- Mandatory cgcc cleanup.
- Change the VRPS read-write lock into a mutex. (Because there
are no readers.)
- Remove `static` modifier from `rtrhandler_handle_roa_v4()`'s
helper functions (in the hopes of getting a slightly more
revealing stack trace).
Still haven't found the problem. It behaves like a concurrency
error, but doesn't look possible.
pcarana [Sat, 17 Oct 2020 01:27:26 +0000 (20:27 -0500)]
Fix bug: data can be stale when the local-repository is deleted.
The data remained the same until an RRDP server had a delta update; in such case the updated files weren't found and the snapshot was processed, so the local cache was built again. In case that the RRDP server didn't had updates, the root manifest wasn't found and the whole validation cycle results were discarded.
Now, when the manifest isn't found and the RRDP has no updates, force the snapshot processing to assure that the error isn't of the RP. Also, update the daemon that cleans up the RRDP visited URIs, so that it deletes the files from its corresponding workspace.
Use a local workspace fort RRDP related files, fixes #39.
+Whenever an RRDP file is identified (ie. update notification URI) create a directory at '--local-repository' where all of the RRDP files (XMLs as well as 'publish' elements at those snapshot/delta files) will be created and read.
+The rsync URIs at the publish/withdraw elements, are mapped to the location <--local-repository>/<rrdp workspace>/<URI part>. Eg. if '--local-repository=/tmp/fort' and the current workspace (each TAL has its own workspace) is 'ABC', then the URI 'rsync://example.com/foo/bar.cer' will be created at '/tmp/fort/ABC/example.com/foo/bar.cer'.
-RSYNC repositories are still created at '--local-repository'.
Change the way that the server handles client connections, fix memleak
+Using a thread for each server socket worked ok, but at a great cost since each thread can increase memory consumption; instead, use non-blocking sockets and select() to poll for client connections on each configured server socket.
+Fix memory leak when the incidence 'stale CRL' returns error.
Fix AIA validation, log all 'enomem', use tmp files at http downloads
+The AIA validation flow didn't considered entirely the scenario where a TA child AIA extension didn't matched the actual location from where the TA was fetched (common case: when its downloaded from an HTTPS URI), so despite the TA actually existed it wasn't considered when the validator was working with local files.
+Replace all '-ENOMEM' return codes with the log function 'pr_enomem'.
+Use temporal files whenever an HTTPS file is being downloaded.
+Fix memory leak when working with local TA files.
+The message '{..} discarding any other validation results' is now sent to the operation log.
+Fix GCC 10 warning related to 'strncpy', use 'memcpy' instead.
+The new arguments are: http.enabled, http.priority, http.retry.count, http.retry.interval.
+rrdp.* args still exist. When any of them is set (via conf file or as arg) a warning message is displayed and the value of the argument is set to its corresponding http.* arg.
+Move the 'retries logic' to the http requests, since only RRDP flows had it.
+At a TAL: when shuffle-uris arg is set, do the shuffle first and then consider the priority set at rsync and http args, so that the priority argument can be honored.
+The new incidences have the IDs 'incid-crl-stale' and 'incid-mft-stale'.
+SSL lib implementations (OpenSSL and LibreSSL at least) doesn't make it easy to ignore a stale CRL, so when the incidence exists and is warn/ignored, retry the verification using a cloned CRL with a valid 'nextUpdate' field.
Fix bug: error'd repositories weren't logged if a child repo was synced
+If a parent repository wasn't successfully synced (eg. LACNIC) but a child repository was synced (eg. Brazil), the errors related to the parent repository weren't logged to the operation log.
+Fix this by poping the working repository from the TA, since this was causing the error. All the repositories were erroneously related, so on success of any of them, the error logs were discarded.
+Two additional updates are done: don't rsync when forcing the download of an URI whose ancestor had a previous error, and remove line breaks from stale repositories summary.
+Rename some local variables to aid dev reading.
pcarana [Fri, 26 Jun 2020 23:09:32 +0000 (18:09 -0500)]
Fix bug: didn't searched local files when an RRDP URI failed previously
+Whenever an RRDP repository can't be fetched, an attempt to work with local files must be done. If RSYNC was disabled and there was an error fetching the RRDP repository, the next time that repository was found on a certificate, it was being rejected; the right thing to do, is to consider such scenario and keep working locally.
pcarana [Fri, 26 Jun 2020 19:38:23 +0000 (14:38 -0500)]
Avoid additional operations after calling fork()
+Based on https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html, the function called by the child process now avoids malloc's and only redirects its output to the corresponding pipes before doing the rsync execution with execvp().
+This fixes #35. Something at the musl implementation (very specific for docker+alpine) hangs the child process right after its creation, the parent process waits for the child to end but it never does, so the container runs for ever and never ends a validation cycle.
+Also, flush stderr/stdout before fork() to avoid a possible (in docker+alpine, almost sure) deadlock between parent process and its forked child.
pcarana [Fri, 19 Jun 2020 22:30:43 +0000 (17:30 -0500)]
Fix several bugs related to sync errors, update some log messages
+Fix bug: an endless loop when a requested URI error was removed.
+Fix bug: some error'd URIs could be logged despite that their repository data was successfully fetched with another access method.
+Fig bug: if a TAL has more than one URI and there was an error fetching an URI, the following URIs in the list weren't considered to get the TA certificate.
+Only 'stderr' rsync output will be sent to operation log considering '--stale-repository-period', the 'stdout' rsync output will be sent to validation log at level info.
+Messages of rsync/rrdp retries are 'upgraded' from level info to warn (all on validation logs).
+Add a warning message (validation log) whenever local data is going to be utilized due to previous errors fetching repositories or TA certificates.
+Log all communication errors if 'log.level=debug'.
pcarana [Sat, 13 Jun 2020 00:34:57 +0000 (19:34 -0500)]
Replace args '*log.prefix' for '*log.tag', add help message.
+Do the replacement at code, docs and unit tests.
+Add a help message that's printed whenever there's an error at the configuration arguments.
+Fix a broken unit test.
+Fix the description of 'validation-log.tag'.
+Fix some errors at configuration examples ('examples/config.json') and at the web docs ('usage.html#--configuration-file').
pcarana [Thu, 4 Jun 2020 23:14:03 +0000 (18:14 -0500)]
Fix bug when applying SLURM, and configure the log level on empty dirs
+The bug was when a SLURM was successfully loaded, instead of stopping the interfal flow on success (a 'return' was needed) it continued to the error flow. This lead to worst errors later, such as segfault when a valid slurm was applied.
+Log error whenever the TALs configured directory is empty, log warning if the SLURM directory is empty.
pcarana [Wed, 13 May 2020 23:43:43 +0000 (18:43 -0500)]
Allow to work with cache on requests errors, common func to get date
+ Create common function to get the current date and time.
+ Identify request errors, specifically after trying to fetch data via http/rsync without success. This helps to identify if a whole repository can't be downloaded after a considerable time (it can be configured).
+ Allow to work with local files even when there was a download error.
+ Add 'stale-repository-period' argument to set the time period that must lapse to warn about stale repositories (this will be logged to the operation log).
- Code wasn't validating null result on strdup
- If a validation thread was interrupted,
`perform_standalone_validation()` was reading an uninitalized
exit status.
More or less as a side effect, I also merged the structures
`pthread_param` and `thread`, because their usage was similar
and shared ~50% of their members.
`do_file_validation()` is no longer responsible for freeing its
generic argument.
pcarana [Wed, 25 Mar 2020 00:39:01 +0000 (18:39 -0600)]
Add new incidences regarding manifest validation.
-Related to #28.
-'incid-file-at_mft-not-found': when a file listed in a manifest isn't found at the manifest publication point.
-'incid-file-at-mft-hash-not-match': the file hash doesn't match the hash listed at the manifest.
-Both incidences will be an error by default.
pcarana [Thu, 19 Mar 2020 22:55:20 +0000 (16:55 -0600)]
Update SLURM loading logic (use a cache to load new data).
+Stop searching for duplicate elements in the same file or in distinct files, also stop searching for covered prefixes at the same file; those checks don't exist at the RFC and they had a huge processing cost.
+Implement a SLURM cache when a new file is loaded, this way is easier to check RFC 8416 section 4.2 rule.
+Remove the whole context properties that were utilized to know on which file the loader was working.
pcarana [Fri, 13 Mar 2020 17:47:37 +0000 (11:47 -0600)]
Check for time condition met/unmet due to old libcurl impl
The 'problem' was found at CentOS 7, the libcurl implementation makes the 'If-Modified-Since' check at the client side. So, if the server responds with an HTTP OK (200) code but the dates don't match, the response content is ignored.
What's the problem? For us (HTTP client) the response looks ok and we take the download as correct, but the downloaded file doesn't have content, so when its read bad things happen (actually the error is logged and the fallback is to mark such repository as invalid and try the download from another repo, if such repo is available).
pcarana [Fri, 13 Mar 2020 16:36:04 +0000 (10:36 -0600)]
Stop holding the write lock when the SLURM is loaded
There's no need to hold the lock, the SLURM loading action doesn't modify the current DB state; it's altering the new DB state, which will be utilized later to replace the current DB state (and that's where the lock is needed).