These should be safe. They just print the signal number, the stack
trace, then restore the original behavior.
The only problem is I can't do the same with SIGKILL nor SIGSTOP,
but I suppose if SIGKILL were the problem, the kernel would have left
something in the logs. And SIGSTOP would have left the process alive.
ad841d50 was a mistake. It was never agreed in #40 that Fort should
shotgun blast its own face on the first ENOMEM, and even if it was, the
idea is preposterous. Memory allocation failures are neither programming
errors nor an excuse to leave all the routers hanging.
While there's some truth to the notion that a Fort memory leak (which
has been exhausting memory over time) could be temporarily amended by
killing Fort (and letting the OS clean it up), the argument completely
misses the reality that memory allocation failures could happen
regardless of the existence of a memory leak.
A memory leak is a valid reason to throw away the results of a current
validation run (as long as the admin is warned), but an existing
validation result and the RTR server must remain afloat.
Hypothesis: Something (which I haven't spotted yet) was causing the
main thread to skip its wait before the pool threads finished their
tasks. Maybe something to do with the ready signal again?
So the main thread returned early, which means pool threads were
silently suppressed by the OS. That explains the early terminations and
nonexistent stack traces.
If I keep finding crippling errors like this, I will definitely have to
purge the thread pool. It's turned out to be a fucking bug colony at
this point. I'm sick of it.
Way I see it, the root of the problem was the thread pool's control
code, which was too complicated for its own good. A surprisingly large
part of why it was overcomplicated was because it reinvented thread
joining.
So I simplified the control code by removing the detach property. Now
that the main thread joins the proper way, the validation code will not
be interrupted anymore.
This might well be the solution for #49. However, it bothers me that I
still don't have a reasonable explanation as to why the main thread
seemed to be skipping wait.
1. (error) Fix the --work-offline flag.
It has been unused since commit 85478ff30ebc029abb0ded48de5b557f52a758e0.
2. (performance) Remove redundant fopen() and fclose() during
valid_file_or_dir().
If stat() is used instead of fstat(), there's no need to open and
close the file.
(Technically, it's no longer validating readabilty, but since the
validator downloads the files, read permission errors should be
extremely rare, and can be catched later.)
3. (fine) Remove return value from thread_pool_task_cb.
This wasn't a problem, but the return value was meaningless, and
no callers were using it.
- Main validation loop: Remove some confusing, seemingly needless
wrapper functions.
- Libcurl: Catch lots of status codes properly
- Libcurl: Send proper data types to curl_easy_setopt()
(Argument types were not matching documented requirements.)
- RTR server: Reduce argument lists.
1. Merges the log and debug modules. I think their separation was the
reason why they forgot to add stack traces to syslog when they added
syslog to the project.
Not risking that mistake again.
2. Removes as many obstacles as possible from stack trace printing on
critical errors.
3. Add mutexes to logging. This should prevent messages from mixing on
top of each other when there are threads.
1. (warning) The libcrypto error stack trace was always showing empty.
This was because of a bad counter.
2. (critical) Normal stack traces were only being printed in the
standard streams, never on syslog.
This is probably the reason why we don't have a proper error message
on #49. It's probably a segmentation fault.
Also a whole bunch of cleanup. The logging module had a bit of a
duplicate code problem.
I accidentally removed a lock operation in the previous commit,
so lots of undefined behavior was being triggered.
Also, restores (but improves) the thread ready signal. It's hard to
explain:
- Before: Workers send ready signal to parent,
but parent might not be listening yet;
Therefore parent timeouts on wait.
- Previous: Workers do not send ready signal to parent.
Therefore, parent might signal work when no workers are ready yet;
Therefore nobody works.
- Now: Workers send ready signal to parent,
parent listens lazily (ie. late), but only if workers aren't ready
yet.
Therefore, correct behavior.
pcarana [Wed, 27 Jan 2021 15:32:18 +0000 (09:32 -0600)]
Fix rsync and thread pool bugs.
+Mistakenly (of course, it was a bug) the returned value from rsync execution was being confused with the returned value from execvp call. The main problem was when rsync returned a code 12 (Error in rsync protocol data stream); in that case, the caller confused that error with ENOMEM (also with value 12), which led to terminate execution.
+The thread pool wait function wasn't considering pending taks at the queue; also the poll function was holding and releasing the mutex more than it was needed, and the thread attributes are now globally initialized (thanks @ydahhrk for the code review).
+Increment the number of threads at the internal pool to 10.
pcarana [Wed, 25 Nov 2020 23:28:40 +0000 (17:28 -0600)]
Add docs for thread-pool.* args, add flow to reject RTR clients
+An RTR client is rejected when there aren't available threads at the pool to attend it.
+Add new function at thread_pool.c to check if the pool has available threads to work.
+Use an internal buffer at sockaddr2str(), since the buffer received as parameter wasn't utilized by anybody.
+Update max values: thread-pool.server.max=500, thread-pool.validation.max=100.
pcarana [Tue, 24 Nov 2020 00:20:40 +0000 (18:20 -0600)]
Use thread pool for RTR server/clients, validation cycles at main thread
+Change the previous logic: RTR server lived at the main thread and the validation cycles were run in a distinct thread. Now the validation cycles are run at the main thread, and RTR server is spawned at a new thread.
+Create internal thread pool to handle RTR server task and delete RRDP dirs tasks.
+Create thread pool to handle incoming RTR clients. One thread is utilized per client.
+Create args: 'thread-pool.server.max' (spawned threads to attend RTR clients) and 'thread-pool.validation.max' (spawned threads to run validation cycles).
+Shutdown all living client sockets when the application ends its execution.
+Rename 'updates_daemon.*' to 'validation_run.*'.
pcarana [Fri, 6 Nov 2020 01:49:43 +0000 (19:49 -0600)]
Implement a thread pool, still pending to use at RTR clients
+The pool is basically a tasks queue, it's initialized using a fixed amount of threads (all of them spawned at pool creation) where each of them will be waiting for pending tasks to attend.
+TODO: the number of threads per pool must be configurable.
+TODO: right now only a pool is utilized at the TALs validation (and therefore the whole RPKI tree beneath them), at least another pool can be used to receive RTR clients.
pcarana [Fri, 30 Oct 2020 20:18:13 +0000 (14:18 -0600)]
Add '--daemon' argument to daemonize fort, fixes #25
+When the flag is enabled, any value set at '--log.output' and '--validation-log.output' is overwritten with 'syslog' (all enabled logs will be sent to syslog).
+Update the docs to include the new argument.
pcarana [Wed, 28 Oct 2020 00:22:11 +0000 (18:22 -0600)]
Add argument '--init-tals' to fetch RIR TALs
+Once utilized, FORT tries to download the TALs and exits. In order to download ARIN TAL, the user must explicitly accept its RPA by typing yes (ignoring case) at stdin.
+Remove the write callback from HTTP download callers, it was unnecessary since every caller did the same thing.
+Update the docs to include the new argument.
pcarana [Sat, 17 Oct 2020 01:27:26 +0000 (20:27 -0500)]
Fix bug: data can be stale when the local-repository is deleted.
The data remained the same until an RRDP server had a delta update; in such case the updated files weren't found and the snapshot was processed, so the local cache was built again. In case that the RRDP server didn't had updates, the root manifest wasn't found and the whole validation cycle results were discarded.
Now, when the manifest isn't found and the RRDP has no updates, force the snapshot processing to assure that the error isn't of the RP. Also, update the daemon that cleans up the RRDP visited URIs, so that it deletes the files from its corresponding workspace.
Use a local workspace fort RRDP related files, fixes #39.
+Whenever an RRDP file is identified (ie. update notification URI) create a directory at '--local-repository' where all of the RRDP files (XMLs as well as 'publish' elements at those snapshot/delta files) will be created and read.
+The rsync URIs at the publish/withdraw elements, are mapped to the location <--local-repository>/<rrdp workspace>/<URI part>. Eg. if '--local-repository=/tmp/fort' and the current workspace (each TAL has its own workspace) is 'ABC', then the URI 'rsync://example.com/foo/bar.cer' will be created at '/tmp/fort/ABC/example.com/foo/bar.cer'.
-RSYNC repositories are still created at '--local-repository'.
Change the way that the server handles client connections, fix memleak
+Using a thread for each server socket worked ok, but at a great cost since each thread can increase memory consumption; instead, use non-blocking sockets and select() to poll for client connections on each configured server socket.
+Fix memory leak when the incidence 'stale CRL' returns error.
Fix AIA validation, log all 'enomem', use tmp files at http downloads
+The AIA validation flow didn't considered entirely the scenario where a TA child AIA extension didn't matched the actual location from where the TA was fetched (common case: when its downloaded from an HTTPS URI), so despite the TA actually existed it wasn't considered when the validator was working with local files.
+Replace all '-ENOMEM' return codes with the log function 'pr_enomem'.
+Use temporal files whenever an HTTPS file is being downloaded.
+Fix memory leak when working with local TA files.
+The message '{..} discarding any other validation results' is now sent to the operation log.
+Fix GCC 10 warning related to 'strncpy', use 'memcpy' instead.
+The new arguments are: http.enabled, http.priority, http.retry.count, http.retry.interval.
+rrdp.* args still exist. When any of them is set (via conf file or as arg) a warning message is displayed and the value of the argument is set to its corresponding http.* arg.
+Move the 'retries logic' to the http requests, since only RRDP flows had it.
+At a TAL: when shuffle-uris arg is set, do the shuffle first and then consider the priority set at rsync and http args, so that the priority argument can be honored.
+The new incidences have the IDs 'incid-crl-stale' and 'incid-mft-stale'.
+SSL lib implementations (OpenSSL and LibreSSL at least) doesn't make it easy to ignore a stale CRL, so when the incidence exists and is warn/ignored, retry the verification using a cloned CRL with a valid 'nextUpdate' field.
Fix bug: error'd repositories weren't logged if a child repo was synced
+If a parent repository wasn't successfully synced (eg. LACNIC) but a child repository was synced (eg. Brazil), the errors related to the parent repository weren't logged to the operation log.
+Fix this by poping the working repository from the TA, since this was causing the error. All the repositories were erroneously related, so on success of any of them, the error logs were discarded.
+Two additional updates are done: don't rsync when forcing the download of an URI whose ancestor had a previous error, and remove line breaks from stale repositories summary.
+Rename some local variables to aid dev reading.
pcarana [Fri, 26 Jun 2020 23:09:32 +0000 (18:09 -0500)]
Fix bug: didn't searched local files when an RRDP URI failed previously
+Whenever an RRDP repository can't be fetched, an attempt to work with local files must be done. If RSYNC was disabled and there was an error fetching the RRDP repository, the next time that repository was found on a certificate, it was being rejected; the right thing to do, is to consider such scenario and keep working locally.
pcarana [Fri, 26 Jun 2020 19:38:23 +0000 (14:38 -0500)]
Avoid additional operations after calling fork()
+Based on https://pubs.opengroup.org/onlinepubs/9699919799/functions/fork.html, the function called by the child process now avoids malloc's and only redirects its output to the corresponding pipes before doing the rsync execution with execvp().
+This fixes #35. Something at the musl implementation (very specific for docker+alpine) hangs the child process right after its creation, the parent process waits for the child to end but it never does, so the container runs for ever and never ends a validation cycle.
+Also, flush stderr/stdout before fork() to avoid a possible (in docker+alpine, almost sure) deadlock between parent process and its forked child.