There are many ways in which a mismatching cache index can cause erratic
behavior that's hard to detect. Since the index is written at the end of
the validation cycle, crashing at any point between a cache refresh and
the index write results in a misindexed cache.
Deleting the index after loading it seems to be a reliable way to force
Fort to reset the cache after a crash.
- Rename extension.h to ext.h; the former collides with Extension.h.
- Move _DEFAULT_SOURCE to the source; it's not widespread enough for
Makefile.am.
- Add _DARWIN_C_SOURCE, needed by MacOS for timegm() and mkdtemp().
- Add -flto to unit test AM_CFLAGS. This minimizes superflous #includes
and mocks needed, and will hopefully make them consistent across
platforms.
- Delete _BSD_SOURCE; it seems orphaned. (Though see below.)
Works on Linux and Mac. Might have broken the BSDs; I can't test them
ATM.
- Separate node->mtim into attempt_ts and success_ts.
Because they're really two different timestamps; The former is meant
for node expiration, the latter for HTTP IMS.
- Move removal of orphaned fallbacks to remove_abandoned().
Because orphaned refreshes need the same logic.
- Added the (randomly missing) expiration threshold for orphans.
It's still missing the implementation of remove_orphaned_files(),
but I'm still weighting options, as it seems it's going to be an
expensive operation that's rarely going to do anything.
Both used to be indexed by caRepository, inducing possible collision.
RRDP fallbacks are now indexed by rkiNotify+caRepository, ensuring
they're caged separately.
The fork()s (needed to spawn rsyncs) duplicate Fort's process.
Which is messy in a multithreaded program. Quoting the Linux man page:
> * The child process is created with a single thread—the one that
> called fork(). The entire virtual address space of the parent is
> replicated in the child, including the states of mutexes, condition
> variables, and other pthreads objects. (...)
> * After a fork() in a multithreaded program, the child can safely call
> only async-signal-safe functions (...) until such time as it calls
> execve(2).
As far as I can tell, since the forked child was, in fact, careful to
only invoke async-signal-safe functions, this wasn't really a bug.
Still, it wasn't quality architecture either.
Moving the rsync spawner to a dedicated subprocess should stop the forks
from threatening to clash with the multithreading completely.
Relies on the new core loop design, so this won't work properly until
that's implemented.
I feel like I need to relearn signals every time I have to interact with
them. Best get this done while the iron's hot.
1. The ROA file is first written as `<cache>/.roa`.
The RK file is first written as `<cache>/.rk`.
2. When the validation run is done, `.roa` is renamed to `--output.roa`,
and `.rk` becomes `--output.bgpsec`.
3. Most terminating signals unlink `.roa` and `.rk`.
The sigaction() code was in logging because it was originally conceived
by the SIGSEGV stack trace printing hack. The SIGPIPE ignorer was also
incidentally moved there at some point, but it has never had anything
to do with logging.
And I'm going to catch more signals in the upcoming commits, so this
really needs to be formalized into its own module.
It seems I'm finally done making dramatic wide-reaching changes to the
codebase. There's still plenty to add and test, but I would like to
start pushing atomic commits from now on.
This is a squashed version of development brach "issue82". It includes
a few merges with main.
- `cache/rsync`, `cache/https` and `cache/rrdp` contain "refreshes"
(the exact latest files according to the servers). RRDP withdraws are
honored, and rsyncs run without --compare-dest.
- "Refresh" files marked as valid are backed up in `cache/fallback`
at the end of each validation cycle.
- Validation first tests fallback+refresh. (If a file exists in both,
refresh wins.) If that fails, it retries with fallback only.
- The index is not a tree; everything is caged in numbered directories
and indexed by exact URL, to prevent file overriding by URL hacking.
There's also a `cache/tmp` directory, where Fort temporarily dumps
notifications, snapshots and deltas. This directory will be removed
once #127 is fixed.
X509V3_EXT_print() was being summoned to print extensions unrelated to
RPKI. The TODO wanted me to pick a suitable flag for extensions unknown
even to libcrypto.
For reference, this is how X509V3_EXT_print() prints an AIA, as a known
extension:
CA Issuers - URI:rsync://rpki.ripe.net/repository/aca/KpSo3VVK5wEHIJnHC2QHVV3d5mk.cer
This is how X509V3_EXT_print() prints the same AIA, as an unknown
extension, X509V3_EXT_PARSE_UNKNOWN enabled:
These two quirks made the validation mostly a no-op.
There's also the issue that this implementation seems inefficient,
especially since Fort doesn't need to DER-encode anywhere else. By
checking the encoding while parsing, I would save a lot of memory
in addition to being able to delete that mess of encoding functions.
But I'm going to have to push that to the future. This is growing more
ambitious than I can afford during a release review, and given that the
code wasn't really doing anything productive in the first place, I'm not
losing much by simply axing it for now.
- Employ libssl's OID parsing rather than implement it from scratch.
- Rename `struct signed_object_args` to `struct ee_cert`, since it's
just a bunch of EE certificate data.
- Remove `struct signed_data`, because it wasn't actually contributing
anything.
rsync cannot download into standard output... which means rsync'd files
cannot be elegantly piped as standard output to --mode=print. So either
the rsync has to be done manually by the user... or --mode=print has to
do it internally by itself.
And looking at the code that resulted... I now wish I had gone with the
former option. Because of the long overdue cache refactors, the user
needs to include --tal for this rsync to be compatible with the cache.
This sucks.
As a workaround, Fort will rsync into /tmp if --tal and/or --local-cache
aren't supplied:
$ fort --mode=print \
--validation-log.enabled \
--validation-log.level debug \
rsync://a.b.c/d/CRL.crl
...
May 10 13:32:44 DBG [Validation]: Executing rsync:
May 10 13:32:44 DBG [Validation]: rsync
May 10 13:32:44 DBG [Validation]: ...
May 10 13:32:44 DBG [Validation]: rsync://a.b.c/d/CRL.crl
May 10 13:32:44 DBG [Validation]: /tmp/fort-Q7tMhz/CRL.crl
...
{
"tbsCertList": {
"version": 1,
...
Fort used to clear the --output.roa and --output.bgpsec files to make
sure they were writable, during early validations.
So this is why the files spent so much time being empty! This was not
acceptable. It didn't even guarantee the files would still remain
writable by the time Fort needed to properly populate them.
This function was always including the binary flag ("b") during
fopen(2), which seems to be inappropriate for the --output.roa and
--output.bgpsec files.
Well, the Unixes don't do anything with this flag, so this is more of a
semantic fine-tune than a bugfix.
The code was writing dates in Zulu format, which was fine. But then, to
read them back, it was loading them with mktime(), which is a local
timezone function. The effects of this bug depend on the time zone.
Files would expire from the cache up to 12 hours too early (UTC+) or
late (UTC-), or be updated up to 12 hours late (UTC-). (There's no such
thing as "updating too early" in this context, since Fort cannot refresh
before the present.)
I fixed this by swapping mktime() for timegm(), which is not great,
because the latter is still a nonstandard function. But it'll have to
do, because the other options are worse.
Below is my thought process on timegm() and the other options:
1. I want to store timestamps as human-readable strings.
2. Converting a timestamp to Zulu string is trivial: time_t ->
gmtime_r() -> struct tm (GMT) -> strftime() -> char*.
3. But converting a string to timestamp relies on timegm(), which is
nonstandard: char* -> strptime() -> struct tm (GMT) -> timegm() ->
time_t.
Brainstorm:
1. Store the dates in local time.
Hideous option, but not ENTIRELY insane.
Storing in local time will render the dates up to 24 hours off, I think.
But this only happens when the cache changes timezone, which isn't
really often.
But it's still pretty clumsy, and also not future-proof, as new date
fields would have to be constrained by the same limitation.
2/10 at best.
2. Ditch time_t, use struct tm in UTC for everything.
So I would rely on gmtime_r() only to get out of time_t, and never need
timegm() for anything.
Comparing field by field would be a pain, but it's interesting to note
that Fort is actually already doing it somewhere: tm_cmp(). I guess I
have to admire the audaciousness of past me.
What mainly scares me is that mktime() seems to be the only standard
function that is contractually obligated to normalize the tm, which
means I would have to keep mktime()ing every time I need to compare
them.
And mktime() is... a local time function. Probably wouldn't actually
work.
4/10. I hate this API.
3. Bite the bullet: Go modern POSIX; assume time_t is integer seconds
since the 1970 epoch.
I'm fantasizing about an early environment assertion that checks the
gmtime_r() for a known time_t, that shows people the door if the
implementation isn't sane. Would work even in POSIX-adjacent systems
like most Linuxes.
This would have other benefits, like the ability to ditch difftime(),
and perform arithmetic operations on time_t's.
All the officially supported platforms get this aspect of POSIX right,
so what's stopping me?
Well, it would mean I would have to store the date as a seconds-since-
epoch integer, which is not as human-friendly and defeats the whole
point of the JSON format.
And yet... this feels so right, I might end up doing it even regardless
of the JSON data type and timegm().
But not today. 7/10.
4. Bite the bullet: Use timegm().
Interesting. timegm() might be added to C23:
> Changes integrated into the latest working draft of C23 are listed
> below.
> (...)
> - Add timegm() function in <time.h> to convert time structure into
> calendar time value - similar to function in glibc and musl libraries.
Reject RRDP snapshot and deltas if they're not hosted in the same origin
as the notification. Prevents malicious notifications from wasting other
servers' bandwidth.
First half of draft-spaghetti-sidrops-rrdp-same-origin-00.
- Remember old delta hashes.
- If one of the delta hashes changes in the notification, the session
needs to be re-snapshot'd, even if the session ID and serial seem
consistent.
Before, only local paths were being normalized. This was because, by
some twisted logic, I was under the impression that `https://a.b.c/d`
and `https://a.b.c/d/.` are technically supposed to be different
identifiers. My recent visit to RFC3986 disproved this.
But regardless of what the standards say, it doesn't make sense to treat
those URLs differently in the context of RP validation. They will be
cached into the same directory.
Not that this aspect of the code used to be incorrect, to the best of
my knowledge. Fort needs normalization for proper pre-download
identification (which in turn is needed to prevent a file from being
downloaded twice during the same iteration). Old code used to identify
by local path, and the upcoming session desync commit will need to index
by remote path. Hence the need for this change.
In the upcoming commits, the cache is going to need to start managing
the notifications differently from the other downloadables, including
snapshots and deltas.
The logic used to differentiate the different objects from each other
was getting convoluted, and had been naturally converging into the URI
object since the 1.6.0 refactors, IIRC.
So stop beating around the bush and commit to this design. Store the
type in the URI object, and add several values to the uri_type enum.
To prevent extra allocations and copies, strings in the RRDP module used
to be stored as `xmlChar *`s. This optimization doesn't hold water now
that I'm planning to also convert them to JSON.
Not much to report on this one. It seemed the code was casting string
types back and forth too carelessly, but it turns out its assumptions
are agreeable.
In particular, I got scared by what appeared to be a missing printable
ASCII validation. But no; libxml2 takes care of it, thanks to the XML
grammar.
So what was really missing was a comment explaining the logic, so I
don't have to dig it up again. And some unit tests to preserve the
guarantees.
There are other validations I'm not so sure about, but I decided to
leave them out of the scope of this commit.
The old decoding was done using libcrypto's BIO API, even though the
code didn't need to perform any stream chaining. (Which seems to be the
whole point.)
Trying to figure out why it wasn't properly erroring on invalid base64,
I eventually realized I could dramatically simplify everything by
shifting the API layer. I'm now using `EVP_DecodeUpdate()` and friends.
It's so much cleaner and more correct now; it's incredible.
This had a cascade effect. The complexity of the base64.h API also
collapsed, which resulted in massive refactors on the callers.
1. Error code wasn't being catched.
2. Second argument was NULL, so the caller had no way of knowing
whether the string was fully consumed.
2. The code wasn't making sure the input was a clean hex string.
3. The input string wasn't NULL-terminated.
In the end, I ditched strtol(). It's too dangerous and cluttered for
this purpose.
We expect [a] to bind the socket to any IPv4 address,
[b] to bind the socket to any IPv6 address,
and [c] to bind the socket to any IPv4 and IPv6 address.
Right? The BSDs work that way.
But Linux doesn't. For Linux,
[a] binds to any IPv4 address,
[b] binds to any IPv4 and IPv6 address,
and [c] is an error.
But I don't want to override the behavior because some admins are
probably used to it:
linux$ nc -6lknv :: 7890
Listening on :: 7890
Connection received on ::1 52814
Hello from IPv6!
Connection received on ::ffff:127.0.0.1 55456
Hello from IPv4!
Instead, let's print a warning.
Thanks to Benjamim Pinheiro for reporting this quirk.
- Hint SOCK_STREAM on getaddrinfo(), not on socket().
I guess this means it was attempting some redundant binds on failure?
- Print the "Attempting to bind socket to address '%s', port '%s'"
message in every addrinfo, not just the first one. (Improves log
legibility.)
- Improve the algorithm that produces the printable version of the
address (used to dereference ai_addr->sa_data like the previous
commit), and fall back to whatever the user provided if it fails.
- Weird inconsistency: Most errors used to cause create_server_socket()
to try the next addrinfo candidate, but getsockname() and listen()
induced immediate abortion. I've no idea what that was about;
getsockname() isn't even a mandatory step.
- Attempt to bind all addrinfo candidates; they must all succeed.
(It used to stop on the first success.)
I'm conflicted with the last point. The old behavior appears to have
been inherited from the Linux manual page example, but doesn't make any
sense from where I'm standing.
If `/etc/hosts` defines
127.0.0.1 localhost
::1 localhost
And server.address is "localhost", shouldn't we attempt to bind to both
addresses?
While trying to implement the upcoming patch, I found myself trying to
refactor this validation. The implementation was awkward; instead of
ntohs'ing sockaddr_in.sin_port, it extracted the bytes manually:
port = (unsigned char)((*result)->ai_addr->sa_data[0]) << 8;
port += (unsigned char)((*result)->ai_addr->sa_data[1]);
But on second thought, I'm not seeing this validation eye to eye.
It's intended to prevent `getaddrinfo()` from parsing
`--server.port=120000` like `--server.port=54464` (120000 % 65536) on
Linux, but... thing is, this "port out of bounds" quirk looks like an
everyday Linux idiosyncrasy.
Linux being Linux:
$ nc -lv 127.0.0.1 120000
Listening on localhost 54464
It's weird but seems inoffensive, and some beardos might even expect it.
`getaddrinfo()` returns proper errors in the BSDs, so the validation is
redundant there.