I found a couple of symbols I missed during the previous macro review:
- getline(). This forces POSIX >= 2008.
- IN6_ARE_ADDR_EQUAL(). I can't find where this is supposed to be
standardized, so I decided to ditch it.
Also, declaring these macros in every dependent file is scaling poorly,
so I moved them to the CC directives, unifying them in the process.
_POSIX_C_SOURCE is now 200809L, and _XOPEN_SOURCE is 700 (to match).
Also, reduce -std to c99. I don't really have an issue with gnu11, but
it looks like the delta is heavily underutilized in the project.
Might revert it later.
Perform a token attempt to create the cache directory, as well as a more
reasonable one to create its tmp/ subdirectory, whenever the validation
cycle begins.
This option is causing portability issues, and I can't figure out why it
was introduced. None of the Github issues mention it, and 6401a4739ac512985158e63499e037dc2f2078db says
> Use SO_REUSEPORT at sockopts so that the RTR port can be reused
Reorganize `#include <>`s in accordance with the IEEE Std 1003.1,
the Linux man pages (which do a pretty good job explaining portability
nuances), the documentation of the dependencies and some common sense.
(Since it seems some of this stuff is undefined.)
The algorithm still induces some unnecessary includes. (eg. the `NULL`
symbol induces `stddef.h`, `string.h`, `stdlib.h`, `stdio.h`, `unistd.h`
AND `locale.h`, because the standard states it should be defined in all
of them.) I don't think this is a problem for now; I'll optimize it
later.
The asn1c stuff was autogenerated, and the uthash stuff was copy-pasted
from its source project.
None of it serves any purpose. I'm not allergic to the possibility of
supporting these other environments, but this is not the time for it.
Also, it probably doesn't work anyway, since it has never been tested
in the context of Fort.
This doesn't delete all the glue code; only the necessary parts needed
for the upcoming commit.
I'm currently reviewing the system includes. There are some unnecesary
ones, as well as a few nonstandard quirks complicating portability.
This is the first of possibly quite a few commits intended to refactor
this relative mess. It makes all `#include ""`s relative to the root of
the source directory.
Mainly so a local cache tree traversal recovers from a malformed URI.
1. Atomize append operations
Failure used to leave the path builder in an essentially undefined
state, which meant the path had to be thrown away, precluding further
tree traversal.
A failing `append()` no longer modifies the path builder, allowing the
tree traversal to simply skip the ailing branch.
2. Remove lazy failure
The old path_builder was postponing error reporting to path compilation
(`pb_peek()` and `pb_compile()`).
This was an old optimization (meant to simplify path building code),
and was getting in the way of the atomizing.
Errors are now fail-fast, thrown during path construction
(`pb_append*()`).
3. Move path normalization to `uri`
Path normalization (collapsing `.`, `..` and `/`) was getting in the
way of the atomizing, in addition to only really being useful for the
URI-to-cache conversion code.
3. Restore support for absolute paths
This was just a small TODO spawned a few commits ago.
It's a bit smarter now. Addresses a bunch of issues at once, though it
still needs several tweaks and testing:
- #78: Provide a dedicated namespace for each RRDP notification, to
prevent malicious RPPs from overriding files from other RPPs.
- #79: RRDP session and serial are no longer cached in RAM; they're
extracted from cached notification files as they are needed.
This prevents all RRDP from being considered outdated during startup.
- #80: rsync-strategy has been removed.
- #81: The cache now retains RRDP files.
The refactor has been more intrusive than intended. I've been retouching
the core loop and rrdp/https code, which has yielded the following
further disinfections:
- #77: Refactor the HTTP code so 304 is handled as success, despite no
file modifications having been made.
- It seems the old code was refusing to download RPPs via RRDP when said
RPP wasn't also (unrelatedly) served via rsync. This seemed to stem
from an old RFC misunderstanding from the previous developer.
- I've deprecated `rsync.priority` and `rrdp.priority`, mostly just to
simplify the code. I haven't seen anyone using these config fields,
and I think SIAs and/or randomness should be the ones to decide which
protocol is preferred for a given RPP, not Fort's admin.
- However, I have also decided to deprecate `shuffle_tal_uris`, because
I also suspect it's completely unused, and would like to hear some
complaints otherwise.
- Deprecated `rsync.arguments-flat`, because non-recursive rsyncs are
not needed anymore.
- Since RRDP files are no longer deleted immediately after use, the
`DEBUG_RRDP` compilation has lost its purpose, so I deleted it.
- The code was using `HASH_ADD_STR` on strings contained outside of the
node structure. This is illegal according to uthash's documentation,
and might have induced some crashes in the past.
As discovered in 566835e8da0ce52b6bded39db72667eeb2e41001, this
validation was implemented incorrectly. Fort should locate the parent
certificate in the local cache by URI, not force-redownload by rsync.
The URI indexing will be implemented as part of #78. I'll reimplement
the validation properly then.
It's a collection of failed downloads, and it's two-purposed:
1. To list the failed downloads once the validation is over.
(Which is the functionality I deleted in the prevous commit.)
2. To clue the AIA validation code (through an EREQFAILED) not to...
uh... redownload... the AIA certificate's... parent.
Huh.
Yeah, I have a few issues with the implementation:
A. Purpose 1 is obsolete.
B. Regarding purpose 2: Fort should never redownload a file that was
already traversed in the same validation cycle. This purpose is
plainly wrong.
Oh, you know what? I get it.
The original coder was probably concerned that the parent might have
been downloaded via RRDP, yet the child's AIA is always an rsync URI...
and because RRDP and rsync are cached into separate namespaces, well,
Fort wasn't going to find the parent.
But the thing is, it's a URI, not a URL. RRDP also refers to these files
by way of their "rsync" *URI* in its snapshots and deltas. RRDP might
be HTTP, but there is no such thing as http://path.to/certificate.cer.
This should be fixed by way of clever local cache resolution, not an
awkward redownload.
C. The lifescope of this table should be a single validation run, not
Fort's lifetime.
I think the reason why it's global is so the code could warn if a
particular resource had been down for several iterations.
But I wouldn't say that's Fort's job, and even if it is, it's probably
best to move it to Prometheus somehow. We're relying too much on the
logs.
The "summary" was a list of failed downloads, and from what I gather,
it was intended to clue the admin of a possible disconnection problem.
Thing is, the implementation was rather bewilderingly large and
obtrusive, and now that we're going to have a Prometheus endpoint,
most likely replaceable by an elegant counter.
And I haven't implemented Prometheus yet, but I would definitely love to
stop running into this invasive code all over the place, thanks.
Trying to recover is incorrect because we don't want to advertise an
incomplete or outdated VRP table to the routers. We don't want to rely
on the OOM-killer; we NEED to die on memory allocation failures ASAP.
Though this doesn't have much to do with the RRDP refactor, I'm doing it
early to get plenty of time for testing and review.
Partially F1xes #40. (Still need to update some dependency usages.)
- Fort SHOULD die as soon as it realizes the VRP table is corrupted, as
we should not send garbage to the routers.
- Also, I'm not entirely sure the code would not crash later anyway,
since the table is, in fact, corrupted.
- Plus, if it doesn't crash, there would be no core dump to further
analyze the bug.
2. Point bug output to the currently active bug report
I was hesitant to put this on main because it seemed like a performance
tank, but truth be told, I'm being a wuss. these table iterations are
nothing compared to the amount of time Fort has to spend downloading.
And it looks like I'm never going to find this bug with the stack trace
alone.
Does not fix #83 nor #89, but prevents the crash at least.
Job Snijders [Sat, 21 Jan 2023 11:43:02 +0000 (11:43 +0000)]
Ensure X509 extensions are hashed & cached, before deciding a cert is CA or EE
If X509_check_ca() fails to cache X509v3 extension values, the return
value may be incorrect, leading to erroneously assuming a given certificate
is a CA or EE cert (while in reality it is the other, or neither).
This failure mode can arise because X509_check_ca() doesn't verify
whether libcrypto's (void)x509v3_cache_extensions(x) flipped the EXFLAG_INVALID
flag in x->ex_flags. Unfortunately, X509_check_ca() doesn't have a return code
to indicate an error, so this can't be fixed in libcrypto - the API is broken.
The workaround is to call X509_check_purpose(3) with a purpose argument of -1,
before calling X509_check_ca(), this ensures the X509v3 extensions are cached.
Since X509_check_purpose() does have a return code to indicate errors, we can
use that to supplement X509_check_ca()'s shortcomings.
OpenBSD's rpki-client also uses the above approach.
backtrace() is a glibc-only feature. Some systems, such as Alpine,
do not support glibc.
It seems one solution is to rely on community ports, but I imagine it'd
be best to offload such a decision to the user. Not the safest.
Instead, if backtrace() is not available, just delete stack traces from
the binary. It's going to be a pain to debug, but that's the world we
live in, I guess.
Turns libexec into an optional dependency. Fixes #87.
Also, the commit contains a review and update of the documentation's
Alpine dependency list. There was a lot of fat in there.
Certificate: Remove subject name uniqueness validation
RFC 6487:
> An issuer SHOULD use a different subject name if the subject's key
> pair has changed (i.e., when the CA issues a certificate as part of
> re-keying the subject.)
Fort's implementation was problematic. The code was comparing the
certificate's subject name and public key to siblings that had
potentially not been validated yet. It seems to me this would make it
possible for attackers to crash FORT (by posting invalid objects) or to
invalidate legitimate objects (by publishing siblings that contained
conflicting subject names and public keys, without having to worry about
the rest of the fields).
This would be somewhat difficult o fix. I asked on the mailing list and
Discord ("RPKI Community"), and it seems the concensus is "don't
validate it." Subject Names don't really matter that much, because
RPKI's primary concern is resource ownership, not identity. Furthermore,
I'm not convinced that chopping off branches off the tree because of a
clumsy key rollover is a good idea.
On the other hand, it looks like the notfatal hash table API was being
used incorrectly. HASH_ADD_KEYPTR can OOM, but `errno` wasn't being
catched.
Fixing this is nontrivial, however, because strange `reqs_error`
functions are in the way, and that's a spaggetti I decided to avoid.
Instead, I converted HASH_ADD_KEYPTR usage to the fatal hash table API.
That's the future according to #40, anyway.
I don't think this has anything to do with #83, though.
When the file didn't need to be downloaded, because the IMS header did
its job, nothing needs to be deleted. So Fort shouldn't be complaining.
Like the previous commit, this is not a great solution, because IMS is
not the only trigger of file deletes, and the error message might be
helpful in other cases. Then again, I don't really agree with this eager
repository cleaning technique; it complicates debugging.
The proper solution is a WIP in the rrdp-refactor branch.
Lots of error messages were referring to the wrong file, and several of
them printed the correct file manually as what I can only describe as a
quick workaround.
It's not perfect; not all the RRDP code has been patched. That'll have
to wait until the deep RRDP refactor.
HTTP: Warn if a download exceeds 50% of the file size limit
Requested by Ties de Kock:
> Since some RPs now includes an upper limit on object size (some use
> 2MB if I recall correctly) I would appreciate a warning if an object
> goes over "a large fraction" of this limit (and a sample of the
> warning in the changelog and metrics if possible) - so people know
> what they need to alert on. In this situation operators can monitor
> for "natural growth" of an object and intervene, while the case that
> this check prevents (maliciously large objects) is still covered.
>
> The largest object I could find in the wild is 1.2MB (APNIC AS0 ROA).
> The RIPE NCC's largest object is smaller at the moment (but the CRL
> grows quickly if we do member CA keyrolls - since it adds all object
> on it).
>
> In summary:
>
> - I would recommend a warning (and preferably a metric) when an object
> of 50% of the object size limit is encountered.
> - I would like it if the hard limit is "safe" - especially CRLs can
> grow in some cases.
The metric will be added later, as part of #50. The warning is eg.
File size exceeds 50% of the configured limit (10/20 bytes).
50% is hardcoded at the moment.
Notice that this is an HTTP-only patch. rsync does not warn.