It looks like %F, %T and %z are not portable conversion specification
characters for strptime() and strftime(). Therefore, change date format
from "%FT%T%z" to "%Y-%m-%dT%H:%M:%SZ".
This means the JSON now employs Zulu, which also fixes a unit test that
used to be hardcoded to my own timezone. Yaay.
While playing with the configuration sample, I found out that setting
a `null` slurm property in the JSON was rejected, even though the SLURM
file itself is not mandatory.
So rethink this, and for a few other fields as well.
Before, it used to clean old abandoned files, and nodes for which the
files seemed to have disappeared. Now it also deletes files for which
the node seems to have disappeared.
Job Snijders [Fri, 24 Nov 2023 09:46:37 +0000 (09:46 +0000)]
Don't set directory modtimes to match the source
When syncing against remote repositories, the modtimes of the
remote directories is irrelevant. In the RRDP protocol the directory
modtimes aren't signalled either. This should save some IOPS.
I'm finding lots of problems with the error reports:
- Some error messages are getting logged with what appears to be the
wrong severity, variant (normal vs libcrypto) or type (val vs op).
Also, the line between val and op is sometimes blurry.
- Some error messages are extremely ambiguous, which makes them useless.
It's hard to fix them because they tend to be caused by library utils
that either refuse to spit details, or export them through
undocumented, unreliable and/or inconsistent means.
- Another consequence of the generic errors is that it's hard to tell
the ENOMEMs apart, which sucks because we're supposed to handle them
differently.
- Some error messages aren't printing the offending function arguments,
which will make them hard to debug when they happen.
I'm anticipating another redesign of the framework, but I'm also trying
very hard not to do any new major rewrites before the next release.
rsync-ing every RPP separately is prohibitely expensive when RRDP is
unavailable.
As it turns out, rsync-ing the repository root happens to be an
unwritten standard. Fort was far from the only one doing it, and people
expect it.
This means I will eventually have to come up with a different way to box
RRDP RPPs, as the current implementation induces too many redownloads
when rsync needs to be fall-backed into.
I will have to leave that outside of 1.6.0 however, as I've fixed too
much stuff already, and I need a new release urgently.
Sort of reverts #80, though the flag will remain deleted. I don't think
there's a point in offloading this decision to the user.
Remove tmp directory step from --init-tals/--init-as0-tals
The code that handles these flags does not run with a cache context,
so the temporal file step was causing cache download issues despite
being completely unneeded.
Tried to protect access via mutex, but oh boy. That escalated quickly.
Instead, restore tree workspace isolation. Since the 1-thread-per-TAL
architecture has survived, this allows the validation to merrily read
and write the local cache without any locking.
Each thread now builds its own resource table. The main thread joins
them.
This basically zeroizes resource sharing between validation threads.
Great from an engineering perspective, maybe not so much from the
performance angle.
Corner case. Suppose the cache has (for whatever reason) downloaded the
two following URLs separately:
rsync://a.b.c/d
rsync://a.b.c/d/e/f
(This might happen if the latter is downloaded in one iteration, then
the former is downloaded the next, and the cleanup timer hasn't kicked
in yet.)
This commit extends the existing priority selection algorithm to this
situation.
(The old way was to choose the most specific URL, which would go on to
lose to a different URL which might have lost to the less specific
version.)
Honestly, this is a micro-correction. It hopefully slightly increases
the chances of the cache fallback being useful in very specific unusual
situations, rather than guarantee it. But it's more consistent, future-
proof and looks more sensible in the tests.
handle_tal_uri() was returning 0 on soft errors and positive on success.
It's supposed to be the other way around.
This resulted in the main loop dropping successful tree traversals.
It also resulted in TA public key mismatches causing traversal
termination, which was a violation of RFC 8630:
> If the connection to the preferred URI fails or the retrieved CA
> certificate public key does not match the TAL public key, the RP
> SHOULD retrieve the CA certificate from the next URI, according to
> the local preference ranking of URIs.
- Improve usage of `xmlChar *`
(Was being casted to/from `char *` rather contract-breakingly.)
- Bunch of renames
- notification_metadata -> rrdp_session
(These are not exclusive to Notifications.)
- delta_head -> notification_delta
(The object specifically refers to Notification delta tags,
I don't know what "head" is supposed to allude.)
- rdr_snapshot_ctx, rdr_delta_ctx -> rrdp_ctx
(Slightly tweaked the semantics of this, to reduce argument
lists.)
- Remove redundant fnstack pushes and pops
(Rather comically, snapshots and deltas were being stacked twice:
parse_snapshot() + rrdp_parse_snapshot(), parse_delta() +
process_delta().)
I don't really have any strong arguments to justify this. Been
progressively convincing myself to do it over time, and at this point I
think it's inevitable.
It's not that much code, most of this stuff doesn't need to be exported,
and the reduced API simplifies the review.
I originally meant to privatize and de-heap these structures,
but it turns out they were not just used by a single module;
they were only used by `parse_metadata()`.
(Their domain seemed larger than it was, because they were being
initialized elsewhere, for no apparent reason.)
Then, while trying to clean up the global namespace, I noticed that the
hack was actually intertwined with another one: `parse_metadata()` was
being used for two purposes that are never actually used simultaneously.
This cluttered it.
So also separate `parse_metadata()` from `validate_metadata()`.
If all of a RPP's URLs fail, fall back to most sensible cached
candidate.
It seems this used to be only implemented for TAs, and the heuristics
for choosing a suitable fallback were rudimentary.
Elaborate, centralize and extend implementation to all cache content.
Side maintenance tweaks:
- Remove EREQFAILED, because it largely evolved from meaning "don't try
again" to "try again." So now it was just a redundant EAGAIN.
- Ditch redundant arguments from valid_file_or_dir().
- Merge the three URI arraylist implementations (certificate.c, tal.c
and manifest.c) into one.
- Move RRDP workspace URI, from a thread variable to the stack.
(Code smell. It used to be awkward to follow this variable's lifespan
through the tree traversal.)
- Move struct publish and struct withdraw from the heap to the stack.
(Eliminate pointless allocations. These are not the only RRDP
structures I want to move to the stack.)
- Change file_metadata.uri from `char *` to `struct rpki_uri *`.
(This string was forcing the RRDP code to recompute the URI
repeatedly.)
My original intent was "deprecate thread-pool.validation.max," but it
turns out it was just a symptom of a (mostly inoffensive)
overcomplication.
thread-pool.validation.max has proven confusing to users, because it
doesn't make sense for it to be configurable. The thread count should
always equal the number of RPKI trees, which in turn equals the number
of TALs. There's no reason why Fort should offload this decision to the
user.
As for the thread pool, the validation cycle is not really a fitting
problem for such an ellaborate solution, because the former involves a
very small amount (typically 5) of long-lived threads that start at the
same time, once every hour or so.
So instead of pooling a configured amount of threads in the beginning,
spawn raw threads as needed.
Tweak ideated during the commit message of the previous commit.
- If the read() yields at least one Error Report, drop the connection.
This is because all the server-received error codes currently defined
are supposed to result in immediate connection termination.
If a future RFC defines a nonfatal error code, Error Reports should
probably be downgraded to the 'last PDU' rule below.
- Otherwise, if a read() yields multile PDUs, drop all except for the
last one.
Since it's what the client is most likely expecting, I guess. Serial
Queries and Reset Queries are alternate means to achieve the same goal,
so it doesn't make sense to queue them.
Someone reported a security vulnerability in the server, but the details
are muddy, and clarifications have not arrived yet. I haven't been able
to reproduce it, but the review did yield room for improvement:
1. Buffer request bytes better
The old code seemed to assume each socket read consumed exactly one
(nonempty) TCP packet, and each such packet contained exactly one PDU.
I'm scratching my head at this, but I guess for most intents and
purposes, this assumption is not as lunatic as it seems. Benign RTR PDUs
are very small, and it doesn't make sense for a request packet to
contain multiple of them. Error Reports aside, it doesn't even make
sense for the client to send multiple PDUs in quick succession at all.
Regardless, I'm flushing that assumption down the toilet:
- If read() yields multiple PDUs, queue and handle them in sequence.
Although as I'm writing this I'm realizing that queuing PDUs is a dumb
idea, because Serial Queries and Reset Queries are alternate means to
achieve the same goal. If the client sent a new request, it's most
likely given up on the old one. Plus, queuing PDUs brings additional
complexity and risks. I'm going to have to change this in the next
commit.
- If a read() yields a fragmented PDU, buffer and prepend it to the next
successful read.
This will probably never happen, but it's nice to handle it properly
anyway.
2. Drop unused PDU parsers
An RTR server only needs to handle PDU types Serial Query, Reset Query
and Error Report. Fort also had dead code meant for the other PDU types.
I'm guessing they were intended for the Error Report internal PDU field,
but it turns out that's also unused.
3. Improve PDU validation
Since Serial Queries and Reset Queries are supposed to have constant
length, Fort was often ignoring the PDU header length field.
Fort now punishes incorrect lengths more aggressively.
Was downloading rsync://a.b/c into rsync://a.b/c/c, because of an rsync
complication (from rsync(1)):
> A trailing slash on the source changes this behavior to avoid creating
> an additional directory level at the destination. You can think of a
> trailing / on a source as meaning "copy the contents of this
> directory" as opposed to "copy the directory by name", but in both
> cases the attributes of the containing directory are transferred to
> the containing directory on the destination. In other words, each of
> the following commands copies the files in the same way, including
> their setting of the attributes of /dest/foo:
>
> rsync -av /src/foo /dest
> rsync -av /src/foo/ /dest/foo
I found a couple of symbols I missed during the previous macro review:
- getline(). This forces POSIX >= 2008.
- IN6_ARE_ADDR_EQUAL(). I can't find where this is supposed to be
standardized, so I decided to ditch it.
Also, declaring these macros in every dependent file is scaling poorly,
so I moved them to the CC directives, unifying them in the process.
_POSIX_C_SOURCE is now 200809L, and _XOPEN_SOURCE is 700 (to match).
Also, reduce -std to c99. I don't really have an issue with gnu11, but
it looks like the delta is heavily underutilized in the project.
Might revert it later.
Perform a token attempt to create the cache directory, as well as a more
reasonable one to create its tmp/ subdirectory, whenever the validation
cycle begins.
This option is causing portability issues, and I can't figure out why it
was introduced. None of the Github issues mention it, and 6401a4739ac512985158e63499e037dc2f2078db says
> Use SO_REUSEPORT at sockopts so that the RTR port can be reused
Reorganize `#include <>`s in accordance with the IEEE Std 1003.1,
the Linux man pages (which do a pretty good job explaining portability
nuances), the documentation of the dependencies and some common sense.
(Since it seems some of this stuff is undefined.)
The algorithm still induces some unnecessary includes. (eg. the `NULL`
symbol induces `stddef.h`, `string.h`, `stdlib.h`, `stdio.h`, `unistd.h`
AND `locale.h`, because the standard states it should be defined in all
of them.) I don't think this is a problem for now; I'll optimize it
later.
The asn1c stuff was autogenerated, and the uthash stuff was copy-pasted
from its source project.
None of it serves any purpose. I'm not allergic to the possibility of
supporting these other environments, but this is not the time for it.
Also, it probably doesn't work anyway, since it has never been tested
in the context of Fort.
This doesn't delete all the glue code; only the necessary parts needed
for the upcoming commit.
I'm currently reviewing the system includes. There are some unnecesary
ones, as well as a few nonstandard quirks complicating portability.
This is the first of possibly quite a few commits intended to refactor
this relative mess. It makes all `#include ""`s relative to the root of
the source directory.
Mainly so a local cache tree traversal recovers from a malformed URI.
1. Atomize append operations
Failure used to leave the path builder in an essentially undefined
state, which meant the path had to be thrown away, precluding further
tree traversal.
A failing `append()` no longer modifies the path builder, allowing the
tree traversal to simply skip the ailing branch.
2. Remove lazy failure
The old path_builder was postponing error reporting to path compilation
(`pb_peek()` and `pb_compile()`).
This was an old optimization (meant to simplify path building code),
and was getting in the way of the atomizing.
Errors are now fail-fast, thrown during path construction
(`pb_append*()`).
3. Move path normalization to `uri`
Path normalization (collapsing `.`, `..` and `/`) was getting in the
way of the atomizing, in addition to only really being useful for the
URI-to-cache conversion code.
3. Restore support for absolute paths
This was just a small TODO spawned a few commits ago.
It's a bit smarter now. Addresses a bunch of issues at once, though it
still needs several tweaks and testing:
- #78: Provide a dedicated namespace for each RRDP notification, to
prevent malicious RPPs from overriding files from other RPPs.
- #79: RRDP session and serial are no longer cached in RAM; they're
extracted from cached notification files as they are needed.
This prevents all RRDP from being considered outdated during startup.
- #80: rsync-strategy has been removed.
- #81: The cache now retains RRDP files.
The refactor has been more intrusive than intended. I've been retouching
the core loop and rrdp/https code, which has yielded the following
further disinfections:
- #77: Refactor the HTTP code so 304 is handled as success, despite no
file modifications having been made.
- It seems the old code was refusing to download RPPs via RRDP when said
RPP wasn't also (unrelatedly) served via rsync. This seemed to stem
from an old RFC misunderstanding from the previous developer.
- I've deprecated `rsync.priority` and `rrdp.priority`, mostly just to
simplify the code. I haven't seen anyone using these config fields,
and I think SIAs and/or randomness should be the ones to decide which
protocol is preferred for a given RPP, not Fort's admin.
- However, I have also decided to deprecate `shuffle_tal_uris`, because
I also suspect it's completely unused, and would like to hear some
complaints otherwise.
- Deprecated `rsync.arguments-flat`, because non-recursive rsyncs are
not needed anymore.
- Since RRDP files are no longer deleted immediately after use, the
`DEBUG_RRDP` compilation has lost its purpose, so I deleted it.
- The code was using `HASH_ADD_STR` on strings contained outside of the
node structure. This is illegal according to uthash's documentation,
and might have induced some crashes in the past.
On closer inspection, none of the error messages logged in #98 imply a
problem, so I have reduced their severities, removed the stack traces
and improved the error messages.
Fixes the error messages half of #98. I still need to look into the
alleged discrepancies with Routinator and Cloudflare.