rt-table.c: Solution for missing igp_metric described in issue #154.
The igp_metric was set as an eattr, but not as a hostentry attribut.
However, the eattr was ignored and possibly even rewritten to
the hostentry (default) value.
The hostentry is initialized in ea_set_hostentry. There is an ea_list,
but it is not used for setting the igp_metric and it seems like it
does not contain the igp_metric at all. That is why I set it in rt_next_hop_resolve_rte.
rt-table: fixed rt_notify_basic - old route is now cleared in case it was previously rejected. The problem manifested in RA_OPTIMAL channel mode. In this mode, we need only one route of a net to be in accepted table or to be in rejected table, but not in both. When there was a route in accepted map and we were addind another route of the same net, we added the new route, removed the old route and everything was ok. But when the first route was in rejected map, we just added the new route. That caused inconsistency - there were two routes of the same net in the maps (in accepted and rejected or both in rejected). This was causing a crashes in assert in channel_notify_basic.
Now, when we have the unlocked slab, we can also go for a completely
lockless global attribute cache, which has been the major contention
point in BIRD 3.
This is a complicated data structure, basically a lockless hash table
with use-counted items inside, allowing concurrent insertions and
deletions.
Hash tables have to dynamically resize its base array, and the rehash
routine runs from an event if needed.
Also, due to some funny meta-race conditions, it is now possible to
end up with multiple ea_list_storage instances with the same contents,
yet it looks like it happens very rarely. (Prove us wrong, we dare you!)
Slab: Allowing concurrent allocations without locking
Every active thread now gets its own page from which it allocates
blocks. This enables concurrent allocations without much contention,
alleviating the need for mutex locking over Slab allocations,
effectively making Slab lock-free.
This adds some overhead in case every thread just allocates one item and
finishes, yet these situations should not happen too often and/or have
too large impact, so we don't need to care much. If something like this
happens, though, please cry. We'll cry with you, for sure.
Also there is a cleanup routine now which has to run often enough to
ensure that pages with some freed blocks get available for allocations.
This rework also changes the API of slabs, requiring an event list to
send cleanup events to, to be passed to sl_new().
From the beginning, there is a years-old implementation of a slab
not actually being a slab, just transparently passing all requests
to malloc and free. We don't need that anymore, we have different
methods now to determine whether the problem is the allocator or
something else, and as we are going to change the slab API anyway,
together with some behavioral updates, having a fake slab is just
an anachronism.
For certain upcoming data structures, we actually need to use thread IDs
as a functional information to index things, not just a logging token.
Thus, we need them to be dense, not just flying around as they were until now.
To achieve this, we assign the IDs from a global hmap when the threads
are started, and properly return them when the threads are finished.
This way, the IDs of stopping threads are expected to be recycled,
whereas until now it wasn't expected to happen.
You may need to take care about this in your log reading apparatus.
Also there is now a maximum thread count hard limit because unlimited
thread count is too crazy to handle. But the limit is still ridiculously
high and nobody is ever expected to hit it anyway.
Maria Matejka [Fri, 31 Jan 2025 12:17:11 +0000 (13:17 +0100)]
RCU: Add split-sync calls
Now, instead of synchronize_rcu(), one can call rcu_begin_sync(),
store the value, and (probably via defer_call()) use it later
to check by rcu_end_sync() that it indeed has ended.
Ondrej Zajicek [Thu, 9 Jan 2025 15:44:51 +0000 (16:44 +0100)]
lib: Unify alignment of allocators
Different internal allocators (memory blocks, linpools, and slabs) used
different way to compute alignment. Unify it to use alignment based on
standard max_align_t type.
On x86_64, this does not change alignment of memory blocks and linpools
(both old and new is 16), but it increases alignment of slabs from 8 to
16.
Maria Matejka [Wed, 8 Jan 2025 19:22:21 +0000 (20:22 +0100)]
Table: more best route refeed fixes
Best route refeed is tricky. The journal may include repeatedly the same
route in the old and/or in the new position in case of flaps. We don't
like checking that fully in the RCU critical section which is already
way too long, thus we filter out the repeated occurence of the current
best route while keeping possibly more old routes.
We also don't want to send spurious withdraws, and we need to check that
only one notification per net is sent for RA_OPTIMAL.
There was also missing a rejected map update in case of idempotent
squashed update, and last but not least, the best route journal should
not include invalid routes (import keep filtered).
Maria Matejka [Tue, 24 Dec 2024 15:16:55 +0000 (16:16 +0100)]
Allocate the normalization buckets on stack
Even though allocating from tmp_linpool is quite cheap,
it isn't cheap when the block is larger than a page, which is the case here.
Instead, we now allocate just the result which typically fits in a page,
avoiding a necessity of a malloc().
Maria Matejka [Tue, 24 Dec 2024 11:18:39 +0000 (12:18 +0100)]
Stonehenge: multi-slab allocator
To mid-term allocate and free lots of small blocks in a fast pace,
mb_alloc is too slow and causes heap bloating. We can already allocate
blocks from slabs, and if we allow for a little bit of inefficiency,
we can just use multiple slabs with stepped sizes.
This technique is already used in ea_list allocation which is gonna be
converted to Stonehenge.
Maria Matejka [Mon, 23 Dec 2024 10:58:05 +0000 (11:58 +0100)]
Kernel: feed only once during startup
There was an inefficiency in the initial scan state machine,
causing routes to be fed several times instead of just once.
Now the export startup is postponed until first krt_scan()
finishes and we actually can do the pruning with full information.
Ondrej Zajicek [Tue, 17 Dec 2024 08:00:42 +0000 (09:00 +0100)]
Nest: Fix handling of 64-bit rte_src.private_id
The commit 21213be523baa7f2cbf0feaa617f265c55e9b17a expanded private_id
in route source to u64, but forgot to modify function arguments, so it
was still cropped at 32-bit, which may cause some collisions for L3VPN.
This patch fixes that.
Ondrej Zajicek [Thu, 12 Dec 2024 03:04:07 +0000 (04:04 +0100)]
Netlink: Handle onlink flag on BSD-Netlink
On BSD, the onlink flag is not tracked or reported by kernel. We are
using an heuristic that assigns the onlink flag to routes scanned from
the kernel. We should use the same heuristic even in BSD-Netlink
case, as the onlink flag is not reported here too.
Fabian Bläse [Tue, 10 Dec 2024 01:14:06 +0000 (02:14 +0100)]
Babel: fix seqno wrapping on seqno request
The Babel seqno wraps around when reaching its maximum value (UINT16_MAX).
When comparing seqnos, this has to be taken into account. Therefore,
plain number comparisons do not work.
Maria Matejka [Sat, 21 Dec 2024 18:02:22 +0000 (19:02 +0100)]
BFD: Fix session reconfiguration locking order
The sessions have to be updated asynchronously to avoid
cross-locking between protocols.
Testsuite: cf-ibgp-bfd-switch, cf-ibgp-multi-bfd-auth Fixes: #139
Thanks to Daniel Suchy <danny@danysek.cz> for reporting:
https://trubka.network.cz/pipermail/bird-users/2024-December/017984.html
Maria Matejka [Fri, 20 Dec 2024 10:28:00 +0000 (11:28 +0100)]
BGP: fix locking order error on dynamic protocol spawn
We missed that the protocol spawner violates the prescribed
locking order. When the rtable level is locked, no new protocol can be
started, thus we need to:
* create the protocol from a clean mainloop context
* in protocol start hook, take the socket
Testsuite: cf-bgp-autopeer Fixes: #136
Thanks to Job Snijders <job@fastly.com> for reporting:
https://trubka.network.cz/pipermail/bird-users/2024-December/017980.html
Maria Matejka [Thu, 19 Dec 2024 11:28:27 +0000 (12:28 +0100)]
Kernel: when channel traces, we have to trace the final result
Otherwise it looks like we are sending too much traffic to netlink
every other while, which is not true. Now we can disambiguate between
in-kernel updates and ignored routes.
Maria Matejka [Thu, 19 Dec 2024 10:54:05 +0000 (11:54 +0100)]
Table: not feeding twice, once is enough
If there is no feed pending, the requested one should be
activated immediately, otherwise it is activated only after
the full run, effectively running first a full feed and
then the requested one.
Maria Matejka [Sun, 15 Dec 2024 20:04:22 +0000 (21:04 +0100)]
Table prune inhibited during reconfiguration
When many changes are done during reconfiguration, the table may
start pruning old routes before everything is settled down, slowing
down not only the reconfiguration, but also the shutdown process.
Maria Matejka [Sat, 14 Dec 2024 22:21:07 +0000 (23:21 +0100)]
Disable multiple malloc arenas
In our usecase, these are impossibly greedy because we often
free memory in a different thread than where we allocate, forcing
the default allocator to scatter the used memory all over the place.
There was a suspicion that maybe the BIRD 3 version of ROA gets the
digesting wrong. This test covers the nastiest cornercases we could
think about, so now we can expect it to be right.
We have quite large critical sections and we need to allocate inside
them. This is something to revise properly later on, yet for now,
instead of slowly but surely growing the virtual memory address space,
it's better to optimize the cold page cache pickup and count situations
where this happened inside the critical section.
Lockfree journal: Cleanup hook runs only when needed.
The function lfjour_cleanup_hook() was scheduled each time any of the
journal recipients reached end of a block of journal items or read all
of journal items. Because lfjour_cleanup_hook() can clean only journal
items every recipient has processed, it was often called uselessly.
This commit restricts most of the unuseful scheduling. Only some
recipients are given a token alowing them to try to schedule the
cleanup hook. When a recipient wants to schedule the cleanup hook, it
checks whether it has a token. If yes, it decrements number of tokens
the journal has given (issued_tokens) and discards its own token. If
issued_tokens reaches zero, the recipient is allowed to schedule the
cleanup hook.
There is a maximum number of tokens a journal can give to its recipients
(max_tokens). A new recipient is given a token in its init, unless the
maximum number of tokens is reached. The rest of tokens is given to
customers in lfjour_cleanup_hook().
In the cleanup hook, the issued_tokens number is increased in order to
avoid calling the hook before it finishes. Then, tokens are given to the
slowest recipients (but never to more than max_token recipients). Before
leaving lfjour_cleanup_hook(), the issued_tokens number is decreased back.
If no other tokens are given, we have to make sure the
lfjour_cleanup_hook will be called again. If every item in journal was
read by every recipient, tokens are given to random recipients. If all
recipients with tokens managed to finish until now, we give the token to
the first unfinished customer we find, or we just call the hook again.
Ondrej Zajicek [Tue, 3 Dec 2024 00:19:44 +0000 (01:19 +0100)]
RPKI: Fix several errors in handling of Error PDU
Fix several errors including:
- Unaligned memory access to 'Length of Error Text' field
- No validation of 'Length of Encapsulated PDU' field
- No validation of 'Error Code' field
- No validation of characters in diagnostic message