Now, when we have the unlocked slab, we can also go for a completely
lockless global attribute cache, which has been the major contention
point in BIRD 3.
This is a complicated data structure, basically a lockless hash table
with use-counted items inside, allowing concurrent insertions and
deletions.
Hash tables have to dynamically resize its base array, and the rehash
routine runs from an event if needed.
Also, due to some funny meta-race conditions, it is now possible to
end up with multiple ea_list_storage instances with the same contents,
yet it looks like it happens very rarely. (Prove us wrong, we dare you!)
Slab: Allowing concurrent allocations without locking
Every active thread now gets its own page from which it allocates
blocks. This enables concurrent allocations without much contention,
alleviating the need for mutex locking over Slab allocations,
effectively making Slab lock-free.
This adds some overhead in case every thread just allocates one item and
finishes, yet these situations should not happen too often and/or have
too large impact, so we don't need to care much. If something like this
happens, though, please cry. We'll cry with you, for sure.
Also there is a cleanup routine now which has to run often enough to
ensure that pages with some freed blocks get available for allocations.
This rework also changes the API of slabs, requiring an event list to
send cleanup events to, to be passed to sl_new().
From the beginning, there is a years-old implementation of a slab
not actually being a slab, just transparently passing all requests
to malloc and free. We don't need that anymore, we have different
methods now to determine whether the problem is the allocator or
something else, and as we are going to change the slab API anyway,
together with some behavioral updates, having a fake slab is just
an anachronism.
For certain upcoming data structures, we actually need to use thread IDs
as a functional information to index things, not just a logging token.
Thus, we need them to be dense, not just flying around as they were until now.
To achieve this, we assign the IDs from a global hmap when the threads
are started, and properly return them when the threads are finished.
This way, the IDs of stopping threads are expected to be recycled,
whereas until now it wasn't expected to happen.
You may need to take care about this in your log reading apparatus.
Also there is now a maximum thread count hard limit because unlimited
thread count is too crazy to handle. But the limit is still ridiculously
high and nobody is ever expected to hit it anyway.
Maria Matejka [Fri, 31 Jan 2025 12:17:11 +0000 (13:17 +0100)]
RCU: Add split-sync calls
Now, instead of synchronize_rcu(), one can call rcu_begin_sync(),
store the value, and (probably via defer_call()) use it later
to check by rcu_end_sync() that it indeed has ended.
Maria Matejka [Thu, 13 Feb 2025 17:25:44 +0000 (18:25 +0100)]
Taming static checker: flow[64]_validate_cf() checks NULL data
This does not apply for the current code but if somebody chose to use
the flowspec validation functions for something totally broken, it may
unnecessarily crash.
Maria Matejka [Mon, 3 Feb 2025 14:21:52 +0000 (15:21 +0100)]
BFD session handling rework
The original implementation for BIRD 3 was rooted in the first
methods how I tried to go for multithreading and it had several flaws,
mostly incomprehensive notification and request pickup routines.
Also converting to a double-loop architecture where one of the
loops (low-latency) solely runs BFD socket communication, whereas
the other one does all the other shenanigans.
Maria Matejka [Mon, 10 Mar 2025 08:15:53 +0000 (09:15 +0100)]
Table export: Relaxing too strict inconsistency assert
In case of refeeds, we may get old routes which we have not seen,
the table does not know that and the channel ingress is the right place
to detect it.
Ondrej Zajicek [Thu, 6 Mar 2025 02:43:15 +0000 (03:43 +0100)]
Nest: Fix locking of tables by channels
Channels that are down keep ptr on routing tables but do not keep them
locked. It is safe because the existence of tables depend on being
configured. But when a table is removed during reconfiguration, the
channel kept a dangling pointer since it fell down until it was removed.
This could be triggered by 'show protocols all' and other similar.
Change locking so that a channel kept a table locked for its whole
existence. The same change is already in BIRD 3.
Note that this is somewhat conceptually problematic as downed channels
do not keep resources. Also, other objects in specialized channels
(igp_table, base_table in bgp_channel, mpls_domain / mpls_range in
mpls_channel) are still locked only when channel is active, but for
them it is easier to keep track that they are not accessed when
they are deconfigured.
Maria Matejka [Mon, 3 Mar 2025 18:48:58 +0000 (19:48 +0100)]
Table export: Another inconsistency in refeeds
When a route has been already sent to the channel and the refeed
runs because of a filter change or just because requested, the
old and new routes are the same which was actually not anticipated
by rt_notify_basic().
Commit 69d1ffde4c72 ("Split route data structure to storage (ro) /
manipulation (rw) structures.") changed rte->net from a pointer to a
'struct network' to a 'struct net_addr', but kept the address-of (&)
operator before casting to 'net_addr_ip6_sadr *' when sending a
source-specific route to the kernel.
Because 'struct network' had an embedded struct member (struct
fib_node), the address-of was needed to get back to a pointer to the
data, but with the change in the commit mentioned above, e->net is now a
straight pointer to the address.
The bug meant that the source prefixes passed to the kernel were
essentially garbage, leading to routes in the kernel like:
default from b74:9e05:0:1:d8cf:c000::/86 via fe80::1 dev eth0 proto bird metric 32 pref medium
Fix this by getting rid of the address-of operator.
Note by commiter: used our TYPE_CAST macro instead of plain typecast
to avoid this kind of problem in future.
Fixes: 69d1ffde4c72 ("Split route data structure to storage (ro) / manipulation (rw) structures.") Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: Maria Matejka <mq@jmq.cz>
Table export: fixed inconsistency in export_rejected_map
When updates arrived in such an order that the first one was rejected and
the second one got accepted, the export_rejected_map flag mistakenly
stayed set, leaking the route ID.
In the RA_OPTIMAL channel mode, there are consistency checks that at
most one route for a net has been accepted or rejected. After some time,
the leaked ID and bit in export_rejected_map caused spurious crashes in
asserts later in channel_notify_basic().
Thanks to NIX-CZ and Maiyun Zhang for reporting this.
Maria Matejka [Wed, 12 Feb 2025 20:29:10 +0000 (21:29 +0100)]
BGP export table src fix
When exchanging routes in BGP export table, we forgot to update
the src in cases of add path off. This led to falsely claiming another
origin of that route in export table dump and also holding protocols
in the flush state because of their srcs being kept in the export tables.
Maria Matejka [Mon, 10 Feb 2025 11:29:51 +0000 (12:29 +0100)]
Fix channel restart sequence
If channel goes start -> pause -> start, the original code crashed
but it's a valid sequence for protocol half-restart, going from UP
to START and then back UP.
Ondrej Zajicek [Thu, 9 Jan 2025 15:44:51 +0000 (16:44 +0100)]
lib: Unify alignment of allocators
Different internal allocators (memory blocks, linpools, and slabs) used
different way to compute alignment. Unify it to use alignment based on
standard max_align_t type.
On x86_64, this does not change alignment of memory blocks and linpools
(both old and new is 16), but it increases alignment of slabs from 8 to
16.
Maria Matejka [Mon, 13 Jan 2025 21:15:52 +0000 (22:15 +0100)]
BGP: fix shutdown crash when dynamic peer is just connected
In some edge cases, the dynamic BGP starts but doesn't yet pick up
the socket from the peer, when it gets shut down, typically on
a complete shutdown. Fixing this to just close the socket, not assert
it being already picked up.
Ondrej Zajicek [Thu, 9 Jan 2025 15:44:51 +0000 (16:44 +0100)]
lib: Unify alignment of allocators
Different internal allocators (memory blocks, linpools, and slabs) used
different way to compute alignment. Unify it to use alignment based on
standard max_align_t type.
On x86_64, this does not change alignment of memory blocks and linpools
(both old and new is 16), but it increases alignment of slabs from 8 to
16.
Maria Matejka [Wed, 8 Jan 2025 19:22:21 +0000 (20:22 +0100)]
Table: more best route refeed fixes
Best route refeed is tricky. The journal may include repeatedly the same
route in the old and/or in the new position in case of flaps. We don't
like checking that fully in the RCU critical section which is already
way too long, thus we filter out the repeated occurence of the current
best route while keeping possibly more old routes.
We also don't want to send spurious withdraws, and we need to check that
only one notification per net is sent for RA_OPTIMAL.
There was also missing a rejected map update in case of idempotent
squashed update, and last but not least, the best route journal should
not include invalid routes (import keep filtered).
Maria Matejka [Tue, 24 Dec 2024 15:16:55 +0000 (16:16 +0100)]
Allocate the normalization buckets on stack
Even though allocating from tmp_linpool is quite cheap,
it isn't cheap when the block is larger than a page, which is the case here.
Instead, we now allocate just the result which typically fits in a page,
avoiding a necessity of a malloc().
Maria Matejka [Tue, 24 Dec 2024 11:18:39 +0000 (12:18 +0100)]
Stonehenge: multi-slab allocator
To mid-term allocate and free lots of small blocks in a fast pace,
mb_alloc is too slow and causes heap bloating. We can already allocate
blocks from slabs, and if we allow for a little bit of inefficiency,
we can just use multiple slabs with stepped sizes.
This technique is already used in ea_list allocation which is gonna be
converted to Stonehenge.
Maria Matejka [Mon, 23 Dec 2024 10:58:05 +0000 (11:58 +0100)]
Kernel: feed only once during startup
There was an inefficiency in the initial scan state machine,
causing routes to be fed several times instead of just once.
Now the export startup is postponed until first krt_scan()
finishes and we actually can do the pruning with full information.