Maria Matejka [Thu, 17 Jul 2025 22:19:14 +0000 (00:19 +0200)]
CI: Autotests for BGP setting changes
There are actually 144 test variants. Choosing 12 of them, such that:
- m2 may request no RR, basic RR or enhanced RR
- m2 may have any combination of import and export table
- import and export table settings for m1 are pseudorandomized
- the same for multiple variants how to get basic RR negotiated
This should cover all the code with not too much resource consumption.
Maria Matejka [Wed, 25 Jun 2025 11:00:11 +0000 (13:00 +0200)]
BGP: restart on outgoing next hop setting change
When next hop self / keep / address changed, BGP only reloaded
the exports but it didn't apply the changes. To fix this problem
before actually implementing a proper change detection algorithm,
we restart the protocol if this setting changes.
Maria Matejka [Sun, 29 Jun 2025 18:14:31 +0000 (20:14 +0200)]
CI: adding tests cf-bgp-unnumbered and cf-bgp-error-states
The unnumbered test checks the onlink neighbor scenarios,
and the cf-bgp-error-states checks a regression for BIRD 3
where BGP crashed when listening socket failed to bind.
Maria Matejka [Sun, 29 Jun 2025 16:56:27 +0000 (18:56 +0200)]
BGP: Fix crash on listening socket failure
In that specific case, bgp_stop() runs from the main loop, effectively
causing the deferred state announcement to run from a bad context.
Fixed by forcibly announcing the state immediately.
Maria Matejka [Thu, 19 Jun 2025 16:57:15 +0000 (18:57 +0200)]
BGP: Allow onlink neighbors
In certain scenarios, the direct neighbor is not inside any prefix
assigned to the appropriate interface. There is a route for that address
pointing to that interface, though.
In such cases, the user may specify the neighbor as onlink, effectively
disabling the prefix check and trying to connect immediately. It is
expected that the operator ensures that the neighbor is indeed there.
Maria Matejka [Thu, 19 Jun 2025 19:12:52 +0000 (21:12 +0200)]
CI: Limit log bloating for netlab runners
There is no useful configurable file size limit for netlab, allowing
the job to eat up all the disk. Thus we limit it directly in the script
by setting ulimit -f to 1G.
Maria Matejka [Thu, 19 Jun 2025 16:40:23 +0000 (18:40 +0200)]
CI: split make gitlab to local and venv variant
If you prefer to install python3-jinja2 and pyaml yourself,
run gitlab-local. If you prefer the thing to create a venv
and pip3 all the dependencies, run gitlab-venv.
Maria Matejka [Tue, 17 Jun 2025 14:09:45 +0000 (16:09 +0200)]
RIP: fix split horizon instability
When split horizon is active (by default), RIP withdraws routes from
these neighbors which have sent the best route to us. This helps with
reducing routing loops. When ECMP is on (by default), RIP behaves this
way only to the first neighbor in the list.
In BIRD 2, the neighbors were first sorted and then the first one was
chosen. This was lost in BIRD 3 and the first neighbor in the unsorted
list was considered instead.
The overall result in routing does not change so this is not technically
a bug and should not result in misrouting. The final result is unstable
and time-sensitive though, making debugging harder and automatic tests
fail.
We are rectifying this to be stable again so that our tests may get
green again.
Fixes: #284 Reproduced-By: David Petera while checking RIP in VRFs Identified-By: Ondrej Zajicek
Maria Matejka [Fri, 13 Jun 2025 18:37:16 +0000 (20:37 +0200)]
Pipe: Fix inconsistency caused by rare pipe collision
When two different pipes modify the same route in two different ways
and both results are attempted to be imported into the same table,
the table complains and should ignore the whole update. Yet it did so
only if the routes were identical (which was also tested previously).
Instead of ignoring, this routine replaced the original route with the
new one from another sender which caused a discrepancy in counters in
the original sender and later crash on failed consistency check.
Discovered randomly when trying to reproduce another bug and
accidentally doing exactly this sequence of pipe collision and
then reconfiguration to rectify the collision by removing the pipe
altogether.
Maria Matejka [Thu, 12 Jun 2025 09:52:32 +0000 (11:52 +0200)]
CI: Fix netlab failure artifact collection
When implementing artifact collection for netlab,
I forgot that the coredumps and logs are generated outside the
repository. Moving the artifacts to the right place for collection.
Igor Putovny [Mon, 2 Jun 2025 11:44:42 +0000 (13:44 +0200)]
BGP: Disallow AS Sets by default
For a long time, AS Sets have been considered obsolete
but they were still valid by the original RFC. Recently,
RFC 9774 flipped this and AS Sets are now formally deprecated.
Therefore, all BGP sessions will now by default reject routes
containing AS Sets in their AS Paths. If you want to keep the
old behavior, you may simply state
allow as set;
in every BGP protocol configuration where you want to accept AS Sets.
Maria Matejka [Wed, 4 Jun 2025 12:53:36 +0000 (14:53 +0200)]
BGP: Restart if route refresh is impossible on attribute change
In previous commit, we force route refresh when some protocol attributes
change. Yet, when the neighbor doesn't support route refresh, we have to
restart the session, not send an unsupported request.
Note: if the neighbor is restarting right now with graceful restart
enabled, we keep the stale routes until the neighbor converges again.
Ondrej Zajicek [Tue, 3 Jun 2025 14:56:41 +0000 (16:56 +0200)]
BGP: Do route refresh after preference change
Reconfiguration of preference is handled by nest code by asking for
reload, but in case of BGP with import table, that just reloaded routes
with the old preference. In BGP, we can handle that by triggering full
route refresh.
Although, it would be probably better to set preference in nest, when
a route is propagated from the import table.
Ondrej Zajicek [Thu, 29 May 2025 15:34:35 +0000 (17:34 +0200)]
Nest: Fix route update after preference change
The route preference was ignored in route comparison, therefore if
a protocol changed it and then reloaded routes, they were ignored
and routes with the old prefernce were kept.
The bug was introduced 5 years ago, when preference was moved from
struct rte to struct rta.
Maria Matejka [Tue, 27 May 2025 08:58:23 +0000 (10:58 +0200)]
Filters: Re-enabling onlink nexthop attribute
There are still some semantic problems wrt. VRF settings,
effectively allowing botched nexthops to slip through filters,
but it makes more sense to re-enable onlink setting, than
to wait until we find out how to do that all properly.
Maria Matejka [Sat, 24 May 2025 12:16:44 +0000 (14:16 +0200)]
Table: Do not attempt to prune an empty table
Table pruning is requested from multiple places, including
a timer. Sometimes this may happen in a race condition with
table shutdown, and therefore we explicitly refuse to schedule
pruning if the table is empty.
Also added a full-blown check that a shutting-down table
is indeed empty, and several more asserts to catch imminent crashes
before they happen in hard-to-debug places.
Maria Matejka [Fri, 23 May 2025 18:40:22 +0000 (20:40 +0200)]
Shutdown: Do not export routes
When shutting down, the tables start flushing routes before all protocols
are even officially aware of the shutdown. This fix allows for a fast check
whether the shutdown is already running or not, and if so, all exports are
ignored instead of processing them.
One notable exception is the kernel protocol which needs to process all
the exports normally to actually withdraw the routes.
This is a hotfix for #251 and #252. Proper fix will require protocol
state machine refactoring.
Maria Matejka [Fri, 23 May 2025 17:17:53 +0000 (19:17 +0200)]
Table: fix a race condition in export
The race condition happens as follows:
- channel A starts feeding
- channel B imports a route ahead of the feeding pointer
- channel A exports this route and continues feeding from the pointer
- no other import hits this specific prefix
- there is at least one channel C which has not cleared this export
- channel A computes ecnt=0 for this prefix because all exports
have been already cleared
- the condition e >= ecnt mistakenly triggers retry
If the birdloops involved get assigned to the same thread, this race
condition then can't recover and the thread is stuck in an infinite
loop.
Fixed the race condition by moving the consistency check after actually
checking eligibility of the export, not before.
Maria Matejka [Wed, 21 May 2025 11:11:10 +0000 (13:11 +0200)]
ASPA: fix the table subscription
With the partial reloads of ROA, the ASPA reload must be different.
When merging ASPA from 2.16, we forgot about that, and the ROA digesting
procedures simply do not work with ASPA, so we instead re-add the settle
timer back.
Maria Matejka [Wed, 21 May 2025 11:53:35 +0000 (13:53 +0200)]
ASPA: fix aspa_check_upstream and aspa_check_downstream parse crash
Due to internal differences between BIRD 2 and 3, there is
FI_CURRENT_ROUTE in BIRD 3 and `val.rte == NULL` is not supported
as a shortcut for fetching the route object.
When merging, we forgot to fix this difference in the BGP config parser,
thus the constant folder failed to recognize that this is not a constant
expression, leading to accidentally dereference the NULL route pointer
in config parse time.
Maria Matejka [Wed, 21 May 2025 09:06:01 +0000 (11:06 +0200)]
BGP: fix warnings for roa_check() with no import table and route refresh on
When roa_check() appears in import filters, the autoreload is switched
on but prohibited for BGP if no import table is available. By fixing the
route refresh feature in 9edc421148fe9c5e7d6038b667ba8fafb587a1eb, we
inadvertently exposed another bug where Nest wasn't distinguishing
between locally and remotely available reload.
Whereas on manual reconfiguration, the route refresh is expected to be
invoked, the autoreload should not trigger any remote actions. Therefore
now the channels actualy indicate whether their reload hook triggers
remote actions or not. This information can be then used to decide
whether to allow autoreload or not.
Maria Matejka [Wed, 21 May 2025 08:15:49 +0000 (10:15 +0200)]
All events are required to have their hook
This is quite an obvious requirement but it wasn't enforced properly,
leading to unnecessarily tedious debugging when an uninitialized event
is improperly enqueued.
Now BIRD is going to crash as soon as somebody tries to enqueue such
an event, not when trying to execute nothing from a clean context.
When recycling an existing configuration (configure undo), the found_old
flags were already set. Due to this oversight, log files failed to
reopen on configure undo.
Ondrej Zajicek [Fri, 9 May 2025 12:53:58 +0000 (14:53 +0200)]
Conf: Improve parsing of config datatypes
Parser rules for configuration datatypes were inconsistent about whether
they accept literals, constants, or expressions. We should accept each
of these three everywhere.
The patch also simplifies the grammar, makes it more uniform, and adds
some documentation about that.
xtex [Fri, 9 May 2025 11:14:50 +0000 (07:14 -0400)]
Build: call tools/version with sh
As tools/version has a shebang line, it should be
fine to just call it without specifying bash.
Calling bash explicitly may lead to inconsistency,
as the first line of tools/version indicates /usr/bin/sh
but the script is always executed with bash.
And, it adds bash as a new build dependency.
Maria Matejka [Thu, 8 May 2025 21:03:57 +0000 (23:03 +0200)]
BGP: Fix repeated route refresh request
The previous approach was crashing on rapid successions of route refreshs
without even completing the previous ones. Now the newly requested refreshs
just queue and don't start multiple refreshs over and over again.
Igor Putovny [Mon, 5 May 2025 13:47:10 +0000 (15:47 +0200)]
BGP: Fix route refresh behavior
On import filter reconfiguration, the route refresh capability is now
honored and used instead of restarting the session.
On export filter reconfiguration, the enhanced route refresh capability
is used to indicate BoRR and EoRR, unless the export table is on. In
such cases, only relevant changes are sent as proper updates.
When route refresh request is received, the enhanced route refresh
capability is now honored.
Fix TCP AO keys dump to dump sockets from all loops
The AO keys dump changes frequently and we don't always get
a notification when the relevant data changes. With that, we have to
send an event to each single birdloop to request the dump, and wait for
it all to complete.
Only sockets in main birdloop were dumped. Now, all sockets are dumped.
Because of serialization problems, socket dump is cached for each birdloop
and it is continuously updated. These cached dumps are then just
collected and written out without actually looking at the live data.
Maria Matejka [Wed, 7 May 2025 13:02:24 +0000 (15:02 +0200)]
BGP: Fix crash on too long export
When BGP route is short enough to be accepted but too long after local
changes, it is converted to withdraw. In these cases though, there was
a dangling pointer left from the prefix structure to the attribute
bucket.
That pointer is now pointing at the right place after the bucket gets
converted to withdraw.
Thanks to ix.br for catching and reporting this issue.