Ondřej Surý [Sat, 31 Jan 2026 06:24:49 +0000 (07:24 +0100)]
Cleanup the duplicate logic and comments around add into NSEC tree
After merging the NORMAL, NSEC and NSEC3 tree into single QP tree, there
were some comments still speaking about auxiliary NSEC tree. These were
cleaned up and the logic when we pass the qp tree (write transaction) to
qpzone_addrdataset_inner() was changed to be more obvious that this is
needed only when we are adding NSEC records.
Colin Vidal [Mon, 16 Mar 2026 10:36:25 +0000 (11:36 +0100)]
chg: dev: Exclude named.args.j2 and system test README files from license header checks
Exclude named.args.j2 files from license header checks so named.args can
be generated from Jinja templates. Also exclude system test README files
from the license header checks.
Ondřej Surý [Mon, 16 Mar 2026 10:06:28 +0000 (11:06 +0100)]
fix: dev: Fix use-after-free in xfrin_recv_done
Move the LIBDNS_XFRIN_RECV_DONE probe execution before dns_xfrin_detach
in xfrin_recv_done.
Previously, dns_xfrin_detach was called before the trace probe, which
could free the xfr object. Because the accessed member xfr->info is an
embedded array, the expression evaluates via pointer arithmetic rather
than a direct memory dereference. Although this prevents a reliable
crash in practice, it technically remains a use-after-free issue.
Reorder the statements to ensure the transfer context is fully valid
when the probe executes.
Closes #5786
Merge branch '5786-fix-dtrace-after-free' into 'main'
Ondřej Surý [Wed, 4 Mar 2026 16:08:50 +0000 (17:08 +0100)]
Fix use-after-free in xfrin_recv_done
Move the LIBDNS_XFRIN_RECV_DONE probe execution before dns_xfrin_detach
in xfrin_recv_done.
Previously, dns_xfrin_detach was called before the trace probe, which
could free the xfr object. Because the accessed member xfr->info is an
embedded array, the expression evaluates via pointer arithmetic rather
than a direct memory dereference. Although this prevents a reliable
crash in practice, it technically remains a use-after-free issue.
Reorder the statements to ensure the transfer context is fully valid
when the probe executes.
Arаm Sаrgsyаn [Mon, 16 Mar 2026 10:01:32 +0000 (10:01 +0000)]
fix: dev: Fix OpenSSL 4 compatibility issue when calling X509_get_subject_name()
Starting from OpenSSL 4 the the X509_get_subject_name() function
returns a 'const' pointer to a name instead of a regular pointer.
Duplicate the name before operating on it, then free it.
Closes #5807
Merge branch '5807-openssl-4-X509_get_subject_name-compat-fix' into 'main'
Aram Sargsyan [Thu, 12 Mar 2026 13:10:38 +0000 (13:10 +0000)]
OpenSSL 4 compatibility fix
Starting from OpenSSL 4 the the X509_get_subject_name() function
returns a 'const' pointer to a name instead of a regular pointer.
Duplicate the name before operating on it, then free it.
Ondřej Surý [Sat, 14 Mar 2026 11:53:51 +0000 (12:53 +0100)]
Simplify checkds_create() to return void
Since memory allocation never fails in BIND 9, checkds_create() cannot
fail. Change it to return void and use designated initializers,
removing error handling at all call sites.
Ondřej Surý [Sat, 14 Mar 2026 11:53:03 +0000 (12:53 +0100)]
Fix TSIG key and transport leaks in zone_notify() error paths
Two 'goto next' paths in zone_notify() skipped detaching the TSIG
key and transport, leaking them on TLS configuration failure and
when the destination address is disabled.
The INSIST in isc_radix_insert() checks node->data[RADIX_V4] and
node->node_num[RADIX_V4] twice due to a copy-paste error, never
verifying the RADIX_V6 fields.
Fix the second pair to check RADIX_V6.
Merge branch 'ondrej/fix-copy-paste-error-checking-RADIX_V4-instead-of-RADIX_V6' into 'main'
Ondřej Surý [Wed, 11 Mar 2026 12:17:56 +0000 (13:17 +0100)]
Fix INSIST copy-paste error checking RADIX_V4 instead of RADIX_V6
The INSIST in isc_radix_insert() checks node->data[RADIX_V4] and
node->node_num[RADIX_V4] twice due to a copy-paste error, never
verifying the RADIX_V6 fields.
Ondřej Surý [Sat, 14 Mar 2026 10:02:10 +0000 (11:02 +0100)]
fix: dev: Fix port validation rejecting valid port 65535
Three port validation checks use >= UINT16_MAX instead of > UINT16_MAX,
incorrectly rejecting port 65535 as out of range. Port 65535 is a valid
TCP/UDP port number. Other port checks in the same file already use the
correct > comparison.
Merge branch 'ondrej/fix-port-validation-rejecting-valid-port-65535' into 'main'
Ondřej Surý [Wed, 11 Mar 2026 12:18:01 +0000 (13:18 +0100)]
Fix port validation rejecting valid port 65535
A few port validation checks use >= UINT16_MAX instead of > UINT16_MAX,
incorrectly rejecting port 65535 as out of range. Port 65535 is a valid
TCP/UDP port number. Other port checks in the same file already use the
correct > comparison.
Ondřej Surý [Sat, 14 Mar 2026 09:10:37 +0000 (10:10 +0100)]
fix: dev: Fix memory leak in dns_catz_options_setdefault() for zonedir
When defaults->zonedir is set, opts->zonedir is unconditionally
overwritten without freeing the previous value. This leaks memory
on every catalog zone update when zonedir defaults are configured.
Free the existing opts->zonedir before replacing it.
Merge branch 'ondrej/fix-memory-leak-in-dns_catz_options_setdefault' into 'main'
Ondřej Surý [Wed, 11 Mar 2026 12:17:32 +0000 (13:17 +0100)]
Fix memory leak in dns_catz_options_setdefault() for zonedir
When defaults->zonedir is set, opts->zonedir is unconditionally
overwritten without freeing the previous value. This leaks memory
on every catalog zone update when zonedir defaults are configured.
Free the existing opts->zonedir before replacing it.
Ondřej Surý [Sat, 14 Mar 2026 06:45:57 +0000 (07:45 +0100)]
fix: usr: Fix intermittent named crashes during asynchronous zone operations
Asynchronous zone loading and dumping operations occasionally dispatched tasks
to the wrong internal event loop. This threading violation triggered internal
safety assertions that abruptly terminated named. Strict loop affinity is now
enforced for these tasks, ensuring they execute on their designated threads
and preventing the crashes.
Closes #4882
Merge branch '4882-run-rndc-zone-commands-on-correct-loop' into 'main'
Ondřej Surý [Tue, 10 Mar 2026 17:25:54 +0000 (18:25 +0100)]
Dispatch async work jobs from the correct loop
Refactor dns_loadctx_t and dns_dumpctx_t to use standard
ISC_REFCOUNT_DECL and ISC_REFCOUNT_IMPL macros, retiring the
redundant manual attach and detach implementations.
Introduce dns_loadctx_enqueue() and dns_dumpctx_enqueue() to
ensure compliance with the new strict loop affinity in
isc_work_enqueue(). If the current loop does not match the
target loop, the enqueue operation is safely bounced to the
correct thread via isc_async_run().
Ondřej Surý [Tue, 10 Mar 2026 17:25:37 +0000 (18:25 +0100)]
Enforce isc_work enqueue loop affinity
Add a REQUIRE(isc_loop() == loop) assertion to isc_work_enqueue()
to strictly enforce that work is enqueued from the loop it is
assigned to. This loudly prohibits cross-thread queue manipulation
before it inevitably turns into a concurrency debugging nightmare.
Michał Kępień [Thu, 12 Mar 2026 11:27:36 +0000 (12:27 +0100)]
Fix a typo in job name
As hinted upon by the comment preceding it, the job preparing packager
notifications was (rather unsurprisingly) supposed to be called
"prepare-packager-notification". Fix the typo in its name.
Petr Špaček [Tue, 10 Mar 2026 17:04:51 +0000 (18:04 +0100)]
Delete early access token when code is published
Technically this is not necessary because the token expires in one week
after creation, and new code would have got there only one week before
the next public release, but better be safe than sorry.
Catch is, after_script gets executed even if a job fails or is
canceled. Delete distros token only if publication succeeded.
Ondřej Surý [Tue, 10 Mar 2026 17:38:37 +0000 (18:38 +0100)]
fix: dev: Fix resquery reference imbalance on TCP connect failure
In fctx_query(), resquery_ref(query) is called before
dns_dispatch_connect() in anticipation of the resquery_connected()
callback consuming the reference. When dns_dispatch_connect() fails
synchronously on TCP (e.g. from dns_transport_get_tlsctx() failing
in tcp_dispatch_connect()), the connect callback is never scheduled,
so the extra reference is never consumed. This has been fixed.
Merge branch 'ondrej/fix-resquery-refcount' into 'main'
Ondřej Surý [Fri, 6 Mar 2026 16:06:24 +0000 (17:06 +0100)]
Fix resquery reference imbalance on TCP connect failure
In fctx_query(), resquery_ref(query) is called before
dns_dispatch_connect() in anticipation of the resquery_connected()
callback consuming the reference.
When dns_dispatch_connect() fails synchronously on TCP (e.g. from
dns_transport_get_tlsctx() failing in tcp_dispatch_connect()), the
connect callback is never scheduled, so the extra reference is never
consumed. The error path then tears down the query via manual cleanup
(isc_mem_put) without going through the refcount destructor, leaving
the reference imbalanced.
Fix by dropping the extra reference on the error path, just after
dns_dispatch_done() which cleans up the dispatch entry.
Nicki Křížek [Tue, 10 Mar 2026 15:07:54 +0000 (16:07 +0100)]
chg: test: Disable statschannel RTT tests on FreeBSD
These tests rely on somewhat precise timing, as they test that answers
arrive in a particular latency bucket within the statschannel stats.
These tests are affected by various timing and network issues on our
FreeBSD CI runners and the results are very unstable. Skip these on
FreeBSD entirely.
Merge branch 'nicki/disable-statschannel-rtt-on-freebsd' into 'main'
Nicki Křížek [Tue, 10 Mar 2026 12:35:56 +0000 (13:35 +0100)]
Disable statschannel RTT tests on FreeBSD
These tests rely on somewhat precise timing, as they test that answers
arrive in a particular latency bucket within the statschannel stats.
These tests are affected by various timing and network issues on our
FreeBSD CI runners and the results are very unstable. Skip these on
FreeBSD entirely.
Nicki Křížek [Mon, 9 Mar 2026 15:48:24 +0000 (16:48 +0100)]
chg: ci: Re-enable shotgun runs for nightlies and tags
The recent rewrite of DNS Shotgun infrastructure might've improved the
prior instability. In order to evaluate, re-enable the regular shotgun
pipelines to gather data.
Merge branch 'nicki/ci-shotgun-enable' into 'main'
Nicki Křížek [Thu, 29 Jan 2026 10:10:10 +0000 (11:10 +0100)]
Re-enable shotgun runs
Make the shotgun pipelines on-demand with 5 samples (and no retry) by
defautl. MRs are compared to their base, while other sources (triggers,
web, schedule...) are compared against the latest released version.
For schedules, run the shotgun pipelines on Monday morning only, but
with the increased number of samples. This should provide useful data
without too many false positives.
Nicki Křížek [Mon, 9 Mar 2026 12:12:14 +0000 (13:12 +0100)]
chg: test: Log dnspython queries after .to_wire() is called
Some dns message modifications like TSIG happen only after .to_wire() is
called on the message. To ensure there isn't a discrepancy between what
has been logged and what has been sent, log the query after
dns.query.udp() is executed (which calls .to_wire() on the message).
Merge branch 'nicki/pytest-log-querymsg' into 'main'
Nicki Křížek [Tue, 3 Mar 2026 12:37:14 +0000 (13:37 +0100)]
Log dnspython queries after .to_wire() is called
Some dns message modifications like TSIG happen only after .to_wire() is
called on the message. To ensure there isn't a discrepancy between what
has been logged and what has been sent, log the query after
dns.query.udp() is executed (which calls .to_wire() on the message).
Alessio Podda [Fri, 6 Mar 2026 14:06:13 +0000 (14:06 +0000)]
chg: dev: Replace lock keyfile hashmap with lock pool
Kasp used a lock per zone origin in order to prevent concurrent access
to keyfiles. This lead to substantial memory consumption in the case of
authoritative servers with many small zones, as lots of locks need to be
allocated.
Since the number of keyfile locks taken cannot exceed the number of
helper threads, it makes more sense to use a lock pool of fixed size
keyed by the hash of the origin name, leading to memory savings.
Merge branch 'alessio/keyfile-lock-pool' into 'main'
Alessio Podda [Fri, 27 Feb 2026 12:33:55 +0000 (13:33 +0100)]
Replace lock keyfile hashmap with lock pool
Kasp used a lock per zone origin in order to prevent concurrent access
to keyfiles. This lead to substantial memory consumption in the case of
authoritative servers with many small zones, as lots of locks need to be
allocated.
Since the number of keyfile locks taken cannot exceed the number of
helper threads, it makes more sense to use a lock pool of fixed size
keyed by the hash of the origin name, leading to memory savings.
This commit adds a new CI job to update the BIND9 version in the
isc-projects/bind9-docker project, which will cause the docker images
to be rebuilt for release. Previously a manual step.
A notification is sent to the relevant Mattermost channel.
fix: usr: Fix setting retire in dns_keymgr_key_init
A wrong-variable bug in `dns_keymgr_key_init()` causes the DNSSEC key inactive
time to never be read. This means the key state is retracting zone signatures
where it should have, delaying the key rollover.
ISC would like to thank Naresh Kandula Parmar (Nottiboy) for reporting this.
Closes #5774
Merge branch '5774-fix-setting-retire' into 'main'
Make the maximum number of processed delegation nameservers configurable
via the new 'max-delegation-servers' option (default: 13), replacing the
hardcoded NS_PROCESSING_LIMIT (20).
The default is reduced to 13 to precisely match the maximum number of
root servers that can fit into a classic 512-byte UDP payload. This
provides a natural, historically sound cap that mitigates resource
exhaustion and amplification attacks from artificially inflated or
misconfigured delegations.
The configuration option is strictly bounded between 1 and 100 to ensure
resolver stability.
Merge branch 'ondrej/make-NS_PROCESSING_LIMIT-configurable' into 'main'
Make the maximum number of processed delegation nameservers configurable
via the new 'max-delegation-servers' option (default: 13), replacing the
hardcoded NS_PROCESSING_LIMIT (20).
The default is reduced to 13 to precisely match the maximum number of
root servers that can fit into a classic 512-byte UDP payload. This
provides a natural, historically sound cap that mitigates resource
exhaustion and amplification attacks from artificially inflated or
misconfigured delegations.
The configuration option is strictly bounded between 1 and 100 to ensure
resolver stability.
Štěpán Balážik [Tue, 3 Mar 2026 06:50:22 +0000 (06:50 +0000)]
fix: ci: Fix .respdiff-recent-named anchor to work when the ABI changes
Previously, on 9.20 and 9.18, both builds (reference and the version
being tested) would use the same .so files which lead to a crash if the
ABI changed.
Use `git worktree` to get completely separate build environment for the
reference version.
This is not a problem on 9.21 as Meson is smart and covers this mistake,
but apply the fix to it as well for consistency.
This also is not a problem on non-MR pipelines: the latest released version
was used as a reference there, so the .so versions would differ.
Štěpán Balážik [Mon, 2 Mar 2026 14:54:53 +0000 (15:54 +0100)]
Fix .respdiff-recent-named anchor to work when the ABI changes
Previously, on 9.20 and 9.18, both builds (reference and the version
being tested) would use the same .so files which lead to a crash if the
ABI changed.
Use `git worktree` to get completely separate build environment for the
reference version.
This is not a problem on 9.21 as Meson is smart and covers this mistake,
but apply the fix to it as well for consistency.
Colin Vidal [Sun, 1 Mar 2026 08:21:03 +0000 (09:21 +0100)]
fix: usr: Resolve "key defined in view is not found"
A recent change in `2956e4fc45b3c2142a3351682d4200647448f193` hardened the `key` name check when used in `primaries` to immediately reject the configuration if the key was not defined (rather than only checking whether the key name was correctly formed). However, the change introduced a regression that prevented the use of a `key` defined in a view. This is now fixed.
Colin Vidal [Mon, 23 Feb 2026 18:36:19 +0000 (19:36 +0100)]
checkconf: check key existence in views
Commit `2956e4fc45b3c2142a3351682d4200647448f193` hardened the `key`
name check when used in `primaries` to reject the configuration if
the key was not defined, rather than simply checking whether the
key name was correctly formed.
However, the key name check didn't include the view configuration,
causing keys not to be recognized if they were defined inside the
view and not at the global level. This regression is now fixed.
Michał Kępień [Fri, 27 Feb 2026 15:52:20 +0000 (16:52 +0100)]
chg: doc: Update Sphinx-related Python packages
Update Sphinx-related Python packages to their current versions pulled
in by "pip install sphinx-rtd-theme" run in a fresh Debian "bookworm"
container.
Merge branch 'michal/update-sphinx-related-python-packages' into 'main'
Michał Kępień [Fri, 27 Feb 2026 13:10:26 +0000 (14:10 +0100)]
Update Sphinx-related Python packages
Update Sphinx-related Python packages to their current versions pulled
in by "pip install sphinx-rtd-theme" run in a fresh Debian "bookworm"
container.
Arаm Sаrgsyаn [Thu, 26 Feb 2026 17:21:24 +0000 (17:21 +0000)]
new: usr: Provide response round-trip time (RTT) counters via statistics channel
Previously, :iscman:`named` provided RTT counters for outgoing
queries performed by itself during name resolutions. Now this
has been improved to provide more granular counters (histogram),
and to also provide RTT counters for the incoming queries.
Closes #5279
Merge branch '5279-query-rtt-isc_histo_t-statistics' into 'main'
Aram Sargsyan [Thu, 15 Jan 2026 14:46:06 +0000 (14:46 +0000)]
Replace the outgoing queries RTT histogram code with isc_histomulti
The granularity of the simple histogram with fixed number of ranges
sometimes isn't good enough. As there's a need to implement a new
histogram statistics for the incoming query times (RTT), it was decided
to also update the existing RTT statistics of the outgoing queries
so that they look similar and use common code.
Remove the old histogram code from the resolver and from the statistics
channel. Reimplement the outgoing queries RTT histogram using the
isc_histomulti module, and prepare the necessary base for implementing
the incoming queries RTT histogram. The statistics channel will be
updated to expose the new histograms in an upcoming commit.
Aram Sargsyan [Thu, 15 Jan 2026 14:38:44 +0000 (14:38 +0000)]
Use standard reference counting for isc_histomulti
Use reference counting for isc_histomulti module so that it's
possible to attach/detach to/from the objects when used in the
statistics channel in the coming commits.
Ondřej Surý [Thu, 26 Feb 2026 06:33:29 +0000 (07:33 +0100)]
chg: dev: Implement Fisher-Yates shuffle for nameserver selection
Replace the two-pass "random start index and wrap around" logic in
fctx_getaddresses_nameservers() with a statistically sound partial
Fisher-Yates shuffle.
The previous implementation picked a random starting node and did two
passes over the linked list to find query candidates. The new logic
introduces fctx_getaddresses_nsorder() to perform an in-place
randomization of indices into a bounded, stack-allocated lookup array
(nsorder) representing the "winning" fetch slots.
The nameserver dataset is now traversed in exactly one sequential pass:
1. Every nameserver is evaluated for local cached data.
2. If the current nameserver's sequential index exists in the randomized
nsorder array, it is permitted to launch an outgoing network fetch.
3. If not, it is restricted to local lookups via DNS_ADBFIND_NOFETCH.
This guarantees a fair random distribution for outbound queries while
maximizing local cache hits, entirely within O(1) memory and without
the overhead of linked-list pointer shuffling or dynamic allocation.
Closes #5695
Merge branch '5695-refactor-the-random-NS-selection' into 'main'
Colin Vidal [Wed, 25 Feb 2026 18:01:22 +0000 (19:01 +0100)]
Add test coverage for nameserver processing limits
Introduce a new system test (nsprocessinglimit) to verify that the
resolver strictly respects outgoing network fetch quotas when presented
with heavily delegated, unresponsive zones.
This test acts as a regression check for the recent Fisher-Yates nameserver
selection refactor. It sets up an authoritative server delegating a zone
to 23 distinct nameservers (all pointing to unresponsive loopback IPs).
Using dnstap, the test forces a resolution failure and verifies that:
1. The resolver successfully traverses the zone delegation path.
2. The resolver caps the outgoing network queries to the delegated
nameservers exactly at the processing limit (20 fetches), ensuring
array boundaries and dynamic fetch quotas are strictly enforced without
crashing or hanging.
Ondřej Surý [Wed, 25 Feb 2026 15:46:40 +0000 (16:46 +0100)]
Implement Fisher-Yates shuffle for nameserver selection
Replace the two-pass "random start index and wrap around" logic in
fctx_getaddresses_nameservers() with a statistically sound Fisher-Yates
shuffle.
The previous implementation picked a random starting node and did two
passes over the linked list to find query candidates. The new logic
extracts the available nameservers into a bounded, stack-allocated array
of dns_rdata_t structures.
This array is then randomized in-place using a Fisher-Yates shuffle.
Finally, the shuffled array is traversed sequentially to launch fetches
until the dynamic quota (fctx->pending_running >= fetches_allowed) is
reached.
This guarantees a fair random distribution for outbound queries while
properly respecting dynamic query limits, entirely within O(1) memory
and without the overhead of linked-list pointer shuffling or multiple
dataset traversals.
Ondřej Surý [Wed, 25 Feb 2026 09:05:55 +0000 (10:05 +0100)]
fix: usr: Remove deterministic selection of nameserver
When selecting nameserver addresses to be looked up we where
always selecting them in dnssec name order from the start of
the nameserver rrset. This could lead to resolution failure
despite there being address that could be resolved for the
other names. Use a random starting point when selecting which
names to lookup.
Closes #5695
Closes #5745
Merge branch '5695-add-random-server-selection' into 'main'
Colin Vidal [Tue, 24 Feb 2026 16:30:56 +0000 (17:30 +0100)]
system test covering NS randomization
Add randomizens system test which ensures that NS are randomly selected.
The test relies of the fact that `getaddresses_allowed()` logic won't
allow to query more than 3 NS at the top-level. The `example.` zone has
4 NS and the 3 formers are lame. As a result, if the resolved doesn't
randomize the NS selection, it will only quiery the 3 formers, which
won't give an answer, and fails. With randomization enabled, there is a
chance that the resolver queries the fourth NS, and gets the result.
Mark Andrews [Fri, 19 Dec 2025 07:12:06 +0000 (18:12 +1100)]
Remove determinist selection of nameserver
When selecting nameserver addresses to be looked up we where
always selecting them in dnssec name order from the start of
the nameserver rrset. This could lead to resolution failure
despite there being address that could be resolved for the
other names. Use a random starting point when selecting which
names to lookup.
Ondřej Surý [Wed, 25 Feb 2026 06:29:23 +0000 (07:29 +0100)]
sec: usr: Remove purged adb names and entries from SIEVE list immediately
Both expire_name() and expire_entry() use isc_async mechanism to remove
the names and entries from the SIEVE-LRU lists on the matching isc_loop.
Under certain circumstances, this could lead to double counting the
purged named/entries when purging the SIEVE-LRU lists under the overmem
condition. This would cause not enough memory to be cleaned up and the
ADB would then never recover from the overmem condition leading to OOM
crash of the named.
Merge branch 'ondrej/fix-runaway-memory-in-adb' into 'main'
Ondřej Surý [Tue, 10 Feb 2026 05:16:31 +0000 (06:16 +0100)]
Remove purged adb names and entries from SIEVE list immediately
Both `expire_name()` and `expire_entry()` use the isc_async mechanism to
remove names and entries from the SIEVE-LRU lists on the matching
isc_loop.
Under heavy load when the cleaning mechanism didn't have the chance to
kick in yet, this delay could lead to double-counting the purged names
and entries when purging the SIEVE-LRU lists during an overmem
condition. This would result in insufficient memory being cleaned up,
causing the ADB to never recover from the overmem condition and leading
to an OOM crash of `named`.
This patch resolves the issue by bypassing the async queue and executing
the removal synchronously if the target loop matches the current
isc_loop().
If an BIND 9 administrator imports an invalid SKR file, local stack
in the import function might overflow. This could lead to
a memory corruption on the stack and ultimately server crash.
This has been fixed.
ISC would like to thank mcsky23 for bringing this bug to our attention.
Closes #5758
Merge branch '5758-fix-stack-overflow-via-rndc-skr-import' into 'main'