Ondřej Surý [Sat, 16 May 2026 10:30:01 +0000 (12:30 +0200)]
new: dev: Enable PR-Agent reviews on merge requests
Adds a CI job that runs PR-Agent against each merge request opened
from the canonical repository, posting an automated review and
code-improvement suggestions as MR comments. The job is gated to
same-project source branches so the OpenAI key and personal access
token are not exposed to fork pipelines.
Ondřej Surý [Sat, 16 May 2026 06:23:50 +0000 (08:23 +0200)]
Add PR-Agent job to GitLab CI for merge-request review
Run PR-Agent's `review` and `improve` commands against each merge
request from the canonical repository, posting an automated review
and code-improvement suggestions as MR comments. The rule restricts
the job to MRs whose source project matches CI_PROJECT_PATH so the
OpenAI key and GitLab personal access token are never exposed to
fork pipelines.
Ondřej Surý [Fri, 15 May 2026 08:08:46 +0000 (10:08 +0200)]
Allow any valid DNS name as a key name
TSIG key names need to be any valid DNS name so that update-policy
"self" rules work with arbitrary names. Replace the
alnum+'.'+'-'+'_' charset filter in the key-generation tools with a
dns_name_fromstring() validity check.
Ondřej Surý [Fri, 15 May 2026 07:33:09 +0000 (09:33 +0200)]
chg: dev: Use SipHash-1-3 for hash tables, keep SipHash-2-4 for cookies
SipHash-2-4 was designed as a conservative PRF/MAC with extra rounds
against future attacks. For hash tables, where outputs are never
exposed, SipHash-1-3 provides sufficient collision resistance with
fewer rounds. As the SipHash author noted: "I would be very surprised
if SipHash-1-3 introduced weaknesses for hash tables."
DNS cookies continue to use SipHash-2-4 since cookie values are sent
on the wire and must resist online attacks.
Use SipHash-1-3 for hash tables, keep SipHash-2-4 for cookies
SipHash-2-4 was designed as a conservative PRF/MAC with extra rounds
against future attacks. For hash tables, where outputs are never
exposed, SipHash-1-3 provides sufficient collision resistance with
fewer rounds. As the SipHash author noted: "I would be very surprised
if SipHash-1-3 introduced weaknesses for hash tables."
DNS cookies continue to use SipHash-2-4 since cookie values are sent
on the wire and must resist online attacks.
Ondřej Surý [Fri, 15 May 2026 06:03:16 +0000 (08:03 +0200)]
fix: test: Fix flaky reclimit test
The max-types-per-name cache eviction tests were flaky because two test steps were missing a sleep between queries, causing TTL-based cache verification to fail when both queries completed within the same second.
Merge branch 'ondrej/fix-flaky-reclimit' into 'main'
The cache verification in steps 11 and 15 checks that the TTL has
decreased from its initial value to confirm the response was served
from cache, but the sleep between the two queries was missing. Both
queries could complete within the same second, leaving the TTL
unchanged and causing the test to incorrectly conclude the entry was
not cached.
Ondřej Surý [Fri, 15 May 2026 05:48:26 +0000 (07:48 +0200)]
chg: dev: Skip in-domain nameservers that have no glue
A referral that names a nameserver inside the delegated zone but
provides no address for it leaves the resolver unable to reach that
server. named now logs "missing mandatory glue for <name>" at notice
level and skips the nameserver.
Merge branch 'ondrej/dont-store-missing-in-domain-glue-ns' into 'main'
Ondřej Surý [Wed, 6 May 2026 10:37:03 +0000 (12:37 +0200)]
Drop in-domain NS without glue from the delegation set
Pull the dns_message_findname() lookups into cache_delegglue() and
cache_delegglue6() so each helper now owns its glue lookup and returns
the number of addresses cached. cache_delegns() splits referrals into
two cases: in-domain (the NS name is below the delegation point) and
sibling/in-bailiwick.
An in-domain NS without glue is unresolvable by definition - the
resolver would have to ask the very server it's trying to find. Log
"missing mandatory glue" at notice level and skip the deleg entirely
rather than leaving an unusable entry in the set. A new
dns_delegset_freedeleg() undoes a fresh dns_delegset_allocdeleg() so
the rest of the delegation set is preserved.
Ondřej Surý [Fri, 15 May 2026 04:57:00 +0000 (06:57 +0200)]
chg: usr: Fall back to TCP on a UDP response with a mismatched query id
BIND used to wait silently for the correct DNS message id on a UDP fetch
even after receiving a response from the expected server with the wrong
id, leaving room for off-path spoofing attempts to keep guessing within
that window. The resolver now retries the fetch over TCP on the first
such response, and a new MismatchTCP statistics counter tracks how
often the fallback fires.
Closes #5449
Merge branch '5449-immediate-tcp-fallback-on-id-mismatch' into 'main'
Ondřej Surý [Thu, 14 May 2026 10:20:19 +0000 (12:20 +0200)]
Switch UDP fetches to TCP on the first response with a wrong query id
Until now, the dispatcher silently dropped UDP responses from the
expected peer that carried the wrong DNS message id and kept listening
for the correct id to arrive within the read timeout. An off-path
attacker who knows the destination address and source port of an
outgoing fetch could exploit that quiet retry window to flood the
resolver with guessed responses; with a gigabit link the per-query
success probability grows linearly with the number of guesses that
arrive before the legitimate answer or the timeout.
Treat any such mismatch as a possible spoofing attempt and let the
resolver immediately retry the same query over TCP, the same control
path the truncation handler already uses.
Add a resolver statistics counter - exposed as 'queries retried over TCP
after a response with mismatched query id' in rndc stats and
'MismatchTCP' in the statistics channel
Ondřej Surý [Thu, 14 May 2026 06:52:58 +0000 (08:52 +0200)]
fix: dev: Fix data race during rndc dumpdb or zone load
'rndc dumpdb' against a server with zones, and async zone load,
had a timing window where the operation's completion could fire
before the server had finished registering the operation,
occasionally leading to a possible crash. The completion is now
delivered after the registration is in place.
Closes #5952
Merge branch '5952-fix-masterdump-async-ctx-race' into 'main'
Ondřej Surý [Fri, 8 May 2026 05:46:03 +0000 (07:46 +0200)]
Fix data race in async master dump/load context publication
Bouncing the offload itself to the target loop let the after-work
callback fire on the target thread and run the user's done callback
before the calling thread had published *dctxp / *lctxp. Enqueue on
the calling loop and bounce only the done callback instead, so the
publish is sequenced before the cross-thread hand-off by construction
and cannot be reintroduced by reordering the entry-point body.
Mark Andrews [Thu, 14 May 2026 00:00:21 +0000 (10:00 +1000)]
Disable output escaping in bind9.xsl
The statistics charts where not displaying on some browsers (e.g. Chrome)
due to '>' being escaped as '>'. Use disable-output-escaping="yes" to
turn this off.
Colin Vidal [Wed, 13 May 2026 20:31:32 +0000 (22:31 +0200)]
fix: test: Fix cyclic glues (again)
Previous fix `ed90d578b3a98f45eb8bc09966e9c4ab870a156d` uses
`wait_for_line()` by mistake, and the test aims to wait for two log
lines to be printed before continuing.
In principle, `wait_for_all()` should do, but `running` should always be
printed first, so `wait_for_sequence()` seems to be the right fit here.
Merge branch 'colin/fix-cyclic-glues-again' into 'main'
Colin Vidal [Wed, 13 May 2026 13:20:35 +0000 (15:20 +0200)]
Fix cyclic glues (again)
Previous fix `ed90d578b3a98f45eb8bc09966e9c4ab870a156d` uses
`wait_for_line()` by mistake as the test aims to wait for two log lines
to be printed before continuing (and not continuing as soon as one of
them is printed).
Instead, `wait_for_all()` is used since the order between the two
expected log line is not guaranteed.
The global RUNNER_SCRIPT_TIMEOUT: 55m in the parent pipeline was being
forwarded to the stress and tsan:stress child pipelines, where forwarded
yaml variables outrank job-level variables. That caused stress jobs with
BIND_STRESS_TESTS_RUN_TIME >= 60 to be killed at 55 minutes, regardless
of the per-job RUNNER_SCRIPT_TIMEOUT set in the generated child config.
Set forward:yaml_variables: false on both trigger jobs; the generated
configs already declare every variable they need.
Assisted-by: Claude:claude-opus-4-7
Merge branch 'mnowak/fix-stress-test-script-timeout' into 'main'
Michal Nowak [Wed, 13 May 2026 09:44:26 +0000 (11:44 +0200)]
Selectively inherit yaml vars in stress trigger jobs
The parent's global RUNNER_SCRIPT_TIMEOUT: 55m was reaching the stress
and tsan:stress child pipelines via inherited yaml variables, where
inherited values outrank the child's job-level variables. That caused
stress jobs with BIND_STRESS_TESTS_RUN_TIME >= 60 to be killed at 55
minutes, regardless of the per-job RUNNER_SCRIPT_TIMEOUT set in the
generated child config.
Use inherit:variables with a positive list on both trigger jobs:
inherit only CI_REGISTRY_IMAGE so the parent's registry override
(needed for image pulls in the child) flows through, while keeping
RUNNER_SCRIPT_TIMEOUT (and other globals) out of the child pipeline's
variable scope. The per-job RUNNER_SCRIPT_TIMEOUT values set by the
generated child config now take effect.
Michal Nowak [Wed, 25 Mar 2026 12:31:49 +0000 (13:31 +0100)]
Set RUNNER_SCRIPT_TIMEOUTs
Sometimes jobs can get stuck and be terminated by GitLab, leaving us
without artefacts that could contain useful information about why the
job got stuck.
Colin Vidal [Tue, 12 May 2026 14:42:43 +0000 (16:42 +0200)]
fix: test: Fix cyclic_glue system test
The cyclic_glue system test was waiting for `running` log after
an `rndc reload` command, but wasn't waiting for the log saying a
specific zone which changed has been reloaded `zone <zone>/IN: loaded`.
As a result, the test could randomly fails. This is now fixed.
Closes #5953
Merge branch '5953-fix-cyclic-glue-test' into 'main'
Colin Vidal [Tue, 12 May 2026 12:42:35 +0000 (14:42 +0200)]
Fix cyclic_glue system test
The cyclic_glue system test was waiting for `running` log after
an `rndc reload` command, but wasn't waiting for the log saying a
specific zone which changed has been reloaded `zone <zone>/IN: loaded`.
As a result, the test could randomly fails. This is now fixed.
Ondřej Surý [Tue, 12 May 2026 14:17:59 +0000 (16:17 +0200)]
chg: usr: Cap glue records cached from a referral
named cached every glue record from a referral, retaining far more
than resolution will ever use. The number of nameservers and
addresses kept per referral is now bounded in the delegation database.
Closes #5701
Merge branch '5701-limit-the-number-of-GLUE-records' into 'main'
Ondřej Surý [Wed, 6 May 2026 10:35:22 +0000 (12:35 +0200)]
Cap glue records cached from a referral
The resolver populated the delegation database with every NS RR and
every glue address from a referral, with no aggregate bound. Resolution
only ever uses the first max-delegation-servers NS owners and a handful
of addresses per NS, so anything beyond that is dead memory.
Stop the NS loop in cache_delegns() at view->max_delegation_servers and
cap each glue rdataset at DELEG_MAX_GLUES_PER_NS (20) addresses, so each
NS owner contributes at most 20 A and 20 AAAA glues.
Michał Kępień [Mon, 11 May 2026 15:43:55 +0000 (17:43 +0200)]
chg: ci: Add commit link and diff to RPM build job logs
The output of update_rpms.py is terse, making it difficult to verify its
actions. Add a commit link and "git show" output to the log of every CI
job running the update_rpms.py script in "build" mode to facilitate
double-checking its actions.
Merge branch 'michal/add-commit-link-and-diff-to-rpm-build-job-logs' into 'main'
Michał Kępień [Mon, 11 May 2026 15:41:50 +0000 (17:41 +0200)]
Add commit link and diff to RPM build job logs
The output of update_rpms.py is terse, making it difficult to verify its
actions. Add a commit link and "git show" output to the log of every CI
job running the update_rpms.py script in "build" mode to facilitate
double-checking its actions.
Michał Kępień [Mon, 11 May 2026 14:23:16 +0000 (16:23 +0200)]
fix: ci: Increase GIT_DEPTH for the "assign-milestones" job
Cloning tags with the default GIT_DEPTH of 1 prevents the milestone
assignment script from identifying any merge requests that are included
in a given release. Fix by increasing GIT_DEPTH to an arbitrary value
that is high enough for practical purposes.
The GIT_DEPTH CI variable defaults to 1 for all jobs through the
top-level "variables" key. Explicitly setting it to 1 in job
definitions is unnecessary and may cause confusion. Remove these
redundant assignments.
Merge branch 'michal/fix-assign-milestones-job' into 'main'
Michał Kępień [Mon, 11 May 2026 14:07:47 +0000 (16:07 +0200)]
Remove redundant "GIT_DEPTH: 1" assignments
The GIT_DEPTH CI variable defaults to 1 for all jobs through the
top-level "variables" key. Explicitly setting it to 1 in job
definitions is unnecessary and may cause confusion. Remove these
redundant assignments.
Michał Kępień [Mon, 11 May 2026 14:07:47 +0000 (16:07 +0200)]
Increase GIT_DEPTH for the "assign-milestones" job
Cloning tags with the default GIT_DEPTH of 1 prevents the milestone
assignment script from identifying any merge requests that are included
in a given release. Fix by increasing GIT_DEPTH to an arbitrary value
that is high enough for practical purposes.
Michal Nowak [Mon, 11 May 2026 13:34:30 +0000 (15:34 +0200)]
new: test: Add isctest.transfer.transfer_message() helper and convert tests
Add a new helper function, `isctest.transfer.transfer_message()`, to
`bin/tests/system/isctest/transfer.py` that generates the log message
produced by `xfrin_log()` in `lib/dns/xfrin.c` for an incoming zone
transfer:
transfer of '<zone>/IN' from <source_ns>#<port>: <msg>
The explicit use of `port` matches current shell system usage.
- zone - zone name without class (e.g. "example.com")
- source_ns - IP string, or None to wildcard the source address
- msg - the transfer-level message
(e.g. "Transfer status: success")
- port - integer source port, or None to wildcard the port number
When both source_ns and port are concrete values a plain str is returned
and `wait_for_line()` treats it as a literal substring match. Whenever
either is `None` a compiled `re.Pattern` is returned, with the unknown part
replaced by a constrained wildcard:
- source_ns=None, port=None -> from .*#[0-9]+:
- source_ns=None, port=53 -> from .*#53:
- source_ns="1.2.3.4", port=None -> from 1.2.3.4#[0-9]+:
- source_ns="1.2.3.4", port=N -> "from 1.2.3.4#N:" (plain str)
The port wildcard is [0-9]+ (not .*) because a port is always numeric.
Convert all hard-coded transfer log patterns in the Python system tests
to use transfer_message().
Notable cases:
- `mirror_root_zone`: source_ns=None (live internet, any root server),
port=53.
- `cipher_suites`: source_ns="10.53.0.1", port=None (each zone transfers
over a different TLS port).
- `test_under_signed_transfer`: parametrize gains a boolean xfrin_msg
flag to distinguish messages that go through xfrin_log() from
lower-level TSIG errors that do not.
Testing
-------
All system tests pass under `pytest -n auto`. The `mirror_root_zone`
live-internet test was also verified separately with
`CI_ENABLE_LIVE_INTERNET_TESTS=1`.
LLM usage
---------
This commit was produced in an interactive session with Claude Code
(Claude Sonnet 4.6), guided step by step by a human reviewer.
Closes #5735
Merge branch '5735-make-transfer-message-formatter' into 'main'
Michal Nowak [Mon, 11 May 2026 11:24:22 +0000 (13:24 +0200)]
Add isctest.transfer.transfer_message() helper and convert tests
Add a new helper function, isctest.transfer.transfer_message(), to
bin/tests/system/isctest/transfer.py that generates the log message
produced by xfrin_log() in lib/dns/xfrin.c for an incoming zone
transfer:
transfer of '<zone>/IN' from <source_ns>#<port>: <msg>
The helper always returns a compiled re.Pattern. source_ns and port
each accept None to match any source address / port. msg accepts
either a plain str (regex-escaped automatically) or a compiled
re.Pattern (spliced into the regex as-is), so callers that need regex
syntax in the message part can pass Re(r"...") without having to
wrap the whole result.
source_ns is passed through re.escape() when provided, so dots in
IPv4 addresses (e.g. "10.53.0.1") match a literal dot rather than
any character.
Convert the existing call sites across the system tests to use the
new helper.
Alessio Podda [Mon, 11 May 2026 12:52:17 +0000 (12:52 +0000)]
chg: dev: Make dns_glue_t private to qpzone
The dns_glue struct currently contains four dns_rdataset structs to hold
the glue. These structs are over 100 bytes each because they need to be
able to hold data for multiple types of databases.
Since the dns_glue_t type is only used by qpzone, we can instead hold pointers
to the vecheaders directly, and only bind the vecheaders to the
rdatasets when adding the glue to the message.
This leads to a 33% memory reduction in some authoritative benchmarks.
Alessio Podda [Sat, 14 Feb 2026 21:20:41 +0000 (22:20 +0100)]
Delay binding glue to rdataset
The dns_glue struct currently contains four dns_rdataset structs to hold
the glue. These structs are over 100 bytes each because they need to be
able to hold data for multiple types of databases.
Since the dns_glue_t type is only used by qpzone, we can instead hold
pointers to the vecheaders directly, and only bind the vecheaders to
the rdatasets when adding the glue to the message.
The dns_glue_t, dns_gluelist_t and dns_glue_additionaldata_ctx types are
only used in qpzone.c. This commits moves them to the private header
qpzone_p.h.
This is done in preparation of a followup commit that will refactor them
to use types that are private to qpzone.
Michał Kępień [Mon, 11 May 2026 08:09:09 +0000 (10:09 +0200)]
fix: ci: Fix triggering rules for the "publish-cleanup" job
The "publish-cleanup" tag pipeline job is currently created for all
security releases, including BIND -S releases, but it depends on the
"publish" job, which is only created for open source releases. This
breaks CI configuration for BIND -S tags, preventing pipelines from
getting created for such tags altogether. Fix by only creating the
"publish-cleanup" job in tag pipelines for open source security
releases.
Merge branch 'michal/fix-triggering-rules-for-the-publish-cleanup-job' into 'main'
Michał Kępień [Mon, 11 May 2026 08:07:38 +0000 (10:07 +0200)]
Fix triggering rules for the "publish-cleanup" job
The "publish-cleanup" tag pipeline job is currently created for all
security releases, including BIND -S releases, but it depends on the
"publish" job, which is only created for open source releases. This
breaks CI configuration for BIND -S tags, preventing pipelines from
getting created for such tags altogether. Fix by only creating the
"publish-cleanup" job in tag pipelines for open source security
releases.
Michał Kępień [Thu, 7 May 2026 16:05:37 +0000 (18:05 +0200)]
chg: ci: Mark merged security fixes as "Not released yet"
Adjust the triggering rules for the "merged-metadata" CI job so that
merge requests merged into security-* branches are automatically
assigned to the "Not released yet" milestone, just like merge requests
targeting public branches. This enables merge requests containing
security fixes to be correctly processed by release automation scripts.
Merge branch 'pspacek/extend-not-released-yet-milestone' into 'main'
Petr Špaček [Tue, 5 May 2026 13:04:36 +0000 (15:04 +0200)]
Mark merged security fixes as "Not released yet"
Adjust the triggering rules for the "merged-metadata" CI job so that
merge requests merged into security-* branches are automatically
assigned to the "Not released yet" milestone, just like merge requests
targeting public branches. This enables merge requests containing
security fixes to be correctly processed by release automation scripts.
Michał Kępień [Thu, 7 May 2026 15:51:36 +0000 (17:51 +0200)]
chg: ci: Enable automatic backports for security fixes
Ensure the "backports" CI job is created when new changes are merged
into security-* branches. This enables using backport automation for
security fixes.
Merge branch 'michal/extend-automatic-backports' into 'main'
Michał Kępień [Thu, 7 May 2026 15:45:35 +0000 (17:45 +0200)]
Enable automatic backports for security fixes
Ensure the "backports" CI job is created when new changes are merged
into security-* branches. This enables using backport automation for
security fixes.
Aydın Mercan [Tue, 5 May 2026 12:27:06 +0000 (15:27 +0300)]
[CVE-2026-3593] sec: usr: Fix use-after-free in DNS-over-HTTPS when processing HTTP/2 SETTINGS frames
A use-after-free vulnerability in the DNS-over-HTTPS implementation
could cause named to crash when a client sends a flood of HTTP/2
SETTINGS frames while a DoH response is being written. This affects
servers with DoH (DNS-over-HTTPS) enabled.
ISC would like to thank Naresh Kandula Parmar (Nottiboy) for reporting this.
Ondřej Surý [Wed, 6 May 2026 08:12:35 +0000 (10:12 +0200)]
Pass empty string instead of NULL to ns_client_dumpmessage()
The two new call sites added by the CLASS-validation work passed NULL
as the reason, but ns_client_dumpmessage() bails out early on a NULL
reason — so the message dump never happened. The intent was to dump
the message and let the follow-up ns_client_log() carry the reason
text, so pass "" to suppress the prefix without short-circuiting the
dump.
Aydın Mercan [Tue, 10 Mar 2026 11:48:02 +0000 (14:48 +0300)]
Fix use-after-free in DoH write buffer after HTTP/2 send
After the send callback completes, the UV request is freed but
the HTTP/2 socket's write buffer still points to the freed memory.
If nghttp2 subsequently needs to send frames (e.g. SETTINGS ACK),
the server_read_callback reads from the dangling buffer.
Clear the write buffer before freeing the UV request.
Ondřej Surý [Fri, 1 May 2026 08:13:10 +0000 (10:13 +0200)]
[CVE-2026-5946] sec: usr: Disable recursion, UPDATE, and NOTIFY for non-IN views
Recursion, dynamic updates (UPDATE), and zone change notifications
(NOTIFY) are now disabled for views with a class other than IN
(such as CHAOS or HESIOD); authoritative service for non-IN zones
(e.g. version.bind in class CHAOS) continues to work as before.
Servers configured with recursion yes in a non-IN view will log a
warning at startup, and named-checkconf flags the same condition.
UPDATE and NOTIFY messages that specify the meta-classes ANY or NONE
in the question section are now rejected with FORMERR.
This addresses a set of closely related security issues collectively
identified as CVE-2026-5946. ISC would like to thank Mcsky23 for
bringing these issues to our attention.
Closes: https://gitlab.isc.org/isc-projects/bind9/-/issues/5784
Merge branch 'each-security-disable-chaos-recursion' into 'security-main'
Replace the hysteretic hi_water/lo_water switch with a stochastic
check: always false below lo_water, always true at or above hi_water,
linearly ramped probability in between. This spreads cache cleaning
across many inserts instead of triggering a thundering herd once the
hi_water mark is crossed (which causes every addrdataset to enter the
LRU purge path simultaneously and serializes lookups behind the node
write locks).
The is_overmem atomic and its stores are no longer needed and are
removed. The existing tests that asserted specific hysteretic state
transitions are simplified to check only the deterministic boundaries.
Aydın Mercan [Mon, 9 Mar 2026 12:48:34 +0000 (15:48 +0300)]
Add system test for HTTP/2 SETTINGS frame flood
Send a valid DoH query followed by a flood of SETTINGS frames to
trigger a use-after-free in the write buffer. Under ASan, named
will abort if the bug is present.
Ondřej Surý [Fri, 1 May 2026 06:51:33 +0000 (08:51 +0200)]
chg: dev: Harden GSS-API context establishment in TKEY negotiation
Implement RFC 3645 Section 3.1.1 client-side check for REPLAY, MUTUAL, and INTEG flags after gss_init_sec_context() completes. Add server-side INTEG flag check after gss_accept_sec_context(). Also fixes an uninitialized gss_name_t on the error path in dst_gssapi_initctx().
Merge branch 'ondrej/harden-gssapi-integration' into 'security-main'
Evan Hunt [Mon, 9 Mar 2026 04:50:04 +0000 (15:50 +1100)]
Test server behavior when sending various UPDATE requests
Send update messages for zones with CLASS0, ANY and NONE. The class
ANY UPDATE also attempts to delete a KX record in an existing IN
class zone to trigger a REQUIRE.
Fixed a memory leak where each GSS-API TKEY negotiation leaked a
security context inside the GSS library. An unauthenticated attacker
could exhaust server memory by sending repeated TKEY queries to a
server with tkey-gssapi-keytab configured. The leaked memory was
allocated by the GSS library, bypassing BIND's memory accounting.
Multi-round GSS-API negotiation (GSS_S_CONTINUE_NEEDED) is now
rejected, as BIND never supported it correctly and Kerberos/SPNEGO
completes in a single round.
Closes: https://gitlab.isc.org/isc-projects/bind9/-/issues/5752
Merge branch '5752-fix-memory-leak-in-TKEY-negotiation' into 'security-main'
Check GSS_C_REPLAY_FLAG in client-side ret_flags validation
RFC 3645 Section 3.1.1 mandates that the client MUST abandon the
algorithm if replay_det_state is FALSE after GSS_Init_sec_context
completes. The previous commit checked MUTUAL and INTEG but missed
REPLAY, even though it was already requested in the input flags.
Add GSS_C_REPLAY_FLAG to the ret_flags bitmask check so all three
required properties (replay detection, mutual authentication, and
integrity) are verified.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Evan Hunt [Tue, 17 Mar 2026 20:45:11 +0000 (13:45 -0700)]
Test UPDATE behavior in CHAOS and other non-IN classes
Send various UPDATE requests that are known to have caused
crashes previously with deliberately misconfigured non-IN
zones; confirm that UPDATE is not processed.
Ondřej Surý [Fri, 1 May 2026 06:07:20 +0000 (08:07 +0200)]
[CVE-2026-5947] sec: usr: Fix crash in resolver when SIG(0)-signed responses are received under load
A resolver could crash when handling a SIG(0)-signed response if the
matching client query was cancelled while signature verification was
still in progress — for example, when the recursive-clients quota
was exhausted. This has been fixed.
Closes isc-projects/bind9#5819
Merge branch '5819-fix-heap-use-after-free-in-resquery_response_continue' into 'security-main'
Fix output token and GSS context leaks in TKEY/GSS-API error paths
In dst_gssapi_acceptctx(), rename outtoken to outtokenp (matching BIND
convention for output pointer parameters) and free the allocated output
token buffer on error in the cleanup path.
In process_gsstkey(), route the empty-principal error path through
cleanup via CLEANUP() instead of returning early, so that the output
token, GSS context, and TSIG key are all freed consistently by the
existing cleanup block.
Ondřej Surý [Wed, 18 Mar 2026 00:02:24 +0000 (01:02 +0100)]
Verify integrity flag on server-side GSS-API context
After gss_accept_sec_context() completes, verify that the INTEG flag
is set in ret_flags. Without integrity protection, GSS-TSIG message
authentication cannot function correctly.
The server side was previously passing NULL for ret_flags, meaning it
never verified the negotiated security properties. The client side
was fixed in the previous commit; this fixes the server side.
Colin Vidal [Thu, 30 Apr 2026 18:49:05 +0000 (20:49 +0200)]
[CVE-2026-3592] sec: usr: Limit resolver server list size
When resolving a domain with many nameservers that share overlapping IP addresses (e.g., 10 NS records all pointing at the same set of addresses), BIND could previously waste time querying duplicate addresses and build up excessively large server lists. Deduplicate addresses in the resolver's server list so that each unique IP is only queried once per resolution attempt, regardless of how many NS records point to it and cap the number of addresses stored per nameserver name to 6 (combined A and AAAA), preventing memory and CPU overhead from domains with unusually large NS/glue sets.
Closes isc-projects/bind9#5641
Merge branch '5641-selfpointedglue' into 'security-main'
Ondřej Surý [Tue, 17 Mar 2026 03:45:16 +0000 (04:45 +0100)]
Fix use-after-free in resolver SIG(0) async verification path
When a SIG(0)-signed response triggers async ECDSA verification via
dns_message_checksig_async(), the respctx_t holds a raw pointer to
the resquery_t. If the fetch context is shut down while verification
is in flight (e.g. due to recursive-clients quota exhaustion), the
query is destroyed and the callback dereferences a dangling pointer.
Take a reference on the resquery_t when initializing the respctx_t,
and release it in both cleanup paths. The query's own reference to
the fetch context keeps the fctx alive transitively.
Ondřej Surý [Fri, 20 Mar 2026 07:43:28 +0000 (08:43 +0100)]
Add regression test for GSS-API context leak via TKEY CONTINUE
Send crafted SPNEGO NegTokenInit tokens that propose the krb5
mechanism without a mechToken. This causes gss_accept_sec_context()
to return GSS_S_CONTINUE_NEEDED, which on unfixed code leaks the
GSS context handle (~520 bytes per query).
The test verifies that the server rejects the negotiation (TKEY
error != 0, no continuation token) rather than returning a CONTINUE
response (error=0 with output token).
Ondřej Surý [Tue, 17 Mar 2026 23:28:19 +0000 (00:28 +0100)]
Implement RFC 3645 Section 3.1.1 ret_flags check in GSS-API client
After gss_init_sec_context() completes, verify that both MUTUAL and
INTEG flags are set in ret_flags. RFC 3645 Section 3.1.1 requires
the client to abandon the algorithm if either flag is missing, as
the security context would not provide mutual authentication or
message integrity.
Also fix uninitialized gss_name_t variable in dst_gssapi_initctx()
that could cause undefined behavior if gss_import_name() fails and
the cleanup path calls gss_release_name() on the uninitialized
value.
Evan Hunt [Tue, 17 Mar 2026 20:24:43 +0000 (13:24 -0700)]
Skip "deny-answer-address" for non-IN addresses
Ensure that we don't attempt an ACL match for answer addresses
when handling a class-CHAOS zone. This is an additional line of
defense for YWH-PGM40640-74.
Colin Vidal [Thu, 30 Apr 2026 17:41:47 +0000 (19:41 +0200)]
fix: usr: Do not resend query after BADCOOKIE answer on TCP
When an upstream server answers BADCOOKIE, no matter which transport is used,
the resolver resends the query using TCP. However, if the upstream
server responded with BADCOOKIE again over TCP, the resolver would keep
resending until the maximum query count was reached.
This is now fixed by no longer resending once the query has already been
sent over TCP.
See isc-projects/bind9#5804
Merge branch '5804-resend-loop-badcookie' into 'security-main'
Colin Vidal [Thu, 2 Apr 2026 08:43:00 +0000 (10:43 +0200)]
update `max-delegation-servers` documentation
Clarify how `max-delegation-servers` is used in the resolver, in
particular, the fact that it, in practice, caps the maximum outgoing
queries to resolve a name at a given delegation point.
Ondřej Surý [Tue, 17 Mar 2026 23:10:35 +0000 (00:10 +0100)]
Fix GSS-API context leak in TKEY negotiation
Reject multi-round GSS-API negotiation (GSS_S_CONTINUE_NEEDED) in
dst_gssapi_acceptctx(). Each call to gss_accept_sec_context()
allocates a context inside the GSS library; without this fix, the
context handle was passed back to process_gsstkey() which did not
store it persistently, leaking it on every incomplete negotiation.
An unauthenticated attacker could exhaust server memory by sending
repeated TKEY queries with GSSAPI tokens, each leaking one GSS
context. The leaked memory is allocated by the GSS library via
malloc(), bypassing BIND's memory accounting.
In practice, Kerberos/SPNEGO (the only mechanism used with BIND)
completes in a single round, so rejecting continuation does not
affect real-world deployments. See RFC 3645 Section 4.1.3.
Mark Andrews [Tue, 3 Mar 2026 23:00:56 +0000 (10:00 +1100)]
Reject meta-classes in UPDATE and NOTIFY messages
NOTIFY and UPDATE messages must specify a data class in the
QUESTION/ZONE section. NONE and ANY are meta-classes and not
appropriate here. Return FORMERR if either is used.
Rejecting messages with a query class of NONE addresses YWH-PGM40640-72,
YWH-PGM40640-82, and YWH-PGM40640-83. Rejecting messages with a query
class of ANY addresses YWH-PGM40640-87, YWH-PGM40640-88, and
YWH-PGM40640-117.
A bug during bad server handling could cause the resolver to enter an infinite loop, continuously sending queries to an upstream server with no exit condition, until the resolver query timeout was hit. This has been fixed.
ISC would like to thank Billy Baraja (BielraX) for bringing this issue to our attention.
Closes isc-projects/bind9#5804
Merge branch '5804-resend-loop' into 'security-main'
Colin Vidal [Fri, 10 Apr 2026 12:55:09 +0000 (14:55 +0200)]
Update resend_loop_badcookie system test
Update the resend_loop_badcookie system test to ensure there is no
attempt to resend the query using TCP when getting BADCOOKIE from an
upstream server using this transport already.
Colin Vidal [Wed, 1 Apr 2026 20:31:50 +0000 (22:31 +0200)]
add max-delegation-servers tests for out domain NS
Add a new system test which ensures that the `max-delegation-servers`
limit is correctly respected also in the case a domain has only NS names
(and no glues). In particular, this test when there are multiple NS
names and multiples IPs per names.
If the number of IP (even from the first picked NS name) reaches
`max-delegation-servers`, and the resolution is not a success, the
resolver won't attempt another NS name, as it already used all its
"credit".
Ondřej Surý [Wed, 4 Mar 2026 09:46:58 +0000 (10:46 +0100)]
Validate DNS message CLASS early in request processing
Reject requests with unsupported or misused CLASS values before
further processing. Only IN, CH, HS, RESERVED0 (for DNS Cookies),
ANY (for TKEY negotiation), and NONE (for DNS UPDATE) are accepted;
all other classes return NOTIMP. Misuse of NONE or ANY outside
their allowed contexts returns FORMERR.
This adds further protection against bugs of the same general class
as YWH-PGM40640-70 and YWH-PGM40640-73.
Colin Vidal [Tue, 7 Apr 2026 20:18:58 +0000 (22:18 +0200)]
rctx_resend() increment query counters
Calls to `rctx_resend()` are done internally within the resolver, in
flow which are not supposed to happens more than once. For instance,
if some query fails, and a specific flag "F" wasn't set, then set the
flag and try again. This wouldn't occur more than once because if the
query fails the next attempt, the flag "F" would be set already, so the
resolver would move to the next server (or give up).
However, a subtle bug missing checking a flag, for instance, could lead
to an unbounded loop re-trying to query the same server. This is now
impossible as `rctx_resend()` also increment the query counters (so if
such case occurs, it would stop once the maximum limit is reached).
The dns_resstatscounter_retry are also only incremented if the
`fctx_query()` succeeds, similar to as is done in `fctx_try()`.
Colin Vidal [Fri, 10 Apr 2026 12:54:49 +0000 (14:54 +0200)]
Do not resend after BADCOOKIE answer on TCP
When an upstream server answers BADCOOKIE, no matter the transport used,
the resolver eventually resends the query using TCP. However, if the
upstream server responds with BADCOOKIE again over TCP, the resolver
would keep resending until the maximum query count is reached.
This is now fixed by stopping resending once the query has already been
sent over TCP.
Colin Vidal [Wed, 4 Mar 2026 17:25:32 +0000 (18:25 +0100)]
Add SRTT-based server selection system test
Verify that the resolver selects authoritative servers in increasing
SRTT order. Four servers are configured with increasing response
delays. 100 queries are sent, expecting most to go to the fastest
server (ns2). Then ns2 stops responding, another 100 queries are
sent and should go to ns3 (the next fastest), and so on through
ns4 and ns5. Each query uses a unique name to avoid cache hits.
Evan Hunt [Wed, 4 Mar 2026 21:24:52 +0000 (13:24 -0800)]
Disable UPDATE and NOTIFY for non-IN classes
Return NOTIMP for UPDATE and NOTIFY requests received for views with a
class other than IN. Only QUERY is now supported for non-IN views such
as CHAOS.
When running dns dns_rdata_tostruct() with types that are only defined
for class IN, ensure that the class is correct before proceeding.
Add an assertion that any zone being updated is of class IN. (Note
that previously, a DLZ zone could have its class value set incorrectly
to NONE; this has been fixed.)
This addresses YWH-PGM40640-70 and YWH-PGM40640-73 (as well as any
similar problems that might have occurred in the future) by minimizing
the code paths that can be reached by rdata classes other than IN, so it
is safe for the implementation to assume that rdatatypes that are only
defined for class IN, such as SVCB or WKS, have been parsed and
validated, and not accepted as unknown/opaque data.
Colin Vidal [Thu, 5 Feb 2026 10:20:11 +0000 (11:20 +0100)]
Add system test for self-pointed glue deduplication
Test the resolver's behavior with self-pointed glue where each NS
has the same set of addresses. Verify that addresses are
deduplicated and each unique IP is only queried once.
Also test the NS processing limit (max-delegation-servers) and the
ADB address limit (adbaddrslimit), both individually and combined.
Evan Hunt [Tue, 3 Mar 2026 22:00:38 +0000 (14:00 -0800)]
Disable recursion for non-IN classes
Force recursion off, and set allow-recursion/allow-recursion-on ACLs
to none, for views with a class other than IN. Log a configuration
warning if recursion is explicitly enabled for a non-IN view.
This addresses YWH-PGM40640-74 and YWH-PGM40640-75 by preventing any
attempt at recursive processing in a class-CHAOS view, ensuring that
server addresses used for recursive queries and received in recursive
responses are of the expected format.
The resolver will repeatedly resend queries until the fetch timeout
expires, resulting in resulting in thousands of qrysent while the quota
counter remains 0.
Colin Vidal [Wed, 4 Feb 2026 09:18:42 +0000 (10:18 +0100)]
Remove duplicate addresses from the resolver SLIST
The SLIST (essentially `fctx->finds`, forwarders and dual-stack
alternatives aside) can have duplicate server addresses when multiple
in-domain nameservers share the same IP addresses:
sub.example. NS ns1.sub.example.
sub.example. NS ns2.sub.example.
ns1.sub.example. A 1.2.3.4
ns1.sub.example. A 5.6.7.8
ns2.sub.example. A 1.2.3.4
ns2.sub.example. A 5.6.7.8
If both 1.2.3.4 and 5.6.7.8 fail to return a valid answer, the resolver
would query each address twice.
The problem is fixed by replacing the two-phase server selection (sort
each find list by SRTT, sort finds by head SRTT) with a single linear
scan in nextaddress() that finds the lowest-SRTT unmarked, non-duplicate
address across all find lists.
The old approach had a correctness bug: after sorting, the resolver
picked the next address from the "current" find list rather than
globally. For example, with find lists [1, 15, 26] and [3, 4, 5], the
second pick would be SRTT 15 instead of the correct SRTT 3.
The new approach is both simpler and correct: each call to nextaddress()
walks all addresses, skips marked and duplicate entries, and returns the
one with the lowest SRTT. While this walk is repeated for each server
attempt, it operates on a small bounded list and is negligible compared
to the network I/O of querying the server.
Colin Vidal [Thu, 5 Feb 2026 08:46:01 +0000 (09:46 +0100)]
Limit the number of addresses returned per ADB find
The number of `dns_adbaddrfind_t` (NS address with metadata like SRTT)
returned from an ADB NS name lookup is now limited by the caller. The
default value (outside the resolver) uses `max-delegation-servers`, and
the resolver, for a given fetch, start with `max-delegation-servers` and
decrement it at each ADB fetch. This ensures that, for a given
delegation, no more than 13 nameservers will be contacted.
This is the same mechanism used when looking up `dns_adbaddrfind_t` from
a list of glues (addresses).