Michał Kępień [Tue, 18 Mar 2025 15:28:18 +0000 (16:28 +0100)]
Handle queries indefinitely on each TCP connection
Instead of closing every incoming TCP connection after handling a single
query, continue receiving queries on each TCP connection until the
client disconnects itself. When coupled with response dropping, this
enables silently receiving all incoming data, simulating an unresponsive
server.
Michał Kępień [Tue, 18 Mar 2025 15:28:18 +0000 (16:28 +0100)]
Enable receiving chunked TCP DNS messages
A TCP DNS client may send its queries in chunks, causing
StreamReader.read() to return less data than previously declared by the
client as the DNS message length; even the two-octet DNS message length
itself may be split up into two single-octet transmissions. Sending
data in chunks is valid client behavior that should not be treated as an
error. Add a new helper method for reading TCP data in a loop, properly
distinguishing between chunked queries and client disconnections. Use
the new method for reading all TCP data from clients.
Michał Kępień [Tue, 18 Mar 2025 15:28:18 +0000 (16:28 +0100)]
Handle connection resets during reading
A TCP peer may reset the connection at any point, but asyncserver.py
currently only handles connection resets when it is sending data to the
client. Handle connection resets during reading in the same way.
Michał Kępień [Tue, 18 Mar 2025 15:28:18 +0000 (16:28 +0100)]
Simplify peer address formatting
Add a helper class, Peer, which holds the <host, port> tuple of a
connection endpoint and gets pretty-printed when formatted as a string.
This enables passing instances of this new class directly to logging
functions, eliminating the need for the AsyncDnsServer._format_peer()
helper method.
Nicki Křížek [Tue, 18 Mar 2025 09:29:10 +0000 (09:29 +0000)]
chg: ci: Allow re-run of the shotgun jobs to reduce false positives
The false positive rate is about 10-20 % when evaluating shotgun results
from a single run. Attempt to reduce the false positive rate by allowing
a re-run of failed jobs.
Merge branch 'nicki/ci-shotgun-reduce-false-positives' into 'main'
Nicki Křížek [Wed, 12 Mar 2025 16:24:05 +0000 (17:24 +0100)]
Allow re-run of the shotgun jobs to reduce false positive
The false positive rate is about 10-20 % when evaluating shotgun results
from a single run. Attempt to reduce the false positive rate by allowing
a re-run of failed jobs.
While there is a slight risk that barely noticable decreases in
performance might slip by more easily in MRs, they'd still likely pop up
during nightly or pre-release testing.
Also increase the tolerance threshold for DoH latency comparisons, as
those tests often experience increased jitter in the tail end latencies.
Michał Kępień [Tue, 18 Mar 2025 05:55:17 +0000 (05:55 +0000)]
chg: test: Use isctest.asyncserver in the "qmin" test
Replace custom DNS servers used in the "qmin" system test with new code
based on the isctest.asyncserver module. The revised code employs zone
files and a limited amount of custom logic, which massively improves
test readability and maintainability, extends logging, and fixes
non-compliant replies sent by some of the custom servers in response to
certain queries (e.g. AA=0 in authoritative empty non-terminal
responses, non-glue address records in ADDITIONAL section).
Merge branch 'michal/qmin-asyncserver' into 'main'
Michał Kępień [Tue, 18 Mar 2025 05:19:01 +0000 (06:19 +0100)]
Broaden vulture exclude glob for ans.py servers
The vulture tool seems to be unable to follow how the parent classes
defined in bin/tests/system/qmin/qmin_ans.py use mandatory properties
specified by child classes in bin/tests/system/qmin/ans*/ans.py. Make
the tool ignore not just ans.py servers, but also *_ans.py utility
modules above the ansX/ subdirectories to prevent false positives about
unused code from causing CI pipeline failures.
Michał Kępień [Tue, 18 Mar 2025 05:19:01 +0000 (06:19 +0100)]
Ignore .hypothesis files created by system tests
Some versions of the Hypothesis Python library - notably the one
included in stock OS repositories for Ubuntu 20.04 Focal Fossa - cause a
.hypothesis file to be created in a Python script's working directory
when the hypothesis module is present in its import chain. Ignore such
files by adding them to the list of expected test artifacts to prevent
pytest teardown checks from failing due to these files appearing in the
file system after running system tests.
Michał Kępień [Tue, 18 Mar 2025 05:19:01 +0000 (06:19 +0100)]
Fix PYTHONPATH set for ans.py servers by start.pl
Commit 6c010a5644324947c8c13b5600cd8d988ae7684f caused the PYTHONPATH
environment variable to be set for ans.py servers started using
start.pl. However, no system test has actually used the new
isctest.asyncserver module since that change was applied, so it has not
been noticed until now that including the source directory in PYTHONPATH
is only sufficient for in-tree builds. Include the build directory
instead of the source directory in the PYTHONPATH environment variable
set for ans.py servers started by start.pl so that they work correctly
for both in-tree and out-of-tree builds.
Michał Kępień [Tue, 18 Mar 2025 05:19:01 +0000 (06:19 +0100)]
Use isctest.asyncserver in the "qmin" test
Replace custom DNS servers used in the "qmin" system test with new code
based on the isctest.asyncserver module. The revised code employs zone
files and a limited amount of custom logic, which massively improves
test readability and maintainability, extends logging, and fixes
non-compliant replies sent by some of the custom servers in response to
certain queries (e.g. AA=0 in authoritative empty non-terminal
responses, non-glue address records in ADDITIONAL section).
Ondřej Surý [Mon, 17 Mar 2025 14:27:38 +0000 (15:27 +0100)]
Remove the kludges for records in the bad sections
There were kludges to help process responses from authoritative servers
giving RRs in wrong sections (mentioning BIND 8). These should just go
away and such responses should not be processed.
Evan Hunt [Sat, 15 Mar 2025 01:26:35 +0000 (01:26 +0000)]
fix: nil: Add new convenience functions to classify rdata types
- `dns_rdatatype_ismulti()` returns true if a given type can have
multiple answers: ANY, RRSIG, or SIG.
- `dns_rdatatype_issig()` returns true for a signature: RRSIG or SIG.
- `dns_rdatatype_isaddr()` returns true for an address: A or AAAA.
- `dns_rdatatype_isalias()` returns true for an alias: CNAME or DNAME.
Code has been modified to use these functions where applicable.
These and all similar functions (e.g., `dns_rdatatype_ismeta()`, `dns_rdatatype_issingleton()`, etc) are now `static inline` functions defined in `rdata.h`.
Merge branch 'each-rdatatype-functions' into 'main'
Evan Hunt [Tue, 4 Mar 2025 23:51:49 +0000 (15:51 -0800)]
add new functions to classify rdata types
- dns_rdatatype_ismulti() returns true if a given type can have
multiple answers: ANY, RRSIG, or SIG.
- dns_rdatatype_issig() returns true for a signature: RRSIG or SIG.
- dns_rdatatype_isaddr() returns true for an address: A or AAAA.
- dns_rdatatype_isalias() returns true for an alias: CNAME or DNAME.
Evan Hunt [Fri, 14 Mar 2025 23:19:36 +0000 (23:19 +0000)]
fix: dev: step() could ignore rollbacks
The `step()` function (used for stepping to the prececessor or successor of a database node) could overlook a node if there was an rdataset that was marked IGNORE because it had been rolled back, covering an active rdataset under it.
Closes #5170
Merge branch '5170-step-ignores-rollback' into 'main'
Evan Hunt [Sat, 15 Feb 2025 05:42:34 +0000 (21:42 -0800)]
qpzone.c:step() could ignore rollbacks
the step() function (used for stepping to the prececessor or
successor of a database node) could overlook a node because
there was an rdataset marked IGNORE because it had been rolled
back, covering an active rdataset under it.
Evan Hunt [Fri, 14 Mar 2025 22:26:36 +0000 (22:26 +0000)]
fix: dev: Fix handling of revoked keys
When a key is revoked, its key ID changes due to the inclusion of the "revoked" flag. A collision between this changed key ID
and an unrelated public-only key could cause a crash in `dnssec-signzone`.
Closes #5231
Merge branch '5231-fix-keyid-collision' into 'main'
Evan Hunt [Tue, 11 Mar 2025 20:36:00 +0000 (13:36 -0700)]
fix handling of revoked keys
when a key is revoked its key ID changes, due to the inclusion
of the "revoke" flag. a collision between this changed key ID and
that of an unrelated public-only key could cause a crash in
dnssec-signzone.
Mark Andrews [Fri, 14 Mar 2025 05:28:01 +0000 (05:28 +0000)]
fix: test: Tune many types tests in reclimit test
The `I:checking that lifting the limit will allow everything to get
cached (20)` test was failing due to the TTL of the records being
too short for the elapsed time of the test. Raise the TTL to fix
this and adjust other tests as needed.
Closes #5206
Merge branch '5206-tune-last-sub-test-of-reclimit' into 'main'
Mark Andrews [Wed, 26 Feb 2025 21:36:54 +0000 (08:36 +1100)]
Tune many types tests in reclimit test
The 'I:checking that lifting the limit will allow everything to get
cached (20)' test was failing due to the TTL of the records being
too short for the elapsed time of the test. Raise the TTL to fix
this and adjust other tests as needed.
Mark Andrews [Fri, 14 Mar 2025 02:02:52 +0000 (02:02 +0000)]
fix: usr: QNAME minimization could leak the query type
When performing QNAME minimization, `named` now sends an NS query for the original query name, before sending the final query. This prevents the parent zone from learning the original query type, in the event that the query name is a delegation point.
For example, when looking up an address record for `example.com`, NS queries are now sent to the servers for both `com` and `example.com`, before the address query is sent to the servers for `example.com`. Previously, an address query would have been sent to the servers for `com`.
Closes #4805
Merge branch '4805-missing-qname-ns-query-when-using-qname-minimisation' into 'main'
Mark Andrews [Thu, 18 Jul 2024 03:27:23 +0000 (13:27 +1000)]
Don't leak the original QTYPE to parent zone
When performing QNAME minimization, named now sends an NS
query for the original QNAME, to prevent the parent zone from
receiving the QTYPE.
For example, when looking up example.com/A, we now send NS queries
for both com and example.com before sending the A query to the
servers for example.com. Previously, an A query for example.com
would have been sent to the servers for com.
Several system tests needed to be adjusted for the new query pattern:
- Some queries in the serve-stale test were sent to the wrong server.
- The synthfromdnssec test could fail due to timing issues; this
has been addressed by adding a 1-second delay.
- The cookie test could fail due to the a change in the count of
TSIG records received in the "check that missing COOKIE with a
valid TSIG signed response does not trigger TCP fallback" test case.
- The GL #4652 regression test case in the chain system test depends
on a particular query order, which no longer occurs when QNAME
minimization is active. We now disable qname-minimization
for that test.
Mark Andrews [Thu, 15 Aug 2024 13:34:59 +0000 (23:34 +1000)]
Fix handling of ISC_R_TIMEOUT in resume_qmin()
If a timeout occurs when sending a QMIN query, QNAME
minimization should be disabled. This now causes a hard
failure in strict mode, or a fallback to non-minimized queries
in relaxed mode.
Mark Andrews [Fri, 14 Mar 2025 00:45:20 +0000 (00:45 +0000)]
new: usr: dig can now display the received BADVERS message during negotiation
Dig +showbadvers now displays the received BADVERS message and
continues the EDNS version negotiation. Previously to see the
BADVERS message +noednsneg had to be specified which terminated the
EDNS negotiation. Additionally the specified EDNS value (+edns=value)
is now used when making all the initial queries with +trace. i.e EDNS
version negotiation will be performed with each server when performing
the trace.
Closes #5234
Merge branch '5234-have-dig-display-the-badvers-message' into 'main'
Mark Andrews [Tue, 11 Mar 2025 23:02:05 +0000 (10:02 +1100)]
Add "+showbadvers" to dig and reset EDNS version
Add "+showbadvers" to display the BADVERS response similarly
to "+showbadcookie". Additionally reset the EDNS version to
the requested version in "dig +trace" so that EDNS version
negotiation can be tested at all levels of the trace rather
that just when requesting the root nameservers.
Matthijs Mekking [Thu, 13 Mar 2025 08:28:37 +0000 (09:28 +0100)]
Raise max-clients-per-query to be at least
In the case where 'clients-per-query' is larger than
'max-clients-per-query', raise 'max-clients-per-query' so that
'clients-per-query' equals 'max-clients-per-query' and log a warning
that this is what happened.
Colin Vidal [Thu, 13 Mar 2025 11:56:37 +0000 (11:56 +0000)]
new: usr: Add support for EDE 20 (Not Authoritative)
Support was added for EDE codes 20 (Not Authoritative) when client requests recursion (RD) but the server has recursion disabled.
RFC 8914 mention EDE 20 should also be returned if the client doesn't have the RD bit set (and recursion is needed) but it doesn't apply for
BIND as BIND would try to resolve from the "deepest" referral in AUTHORITY section. For example, if the client asks for "www.isc.org/A" but the server only knows the root domain, it will return NOERROR but no answer for "www.isc.og/A", just the list of other servers to ask.
Colin Vidal [Wed, 12 Mar 2025 09:28:27 +0000 (10:28 +0100)]
add support for EDE 20 (Not Authoritative)
Extended DNS Error message EDE 20 (Not Authoritative) is now sent when
client request recursion (RD) but the server has recursion disabled.
RFC 8914 mention EDE 20 should also be returned if the client doesn't
have the RD bit set (and recursion is needed) but it doesn't apply for
BIND as BIND would try to resolve from the "deepest" referral in
AUTHORITY section. For example, if the client asks for "www.isc.org/A"
but the server only knows the root domain, it will returns NOERROR but
no answer for "www.isc.og/A", just the list of other servers to ask.
Colin Vidal [Wed, 12 Mar 2025 09:53:11 +0000 (10:53 +0100)]
add support for EDE 7 and 8
Extended DNS Error messages EDE 7 (expired key) and EDE 8 (validity
period of the key not yet started) are now sent in case of such DNSSEC
validation failures.
Refactor the existing validator extended error APIs in order to make it
easy to have a consisdent extra info (with domain/type) in the various
use case (i.e. when the EDE depends on validator state,
validate_extendederror or when the EDE doesn't depend of any state but
can be called directly in a specific flow).
Matthijs Mekking [Wed, 12 Mar 2025 15:14:52 +0000 (16:14 +0100)]
ksr: Take into account key collisions
When generating new key pairs, one test checks if existing keys that
match the time bundle are selected, rather than extra keys being
generated. Part of the test is to check the verbose output, counting
the number of "Selecting" and "Generating" occurences. But if there
is a key collision, the ksr tool will output that the key already
exists and includes the substring "already exists, or might collide
with another key upon revokation. Generating a new key".
So substract by one the generated counter if there is a "collide"
occurrence.
Ondřej Surý [Wed, 5 Mar 2025 11:20:21 +0000 (11:20 +0000)]
chg: dev: Cleanup parts of the isc_mem API
This MR changes custom attach/detach implementation with refcount macros, replaces isc_mem_destroy() with isc_mem_detach(), and does various small cleanups.
Merge branch 'ondrej/cleanup-isc_mem-api' into 'main'
Replace attach/detach in isc_mem with refcount implementation
The isc_mem API is one of the most commonly used APIs that didn't
used ISC_REFCOUNT_DECL and ISC_REFCOUNT_IMPL macros. Replace the
implementation of isc_mem_attach(), isc_mem_detach() and
isc_mem_destroy() with the respective macros.
This also removes the legacy isc_mem_destroy() functionality that would
check whether all references had been detached from the memory context
as it doesn't work reliably when using the call_rcu() API. Instead of
doing this individually, call isc_mem_checkdestroyed(stderr) from the
isc_mem_destroy() macro to keep the extra check that all contexts were
freed when the program is exiting.
Mark Andrews [Wed, 23 Jun 2021 09:51:51 +0000 (19:51 +1000)]
Implement digest_sig and digest_rrsig for ZONEMD
ZONEMD needs to be able to digest SIG and RRSIG records. The signer
field can be compressed in SIG so we need to call dns_name_digest().
While for RRSIG the records the signer field is not compressed the
canonical form has the signer field downcased (RFC 4034, 6.2). This
also implies that compare_rrsig needs to downcase the signer field
during comparison.
Ondřej Surý [Wed, 5 Mar 2025 06:49:59 +0000 (06:49 +0000)]
fix: dev: Fix the foundname vs dcname madness in qpcache_findzonecut()
The qpcache_findzonecut() accepts two "foundnames": 'foundname' and
'dcname' could be NULL. Originally, when 'dcname' would be NULL, the
'dcname' would be set to 'foundname' which basically means that we were
copying the .ndata over itself for no apparent reason.
Merge branch 'ondrej/refactor-qpcache_findzonecut' into 'main'
Ondřej Surý [Sun, 2 Feb 2025 14:50:15 +0000 (15:50 +0100)]
Fix the foundname vs dcname madness in qpcache_findzonecut()
The qpcache_findzonecut() accepts two "foundnames": 'foundname' and
'dcname' could be NULL. Originally, when 'dcname' would be NULL, the
'dcname' would be set to 'foundname'. Then code like this was present:
Evan Hunt [Sun, 2 Mar 2025 05:03:51 +0000 (21:03 -0800)]
when recording an rr trace, use libtool
when running a system test with the USE_RR environment
variable set to 1, an rr trace is generated for named.
because rr wasn't run using libtool --mode=execute, the
trace would actually be generated for the wrapper script
generated by libtool, not for the actual named binary.
Artem Boldariev [Mon, 3 Mar 2025 10:10:33 +0000 (10:10 +0000)]
fix: dev: Post [CVE-2024-12705] Performance Drop Fixes, Part 2
This merge request addresses several key performance bottlenecks in the DoH (DNS over HTTPS) implementation by introducing significant optimizations and improvements.
### Key Improvements
1. **Simplification and Optimisation of `http_do_bio()` Function**:
- The code flow in the `http_do_bio()` function has been significantly simplified.
2. **Flushing HTTP Write Buffer on Outgoing DNS Messages**:
- The buffer is flushed and a send operation is performed when there is an outgoing DNS message.
3. **Bumping Active Streams Processing Limit**:
- The total number of active streams has been increased to 60% of the total streams limit.
These changes collectively enhance the performance and reliability of the DoH implementation, making it more efficient and robust for handling high-load scenarios, particularly noticeable in long runs (>= 1h) of `stress:long:rpz:doh+udp:linux:*` tests. It improves perf. for tests for BIND 9.18, but it likely will have a positive but less pronounced effect on newer versions as well.
In essence, the merge request fixes three bottlenecks stacked upon each other.
*It is a logical continuation of the merge requests !10109.* !10109, unfortunately, did not completely [address the performance drop in 9.18](https://gitlab.isc.org/isc-projects/bind9/-/pipelines/221545) for longer runs of the stress test. This merge request [addresses that](https://gitlab.isc.org/isc-projects/bind9/-/pipelines/223661).
**P.S.**
The origin of the fixes is, in fact, the branch in !10193. So this MR is a ... *forward port* of them.
Merge branch 'artem-doh-performance-drop-post-fix' into 'main'
Artem Boldariev [Tue, 25 Feb 2025 17:58:24 +0000 (19:58 +0200)]
DoH: Bump the active streams processing limit
This commit bumps the total number of active streams (= the opened
streams for which a request is received, but response is not ready) to
60% of the total streams limit.
The previous limit turned out to be too tight as revealed by
longer (≥1h) runs of "stress:long:rpz:doh+udp:linux:*" tests.
Artem Boldariev [Tue, 25 Feb 2025 07:52:19 +0000 (09:52 +0200)]
DoH: Flush HTTP write buffer on an outgoing DNS message
Previously, the code would try to avoid sending any data regardless of
what it is unless:
a) The flush limit is reached;
b) There are no sends in flight.
This strategy is used to avoid too numerous send requests with little
amount of data. However, it has been proven to be too aggressive and,
in fact, harms performance in some cases (e.g., on longer (≥1h) runs
of "stress:long:rpz:doh+udp:linux:*").
Now, additionally to the listed cases, we also:
c) Flush the buffer and perform a send operation when there is an
outgoing DNS message passed to the code (which is indicated by the
presence of a send callback).
That helps improve performance for "stress:long:rpz:doh+udp:linux:*"
tests.
Artem Boldariev [Mon, 24 Feb 2025 16:32:23 +0000 (18:32 +0200)]
DoH: Limit the number of delayed IO processing requests
Previously, a function for continuing IO processing on the next UV
tick was introduced (http_do_bio_async()). The intention behind this
function was to ensure that http_do_bio() is eventually called at
least once in the future. However, the current implementation allows
queueing multiple such delayed requests needlessly. There is currently
no need for these excessive requests as http_do_bio() can requeue them
if needed. At the same time, each such request can lead to a memory
allocation, particularly in BIND 9.18.
This commit ensures that the number of enqueued delayed IO processing
requests never exceeds one in order to avoid potentially bombarding IO
threads with the delayed requests needlessly.
Artem Boldariev [Thu, 20 Feb 2025 20:08:01 +0000 (22:08 +0200)]
DoH: Simplify http_do_bio()
This commit significantly simplifies the code flow in the
http_do_bio() function, which is responsible for processing incoming
and outgoing HTTP/2 data. It seems that the way it was structured
before was indirectly caused by the presence of the missing callback
calls bug, fixed in 8b8f4d500d9c1d41d95d34a79c8935823978114c.
The change introduced by this commit is known to remove a bottleneck
and allows reproducible and measurable performance improvement for
long runs (>= 1h) of "stress:long:rpz:doh+udp:linux:*" tests.
Additionally, it fixes a similar issue with potentially missing send
callback calls processing and hardens the code against use-after-free
errors related to the session object (they can potentially occur).
Ondřej Surý [Sat, 1 Mar 2025 06:35:58 +0000 (06:35 +0000)]
rem: dev: Cleanup isc/util.h header and friends
Cleanup short list macros from <isc/util.h>, remove two unused headers, move locking macros to respective headers and use only the C11 static assertion.
Merge branch 'ondrej/cleanup-short-macros' into 'main'
Ondřej Surý [Fri, 28 Feb 2025 20:51:18 +0000 (21:51 +0100)]
Remove STATIC_ASSERT variants in favor of the C11 variant
Previously, a gcc < 4.6 shim for _Static_assert() was included. Such an
old compiler is not supported now anyway, so the macro variant has been
removed in favor of a single definition using _Static_assert().
Ondřej Surý [Fri, 28 Feb 2025 20:49:48 +0000 (21:49 +0100)]
Move locking macros into individual headers
Previously, the LOCK()/UNLOCK() and friends macros were defined in the
isc/util.h header. Those macros were moved to their respective headers
as those would have to be included anyway if that particular lock was in
use.
Ondřej Surý [Fri, 28 Feb 2025 20:40:50 +0000 (21:40 +0100)]
Remove superflous header includes from isc/util.h header
Formerly, isc/util.h would pull a few extra headers (isc/list.h,
isc/attributes.h, isc/result.h and errno.h). These includes were
removed in favor of including them directly when used.