Matthijs Mekking [Thu, 22 May 2025 09:23:48 +0000 (11:23 +0200)]
Fix spurious missing key files log messages
I suspect this happens because old key is purged by one zone view,
then the other is freaking out about it.
Keys that are unused or being purged should not be taken into account.
The keyring is maintained per zone. So in one zone, a key in the
keyring is being purged. The corresponding key file is removed.
The key maintenance is done for the other zone view. The key in that
keyring is not yet set to purge, but its corresponding key file is
removed. This leads to "some keys are missing"
I think we should not check the purge variable at this point, but the
current time and purge-keys duration. That is what this commit does.
Michał Kępień [Wed, 14 May 2025 18:02:45 +0000 (18:02 +0000)]
[9.18] chg: test: Mark test_idle_timeout as flaky on FreeBSD 13
The test_idle_timeout check in the "timeouts" system test has been
failing often on FreeBSD 13 AWS hosts. Adding timestamped debug logging
shows that the time.sleep() calls used in that check are returning
significantly later than asked to on that platform (e.g. after 4 seconds
when just 1 second is requested), breaking the test's timing assumptions
and triggering false positives. These failures are not an indication of
a bug in named and have not been observed on any other platform. Mark
the problematic check as flaky, but only on FreeBSD 13, so that other
failure modes are caught appropriately.
Backport of MR !10459
Merge branch 'backport-michal/mark-test_idle_timeout-as-flaky-on-freebsd-13-9.18' into 'bind-9.18'
Michał Kępień [Wed, 14 May 2025 07:50:33 +0000 (09:50 +0200)]
Mark test_idle_timeout as flaky on FreeBSD 13
The test_idle_timeout check in the "timeouts" system test has been
failing often on FreeBSD 13 AWS hosts. Adding timestamped debug logging
shows that the time.sleep() calls used in that check are returning
significantly later than asked to on that platform (e.g. after 4 seconds
when just 1 second is requested), breaking the test's timing assumptions
and triggering false positives. These failures are not an indication of
a bug in named and have not been observed on any other platform. Mark
the problematic check as flaky, but only on FreeBSD 13, so that other
failure modes are caught appropriately.
Michal Nowak [Mon, 5 May 2025 15:06:10 +0000 (15:06 +0000)]
[9.18] chg: ci: Run linkchecker only on Wednesdays
Some domains tested by linkchecker may think that we connect to them too
often and will refuse connection or reply with an error code, which makes
this job fail. Let's check links only on Wednesdays.
Backport of MR !10439
Merge branch 'backport-mnowak/run-linkchecker-only-sometimes-9.18' into 'bind-9.18'
Michal Nowak [Mon, 5 May 2025 10:57:47 +0000 (12:57 +0200)]
Run linkchecker only on Wednesdays
Some domains tested by linkchecker may think that we connect to them too
often and will refuse connection or reply with and error code, which
makes this job fail. Let's check links only on Wednesdays.
Michal Nowak [Mon, 5 May 2025 10:13:58 +0000 (10:13 +0000)]
[9.18] chg: ci: Disable linkcheck on www.gnu.org
The check fails with the following error for some time:
broken https://www.gnu.org/software/libidn/#libidn2 - HTTPSConnectionPool(host='www.gnu.org', port=443): Max retries exceeded with url: /software/libidn/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5bd4c14590>: Failed to establish a new connection: [Errno 111] Connection refused'))
Backport of MR !10436
Merge branch 'backport-mnowak/linkcheck-disable-www-gnu-org-9.18' into 'bind-9.18'
Michal Nowak [Mon, 5 May 2025 09:50:03 +0000 (11:50 +0200)]
Disable linkcheck on www.gnu.org
The check fails with the following error for some time:
broken https://www.gnu.org/software/libidn/#libidn2 - HTTPSConnectionPool(host='www.gnu.org', port=443): Max retries exceeded with url: /software/libidn/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f5bd4c14590>: Failed to establish a new connection: [Errno 111] Connection refused'))
Over the past few years, some of the initial decisions made about which
GitLab CI jobs to run for all merge requests and which of them to run
just for scheduled/web-triggered pipelines turned out to be less than
ideal in practice: test coverage was found to be too lax in some areas
and on the other hand unnecessarily repetitive in others. For example,
compilation failures for certain build types that are not exercised for
every merge request (e.g. FIPS-enabled builds) turned out to be much
more common in practice than e.g. test failures happening only on a
subset of releases of a given Linux distribution.
To limit excessive resource use while retaining broad test coverage,
adjust GitLab CI job triggering rules for merge request pipelines as
follows:
- run all possible build jobs for every merge request; compilation
failures triggered for build flavors that were only tested in
scheduled pipelines turned out to be surprisingly commonplace and
became a nuisance over time, particularly given that the run times
of build jobs are much lower than those of test jobs,
- for every merge request, run at least one system & unit test job for
each build flavor (e.g. sanitizer-enabled, FIPS-enabled,
out-of-tree, tarball-based, etc.),
- limit the amount of test jobs run for each distinct operating
system; for example, only run system & unit test jobs for Ubuntu
24.04 Noble Numbat in merge request pipelines, skipping those for
Ubuntu 22.04 Jammy Jellyfish and Ubuntu 20.04 Focal Fossa (while
still running them in other pipeline types, e.g. in scheduled
pipelines),
- ensure every merge request is tested on Oracle Linux 8, which is the
operating system with the oldest package versions out of the systems
that are still supported by this BIND 9 branch,
- decrease the number of test jobs run with sanitizers enabled while
still testing with both ASAN and TSAN and both GCC and Clang for
every merge request.
These changes do not affect the set of jobs created for any other
pipeline type (triggered by a schedule, by a GitLab API call, by the web
interface, etc.); only merge request pipelines are affected.
Backport of MR !10349
Merge branch 'backport-michal/revise-ci-job-triggering-rules-9.18' into 'bind-9.18'
Over the past few years, some of the initial decisions made about which
GitLab CI jobs to run for all merge requests and which of them to run
just for scheduled/web-triggered pipelines turned out to be less than
ideal in practice: test coverage was found to be too lax in some areas
and on the other hand unnecessarily repetitive in others. For example,
compilation failures for certain build types that are not exercised for
every merge request (e.g. FIPS-enabled builds) turned out to be much
more common in practice than e.g. test failures happening only on a
subset of releases of a given Linux distribution.
To limit excessive resource use while retaining broad test coverage,
adjust GitLab CI job triggering rules for merge request pipelines as
follows:
- run all possible build jobs for every merge request; compilation
failures triggered for build flavors that were only tested in
scheduled pipelines turned out to be surprisingly commonplace and
became a nuisance over time, particularly given that the run times
of build jobs are much lower than those of test jobs,
- for every merge request, run at least one system & unit test job for
each build flavor (e.g. sanitizer-enabled, FIPS-enabled,
out-of-tree, tarball-based, etc.),
- limit the amount of test jobs run for each distinct operating
system; for example, only run system & unit test jobs for Ubuntu
24.04 Noble Numbat in merge request pipelines, skipping those for
Ubuntu 22.04 Jammy Jellyfish and Ubuntu 20.04 Focal Fossa (while
still running them in other pipeline types, e.g. in scheduled
pipelines),
- ensure every merge request is tested on Oracle Linux 8, which is the
operating system with the oldest package versions out of the systems
that are still supported by this BIND 9 branch,
- decrease the number of test jobs run with sanitizers enabled while
still testing with both ASAN and TSAN and both GCC and Clang for
every merge request.
These changes do not affect the set of jobs created for any other
pipeline type (triggered by a schedule, by a GitLab API call, by the web
interface, etc.); only merge request pipelines are affected.
Michal Nowak [Tue, 29 Apr 2025 11:02:11 +0000 (11:02 +0000)]
[9.18] rem: ci: Drop OpenBSD from the CI
With the ongoing process of moving CI workloads to AWS, OpenBSD poses a
challenge, as there is no OpenBSD AMI image in the AWS catalog. Building
our image from scratch is disproportionately complicated, given that
OpenBSD is not a common deployment platform for BIND 9. Otherwise,
OpenBSD stays at the "Best-Effort" level of support.
Backport of MR !10375
Merge branch 'backport-mnowak/drop-openbsd-from-ci-9.18' into 'bind-9.18'
Michal Nowak [Wed, 9 Apr 2025 12:05:42 +0000 (14:05 +0200)]
Drop OpenBSD from the CI
With the ongoing process of moving CI workloads to AWS, OpenBSD poses a
challenge, as there is no OpenBSD AMI image in the AWS catalog. Building
our image from scratch is disproportionately complicated, given that
OpenBSD is not a common deployment platform for BIND 9. Otherwise,
OpenBSD stays at the "Best-Effort" level of support.
[9.18] fix: dev: Unify the int32_t vs int_fast32_t when working with atomic types
There's a mismatch between the atomic and non-atomic types that could
potentialy lead to a rwlock deadlock (after two billion 2^32) writes.
Use int_fast32_t when loading the atomic_int_fast32_t types in the
isc_rwlock unit.
Closes #5280
Merge branch '5280-match-the-types-in-isc_rwlock-9.18' into 'bind-9.18'
Unify the int32_t vs int_fast32_t when working with atomic types
There's a mismatch between the atomic and non-atomic types that could
potentialy lead to a rwlock deadlock (after two billion 2^32) writes.
Use int_fast32_t when loading the atomic_int_fast32_t types in the
isc_rwlock unit.
With `dnssec-policy` you can pregenerate keys and if they are eligible, rather than creating a new key, a key is selected from the pregenerated keys. A key is eligible if it is unused, i.e it has no key timing metadata set.
Backport of MR !10385
Merge branch 'backport-matthijs-clarify-pregenerating-keys-9.18' into 'bind-9.18'
With dnssec-policy you can pregenerate keys and if they are eligible,
rather than creating a new key, a key is selected from the pregenerated
keys. A key is eligible if it is unused, i.e it has no key timing
metadata set.
Michal Nowak [Mon, 14 Apr 2025 11:43:52 +0000 (11:43 +0000)]
[9.18] fix: test: Fix check_pid() in runtime system test on FreeBSD
The original check_pid() always returned 0 on FreeBSD, even if the
process was still running. This makes the "verifying that named checks
for conflicting named processes" check fail on FreeBSD with TSAN.
Backport of MR !10373
Merge branch 'backport-mnowak/fix-runtime-pid-check-9.18' into 'bind-9.18'
Michal Nowak [Thu, 3 Apr 2025 11:38:03 +0000 (13:38 +0200)]
Fix check_pid() in runtime system test on FreeBSD
The original check_pid() always returned 0 on FreeBSD, even if the
process was still running. This makes the "verifying that named checks
for conflicting named processes" check fail on FreeBSD with TSAN.
The timers can be destroyed while the timer actions are still running,
and when the action calls isc_event_free() it can assert, because it's
trying to access the destroyed timer object.
Prior to destroying the timers, first disable them, then a short grace
period before destroying them.
Closes #5228
Merge branch '5228-task-unit-test-fix-9.18' into 'bind-9.18'
The timers can be destroyed while the timer actions are still running,
and when the action calls isc_event_free() it can assert, because it's
trying to access the destroyed timer object.
Before destroying the timers, first disable them, then wait 2 seconds
of grace period before destroying them.
[9.18] fix: usr: Stop caching lack of EDNS support
`named` could falsely learn that a server doesn't support EDNS when
a spoofed response was received; that subsequently prevented DNSSEC
lookups from being made. This has been fixed.
TSAN reports a lock-order-inversion (potential deadlock) issue in
add_trace_entry().
While it is true that in one case a lock in the 'isc_mem_t' structure is
locked first, and then a lock in the 'FILE' structure is locked second,
and in the the second case it is the other way around, this isn't an
issue, because those are 'FILE' structures for totally different files,
used in different parts of the code.
Closes #5266
Backport of MR !10355
Merge branch 'backport-5266-freebsd-suppress-tsan-lock-order-inversion-false-positive-9.18' into 'bind-9.18'
TSAN reports a lock-order-inversion (potential deadlock) issue in
add_trace_entry():
WARNING: ThreadSanitizer: lock-order-inversion (potential deadlock)
Cycle in lock order graph: M0001 (0x000000000001) => M0002 (0x000000000002) => M0001
Mutex M0002 acquired here while holding mutex M0001 in main thread:
#0 _pthread_mutex_lock /usr/src/contrib/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1342:3
#1 add_trace_entry lib/isc/mem.c:210:2
#2 isc__mem_get lib/isc/mem.c:606:2
#3 isc_buffer_allocate lib/isc/./include/isc/buffer.h:1080:23
#4 pushandgrow lib/isc/lex.c:321:3
#5 isc_lex_gettoken lib/isc/lex.c:445:22
#6 cfg_gettoken lib/isccfg/parser.c:3490:11
#7 cfg_parse_mapbody lib/isccfg/parser.c:2230:3
#8 cfg_parse_obj lib/isccfg/parser.c:247:11
#9 parse2 lib/isccfg/parser.c:628:11
#10 cfg_parse_file lib/isccfg/parser.c:668:11
#11 load_configuration bin/named/server.c:8069:13
#12 run_server bin/named/server.c:9518:2
#13 isc__async_cb lib/isc/async.c:110:3
#14 uv__async_io /tmp/libuv-1.50.0/src/unix/async.c:208:5
#15 uv__io_poll /tmp/libuv-1.50.0/src/unix/kqueue.c:369:9
#16 uv_run /tmp/libuv-1.50.0/src/unix/core.c:460:5
#17 loop_thread lib/isc/loop.c:327:6
#18 thread_body lib/isc/thread.c:89:8
#19 isc_thread_main lib/isc/thread.c:124:2
#20 isc_loopmgr_run lib/isc/loop.c:513:2
#21 main bin/named/main.c:1469:2
Mutex M0001 previously acquired by the same thread here:
#0 _pthread_mutex_lock /usr/src/contrib/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1342:3
#1 _flockfile /usr/src/lib/libc/stdio/_flock_stub.c:72:3
#2 cfg_gettoken lib/isccfg/parser.c:3490:11
#3 cfg_parse_mapbody lib/isccfg/parser.c:2230:3
#4 cfg_parse_obj lib/isccfg/parser.c:247:11
#5 parse2 lib/isccfg/parser.c:628:11
#6 cfg_parse_file lib/isccfg/parser.c:668:11
#7 load_configuration bin/named/server.c:8069:13
#8 run_server bin/named/server.c:9518:2
#9 isc__async_cb lib/isc/async.c:110:3
#10 uv__async_io /tmp/libuv-1.50.0/src/unix/async.c:208:5
#11 uv__io_poll /tmp/libuv-1.50.0/src/unix/kqueue.c:369:9
#12 uv_run /tmp/libuv-1.50.0/src/unix/core.c:460:5
#13 loop_thread lib/isc/loop.c:327:6
#14 thread_body lib/isc/thread.c:89:8
#15 isc_thread_main lib/isc/thread.c:124:2
#16 isc_loopmgr_run lib/isc/loop.c:513:2
#17 main bin/named/main.c:1469:2
Mutex M0001 acquired here while holding mutex M0002 in main thread:
#0 _pthread_mutex_lock /usr/src/contrib/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1342:3
#1 _flockfile /usr/src/lib/libc/stdio/_flock_stub.c:72:3
#2 print_active lib/isc/mem.c:629:3
#3 isc_mem_stats lib/isc/mem.c:694:2
#4 main bin/named/main.c:1498:4
Mutex M0002 previously acquired by the same thread here:
#0 _pthread_mutex_lock /usr/src/contrib/llvm-project/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp:1342:3
#1 isc_mem_stats lib/isc/mem.c:668:2
#2 main bin/named/main.c:1498:4
SUMMARY: ThreadSanitizer: lock-order-inversion (potential deadlock) lib/isc/mem.c:210:2 in add_trace_entry
In the first stack frame ('M0001'->'M0002' lock order) cfg_gettoken()
uses flockfile() to lock 'M0001' for the 'FILE' object associated with
the configuration file (e.g. the configuration file itself and
whatever it includes, like a zone database), then it locks a memory
context mutex M0002.
In the other stack frmae ('M0002'->'M0001' lock order) isc_mem_stats()
locks a memory context mutex M0002, then it uses fprintf(), which
internally locks a 'M0001' mutex with flockfile() to write into the
'named.memstats' memory statistics file.
While it is true that in one case a lock in the 'isc_mem_t' structure is
locked first, and then a lock in the 'FILE' structure is locked second,
and in the the second case it is the other way around, this isn't an
issue, because those are 'FILE' structures for totally different files,
used in different parts of the code.
It was also manually confirmed that 'named.memstats' doesn't get
processed by cfg_gettoken(), and is used only in the second stack
frame's code flow when named is exiting.
Michal Nowak [Wed, 2 Apr 2025 18:45:53 +0000 (18:45 +0000)]
[9.18] chg: nil: Suppress FreeBSD-specific TSAN false-positive data race
TSAN reports a data race in FreeBSD's memset(), called by its
__crt_calloc() memory allocation function. There is a very similar
bug report [1] in FreeBSD bug tracker, and an existing code-review [2]
that tries to address an issue, the description of which is very
similar to what we are seeing.
Suppress this report by adding its signature to '.tsan-suppress'.
Suppress FreeBSD-specific TSAN false-positive data race
TSAN reports a data race in FreeBSD's memset(), called by its
__crt_calloc() memory allocation function. There is a very similar
bug report [1] in FreeBSD bug tracker, and an existing code-review [2]
that tries to address an issue, the description of which is very
similar to what we are seeing.
Suppress this report by adding its signature to '.tsan-suppress'.
[9.18] chg: ci: Update issue closing regex in dangerfile.py
Update issue regex in danger file
The regular expression in `dangerfile.py` has been updated to match
the one in GitLab and bind9-qa (isc-projects/bind9-qa!41), i.e.
https://docs.gitlab.com/user/project/issues/managing_issues/#default-closing-pattern.
Backport of MR !10361
Merge branch 'backport-andoni/update-issue-regex-in-danger-file-9.18' into 'bind-9.18'
Update the regular expression used for extracting references to GitLab
issues closed by a given merge request so that it is identical to the
one used by GitLab [1].
[9.18] new: ci: Allow pushing branches and tags to customer git repos
For pipelines in the private repository, add an optional manual job,
which allows the current branch to be pushed into the specified
customer's git repository. This can be useful to provide patch previews
for early testing.
For tags created in a private repository, add a manual job which pushes
the created tag to all entitled customers.
Backport of MR !10323
Merge branch 'backport-nicki/ci-customer-git-automation-9.18' into 'bind-9.18'
Nicki Křížek [Tue, 25 Mar 2025 15:51:24 +0000 (16:51 +0100)]
Allow pushing branches and tags to customer git repos
For pipelines in the private repository, add an optional manual job,
which allows the current branch to be pushed into the specified
customer's git repository. This can be useful to provide patch previews
for early testing.
For tags created in a private repository, add a manual job which pushes
the created tag to all entitled customers.
Arаm Sаrgsyаn [Mon, 31 Mar 2025 19:56:10 +0000 (19:56 +0000)]
[9.18] fix: usr: Fix resolver statistics counters for timed out responses
When query responses timed out, the resolver could incorrectly increase the regular responses counters, even if no response was received. This has been fixed.
Closes #5193
Backport of MR !10227
Merge branch 'backport-5193-resolver-statistics-counters-fix-9.18' into 'bind-9.18'
Aram Sargsyan [Thu, 6 Mar 2025 14:28:48 +0000 (14:28 +0000)]
Fix the resolvers RTT-ranged responses statistics counters
When a response times out the fctx_cancelquery() function
incorrectly calculates it in the 'dns_resstatscounter_queryrtt5'
counter (i.e. >=1600 ms). To avoid this, the rctx_timedout()
function should make sure that 'rctx->finish' is NULL. And in order
to adjust the RTT values for the timed out server, 'rctx->no_response'
should be true. Update the rctx_timedout() function to make those
changes.
Aram Sargsyan [Thu, 6 Mar 2025 14:26:23 +0000 (14:26 +0000)]
Fix resolver responses statistics counter
The resquery_response() function increases the response counter without
checking if the response was successful. Increase the counter only when
the result indicates success.
Nicki Křížek [Fri, 28 Mar 2025 12:22:51 +0000 (12:22 +0000)]
[9.18] chg: doc: Remove -S changelog templates from open-source edition
These changelogs meant for -S edition were introduced to avoid rebase
conflicts. However, the same result can be achieved by linking the -S
changelogs directly from their open-source variants, rather than
including the -S changelogs directly in changelog.rst.
Nicki Křížek [Thu, 27 Mar 2025 12:51:29 +0000 (13:51 +0100)]
Remove -S changelog templates from open-source edition
These changelogs meant for -S edition were introduced to avoid rebase
conflicts. However, the same result can be achieved by linking the -S
changelogs directly from their open-source variants, rather than
including the -S changelogs directly in changelog.rst.
Ondřej Surý [Wed, 26 Mar 2025 12:09:19 +0000 (12:09 +0000)]
[9.18] fix: dev: Validating ADB fetches could cause a crash in import_rdataset()
Previously, in some cases, the resolver could return rdatasets of type CNAME or DNAME without the result code being set to `DNS_R_CNAME` or `DNS_R_DNAME`. This could trigger an assertion failure in the ADB. The resolver error has been fixed.
Closes #5201
Backport of MR !10172
Backport of MR !10178
Merge branch 'backport-5201-adb-cname-error-9.18' into 'bind-9.18'
Evan Hunt [Tue, 25 Feb 2025 22:41:41 +0000 (14:41 -0800)]
set eresult based on the type in ncache_adderesult()
when the caching of a negative record failed because of the
presence of a positive one, ncache_adderesult() could override
this to ISC_R_SUCCESS. this could cause CNAME and DNAME responses
to be handled incorrectly. ncache_adderesult() now sets the result
code correctly in such cases.
Michal Nowak [Tue, 25 Mar 2025 16:51:00 +0000 (16:51 +0000)]
[9.18] fix: test: Limit X-Bloat header size to 100KB
Otherwise curl 8.13 rejects the line with:
I:Check HTTP/1.1 keep-alive with truncated stream (21)
curl: option --header: error encountered when reading a file
curl: try 'curl --help' or 'curl --manual' for more information
Also, see https://github.com/curl/curl/pull/16572.
Closes #5249
Backport of MR !10319
Merge branch 'backport-5249-statschannel-limit-http-header-size-9.18' into 'bind-9.18'
Michal Nowak [Tue, 25 Mar 2025 13:14:52 +0000 (14:14 +0100)]
Limit X-Bloat header size to 100KB
Otherwise curl 8.13 rejects the line with:
I:Check HTTP/1.1 keep-alive with truncated stream (21)
curl: option --header: error encountered when reading a file
curl: try 'curl --help' or 'curl --manual' for more information
Also, see https://github.com/curl/curl/pull/16572.
Evan Hunt [Tue, 25 Mar 2025 07:34:26 +0000 (07:34 +0000)]
[9.18] fix: usr: Don't enforce NOAUTH/NOCONF flags in DNSKEYs
All DNSKEY keys are able to authenticate. The `DNS_KEYTYPE_NOAUTH` (and `DNS_KEYTYPE_NOCONF`) flags were defined for the KEY rdata type, and are not applicable to DNSKEY. Previously, however, because the DNSKEY implementation was built on top of KEY, the `_NOAUTH` flag prevented authentication in DNSKEYs as well. This has been corrected.
Closes #5240
Backport of MR !10261
Merge branch 'backport-5240-ignore-noauth-flag-9.18' into 'bind-9.18'
Evan Hunt [Fri, 14 Mar 2025 00:44:49 +0000 (17:44 -0700)]
Don't check DNS_KEYFLAG_NOAUTH
All DNSKEY keys are able to authenticate. The DNS_KEYTYPE_NOAUTH
(and DNS_KEYTYPE_NOCONF) flags were defined for the KEY rdata type,
and are not applicable to DNSKEY.
Previously, because the DNSKEY implementation was built on top of
KEY, the NOAUTH flag prevented authentication in DNSKEYs as well.
This has been corrected.
Evan Hunt [Thu, 13 Mar 2025 19:20:40 +0000 (12:20 -0700)]
Tidy up keyvalue.h definitions
Use enums for DNS_KEYFLAG_, DNS_KEYTYPE_, DNS_KEYOWNER_, DNS_KEYALG_,
and DNS_KEYPROTO_ values.
Remove values that are never used.
Eliminate the obsolete DNS_KEYFLAG_SIGNATORYMASK. Instead, add three
more RESERVED bits for the key flag values that it covered but which
were never used.
Artem Boldariev [Mon, 24 Mar 2025 09:34:21 +0000 (09:34 +0000)]
chg: usr: Fix network manager issue when both success and timeout callbacks can be called for the same read request
This commit simplifies code flow in the tls_cycle_input() and makes
the incoming data processing similar to that in TCP DNS. In
particular, now we decipher all the the incoming data before making a
single isc__nm_process_sock_buffer() call. Previously we would try to
decipher data bit-by-bit before trying to process the deciphered bit
via isc__nm_process_sock_buffer(). Doing like before made the code
much less predictable, in particular in the areas like when reading is
paused or resumed.
The newer approach also allowed us to get rid of some old kludges.
Closes #5247
Merge branch '5247-unexpected-callbacks' into 'bind-9.18'
Artem Boldariev [Wed, 19 Mar 2025 13:11:26 +0000 (15:11 +0200)]
TLS DNS: Simplify tls_cycle_input()
This commit simplifies code flow in the tls_cycle_input() and makes
the incoming data processing similar to that in TCP DNS. In
particular, now we decipher all the the incoming data before making a
single isc__nm_process_sock_buffer() call. Previously we would try to
decipher data bit-by-bit before trying to process the deciphered bit
via isc__nm_process_sock_buffer(). Doing like before made the code
much less predictable, in particular in the areas like when reading is
paused or resumed.
The newer approach also allowed us to get rid of some old kludges.
Nicki Křížek [Tue, 18 Mar 2025 13:20:06 +0000 (13:20 +0000)]
[9.18] chg: ci: Allow re-run of the shotgun jobs to reduce false positives
The false positive rate is about 10-20 % when evaluating shotgun results
from a single run. Attempt to reduce the false positive rate by allowing
a re-run of failed jobs.
Backport of MR !10271
Merge branch 'backport-nicki/ci-shotgun-reduce-false-positives-9.18' into 'bind-9.18'
Nicki Křížek [Wed, 12 Mar 2025 16:24:05 +0000 (17:24 +0100)]
Allow re-run of the shotgun jobs to reduce false positive
The false positive rate is about 10-20 % when evaluating shotgun results
from a single run. Attempt to reduce the false positive rate by allowing
a re-run of failed jobs.
While there is a slight risk that barely noticable decreases in
performance might slip by more easily in MRs, they'd still likely pop up
during nightly or pre-release testing.
Also increase the tolerance threshold for DoH latency comparisons, as
those tests often experience increased jitter in the tail end latencies.
Mark Andrews [Sat, 15 Mar 2025 00:33:04 +0000 (00:33 +0000)]
[9.18] fix: test: Tune many types tests in reclimit test
The `I:checking that lifting the limit will allow everything to get
cached (20)` test was failing due to the TTL of the records being
too short for the elapsed time of the test. Raise the TTL to fix
this and adjust other tests as needed.
Closes #5206
Backport of MR !10177
Merge branch 'backport-5206-tune-last-sub-test-of-reclimit-9.18' into 'bind-9.18'
Mark Andrews [Wed, 26 Feb 2025 21:36:54 +0000 (08:36 +1100)]
Tune many types tests in reclimit test
The 'I:checking that lifting the limit will allow everything to get
cached (20)' test was failing due to the TTL of the records being
too short for the elapsed time of the test. Raise the TTL to fix
this and adjust other tests as needed.
Mark Andrews [Wed, 23 Jun 2021 09:51:51 +0000 (19:51 +1000)]
Implement digest_sig and digest_rrsig for ZONEMD
ZONEMD needs to be able to digest SIG and RRSIG records. The signer
field can be compressed in SIG so we need to call dns_name_digest().
While for RRSIG the records the signer field is not compressed the
canonical form has the signer field downcased (RFC 4034, 6.2). This
also implies that compare_rrsig needs to downcase the signer field
during comparison.
Evan Hunt [Sun, 2 Mar 2025 05:03:51 +0000 (21:03 -0800)]
when recording an rr trace, use libtool
when running a system test with the USE_RR environment
variable set to 1, an rr trace is generated for named.
because rr wasn't run using libtool --mode=execute, the
trace would actually be generated for the wrapper script
generated by libtool, not for the actual named binary.
Arаm Sаrgsyаn [Tue, 4 Mar 2025 10:49:30 +0000 (10:49 +0000)]
[9.18] fix: dev: Fix memory ordering issues with atomic operations in the quota.c module
Change all the non-locked operations on `quota->used` and
`quota->waiting` to "acq/rel" for inter-thread synchronization. Some
loads are left as "relaxed", because they are under a locked mutex
which also provides protection.
Also use relaxed memory ordering for `quota->max` and `quota->soft`,
as done in the main branch; possible ordering issues for these
variables are acceptable.
Closes #5018
Merge branch '5018-quota-memory-ordering-fixes-9.18' into 'bind-9.18'
Aram Sargsyan [Thu, 27 Feb 2025 16:48:52 +0000 (16:48 +0000)]
Fix memory ordering for operations with quota->used and quota->waiting
Change all the non-locked operations on 'quota->used' and
'quota->waiting' to "acq/rel" for inter-thread synchronization. Some
loads are left as "relaxed", because they are under a locked mutex
which also provides protection.
Artem Boldariev [Tue, 25 Feb 2025 17:58:24 +0000 (19:58 +0200)]
DoH: Bump the active streams processing limit
This commit bumps the total number of active streams (= the opened
streams for which a request is received, but response is not ready) to
60% of the total streams limit.
The previous limit turned out to be too tight as revealed by
longer (≥1h) runs of "stress:long:rpz:doh+udp:linux:*" tests.
Artem Boldariev [Tue, 25 Feb 2025 07:52:19 +0000 (09:52 +0200)]
DoH: Flush HTTP write buffer on an outgoing DNS message
Previously, the code would try to avoid sending any data regardless of
what it is unless:
a) The flush limit is reached;
b) There are no sends in flight.
This strategy is used to avoid too numerous send requests with little
amount of data. However, it has been proven to be too aggressive and,
in fact, harms performance in some cases (e.g., on longer (≥1h) runs
of "stress:long:rpz:doh+udp:linux:*").
Now, additionally to the listed cases, we also:
c) Flush the buffer and perform a send operation when there is an
outgoing DNS message passed to the code (which is indicated by the
presence of a send callback).
That helps improve performance for "stress:long:rpz:doh+udp:linux:*"
tests.
Artem Boldariev [Mon, 24 Feb 2025 16:32:23 +0000 (18:32 +0200)]
DoH: Limit the number of delayed IO processing requests
Previously, a function for continuing IO processing on the next UV
tick was introduced (http_do_bio_async()). The intention behind this
function was to ensure that http_do_bio() is eventually called at
least once in the future. However, the current implementation allows
queueing multiple such delayed requests needlessly. There is currently
no need for these excessive requests as http_do_bio() can requeue them
if needed. At the same time, each such request can lead to a memory
allocation, particularly in BIND 9.18.
This commit ensures that the number of enqueued delayed IO processing
requests never exceeds one in order to avoid potentially bombarding IO
threads with the delayed requests needlessly.
Artem Boldariev [Thu, 20 Feb 2025 20:08:01 +0000 (22:08 +0200)]
DoH: Simplify http_do_bio()
This commit significantly simplifies the code flow in the
http_do_bio() function, which is responsible for processing incoming
and outgoing HTTP/2 data. It seems that the way it was structured
before was indirectly caused by the presence of the missing callback
calls bug, fixed in 8b8f4d500d9c1d41d95d34a79c8935823978114c.
The change introduced by this commit is known to remove a bottleneck
and allows reproducible and measurable performance improvement for
long runs (>= 1h) of "stress:long:rpz:doh+udp:linux:*" tests.
Additionally, it fixes a similar issue with potentially missing send
callback calls processing and hardens the code against use-after-free
errors related to the session object (they can potentially occur).