Tom Krizek [Mon, 26 Jun 2023 13:19:35 +0000 (15:19 +0200)]
Use timeout for rndc status in shutdown test
Pass 5 second timeout to the rndc status command(s) to avoid hitting the
hard 10 second timeout from subprocess.call, which would result in an
unwanted exception that would only mask the real issue: if the rndc
status times out in this test, it is likely due to the server not
stopping as it should.
Tom Krizek [Mon, 26 Jun 2023 13:18:45 +0000 (15:18 +0200)]
Split shutdown test into separate test cases
The shutdown test attempts to shut down the server using two different
methods - rndc and sigterm. Use pytest.mark.parametrize to run these as
separate test cases for easier identification of failures.
Evan Hunt [Wed, 28 Jun 2023 21:52:53 +0000 (14:52 -0700)]
fix a TSAN bug in "rndc fetchlimit"
fctx counters could be accessed without locking when
"rndc fetchlimit" is called; while this is probably harmless
in production, it triggered TSAN reports in system tests.
Evan Hunt [Wed, 28 Jun 2023 16:28:01 +0000 (09:28 -0700)]
minor refactoring of resume_qmin() for clarity
make the code flow clearer by enumerating the result codes that
are treated as success conditions for an intermediate minimized
query (ISC_R_SUCCESS, DNS_R_DELEGATION, DNS_R_NXRRSET, etc), rather
than just folding them all into the 'default' branch of a switch
statement.
Tom Krizek [Mon, 26 Jun 2023 15:14:16 +0000 (17:14 +0200)]
Fix checking for executables in shell conditions in tests
Surround the variables which are checked whether they're executable in
double quotes. Without them, empty paths won't be properly interpreted
as not executable.
Tom Krizek [Mon, 26 Jun 2023 14:40:03 +0000 (16:40 +0200)]
Disable delv tests under TSAN
Since delv can occasionally hang in system tests when running with TSAN
(see GL#4119), disable these tests as a workaround. Otherwise, the hung
delv process will just waste CI resources and prevent any meaningful
output from the rest of the test suite.
Mark Andrews [Mon, 19 Jun 2023 07:49:05 +0000 (17:49 +1000)]
Test legacy HMAC key files with dig
tsig-keygen is now used to generate key files for TSIG. These have
a different format to those that were generated by dnssec-keygen.
Test that dig can still read these files.
Mark Andrews [Mon, 19 Jun 2023 04:14:39 +0000 (14:14 +1000)]
Test support with legacy HMAC K files with nsupdate
tsig-keygen generates key files that are different to those that
where generated by dnssec-keygen. Check that nsupdate can still
read those old format files.
Mark Andrews [Mon, 19 Jun 2023 04:17:14 +0000 (14:17 +1000)]
Restore the ability to read legacy K*+157+* files
The ability to read legacy HMAC-MD5 K* keyfile pairs using algorithm
number 157 was accidentally lost when the algorithm numbers were
consolidated into a single block, in commit 09f7e0607a34d90eae53f862954e98c31b5ae532.
The assumption was that these algorithm numbers were only known
internally, but they were also used in key files. But since HMAC-MD5
got renumbered from 157 to 160, legacy HMAC-MD5 key files no longer
work.
Move HMAC-MD5 back to 157 and GSSAPI back to 160. Add exception for
GSSAPI to list_hmac_algorithms.
Tony Finch [Tue, 6 Jun 2023 14:20:44 +0000 (15:20 +0100)]
Check for overflow in jemalloc_shim
When compiled using a malloc that lacks an equivalent to sallocx(),
the jemalloc_shim adds a size prefix to each allocation. We must check
that this does not overflow.
Tony Finch [Tue, 6 Jun 2023 14:11:13 +0000 (15:11 +0100)]
Add <isc/overflow.h> for checked mul, add, and sub
The `ISC_OVERFLOW_XXX()` macros are usually wrappers around
`__builtin_xxx_overflow()`, with alternative implementations
for compilers that lack the builtins.
Replace the overflow checks in `isc/time.c` with the new macros.
Ondřej Surý [Wed, 7 Jun 2023 13:23:08 +0000 (15:23 +0200)]
Use per-loop memory contexts for dns_resolver child objects
The dns_resolver creates a lot of smaller objects (fetch context, fetch
counter, query, response, ...) and those are all loop-bound.
Previously, those objects were allocated from the a single resolver
context, which in turn increases contention between threads - remember
"dead by thousand atomic paper cuts". Instead of using a single memory
context, use the per-loop memory contexts that are bound to a specific
loop and thus there's no contention between them when doing the memory
accounting.
Ondřej Surý [Mon, 26 Jun 2023 13:58:48 +0000 (15:58 +0200)]
Remove the explicit call_rcu thread creating and destruction
The free_all_cpu_call_rcu_data() call can consume hundreds of
milliseconds on shutdown. Don't try to be smart and let the RCU library
handle this internally.
Evan Hunt [Fri, 2 Jun 2023 00:14:49 +0000 (17:14 -0700)]
explicitly set dnssec-validation in system tests
the default value of dnssec-validation is 'auto', which causes
a server to send a key refresh query to the root zone when starting
up. this is undesirable behavior in system tests, so this commit
sets dnssec-validation to either 'yes' or 'no' in all tests where
it had not previously been set.
this change had the mostly-harmless side effect of changing the cached
trust level of unvalidated answer data from 'answer' to 'authanswer',
which caused a few test cases in which dumped cache data was examined in
the serve-stale system test to fail. those test cases have now been
updated to expect 'authanswer'.
Tom Krizek [Thu, 22 Jun 2023 16:08:17 +0000 (18:08 +0200)]
Check for proper file size output in dnstap test
Previously, the first check silently failed, as 454 is apparently (in my
local setup) the minimum output size for the dnstap output, rather than
470 which the test was expecting. Effectively, the check served as a 5
second sleep rather than waiting for the proper file size.
Additionally, check the expected file sizes and fail if expectations
aren't met.
Tom Krizek [Thu, 22 Jun 2023 15:57:22 +0000 (17:57 +0200)]
Check for proper log message in kasp test
The log message is supposed to contain the zone name which was
erroneously omitted, but didn't pop up during tests, since return code
was silently ignored.
Now it actually waits for the proper log message rather than being an
equivalent of 3 second sleep (which was also sufficient to make the test
pass, thus we detected no failure).
In HTTP/1.0 and HTTP/1.1, RFC 9112 section 9.6 says the last response
in a connection should include a `Connection: close` header, but the
statschannel server omitted it.
In an HTTP/1.0 response, the statschannel server can sometimes send a
`Connection: keep-alive` header when it is about to close the
connection. There are two ways:
If the first request on a connection is keep-alive and the second
request is not, then _both_ responses have `Connection: keep-alive`
but the connection is (correctly) closed after the second response.
If a single request contains
Connection: close
Connection: keep-alive
then RFC 9112 section 9.3 says the keep-alive header is ignored, but
the statschannel sends a spurious keep-alive in its response, though
it correctly closes the connection.
To fix these bugs, make it more clear that the `httpd->flags` are part
of the per-request-response state. The Connection: flags are now
described in terms of the effect they have instead of what causes them
to be set.
Michał Kępień [Thu, 15 Jun 2023 14:17:14 +0000 (16:17 +0200)]
Fix entity renumbering in util/parse_tsan.py
util/parse_tsan.py builds tables of mutexes, threads, and pointers it
finds in the TSAN report provided to it as a command-line argument and
then replaces all mentions of each of these entities so that they are
numbered sequentially in the processed report. For example, this line:
Cycle in lock order graph: M0 (...) => M5 (...) => M9 (...) => M0
is expected to become:
Cycle in lock order graph: M1 (...) => M2 (...) => M3 (...) => M1
Problems arise when the gaps between mutex/thread identifiers present on
a single line are smaller than the total number of mutexes/threads found
by the script so far. For example, the following line:
Cycle in lock order graph: M0 (...) => M1 (...) => M2 (...) => M0
first gets turned into:
Cycle in lock order graph: M1 (...) => M1 (...) => M2 (...) => M1
and then into:
Cycle in lock order graph: M2 (...) => M2 (...) => M2 (...) => M2
In other words, lines like this become garbled due to information loss.
The problem stems from the fact that the numbering scheme the script
uses for identifying mutexes and threads is exactly the same as the one
used by TSAN itself. Update util/parse_tsan.py so that it uses
zero-padded numbers instead, making the "overlapping" demonstrated above
impossible.
Ondřej Surý [Thu, 15 Jun 2023 08:24:21 +0000 (10:24 +0200)]
Make isc_result tables smaller
The isc_result_t enum was to sparse when each library code would skip to
next << 16 as a base. Remove the huge holes in the isc_result_t enum to
make the isc_result tables more compact.
This change required a rewrite how we map dns_rcode_t to isc_result_t
and back, so we don't ever return neither isc_result_t value nor
dns_rcode_t out of defined range.
Ondřej Surý [Thu, 15 Jun 2023 08:24:21 +0000 (10:24 +0200)]
Refactor how we map isc_result_t <-> dns_rcode_t
The mapping functions between isc_result_t and dns_rcode_t could return
both isc_result_t values not defined in the header and dns_rcode_t
values not defined in the header because it blindly maps anything
withing full 12-bits defined for RCODEs to isc_result_t and back.
Refactor the dns_result_{from,to}rcode() functions to always return
valid isc_result_t and dns_rcode_t values by explicitly mapping the
values to each other and returning DNS_R_SERVFAIL (dns_rcode_servfail)
when encountering value out of the defined range.
Tom Krizek [Thu, 15 Jun 2023 08:48:21 +0000 (10:48 +0200)]
Mark CI failure of system:clang:tsan as an error again
Both the issues causing frequent failures have been resolved. The job
seems to have stabilized and there's no longer a need to mark the
failure as a mere warnings.
Aram Sargsyan [Fri, 9 Jun 2023 07:13:27 +0000 (07:13 +0000)]
Fix a data race between the dns_zone and dns_catz modules
The dns_zone_catz_enable_db() and dns_zone_catz_disable_db()
functions can race with similar operations in the catz module
because there is no synchronization between the threads.
Add catz functions which use the view's catalog zones' lock
when registering/unregistering the database update notify callback,
and use those functions in the dns_zone module, instead of doing it
directly.
Tony Finch [Fri, 9 Jun 2023 07:22:47 +0000 (08:22 +0100)]
CHANGES note for [GL #4134]
[cleanup] Report "permission denied" instead of "unexpected error"
when trying to update a zone file is on a read-only file
system. Thanks to Midnight Veil. [GL #4134]
Mark Andrews [Thu, 8 Jun 2023 06:58:45 +0000 (16:58 +1000)]
Use RCU for view->adb access
view->adb may be referenced while the view is shutting down as the
zone uses a weak reference to the view and examines view->adb but
dns_view_detach call dns_adb_detach to clear view->adb.
- style fixes and general tidying-up in tkey.c
- remove the unused 'intoken' parameter from dns_tkey_buildgssquery()
- remove an unnecessary call to dns_tkeyctx_create() in ns_server_create()
(the TKEY context that was created there would soon be destroyed and
another one created when the configuration was loaded).
purely to assuage my desire for consistency across modules,
result variables have been renamed to 'result' as they are
throughout most of BIND. there are no other changes.
this function was no longer needed, because the algorithm name is no
longer copied into the tsigkey object by dns_tsigkey_createfromkey();
it's always just a pointer to a statically defined name.
use algorithm number instead of name to create TSIG keys
the prior practice of passing a dns_name containing the
expanded name of an algorithm to dns_tsigkey_create() and
dns_tsigkey_createfromkey() is unnecessarily cumbersome;
we can now pass the algorithm number instead.
- remove the 'ring' parameter from dns_tsigkey_createfromkey(),
and use dns_tsigkeyring_add() to add key objects to a keyring instead.
- add a magic number to dns_tsigkeyring_t
- change dns_tsigkeyring_dumpanddetach() to dns_tsigkeyring_dump();
we now call dns_tsigkeyring_detach() separately.
- remove 'maxgenerated' from dns_tsigkeyring_t since it never changes.
- style cleanups.
- simplify the function parameters to dns_tsigkey_create():
+ remove 'restored' and 'generated', they're only ever set to false.
+ remove 'creator' because it's only ever set to NULL.
+ remove 'inception' and 'expiry' because they're only ever set to
(0, 0) or (now, now), and either way, this means "never expire".
+ remove 'ring' because we can just use dns_tsigkeyring_add() instead.
- rename dns_keyring_restore() to dns_tsigkeyring_restore() to match the
rest of the functions operating on dns_tsigkeyring objects.
Matthijs Mekking [Wed, 14 Jun 2023 07:05:31 +0000 (09:05 +0200)]
Update findzonekeys function name in log message
The "dns_dnssec_findzonekeys2" log message is a leftover from when that
was the name of the function. Rename to match the current name of the
function.
Matthijs Mekking [Tue, 13 Jun 2023 15:45:08 +0000 (17:45 +0200)]
Add dynamic update prepub and doubleksk test case
Add two test cases for zones that use auto-dnssec, but not
inline-signing, and make sure that the change for find_zone_keys()
do not affect introducing a new key that is intended for signing.
See note https://gitlab.isc.org/isc-projects/bind9/-/merge_requests/7638#note_355944
The find_zone_keys() function was not working properly for
inline-signed zones. It only worked if the DNSKEY records were also
published in the unsigned version of the zone. But this is not the
case when you use dnssec-policy, the DNSKEY records will only occur
in the signed version of the zone. Therefor, when looking for keys
to sign the zone, only the newly added keys in the dynamic update
were found (which could be zero), ignoring existing keys.
Also, if a DNSKEY was added, it would try to sign the zone with just
this new key, and this would only work if the key files for that key
were imported into the key-directory.
This is a design error, because the goal is to sign the zone with the
keys for which we actually have key files for. So instead of looking
for DNSKEY records to then search for the matching key files, call
dns_dnssec_findmatchingkeys() which just looks for the keys we have
on disk for the given zone. It will also set the correct DNSSEC
signing hints.
Matthijs Mekking [Tue, 13 Jun 2023 13:59:53 +0000 (15:59 +0200)]
Add log check in multisigner system test
When we add DNSKEY records via dynamic update, this should no longer
trigger signing the zone with these keys. This currently happens when
'find_zone_keys()' looks up the keys by inspecting the DNSKEY RRset,
then attempting to read the corresponding key files.
Add checks that inspect the logs whether an attempt to read the key
files for the newly added keys was done (and failed because these files
are not available).
Aram Sargsyan [Tue, 13 Jun 2023 09:58:29 +0000 (09:58 +0000)]
Fix catz db update callback registration logic error
When a catalog zone is updated using AXFR, the zone database is changed,
so it is required to unregister the update notification callback from
the old database, and register it for the new one.
Currently, here is the order of the steps happening in such scenario:
1. The zone.c:zone_startload() function registers the notify callback
on the new database using dns_zone_catz_enable_db()
2. The callback, when called, notices that the new 'db' is different
than 'catz->db', and unregisters the old callback for 'catz->db',
marks that it's unregistered by setting 'catz->db_registered' to
false, then it schedules an update if it isn't already scheduled.
3. The offloaded update process, after completing its job, notices that
'catz->db_registered' is false, and (re)registers the update callback
for the current database it is working on. There is no harm here even
if it was registered also on step 1, and we can't skip it, because
this function can also be called "artificially" during a
reconfiguration, and in that case the registration step is required
here.
A problem arises when before step 1 an update process was already
in a running state, operating on the old database, and finishing its
work only after step 2. As described in step 3, dns__catz_update_cb()
notices that 'catz->db_registered' is false and registers the callback
on the current database it is working on, which, at that state, is
already obsolete and unused by the zone. When it detaches the database,
the function which is responsible for its cleanup (e.g. free_rbtdb())
asserts because there is a registered update notify callback there.
To fix the problem, instead of delaying the (re)registration to step 3,
make sure that the new callback is registered and 'catz->db_registered'
is accordingly marked on step 2.
Tom Krizek [Tue, 13 Jun 2023 08:52:01 +0000 (10:52 +0200)]
Avoid false positive in serve-stale system test check
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.
Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.