Matthijs Mekking [Tue, 27 Jun 2023 14:29:50 +0000 (16:29 +0200)]
Add test for "three is a crowd" bug (GL #2375)
Add this test scenario for a bug fixed a while ago. When a third key is
introduced while the previous rollover hasn't finished yet, the keymgr
could decide to remove the first two keys, because it was not checking
for an indirect dependency on the keys.
In other words, the previous bug behavior was that the first two keys
were removed from the zone too soon.
This test case checks that all three keys stay in the zone, and no keys
are removed premature after another new key has been introduced.
Matthijs Mekking [Tue, 27 Jun 2023 14:27:35 +0000 (16:27 +0200)]
Check all keys despite early failure
In the kasp script, if one expected key is not found, continue checking
the other key ids, even if there is no match for the first one. This
provides a bit more information which keys mismatch and makes for
easier debugging test failures.
Tom Krizek [Mon, 26 Jun 2023 15:14:16 +0000 (17:14 +0200)]
Fix checking for executables in shell conditions in tests
Surround the variables which are checked whether they're executable in
double quotes. Without them, empty paths won't be properly interpreted
as not executable.
Tom Krizek [Mon, 26 Jun 2023 14:40:03 +0000 (16:40 +0200)]
Disable delv tests under TSAN
Since delv can occasionally hang in system tests when running with TSAN
(see GL#4119), disable these tests as a workaround. Otherwise, the hung
delv process will just waste CI resources and prevent any meaningful
output from the rest of the test suite.
Tom Krizek [Thu, 22 Jun 2023 16:08:17 +0000 (18:08 +0200)]
Check for proper file size output in dnstap test
Previously, the first check silently failed, as 450 is apparently (in
the CI) the minimum output size for the dnstap output, rather than
470 which the test was expecting. Effectively, the check served as a 5
second sleep rather than waiting for the proper file size.
Additionally, check the expected file sizes and fail if expectations
aren't met.
Tom Krizek [Thu, 22 Jun 2023 15:57:22 +0000 (17:57 +0200)]
Check for proper log message in kasp test
The log message is supposed to contain the zone name which was
erroneously omitted, but didn't pop up during tests, since return code
was silently ignored.
Now it actually waits for the proper log message rather than being an
equivalent of 3 second sleep (which was also sufficient to make the test
pass, thus we detected no failure).
Michał Kępień [Thu, 15 Jun 2023 14:17:14 +0000 (16:17 +0200)]
Fix entity renumbering in util/parse_tsan.py
util/parse_tsan.py builds tables of mutexes, threads, and pointers it
finds in the TSAN report provided to it as a command-line argument and
then replaces all mentions of each of these entities so that they are
numbered sequentially in the processed report. For example, this line:
Cycle in lock order graph: M0 (...) => M5 (...) => M9 (...) => M0
is expected to become:
Cycle in lock order graph: M1 (...) => M2 (...) => M3 (...) => M1
Problems arise when the gaps between mutex/thread identifiers present on
a single line are smaller than the total number of mutexes/threads found
by the script so far. For example, the following line:
Cycle in lock order graph: M0 (...) => M1 (...) => M2 (...) => M0
first gets turned into:
Cycle in lock order graph: M1 (...) => M1 (...) => M2 (...) => M1
and then into:
Cycle in lock order graph: M2 (...) => M2 (...) => M2 (...) => M2
In other words, lines like this become garbled due to information loss.
The problem stems from the fact that the numbering scheme the script
uses for identifying mutexes and threads is exactly the same as the one
used by TSAN itself. Update util/parse_tsan.py so that it uses
zero-padded numbers instead, making the "overlapping" demonstrated above
impossible.
Tom Krizek [Fri, 2 Jun 2023 08:53:42 +0000 (10:53 +0200)]
Adjust the respdiff failure threshold for a new dataset
This is just a slight tweak for the respdiff CI test. The new dataset
has a different set of queries and it results in a slightly more
SERVFAILs rather than timeouts in the respdiff-long-third-party test.
In our comparison script, timeouts are not counted towards the
threshold. While the total number of differences remains roughly the
same, the different distributions of them (among SERVFAIL vs timeout)
warrants a slight bump in the threshold in order to avoid test failures.
Tom Krizek [Tue, 13 Jun 2023 08:52:01 +0000 (10:52 +0200)]
Avoid false positive in serve-stale system test check
The purpose of the check is to verify the server has survived the
previous barrage of queries. This is done by sending a query and
checking we get a NOERROR response back.
Previously, that query could've been affected by a servfail cache - the
server would return a SERVFAIL answer, thus failing the check, despite
being up and running. Use version.bind txt ch query to avoid the
interference of servfail cache.
The 'refresh_rrset' variable is used to determine if we can detach from
the client. This can cause a hang on shutdown. To fix this, move setting
of the 'nodetach' variable up to where 'refresh_rrset' is set (in
query_lookup(), and thus not in ns_query_done()), and set it to false
when actually refreshing the RRset, so that when this lookup is
completed, the client will be detached.
Evan Hunt [Fri, 26 May 2023 06:53:50 +0000 (23:53 -0700)]
Stale answer lookups could loop when over recursion quota
When a query was aborted because of the recursion quota being exceeded,
but triggered a stale answer response and a stale data refresh query,
it could cause named to loop back where we are iterating and following
a delegation. Having no good answer in cache, we would fall back to
using serve-stale again, use the stale data, try to refresh the RRset,
and loop back again, without ever terminating until crashing due to
stack overflow.
This happens because in the functions 'query_notfound()' and
'query_delegation_recurse()', we check whether we can fall back to
serving stale data. We shouldn't do so if we are already refreshing
an RRset due to having prioritized stale data in cache.
In other words, we need to add an extra check to 'query_usestale()' to
disallow serving stale data if we are currently refreshing a stale
RRset.
As an additional mitigation to prevent looping, we now use the result
code ISC_R_ALREADYRUNNING rather than ISC_R_FAILURE when a recursion
loop is encountered, and we check for that condition in
'query_usestale()' as well.
Ondřej Surý [Tue, 30 May 2023 06:46:17 +0000 (08:46 +0200)]
Improve RBT overmem cache cleaning
When cache memory usage is over the configured cache size (overmem) and
we are cleaning unused entries, it might not be enough to clean just two
entries if the entries to be expired are smaller than the newly added
rdata. This could be abused by an attacker to cause a remote Denial of
Service by possibly running out of the operating system memory.
Currently, the addrdataset() tries to do a single TTL-based cleaning
considering the serve-stale TTL and then optionally moves to overmem
cleaning if we are in that condition. Then the overmem_purge() tries to
do another single TTL based cleaning from the TTL heap and then continue
with LRU-based cleaning up to 2 entries cleaned.
Squash the TTL-cleaning mechanism into single call from addrdataset(),
but ignore the serve-stale TTL if we are currently overmem.
Then instead of having a fixed number of entries to clean, pass the size
of newly added rdatasetheader to the overmem_purge() function and
cleanup at least the size of the newly added data. This prevents the
cache going over the configured memory limit (`max-cache-size`).
Additionally, refactor the overmem_purge() function to reduce for-loop
nesting for readability.
Michal Nowak [Wed, 31 May 2023 11:14:19 +0000 (13:14 +0200)]
Disable minimal update check with no keys on Windows
The $t1 value equals $t2 due to the time elapsed between "rndc
managed-keys status" calls being equal to the normal active refresh
period (as calculated per rules listed in RFC 5011 section 2.3) minus an
"hour" (as set using -T mkeytimers). This value equality is expected to
happen on really slow machines. On our Windows CI runner, it happens
very often.
Michal Nowak [Fri, 26 May 2023 08:50:58 +0000 (10:50 +0200)]
Change images for TSAN jobs
Fedora 38 and Debian "bullseye" images were "forked" to images used only
for TSAN CI jobs. The new images contain TSAN-aware liburcu that does
not fit well with ASAN CI jobs for which original images were also used.
liburcu is not used in this branch, but images are shared among
branches, and their use needs to be consistent in all maintained
branches.
We recently fixed a bug where in some cases (when following an
expired CNAME for example), named could return SERVFAIL if the target
record is still valid (see isc-projects/bind9#3678, and
isc-projects/bind9!7096). We fixed this by considering non-stale
RRsets as well during the stale lookup.
However, this triggered a new bug because despite the answer from
cache not being stale, the lookup may be triggered by serve-stale.
If the answer from database is not stale, the fix in
isc-projects/bind9!7096 erroneously skips the serve-stale logic.
Add 'answer_found' checks to the serve-stale logic to fix this issue.
Add a test case where when priming the cache with a slow authoritative
resolver, the stale-answer-client-timeout option should not return
a delegation to the client (it should wait until an applicable answer
is found, if no entry is found in the cache).
Michal Nowak [Thu, 18 May 2023 10:53:21 +0000 (12:53 +0200)]
Disable exceeded quota check on Windows
This check is too unstable on Windows. Given the bind-9.16 branch is in
security fixes-only mode, something unlikely to be investigated before
the branch goes EOL.
Mark Andrews [Wed, 10 May 2023 22:20:54 +0000 (08:20 +1000)]
Remove init and deinit from fuzz.h
Constructors and destructors for the main program are not reliable
as they may be called before constructors for shared libraries they
depend upon or be called after destructors of shared libraries they
depend upon.
Aram Sargsyan [Thu, 11 May 2023 12:08:13 +0000 (12:08 +0000)]
Check whether zone->db is a valid pointer before attaching
The zone_resigninc() function does not check the validity of
'zone->db', which can crash named if the zone was unloaded earlier,
for example with "rndc delete".
Check that 'zone->db' is not 'NULL' before attaching to it, like
it is done in zone_sign() and zone_nsec3chain() functions, which
can similarly be called by zone maintenance.
Tom Krizek [Thu, 6 Apr 2023 12:05:30 +0000 (14:05 +0200)]
Ensure named always terminates in the shutdown test
Previously, if an exception would happen inside the `with` block, the
error handler would wait indefinitely for the process to end. That would
never happen, since the termination signal was never sent to named and
the test would get stuck.
Using the try-finally block ensures that the named process is always
killed and any exception or errors will be handled gracefully.
Tom Krizek [Thu, 6 Apr 2023 12:01:43 +0000 (14:01 +0200)]
Refactor shutdown test into more helper functions
Improve code readability by splitting the test into more functions. Some
could be re-used later on for more general-purpose subprocess handling
or named checks.
Mark Andrews [Wed, 19 Apr 2023 00:34:49 +0000 (10:34 +1000)]
Check the pointer alignments when deserialising
deserialize_corrupt_test may corrupt the pointers such that they
is no longer properly aligned. Check that the alignment is consistent
with memory returned from isc_mem before checking the magic value.