Aram Sargsyan [Mon, 27 Mar 2023 10:56:22 +0000 (10:56 +0000)]
Fix a data race in dns__catz_update_cb()
The dns__catz_update_cb() function was earlier updated (see d2ecff3c4a0d961041b860515858d258d40462d7) to use a separate
'dns_db_t' object ('catz->updb' instead of 'catz->db') to
avoid a race between the 'dns__catz_update_cb()' and
'dns_catz_dbupdate_callback()' functions, but the 'REQUIRE'
check there still checks the validity of the 'catz->db' object.
Evan Hunt [Wed, 22 Mar 2023 22:01:30 +0000 (15:01 -0700)]
deprecate delegation-only and root-delegation only
These options and zone type were created to address the
SiteFinder controversy, in which certain TLD's redirected queries
rather than returning NXDOMAIN. since TLD's are now DNSSEC-signed,
this is no longer likely to be a problem.
The deprecation message for 'type delegation-only' is issued from
the configuration checker rather than the parser. therefore,
isccfg_check_namedconf() has been modified to take a 'nodeprecate'
parameter to suppress the warning when named-checkconf is used with
the command-line option to ignore warnings on deprecated options (-i).
Ondřej Surý [Thu, 23 Mar 2023 09:48:39 +0000 (10:48 +0100)]
Properly handle ISC_R_SHUTTINGDOWN in resquery_response()
When resquery_response() was called with ISC_R_SHUTTINDOWN, the region
argument would be NULL, but rctx_respinit() would try to pass
region->base and region->len to the isc_buffer_init() leading to
a NULL pointer dereference. Properly handle non-ISC_R_SUCCESS by
ignoring the provided region.
Tom Krizek [Mon, 13 Mar 2023 16:45:20 +0000 (17:45 +0100)]
Lighten the CI load during the dupsigs test
Previously, an AXFR request would be issued every second while waiting
for the zone to be signed. This might've been the cause of issues in CI
where many tests are running in parallel and any extra load may increase
test instability.
Instead, check for the last NSEC record to have a signature before
commencing the AXFR request to check the zone has been fully signed.
Also increase the time for the zone signing to a total of 60+10 seconds
up from the previous 30.
Tom Krizek [Mon, 13 Mar 2023 14:13:44 +0000 (15:13 +0100)]
Redirect dupsigs test output to proper logger
Ensure messages from dupsigs system test end up in its log rather than
stdout. Previously, the output was hard to debug when running the tests
in parallel and messages wouldn't end up in the dupsigs.log.
Aram Sargsyan [Tue, 21 Mar 2023 11:42:28 +0000 (11:42 +0000)]
Hold a catz reference while the update process is running
This should delay the catalog zone from being destroyed during
shutdown, if the update process is still running.
Doing this should not introduce significant shutdown delays, as
the update function constantly checks the 'shuttingdown' flag
and cancels the process if it is set.
Tom Krizek [Thu, 9 Mar 2023 12:33:31 +0000 (13:33 +0100)]
Use the default retention time for CI artifacts
The instance-wide GitLab CI artifact retention time was changed to 1 day
up from the previous value of 12 hours. Remove our explicit overrides
for 1 day artifact retention time, as it is the default now.
Previously, most of our jobs had overrides for 1 day retention, while
some of our jobs used the default 12 hours. This discrepancy could be
quite impractical at times.
Artem Boldariev [Tue, 7 Mar 2023 21:16:11 +0000 (23:16 +0200)]
DoT: remove TLS-related kludge in isc__nmsocket_connecttimeout_cb()
This commit ensures that 'sock->tls.pending_req' is not getting
nullified during TLS connection timeout callback as it prevents the
connection callback being called when connecting was not successful.
We expect 'isc__nm_failed_connect_cb() to be called from
'isc__nm_tlsdns_shutdown()' when establishing connections was
successful, but with 'sock->tls.pending_req' nullified that will not
happen.
The code removed most likely was required in older iterations of the
NM, but to me it seems that now it does only harm. One of the well
know pronounced effects is leading to irrecoverable zone transfer
hangs via TLS.
Mark Andrews [Tue, 14 Mar 2023 02:13:14 +0000 (13:13 +1100)]
When signing with a new algorithm preserve NSEC/NSEC3 chains
If the zone already has existing NSEC/NSEC3 chains then zone_sign
needs to continue to use them. If there are no chains then use
kasp setting otherwise generate an NSEC chain.
Aram Sargsyan [Fri, 10 Mar 2023 11:07:13 +0000 (11:07 +0000)]
Improve dnstap system test reliability
The dnstap system test fails intermittently, and it appears to be
a timing issue - adding a short delay after running 'fstrm_capture',
and before running 'dnstap -reopen' improves the situation from
50% failures (5 out of 10 times) to 0% failures (0 out of 20 times),
tested locally.
The reason is that 'fstrm_capture' is executed in the background,
and due to OS scheduling and other factors, the listener socket
may not be ready when the following command runs and tells 'named'
to (re)open it.
Michal Nowak [Thu, 9 Mar 2023 10:10:53 +0000 (11:10 +0100)]
Drop parallel build from stress tests
BUILD_PARALLEL_JOBS environmental variable is set to 6, which does not
align well with 4 and 8 CPU core systems dedicated to CI "stress" tests.
When multiple parallel jobs run on the host, they compete for resources
with an undesirable result: 6 compiler processes of one job may starve
named, resulting in lower-than-expected throughput and minutes-long
query response latency spikes.
Better drop the build parallelism of BIND-under-test. About 1-2 minutes
are added to the 60-65 minutes long job duration.
Aram Sargsyan [Tue, 7 Mar 2023 14:08:52 +0000 (14:08 +0000)]
Fix the placement of printing dig output comments in doth system test
There can be comments in dig output for a zone transfer only in case
of an error, so we should print those errors not when wait_for_tls_xfer
succeeds, but when it fails.
Also, there is no point in printing those comments when a failure was
indeed expected.
Michal Nowak [Tue, 28 Feb 2023 16:49:43 +0000 (17:49 +0100)]
Build BIND in stress test jobs with common CFLAGS
By omission, BIND was not built with common CFLAGS in the stress test
jobs. Building with common CFLAGS and -Og should help GDB produce a
backtrace with more information.
The serve-stale system test was intermittently failing due to a timing
issue:
I:serve-stale:check stale data.example TXT was refreshed...
I:serve-stale:failed
The RRset is refreshed, however, it first checks for an expected log
line, prior checking that the stale data.example TXT was refreshed
(using dig). This log line is there to ensure the record is actually
refreshed before we start querying again. Alternatively we could just
retry_quiet 10 <wait for dig output matches expectations>. It would
lower the chances for intermittent test failures, since there is no
longer a "check for log line, sleep one second if check fails, check
for log line, ...", prior to the check.
Mark Andrews [Fri, 3 Mar 2023 03:04:34 +0000 (14:04 +1100)]
Now logs UV versions when starting up
Named now logs both compile time and run time UV versions when
starting up. This is useful information to have when debugging
network issues involving named.
Aram Sargsyan [Thu, 2 Mar 2023 08:52:25 +0000 (08:52 +0000)]
catz: use two pairs of dns_db_t and dns_dbversion_t in a catalog zone
As it is done in the RPZ module, use 'db' and 'dbversion' for the
database we are going to update to, and 'updb' and 'updbversion' for
the database we are working on.
Doing this should avoid a race between the 'dns__catz_update_cb()' and
'dns_catz_dbupdate_callback()' functions.
Aram Sargsyan [Wed, 1 Mar 2023 12:30:46 +0000 (12:30 +0000)]
Fix view's zones reverting bug during reconfiguration
During reconfiguration, the configure_view() function reverts the
configured zones to the previous view in case if there is an error.
It uses the 'zones_configured' boolean variable to decide whether
it is required to revert the zones, i.e. the error happened after
all the zones were successfully configured.
The problem is that it does not account for the case when an error
happens during the configuration of one of the zones (not the first),
in which case there are zones that are already configured for the
new view (and they need to be reverted), and there are zones that
are not (starting from the failed one).
Since 'zones_configured' remains 'false', the configured zones are
not reverted.
Replace the 'zones_configured' variable with a pointer to the latest
successfully configured zone configuration element, and when reverting,
revert up to and including that zone.
Aram Sargsyan [Wed, 1 Mar 2023 12:47:25 +0000 (12:47 +0000)]
Add a catz system test check for [GL #3911]
The trick is to configure a duplicate zone, which comes after the
catalog zone, where the duplicate zone is an existing member zone.
In that scenario, all the zones which come before the "faulty" zone
in the configuration file will fail to be reverted to the previous
version of the view after a reconfiguration error, and in this
particular case that will result in an assertion failure when the
catalog zone update is initiated, because it will be still tied to
the new version of the view, which was dismissed.
Mark Andrews [Thu, 23 Feb 2023 22:39:34 +0000 (09:39 +1100)]
Extract test coverage statistics from the gcov job
In older GitLab versions, the regular expression used for extracting
test coverage statistics from the output of GitLab CI jobs was
configured in the project's settings, using GitLab's web interface.
That changed in recent GitLab versions [1]; the previous configuration
method was removed from the web interface altogether as of GitLab 15.0.
The relevant regular expression is now supposed to be set in the
relevant job's definition in .gitlab-ci.yml.
Set the regular expression used for extracting test coverage
statistics in the definition of the "gcov" GitLab CI job. Use the
regular expression suggested in GitLab's documentation [2].
Aram Sargsyan [Fri, 27 Jan 2023 08:47:52 +0000 (08:47 +0000)]
catz: unregister the db update-notify callback before detaching from db
When detaching from the previous version of the database, make sure
that the update-notify callback is unregistered, otherwise there is
an INSIST check which can generate an assertion failure in free_rbtdb(),
which checks that there are no outstanding update listeners in the list.
Aram Sargsyan [Thu, 26 Jan 2023 19:08:19 +0000 (19:08 +0000)]
Process db callbacks in zone_loaddone() after zone_postload()
The zone_postload() function can fail and unregister the callbacks.
Call dns_db_endload() only after calling zone_postload() to make
sure that the registered update-notify callbacks are not called
when the zone loading has failed during zone_postload().
Also, don't ignore the return value of zone_postload().
Aram Sargsyan [Fri, 27 Jan 2023 09:22:11 +0000 (09:22 +0000)]
Add a system test for [GL #3777]
Add the 'ixfr-from-differences yes;' option to trigger a failed
zone postload operation when a zone is updated but the serial
number is not updated, then issue two successive 'rndc reload'
commands to trigger the bug, which causes an assertion failure.
Artem Boldariev [Wed, 25 May 2022 11:49:32 +0000 (14:49 +0300)]
Increase server start timeout for system tests
This commit increases server start timeout from 60 to 90 seconds in
order to avoid system test failures on some platforms due to inability
to initialise TLS contexts in time.
Mark Andrews [Tue, 28 Feb 2023 03:10:56 +0000 (14:10 +1100)]
Fix 'lame server clients are dropped below the hard limit' test
The test was setting a minimum count for recursive clients which
was not always being met (e.g. 91 instead of 100) producing a false
positive. Lower the lower bound on recursive clients for this
test to 1.
Michał Kępień [Tue, 28 Feb 2023 11:54:02 +0000 (12:54 +0100)]
Add a DNSRPS-enabled build to regular CI pipelines
DNSRPS-enabled builds have recently been silently broken a few times due
to that feature not being tested in regular CI pipelines. Add the
--enable-dnsrps --enable-dnsrps-dl switches to the ./configure
invocation in one of the CI jobs run for all merge requests so that
DNSRPS-related build issues can be detected in advance.
It is important to note that this change by itself does NOT enable
actual testing of the DNSRPS feature as doing that requires a DNSRPS
provider library to be present on the test host.
Michał Kępień [Tue, 28 Feb 2023 11:54:02 +0000 (12:54 +0100)]
(Mostly) fix building bin/tests/system/rpz/dnsrps
Building the bin/tests/system/rpz/dnsrps helper binary is currently not
possible at all as the necessary compiler and linker flag definitions
are missing from bin/tests/system/Makefile.am. Add these as a basis for
addressing the problem.
Unfortunately, this is where the "mostly" bit mentioned in this commit's
subject line comes into play. The dlopen() parts of DNSRPS code have
not yet been reworked to use libuv's dlopen() API (uv_dlopen() etc.)
(See commit 37b9511ce1dd9ba66a6620c5ff617016eb81188f for prior work in
this area.) While it is certainly possible to do that, implementing
such a change without testing it in practice against a usable librpz.so
(i.e. a DNSRPS provider library) is bound to cause more trouble and
confusion than keeping the code the way it is right now. However,
making that code buildable as-is requires linking against a C standard
library that exports the dlopen(), dlsym(), and dlclose() symbols used
by the DNSRPS dynamic loading code. glibc 2.34+ satisfies that
requirement, but older glibc versions do not (these come with a separate
libdl shared library that would need to be linked in as well). (Other
C standard library implementations have not been examined.) Since the
long-term plan is to rely on libuv's dlopen() API exclusively and
detecting the shared object containing dlopen() & friends would only
pull in build system complexity for no good reason, assume for now that
the target system provides the dlopen() API in its C standard library.
This change enables the system test suite to be run for a BIND 9 build
prepared using --enable-dnsrps --enable-dnsrps-dl (on systems satisfying
the requirement explained above). However, it is important to note that
this change by itself does NOT enable actual testing of the DNSRPS
feature as doing that requires a DNSRPS provider library to be present
on the test host.
Ondřej Surý [Thu, 23 Feb 2023 10:10:39 +0000 (11:10 +0100)]
Pause the catz dbiterator while processing the zone
The dbiterator read-locks the whole zone and it stayed locked during
whole processing time when catz is being read. Pause the iterator, so
the updates to catz zone are not being blocked while processing the catz
update.
Ondřej Surý [Mon, 27 Feb 2023 23:00:23 +0000 (23:00 +0000)]
Unlock catzs during dns__catz_update_cb()
Instead of holding the catzs->lock the whole time we process the catz
update, only hold it for hash table lookup and then release it. This
should unblock any other threads that might be processing updates to
catzs triggered by extra incoming transfer.
Aram Sargsyan [Mon, 27 Feb 2023 22:53:23 +0000 (22:53 +0000)]
Offload catalog zone updates
Offload catalog zone processing so that the network manager threads
are not interrupted by a large catalog zone update.
Introduce a new 'updaterunning' state alongside with 'updatepending',
like it is done in the RPZ module.
Note that the dns__catz_update_cb() function currently holds the
catzs->lock during the whole process, which is far from being optimal,
but the issue is going to be addressed separately.
Aram Sargsyan [Mon, 27 Feb 2023 21:29:24 +0000 (21:29 +0000)]
Add shutdown signaling for catalog zones
This change should make sure that catalog zone update processing
doesn't happen when the catalog zone is being shut down. This
should help avoid races when offloading the catalog zone updates
in the follow-up commit.