Tom Krizek [Thu, 2 Mar 2023 12:02:48 +0000 (13:02 +0100)]
Add release metadata update to release checklist
The release engineering automation we have relies on up-to-date
information about our upcoming release plans. Ensure these are updated
at the end of each release cycle.
Michal Nowak [Thu, 2 Mar 2023 10:42:45 +0000 (11:42 +0100)]
Move "disallow merging to maintained branches" closer to tagging
Updating GitLab settings for all maintained branches to disallow merging
to them has an unfortunate consequence: daily scheduled pipelines won't
be executed anymore. This is a problem because we need the pipelines to
ensure no new bugs were introduced just before a code freeze.
The "Announce (on Mattermost) that the code freeze is in effect" item is
still in place but is now more of a social "disallow merging to
maintained branches".
Mark Andrews [Fri, 3 Mar 2023 03:04:34 +0000 (14:04 +1100)]
Now logs UV versions when starting up
Named now logs both compile time and run time UV versions when
starting up. This is useful information to have when debugging
network issues involving named.
Aram Sargsyan [Thu, 2 Mar 2023 08:52:25 +0000 (08:52 +0000)]
catz: use two pairs of dns_db_t and dns_dbversion_t in a catalog zone
As it is done in the RPZ module, use 'db' and 'dbversion' for the
database we are going to update to, and 'updb' and 'updbversion' for
the database we are working on.
Doing this should avoid a race between the 'dns__catz_update_cb()' and
'dns_catz_dbupdate_callback()' functions.
Aram Sargsyan [Wed, 1 Mar 2023 12:30:46 +0000 (12:30 +0000)]
Fix view's zones reverting bug during reconfiguration
During reconfiguration, the configure_view() function reverts the
configured zones to the previous view in case if there is an error.
It uses the 'zones_configured' boolean variable to decide whether
it is required to revert the zones, i.e. the error happened after
all the zones were successfully configured.
The problem is that it does not account for the case when an error
happens during the configuration of one of the zones (not the first),
in which case there are zones that are already configured for the
new view (and they need to be reverted), and there are zones that
are not (starting from the failed one).
Since 'zones_configured' remains 'false', the configured zones are
not reverted.
Replace the 'zones_configured' variable with a pointer to the latest
successfully configured zone configuration element, and when reverting,
revert up to and including that zone.
Aram Sargsyan [Wed, 1 Mar 2023 12:47:25 +0000 (12:47 +0000)]
Add a catz system test check for [GL #3911]
The trick is to configure a duplicate zone, which comes after the
catalog zone, where the duplicate zone is an existing member zone.
In that scenario, all the zones which come before the "faulty" zone
in the configuration file will fail to be reverted to the previous
version of the view after a reconfiguration error, and in this
particular case that will result in an assertion failure when the
catalog zone update is initiated, because it will be still tied to
the new version of the view, which was dismissed.
Mark Andrews [Thu, 23 Feb 2023 22:39:34 +0000 (09:39 +1100)]
Extract test coverage statistics from the gcov job
In older GitLab versions, the regular expression used for extracting
test coverage statistics from the output of GitLab CI jobs was
configured in the project's settings, using GitLab's web interface.
That changed in recent GitLab versions [1]; the previous configuration
method was removed from the web interface altogether as of GitLab 15.0.
The relevant regular expression is now supposed to be set in the
relevant job's definition in .gitlab-ci.yml.
Set the regular expression used for extracting test coverage
statistics in the definition of the "gcov" GitLab CI job. Use the
regular expression suggested in GitLab's documentation [2].
Ondřej Surý [Tue, 28 Feb 2023 13:14:06 +0000 (14:14 +0100)]
Decouple view->resolver and friends shutdown and detach
In !7538, the shutdown procedure was simplified, but the ordering was
wrong, we need to shutdown the resolver, adb and requestmgr before
detaching those objects from the view, because there are cross
dependencies between at least the resolver and the adb.
Execute the shutdown(s) first, only when all three shutdowns have been
executed, detach those objects from the view.
Michał Kępień [Tue, 28 Feb 2023 11:54:02 +0000 (12:54 +0100)]
Add a DNSRPS-enabled build to regular CI pipelines
DNSRPS-enabled builds have recently been silently broken a few times due
to that feature not being tested in regular CI pipelines. Add the
--enable-dnsrps --enable-dnsrps-dl switches to the ./configure
invocation in one of the CI jobs run for all merge requests so that
DNSRPS-related build issues can be detected in advance.
It is important to note that this change by itself does NOT enable
actual testing of the DNSRPS feature as doing that requires a DNSRPS
provider library to be present on the test host.
Michał Kępień [Tue, 28 Feb 2023 11:54:02 +0000 (12:54 +0100)]
(Mostly) fix building bin/tests/system/rpz/dnsrps
Building the bin/tests/system/rpz/dnsrps helper binary is currently not
possible at all as the necessary compiler and linker flag definitions
are missing from bin/tests/system/Makefile.am. Add these as a basis for
addressing the problem.
Unfortunately, this is where the "mostly" bit mentioned in this commit's
subject line comes into play. The dlopen() parts of DNSRPS code have
not yet been reworked to use libuv's dlopen() API (uv_dlopen() etc.)
(See commit 37b9511ce1dd9ba66a6620c5ff617016eb81188f for prior work in
this area.) While it is certainly possible to do that, implementing
such a change without testing it in practice against a usable librpz.so
(i.e. a DNSRPS provider library) is bound to cause more trouble and
confusion than keeping the code the way it is right now. However,
making that code buildable as-is requires linking against a C standard
library that exports the dlopen(), dlsym(), and dlclose() symbols used
by the DNSRPS dynamic loading code. glibc 2.34+ satisfies that
requirement, but older glibc versions do not (these come with a separate
libdl shared library that would need to be linked in as well). (Other
C standard library implementations have not been examined.) Since the
long-term plan is to rely on libuv's dlopen() API exclusively and
detecting the shared object containing dlopen() & friends would only
pull in build system complexity for no good reason, assume for now that
the target system provides the dlopen() API in its C standard library.
This change enables the system test suite to be run for a BIND 9 build
prepared using --enable-dnsrps --enable-dnsrps-dl (on systems satisfying
the requirement explained above). However, it is important to note that
this change by itself does NOT enable actual testing of the DNSRPS
feature as doing that requires a DNSRPS provider library to be present
on the test host.
Ondřej Surý [Thu, 5 Jan 2023 08:12:35 +0000 (09:12 +0100)]
Implement dns_db node tracing
This implements node reference tracing that passes all the internal
layers from dns_db API (and friends) to increment_reference() and
decrement_reference().
It can be enabled by #defining DNS_DB_NODETRACE in <dns/trace.h> header.
Matthijs Mekking [Tue, 21 Feb 2023 07:50:04 +0000 (08:50 +0100)]
dnssec-signzone can now create multiple CDS RRs
Change the commandline option -G to take a string that determines what
sync records should be published. It is a comma-separated string with
each element being either "cdnskey", or "cds:<algorithm>", where
<algorithm> is a valid digest type. Duplicates are suppressed.
Matthijs Mekking [Tue, 31 Jan 2023 09:25:48 +0000 (10:25 +0100)]
Update code to publish CDS with other digest type
Now that we can configure a different digest type, update the code
to honor the configuration. Update 'dns_dnssec_syncupdate' so that
the correct CDS record is published, and also when deleting CDS records,
ensure that all possible CDS records are removed from the zone.
Matthijs Mekking [Tue, 31 Jan 2023 09:20:00 +0000 (10:20 +0100)]
Add test case for different digest type
Change one of the test cases to use a different digest type (4). The
system tests and kasp script need to be updated to take into account
the new algorithm (instead of the hard coded 2).
Mark Andrews [Tue, 28 Feb 2023 03:10:56 +0000 (14:10 +1100)]
Fix 'lame server clients are dropped below the hard limit' test
The test was setting a minimum count for recursive clients which
was not always being met (e.g. 91 instead of 100) producing a false
positive. Lower the lower bound on recursive clients for this
test to 1.
Tony Finch [Thu, 16 Feb 2023 21:42:04 +0000 (21:42 +0000)]
Improve qp-trie compaction in write transactions
In general, it's better to do one thorough compaction when a batch of
work is complete, which is the way that `update` transactions work.
Conversely, `write` transactions are designed so that lots of little
transactions are not too inefficient, but they need explicit
compaction. This changes `dns_qp_compact()` so that it is easier to
compact any time that makes sense, if there isn't a better way to
schedule compaction. And `dns_qpmulti_commit()` only recycles garbage
when there is enough to make it worthwhile.
Tony Finch [Fri, 17 Feb 2023 18:28:24 +0000 (18:28 +0000)]
Improve qp-trie refcount debugging
Add some qp-trie tracing macros which can be enabled by a
developer. These print a message when a leaf is attached or
detached, indicating which part of the qp-trie implementation
did so. The refcount methods must now return the refcount value
so it can be printed by the trace macros.
Tony Finch [Thu, 22 Dec 2022 14:55:14 +0000 (14:55 +0000)]
Refactor qp-trie to use QSBR
The first working multi-threaded qp-trie was stuck with an unpleasant
trade-off:
* Use `isc_rwlock`, which has acceptable write performance, but
terrible read scalability because the qp-trie made all accesses
through a single lock.
* Use `liburcu`, which has great read scalability, but terrible
write performance, because I was relying on `rcu_synchronize()`
which is rather slow. And `liburcu` is LGPL.
To get the best of both worlds, we need our own scalable read side,
which we now have with `isc_qsbr`. And we need to modify the write
side so that it is not blocked by readers.
Better write performance requires an async cleanup function like
`call_rcu()`, instead of the blocking `rcu_synchronize()`. (There
is no blocking cleanup in `isc_qsbr`, because I have concluded
that it would be an attractive nuisance.)
Until now, all my multithreading qp-trie designs have been based
around two versions, read-only and mutable. This is too few to
work with asynchronous cleanup. The bare minimum (as in epoch
based reclamation) is three, but it makes more sense to support an
arbitrary number. Doing multi-version support "properly" makes
fewer assumptions about how safe memory reclamation works, and it
makes snapshots and rollbacks simpler.
To avoid making the memory management even more complicated, I
have introduced a new kind of "packed reader node" to anchor the
root of a version of the trie. This is simpler because it re-uses
the existing chunk lifetime logic - see the discussion under
"packed reader nodes" in `qp_p.h`.
I have also made the chunk lifetime logic simpler. The idea of a
"generation" is gone; instead, chunks are either mutable or
immutable. And the QSBR phase number is used to indicate when a
chunk can be reclaimed.
Instead of the `shared_base` flag (which was basically a one-bit
reference count, with a two version limit) the base array now has a
refcount, which replaces the confusing ad-hoc lifetime logic with
something more familiar and systematic.
Tony Finch [Wed, 28 Sep 2022 15:56:46 +0000 (16:56 +0100)]
Benchmarks for the qp-trie
The main benchmark is `qpmulti`, which exercizes the qp-trie
transactional API with differing numbers of threads and differing data
sizes, to get some idea of how its performance scales.
The `load-names` benchmark compares the times to populate and query
and the memory used by various BIND data structures: qp-trie, hash
table (chained), hash map (closed), and red-black tree.
The `qp-dump` program is a test utility rather than a benchmark. It
populates a qp-trie and prints it out, either in an ad-hoc text
format, or as input to the graphviz `dot` program.
Tony Finch [Sun, 12 Jun 2022 14:52:35 +0000 (15:52 +0100)]
Fuzz testing the qp-trie
Ensure dns_qpkey_fromname() and dns_qpkey_toname() are inverses.
Excercise a single-threaded dns_qp_t with a fixed set of random keys
and a small chunk size. Use the table of names to ensure that the trie
is behaving as expected. This is (in effect) randomized testing like
the `qpmulti` unit test, but making use of coverage-guided fuzzing
and (in principle) test case minimization.
Tony Finch [Thu, 6 Oct 2022 14:59:14 +0000 (15:59 +0100)]
Test the qp-trie transactional API
Randomized testing with intensive consistency and correctness checks
make it much easier to get good coverage and to shake out bugs than
hand-written unit tests for specific cases.
These tests only run in a single thread, but each test transaction
uses both a write/update and a query/snapshot, to ensure that
modifications are not visible to concurrent readers.
Tony Finch [Tue, 14 Jun 2022 15:20:28 +0000 (16:20 +0100)]
Test infrastructure for the qp-trie
This change adds a number of support routines for the unit tests, and
for benchmarks and fuzz tests to be added later. It isn't necessary to
include the support routines in libdns, since they are not needed by
BIND's installed programs. So `libtest` seems like the best place for
them.
The tests themselves verify that dns_qpkey_fromname() behaves as
expected.
Tony Finch [Wed, 22 Feb 2023 18:08:26 +0000 (18:08 +0000)]
Fix qp-trie refcounting mistake
The error occurred when:
* The bump chunk was re-used across multiple write transactions.
In this situation the bump chunk is marked immutable, but the
immutable flag is disregarded for cells after the fender, which
were allocated in the current transaction.
* The bump chunk fills up during an insert operation, so that the
enlarged twigs vector is allocated from a new bump chunk.
* Before this happened, we should have (but didn't) made the twigs
vector mutable. This would have adjusted its refcounts as necessary.
* However, moving to a new bump chunk has a side effect: twigs that
were previously considered mutable because they are after the
fender become immutable.
* Because of this, the old twigs vector was not destroyed as expected.
* So leaves were duplicated without their refcounts being increased.
The effect is that the refcounts were lower than they should have
been, and underflowed. The tests failed to check for refcount
underflow, so this mistake was detected much later than it ideally
could have been.
After the fix, it is now correct not to ensure the twigs are mutable,
because they are about to be copied to a larger vector. Instead, we
need to find out whether `squash_twigs()` destroyed the old twigs, and
adjust the refcounts accordingly.
Tony Finch [Mon, 9 May 2022 13:31:35 +0000 (14:31 +0100)]
Add a qp-trie data structure
A qp-trie is a kind of radix tree that is particularly well-suited to
DNS servers. I invented the qp-trie in 2015, based on Dan Bernstein's
crit-bit trees and Phil Bagwell's HAMT. https://dotat.at/prog/qp/
This code incorporates some new ideas that I prototyped using
NLnet Labs NSD in 2020 (optimizations for DNS names as keys)
and 2021 (custom allocator and garbage collector).
https://dotat.at/cgi/git/nsd.git
The BIND version of my qp-trie code has a number of improvements
compared to the prototype developed for NSD.
* The main omission in the prototype was the very sketchy outline of
how locking might work. Now the locking has been implemented,
using a reader/writer lock and a mutex. However, it is designed to
benefit from liburcu if that is available.
* The prototype was designed for two-version concurrency, one
version for readers and one for the writer. The new code supports
multiversion concurrency, to provide a basis for BIND's dbversion
machinery, so that updates are not blocked by long-running zone
transfers.
* There are now two kinds of transaction that modify the trie: an
`update` aims to support many very small zones without wasting
memory; a `write` avoids unnecessary allocation to help the
performance of many small changes to the cache.
* There is also a single-threaded interface for situations where
concurrent access is not necessary.
* The API makes better use of types to make it more clear which
operations are permitted when.
* The lookup table used to convert a DNS name to a qp-trie key is
now initialized by a run-time constructor instead of a programmer
using copy-and-paste. Key conversion is more flexible, so the
qp-trie can be used with keys other than DNS names.
* There has been much refactoring and re-arranging things to improve
the terminology and order of presentation in the code, and the
internal documentation has been moved from a comment into a file
of its own.
Some of the required functionality has been stripped out, to be
brought back later after the basics are known to work.
* Garbage collector performance statistics are missing.
* Fancy searches are missing, such as longest match and
nearest match.
* Iteration is missing.
* Search for update is missing, for cases where the caller needs to
know if the value object is mutable or not.
Aram Sargsyan [Fri, 27 Jan 2023 08:47:52 +0000 (08:47 +0000)]
catz: unregister the db update-notify callback before detaching from db
When detaching from the previous version of the database, make sure
that the update-notify callback is unregistered, otherwise there is
an INSIST check which can generate an assertion failure in free_rbtdb(),
which checks that there are no outstanding update listeners in the list.
Aram Sargsyan [Thu, 26 Jan 2023 19:08:19 +0000 (19:08 +0000)]
Process db callbacks in zone_loaddone() after zone_postload()
The zone_postload() function can fail and unregister the callbacks.
Call dns_db_endload() only after calling zone_postload() to make
sure that the registered update-notify callbacks are not called
when the zone loading has failed during zone_postload().
Also, don't ignore the return value of zone_postload().
Aram Sargsyan [Fri, 27 Jan 2023 09:22:11 +0000 (09:22 +0000)]
Add a system test for [GL #3777]
Add the 'ixfr-from-differences yes;' option to trigger a failed
zone postload operation when a zone is updated but the serial
number is not updated, then issue two successive 'rndc reload'
commands to trigger the bug, which causes an assertion failure.
Ondřej Surý [Thu, 23 Feb 2023 10:10:39 +0000 (11:10 +0100)]
Pause the catz dbiterator while processing the zone
The dbiterator read-locks the whole zone and it stayed locked during
whole processing time when catz is being read. Pause the iterator, so
the updates to catz zone are not being blocked while processing the catz
update.
Ondřej Surý [Fri, 24 Feb 2023 09:46:00 +0000 (09:46 +0000)]
Unlock catzs during dns__catz_update_cb()
Instead of holding the catzs->lock the whole time we process the catz
update, only hold it for hash table lookup and then release it. This
should unblock any other threads that might be processing updates to
catzs triggered by extra incoming transfer.
Aram Sargsyan [Tue, 21 Feb 2023 14:23:38 +0000 (14:23 +0000)]
Offload catalog zone updates
Offload catalog zone processing so that the network manager threads
are not interrupted by a large catalog zone update.
Introduce a new 'updaterunning' state alongside with 'updatepending',
like it is done in the RPZ module.
Note that the dns__catz_update_cb() function currently holds the
catzs->lock during the whole process, which is far from being optimal,
but the issue is going to be addressed separately.