Ondřej Surý [Sun, 13 Nov 2022 10:04:30 +0000 (11:04 +0100)]
Remove isc_resource API and set limits directly in named_os unit
The only function left in the isc_resource API was setting the file
limit. Replace the whole unit with a simple getrlimit to check the
maximum value of RLIMIT_NOFILE and set the maximum back to rlimit_cur.
This is more compatible than trying to set RLIMIT_UNLIMITED on the
RLIMIT_NOFILE as it doesn't work on Linux (see man 5 proc on
/proc/sys/fs/nr_open), neither it does on Darwin kernel (see man 2
getrlimit).
The only place where the maximum value could be raised under privileged
user would be BSDs, but the `named_os_adjustnofile()` were not called
there before. We would apply the increased limits only on Linux and Sun
platforms.
Ondřej Surý [Sun, 13 Nov 2022 09:28:17 +0000 (10:28 +0100)]
Mark setting operating system limits from named.conf as ancient
After deprecating the operating system limits settings (coresize,
datasize, files and stacksize), mark them as ancient and remove the code
that sets the values from config.
Ondřej Surý [Thu, 3 Nov 2022 16:42:12 +0000 (17:42 +0100)]
Propagate the shutdown event to the recursing ns_client(s)
Send the ns_query_cancel() on the recursing clients when we initiate the
named shutdown for faster shutdown.
When we are shutting down the resolver, we cancel all the outstanding
fetches, and the ISC_R_CANCEL events doesn't propagate to the ns_client
callback.
In the future, the better solution how to fix this would be to look at
the shutdown paths and let them all propagate from bottom (loopmgr) to
top (f.e. ns_client).
Ondřej Surý [Tue, 6 Dec 2022 14:59:35 +0000 (15:59 +0100)]
Fix reference counting in get_attached_entry
When get_attached_entry() encounters entry that would be expired, it
needs to get reference to the entry before calling maybe_expire_entry(),
so the ADB entry doesn't get destroyed inside the its own lock.
Michal Nowak [Tue, 22 Nov 2022 09:27:17 +0000 (10:27 +0100)]
Add ASAN- and TSAN-enabled respdiff jobs
Neither of the new CI jobs can reliably pass at the moment; hence they
are defined with "allow_failure: true" until issues in the code base are
resolved.
Mark Andrews [Wed, 30 Nov 2022 08:32:11 +0000 (19:32 +1100)]
Check that restored catalog zone works
Using a restored catalog zone excercised a use-after-free bug.
The test checks that the use-after-free bug is gone and is just
a reasonable behaviour check in its own right.
Mark Andrews [Wed, 30 Nov 2022 07:44:37 +0000 (18:44 +1100)]
Call dns_db_updatenotify_unregister earlier
dns_db_updatenotify_unregister needed to be called earlier to ensure
that listener->onupdate_arg always points to a valid object. The
existing lazy cleanup in rbtdb_free did not ensure that.
Matthijs Mekking [Thu, 17 Nov 2022 13:52:26 +0000 (13:52 +0000)]
Consider non-stale data when in serve-stale mode
With 'stale-answer-enable yes;' and 'stale-answer-client-timeout off;',
consider the following situation:
A CNAME record and its target record are in the cache, then the CNAME
record expires, but the target record is still valid.
When a new query for the CNAME record arrives, and the query fails,
the stale record is used, and then the query "restarts" to follow
the CNAME target. The problem is that the query's multiple stale
options (like DNS_DBFIND_STALEOK) are not reset, so 'query_lookup()'
treats the restarted query as a lookup following a failed lookup,
and returns a SERVFAIL answer when there is no stale data found in the
cache, even if there is valid non-stale data there available.
With this change, query_lookup() now considers non-stale data in the
cache in the first place, and returns it if it is available.
Artem Boldariev [Tue, 22 Nov 2022 13:11:57 +0000 (15:11 +0200)]
TLS: check for sock->recv_cb when handling received data
This commit adds a check if 'sock->recv_cb' might have been nullified
during the call to 'sock->recv_cb'. That could happen, e.g. by an
indirect call to 'isc_nmhandle_close()' from within the callback when
wrapping up.
Artem Boldariev [Fri, 2 Dec 2022 08:41:28 +0000 (10:41 +0200)]
DoH: Avoid accessing non-atomic listener socket flags when accepting
This commit ensures that the non-atomic flags inside a DoH listener
socket object (and associated worker) are accessed when doing accept
for a connection only from within the context of the dedicated thread,
but not other worker threads.
Artem Boldariev [Thu, 1 Dec 2022 20:42:54 +0000 (22:42 +0200)]
TLS: Avoid accessing non-atomic listener socket flags during HS
This commit ensures that the non-atomic flags inside a TLS listener
socket object (and associated worker) are accessed when doing
handshake for a connection only from within the context of the
dedicated thread, but not other worker threads.
Tom Krizek [Wed, 30 Nov 2022 16:44:27 +0000 (17:44 +0100)]
Add libnghttp2 prerequisite for doth system test
While some of these tests are for DoT which doesn't require nghttp2,
the server configs won't allow the server to start without nghttp2
support during compile time.
It might be possible to split these tests into DoT and DoH and only
require nghttp2 for DoH tests, but since almost all of our CI jobs are
compiled with nghttp2, we wouldn't gain a lot of coverage, so it's
probably not worth the effort.
Tom Krizek [Thu, 1 Dec 2022 14:12:28 +0000 (15:12 +0100)]
Use feature-test to detect feature support in system tests
Previously, there were two different ways to detect feature support.
Either through an environment variable set by configure in conf.sh, or
using the feature-test utility.
It is more simple and consistent to have only one way of detecting the
feature support. Using the feature-test utility seems superior the the
environment variables set by configure.
Artem Boldariev [Wed, 30 Nov 2022 20:58:13 +0000 (22:58 +0200)]
TLS: Avoid accessing listener socket flags from other threads
This commit ensures that the flags inside a TLS listener socket
object (and associated worker) are accessed when accepting a
connection only from within the context of the dedicated thread, but
not other worker threads.
Ondřej Surý [Thu, 1 Dec 2022 17:31:05 +0000 (18:31 +0100)]
Honour single read per client isc_nm_read() call in the TLSDNS
The TLSDNS transport was not honouring the single read callback for
TLSDNS client. It would call the read callbacks repeatedly in case the
single TLS read would result in multiple DNS messages in the decoded
buffer.
Ondřej Surý [Fri, 4 Nov 2022 14:04:29 +0000 (15:04 +0100)]
Refactor the dns_resolver fetch context hash tables and locking
This is second in the series of fixing the usage of hashtables in the
dns_adb and the dns_resolver units.
Currently, the fetch buckets (used to hold the fetch context) and zone
buckets (used to hold per-domain counters) would never get cleaned from
the memory. Combined with the fact that the hashtable now grows as
needed (instead of using hashtable as buckets), the memory usage in the
resolver can just grow and it never drops down.
In this commit, the usage of hashtables (hashmaps) has been completely
rewritten, so there are no "buckets" and all the matching conditions are
directly mapped into the hashtable key:
1. For per-domain counter hashtable, this is simple as the lowercase
domain name is used directly as a counter.
2. For fetch context hashtable, this requires copying some extra flags
back and forth in the key.
As we don't hold the "buckets" forever, the cleaning mechanism has been
rewritten as well:
1. For per-domain counter hashtable, this is again much simpler, as we
only need to check whether the usage counter is still zero under the
lock and bail-out on cleaning if the counter is in use.
2. For fetch context hashtable, this is more complicated as the fetch
context cannot be reused after it has been finished. The algorithm
is different, the fetch context is always removed from the
hashtable, but if we find the fetch context that has been marked
as finished in the lookup function, we help with the cleaning from
the hashtable and try again.
Couple of additional changes have been implemented in this refactoring
as those were needed for correct functionality and could not be split
into individual commits (or would not make sense as seperate commits):
1. The dns_resolver_createfetch() has an option to create "unshared"
fetch. The "unshared" fetch will never get matched, so there's
little point in storing the "unshared" fetch in the hashtable.
Therefore the "unshared" fetches are now detached from the
hashtable and live just on their own.
2. Replace the custom reference counting with ISC_REFCOUNT_DECL/IMPL
macros for better tracing.
3. fctx_done_detach() is idempotent, it makes the "final" detach (the
one matching the create function) only once. But that also means
that it has to be called before the detach that kept the fetch
context alive in the callback. A new macro fctx_done_unref() has
been added to allow this code flow:
Doing this the other way around could cause fctx to get destroyed in
the fctx_unref() first and fctx_done_detach() would cause UAF.
4. The resume_qmin() and resume_dslookup() callbacks have been
refactored for more readability and simpler code paths. The
validated() callback has also received some of the simplifications,
but it should be refactored in the future as it is bit of spaghetti
now.
This commit ensures that send callbacks are always called from within
the context of its worker thread even in the case of
shuttigdown/inactive socket, just like TCP transport does and with
which TLS attempts to be as compatible as possible.
Artem Boldariev [Thu, 27 Oct 2022 18:16:36 +0000 (21:16 +0300)]
TLS Stream: use ISC_R_CANCELLED error when shutting down
This commit changes ISC_R_NOTCONNECTED error code to ISC_R_CANCELLED
when attempting to start reading data on the shutting down socket in
order to make its behaviour compatible with that of TCP and not break
the common code in the unit tests.
Artem Boldariev [Thu, 27 Oct 2022 17:13:06 +0000 (20:13 +0300)]
TLS Stream: fix isc_nm_read_stop() and reading flags handling
It turned out that after the latest Network Manager refactoring
'sock->reading' flag was not processed correctly. Due to this
isc_nm_read_stop() might not work as expected because reading from the
underlying TCP socket could have been resume in 'tls_do_bio()'
regardless of the 'sock->reading' value.
This bug did not seem to cause problems with DoH, so it was not
noticed, but Stream DNS has more strict expectations regarding the
underlying transport.
Additionally to the above, the 'sock->recv_read' flag was completely
ignored and corresponding logic was completely unimplemented. That did
not allow to implement one fine detail compared to TCP: once reading
is started, it could be satisfied by one datum reading.
Tony Finch [Tue, 29 Nov 2022 20:23:31 +0000 (20:23 +0000)]
Compress zone transfers properly
After change 5995, zone transfers were using a small
compression context that only had space for the first
few dozen names in each message. They now use a large
compression context with enough space for every name.
Ondřej Surý [Wed, 30 Nov 2022 09:46:40 +0000 (10:46 +0100)]
Don't log "final reference detached" on INFO level
The "final reference detached" message was meant to be DEBUG(1), but was
instead kept at INFO level. Move it to the DEBUG(1) logging level, so
it's not printed under normal operations.
Ondřej Surý [Wed, 23 Nov 2022 08:56:19 +0000 (09:56 +0100)]
Keep the unlink adb entries until expiration
Currently, the ADB uses TTL of 0 for ADB names that the server is
authoritative for and TTL of 10 seconds for HINT and GLUE ADB names.
This requires the unlinked ADB entries to be kept around, because they
would disappear too quickly. This especially affect the root zone as
the trust level is "ultimate" for the root zone nameservers.
This commit restores the ability to keep the unlinked ADB entries in the
database for later reuse, restores printing the unlinked entries and
adds some extra cleaning of the unlinked ADB entries on the tail of the
LRU list (similar to what we are doing for the ADB names).
Ondřej Surý [Thu, 13 Oct 2022 10:00:54 +0000 (12:00 +0200)]
Refactor the dns_adb unit
The dns_adb unit has been refactored to be much simpler. Following
changes have been made:
1. Simplify the ADB to always allow GLUE and hints
There were only two places where dns_adb_createfind() was used - in
the dns_resolver unit where hints and GLUE addresses were ok, and in
the dns_zone where dns_adb_createfind() would be called without
DNS_ADBFIND_HINTOK and DNS_ADBFIND_GLUEOK set.
Simplify the logic by allowing hint and GLUE addresses when looking
up the nameserver addresses to notify. The difference is negligible
and would cause a difference in the notified addresses only when
there's mismatch between the parent and child addresses and we
haven't cached the child addresses yet.
2. Drop the namebuckets and entrybuckets
Formerly, the namebuckets and entrybuckets were used to reduced the
lock contention when accessing the double-linked lists stored in each
bucket. In the previous refactoring, the custom hashtable for the
buckets has been replaced with isc_ht/isc_hashmap, so only a single
item (mostly, see below) would end up in each bucket.
Removing the entrybuckets has been straightforward, the only matching
was done on the isc_sockaddr_t member of the dns_adbentry.
Removing the zonebuckets required GLUEOK and HINTOK bits to be
removed because the find could match entries with-or-without the bits
set, and creating a custom key that stores the
DNS_ADBFIND_STARTATZONE in the first byte of the key, so we can do a
straightforward lookup into the hashtable without traversing a list
that contains items with different flags.
3. Remove unassociated entries from ADB database
Previously, the adbentries could live in the ADB database even after
unlinking them from dns_adbnames. Such entries would show up as
"Unassociated entries" in the ADB dump. The benefit of keeping such
entries is little - the chance that we link such entry to a adbname
is small, and it's simpler to evict unlinked entries from the ADB
cache (and the hashtable) than create second LRU cleaning mechanism.
Unlinked ADB entries are now directly deleted from the hash
table (hashmap) upon destruction.
4. Cleanup expired entries from the hash table
When buckets were still in place, the code would keep the buckets
always allocated and never shrink the hash table (hashmap). With
proper reference counting in place, we can delete the adbnames from
the hash table and the LRU list.
5. Stop purging the names early when we hit the time limit
Because the LRU list is now time ordered, we can stop purging the
names when we find a first entry that doesn't fullfil our time-based
eviction criteria because no further entry on the LRU list will meet
the criteria.
Future work:
1. Lock contention
In this commit, the focus was on correctness of the data structure,
but in the future, the lock contention in the ADB database needs to
be addressed. Currently, we use simple mutex to lock the hash
tables, because we almost always need to use a write lock for
properly purging the hashtables. The ADB database needs to be
sharded (similar to the effect that buckets had in the past). Each
shard would contain own hashmap and own LRU list.
2. Time-based purging
The ADB names and entries stay intact when there are no lookups.
When we add separate shards, a timer needs to be added for time-based
cleaning in case there's no traffic hashing to the inactive shard.
3. Revisit the 30 minutes limit
The ADB cache is capped at 30 minutes. This needs to be revisited,
and at least the limit should be configurable (in both directions).
Ondřej Surý [Thu, 13 Oct 2022 06:11:30 +0000 (08:11 +0200)]
Create per-thread task for dns_adb resolver fetches
The dns_adb would serialize all fetches on a single task. Create a
per-thread task, so the fetches will stay local to the thread that
initiated the fetch.
Ondřej Surý [Mon, 1 Aug 2022 07:07:38 +0000 (09:07 +0200)]
Purge stale ADB names globaly, not per bucket
Before the refactoring, there was only few buckets with many names in
them, so cleaning up stale ADB names per-bucket made sense. After the
refactoring, each bucket directly maps to ADB name, so purging has been
effectively disabled.
Create a global LRU list for ADB names (and ADB entries) and purge the
stale ADB names globally.
Previously, the name and entry buckets were much larger, so the dead
names and entries were moved to a secondary list to be cleaned
later (f.e. after the already running fetch has been canceled). After
the last refactoring, the bucket now contains only the name (entry)
itself and thus the extra list has a little use. Remove the .deadnames
and .deadentries from dns_adbnamebucket_t and dns_adbentrybucket_t
structures.
Ondřej Surý [Wed, 5 Oct 2022 09:21:28 +0000 (11:21 +0200)]
Refactor dns_rpz unit to use single reference counting
The dns_rpz_zones structure was using .refs and .irefs for strong and
weak reference counting. Rewrite the unit to use just a single
reference counting + shutdown sequence (dns_rpz_destroy_rpzs) that must
be called by the creator of the dns_rpz_zones_t object. Remove the
reference counting from the dns_rpz_zone structure as it is not needed
because the zone objects are fully embedded into the dns_rpz_zones
structure and dns_rpz_zones_t object must never be destroyed before all
dns_rpz_zone_t objects.
The dns_rps_zones_t reference counting uses the new ISC_REFCOUNT_TRACE
capability - enable by defining DNS_RPZ_TRACE in the dns/rpz.h header.
Additionally, add magic numbers to the dns_rpz_zone and dns_rpz_zones
structures.
Ondřej Surý [Wed, 5 Oct 2022 09:19:52 +0000 (11:19 +0200)]
Add extra set of ISC_REFCOUNT_TRACE_{IMPL,DECL} macros
The new ISC_REFCOUNT_TRACE_{IMPL,DECL} macros can be used to add a
reference tracing capability to any unit using the reference counting.
It requires a little bit of extra work in each header as you can't have
a define from inside a define (see rpz.h), but it's fairly easy to add
tracing to any struct using reference counting with these macros.
Ondřej Surý [Wed, 2 Nov 2022 10:40:19 +0000 (11:40 +0100)]
Remove the unused cache cleaning mechanism from dns_cache API
The dns_cache API contained a cache cleaning mechanism that would be
disabled for 'rbt' based cache. As named doesn't have any other cache
implementations, remove the cache cleaning mechanism from dns_cache API.
Ondřej Surý [Wed, 2 Nov 2022 10:40:19 +0000 (11:40 +0100)]
Remove the dead external cache cleaning mechanism from RBTDB
The RBTDB has own cache cleaning mechanism and therefor the iterator
.cleaning member would never be set to true. Remove the code that
checks for iterator->cleaning from the RBTDB.
Artem Boldariev [Thu, 10 Nov 2022 19:05:00 +0000 (21:05 +0200)]
TCP: use uv_try_write() to optimise sends
This commit make TCP code use uv_try_write() on best effort basis,
just like TCP DNS and TLS DNS code does.
This optimisation was added in
'caa5b6548a11da6ca772d6f7e10db3a164a18f8d' but, similar change was
mistakenly omitted for generic TCP code. This commit fixes that.