Diego Fronza [Tue, 20 Oct 2020 00:25:34 +0000 (21:25 -0300)]
Added system test for stale-refresh-time
This test works as follow:
- Query for data.example rrset.
- Sleep until its TTL expires (2 secs).
- Disable authoritative server.
- Query for data.example again.
- Since server is down, answer come from stale cache, which has
a configured stale-answer-ttl of 3 seconds.
- Enable authoritative server.
- Query for data.example again
- Since last query before activating authoritative server failed, and
since 'stale-refresh-time' seconds hasn't elapsed yet, answer should
come from stale cache and not from the authoritative server.
Diego Fronza [Tue, 20 Oct 2020 00:24:38 +0000 (21:24 -0300)]
Adjusted ancient rrset system test
Before the stale-refresh-time feature, the system test for ancient rrset
was somewhat based on the average time the previous tests and queries
were taking, thus not very precise.
After the addition of stale-refresh-time the system test for ancient
rrset started to fail since the queries for stale records (low
max-stale-ttl) were not taking the time to do a full resolution
anymore, since the answers now were coming from the cache (because the
rrset were stale and within stale-refresh-time window after the
previous resolution failure).
To handle this, the correct time to wait before rrset become ancient is
calculated from max-stale-ttl configuration plus the TTL set in the
rrset used in the tests (ans2/ans.pl).
Then before sending queries for ancient rrset, we check if we need to
sleep enough to ensure those rrset will be marked as ancient.
Diego Fronza [Mon, 19 Oct 2020 20:02:03 +0000 (17:02 -0300)]
Add stale-refresh-time option
Before this update, BIND would attempt to do a full recursive resolution
process for each query received if the requested rrset had its ttl
expired. If the resolution fails for any reason, only then BIND would
check for stale rrset in cache (if 'stale-cache-enable' and
'stale-answer-enable' is on).
The problem with this approach is that if an authoritative server is
unreachable or is failing to respond, it is very unlikely that the
problem will be fixed in the next seconds.
A better approach to improve performance in those cases, is to mark the
moment in which a resolution failed, and if new queries arrive for that
same rrset, try to respond directly from the stale cache, and do that
for a window of time configured via 'stale-refresh-time'.
Only when this interval expires we then try to do a normal refresh of
the rrset.
The logic behind this commit is as following:
- In query.c / query_gotanswer(), if the test of 'result' variable falls
to the default case, an error is assumed to have happened, and a call
to 'query_usestale()' is made to check if serving of stale rrset is
enabled in configuration.
- If serving of stale answers is enabled, a flag will be turned on in
the query context to look for stale records:
query.c:6839
qctx->client->query.dboptions |= DNS_DBFIND_STALEOK;
- A call to query_lookup() will be made again, inside it a call to
'dns_db_findext()' is made, which in turn will invoke rbdb.c /
cache_find().
- In rbtdb.c / cache_find() the important bits of this change is the
call to 'check_stale_header()', which is a function that yields true
if we should skip the stale entry, or false if we should consider it.
- In check_stale_header() we now check if the DNS_DBFIND_STALEOK option
is set, if that is the case we know that this new search for stale
records was made due to a failure in a normal resolution, so we keep
track of the time in which the failured occured in rbtdb.c:4559:
header->last_refresh_fail_ts = search->now;
- In check_stale_header(), if DNS_DBFIND_STALEOK is not set, then we
know this is a normal lookup, if the record is stale and the query
time is between last failure time + stale-refresh-time window, then
we return false so cache_find() knows it can consider this stale
rrset entry to return as a response.
The last additions are two new methods to the database interface:
- setservestale_refresh
- getservestale_refresh
Those were added so rbtdb can be aware of the value set in configuration
option, since in that level we have no access to the view object.
Mark Andrews [Tue, 13 Oct 2020 02:00:36 +0000 (13:00 +1100)]
Address TSAN error between dns_rbt_findnode() and subtractrdataset().
Having dns_rbt_findnode() in previous_closest_nsec() check of
node->data is a optimisation that triggers a TSAN error with
subtractrdataset(). find_closest_nsec() still needs to check if
the NSEC record are active or not and look for a earlier NSEC records
if it isn't. Set DNS_RBTFIND_EMPTYDATA so node->data isn't referenced
without the node lock being held.
Ondřej Surý [Tue, 10 Nov 2020 10:23:05 +0000 (11:23 +0100)]
netmgr: Add additional safeguards to netmgr/tls.c
This commit adds couple of additional safeguards against running
sends/reads on inactive sockets. The changes was modeled after the
changes we made to netmgr/tcpdns.c
Witold Kręcicki [Thu, 1 Oct 2020 11:18:47 +0000 (13:18 +0200)]
Add DoT support to bind
Parse the configuration of tls objects into SSL_CTX* objects. Listen on
DoT if 'tls' option is setup in listen-on directive. Use DoT/DoH ports
for DoT/DoH.
This commit adds stub parser support and tests for:
- "tls" statement, specifying key and cert.
- an optional "tls" keyvalue in listen-on statements for DoT
configuration.
Documentation for these options has also been added to the ARM, but
needs further work.
Witold Kręcicki [Wed, 13 May 2020 15:37:51 +0000 (17:37 +0200)]
netmgr: server-side TLS support
Add server-side TLS support to netmgr - that includes moving some of the
isc_nm_ functions from tcp.c to a wrapper in netmgr.c calling a proper
tcp or tls function, and a new isc_nm_listentls() function.
Add DoT support to tcpdns - isc_nm_listentlsdns().
Mark Andrews [Fri, 23 Oct 2020 02:33:34 +0000 (13:33 +1100)]
Fixup legacy test to account for not falling back to EDNS 512 lookups.
The SOA lookup for edns512 could succeed if the negative response
for ns.edns512/AAAA completed before all the edns512/SOA query
attempts are made. The ns.edns512/AAAA lookup returns tc=1 and
the SOA record is cached after processing the NODATA response.
Lookup a TXT record at edns512 and look it up instead of the
SOA record.
Removed 'checking that TCP failures do not influence EDNS statistics
in the ADB' as it is no longer appropriate.
Evan Hunt [Mon, 9 Nov 2020 20:33:37 +0000 (12:33 -0800)]
address some possible shutdown races in xfrin
there were two failures during observed in testing, both occurring
when 'rndc halt' was run rather than 'rndc stop' - the latter dumps
zone contents to disk and presumably introduced enough delay to
prevent the races:
- a failure when the zone was shut down and called dns_xfrin_detach()
before the xfrin had finished connecting; the connect timeout
terminated without detaching its handle
- a failure when the tcpdns socket timer fired after the outerhandle
had already been cleared.
this commit incidentally addresses a failure observed in mutexatomic
due to a variable having been initialized incorrectly.
Ondřej Surý [Sat, 10 Oct 2020 05:26:18 +0000 (07:26 +0200)]
Add libssl libraries to Windows build
This commit extends the perl Configure script to also check for libssl
in addition to libcrypto and change the vcxproj source files to link
with both libcrypto and libssl.
Ondřej Surý [Mon, 9 Nov 2020 10:32:55 +0000 (11:32 +0100)]
Refactor the xfrin reference counting
Previously, the xfrin object relied on four different reference counters
(`refs`, `connects`, `sends`, `recvs`) and destroyed the xfrin object
only if all of them were zero. This commit reduces the reference
counting only to the `references` (renamed from `refs`) counter. We
keep the existing `connects`, `sends` and `recvs` as safe guards, but
they are not formally needed.
Evan Hunt [Fri, 9 Oct 2020 23:38:30 +0000 (16:38 -0700)]
remove isc_task from xfrin
since the network manager is now handling timeouts, xfrin doesn't
need an isc_task object.
it may be necessary to revert this later if we find that it's
important for zone_xfrdone() to be executed in the zone task context.
currently things seem to be working well without that, though.
Ondřej Surý [Sat, 7 Nov 2020 19:48:37 +0000 (20:48 +0100)]
netmgr: Don't crash if socket() returns an error in udpconnect
socket() call can return an error - e.g. EMFILE, so we need to handle
this nicely and not crash.
Additionally wrap the socket() call inside a platform independent helper
function as the Socket data type on Windows is unsigned integer:
> This means, for example, that checking for errors when the socket and
> accept functions return should not be done by comparing the return
> value with –1, or seeing if the value is negative (both common and
> legal approaches in UNIX). Instead, an application should use the
> manifest constant INVALID_SOCKET as defined in the Winsock2.h header
> file.
Ondřej Surý [Fri, 6 Nov 2020 15:12:17 +0000 (16:12 +0100)]
dig: Refactor recv_done, so there's less exit paths
The recv_done() callback had many exit paths with different conditions,
and every path had it's own set of destructors. The refactored code now
has unified exit path with descriptive goto labels matching the intent:
The only exception to the rule is check_for_more_data() path, where the
part of the query gets reused, so the query->readhandle and query gets
detached on it's own, and by going to the keep_query, we are just
skipping calling the destructors again.
Ondřej Surý [Fri, 6 Nov 2020 12:11:08 +0000 (13:11 +0100)]
netmgr: Always load the result from async socket
Because we use result earlier for setting the loadbalancing on the
socket, we could be left with a ISC_R_NOTIMPLEMENTED value stored in the
variable and when the UDP connection would succeed, we would
errorneously return this value instead of ISC_R_SUCCESS.
Evan Hunt [Thu, 5 Nov 2020 22:15:14 +0000 (14:15 -0800)]
dig: prevent query from being detached if udpconnect fails on first attempt
FreeBSD sometimes returns spurious errors in UDP connect() attempts,
so we try a few times before giving up. However, each failed attempt
triggers a call to udp_ready() in dighost.c, and that was causing
the query object to be detached prematurely.
Ondřej Surý [Thu, 5 Nov 2020 14:22:38 +0000 (15:22 +0100)]
dig: add reference counter to the dig_lookup_t object
Sometimes, the dig_lookup_t could be destroyed before the final
send_done() callback was be called, leading to dereferencing an
already freed dig_lookup_t object. By making the dig_lookup_t
reference counted, we are ensuring that it won't be freed until
the last reference (from dig_query_t .lookup) is released.
one of the tests in the resolver system test depends on dig
getting no response to its first two query attempts, and SERVFAIL
on the third after resolution times out.
using a 5-second retry timer in dig means the SERVFAIL response
could occur while dig is discarding the second query and preparing
to send the third. in this case the server's response could be
missed. shortening the retry interval to 4 seconds ensures that
dig has already sent the third query when the SERVFAIL response
arrives.
also, the serve-stale system test could fail due to a race in which
it timed out after waiting ten seconds for a file to be written, and
the dig timeout was just a bit longer. this is addressed by extending
the dig timeout to 11 seconds for this test.
Evan Hunt [Tue, 3 Nov 2020 03:58:05 +0000 (19:58 -0800)]
add isc_nmhandle_settimeout() function
this function sets the read timeout for the socket associated
with a netmgr handle and, if the timer is running, resets it.
for TCPDNS sockets it also sets the read timeout and resets the
timer on the outer TCP socket.
because dig now uses the netmgr, printing of response messages
happens in a different thread than setup. the IDN output filtering
procedure, which set using dns_name_settotextfilter(), is stored as
thread-local data, and so if it's set during setup, it won't be
accessible when printing. we now set it immediately before printing,
in the same thread, and clear it immedately afterward.
The network manager does not support returning UDP datagrams to
clients from unexpected sources; it is therefore not possible for
dig to accept them. The "+[no]unexpected" option has therefore
been removed from the dig command and its documentation.
Michał Kępień [Thu, 5 Nov 2020 10:45:19 +0000 (11:45 +0100)]
Fix detection of CMake-built libuv on Windows
As of libuv 1.36.0, CMake is the only supported build method for libuv
on Windows. Account for that fact by adjusting the relevant paths and
DLL file names used in the win32utils/Configure script. Update
Windows-specific documentation accordingly.
Michał Kępień [Thu, 5 Nov 2020 10:45:19 +0000 (11:45 +0100)]
Use "image" key in Windows GitLab CI job templates
Our GitLab Runner Custom executor scripts now use the "image" key for
determining the Windows Docker image to use for a given CI job. Update
.gitlab-ci.yml to reflect that change.
Michał Kępień [Thu, 5 Nov 2020 06:53:43 +0000 (07:53 +0100)]
Wait for the "fast-expire" zone to be transferred
In order for a "fast-expire/IN: response-policy zone expired" message to
be logged in ns3/named.run, the "fast-expire" zone must first be
transferred in by that server. However, with unfavorable timing, ns3
may be stopped before it manages to fetch the "fast-expire" zone from
ns5 and after the latter has been reconfigured to no longer serve that
zone. In such a case, the "rpz" system test will report a false
positive for the relevant check. Prevent that from happening by
ensuring ns3 manages to transfer the "fast-expire" zone before getting
shut down.
Some setup scripts uses DEFAULT_ALGORITHM in their dnssec-policy
and/or initial signing. The tests still used the literal values
13, ECDSAP256SHA256, and 256. Replace those occurrences where
appropriate.
Ondřej Surý [Mon, 2 Nov 2020 14:55:12 +0000 (15:55 +0100)]
Put up additional safe guards to not use inactive/closed tcpdns socket
When we are operating on the tcpdns socket, we need to double check
whether the socket or its outerhandle or its listener or its mgr is
still active and when not, bail out early.
Witold Kręcicki [Sat, 31 Oct 2020 20:08:53 +0000 (21:08 +0100)]
Fix improper closed connection handling in tcpdns.
If dnslisten_readcb gets a read callback it needs to verify that the
outer socket wasn't closed in the meantime, and issue a CANCELED callback
if it was.
Ondřej Surý [Tue, 27 Oct 2020 16:12:41 +0000 (17:12 +0100)]
add a netmgr unit test
tests of UDP and TCP cases including:
- sending and receiving
- closure sockets without reading or sending
- closure of sockets at various points while sending and receiving
- since the teste is multithreaded, cmocka now aborts tests on the
first failure, so that failures in subthreads are caught and
reported correctly.
Ondřej Surý [Thu, 29 Oct 2020 11:04:00 +0000 (12:04 +0100)]
Fix more races between connect and shutdown
There were more races that could happen while connecting to a
socket while closing or shutting down the same socket. This
commit introduces a .closing flag to guard the socket from
being closed twice.