Ondřej Surý [Tue, 2 Feb 2021 08:28:23 +0000 (09:28 +0100)]
Unit-test fixes and manual page updates for DoH configuration
This commit contains fixes to unit tests to make them work well on
various platforms (in particular ones shipping old versions of
OpenSSL) and fo different configurations.
It also updates the generated manpage to include DoH configuration
options.
Artem Boldariev [Mon, 7 Dec 2020 12:19:10 +0000 (14:19 +0200)]
Initial support for DNS-over-HTTP(S)
This commit completes the support for DNS-over-HTTP(S) built on top of
nghttp2 and plugs it into the BIND. Support for both GET and POST
requests is present, as required by RFC8484.
Both encrypted (via TLS) and unencrypted HTTP/2 connections are
supported. The latter are mostly there for debugging/troubleshooting
purposes and for the means of encryption offloading to third-party
software (as might be desirable in some environments to simplify TLS
certificates management).
This commit adds stub parser support and tests for:
- an "https-server" global option for HTTP/2 configuration.
- an "https-endpoint" view option for DoH configuration.
- add specific -p options to set dns, tls, or https port numbers by
specifying -p dns-PORT, -p tls=PORT, or -p https=PORT. NOTE: this
change only affects syntax; specifying the TLS port on the command
line DOES NOT CURRENTLY HAVE ANY EFFECT.
- change option names to tls-port and https-port for consistency.
- change variable names in system tests to TLSPORT and HTTPSPORT,
and report them when running tests.
Documentation for these options has also been added to the ARM, but
needs further work.
Ondřej Surý [Thu, 14 Jan 2021 11:51:25 +0000 (12:51 +0100)]
implement xfrin via XoT
Add support for a "tls" key/value pair for zone primaries, referencing
either a "tls" configuration statement or "ephemeral". If set to use
TLS, zones will send SOA and AXFR/IXFR queries over a TLS channel.
Diego Fronza [Wed, 27 Jan 2021 22:09:46 +0000 (19:09 -0300)]
Fix race condition on check_stale_header
This commit fix a race that could happen when two or more threads have
failed to refresh the same RRset, the threads could simultaneously
attempt to update the header->last_refresh_fail_ts field in
check_stale_header, a field used to implement stale-refresh-time.
Matthijs Mekking [Thu, 28 Jan 2021 11:30:08 +0000 (12:30 +0100)]
Add test for serve-stale /w fetch-limits
Add a test case when fetch-limits are reached and we have stale data
in cache.
This test starts with a positive answer for 'data.example/TXT' in
cache.
1. Reload named.conf to set fetch limits.
2. Disable responses from the authoritative server.
3. Now send a batch of queries to the resolver, until hitting the
fetch limits. We can detect this by looking at the response RCODE,
at some point we will see SERVFAIL responses.
4. At that point we will turn on serve-stale.
5. Clients should see stale answers now.
6. An incoming query should not set the stale-refresh-time window,
so a following query should still get a stale answer because of a
resolver failure (and not because it was in the stale-refresh-time
window).
Matthijs Mekking [Wed, 27 Jan 2021 15:59:27 +0000 (16:59 +0100)]
Only start stale refresh window when resuming
If we did not attempt a fetch due to fetch-limits, we should not start
the stale-refresh-time window.
Introduce a new flag DNS_DBFIND_STALESTART to differentiate between
a resolver failure and unexpected error. If we are resuming, this
indicates a resolver failure, then start the stale-refresh-time window,
otherwise don't start the stale-refresh-time window, but still fall
back to using stale data.
(This commit also wraps some docstrings to 80 characters width)
Matthijs Mekking [Tue, 19 Jan 2021 08:04:29 +0000 (09:04 +0100)]
Use stale data also if we are not resuming
Before this change, BIND will only fallback to using stale data if
there was an actual attempt to resolve the query. Then on a timeout,
the stale data from cache becomes eligible.
This commit changes this so that on any unexpected error stale data
becomes eligble (you would still have to have 'stale-answer-enable'
enabled of course).
If there is no stale data, this may return in an error again, so don't
loop on stale data lookup attempts. If the DNS_DBFIND_STALEOK flag is
set, this means we already tried to lookup stale data, so if that is
the case, don't use stale again.
When 'opensslecdsa_parse()' encounters a label tag in the private key
file, load the private key with 'opensslecdsa_fromlabel()'. Otherwise
load it from the private structure.
This was attempted before with 'load_privkey()' and 'uses_engine()',
but had the same flaw as 'opensslecdsa_fromlabel()' had previously,
that is getting the private and public key separately, juggling with
pointers between EC_KEY and EVP_PKEY, did not create a valid
cryptographic key that could be used for signing.
The 'opensslecdsa_fromlabel()' function does not need to get the
OpenSSL engine twice to load the private and public key. Also no need
to call 'dst_key_to_eckey()' as the EC_KEY can be derived from the
loaded EVP_PKEY's.
Add some extra checks to ensure the key has the same base id and curve
(group nid) as the dst key.
Since we already have the EVP_PKEY, no need to call 'finalize_eckey()',
instead just set the right values in the key structure.
Matthijs Mekking [Tue, 15 Dec 2020 13:09:05 +0000 (14:09 +0100)]
Don't set pubkey if eckey already has public key
The 'ecdsa_check()' function tries to correctly set the public key
on the eckey, but this should be skipped if the public key is
retrieved via the private key.
Matthijs Mekking [Mon, 30 Nov 2020 11:28:11 +0000 (12:28 +0100)]
Correctly update pointers to pubkey and privkey
The functions 'load_pubkey_from_engine()' and
'load_privkey_from_engine()' did not correctly store the pointers.
Update both functions to add 'EC_KEY_set_public_key()' and
'EC_KEY_set_private_key()' respectively, so that the pointers to
the public and private keys survive the "load from engine" functions.
Matthijs Mekking [Wed, 25 Nov 2020 08:23:57 +0000 (09:23 +0100)]
load_pubkey_from_engine() should load public key
The 'function load_pubkey_from_engine()' made a call to the libssl
function 'ENGINE_load_private_key'. This is a copy paste error and
should be 'ENGINE_load_public_key'.
Matthijs Mekking [Wed, 20 Jan 2021 15:13:49 +0000 (16:13 +0100)]
Update code flow in query.c wrt stale data
First of all, there was a flaw in the code related to the
'stale-refresh-time' option. If stale answers are enabled, and we
returned stale data, then it was assumed that it was because we were
in the 'stale-refresh-time' window. But now we could also have returned
stale data because of a 'stale-answer-client-timeout'. To fix this,
introduce a rdataset attribute DNS_RDATASETATTR_STALE_WINDOW to
indicate whether the stale cache entry was returned because the
'stale-refresh-time' window is active.
Second, remove the special case handling when the result is
DNS_R_NCACHENXRRSET. This can be done more generic in the code block
when dealing with stale data.
Putting all stale case handling in the code block when dealing with
stale data makes the code more easy to follow.
Update documentation to be more verbose and to match then new code
flow.
Diego Fronza [Thu, 14 Jan 2021 20:44:19 +0000 (17:44 -0300)]
Extracted common function from query_lookup and query_refresh_rrset
Both functions employed the same code lines to allocate query context
buffers, which are used to store query results, so this shared portion
of code was extracted out to a new function, qctx_prepare_buffers.
Also, this commit uses qctx_init to initialize the query context whitin
query_refresh_rrset function.
Diego Fronza [Wed, 23 Dec 2020 13:47:02 +0000 (10:47 -0300)]
Add system tests for stale-answer-client-timeout
This commit add 4 tests for the new option:
1. Test default configuration of stale-answer-client-timeout, a
value of 1.8 seconds, with stale-refresh-time disabled.
2. Test disabling of stale-answer-client-timeout.
3. Test stale-answer-client-timeout with a value of zero, in this
case we take advantage of a log entry which shows that a stale
answer was promptly used before an attempt to refresh the RRset
is made. We also check, by activating a disabled authoritative
server, that the RRset was successfully refreshed after that.
4. Test stale-answer-client-timeout 0 with stale-refresh-time 4, in
this test we want to ensure a couple things:
- If we have a stale RRSet entry in cache, a request must be
promptly answered with this data, while BIND must also attempt
to refresh the RRSet in background.
- If the attempt to refresh the RRSet times out, the RRSet must
have its stale-refresh-time window activated.
- If a new request for the same RRSet arrives, it must be
promptly answered with stale data due to stale-refresh-time
being active for this RRSet, in this case no attempt to refresh
the RRSet is made.
- Enable authoritative server, ensure that the RRSet was not
refreshed, to honor stale-refresh-time.
- Wait for stale-refresh-window time pass, send another request
for the same RRSet, this time we expect the answer to be the
stale entry in cache being hit due to
stale-answer-client-timeout 0.
- Send another request, this time we expect the answer to be an
active RRSet, since it must have been refreshed during the
previous request.
Diego Fronza [Mon, 21 Dec 2020 18:54:54 +0000 (15:54 -0300)]
Allow stale data to be used before name resolution
This commit allows stale RRset to be used (if available) for responding
a query, before an attempt to refresh an expired, or otherwise resolve
an unavailable RRset in cache is made.
For that to work, a value of zero must be specified for
stale-answer-client-timeout statement.
To better understand the logic implemented, there are three flags being
used during database lookup and other parts of code that must be
understood:
. DNS_DBFIND_STALEOK: This flag is set when BIND fails to refresh a
RRset due to timeout (resolver-query-timeout), its intent is to
try to look for stale data in cache as a fallback, but only if
stale answers are enabled in configuration.
This flag is also used to activate stale-refresh-time window, since it
is the only way the database knows that a resolution has failed.
. DNS_DBFIND_STALEENABLED: This flag is used as a hint to the database
that it may use stale data. It is always set during query lookup if
stale answers are enabled, but only effectively used during
stale-refresh-time window. Also during this window, the resolver will
not try to resolve the query, in other words no attempt to refresh the
data in cache is made when the stale-refresh-time window is active.
. DNS_DBFIND_STALEONLY: This new introduced flag is used when we want
stale data from the database, but not due to a failure in resolution,
it also doesn't require stale-refresh-time window timer to be active.
As long as there is a stale RRset available, it should be returned.
It is mainly used in two situations:
1. When stale-answer-client-timeout timer is triggered: in that case
we want to know if there is stale data available to answer the
client.
2. When stale-answer-client-timeout value is set to zero: in that
case, we also want to know if there is some stale RRset available
to promptly answer the client.
We must also discern between three situations that may happen when
resolving a query after the addition of stale-answer-client-timeout
statement, and how to handle them:
1. Are we running query_lookup() due to stale-answer-client-timeout
timer being triggered?
In this case, we look for stale data, making use of
DNS_DBFIND_STALEONLY flag. If a stale RRset is available then
respond the client with the data found, mark this query as
answered (query attribute NS_QUERYATTR_ANSWERED), so when the
fetch completes the client won't be answered twice.
We must also take care of not detaching from the client, as a
fetch will still be running in background, this is handled by the
following snippet:
if (!QUERY_STALEONLY(&client->query)) {
isc_nmhandle_detach(&client->reqhandle);
}
Which basically tests if DNS_DBFIND_STALEONLY flag is set, which
means we are here due to a stale-answer-client-timeout timer
expiration.
2. Are we running query_lookup() due to resolver-query-timeout being
triggered?
In this case, DNS_DBFIND_STALEOK flag will be set and an attempt
to look for stale data will be made.
As already explained, this flag is algo used to activate
stale-refresh-time window, as it means that we failed to refresh
a RRset due to timeout.
It is ok in this situation to detach from the client, as the
fetch is already completed.
3. Are we running query_lookup() during the first time, looking for
a RRset in cache and stale-answer-client-timeout value is set to
zero?
In this case, if stale answers are enabled (probably), we must do
an initial database lookup with DNS_DBFIND_STALEONLY flag set, to
indicate to the database that we want stale data.
If we find an active RRset, proceed as normal, answer the client
and the query is done.
If we find a stale RRset we respond to the client and mark the
query as answered, but don't detach from the client yet as an
attempt in refreshing the RRset will still be made by means of
the new introduced function 'query_resolve'.
If no active or stale RRset is available, begin resolution as
usual.
Diego Fronza [Fri, 18 Dec 2020 19:24:56 +0000 (16:24 -0300)]
Added option for disabling stale-answer-client-timeout
This commit allows to specify "disabled" or "off" in
stale-answer-client-timeout statement. The logic to support this
behavior will be added in the subsequent commits.
This commit also ensures an upper bound to stale-answer-client-timeout
which equals to one second less than 'resolver-query-timeout'.
Diego Fronza [Fri, 11 Dec 2020 19:12:48 +0000 (16:12 -0300)]
Adjusted serve-stale test
After the addition of stale-answer-client-timeout a test was broken due
to the following behavior expected by the test.
1. Prime cache data.example txt.
2. Disable authoritative server.
3. Send a query for data.example txt.
4. Recursive server will timeout and answer from cache with stale RRset.
5. Recursive server will activate stale-refresh-time due to the previous
failure in attempting to refresh the RRset.
6. Send a query for data.example txt.
7. Expect stale answer from cache due to stale-refresh-time
window being active, even if authoritative server is up.
Problem is that in step 4, due to the new option
stale-answer-client-timeout, recursive server will answer with stale
data before the actual fetch completes.
Since the original fetch is still running in background, if we re-enable
the authoritative server during that time, the RRset will actually be
successfully refreshed, and stale-refresh-window will not be activated.
The next queries will fail because they expect the TTL of the RRset to
match the one in the stale cache, not the one just refreshed.
To solve this, we explicitly disable stale-answer-client-timeout for
this test, as it's not the feature we are interested in testing here
anyways.
Diego Fronza [Fri, 11 Dec 2020 17:10:31 +0000 (14:10 -0300)]
Add stale-answer-client-timeout option
The general logic behind the addition of this new feature works as
folows:
When a client query arrives, the basic path (query.c / ns_query_recurse)
was to create a fetch, waiting for completion in fetch_callback.
With the introduction of stale-answer-client-timeout, a new event of
type DNS_EVENT_TRYSTALE may invoke fetch_callback, whenever stale
answers are enabled and the fetch took longer than
stale-answer-client-timeout to complete.
When an event of type DNS_EVENT_TRYSTALE triggers fetch_callback, we
must ensure that the folowing happens:
1. Setup a new query context with the sole purpose of looking up for
stale RRset only data, for that matters a new flag was added
'DNS_DBFIND_STALEONLY' used in database lookups.
. If a stale RRset is found, mark the original client query as
answered (with a new query attribute named NS_QUERYATTR_ANSWERED),
so when the fetch completion event is received later, we avoid
answering the client twice.
. If a stale RRset is not found, cleanup and wait for the normal
fetch completion event.
2. In ns_query_done, we must change this part:
/*
* If we're recursing then just return; the query will
* resume when recursion ends.
*/
if (RECURSING(qctx->client)) {
return (qctx->result);
}
To this:
if (RECURSING(qctx->client) && !QUERY_STALEONLY(qctx->client)) {
return (qctx->result);
}
Otherwise we would not proceed to answer the client if it happened
that a stale answer was found when looking up for stale only data.
When an event of type DNS_EVENT_FETCHDONE triggers fetch_callback, we
proceed as before, resuming query, updating stats, etc, but a few
exceptions had to be added, most important of which are two:
1. Before answering the client (ns_client_send), check if the query
wasn't already answered before.
2. Before detaching a client, e.g.
isc_nmhandle_detach(&client->reqhandle), ensure that this is the
fetch completion event, and not the one triggered due to
stale-answer-client-timeout, so a correct call would be:
if (!QUERY_STALEONLY(client)) {
isc_nmhandle_detach(&client->reqhandle);
}
Other than these notes, comments were added in code in attempt to make
these updates easier to follow.
Diego Fronza [Sat, 28 Nov 2020 21:07:29 +0000 (18:07 -0300)]
Avoid iterating name twice when constructing fctx->info
This is a minor performance improvement, we store the result of the
first call to strlcat to use as an offset in the next call when
constructing fctx->info string.
Ondřej Surý [Tue, 12 Jan 2021 12:38:44 +0000 (13:38 +0100)]
Use -release instead of -version-info for internal library SONAMEs
The BIND 9 libraries are considered to be internal only and hence the
API and ABI changes a lot. Keeping track of the API/ABI changes takes
time and it's a complicated matter as the safest way to make everything
stable would be to bump any library in the dependency chain as in theory
if libns links with libdns, and a binary links with both, and we bump
the libdns SOVERSION, but not the libns SOVERSION, the old libns might
be loaded by binary pulling old libdns together with new libdns loaded
by the binary. The situation gets even more complicated with loading
the plugins that have been compiled with few versions old BIND 9
libraries and then dynamically loaded into the named.
We are picking the safest option possible and usable for internal
libraries - instead of using -version-info that has only a weak link to
BIND 9 version number, we are using -release libtool option that will
embed the corresponding BIND 9 version number into the library name.
That means that instead of libisc.so.1701 (as an example) the library
will now be named libisc-9.17.10.so.
Ondřej Surý [Thu, 17 Dec 2020 10:40:29 +0000 (11:40 +0100)]
Refactor TLSDNS module to work with libuv/ssl directly
* Following the example set in 634bdfb16d8, the tlsdns netmgr
module now uses libuv and SSL primitives directly, rather than
opening a TLS socket which opens a TCP socket, as the previous
model was difficult to debug. Closes #2335.
* Remove the netmgr tls layer (we will have to re-add it for DoH)
* Add isc_tls API to wrap the OpenSSL SSL_CTX object into libisc
library; move the OpenSSL initialization/deinitialization from dstapi
needed for OpenSSL 1.0.x to the isc_tls_{initialize,destroy}()
* Add couple of new shims needed for OpenSSL 1.0.x
* When LibreSSL is used, require at least version 2.7.0 that
has the best OpenSSL 1.1.x compatibility and auto init/deinit
* Enforce OpenSSL 1.1.x usage on Windows
* Added a TLSDNS unit test and implemented a simple TLSDNS echo
server and client.
Evan Hunt [Wed, 20 Jan 2021 21:37:52 +0000 (13:37 -0800)]
check whether taskset works before running cpu test
the taskset command used for the cpu system test seems
to be failing under vmware, causing a test failure. we
can try the taskset command and skip the test if it doesn't
work.
Matthijs Mekking [Mon, 14 Dec 2020 09:36:17 +0000 (10:36 +0100)]
Remove the option 'filter-aaaa' options
The 'filter-aaaa', 'filter-aaaa-on-v4', and 'filter-aaaa-on-v6' options
are replaced by the filter-aaaa plugin. This plugin was introduced in
9.13.5 and so it is safe to remove the named.conf options.
When compiling BIND 9 without lmdb, this is promoted from
'not operational' to 'not configured', resulting in a failure (and no
longer a warning) if ldmb-related configuration options are set.
Special case certain system tests to avoid test failures on systems
that do not have lmdb.