Ondřej Surý [Thu, 12 Nov 2020 09:32:18 +0000 (10:32 +0100)]
Refactor netmgr and add more unit tests
This is a part of the works that intends to make the netmgr stable,
testable, maintainable and tested. It contains a numerous changes to
the netmgr code and unfortunately, it was not possible to split this
into smaller chunks as the work here needs to be committed as a complete
works.
NOTE: There's a quite a lot of duplicated code between udp.c, tcp.c and
tcpdns.c and it should be a subject to refactoring in the future.
The changes that are included in this commit are listed here
(extensively, but not exclusively):
* The netmgr_test unit test was split into individual tests (udp_test,
tcp_test, tcpdns_test and newly added tcp_quota_test)
* The udp_test and tcp_test has been extended to allow programatic
failures from the libuv API. Unfortunately, we can't use cmocka
mock() and will_return(), so we emulate the behaviour with #define and
including the netmgr/{udp,tcp}.c source file directly.
* The netievents that we put on the nm queue have variable number of
members, out of these the isc_nmsocket_t and isc_nmhandle_t always
needs to be attached before enqueueing the netievent_<foo> and
detached after we have called the isc_nm_async_<foo> to ensure that
the socket (handle) doesn't disappear between scheduling the event and
actually executing the event.
* Cancelling the in-flight TCP connection using libuv requires to call
uv_close() on the original uv_tcp_t handle which just breaks too many
assumptions we have in the netmgr code. Instead of using uv_timer for
TCP connection timeouts, we use platform specific socket option.
* Fix the synchronization between {nm,async}_{listentcp,tcpconnect}
When isc_nm_listentcp() or isc_nm_tcpconnect() is called it was
waiting for socket to either end up with error (that path was fine) or
to be listening or connected using condition variable and mutex.
Several things could happen:
0. everything is ok
1. the waiting thread would miss the SIGNAL() - because the enqueued
event would be processed faster than we could start WAIT()ing.
In case the operation would end up with error, it would be ok, as
the error variable would be unchanged.
2. the waiting thread miss the sock->{connected,listening} = `true`
would be set to `false` in the tcp_{listen,connect}close_cb() as
the connection would be so short lived that the socket would be
closed before we could even start WAIT()ing
* The tcpdns has been converted to using libuv directly. Previously,
the tcpdns protocol used tcp protocol from netmgr, this proved to be
very complicated to understand, fix and make changes to. The new
tcpdns protocol is modeled in a similar way how tcp netmgr protocol. Closes: #2194, #2283, #2318, #2266, #2034, #1920
* The tcp and tcpdns is now not using isc_uv_import/isc_uv_export to
pass accepted TCP sockets between netthreads, but instead (similar to
UDP) uses per netthread uv_loop listener. This greatly reduces the
complexity as the socket is always run in the associated nm and uv
loops, and we are also not touching the libuv internals.
There's an unfortunate side effect though, the new code requires
support for load-balanced sockets from the operating system for both
UDP and TCP (see #2137). If the operating system doesn't support the
load balanced sockets (either SO_REUSEPORT on Linux or SO_REUSEPORT_LB
on FreeBSD 12+), the number of netthreads is limited to 1.
* The netmgr has now two debugging #ifdefs:
1. Already existing NETMGR_TRACE prints any dangling nmsockets and
nmhandles before triggering assertion failure. This options would
reduce performance when enabled, but in theory, it could be enabled
on low-performance systems.
2. New NETMGR_TRACE_VERBOSE option has been added that enables
extensive netmgr logging that allows the software engineer to
precisely track any attach/detach operations on the nmsockets and
nmhandles. This is not suitable for any kind of production
machine, only for debugging.
* The tlsdns netmgr protocol has been split from the tcpdns and it still
uses the old method of stacking the netmgr boxes on top of each other.
We will have to refactor the tlsdns netmgr protocol to use the same
approach - build the stack using only libuv and openssl.
* Limit but not assert the tcp buffer size in tcp_alloc_cb Closes: #2061
Mark Andrews [Thu, 26 Nov 2020 04:59:14 +0000 (15:59 +1100)]
Adjust default value of "max-recursion-queries"
Since the queries sent towards root and TLD servers are now included in
the count (as a result of the fix for CVE-2020-8616),
"max-recursion-queries" has a higher chance of being exceeded by
non-attack queries. Increase its default value from 75 to 100.
Michal Nowak [Thu, 19 Nov 2020 11:33:27 +0000 (12:33 +0100)]
Drop bin/tests/headerdep_test.sh.in
The bin/tests/headerdep_test.sh script has not been updated since it was
first created and it cannot be used as-is with the current BIND source
code. Better tools (e.g. "include-what-you-use") emerged since the
script was committed back in 2000, so instead of trying to bring it up
to date, remove it from the source repository.
Michal Nowak [Thu, 19 Nov 2020 09:35:57 +0000 (10:35 +0100)]
Revise OPTIONS.md
- The STD_CDEFINES build-time variable was dropped when the build
system was migrated to Automake. CPPFLAGS is the variable which
should now be used for setting preprocessor macros.
- Sort the list of preprocessor macros which affect BIND behavior.
Remove ISC_BUFFER_USEINLINE from the list as it can be controlled
using its relevant ./configure option (--enable-buffer-useinline).
Rename NS_RUN_PID_DIR to NAMED_RUN_PID_DIR to match the source code.
Mark Andrews [Thu, 12 Nov 2020 22:45:47 +0000 (09:45 +1100)]
Tighten DNS COOKIE response handling
Fallback to TCP when we have already seen a DNS COOKIE response
from the given address and don't have one in this UDP response. This
could be a server that has turned off DNS COOKIE support, a
misconfigured anycast server with partial DNS COOKIE support, or a
spoofed response. Falling back to TCP is the correct behaviour in
all 3 cases.
Michal Nowak [Tue, 24 Nov 2020 16:39:23 +0000 (17:39 +0100)]
Write traceback file to the same directory as core file
The traceback files could overwrite each other on systems which do not
use different core dump file names for different processes. Prevent
that by writing the traceback file to the same directory as the core
dump file.
These changes still do not prevent the operating system from overwriting
a core dump file if the same binary crashes multiple times in the same
directory and core dump files are named identically for different
processes.
Diego Fronza [Wed, 25 Nov 2020 19:01:06 +0000 (16:01 -0300)]
Silence coverity warnings in query.c
Return value of dns_db_getservestalerefresh() and
dns_db_getservestalettl() functions were previously unhandled.
This commit purposefully ignore those return values since there is
no side effect if those results are != ISC_R_SUCCESS, it also supress
Coverity warnings.
Michał Kępień [Thu, 26 Nov 2020 12:10:40 +0000 (13:10 +0100)]
Use proper cmocka macros for pointer checks
Make sure pointer checks in unit tests use cmocka assertion macros
dedicated for use with pointers instead of those dedicated for use with
integers or booleans.
Matthijs Mekking [Fri, 13 Nov 2020 11:26:05 +0000 (12:26 +0100)]
Add NSEC3PARAM unit test, refactor zone.c
Add unit test to ensure the right NSEC3PARAM event is scheduled in
'dns_zone_setnsec3param()'. To avoid scheduling and managing actual
tasks, split up the 'dns_zone_setnsec3param()' function in two parts:
1. 'dns__zone_lookup_nsec3param()' that will check if the requested
NSEC3 parameters already exist, and if a new salt needs to be
generated.
2. The actual scheduling of the new NSEC3PARAM event (if needed).
When generating a new salt, compare it with the previous NSEC3
paremeters to ensure the new parameters are different from the
previous ones.
This moves the salt generation call from 'bin/named/*.s' to
'lib/dns/zone.c'. When setting new NSEC3 parameters, you can set a new
function parameter 'resalt' to enforce a new salt to be generated. A
new salt will also be generated if 'salt' is set to NULL.
Logging salt with zone context can now be done with 'dnssec_log',
removing the need for 'dns_nsec3_log_salt'.
Matthijs Mekking [Fri, 23 Oct 2020 13:02:19 +0000 (15:02 +0200)]
Change nsec3param salt config to saltlen
Upon request from Mark, change the configuration of salt to salt
length.
Introduce a new function 'dns_zone_checknsec3aram' that can be used
upon reconfiguration to check if the existing NSEC3 parameters are
in sync with the configuration. If a salt is used that matches the
configured salt length, don't change the NSEC3 parameters.
Matthijs Mekking [Tue, 13 Oct 2020 15:48:22 +0000 (17:48 +0200)]
Check nsec3param configuration values
Check 'nsec3param' configuration for the number of iterations. The
maximum number of iterations that are allowed are based on the key
size (see https://tools.ietf.org/html/rfc5155#section-10.3).
Check 'nsec3param' configuration for correct salt. If the string is
not "-" or hex-based, this is a bad salt.
Matthijs Mekking [Tue, 13 Oct 2020 12:52:02 +0000 (14:52 +0200)]
Don't use 'rndc signing' with kasp
The 'rndc signing' command allows you to manipulate the private
records that are used to store signing state. Don't use these with
'dnssec-policy' as such manipulations may violate the policy (if you
want to change the NSEC3 parameters, change the policy and reconfig).
Matthijs Mekking [Tue, 13 Oct 2020 12:48:04 +0000 (14:48 +0200)]
Fix a reconfig bug wrt inline-signing
When doing 'rndc reconfig', named may complain about a zone not being
reusable because it has a raw version of the zone, and the new
configuration has not set 'inline-signing'. However, 'inline-signing'
may be implicitly true if a 'dnssec-policy' is used for the zone, and
the zone is not dynamic.
Improve the check in 'named_zone_reusable'. Create a new function for
checking 'inline-signing' configuration that matches existing code in
'bin/named/server.c'.
Matthijs Mekking [Tue, 13 Oct 2020 12:39:21 +0000 (14:39 +0200)]
Support for NSEC3 in dnssec-policy
Implement support for NSEC3 in dnssec-policy. Store the configuration
in kasp objects. When configuring a zone, call 'dns_zone_setnsec3param'
to queue an nsec3param event. This will ensure that any previous
chains will be removed and a chain according to the dnssec-policy is
created.
Add tests for dnssec-policy zones that uses the new 'nsec3param'
option, as well as changing to new values, changing to NSEC, and
changing from NSEC.
Michał Kępień [Wed, 25 Nov 2020 11:45:47 +0000 (12:45 +0100)]
Convert add_quota() to a function
cppcheck 2.2 reports the following false positive:
lib/isc/tests/quota_test.c:71:21: error: Array 'quotas[101]' accessed at index 110, which is out of bounds. [arrayIndexOutOfBounds]
isc_quota_t *quotas[110];
^
The above is not even an array access, so this report is obviously
caused by a cppcheck bug. Yet, it seems to be triggered by the presence
of the add_quota() macro, which should really be a function. Convert
the add_quota() macro to a function in order to make the code cleaner
and to prevent the above cppcheck 2.2 false positive from being
triggered.
Michał Kępień [Wed, 25 Nov 2020 11:45:47 +0000 (12:45 +0100)]
Silence cppcheck 2.2 false positive in udp_recv()
cppcheck 2.2 reports the following false positive:
lib/dns/dispatch.c:1239:14: warning: Either the condition 'resp==NULL' is redundant or there is possible null pointer dereference: resp. [nullPointerRedundantCheck]
if (disp != resp->disp) {
^
lib/dns/dispatch.c:1210:11: note: Assuming that condition 'resp==NULL' is not redundant
if (resp == NULL) {
^
lib/dns/dispatch.c:1239:14: note: Null pointer dereference
if (disp != resp->disp) {
^
Apparently this version of cppcheck gets confused about conditional
"goto" statements because line 1239 can never be reached if 'resp' is
NULL.
Move a code block to prevent the above false positive from being
reported without affecting the processing logic.
Michał Kępień [Wed, 25 Nov 2020 11:45:47 +0000 (12:45 +0100)]
Teach cppcheck that fatal() does not return
cppcheck is not aware that the bin/dnssec/dnssectool.c:fatal() function
does not return. This triggers certain cppcheck 2.2 false positives,
for example:
bin/dnssec/dnssec-signzone.c:3471:13: warning: Either the condition 'ndskeys==8' is redundant or the array 'dskeyfile[8]' is accessed at index 8, which is out of bounds. [arrayIndexOutOfBoundsCond]
dskeyfile[ndskeys++] = isc_commandline_argument;
^
bin/dnssec/dnssec-signzone.c:3468:16: note: Assuming that condition 'ndskeys==8' is not redundant
if (ndskeys == MAXDSKEYS) {
^
bin/dnssec/dnssec-signzone.c:3471:13: note: Array index out of bounds
dskeyfile[ndskeys++] = isc_commandline_argument;
^
bin/dnssec/dnssec-signzone.c:772:20: warning: Either the condition 'l->hashbuf==NULL' is redundant or there is pointer arithmetic with NULL pointer. [nullPointerArithmeticRedundantCheck]
memset(l->hashbuf + l->entries * l->length, 0, l->length);
^
bin/dnssec/dnssec-signzone.c:768:18: note: Assuming that condition 'l->hashbuf==NULL' is not redundant
if (l->hashbuf == NULL) {
^
bin/dnssec/dnssec-signzone.c:772:20: note: Null pointer addition
memset(l->hashbuf + l->entries * l->length, 0, l->length);
^
Instead of suppressing all such warnings individually, conditionally
define a preprocessor macro which prevents them from being triggered.
Michał Kępień [Wed, 25 Nov 2020 11:45:47 +0000 (12:45 +0100)]
Remove cppcheck 2.0 false positive workarounds
The cppcheck bug which commit 481fa34e50a6183273f71175adf93bfb12cad1e9
works around was fixed in cppcheck 2.2. Drop the relevant hack from the
definition of the cppcheck GitLab CI job.
Evan Hunt [Fri, 20 Nov 2020 01:58:45 +0000 (17:58 -0800)]
create system test with asynchronous plugin
the test-async plugin uses ns_query_hookasync() at the
NS_QUERY_DONE_SEND hook point to call an asynchronous function.
the only effect is to change the query response code to "NOTIMP",
so we can confirm that the hook ran and resumed correctly.
implementation of hook-based asynchronous functionality
previously query plugins were strictly synchrounous - the query
process would be interrupted at some point, data would be looked
up or a change would be made, and then the query processing would
resume immediately.
this commit enables query plugins to initiate asynchronous processes
and resume on a completion event, as with recursion.
several small changes to query processing to make it easier to
use hook-based recursion (and other asynchronous functionlity)
later.
- recursion quota check is now a separate function,
check_recursionquota(), which is called by ns_query_recurse().
- pass isc_result to query_nxdomain() instead of bool.
the value of 'empty_wild' will be determined in the function
based on the passed result. this is similar to query_nodata(),
and makes the signatures of the two functions more consistent.
- pass the current 'result' value into plugin hooks.
Michał Kępień [Tue, 24 Nov 2020 13:51:51 +0000 (14:51 +0100)]
Refactor libidn2 detection code
Make the code block handling the --with-libidn2=/path/to/libidn2 form of
the --with-libidn2 build-time option behave more similarly to the
PKG_CHECK_MODULES() macro.
Michal Nowak [Mon, 23 Nov 2020 14:07:56 +0000 (15:07 +0100)]
Remove unused DLZ_DRIVER_MYSQL_* build variables
The DLZ_DRIVER_MYSQL_INCLUDES and DLZ_DRIVER_MYSQL_LIBS build variables
are not used anywhere. Remove their definitions and the associated
AC_SUBST() calls.
Michal Nowak [Mon, 23 Nov 2020 13:59:12 +0000 (14:59 +0100)]
Remove AC_SUBST() calls from AX_LIB_LMDB()
LMDB build variables are already substituted by AC_SUBST() calls in
configure.ac and therefore the latter should not be duplicated in the
AX_LIB_LMDB() helper macro.
Michał Kępień [Mon, 23 Nov 2020 10:46:50 +0000 (11:46 +0100)]
Enable "stress" tests to be run on demand
The "stress" test can be run in different ways, depending on:
- the tested scenario (authoritative, recursive),
- the operating system used (Linux, FreeBSD),
- the architecture used (amd64, arm64).
Currently, all supported "stress" test variants are automatically
launched for all scheduled pipelines and for pipelines started for tags;
there is no possibility of running these tests on demand, which could be
useful in certain circumstances.
Employ the "only:variables" key to enable fine-grained control over the
list of "stress" test jobs to be run for a given pipeline. Three CI
variables are used to specify the list of "stress" test jobs to create:
- BIND_STRESS_TEST_MODE: specifies the test mode to use; must be
explicitly set in order for any "stress" test job to be created;
allowed values are: "authoritative", "recursive",
- BIND_STRESS_TEST_OS: specifies the operating system to run the test
on; allowed values are: "linux", "freebsd"; defaults to "linux", may
be overridden at pipeline creation time,
- BIND_STRESS_TEST_ARCH: specifies the architecture to run the test
on; allowed values are: "amd64", "arm64"; defaults to "amd64", may
be overridden at pipeline creation time.
Since case-insensitive regular expressions are used for determining
which jobs to run, every variable described above may contain multiple
values. For example, setting the BIND_STRESS_TEST_MODE variable to
"authoritative,recursive" will cause the "stress" test to be run in both
supported scenarios (either on the default OS/architecture combination,
i.e. Linux/amd64, or, if the relevant variables are explicitly
specified, the requested OS/architecture combinations).
Ondřej Surý [Wed, 11 Nov 2020 09:46:33 +0000 (10:46 +0100)]
Turn all the callback to be always asynchronous
When calling the high level netmgr functions, the callback would be
sometimes called synchronously if we catch the failure directly, or
asynchronously if it happens later. The synchronous call to the
callback could create deadlocks as the caller would not expect the
failed callback to be executed directly.
Matthijs Mekking [Tue, 10 Nov 2020 13:48:24 +0000 (14:48 +0100)]
Change serve-stale test stale-answer-ttl
Using a 'stale-answer-ttl' the same value as the authoritative ttl
value makes it hard to differentiate between a response from the
stale cache and a response from the authoritative server.
Change the stale-answer-ttl from 2 to 4, so that it differs from the
authoritative ttl.
Diego Fronza [Tue, 20 Oct 2020 19:07:56 +0000 (16:07 -0300)]
Wait for multiple parallel dig commands to fully finish
The strategy of running many dig commands in parallel and
waiting for the respective output files to be non empty was
resulting in random test failures, hard to reproduce, where
it was possible that the subsequent reading of the files could
have been failing due to the file's content not being fully flushed.
Instead of checking if output files are non empty, we now wait
for the dig processes to finish.