Ondřej Surý [Tue, 20 Oct 2020 21:51:08 +0000 (23:51 +0200)]
Use libuv's shared library handling capabilities
While libltdl is a feature-rich library, BIND 9 code only uses its basic
capabilities, which are also provided by libuv and which BIND 9 already
uses for other purposes. As libuv's cross-platform shared library
handling interface is modeled after the POSIX dlopen() interface,
converting code using the latter to the former is simple. Replace
libltdl function calls with their libuv counterparts, refactoring the
code as necessary. Remove all use of libltdl from the BIND 9 source
tree.
Ondřej Surý [Tue, 20 Oct 2020 21:51:08 +0000 (23:51 +0200)]
Refactor the cleanup code in lt_dl code
The cleanup code that would clean the object after plugin/dlz/dyndb
loading has failed was duplicating the destructor for the object, so
instead of the extra code, we just use the destructor instead.
Ondřej Surý [Wed, 28 Oct 2020 14:25:44 +0000 (15:25 +0100)]
Unify lt_dlopen() error handling
Make sure an error gets logged when any lt_dlopen() call in the source
tree fails. Also make sure that NULL values returned by lt_dlerror()
are replaced with a generic error message to prevent passing NULL as an
argument for the %s format specifier.
Ondřej Surý [Mon, 26 Oct 2020 10:14:49 +0000 (11:14 +0100)]
Remove redundant lt_dlerror() calls
The redundant lt_dlerror() calls were taken from the examples to clean
any previous errors from lt_dl...() calls. However upon code
inspection, it was discovered there are no such paths that could cause
the lt_dlerror() to return spurious error messages.
Michal Nowak [Tue, 27 Oct 2020 09:20:05 +0000 (10:20 +0100)]
Get rid of bashisms in string comparisons
The double equal sign ('==') is a Bash-specific string comparison
operator. Ensure the single equal sign ('=') is used in all POSIX shell
scripts in the system test suite in order to retain their portability.
Michal Nowak [Tue, 16 Jun 2020 12:19:41 +0000 (14:19 +0200)]
Add "stress" tests to GitLab CI
Run "stress" tests for scheduled pipelines and pipelines created for
tags. These tests were previously only performed manually (as part of
pre-release testing of each new BIND version). Their purpose is to
detect memory leaks and potential performance issues.
As the run time of each "stress" test itself is set to 1 hour, set the
GitLab CI job timeout to 2 hours in order to account for the extra time
needed to set the test up and gather its results.
Ondřej Surý [Thu, 22 Oct 2020 10:32:18 +0000 (12:32 +0200)]
Postpone the isc_app_shutdown() after rndc response has been sent
When `rndc stop` is received, the isc_app_shutdown() was being called
before response to the rndc client has been sent; as the
isc_app_shutdown() also tears down the netmgr, the message was never
sent and rndc would complain about connection being interrupted in the
middle of the transaction. We now postpone the shutdown after the rndc
response has been sent.
Ondřej Surý [Wed, 21 Oct 2020 10:52:09 +0000 (12:52 +0200)]
Fix the isc_nm_closedown() to actually close the pending connections
1. The isc__nm_tcp_send() and isc__nm_tcp_read() was not checking
whether the socket was still alive and scheduling reads/sends on
closed socket.
2. The isc_nm_read(), isc_nm_send() and isc_nm_resumeread() have been
changed to always return the error conditions via the callbacks, so
they always succeed. This applies to all protocols (UDP, TCP and
TCPDNS).
Ondřej Surý [Wed, 21 Oct 2020 06:56:21 +0000 (08:56 +0200)]
Fix the way tcp_send_direct() is used
There were two problems how tcp_send_direct() was used:
1. The tcp_send_direct() can return ISC_R_CANCELED (or translated error
from uv_tcp_send()), but the isc__nm_async_tcpsend() wasn't checking
the error code and not releasing the uvreq in case of an error.
2. In isc__nm_tcp_send(), when the TCP send is already in the right
netthread, it uses tcp_send_direct() to send the TCP packet right
away. When that happened the uvreq was not freed, and the error code
was returned to the caller. We need to return ISC_R_SUCCESS and
rather use the callback to report an error in such case.
Ondřej Surý [Tue, 20 Oct 2020 18:57:19 +0000 (20:57 +0200)]
Explicitly stop reading before closing the nmtcpsocket
When closing the socket that is actively reading from the stream, the
read_cb() could be called between uv_close() and close callback when the
server socket has been already detached hence using sock->statichandle
after it has been already freed.
Ondřej Surý [Tue, 20 Oct 2020 06:07:44 +0000 (08:07 +0200)]
Fix the way udp_send_direct() is used
There were two problems how udp_send_direct() was used:
1. The udp_send_direct() can return ISC_R_CANCELED (or translated error
from uv_udp_send()), but the isc__nm_async_udpsend() wasn't checking
the error code and not releasing the uvreq in case of an error.
2. In isc__nm_udp_send(), when the UDP send is already in the right
netthread, it uses udp_send_direct() to send the UDP packet right
away. When that happened the uvreq was not freed, and the error code
was returned to the caller. We need to return ISC_R_SUCCESS and
rather use the callback to report an error in such case.
The reason for running the last two jobs above sequentially rather than
in parallel is that both of them create *.gcda files (containing
coverage data) in the same locations. While some way of merging these
files from different job artifact archives could probably be designed
with the help of additional tools, the simplest thing to do is not to
run unit test and system test jobs in parallel, carrying *.gcda files
over between jobs as gcov knows how to append coverage data to existing
*.gcda files.
Also note that test coverage will not be visualized if any of the jobs
in the above dependency chain fails (because the gcov job will not be
run).
Diego Fronza [Thu, 10 Sep 2020 18:33:15 +0000 (15:33 -0300)]
Added test for the proposed fix
This test is very simple, two nameserver instances are created:
- ns4: master, with 'minimal-responses yes', authoritative
for example. zone
- ns5: slave, stub zone
The first thing verified is the transfer of zone data from master
to slave, which should be saved in ns5/example.db.
After that, a query is issued to ns5 asking for target.example.
TXT, a record present in the master database with the "test" string
as content.
If that query works, it means stub zone successfully request
nameserver addresses from master, ns4.example. A/AAAA
The presence of both A/AAAA records for ns4 is also verified in the
stub zone local file, ns5/example.db.
Diego Fronza [Thu, 10 Sep 2020 18:09:14 +0000 (15:09 -0300)]
Fix transfer of glue records in stub zones if master has minimal-responses set
Stub zones don't make use of AXFR/IXFR for the transfering of zone
data, instead, a single query is issued to the master asking for
their nameserver records (NS).
That works fine unless master is configured with 'minimal-responses'
set to yes, in which case glue records are not provided by master
in the answer with nameservers authoritative for the zone, leaving
stub zones with incomplete databases.
This commit fix this problem in a simple way, when the answer with
the authoritative nameservers is received from master (stub_callback),
for each nameserver listed (save_nsrrset), a A and AAAA records for
the name is verified in the additional section, and if not present
a query is created to resolve the corresponsing missing glue.
A struct 'stub_cb_args' was added to keep relevant information for
performing a query, like TSIG key, udp size, dscp value, etc, this
information is borrowed from, and created within function 'ns_query',
where the resolving of nameserver from master starts.
A new field was added to the struct 'dns_stub', an atomic integer,
namely pending_requests, which is used to keep how many queries are
created when resolving nameserver addresses that were missing in
the glue.
When the value of pending_requests is zero we know we can release
resources, adjust zone timers, dump to zone file, etc.
Michal Nowak [Mon, 19 Oct 2020 05:47:57 +0000 (07:47 +0200)]
Run unit tests on OpenBSD in GitLab CI
Unlike other maintained BIND branches, the "main" BIND branch does not
require Kyua for running unit tests, which has been an obstacle for
adding an OpenBSD unit test job to GitLab CI. Experiments show that a
complete BIND unit test suite completes in a few minutes on OpenBSD and
that unit tests are not as severely affected by OpenBSD performance
issues as system tests are. Add a GitLab CI job which runs unit tests
on OpenBSD to every pipeline.
Diego Fronza [Thu, 1 Oct 2020 17:04:05 +0000 (14:04 -0300)]
Fix dnstap system test on FreeBSD
This commit ensures that dnstap output files captured
by fstrm_capture are properly flushed before any attempt
on reading them with dnstap-read is done.
By reading fstrm-capture source code it was noticed that
signal SIGHUP is used to flush the capture file.
Matthijs Mekking [Tue, 20 Oct 2020 08:57:16 +0000 (10:57 +0200)]
Don't increment network error stats on UV_EOF
When networking statistics was added to the netmgr (in commit 5234a8e00a6ae1df738020f27544594ccb8d5215), two lines were added that
increment the 'STATID_RECVFAIL' statistic: One if 'uv_read_start'
fails and one at the end of the 'read_cb'. The latter happens
if 'nread < 0'.
According to the libuv documentation, I/O read callbacks (such as for
files and sockets) are passed a parameter 'nread'. If 'nread' is less
than 0, there was an error and 'UV_EOF' is the end of file error, which
you may want to handle differently.
In other words, we should not treat EOF as a RECVFAIL error.
Mark Andrews [Wed, 16 Sep 2020 02:40:52 +0000 (12:40 +1000)]
Try to improve rrl timing
Add a +burst option to mdig so that we have a second to setup the
mdig calls then they run at the start of the next second.
RRL uses 'queries in a second' as a approximation to
'queries per second'. Getting the bursts of traffic to all happen in
the same second should prevent false negatives in the system test.
We now have a second to setup the traffic in. Then the traffic should
be sent at the start of the next second. If that still fails we
should move to +burst=<now+2> (further extend mdig) instead of the
implicit <now+1> as the trigger second.
Mark Andrews [Mon, 12 Oct 2020 06:51:09 +0000 (17:51 +1100)]
Complete the isc_nmhandle_detach() in the worker thread.
isc_nmhandle_detach() needs to complete in the same thread
as shutdown_walk_cb() to avoid a race. Clear the caller's
pointer then pass control to the worker if necessary.
WARNING: ThreadSanitizer: data race
Write of size 8 at 0x000000000001 by thread T1:
#0 isc_nmhandle_detach lib/isc/netmgr/netmgr.c:1258:15
#1 control_command bin/named/controlconf.c:388:3
#2 dispatch lib/isc/task.c:1152:7
#3 run lib/isc/task.c:1344:2
Clone the csock in accept_connection(), not in callback
If we clone the csock (children socket) in TCP accept_connection()
instead of passing the ssock (server socket) to the call back and
cloning it there we unbreak the assumption that every socket is handled
inside it's own worker thread and therefore we can get rid of (at least)
callback locking.
Ondřej Surý [Fri, 2 Oct 2020 07:28:29 +0000 (09:28 +0200)]
Change the isc__nm_tcpdns_stoplistening() to be asynchronous event
The isc__nm_tcpdns_stoplistening() would call isc__nmsocket_clearcb()
that would clear the .accept_cb from non-netmgr thread. Change the
tcpdns_stoplistening to enqueue ievent that would get processed in the
right netmgr thread to avoid locking.
Mark Andrews [Fri, 2 Oct 2020 06:17:51 +0000 (16:17 +1000)]
Fix the data race on shutdown/reconfig in control channel
The controllistener could be freed before the event posted by
isc_nm_stoplistening() has been processed. This commit adds
a reference counter to the controllistener to determine when
to free the listener.
Adjust legacy and digdelv tests for default 1232 EDNS Buffer Size
* the legacy test with -T maxudp512 will just fail, e.g. if the packets
larger than 512 octets are dropped along the path, the proper response
is to fail
* digdelv test was just expecting default server EDNS buffer size to be
4096, the test needed only slight adjustment
Simplify the EDNS buffer size logic for DNS Flag Day 2020
The DNS Flag Day 2020 aims to remove the IP fragmentation problem from
the UDP DNS communication. In this commit, we implement the required
changes and simplify the logic for picking the EDNS Buffer Size.
1. The defaults for `edns-udp-size`, `max-udp-size` and
`nocookie-udp-size` have been changed to `1232` (the value picked by
DNS Flag Day 2020).
2. The probing heuristics that would try 512->4096->1432->1232 buffer
sizes has been removed and the resolver will always use just the
`edns-udp-size` value.
3. Instead of just disabling the PMTUD mechanism on the UDP sockets, we
now set IP_DONTFRAG (IPV6_DONTFRAG) flag. That means that the UDP
packets won't get ever fragmented. If the ICMP packets are lost the
UDP will just timeout and eventually be retried over TCP.
Ondřej Surý [Mon, 5 Oct 2020 11:14:04 +0000 (13:14 +0200)]
Split reusing the addr/port and load-balancing socket options
The SO_REUSEADDR, SO_REUSEPORT and SO_REUSEPORT_LB has different meaning
on different platform. In this commit, we split the function to set the
reuse of address/port and setting the load-balancing into separate
functions.
The libuv library already have multiplatform support for setting
SO_REUSEADDR and SO_REUSEPORT that allows binding to the same address
and port, but unfortunately, when used after the load-balancing socket
options have been already set, it overrides the previous setting, so we
need our own helper function to enable the SO_REUSEADDR/SO_REUSEPORT
first and then enable the load-balancing socket option.