Michał Kępień [Wed, 29 Jan 2020 13:50:26 +0000 (14:50 +0100)]
List atypical failures in system test summary
Each system test can be marked as failed not only due to some tested
component(s) not behaving as expected, but also because of core dumps,
assertion failures, and/or ThreadSanitizer reports being found among its
artifacts. Make the system test summary list the tests which exhibit
such atypical symptoms to more clearly present the nature of problems
found.
Tony Finch [Thu, 16 Jan 2020 15:46:04 +0000 (15:46 +0000)]
Send NOFITY messages after deleting private-type records.
The `rndc signing -clear` command cleans up the private-type records
that keep track of zone signing activity, but before this change it
did not tell the secondary servers that the zone has changed.
Diego Fronza [Wed, 15 Jan 2020 18:22:06 +0000 (15:22 -0300)]
Added test for the proposed fix
Added test to ensure that NXDOMAIN is returned when BIND is queried for a
non existing domain in CH class (if a view of CHAOS class is configured)
and that it also doesn't crash anymore in those cases.
Diego Fronza [Wed, 15 Jan 2020 17:39:38 +0000 (14:39 -0300)]
Fixed crash when querying for non existing domain in chaos class
Function dns_view_findzonecut in view.c wasn't correctly handling
classes other than IN (chaos, hesiod, etc) whenever the name being
looked up wasn't in cache or in any of the configured zone views' database.
That resulted in a NULL fname being used in resolver.c:4900, which
in turn was triggering abort.
Witold Kręcicki [Tue, 21 Jan 2020 13:20:19 +0000 (14:20 +0100)]
dnssec: use less-or-equal when looking at SyncPublish time
If we created a key, mark its SyncPublish time as 'now' and started
bind the key might not be published if the SyncPublish time is in
the same second as the time the zone is loaded. This is mostly
for dnssec system test, as this kind of scenario is very unlikely
in a real world environment.
Witold Kręcicki [Mon, 20 Jan 2020 10:39:14 +0000 (11:39 +0100)]
Fix a race in taskmgr between worker and task pausing/unpausing.
To reproduce the race - create a task, send two events to it, first one
must take some time. Then, from the outside, pause(), unpause() and detach()
the task.
When the long-running event is processed by the task it is in
task_state_running state. When we called pause() the state changed to
task_state_paused, on unpause we checked that there are events in the task
queue, changed the state to task_state_ready and enqueued the task on the
workers readyq. We then detach the task.
The dispatch() is done with processing the event, it processes the second
event in the queue, and then shuts down the task and frees it (as it's not
referenced anymore). Dispatcher then takes the, already freed, task from
the queue where it was wrongly put, causing an use-after free and,
subsequently, either an assertion failure or a segmentation fault.
The probability of this happening is very slim, yet it might happen under a
very high load, more probably on a recursive resolver than on an
authoritative.
The fix introduces a new 'task_state_pausing' state - to which tasks
are moved if they're being paused while still running. They are moved
to task_state_paused state when dispatcher is done with them, and
if we unpause a task in paused state it's moved back to task_state_running
and not requeued.
Tony Finch [Tue, 14 Jan 2020 19:23:31 +0000 (19:23 +0000)]
dnssec: do not publish CDS records when -Psync is in the future
This is a bug I encountered when trying to schedule an algorithm
rollover. My plan, for a zone whose maximum TTL is 48h, was to sign
with the new algorithm and schedule a change of CDS records for more
than 48 hours in the future, roughly like this:
$ dnssec-keygen -a 13 -fk -Psync now+50h $zone
$ dnssec-keygen -a 13 $zone
$ dnssec-settime -Dsync now+50h $zone_ksk_old
However the algorithm 13 CDS was published immediately, which could
have made the zone bogus.
To reveal the bug using the `smartsign` test, this change just adds a
KSK with all its times in the future, so it should not affect the
existing checks at all. But the final check (that there are no CDS or
CDSNSKEY records after -Dsync) fails with the old `syncpublish()`
logic, because the future key's sync records appear early. With the
new `syncpublish()` logic the future key does not affect the test, as
expected, and it now passes.
Witold Kręcicki [Fri, 17 Jan 2020 13:42:57 +0000 (14:42 +0100)]
Fix possible race in socket destruction.
When two threads unreferenced handles coming from one socket while
the socket was being destructed we could get a use-after-free:
Having handle H1 coming from socket S1, H2 coming from socket S2,
S0 being a parent socket to S1 and S2:
Thread A Thread B
Unref handle H1 Unref handle H2
Remove H1 from S1 active handles Remove H2 from S2 active handles
nmsocket_maybe_destroy(S1) nmsocket_maybe_destroy(S2)
nmsocket_maybe_destroy(S0) nmsocket_maybe_destroy(S0)
LOCK(S0->lock)
Go through all children, figure
out that we have no more active
handles:
sum of S0->children[i]->ah == 0
UNLOCK(S0->lock)
destroy(S0)
LOCK(S0->lock)
- but S0 is already gone
Witold Kręcicki [Fri, 17 Jan 2020 10:42:35 +0000 (11:42 +0100)]
clean up some handle/client reference counting errors in error cases.
We weren't consistent about who should unreference the handle in
case of network error. Make it consistent so that it's always the
client code responsibility to unreference the handle - either
in the callback or right away if send function failed and the callback
will never be called.
Witold Kręcicki [Thu, 16 Jan 2020 10:53:31 +0000 (11:53 +0100)]
cleanup properly if we fail to initialize ns_client structure
If taskmgr is shutting down ns_client_setup will fail to create
a task for the newly created client, we weren't cleaning up already
created/attached things (memory context, server, clientmgr).
Witold Kręcicki [Thu, 16 Jan 2020 11:13:28 +0000 (12:13 +0100)]
netmgr: fix a non-thread-safe access to libuv structures
In tcp and udp stoplistening code we accessed libuv structures
from a different thread, which caused a shutdown crash when named
was under load. Also added additional DbC checks making sure we're
in a proper thread when accessing uv_ functions.
Witold Kręcicki [Thu, 16 Jan 2020 10:52:58 +0000 (11:52 +0100)]
netmgr: don't send to an inactive (closing) udp socket
We had a race in which n UDP socket could have been already closing
by libuv but we still sent data to it. Mark socket as not-active
when stopping listening and verify that socket is not active when
trying to send data to it.
Evan Hunt [Wed, 15 Jan 2020 16:19:19 +0000 (08:19 -0800)]
fix a bug when validating negative cache entries
if validator_start() is called with validator->event->message set to
NULL, we can't use message->rcode to decide which negative proofs are
needed, so we use the rdataset attributes instead to determine whether
the rdataset was cached as NXDOMAIN or NODATA.
Witold Kręcicki [Wed, 15 Jan 2020 13:53:42 +0000 (14:53 +0100)]
netmgr: have a single source of truth for tcpdns callback
We pass interface as an opaque argument to tcpdns listening socket.
If we stop listening on an interface but still have in-flight connections
the opaque 'interface' is not properly reference counted, and we might
hit a dead memory. We put just a single source of truth in a listening
socket and make the child sockets use that instead of copying the
value from listening socket. We clean the callback when we stop listening.
Witold Kręcicki [Wed, 15 Jan 2020 11:29:41 +0000 (12:29 +0100)]
netmgr:
- isc__netievent_storage_t was to small to contain
isc__netievent__socket_streaminfo_t on Windows
- handle isc_uv_export and isc_uv_import errors properly
- rewrite isc_uv_export and isc_uv_import on Windows