Vladimír Čunát [Mon, 12 Apr 2021 13:23:02 +0000 (15:23 +0200)]
validator: avoid assertion in an edge-case
Case: NSEC3 with too many iterations used for a positive wildcard proof.
It certainly isn't a perfect fix yet; the whole validator would benefit
from a general overhaul.
Tomas Krizek [Tue, 30 Mar 2021 17:24:08 +0000 (19:24 +0200)]
daemon/http: fix memleak if http_write_pkt() fails
This can happen for example when we want to send an answer, but the
http stream (or the connection?) is already closed.
Direct leak of 48 byte(s) in 1 object(s) allocated from:
#0 0x7f5ad2445459 in __interceptor_malloc /build/gcc/src/gcc/libsanitizer/asan/asan_malloc_linux.cpp:145
#1 0x55c0db3fc442 in http_write_pkt ../daemon/http.c:610
#2 0x55c0db3fc882 in http_write ../daemon/http.c:651
#3 0x55c0db3e9bb1 in qr_task_send ../daemon/worker.c:700
#4 0x55c0db3ee86c in qr_task_finalize ../daemon/worker.c:1321
#5 0x55c0db3f0123 in qr_task_step ../daemon/worker.c:1633
#6 0x55c0db3f0982 in worker_submit ../daemon/worker.c:1755
#7 0x55c0db3d992a in session_wirebuf_process ../daemon/session.c:759
#8 0x55c0db3c5f01 in udp_recv ../daemon/io.c:89
#9 0x7f5ad22b0e0e (/usr/lib/libuv.so.1+0x20e0e)
Vladimír Čunát [Fri, 26 Mar 2021 10:58:42 +0000 (11:58 +0100)]
clear kr_query::flags.CACHED
I suspect there's an edge case where cache thinks it provided enough
data but iterator (or who) disagrees and resolution continues.
We observed (flags.CACHED == true) even when processing a reply from
internet, and that could be confusing and even trigger a segfault.
Clearing the flag sounds OK semantically; it never meant that no cached
data have been used within the kr_query (e.g. zone cut, DS/DNSKEY, ...)
Štěpán Balážik [Wed, 17 Mar 2021 14:53:33 +0000 (15:53 +0100)]
selection: cap the timeout value when probing a random server
This patch caps the timeout set on UDP queries to servers chosen in the
EXPLORE phase of the selection algorithm to two times the timeout that
would be set if we were EXPLOITing.
This measns that we no longer spend an unreasonable amount of time
probing servers that are probably dead anyway while ensuring that we do
probe them from time to time to check if they didn't come to life.
If the timeout value is capped and the server fails to respond, we don't
punish the server for it i.e. we don't cache the timeout.
Vladimír Čunát [Tue, 16 Mar 2021 09:39:50 +0000 (10:39 +0100)]
utils/cache_gc: fix crashes/assertions on RTT entries
I missed some parts when finishing this. I should've tested it better.
GC would hit assertions or NULL dereferences when removing entries,
and eventually that would lead to cache overflowing (and getting
cleared).
Vladimír Čunát [Thu, 11 Mar 2021 14:23:28 +0000 (15:23 +0100)]
dnstap: don't break request resolution on dnstap errors
This isn't a regression of 5.3.0 changes.
Layer functions are supposed to return new values for ctx->state,
but here we were sometimes returning kr_error(EFOO) which altered
processing of the request.
Our case: answers directly from policy module would not end up
finishing the request and we'd hit an assert at the end of processing.
Štěpán Balážik [Thu, 18 Feb 2021 11:10:26 +0000 (12:10 +0100)]
lib/selection{,_iter}.c: allow switching back to UDP
Switching to TCP instead of querying very slow servers over UDP has had
unwanted side effect – we would sometimes get stuck with a server
permanently switched to TCP. And if the server happens to not reply over
TCP we were in trouble.
Therefore after we TCP connect fails or timeouts we provide one last
chance for the server over UDP. This will not prevent the next request
to try TCP again on this server again, but we don't care because
DNS MUST ******* work over TCP.
Vladimír Čunát [Fri, 12 Feb 2021 09:06:25 +0000 (10:06 +0100)]
daemon/udp_queue: drop the error logging
We should do this for all transports and probably just in verbose mode.
We were printing lots of these on Turris OS (for one user at least):
https://forum.turris.cz/t/5-1-8-kresd-throwing-many-errors-in-var-log-messages/14775
EACCESS in particular apparently may happen (on Linux) when the network
is "unavailable", EPERM because of firewall/netfilter:
https://stackoverflow.com/a/23869102
Vladimír Čunát [Wed, 10 Feb 2021 11:56:14 +0000 (12:56 +0100)]
modules/{http,watchdog}: fix stability problems
As first noted in commit d1a229ae9, in some cases we do call chains that
are not supported for JIT in LuaJIT.
I'm not 100% sure all of these are needed to comply, but the functions
here are really small and probably not to be that heavily used,
so I don't think it will be costly to interpret them
(and avoiding crashes is more important).
In my tests this fixed occasional crashes when using http://*/trace/*
Vladimír Čunát [Thu, 28 Jan 2021 10:37:05 +0000 (11:37 +0100)]
policy.ANSWER: minor fixes, mainly around NODATA answers
- return SOA in NODATA answers and allow customizing it
- only call ensure_answer() if really generating an answer
(otherwise we might e.g. deplete XDP buffers, in extreme cases)
Vladimír Čunát [Mon, 1 Feb 2021 09:09:16 +0000 (10:09 +0100)]
when FORMERR comes, differentiate based on OPT
In particular, non-support of EDNS is implied iff FORMERR without OPT
comes. If OPT is there, one possibility is that there was something
wrong in the OPT that *we* sent, but it seems much more likely that
this particular server is just bad and we want to try another one.
https://tools.ietf.org/html/rfc6891#section-7
In particular, we would be in trouble if we dropped OPT in a zone
that is covered by DNSSEC.
Vladimír Čunát [Mon, 1 Feb 2021 08:57:46 +0000 (09:57 +0100)]
lib/selection: rename to *_FORMERR for consistency
It's now consistent with KNOT_RCODE_FORMERR and the official name
https://www.iana.org/assignments/dns-parameters/dns-parameters.xhtml#dns-parameters-6
Vladimír Čunát [Tue, 26 Jan 2021 11:25:09 +0000 (12:25 +0100)]
lib/selection: refactor kr_selection_error_str()
This way leaves less room for mistakes, etc. It's just the idea from:
https://gitlab.nic.cz/knot/knot-resolver/-/commit/dd0c99bdb6332ba3628833a8543a5f9f33141ddd#note_191580
Štěpán Balážik [Wed, 20 Jan 2021 18:33:14 +0000 (19:33 +0100)]
selection_iter: relax NSNXAttack mitigation
Previously the mitigation would stop some longer benign resolutions.
We can safely zero the subquery counter when choose a concrete transport
for the query (i.e. NS name with known IP address).
Štěpán Balážik [Wed, 20 Jan 2021 15:19:18 +0000 (16:19 +0100)]
selection: force resolution of new NS name after lame delegation
Lame delegations are weird, they breed more lame delegations on broken
zones since trying another server from the same set usualy doesn't help.
We force resolution of another NS name in hope of getting somewhere.
Štěpán Balážik [Tue, 19 Jan 2021 12:39:04 +0000 (13:39 +0100)]
iterate.c: don't copy NO_MINIMIZE when following a CNAME
Instead copy it from the request's options.
Reasoning: Minimization might have been turned off as a workaround for
broken authoritative servers which doesn't support it. There is no
reason to drop minimization when switching zones when following a CNAME.
Štěpán Balážik [Thu, 14 Jan 2021 14:39:31 +0000 (15:39 +0100)]
iterate: rework error handling from iterate.c
Previously there where resolve_badmsg and resolve_error functions used
to apply workarounds. This is now moved to selection.c and iterate.c
just provides feedback using the server selection API. Errors are now
handled centrally in selection.c:error.