Tomas Krizek [Mon, 5 Oct 2020 14:18:24 +0000 (16:18 +0200)]
distro/arch: remove upgrade script
Arch is a fast moving rolling release and users expect to upgrade
their configs. The migration period over 6 months is certainly
sufficient - remove this hard to read convoluted code.
Vladimír Čunát [Tue, 6 Oct 2020 07:15:43 +0000 (09:15 +0200)]
validator: avoid using RRSIG from a different packet
Restrict tried RRSIGs by qry_uid equality.
I see no use case against and it could be confusing.
(Also rewrite the conditions around to positive form.)
An assertion in cache noticed an NSEC with _SECURE rank but no RRSIG
(in practice). It was a side-effect of still not keeping RRSIGs with
their RRs in some places. It wasn't a security problem, as it doesn't
really matter where the signatures came from. Theoretically it
might've lead to incorrect caching (missing usable RRSIGs), as cache
was restricting qry_uid to match, but that hasn't been noticed
in practice.
Vladimír Čunát [Wed, 7 Oct 2020 07:51:16 +0000 (09:51 +0200)]
contrib/cleanup: be more correct
It's mainly about the fact that FD can be zero (though it's not common).
Our current usage is just in tests and seems fine.
I don't think that other negative FDs are possible, but I'm lazy to find
"proof" in POSIX and using other negative values than -1 doesn't make
sense to me anyway (might be an assert, I guess).
Tomas Krizek [Fri, 2 Oct 2020 11:04:01 +0000 (13:04 +0200)]
modules/http: answer to /dns-query endpoint as well
When using DoH, it seems /dns-query is a more common convetion for
an endpoint name. Let's use it in addition to /doh, since it doesn't
hurt anything and makes kresd more alike the other DoH implementations
out there. It'll also play more nicely with kdig, which uses /dns-query
as default as well.
Tomas Krizek [Fri, 2 Oct 2020 08:23:40 +0000 (10:23 +0200)]
clang: silence useless warning in lib/layer.h
This silences the following warning, which frequently appears in Travis
CI.
./lib/layer.h:51:21: warning: result of comparison of constant 32 with
expression of type 'enum kr_layer_state' is always true
[-Wtautological-constant-out-of-range-compare]
return s >= 0 && s < (1 << 5);
~ ^ ~~~~~~~~
Vladimír Čunát [Wed, 16 Sep 2020 11:10:44 +0000 (13:10 +0200)]
daemon/io_tty_process_input: remove a special case
After changes in this MR, sending an empty newline evaluates in lua
as nil, and that seems fine. Let's drop this piece of code;
it was broken now anyway (incorrect `io_mode_text` part).
Tomas Krizek [Tue, 29 Sep 2020 10:15:31 +0000 (12:15 +0200)]
ci: re-try OBS distrotests
Often, the VM fails to boot (even twice in the row, as already handled
by the test itself) which leads to false negative result.
These nightly tests fail far too often (at least on of the ~7 tests) and
generate annoying notifications.
Adding yet another layer of retries should reduce the number of false
negatives without diminishing the value of the test (since the real
packaging issues are 100 % reproducible).
Tomas Krizek [Wed, 23 Sep 2020 11:32:21 +0000 (13:32 +0200)]
ci: don't run ta_update after build
This test fails far too often due to conditions inside the CI
environment. This test is already executed in test:valgrind without
paralelism, so let's keep it there instead.
Tomas Krizek [Tue, 8 Sep 2020 08:42:03 +0000 (10:42 +0200)]
daemon/worker: simplify err handling in worker_submit()
If pkt was NULL, the function would return error from parse_packet(),
which would then have no effect besides returning an error code from
worker_submit(). There were also needless checks that pkt is indeed
not NULL in cases where it was no longer possible for it to be so.
This also removes assert(false) statements and simply return an error.
Tomas Krizek [Wed, 26 Aug 2020 11:13:51 +0000 (13:13 +0200)]
daemon/io: don't notify worker on udp_recv() errors
The action doesn't increase any counter or do any error handling.
It would simply SEGFAULT. Even if it didn't worker_submit() would
just return an error code.
Tomas Krizek [Fri, 4 Sep 2020 10:47:54 +0000 (12:47 +0200)]
daemon/worker: fix connection teardown in tls_hs_cb()
Ensure both tasklist and waitinglist is always cleared when tearing down
connection (otherwise the session close will fail on assert).
The previous assert could be triggered when the while loop in the code
above would successfuly perform qr_task_send() for one of the
tasks in waitinglist and then fail on a subsequent one.
Tomas Krizek [Thu, 3 Sep 2020 11:10:04 +0000 (13:10 +0200)]
daemon: handle IO error when processing wire buffer
This fixes the following assert:
daemon/worker.c:1157: qr_task_finalize: Assertion `!session_flags(source_session)->closing' failed.
Scenario which leads to the above error:
1. We're using a stateful protocol.
2. Enough data arrive in a single tcp_recv() call to put at least
two queries into the session's wire buffer.
3. session_wirebuf_process() calls worker_submit() which calls
qr_task_step().
4. In the qr_task_step() the query state changes to KR_STATE_DONE,
then qr_task_finalize() is called.
5. qr_task_send() is called, but it fails. This is where
qr_task_finalize() closes the session, but used to return no error.
6. When the next query is processed in session_wirebuf_process(),
steps 3 and 4 are followed and qr_task_finalize() is called.
7. Since the session is already closed, the assert is triggered.
Debugging this was a lot of fun... All hail the rr debugger!
Štěpán Balážik [Wed, 19 Aug 2020 08:52:17 +0000 (10:52 +0200)]
daemon/worker: start retransmit timer after UDP packet is sent
Previously this was done *before* calling uv_udp_send which lead to many
early retransmits (significant amount of time might pass between calling
uv_udp_send and the moment the packet is actually send to the wire).
Vladimír Čunát [Sat, 22 Aug 2020 09:47:51 +0000 (11:47 +0200)]
lib/cache kr_cdb_api::space_usage(): also use kr_cdb_pt
- The malloc-free pair could be avoided without difficulty,
but it seemed like premature optimization.
- The libknot functions make error handling a bit difficult
(zero is theoretically valid and doesn't show error type),
but writing this properly without libknot would need 10-20
additional lines of code and the risk of encountering errors
in this function seems very low anyway.
Petr Špaček [Mon, 7 Sep 2020 14:08:05 +0000 (16:08 +0200)]
cache: fix race in assert_right_version
This change fixes race condition in assert_right_version(). Racy
situation:
- Two instances have the (empty) cache open: New binary and old binary.
- New binary executes count() inside assert_right_version(), which
internally starts RO transaction. Returned count is 0.
- Old binary does some writes (RW transaction parallel to RO in the first
process).
- New binary skips cache clear because cache was empty at the time of check.
- Result: The old binary wrote data with an old format into cache which
was not cleared and silenty changed version number to a new one.
This is not complete fix because we lack mechanism to detect cache format
change at run-time, but at least it removes one nasty corner case and
cost of this change seems to be minimal.
Vladimír Čunát [Fri, 4 Sep 2020 18:54:52 +0000 (20:54 +0200)]
lib/cache: switch .cachelock to fcntl()
This gives us correctness, especially on "staleness" detection.
For simplicity we now don't remove "stale" .cachelock on opening cache,
but it doesn't obstruct us in any way (and overflow will remove it).
Vladimír Čunát [Fri, 4 Sep 2020 17:31:51 +0000 (19:31 +0200)]
lib/cache: tweaks round transactions
- The switched order is documented not to make difference,
but it seems much clearer this way.
- MDB_TXN_FULL wasn't handled correctly (a reversed condition)
and current LMDB code indicates that such transaction is
not recoverable anyway... so we give up on trying.
Vladimír Čunát [Thu, 27 Aug 2020 13:08:48 +0000 (15:08 +0200)]
cache, GC: improve handling of LMDB maxsize
This version seems to work OK. Unfortunately we had to resort to
an extra write and cache reopening when attempting to set cache size.
And even so, decreasing the size can't really be done, so we only warn
about failing to do that.
Vladimír Čunát [Mon, 24 Aug 2020 15:47:29 +0000 (17:47 +0200)]
tests: fine tune integration test for GC
TL;DR: tune the test - now it works quite reliably for me,
though it's perhaps not nice.
With 1 MiB cache it's not easy to avoid overflows, as the defaults are
meant for much larger sizes. Normal GC target is to decrease usage
by 10% when above 80% in 100 records per transaction. That just won't
work reliable due to 10% being only 25 pages.
This commit makes the test run GC with more suitable tuning and
frequently pauses kresd to give GC better chance to catch up.
Vladimír Čunát [Tue, 18 Aug 2020 16:45:28 +0000 (18:45 +0200)]
lib/cache: abort() if emergency cache-clear fails
As the code has been so far, there's no usable cache in that case
and some code just can't handle that. Up to now we were getting
SIGSEGV from inside LMDB on the next attempted operation.
We might consider loosening preallocation in that case or even
retrying after a short sleep. Systemd's restart after hold-off
timeout has an effect similar to the short sleep.
Vladimír Čunát [Tue, 18 Aug 2020 09:34:43 +0000 (11:34 +0200)]
utils/cache_gc: tolerate ESPACE unless twice in a row
In the unlikely case that GC happens "too late", it could fail when
deleting, in which case it seems best to reopen the cache and try again,
as it will probably be deleted by a kresd instance by the next interval.
Vladimír Čunát [Mon, 17 Aug 2020 17:15:04 +0000 (19:15 +0200)]
utils/cache_gc: avoid too long RO transactions
Until now the analyzing pass over full DB was taking place
in a single RO transaction. For an unknown reason this caused kresd
processes to get MDB_MAP_FULL from mdb_put(), even though clearly there
were plenty free pages at that point.
Basic experiments show that 1k steps are OK and 10k steps are not.
Vladimír Čunát [Mon, 17 Aug 2020 08:38:20 +0000 (10:38 +0200)]
lib/cache: abort transactions on errors
This apparently gets rid of MDB_BAD_TXN failures that we were getting
when cache overflows. Unfortunately LMDB docs don't mention that
after operation failures one should abort the corresponding transaction.
Vladimír Čunát [Mon, 31 Aug 2020 07:29:44 +0000 (09:29 +0200)]
scripts, docs: specify lua version in `luarocks install`
On some systems luarocks defaults to other lua version (e.g. Fedora),
so the result would not be usable from kresd. I didn't touch scripts
for older distro versions (Debian < 10, Ubuntu < 20.04, CentOS 7).