Yu Watanabe [Tue, 23 Mar 2021 05:38:18 +0000 (14:38 +0900)]
firewall-util: probe firewall backend in fw_ctx_new()
FirewallContext is used by networkd and nspawn. Both allocates the
context when it is really necessary. Hence, it is not necessary to delay
probing backend.
Moreover, if iptables backend is not enabled on build, and nftables is
not supported by kernel, previously `fw_nftables_init()` is called
everytime when we try to configure masquerade or dnat. It causes
significant performance loss.
The syntax like "0666" is very unclear. It only makes sense for some subset of
people who do C programming. Let's use the much more sensible modern python
syntax instead.
resolved: don't accept responses to query unless they completely answer our questions
When we checking if the responses we collected for a DnsQuery are
sufficient to complete it we previously only check if one of the
collected response RRs matches at least one of the question RR keys.
This changes the logic to require that there must be at least one
response RR matched *each* of the question RR keys before considering
the answer complete.
Otherwise we might end up accepting an A reply as complete answer for an
A/AAAA query and vice versa, but we want to make sure we wait until we
get a reply on both types before returning this to the user in all
cases.
This has been broken for basically forever, but didn't surface until b1eea703e01da1e280e179fb119449436a0c9b8e since until then we'd basically
ignore the auxiliary RRs included in CNAME/DNAME replies. Once that
commit was made we'd start using the auxiliary RRs included in
CNAME/DNAME replies but those typically included only A or only AAAA
which we then took for complete.
Sergey Bugaev [Mon, 22 Mar 2021 15:31:12 +0000 (18:31 +0300)]
log: protect errno in log_open()
Commit 0b1f3c768ce1bd1490a5e53f539976dcef8ca765 has introduced log_open()
calls after exec fails post-fork. However, the log_open() call itself could
change the value of errno, which, for me, manifested in:
$ coredumpctl gdb
...
Failed to invoke gdb: Success
repart: make sure to grow partition table after growing backing loopback file
This fixes the --size= switch, i.e. where we grow a disk image: after
growing it we need to expand the partition table so that its idea of the
the medium size matches the new reality. Otherwise our disk size
calculations in the subsequent steps might still use the original
ungrown size.
(This used to work, I guess this was borked when libfdisk learnt the
concept of "minimized" partition tables)
Michael Gisbers [Fri, 19 Mar 2021 10:38:53 +0000 (11:38 +0100)]
correct incorrect command in NEWS (#19048)
* for /dev/vsock a file permission of 0o666 was mentioned but 0666 is probably better understood, so let's use that
* correct non existing command 'ip dev'
Frantisek Sumsal [Thu, 18 Mar 2021 10:59:53 +0000 (11:59 +0100)]
coccinelle: filter out a couple of 'false-positive' transformations
* flag-set.cocci: perform the transformation only if the second
argument is a constant
* sd-journal/lookup3.c: skip the cocci completely for this file, since
it's not "ours"
* strjoina.cocci: skip the transformation on the "test_strjoina" test,
since it intentionally tests the "incorrect" expression we're trying to
transform (the same thing was already done in strjoin.cocci)
Luca Boccassi [Wed, 17 Mar 2021 14:34:36 +0000 (14:34 +0000)]
resolved: simplify min_ttl check
rr is asserted upon a few lines above, no need to check for null.
Coverity-found issue, CID 1450844
CID 1450844: Null pointer dereferences (REVERSE_INULL)
Null-checking "rr" suggests that it may be null, but it has already
been dereferenced on all paths leading to the check.
fileio: don't use realloc() in read_full_virtual_file()
We aren't interested in the data previousl read, hence free() followed
by malloc() is typically better since it means libc doesn't have to
restore the contained data needlessly.
systemctl: pecify read_full_file() size argument as NULL
If it is specified as NULL read_full_file() assumes the caller wants a C
string, and it looks for embedded NUL bytes to ensure that works. Given
we don#t actually use the size argument here, let's drop it.
(in one case the size argument is used, but not for actually processing
the full returned data, but just as a shortcut to compare things with
the original string. Let's drop use of that there, too given the risk of
embedded NUL bytes in the data read.)
tree-wide: use read_full_virtual_file() where appropriate
Wherever we read virtual files we better should use
read_full_virtual_file(), to make sure we get a consistent response
given how weird the kernel's handling with partial read on such file
systems is.
Anita Zhang [Tue, 16 Mar 2021 00:21:45 +0000 (17:21 -0700)]
oomd: sort by pgscan rate not pgscan
For pressure based killing we want to target who has the highest
increase in pgscan from the previous interval (vs. the previous logic
which used raw pgscan). This will prevent biasing towards long running
cgroups as mentioned in #19007.
Mike Gilbert [Tue, 9 Mar 2021 22:57:37 +0000 (17:57 -0500)]
cg_unified_cached: return ENOMEDIUM if we cannot find a known hierarchy
When the test suite is being run in a foreign environment,
/sys/fs/cgroup might not be set up in a way that we recognize.
Returning ENOMEDIUM causes the tests to be skipped in this case.
Yu Watanabe [Mon, 8 Mar 2021 06:39:53 +0000 (15:39 +0900)]
sd-event: re-check new epoll events when a child event is queued
Previously, when a process outputs something and exit just after
epoll_wait() but before process_child(), then the IO event is ignored
even if the IO event has higher priority. See #18190.
This can be solved by checking epoll event again after process_child().
However, there exists a possibility that another process outputs and
exits just after process_child() but before the second epoll_wait().
When the IO event has lower priority than the child event, still IO
event is processed.
So, this makes new epoll events and child events are checked in a loop
until no new event is detected. To prevent an infinite loop, the number
of maximum trial is set to 10.
resolved: don't flush answer RRs on CNAME redirect too early
When doing a CNAME/DNAME redirect let's first check if the answer we
already have fully answers the redirected question already. If so, let's
use that. If not, let's properly restart things.
This simply removes one call to dns_answer_reset() that was placed too
early: instead of resetting when we detect a CNAME/DNAME redirect, do so
only after checking if the answer we already have doesn't match the
reply, and then decide to *actually* follow it. Or in other words: rely
on the dns_answer_reset() call in dns_query_go() which we'll call to
actually begin with the redirected question.
(This doesn't really matter as much as one might think, since our cache
stepped in anyway and answered the questions before going back to the
network. However, this adds noise if RRs with very short TTLs are cached
– which some CDNs do – and is of course relavant when people turn off
the local cache.)
Previously by mistake we'd always match every single reply we get in a
CNAME chain to the original question from the stub client. That's
broken, we need to test it against the CNAME query we are currently
looking at.
The effect of this incorrect matching was that we'd assign the RRs to
the wrong section since we'd assume they'd be auxiliary answers instead
of primary answers.
When responding from DNS cache, let's slightly tweak how the TTL is
lowered: as before let's round down when converting from our internal µs
to the external seconds. (This is preferable, since records should
better be cached too short instead of too long.) Let's avoid rounding
down to zero though, since that has special semantics in many cases (in
particular mDNS). Let's just use 1s in that case.
resolved: take shortest TTL of all of RRs in answer as cache lifetime
We nowadays cache full answer RRset combinations instead of just the
exact matching rrset. This means we should not cache RRs that are not
immediate answers to our question for longer then their own RRs. Or in
other words: let's determine the shortest TTL of all RRs in the whole
answer, and use that as cache lifetime.
Luca Boccassi [Sun, 14 Mar 2021 12:36:15 +0000 (12:36 +0000)]
man: specify that ProtectProc= does not work with root/cap_sys_ptrace
When using hidepid=invisible on procfs, the kernel will check if the
gid of the process trying to access /proc is the same as the gid of
the process that mounted the /proc instance, or if it has the ptrace
capability:
Given we set up the /proc instance as root for system services,
The same restriction applies to CAP_SYS_PTRACE, if a process runs with
it then hidepid=invisible has no effect.
ProtectProc effectively can only be used with User= or DynamicUser=yes,
without CAP_SYS_PTRACE.
Update the documentation to explicitly state these limitations.
Daan De Meyer [Fri, 12 Mar 2021 22:09:44 +0000 (22:09 +0000)]
boot: Move console declarations to missing_efi.h
These were added to eficonex.h in gnu-efi 3.0.13. Let's move them
to missing_efi.h behind an appropriate guard to fix the build with
recent versions of gnu-efi.
Kevin Backhouse [Fri, 12 Mar 2021 17:00:56 +0000 (18:00 +0100)]
ask-password-api: fix error handling on invalid unicode character
The integer overflow happens when utf8_encoded_valid_unichar() returns an error
code. The error code is a negative number: -22. This overflows when it is
assigned to `z` (type `size_t`). This can cause an infinite loop if the value
of `q` is 22 or larger.
To reproduce the bug, you need to run `systemd-ask-password` and enter an
invalid unicode character, followed by a backspace character.