Michal Sekletár [Fri, 27 Mar 2020 16:01:59 +0000 (17:01 +0100)]
sd-journal: remove the dead code and actually fix #14695
journal_file_fstat() returns an error if we call it on already unlinked
journal file and hence we never reach remove_file_real() which is the
entire point.
I must have made some mistake while testing the fix that got me thinking
the issue is gone while opposite was true.
swap: finish the secondary swap units' jobs if deactivation of the primary swap unit fails
Currently, if deactivation of the primary swap unit fails:
# LANG=C systemctl --no-pager stop dev-mapper-fedora\\x2dswap.swap
Job for dev-mapper-fedora\x2dswap.swap failed.
See "systemctl status "dev-mapper-fedora\\x2dswap.swap"" and "journalctl -xe" for details.
then there are still the running stop jobs for all the secondary swap units
that follow the primary one:
# systemctl list-jobs
JOB UNIT TYPE STATE
3233 dev-disk-by\x2duuid-2dc8b9b1\x2da0a5\x2d44d8\x2d89c4\x2d6cdd26cd5ce0.swap stop running
3232 dev-dm\x2d1.swap stop running
3231 dev-disk-by\x2did-dm\x2duuid\x2dLVM\x2dyuXWpCCIurGzz2nkGCVnUFSi7GH6E3ZcQjkKLnF0Fil0RJmhoLN8fcOnDybWCMTj.swap stop running
3230 dev-disk-by\x2did-dm\x2dname\x2dfedora\x2dswap.swap stop running
3234 dev-fedora-swap.swap stop running
5 jobs listed.
This remains endlessly because their JobTimeoutUSec is infinity:
# LANG=C systemctl show -p JobTimeoutUSec dev-fedora-swap.swap
JobTimeoutUSec=infinity
If this issue happens during system shutdown, the system shutdown appears to
get hang and the system will be forcibly shutdown or rebooted 30 minutes later
by the following configuration:
The scenario in the real world seems that there is some service unit with
KillMode=none, processes whose memory is being swapped out are not killed
during stop operation in the service unit and then swapoff command fails.
On the other hand, it works well in successful case of swapoff command because
the secondary jobs monitor /proc/swaps file and can detect deletion of the
corresponding swap file.
This commit fixes the issue by finishing the secondary swap units' jobs if
deactivation of the primary swap unit fails.
pid1: fix the names of AllowedCPUs= and AllowedMemoryNodes=
The original PR was submitted with CPUSetCpus and CPUSetMems, which was later
changed to AllowedCPUs and AllowedMemmoryNodes everywhere (including the parser
used by systemd-run), but not in the parser for unit files.
Since we already released -rc1, let's keep support for the old names. I think
we can remove it in a release or two if anyone remembers to do that.
Ryan Gonzalez [Sat, 23 Feb 2019 05:45:03 +0000 (23:45 -0600)]
cryptsetup: Treat key file errors as a failed password attempt
6f177c7dc092eb68762b4533d41b14244adb2a73 caused key file errors to immediately fail, which would make it hard to correct an issue due to e.g. a crypttab typo or a damaged key file.
The `coproc` implementation seems to be a little bit different in older
bash versions, so the `strace` is sometimes started AFTER `systemctl
daemon-reload`, which causes unexpected fails. Let's help it a little by
sleeping for a bit.
test: support MPOL_LOCAL matching in unpatched strace versions
The MPOL_LOCAL constant is not recognized in current strace versions.
Let's match at least the numerical value of this constant until the
strace patch is approved & merged.
Pavel Hrdina [Mon, 29 Jul 2019 15:50:05 +0000 (17:50 +0200)]
cgroup: introduce support for cgroup v2 CPUSET controller
Introduce support for configuring cpus and mems for processes using
cgroup v2 CPUSET controller. This allows users to limit which cpus
and memory NUMA nodes can be used by processes to better utilize
system resources.
The cgroup v2 interfaces to control it are cpuset.cpus and cpuset.mems
where the requested configuration is written. However, it doesn't mean
that the requested configuration will be actually used as parent cgroup
may limit the cpus or mems as well. In order to reflect the real
configuration cgroup v2 provides read-only files cpuset.cpus.effective
and cpuset.mems.effective which are exported to users as well.
core: imply NNP and SUID/SGID restriction for DynamicUser=yes service
Let's be safe, rather than sorry. This way DynamicUser=yes services can
neither take benefit of, nor create SUID/SGID binaries.
Given that DynamicUser= is a recent addition only we should be able to
get away with turning this on, even though this is strictly speaking a
binary compatibility breakage.
This patch extracts the code in charge of initializing the default values for
those rlimits in order to create dedicated functions, which take care of their
initialization.
These functions are then called in parse_configuration() so we make sure that
the default values for these rlimits get restored every time PID1 is reloading
its configuration.
Let's experiment with this in systemd upstream first. Downstreams and
users can after all still comment this easily.
Besides compat concern the most often heard issue with such high PIDs is
usability, since they are potentially hard to type. I am not entirely sure though
whether 4194304 (as largest new PID) is that much worse to type or to
copy than 65563.
This should also simplify management of per system tasks limits as by
this move the sysctl /proc/sys/kernel/threads-max becomes the primary
knob to control how many processes to have in parallel.
Jan Synacek [Fri, 31 Jan 2020 14:17:25 +0000 (15:17 +0100)]
polkit: when authorizing via PK let's re-resolve callback/userdata instead of caching it
Previously, when doing an async PK query we'd store the original
callback/userdata pair and call it again after the PK request is
complete. This is problematic, since PK queries might be slow and in the
meantime the userdata might be released and re-acquired. Let's avoid
this by always traversing through the message handlers so that we always
re-resolve the callback and userdata pair and thus can be sure it's
up-to-date and properly valid.
Jan Synacek [Fri, 31 Jan 2020 10:34:45 +0000 (11:34 +0100)]
sd-bus: introduce API for re-enqueuing incoming messages
When authorizing via PolicyKit we want to process incoming method calls
twice: once to process and figure out that we need PK authentication,
and a second time after we aquired PK authentication to actually execute
the operation. With this new call sd_bus_enqueue_for_read() we have a
way to put an incoming message back into the read queue for this
purpose.
This might have other uses too, for example debugging.
Related: CVE-2020-1712
bus-message: introduce two kinds of references to bus messages
Before this commit bus messages had a single reference count: when it
reached zero the message would be freed. This simple approach meant a
cyclic dependency was typically seen: a message that was enqueued in a
bus connection object would reference the bus connection object but also
itself be referenced by the bus connection object. So far out strategy
to avoid cases like this was: make sure to process the bus connection
regularly so that messages don#t stay queued, and at exit flush/close
the connection so that the message queued would be emptied, and thus the
cyclic dependencies resolved. Im many cases this isn't done properly
however.
With this change, let's address the issue more systematically: let's
break the reference cycle. Specifically, there are now two types of
references to a bus message:
1. A regular one, which keeps both the message and the bus object it is
associated with pinned.
2. A "queue" reference, which is weaker: it pins the message, but not
the bus object it is associated with.
The idea is then that regular user handling uses regular references, but
when a message is enqueued on its connection, then this takes a "queue"
reference instead. This then means that a queued message doesn't imply
the connection itself remains pinned, only regular references to the
connection or a message associated with it do. Thus, if we end up in the
situation where a user allocates a bus and a message and enqueues the
latter in the former and drops all refs to both, then this will detect
this case and free both.
Note that this scheme isn't perfect, it only covers references between
messages and the busses they are associated with. If OTOH a bus message
is enqueued on a different bus than it is associated with cyclic deps
cannot be recognized with this simple algorithm, and thus if you enqueue
a message associated with a bus A on a bus B, and another message
associated with bus B on a bus A, a cyclic ref will be in effect and not
be discovered. However, given that this is an exotic case (though one
that happens, consider systemd-bus-stdio-bridge), it should be OK not to
cover with this, and people have to explicit flush all queues on exit in
that case.
Note that this commit only establishes the separate reference counters
per message. A follow-up commit will start making use of this from the
bus connection object.
sd-bus: make sure dispatch_rqueue() initializes return parameter on all types of success
Let's make sure our own code follows coding style and initializes all
return values on all types of success (and leaves it uninitialized in
all types of failure).
Let's do this like we usually do and size arrays with size_t.
We already do this for the "allocated" counter correctly, and externally
we expose the queue sizes as uint64_t anyway, hence there's really no
point in usigned "unsigned" internally.
cryptsetup: rework how we log about activation failures
First of all let's always log where the errors happen, and not in an
upper stackframe, in all cases. Previously we'd do this somethis one way
and sometimes another, which resulted in sometimes duplicate logging and
sometimes none.
When we cannot activate something due to bad password the kernel gives
us EPERM. Let's uniformly return this EAGAIN, so tha the next password
is tried. (previously this was done in most cases but not in all)
When we get EPERM let's also explicitly indicate that this probably
means the password is simply wrong.
# LANG=C journalctl --no-pager -u A.service -u B.service -u C.target -b
-- Logs begin at Mon 2019-09-09 00:25:06 EDT, end at Thu 2019-10-24 22:28:47 EDT. --
Oct 24 22:27:47 localhost.localdomain systemd[1]: Starting A...
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Child 967 belongs to A.service.
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Running next main command for state start.
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Passing 0 fds to service
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: About to execute: /usr/bin/sleep 60
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Forked /usr/bin/sleep as 968
Oct 24 22:27:47 localhost.localdomain systemd[968]: A.service: Executing: /usr/bin/sleep 60
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Trying to enqueue job A.service/reload/replace
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Merged into running job, re-running: A.service/reload as 1288
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Enqueued job A.service/reload as 1288
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Unit cannot be reloaded because it is inactive.
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Job 1288 A.service/reload finished, result=invalid
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Passing 0 fds to service
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: About to execute: /usr/bin/echo B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Forked /usr/bin/echo as 970
Oct 24 22:27:52 localhost.localdomain systemd[970]: B.service: Executing: /usr/bin/echo B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Failed to send unit change signal for B.service: Connection reset by peer
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Changed dead -> start
Oct 24 22:27:52 localhost.localdomain systemd[1]: Starting B...
Oct 24 22:27:52 localhost.localdomain echo[970]: B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Child 970 belongs to B.service.
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Changed start -> exited
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Job 1371 B.service/start finished, result=done
Oct 24 22:27:52 localhost.localdomain systemd[1]: Started B.
Oct 24 22:27:52 localhost.localdomain systemd[1]: C.target: Job 1287 C.target/start finished, result=done
Oct 24 22:27:52 localhost.localdomain systemd[1]: Reached target C.
Oct 24 22:27:52 localhost.localdomain systemd[1]: C.target: Failed to send unit change signal for C.target: Connection reset by peer
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Child 968 belongs to A.service.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Running next main command for state start.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Passing 0 fds to service
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: About to execute: /usr/bin/echo A2
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Forked /usr/bin/echo as 972
Oct 24 22:28:47 localhost.localdomain systemd[972]: A.service: Executing: /usr/bin/echo A2
Oct 24 22:28:47 localhost.localdomain echo[972]: A2
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Child 972 belongs to A.service.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Changed start -> exited
The issue occurs not only in reload command, i.e.:
Michal Sekletár [Wed, 27 Nov 2019 13:27:58 +0000 (14:27 +0100)]
cryptsetup: reduce the chance that we will be OOM killed
cryptsetup introduced optional locking scheme that should serialize
unlocking keyslots which use memory hard key derivation
function (argon2). Using the serialization should prevent OOM situation
in early boot while unlocking encrypted volumes.
Michal Sekletár [Mon, 18 Nov 2019 11:50:11 +0000 (12:50 +0100)]
core: introduce NUMAPolicy and NUMAMask options
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.
Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
shared/cpu-set-util: only force range printing one time
The idea is to have at least one range to make the new format clearly
distinguishable from the old. But it is enough to just do it once.
In particular, in case the affinity would be specified like 0, 2, 4, 6…,
this gives much shorter output.
cpu_set_malloc() was the last user. It doesn't seem useful to keep
it just to save the allocation of a few hundred bytes in a test, so
it is dropped and a fixed maximum is allocated (1024 bytes).
pid1: when reloading configuration, forget old settings
If we had a configuration setting from a configuration file, and it was
removed, we'd still remember the old value, because there's was no mechanism to
"reset" everything, just to assign new values.
Note that the effect of this is limited. For settings that have an "ongoing" effect,
like systemd.confirm_spawn, the new value is simply used. But some settings can only
be set at start.
In particular, CPUAffinity= will be updated if set to a new value, but if
CPUAffinity= is fully removed, it will not be reset, simply because we don't
know what to reset it to. We might have inherited a setting, or we might have
set it ourselves. In principle we could remember the "original" value that was
set when we were executed, but propagate this over reloads and reexecs, but
that would be a lot of work for little gain. So this corner case of removal of
CPUAffinity= is not handled fully, and a reboot is needed to execute the
change. As a work-around, a full mask of CPUAffinity=0-8191 can be specified.
pid1: don't reset setting from /proc/cmdline upon restart
We have settings which may be set on the kernel command line, and also
in /proc/cmdline (for pid1). The settings in /proc/cmdline have higher priority
of course. When a reload was done, we'd reload just the configuration file,
losing the overrides.
So read /proc/cmdline again during reload.
Also, when initially reading the configuration file when program starts,
don't treat any errors as fatal. The configuration done in there doesn't
seem important enough to refuse boot.
This makes the handling of this option match what we do in unit files. I think
consistency is important here. (As it happens, it is the only option in
system.conf that is "non-atomic", i.e. where there's a list of things which can
be split over multiple assignments. All other options are single-valued, so
there's no issue of how to handle multiple assignments.)
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
fuzz-journal-stream: avoid assertion failure on samples which don't fit in pipe
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11587.
We had a sample which was large enough that write(2) failed to push all the
data into the pipe, and an assert failed. The code could be changed to use
a loop, but then we'd need to interleave writes and sd_event_run (to process
the journal). I don't think the complexity is worth it — fuzzing works best
if the sample is not too huge anyway. So let's just reject samples above 64k,
and tell oss-fuzz about this limit.
journald: check whether sscanf has changed the value corresponding to %n
It's possible for sscanf to receive strings containing all three fields
and not matching the template at the same time. When this happens the
value of k doesn't change, which basically means that process_audit_string
tries to access memory randomly. Sometimes it works and sometimes it doesn't :-)
See also https://bugzilla.redhat.com/show_bug.cgi?id=1059314.
test: initialize syslog_fd in fuzz-journald-kmsg too
This is a follow-up to 8857fb9beb9dcb that prevents the fuzzer from crashing with
```
==220==ERROR: AddressSanitizer: ABRT on unknown address 0x0000000000dc (pc 0x7ff4953c8428 bp 0x7ffcf66ec290 sp 0x7ffcf66ec128 T0)
SCARINESS: 10 (signal)
#0 0x7ff4953c8427 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x35427)
#1 0x7ff4953ca029 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x37029)
#2 0x7ff49666503a in log_assert_failed_realm /work/build/../../src/systemd/src/basic/log.c:805:9
#3 0x7ff496614ecf in safe_close /work/build/../../src/systemd/src/basic/fd-util.c:66:17
#4 0x548806 in server_done /work/build/../../src/systemd/src/journal/journald-server.c:2064:9
#5 0x5349fa in LLVMFuzzerTestOneInput /work/build/../../src/systemd/src/fuzz/fuzz-journald-kmsg.c:26:9
#6 0x592755 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:571:15
#7 0x590627 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool*) /src/libfuzzer/FuzzerLoop.cpp:480:3
#8 0x594432 in fuzzer::Fuzzer::MutateAndTestOne() /src/libfuzzer/FuzzerLoop.cpp:708:19
#9 0x5973c6 in fuzzer::Fuzzer::Loop(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, fuzzer::fuzzer_allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) /src/libfuzzer/FuzzerLoop.cpp:839:5
#10 0x574541 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:764:6
#11 0x5675fc in main /src/libfuzzer/FuzzerMain.cpp:20:10
#12 0x7ff4953b382f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#13 0x420f58 in _start (/out/fuzz-journald-kmsg+0x420f58)
```
The function takes a pointer to a random block of memory and
the length of that block. It shouldn't crash every time it sees
a zero byte at the beginning there.
This should help the dev-kmsg fuzzer to keep going.