This patch extracts the code in charge of initializing the default values for
those rlimits in order to create dedicated functions, which take care of their
initialization.
These functions are then called in parse_configuration() so we make sure that
the default values for these rlimits get restored every time PID1 is reloading
its configuration.
Let's experiment with this in systemd upstream first. Downstreams and
users can after all still comment this easily.
Besides compat concern the most often heard issue with such high PIDs is
usability, since they are potentially hard to type. I am not entirely sure though
whether 4194304 (as largest new PID) is that much worse to type or to
copy than 65563.
This should also simplify management of per system tasks limits as by
this move the sysctl /proc/sys/kernel/threads-max becomes the primary
knob to control how many processes to have in parallel.
Jan Synacek [Fri, 31 Jan 2020 14:17:25 +0000 (15:17 +0100)]
polkit: when authorizing via PK let's re-resolve callback/userdata instead of caching it
Previously, when doing an async PK query we'd store the original
callback/userdata pair and call it again after the PK request is
complete. This is problematic, since PK queries might be slow and in the
meantime the userdata might be released and re-acquired. Let's avoid
this by always traversing through the message handlers so that we always
re-resolve the callback and userdata pair and thus can be sure it's
up-to-date and properly valid.
Jan Synacek [Fri, 31 Jan 2020 10:34:45 +0000 (11:34 +0100)]
sd-bus: introduce API for re-enqueuing incoming messages
When authorizing via PolicyKit we want to process incoming method calls
twice: once to process and figure out that we need PK authentication,
and a second time after we aquired PK authentication to actually execute
the operation. With this new call sd_bus_enqueue_for_read() we have a
way to put an incoming message back into the read queue for this
purpose.
This might have other uses too, for example debugging.
Related: CVE-2020-1712
bus-message: introduce two kinds of references to bus messages
Before this commit bus messages had a single reference count: when it
reached zero the message would be freed. This simple approach meant a
cyclic dependency was typically seen: a message that was enqueued in a
bus connection object would reference the bus connection object but also
itself be referenced by the bus connection object. So far out strategy
to avoid cases like this was: make sure to process the bus connection
regularly so that messages don#t stay queued, and at exit flush/close
the connection so that the message queued would be emptied, and thus the
cyclic dependencies resolved. Im many cases this isn't done properly
however.
With this change, let's address the issue more systematically: let's
break the reference cycle. Specifically, there are now two types of
references to a bus message:
1. A regular one, which keeps both the message and the bus object it is
associated with pinned.
2. A "queue" reference, which is weaker: it pins the message, but not
the bus object it is associated with.
The idea is then that regular user handling uses regular references, but
when a message is enqueued on its connection, then this takes a "queue"
reference instead. This then means that a queued message doesn't imply
the connection itself remains pinned, only regular references to the
connection or a message associated with it do. Thus, if we end up in the
situation where a user allocates a bus and a message and enqueues the
latter in the former and drops all refs to both, then this will detect
this case and free both.
Note that this scheme isn't perfect, it only covers references between
messages and the busses they are associated with. If OTOH a bus message
is enqueued on a different bus than it is associated with cyclic deps
cannot be recognized with this simple algorithm, and thus if you enqueue
a message associated with a bus A on a bus B, and another message
associated with bus B on a bus A, a cyclic ref will be in effect and not
be discovered. However, given that this is an exotic case (though one
that happens, consider systemd-bus-stdio-bridge), it should be OK not to
cover with this, and people have to explicit flush all queues on exit in
that case.
Note that this commit only establishes the separate reference counters
per message. A follow-up commit will start making use of this from the
bus connection object.
sd-bus: make sure dispatch_rqueue() initializes return parameter on all types of success
Let's make sure our own code follows coding style and initializes all
return values on all types of success (and leaves it uninitialized in
all types of failure).
Let's do this like we usually do and size arrays with size_t.
We already do this for the "allocated" counter correctly, and externally
we expose the queue sizes as uint64_t anyway, hence there's really no
point in usigned "unsigned" internally.
cryptsetup: rework how we log about activation failures
First of all let's always log where the errors happen, and not in an
upper stackframe, in all cases. Previously we'd do this somethis one way
and sometimes another, which resulted in sometimes duplicate logging and
sometimes none.
When we cannot activate something due to bad password the kernel gives
us EPERM. Let's uniformly return this EAGAIN, so tha the next password
is tried. (previously this was done in most cases but not in all)
When we get EPERM let's also explicitly indicate that this probably
means the password is simply wrong.
# LANG=C journalctl --no-pager -u A.service -u B.service -u C.target -b
-- Logs begin at Mon 2019-09-09 00:25:06 EDT, end at Thu 2019-10-24 22:28:47 EDT. --
Oct 24 22:27:47 localhost.localdomain systemd[1]: Starting A...
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Child 967 belongs to A.service.
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Running next main command for state start.
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Passing 0 fds to service
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: About to execute: /usr/bin/sleep 60
Oct 24 22:27:47 localhost.localdomain systemd[1]: A.service: Forked /usr/bin/sleep as 968
Oct 24 22:27:47 localhost.localdomain systemd[968]: A.service: Executing: /usr/bin/sleep 60
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Trying to enqueue job A.service/reload/replace
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Merged into running job, re-running: A.service/reload as 1288
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Enqueued job A.service/reload as 1288
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Unit cannot be reloaded because it is inactive.
Oct 24 22:27:52 localhost.localdomain systemd[1]: A.service: Job 1288 A.service/reload finished, result=invalid
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Passing 0 fds to service
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: About to execute: /usr/bin/echo B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Forked /usr/bin/echo as 970
Oct 24 22:27:52 localhost.localdomain systemd[970]: B.service: Executing: /usr/bin/echo B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Failed to send unit change signal for B.service: Connection reset by peer
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Changed dead -> start
Oct 24 22:27:52 localhost.localdomain systemd[1]: Starting B...
Oct 24 22:27:52 localhost.localdomain echo[970]: B
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Child 970 belongs to B.service.
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Changed start -> exited
Oct 24 22:27:52 localhost.localdomain systemd[1]: B.service: Job 1371 B.service/start finished, result=done
Oct 24 22:27:52 localhost.localdomain systemd[1]: Started B.
Oct 24 22:27:52 localhost.localdomain systemd[1]: C.target: Job 1287 C.target/start finished, result=done
Oct 24 22:27:52 localhost.localdomain systemd[1]: Reached target C.
Oct 24 22:27:52 localhost.localdomain systemd[1]: C.target: Failed to send unit change signal for C.target: Connection reset by peer
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Child 968 belongs to A.service.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Running next main command for state start.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Passing 0 fds to service
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: About to execute: /usr/bin/echo A2
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Forked /usr/bin/echo as 972
Oct 24 22:28:47 localhost.localdomain systemd[972]: A.service: Executing: /usr/bin/echo A2
Oct 24 22:28:47 localhost.localdomain echo[972]: A2
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Child 972 belongs to A.service.
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Main process exited, code=exited, status=0/SUCCESS
Oct 24 22:28:47 localhost.localdomain systemd[1]: A.service: Changed start -> exited
The issue occurs not only in reload command, i.e.:
Michal Sekletár [Wed, 27 Nov 2019 13:27:58 +0000 (14:27 +0100)]
cryptsetup: reduce the chance that we will be OOM killed
cryptsetup introduced optional locking scheme that should serialize
unlocking keyslots which use memory hard key derivation
function (argon2). Using the serialization should prevent OOM situation
in early boot while unlocking encrypted volumes.
Michal Sekletár [Mon, 18 Nov 2019 11:50:11 +0000 (12:50 +0100)]
core: introduce NUMAPolicy and NUMAMask options
Make possible to set NUMA allocation policy for manager. Manager's
policy is by default inherited to all forked off processes. However, it
is possible to override the policy on per-service basis. Currently we
support, these policies: default, prefer, bind, interleave, local.
See man 2 set_mempolicy for details on each policy.
Overall NUMA policy actually consists of two parts. Policy itself and
bitmask representing NUMA nodes where is policy effective. Node mask can
be specified using related option, NUMAMask. Default mask can be
overwritten on per-service level.
shared/cpu-set-util: only force range printing one time
The idea is to have at least one range to make the new format clearly
distinguishable from the old. But it is enough to just do it once.
In particular, in case the affinity would be specified like 0, 2, 4, 6…,
this gives much shorter output.
cpu_set_malloc() was the last user. It doesn't seem useful to keep
it just to save the allocation of a few hundred bytes in a test, so
it is dropped and a fixed maximum is allocated (1024 bytes).
pid1: when reloading configuration, forget old settings
If we had a configuration setting from a configuration file, and it was
removed, we'd still remember the old value, because there's was no mechanism to
"reset" everything, just to assign new values.
Note that the effect of this is limited. For settings that have an "ongoing" effect,
like systemd.confirm_spawn, the new value is simply used. But some settings can only
be set at start.
In particular, CPUAffinity= will be updated if set to a new value, but if
CPUAffinity= is fully removed, it will not be reset, simply because we don't
know what to reset it to. We might have inherited a setting, or we might have
set it ourselves. In principle we could remember the "original" value that was
set when we were executed, but propagate this over reloads and reexecs, but
that would be a lot of work for little gain. So this corner case of removal of
CPUAffinity= is not handled fully, and a reboot is needed to execute the
change. As a work-around, a full mask of CPUAffinity=0-8191 can be specified.
pid1: don't reset setting from /proc/cmdline upon restart
We have settings which may be set on the kernel command line, and also
in /proc/cmdline (for pid1). The settings in /proc/cmdline have higher priority
of course. When a reload was done, we'd reload just the configuration file,
losing the overrides.
So read /proc/cmdline again during reload.
Also, when initially reading the configuration file when program starts,
don't treat any errors as fatal. The configuration done in there doesn't
seem important enough to refuse boot.
This makes the handling of this option match what we do in unit files. I think
consistency is important here. (As it happens, it is the only option in
system.conf that is "non-atomic", i.e. where there's a list of things which can
be split over multiple assignments. All other options are single-valued, so
there's no issue of how to handle multiple assignments.)
The CPU_SET_S api is pretty bad. In particular, it has a parameter for the size
of the array, but operations which take two (CPU_EQUAL_S) or even three arrays
(CPU_{AND,OR,XOR}_S) still take just one size. This means that all arrays must
be of the same size, or buffer overruns will occur. This is exactly what our
code would do, if it received an array of unexpected size over the network.
("Unexpected" here means anything different from what cpu_set_malloc() detects
as the "right" size.)
Let's rework this, and store the size in bytes of the allocated storage area.
The code will now parse any number up to 8191, independently of what the current
kernel supports. This matches the kernel maximum setting for any architecture,
to make things more portable.
fuzz-journal-stream: avoid assertion failure on samples which don't fit in pipe
Fixes https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=11587.
We had a sample which was large enough that write(2) failed to push all the
data into the pipe, and an assert failed. The code could be changed to use
a loop, but then we'd need to interleave writes and sd_event_run (to process
the journal). I don't think the complexity is worth it — fuzzing works best
if the sample is not too huge anyway. So let's just reject samples above 64k,
and tell oss-fuzz about this limit.
journald: check whether sscanf has changed the value corresponding to %n
It's possible for sscanf to receive strings containing all three fields
and not matching the template at the same time. When this happens the
value of k doesn't change, which basically means that process_audit_string
tries to access memory randomly. Sometimes it works and sometimes it doesn't :-)
See also https://bugzilla.redhat.com/show_bug.cgi?id=1059314.
test: initialize syslog_fd in fuzz-journald-kmsg too
This is a follow-up to 8857fb9beb9dcb that prevents the fuzzer from crashing with
```
==220==ERROR: AddressSanitizer: ABRT on unknown address 0x0000000000dc (pc 0x7ff4953c8428 bp 0x7ffcf66ec290 sp 0x7ffcf66ec128 T0)
SCARINESS: 10 (signal)
#0 0x7ff4953c8427 in gsignal (/lib/x86_64-linux-gnu/libc.so.6+0x35427)
#1 0x7ff4953ca029 in abort (/lib/x86_64-linux-gnu/libc.so.6+0x37029)
#2 0x7ff49666503a in log_assert_failed_realm /work/build/../../src/systemd/src/basic/log.c:805:9
#3 0x7ff496614ecf in safe_close /work/build/../../src/systemd/src/basic/fd-util.c:66:17
#4 0x548806 in server_done /work/build/../../src/systemd/src/journal/journald-server.c:2064:9
#5 0x5349fa in LLVMFuzzerTestOneInput /work/build/../../src/systemd/src/fuzz/fuzz-journald-kmsg.c:26:9
#6 0x592755 in fuzzer::Fuzzer::ExecuteCallback(unsigned char const*, unsigned long) /src/libfuzzer/FuzzerLoop.cpp:571:15
#7 0x590627 in fuzzer::Fuzzer::RunOne(unsigned char const*, unsigned long, bool, fuzzer::InputInfo*, bool*) /src/libfuzzer/FuzzerLoop.cpp:480:3
#8 0x594432 in fuzzer::Fuzzer::MutateAndTestOne() /src/libfuzzer/FuzzerLoop.cpp:708:19
#9 0x5973c6 in fuzzer::Fuzzer::Loop(std::__1::vector<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >, fuzzer::fuzzer_allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > > > const&) /src/libfuzzer/FuzzerLoop.cpp:839:5
#10 0x574541 in fuzzer::FuzzerDriver(int*, char***, int (*)(unsigned char const*, unsigned long)) /src/libfuzzer/FuzzerDriver.cpp:764:6
#11 0x5675fc in main /src/libfuzzer/FuzzerMain.cpp:20:10
#12 0x7ff4953b382f in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x2082f)
#13 0x420f58 in _start (/out/fuzz-journald-kmsg+0x420f58)
```
The function takes a pointer to a random block of memory and
the length of that block. It shouldn't crash every time it sees
a zero byte at the beginning there.
This should help the dev-kmsg fuzzer to keep going.
resolved: allow access to Set*Link and Revert methods through polkit
This matches what is done in networkd very closely. In fact even the
policy descriptions are all identical (with s/network/resolve), except
for the last one:
resolved has org.freedesktop.resolve1.revert while
networkd has org.freedesktop.network1.revert-ntp and
org.freedesktop.network1.revert-dns so the description is a bit different.
This doesn't matter much, but let's just do the loop once and allocate
the populate the result set on the fly. If we find an error, it'll get
cleaned up automatically.
This only affects systemd-resolved. bus_open_system_watch_bind_with_description()
is also used in timesyncd, but it has no methods, only read-only properties, and
in networkd, but it annotates all methods with SD_BUS_VTABLE_UNPRIVILEGED and does
polkit checks.
Fabian Henneke [Wed, 21 Aug 2019 09:17:59 +0000 (11:17 +0200)]
udev: Add id program and rule for FIDO security tokens
Add a fido_id program meant to be run for devices in the hidraw
subsystem via an IMPORT directive. The program parses the HID report
descriptor and assigns the ID_SECURITY_TOKEN environment variable if a
declared usage matches the FIDO_CTAPHID_USAGE declared in the FIDO CTAP
specification. This replaces the previous approach of whitelisting all
known security token models manually.
This commit is accompanied by a test suite and a fuzzer target for the
descriptor parsing routine.
Michal Sekletar [Tue, 26 Feb 2019 16:33:27 +0000 (17:33 +0100)]
selinux: don't log SELINUX_INFO and SELINUX_WARNING messages to audit
Previously we logged even info message from libselinux as USER_AVC's to
audit. For example, setting SELinux to permissive mode generated
following audit message,
This is unnecessary and wrong at the same time. First, kernel already
records audit event that SELinux was switched to permissive mode, also
the type of the message really shouldn't be USER_AVC.
Let's ignore SELINUX_WARNING and SELINUX_INFO and forward to audit only
USER_AVC's and errors as these two libselinux message types have clear
mapping to audit message types.
Andrew Jorgensen [Wed, 25 Jul 2018 15:06:57 +0000 (08:06 -0700)]
shared/sleep-config: exclude zram devices from hibernation candidates
On a host with sufficiently large zram but with no actual swap, logind will
respond to CanHibernate() with yes. With this patch, it will correctly respond
no, unless there are other swap devices to consider.
Frantisek Sumsal [Mon, 21 Oct 2019 16:39:39 +0000 (18:39 +0200)]
test: bump the second partition's size to 50M
The former size (10M) caused systemd-journald to crash with SIGABRT when
used on a LUKS2 partition, as the LUKS2 metadata consume a significant
part of the 10M partition, thus leaving no space for the journal file
itself (relevant for TEST-02-CRYPTSETUP). This change has been present
in upstream for a while anyway.
Frantisek Sumsal [Fri, 15 Mar 2019 09:05:33 +0000 (10:05 +0100)]
test: use PBKDF2 instead of Argon2 in cryptsetup...
to reduce memory requirements for volume manipulation. Also,
to further improve the test performance, reduce number of PBKDF
iterations to 1000 (allowed minimum).
Anita Zhang [Mon, 8 Oct 2018 03:28:36 +0000 (20:28 -0700)]
core: implement per unit journal rate limiting
Add LogRateLimitIntervalSec= and LogRateLimitBurst= options for
services. If provided, these values get passed to the journald
client context, and those values are used in the rate limiting
function in the journal over the the journald.conf values.
Franck Bui [Tue, 19 Mar 2019 09:59:26 +0000 (10:59 +0100)]
core: only watch processes when it's really necessary
If we know that main pid is our child then it's unnecessary to watch all
other processes of a unit since in this case we will get SIGCHLD when the main
process will exit and will act upon accordingly.
So let's watch all processes only if the main process is not our child since in
this case we need to detect when the cgroup will become empty in order to
figure out when the service becomes dead. This is only needed by cgroupv1.
Thanks Renaud Métrich for backporting this to RHEL.
Resolves: #1744972
Franck Bui [Mon, 18 Mar 2019 19:59:36 +0000 (20:59 +0100)]
core: reduce the number of stalled PIDs from the watched processes list when possible
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.
However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.
If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
Thanks Renaud Métrich for backporting this to RHEL.
Resolves: #1744972
Jan Synacek [Thu, 17 Oct 2019 07:37:35 +0000 (09:37 +0200)]
udev: introduce CONST key name
Currently, there is no way to match against system-wide constants, such
as architecture or virtualization type, without forking helper binaries.
That potentially results in a huge number of spawned processes which
output always the same answer.
This patch introduces a special CONST keyword which takes a hard-coded
string as its key and returns a value assigned to that key. Currently
implemented are CONST{arch} and CONST{virt}, which can be used to match
against the system's architecture and virtualization type.
Michal Sekletar [Tue, 3 Sep 2019 08:05:42 +0000 (10:05 +0200)]
buildsys: don't garbage collect sections while linking
gc-sections is actually very aggressive and garbage collects ELF
sections used by annobin gcc plugin and annocheck then reports gaps in
coverage. Let's drop that linker flag.
core: try to reopen /dev/kmsg again right after mounting /dev
I was debugging stuff during early boot, and was confused that I never
found the logs for it in kmsg. The reason for that was that /proc is
generally not mounted the first time we do log_open() and hence
log_set_target(LOG_TARGET_KMSG) we do when running as PID 1 had not
effect. A lot later during start-up we call log_open() again where this
is fixed (after the point where we close all remaining fds still open),
but in the meantime no logs every got written to kmsg. This patch fixes
that.
ask-password: prevent buffer overrow when reading from keyring
When we read from keyring, a temporary buffer is allocated in order to
determine the size needed for the entire data. However, when zeroing that area,
we use the data size returned by the read instead of the lesser size allocate
for the buffer.
That will cause memory corruption that causes systemd-cryptsetup to crash
either when a single large password is used or when multiple passwords have
already been pushed to the keyring.
kernel-install: do not require non-empty kernel cmdline
When booting with Fedora-Server-dvd-x86_64-30-20190411.n.0.iso,
/proc/cmdline is empty (libvirt, qemu host with bios, not sure if that
matters), after installation to disk, anaconda would "crash" in kernel-core
%posttrans, after calling kernel-install, because dracut would fail
with
> Could not determine the kernel command line parameters.
> Please specify the kernel command line in /etc/kernel/cmdline!
I guess it's legitimate, even if unusual, to have no cmdline parameters.
Two changes are done in this patch:
1. do not fail if the cmdline is empty.
2. if /usr/lib/kernel/cmdline or /etc/kernel/cmdline are present, but
empty, ignore /proc/cmdline. If there's explicit configuration to
have empty cmdline, don't ignore it.
The same change was done in dracut:
https://github.com/dracutdevs/dracut/pull/561.