Lukas Nykryn [Thu, 25 Jun 2015 07:20:59 +0000 (09:20 +0200)]
Revert "core: one step back again, for nspawn we actually can't wait for cgroups running empty since systemd will get exactly zero notifications about it"
Jussi Pakkanen [Sat, 6 Apr 2019 19:59:06 +0000 (21:59 +0200)]
meson: drop misplaced -Wl,--undefined argument
Ld's man page says the following:
-u symbol
--undefined=symbol
Force symbol to be entered in the output file as an undefined symbol. Doing
this may, for example, trigger linking of additional modules from standard
libraries. -u may be repeated with different option arguments to enter
additional undefined symbols. This option is equivalent to the "EXTERN"
linker script command.
If this option is being used to force additional modules to be pulled into
the link, and if it is an error for the symbol to remain undefined, then the
option --require-defined should be used instead.
This would imply that it always requires an argument, which this does not
pass. Thus it will grab the next argument on the command line as its
argument. Before it took one of the many -lrt args (presumably) and now it
grabs something other random linker argument and things break.
[zj: this line was added in the first version of the meson configuration back
in 5c23128daba7236a6080383b2a5649033cfef85c. AFAICT, this was a mistake. No
such flag appeared in Makefile.am at the time.]
sd-bus: if we receive an invalid dbus message, ignore and proceeed
dbus-daemon might have a slightly different idea of what a valid msg is
than us (for example regarding valid msg and field sizes). Let's hence
try to proceed if we can and thus drop messages rather than fail the
connection if we fail to validate a message.
Hopefully the differences in what is considered valid are not visible
for real-life usecases, but are specific to exploit attempts only.
Allocate temporary strings to hold dbus paths on the heap
Paths are limited to BUS_PATH_SIZE_MAX but the maximum size is anyway too big
to be allocated on the stack, so let's switch to the heap where there is a
clear way to understand if the allocation fails.
Refuse dbus message paths longer than BUS_PATH_SIZE_MAX limit.
Even though the dbus specification does not enforce any length limit on the
path of a dbus message, having to analyze too long strings in PID1 may be
time-consuming and it may have security impacts.
In any case, the limit is set so high that real-life applications should not
have a problem with it.
bus-socket: Fix line_begins() to accept word matching full string
The switch to memory_startswith() changed the logic to only look for a space or
NUL byte after the matched word, but matching the full size should also be
acceptable.
This changed the behavior of parsing of "AUTH\r\n", where m will be set to 4,
since even though the word will match, the check for it being followed by ' '
or NUL will make line_begins() return false.
Tested:
- Using netcat to connect to the private socket directly:
$ echo -ne '\0AUTH\r\n' | sudo nc -U /run/systemd/private
REJECTED EXTERNAL ANONYMOUS
- Running the Ignition blackbox test:
$ sudo sh -c 'PATH=$PWD/bin/amd64:$PATH ./tests.test'
PASS
tests: also run TEST-01-BASIC in an unprivileged container (#9957)
This should make it much easier to catch regressions like
https://github.com/systemd/systemd/issues/9914 and
https://github.com/systemd/systemd/issues/8535.
This GDB script was converted to use Python 3 along with all other
Python scripts in commit b95f5528cc, but still used the Python 2 print
statement syntax instead of the Python 3 print function. Fix that.
We also add the Python 2 compatibility statement, just in case some GDB
still uses Python 2 instead of Python 3.
Jan Synacek [Wed, 30 Jan 2019 09:36:53 +0000 (10:36 +0100)]
rules: implement new memory hotplug policy
Our new policy is based on following motivations (assumptions),
* we want to allow the system to use hotplugged memory
* we want memory ballon inflation to work as expected in VMs (going for small
to big in terms of memory footprint)
* we want to allow memory hotplug and memory hot-unplug on high-end
enterprise server (we assume that node0 will have sufficient memory
resources and marking all memory as movable shouldn't be a problem)
Policy:
* nevert online memory on s390 (on both physical and z/VM)
* mark memory as "online_movable" on physical machines
* mark memory as "online" in VMs
If you have the feeling that all this is very wrong and we shouldn't
encode complex policies in udev rules you are absolutely right. However,
for now, we don't have any better place where to put it. In ideal world
we would have a user-space daemon that would be able to configure the
system wrt. to currently present HW and user-defined policy.
Frantisek Sumsal [Tue, 29 Jan 2019 18:33:15 +0000 (19:33 +0100)]
test: replace echo with socat
The original version of the test used netcat along with a standard
AF_UNIX socket, which caused issues across different netcat
implementations. The AF_UNIX socket was then replaced by a FIFO with a
simple echo, which, however, suffers from the same issue (some echo
implementations don't check if the write() was successful).
Let's revert back to the AF_UNIX socket, but replace netcat with socat,
which, hopefully, resolves the main issue.
Michal Sekletar [Fri, 14 Dec 2018 14:17:27 +0000 (15:17 +0100)]
journald: correctly attribute log messages also with cgroupsv1
With cgroupsv1 a zombie process is migrated to root cgroup in all
hierarchies. This was changed for unified hierarchy and /proc/PID/cgroup
reports cgroup to which process belonged before it exited.
Be more suspicious about cgroup path reported by the kernel and use
unit_id provided by the log client if the kernel reports that process is
running in the root cgroup.
Users tend to care the most about 'log->unit_id' mapping so systemctl
status can correctly report last log lines. Also we wouldn't be able to
infer anything useful from "/" path anyway.
journal-remote: set a limit on the number of fields in a message
Existing use of E2BIG is replaced with ENOBUFS (entry too long), and E2BIG is
reused for the new error condition (too many fields).
This matches the change done for systemd-journald, hence forming the second
part of the fix for CVE-2018-16865
(https://bugzilla.redhat.com/show_bug.cgi?id=1653861).
Calling mhd_respond(), which ulimately calls MHD_queue_response() is
ineffective at point, becuase MHD_queue_response() immediately returns
MHD_NO signifying an error, because the connection is in state
MHD_CONNECTION_CONTINUE_SENT.
As Christian Grothoff kindly explained:
> You are likely calling MHD_queue_repsonse() too late: once you are
> receiving upload_data, HTTP forces you to process it all. At this time,
> MHD has already sent "100 continue" and cannot take it back (hence you
> get MHD_NO!).
>
> In your request handler, the first time when you are called for a
> connection (and when hence *upload_data_size == 0 and upload_data ==
> NULL) you must check the content-length header and react (with
> MHD_queue_response) based on this (to prevent MHD from automatically
> generating 100 continue).
If we ever encounter this kind of error, print a warning and immediately
abort the connection. (The alternative would be to keep reading the data,
but ignore it, and return an error after we get to the end of data.
That is possible, but of course puts additional load on both the
sender and reciever, and doesn't seem important enough just to return
a good error message.)
Note that sending of the error does not work (the connection is always aborted
when MHD_queue_response is used with MHD_RESPMEM_MUST_FREE, as in this case)
with libµhttpd 0.59, but works with 0.61:
https://src.fedoraproject.org/rpms/libmicrohttpd/pull-request/1
journald: lower the maximum entry size limit to ½ for non-sealed fds
We immediately read the whole contents into memory, making thigs much more
expensive. Sealed fds should be used instead since they are more efficient
on our side.
journald: when processing a native message, bail more quickly on overbig messages
We'd first parse all or most of the message, and only then consider if it
is not too large. Also, when encountering a single field over the limit,
we'd still process the preceding part of the message. Let's be stricter,
and check size limits early, and let's refuse the whole message if it fails
any of the size limits.
journald: set a limit on the number of fields (1k)
We allocate a iovec entry for each field, so with many short entries,
our memory usage and processing time can be large, even with a relatively
small message size. Let's refuse overly long entries.
What from I can see, the problem is not from an alloca, despite what the CVE
description says, but from the attack multiplication that comes from creating
many very small iovecs: (void* + size_t) for each three bytes of input message.
journald: periodically drop cache for all dead PIDs
In normal use, this allow us to drop dead entries from the cache and reduces
the cache size so that we don't evict entries unnecessarily. The time limit is
there mostly to serve as a guard against malicious logging from many different
PIDs.
journal: limit the number of entries in the cache based on available memory
This is far from perfect, but should give mostly reasonable values. My
assumption is that if somebody has a few hundred MB of memory, they are
unlikely to have thousands of processes logging. A hundred would already be a
lot. So let's scale the cache size propritionally to the total memory size,
with clamping on both ends.
The formula gives 64 cache entries for each GB of RAM.
procfs-util: expose functionality to query total memory
procfs_memory_get_current is renamed to procfs_memory_get_used, because
"current" can mean anything, including total memory, used memory, and free
memory, as long as the value is up to date.
coredump: fix message when we fail to save a journald coredump
If creation of the message failed, we'd write a bogus entry:
systemd-coredump[1400]: Cannot store coredump of 416 (systemd-journal): No space left on device
systemd-coredump[1400]: MESSAGE=Process 416 (systemd-journal) of user 0 dumped core.
systemd-coredump[1400]: Coredump diverted to
2MB might be hard for some clients to use meaningfully, but OTOH, it is
important to log the full commandline sometimes. For example, when the program
is crashing, the exact argument list is useful.
journald: do not store the iovec entry for process commandline on stack
This fixes a crash where we would read the commandline, whose length is under
control of the sending program, and then crash when trying to create a stack
allocation for it.
coredump: remove duplicate MESSAGE= prefix from message
systemd-coredump[9982]: MESSAGE=Process 771 (systemd-journal) of user 0 dumped core.
systemd-coredump[9982]: Coredump diverted to /var/lib/systemd/coredump/core...
log_dispatch() calls log_dispatch_internal() which calls write_to_journal()
which appends MESSAGE= on its own.
tests: drop the precondition check for inherited flag
Docker's default capability set has the inherited flag already
set - that breaks tests which expect otherwise. Let's just
drop the check and run the test anyway.
Michael Biebl [Mon, 16 Jul 2018 09:27:44 +0000 (11:27 +0200)]
test: Drop SKIP_INITRD for QEMU-based tests
Not all distros support booting without an initrd. E.g. the Debian
kernel builds ext4 as a module and so relies on an initrd to
successfully start the QEMU-based images.
Lubomir Rintel [Wed, 28 Nov 2018 10:44:20 +0000 (11:44 +0100)]
sysctl.d: switch net.ipv4.conf.all.rp_filter from 1 to 2
This switches the RFC3704 Reverse Path filtering from Strict mode to Loose
mode. The Strict mode breaks some pretty common and reasonable use cases,
such as keeping connections via one default route alive after another one
appears (e.g. plugging an Ethernet cable when connected via Wi-Fi).
The strict filter also makes it impossible for NetworkManager to do
connectivity check on a newly arriving default route (it starts with a
higher metric and is bumped lower if there's connectivity).
Kernel's default is 0 (no filter), but a Loose filter is good enough. The
few use cases where a Strict mode could make sense can easily override
this.
The distributions that don't care about the client use cases and prefer a
strict filter could just ship a custom configuration in
/usr/lib/sysctl.d/ to override this.
Michal Sekletar [Tue, 4 Sep 2018 18:03:34 +0000 (20:03 +0200)]
cryptsetup-generator: allow whitespace characters in keydev specification
For example, <luks.uuid>=/keyfile:LABEL="KEYFILE FS" previously wouldn't
work, because we truncated label at the first whitespace character,
i.e. LABEL="KEYFILE".
cryptsetup: don't use %m if there's no error to show
We are not the ones receiving an error here, but the ones generating it,
hence we shouldn't show it with %m, that's just confusing, as it
suggests we received an error from some other call.
Michal Sekletar [Thu, 30 Aug 2018 08:45:11 +0000 (08:45 +0000)]
cryptsetup-generator: introduce basic keydev support
Dracut has a support for unlocking encrypted drives with keyfile stored
on the external drive. This support is included in the generated initrd
only if systemd module is not included.
When systemd is used in initrd then attachment of encrypted drives is
handled by systemd-cryptsetup tools. Our generator has support for
keyfile, however, it didn't support keyfile on the external block
device (keydev).
This commit introduces basic keydev support. Keydev can be specified per
luks.uuid on the kernel command line. Keydev is automatically mounted
during boot and we look for keyfile in the keydev
mountpoint (i.e. keyfile path is prefixed with the keydev mount point
path). After crypt device is attached we automatically unmount
where keyfile resides.
Jan Synacek [Wed, 31 Oct 2018 11:50:19 +0000 (12:50 +0100)]
sd-bus: properly initialize containers
Fixes a SIGSEGV introduced by commit 38a5315a3a6fab745d8c86ff9e486faaf50b28d1.
The same problem doesn't exist upstream, as the container structure
there is initialized using a compound literal, which is zeroed out by
default.
sd-bus: unify three code-paths which free struct bus_container
We didn't free one of the fields in two of the places.
$ valgrind --show-leak-kinds=all --leak-check=full \
build/fuzz-bus-message \
test/fuzz/fuzz-bus-message/leak-c09c0e2256d43bc5e2d02748c8d8760e7bc25d20
...
==14457== HEAP SUMMARY:
==14457== in use at exit: 3 bytes in 1 blocks
==14457== total heap usage: 509 allocs, 508 frees, 51,016 bytes allocated
==14457==
==14457== 3 bytes in 1 blocks are definitely lost in loss record 1 of 1
==14457== at 0x4C2EBAB: malloc (vg_replace_malloc.c:299)
==14457== by 0x53AFE79: strndup (in /usr/lib64/libc-2.27.so)
==14457== by 0x4F52EB8: free_and_strndup (string-util.c:1039)
==14457== by 0x4F8E1AB: sd_bus_message_peek_type (bus-message.c:4193)
==14457== by 0x4F76CB5: bus_message_dump (bus-dump.c:144)
==14457== by 0x108F12: LLVMFuzzerTestOneInput (fuzz-bus-message.c:24)
==14457== by 0x1090F7: main (fuzz-main.c:34)
==14457==
==14457== LEAK SUMMARY:
==14457== definitely lost: 3 bytes in 1 blocks
detect-virt: do not try to read all of /proc/cpuinfo
Quoting https://github.com/systemd/systemd/issues/10074:
> detect_vm_uml() reads /proc/cpuinfo with read_full_file()
> read_full_file() has a file max limit size of READ_FULL_BYTES_MAX=(4U*1024U*1024U)
> Unfortunately, the size of my /proc/cpuinfo is bigger, approximately:
> echo $(( 4* $(cat /proc/cpuinfo | wc -c)))
> 9918072
> This causes read_full_file() to fail and the Condition test fallout.
Let's just read line by line until we find an intersting line. This also
helps if not running under UML, because we avoid reading as much data.
Lukas Nykryn [Thu, 25 Oct 2018 14:21:26 +0000 (16:21 +0200)]
proc-cmdline: introduce PROC_CMDLINE_RD_STRICT
Our current set of flags allows an option to be either
use just in initrd or both in initrd and normal system.
This new flag is intended to be used in the case where
you want apply some settings just in initrd or just
in normal system.
resolved: create /etc/resolv.conf symlink at runtime
If the symlink doesn't exists, and we are being started, let's
create it to provie name resolution.
If it exists, do nothing. In particular, if it is a broken symlink,
we cannot really know if the administator configured it to point to
a location used by some service that hasn't started yet, so we
don't touch it in that case either.
Introduce free_and_strndup and use it in bus-message.c
v2: fix error in free_and_strndup()
When the orignal and copied message were the same, but shorter than specified
length l, memory read past the end of the buffer would be performed. A test
case is included: a string that had an embedded NUL ("q\0") is used to replace
"q".
v3: Fix one more bug in free_and_strndup and add tests.
v4: Some style fixed based on review, one more use of free_and_replace, and
make the tests more comprehensive.
It's not useful to bump the reference count before checking if the object is
NULL. Thanks to d40f5cc498 we can do this ;).
Related to https://bugzilla.redhat.com/show_bug.cgi?id=1576084,
https://bugzilla.redhat.com/show_bug.cgi?id=1575340,
https://bugzilla.redhat.com/show_bug.cgi?id=1575350. I'm not sure why those two
people hit this code path, while most people don't. At least we won't abort.