Follow up for #4546:
> @@ -848,8 +848,7 @@ static int bus_kernel_make_message(sd_bus *bus, struct kdbus_msg *k) {
if (k->src_id == KDBUS_SRC_ID_KERNEL)
bus_message_set_sender_driver(bus, m);
else {
- xsprintf(m->sender_buffer, ":1.%llu",
- (unsigned long long)k->src_id);
+ xsprintf(m->sender_buffer, ":1.%"PRIu64, k->src_id);
This produces:
src/libsystemd/sd-bus/bus-kernel.c: In function ‘bus_kernel_make_message’:
src/libsystemd/sd-bus/bus-kernel.c:851:44: warning: format ‘%lu’ expects argument of type ‘long
unsigned int’, but argument 4 has type ‘__u64 {aka long long unsigned int}’ [-Wformat=]
xsprintf(m->sender_buffer, ":1.%"PRIu64, k->src_id);
^
If we encounter the (unlikely) situation where the combined path to the
new root and a path to a mount to be moved together exceed maximum path length,
we shouldn't crash, but fail this path instead.
This reverts some changes introduced in d054f0a4d4.
xsprintf should be used in cases where we calculated the right buffer
size by hand (using DECIMAL_STRING_MAX and such), and never in cases where
we are printing externally specified strings of arbitrary length.
Unfortunately, github drops the original commiter when a PR is "squashed" (even
if it is only a single commit) and replaces it with some rubbish
github-specific user id. Thus, to make the contributors list somewhat useful,
update the .mailmap file and undo all the weirdness github applied there.
pid1: fix fd memleak when we hit FileDescriptorStoreMax limit
Since service_add_fd_store() already does the check, remove the redundant check
from service_add_fd_store_set().
Also, print a warning when repopulating FDStore after daemon-reexec and we hit
the limit. This is a user visible issue, so we should not discard fds silently.
(Note that service_deserialize_item is impacted by the return value from
service_add_fd_store(), but we rely on the general error message, so the caller
does not need to be modified, and does not show up in the diff.)
core: change mount_synthesize_root() return to int
Let's propagate the error here, instead of eating it up early.
In a later change we should probably also change mount_enumerate() to propagate
errors up, but that would mean we'd have to change the unit vtable, and thus
change all unit types, hence is quite an invasive change.
nspawn: if we set up a loopback device, try to mount it with "discard"
Let's make sure that our loopback files remain sparse, hence let's set
"discard" as mount option on file systems that support it if the backing device
is a loopback.
systemctl: tweak the "systemctl list-units" output a bit
Make the underlining between the header and the body and between the units of
different types span the whole width of the table.
Let's never make the table wider than necessary (which is relevant due the
above).
When space is limited and we can't show the full ID or description string
prefer showing the full ID over the full description. The ID is after all
something people might want to copy/paste, while the description is mostly just
helpful decoration.
sysctl: do not fail systemd-sysctl.service if /proc/sys is mounted read-only
Let's make missing write access to /proc/sys non-fatal to the sysctl service.
This is a follow-up to 411e869f497c7c7bd0688f1e3500f9043bc56e48 which altered
the condition for running the sysctl service to check for /proc/sys/net being
writable, accepting that /proc/sys might be read-only. In order to ensure the
boot-up stays clean in containers lower the log level for the EROFS errors
generated due to this.
core: rework the "no_gc" unit flag to become a more generic "perpetual" flag
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
man: document that too strict system call filters may affect the service manager
If execve() or socket() is filtered the service manager might get into trouble
executing the service binary, or handling any failures when this fails. Mention
this in the documentation.
The other option would be to implicitly whitelist all system calls that are
required for these codepaths. However, that appears less than desirable as this
would mean socket() and many related calls have to be whitelisted
unconditionally. As writing system call filters requires a certain level of
expertise anyway it sounds like the better option to simply document these
issues and suggest that the user disables system call filters in the service
temporarily in order to debug any such failures.
execute: apply seccomp filters after changing selinux/aa/smack contexts
Seccomp is generally an unprivileged operation, changing security contexts is
most likely associated with some form of policy. Moreover, while seccomp may
influence our own flow of code quite a bit (much more than the security context
change) make sure to apply the seccomp filters immediately before executing the
binary to invoke.
This also moves enforcement of NNP after the security context change, so that
NNP cannot affect it anymore. (However, the security policy now has to permit
the NNP change).
This change has a good chance of breaking current SELinux/AA/SMACK setups, because
the policy might not expect this change of behaviour. However, it's technically
the better choice I think and should hence be applied.
@resources contains various syscalls that alter resource limits and memory and
scheduling parameters of processes. As such they are good candidates to block
for most services.
@basic-io contains a number of basic syscalls for I/O, similar to the list
seccomp v1 permitted but slightly more complete. It should be useful for
building basic whitelisting for minimal sandboxes
The system call is already part in @default hence implicitly allowed anyway.
Also, if it is actually blocked then systemd couldn't execute the service in
question anymore, since the application of seccomp is immediately followed by
it.
Djalal Harouni [Tue, 25 Oct 2016 14:24:35 +0000 (16:24 +0200)]
core: lets apply working directory just after mount namespaces
This makes applying groups after applying the working directory, this
may allow some flexibility but at same it is not a big deal since we
don't execute or do anything between applying working directory and
droping groups.
Rewrite the function to be slightly simpler. In particular, if a specific
match is found (like ConditionVirtualization=yes), simply return an answer
immediately, instead of relying that "yes" will not be matched by any of
the virtualization names below.
detect-virt: add --private-users switch to check if a userns is active
Various things don't work when we're running in a user namespace, but it's
pretty hard to reliably detect if that is true.
A function is added which looks at /proc/self/uid_map and returns false
if the default "0 0 UINT32_MAX" is found, and true if it finds anything else.
This misses the case where an 1:1 mapping with the full range was used, but
I don't know how to distinguish this case.
'systemd-detect-virt --private-users' is very similar to
'systemd-detect-virt --chroot', but we check for a user namespace instead.
Fixes:
$ ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Feb 17 2016 /bin/sh -> dash
$ ./autogen.sh c
./autogen.sh: 22: ./autogen.sh: [[: not found
...
checking whether make supports nested variables... (cached) yes
checking build system type... Invalid configuration `c': machine `c' not
recognized
configure: error: /bin/bash build-aux/config.sub c failed
Dongsu Park [Tue, 25 Oct 2016 12:51:01 +0000 (14:51 +0200)]
test: skip exec tests when inaccessible dir is unavailable
In case of running test-execute on systems with systemd < v232, several
tests like privatedevices or protectkernelmodules fail because
/run/systemd/inaccessible/ doesn't exist. In these cases, we should skip
tests to avoid unnecessary errors.
See also https://github.com/systemd/systemd/pull/4243#issuecomment-253665566