core:namespace: put paths protected by ProtectKernelTunables= in
Instead of having all these paths everywhere, put the ones that are
protected by ProtectKernelTunables= into their own table. This way it
is easy to add paths and track which ones are protected.
execute: filter low-level I/O syscalls if PrivateDevices= is set
If device access is restricted via PrivateDevices=, let's also block the
various low-level I/O syscalls at the same time, so that we know that the
minimal set of devices in our virtualized /dev are really everything the unit
can access.
units: further lock down our long-running services
Let's make this an excercise in dogfooding: let's turn on more security
features for all our long-running services.
Specifically:
- Turn on RestrictRealtime=yes for all of them
- Turn on ProtectKernelTunables=yes and ProtectControlGroups=yes for most of
them
- Turn on RestrictAddressFamilies= for all of them, but different sets of
address families for each
Also, always order settings in the unit files, that the various sandboxing
features are close together.
Add a couple of missing, older settings for a numbre of unit files.
Note that this change turns off AF_INET/AF_INET6 from udevd, thus effectively
turning of networking from udev rule commands. Since this might break stuff
(that is already broken I'd argue) this is documented in NEWS.
Let's merge a couple of columns, to make the table a bit shorter. This
effectively just drops whitespace, not contents, but makes the currently
humungous table much much more compact.
man: rework documentation for ReadOnlyPaths= and related settings
This reworks the documentation for ReadOnlyPaths=, ReadWritePaths=,
InaccessiblePaths=. It no longer claims that we'd follow symlinks relative to
the host file system. (Which wasn't true actually, as we didn't follow symlinks
at all in the most recent releases, and we know do follow them, but relative to
RootDirectory=).
This also replaces all references to the fact that all fs namespacing options
can be undone with enough privileges and disable propagation by a single one in
the documentation of ReadOnlyPaths= and friends, and then directs the read to
this in all other places.
Moreover a hint is added to the documentation of SystemCallFilter=, suggesting
usage of ~@mount in case any of the fs namespacing related options are used.
man: in user-facing documentaiton don't reference C function names
Let's drop the reference to the cap_from_name() function in the documentation
for the capabilities setting, as it is hardly helpful. Our readers are not
necessarily C hackers knowing the semantics of cap_from_name(). Moreover, the
strings we accept are just the plain capability names as listed in
capabilities(7) hence there's really no point in confusing the user with
anything else.
namespace: chase symlinks for mounts to set up in userspace
This adds logic to chase symlinks for all mount points that shall be created in
a namespace environment in userspace, instead of leaving this to the kernel.
This has the advantage that we can correctly handle absolute symlinks that
shall be taken relative to a specific root directory. Moreover, we can properly
handle mounts created on symlinked files or directories as we can merge their
mounts as necessary.
(This also drops the "done" flag in the namespace logic, which was never
actually working, but was supposed to permit a partial rollback of the
namespace logic, which however is only mildly useful as it wasn't clear in
which case it would or would not be able to roll back.)
execute: drop group priviliges only after setting up namespace
If PrivateDevices=yes is set, the namespace code creates device nodes in /dev
that should be owned by the host's root, hence let's make sure we set up the
namespace before dropping group privileges.
core: imply ProtectHome=read-only and ProtectSystem=strict if DynamicUser=1
Let's make sure that services that use DynamicUser=1 cannot leave files in the
file system should the system accidentally have a world-writable directory
somewhere.
This effectively ensures that directories need to be whitelisted rather than
blacklisted for access when DynamicUser=1 is set.
Let's tighten our sandbox a bit more: with this change ProtectSystem= gains a
new setting "strict". If set, the entire directory tree of the system is
mounted read-only, but the API file systems /proc, /dev, /sys are excluded
(they may be managed with PrivateDevices= and ProtectKernelTunables=). Also,
/home and /root are excluded as those are left for ProtectHome= to manage.
In this mode, all "real" file systems (i.e. non-API file systems) are mounted
read-only, and specific directories may only be excluded via
ReadWriteDirectories=, thus implementing an effective whitelist instead of
blacklist of writable directories.
While we are at, also add /efi to the list of paths always affected by
ProtectSystem=. This is a follow-up for b52a109ad38cd37b660ccd5394ff5c171a5e5355 which added /efi as alternative for
/boot. Our namespacing logic should respect that too.
Previously, if ReadWritePaths= was nested inside a ReadOnlyPaths=
specification, then we'd first recursively apply the ReadOnlyPaths= paths, and
make everything below read-only, only in order to then flip the read-only bit
again for the subdirs listed in ReadWritePaths= below it.
This is not only ugly (as for the dirs in question we first turn on the RO bit,
only to turn it off again immediately after), but also problematic in
containers, where a container manager might have marked a set of dirs read-only
and this code will undo this is ReadWritePaths= is set for any.
With this patch behaviour in this regard is altered: ReadOnlyPaths= will not be
applied to the children listed in ReadWritePaths= in the first place, so that
we do not need to turn off the RO bit for those after all.
This means that ReadWritePaths=/ReadOnlyPaths= may only be used to turn on the
RO bit, but never to turn it off again. Or to say this differently: if some
dirs are marked read-only via some external tool, then ReadWritePaths= will not
undo it.
This is not only the safer option, but also more in-line with what the man page
currently claims:
"Entries (files or directories) listed in ReadWritePaths= are
accessible from within the namespace with the same access rights as
from outside."
To implement this change bind_remount_recursive() gained a new "blacklist"
string list parameter, which when passed may contain subdirs that shall be
excluded from the read-only mounting.
A number of functions are updated to add more debug logging to make this more
digestable.
execute: move suppression of HOME=/ and SHELL=/bin/nologin into user-util.c
This adds a new call get_user_creds_clean(), which is just like
get_user_creds() but returns NULL in the home/shell parameters if they contain
no useful information. This code previously lived in execute.c, but by
generalizing this we can reuse it in run.c.
sysctl: configure kernel parameters in the order they occur in each sysctl configuration files (#4205)
Currently, systemd-sysctl command configures kernel parameters in each sysctl
configuration files in random order due to characteristics of iterator of
Hashmap.
However, kernel parameters need to be configured in the order they occur in
each sysctl configuration files.
- For example, consider fs.suid_coredump and kernel.core_pattern. If
fs.suid_coredump=2 is configured before kernel.core_pattern= whose default
value is "core", then kernel outputs the following message:
Unsafe core_pattern used with suid_dumpable=2. Pipe handler or fully qualified core dump path required.
Note that the security issue mentioned in this message has already been fixed
on recent kernels, so this is just a warning message on such kernels. But
it's still confusing to users that this message is output on some boot and
not output on another boot.
- I don't know but there could be other kernel parameters that are significant
in the order they are configured.
- The legacy sysctl command configures kernel parameters in the order they
occur in each sysctl configuration files. Although I didn't find any official
specification explaining this behavior of sysctl command, I don't think there
is any meaningful reason to change this behavior, in particular, to the
random one.
This commit does the change by simply using OrderedHashmap instead of Hashmap.
Luca Bruno [Sat, 24 Sep 2016 12:30:42 +0000 (12:30 +0000)]
nspawn: decouple --boot from CLONE_NEWIPC (#4180)
This commit is a minor tweak after the split of `--share-system`, decoupling the `--boot`
option from IPC namespacing.
Historically there has been a single `--share-system` option for sharing IPC/PID/UTS with the
host, which was incompatible with boot/pid1 mode. After the split, it is now possible to express
the requirements with better granularity.
For reference, this is a followup to #4023 which contains references to previous discussions.
I realized too late that CLONE_NEWIPC is not strictly needed for boot mode.
journal: fix HMAC calculation when appending a data object
Since commit 5996c7c295e073ce21d41305169132c8aa993ad0 (v190 !), the
calculation of the HMAC is broken because the hash for a data object
including a field is done in the wrong order: the field object is
hashed before the data object is.
However during verification, the hash is done in the opposite order as
objects are scanned sequentially.
NSS modules (libnss_*.so.*) need to be installed into
${rootlibdir} (typically /lib) in order to be used. Previously, the
modules were installed into ${libdir}, thus usually ending up in
/usr/lib, even on systems where split usr is enabled, or ${libdir} is
passed explicitly.
If Kind is not specied, the message about "Invalid Kind" was misleading.
If Kind was specified in an invalid way, we get a message in the parsing
phase anyway. Reword the message to cover both cases better.
man: mention that netdev,network files support dropins
Also update the description of drop-ins in systemd.unit(5) to say that .d
directories, not .conf files, are in /etc/system/system, /run/systemd/system,
etc.
Updated formatting for printing the key for FSS (#4165)
The key used to be jammed next to the local file path. Based on the format string on line 1675, I determined that the order of arguments was written incorrectly, and updated the function based on that assumption.
Before:
```
Please write down the following secret verification key. It should be stored
at a safe location and should not be saved locally on disk.
Tomáš Janoušek [Thu, 15 Sep 2016 23:26:31 +0000 (01:26 +0200)]
logind: fix /run/user/$UID creation in apparmor-confined containers (#4154)
When a docker container is confined with AppArmor [1] and happens to run
on top of a kernel that supports mount mediation [2], e.g. any Ubuntu
kernel, mount(2) returns EACCES instead of EPERM. This then leads to:
systemd-logind[33]: Failed to mount per-user tmpfs directory /run/user/1000: Permission denied
login[42]: pam_systemd(login:session): Failed to create session: Access denied
and user sessions don't start.
This also applies to selinux that too returns EACCES on mount denial.
Ivan Shapovalov [Tue, 13 Sep 2016 00:04:35 +0000 (03:04 +0300)]
update-done, condition: write the timestamp to the file as well and use it to prevent false-positives
This fixes https://bugs.freedesktop.org/show_bug.cgi?id=90192 and #4130
for real. Also, remove timestamp check in update-done.c altogether since
the whole operation is idempotent.
We were already unconditionally using the unicode character when the
input string was not pure ASCII, leading to different behaviour in
depending on the input string.
journal-remote: implement %m support in mhd_respondf
errno value is not protected (it is undefined after this function returns).
Various mhd_* functions are not documented to protect errno, so this could not
guaranteed anyway.
According to its manual page, flags given to mkostemp(3) shouldn't include
O_RDWR, O_CREAT or O_EXCL flags as these are always included. Beyond
those, the only flag that all callers (except a few tests where it
probably doesn't matter) use is O_CLOEXEC, so set that unconditionally.
This parser will also be used in libinput, which uses the MIT license, so
relicense this file to the more permissive license to make bidirectional code
flow easier. parse_hwdb.py is only useful during building of the project, and
is not part of the installation, so effectively both licenses are very similar.
In particular, the licensing of binary packages produced by systemd is not
influenced in any way, because the MIT licensed part is not installed.
Like many other recent thinkpads the factory default pointingstick
sensitivity on these devices is quite low, making the pointingstick
very slow in moving the cursor.
This extends the existing hwdb rules for tweaking the sensitivity to
also apply to the X1 Tablet models.
Signed-off-by: Dennis Wassenberg <dennis.wassenberg@secunet.com>
When root was empty or equal to "/", chroot_symlinks_same was called with
root==NULL, and strjoina returned "", so the code thought both paths are equal
even if they were not. Fix that by always providing a non-null first argument
to strjoina.
Michael Olbrich [Fri, 9 Sep 2016 15:05:06 +0000 (17:05 +0200)]
unit: sent change signal before removing the unit if necessary (#4106)
If the unit is in the dbus queue when it is removed then the last change
signal is never sent. Fix this by checking the dbus queue and explicitly
send the change signal before sending the remove signal.