sysctl: do not fail systemd-sysctl.service if /proc/sys is mounted read-only
Let's make missing write access to /proc/sys non-fatal to the sysctl service.
This is a follow-up to 411e869f497c7c7bd0688f1e3500f9043bc56e48 which altered
the condition for running the sysctl service to check for /proc/sys/net being
writable, accepting that /proc/sys might be read-only. In order to ensure the
boot-up stays clean in containers lower the log level for the EROFS errors
generated due to this.
core: rework the "no_gc" unit flag to become a more generic "perpetual" flag
So far "no_gc" was set on -.slice and init.scope, to units that are always
running, cannot be stopped and never exist in an "inactive" state. Since these
units are the only users of this flag, let's remodel it and rename it
"perpetual" and let's derive more funcitonality off it. Specifically, refuse
enqueing stop jobs for these units, and report that they are "unstoppable" in
the CanStop bus property.
Djalal Harouni [Tue, 25 Oct 2016 14:24:35 +0000 (16:24 +0200)]
core: lets apply working directory just after mount namespaces
This makes applying groups after applying the working directory, this
may allow some flexibility but at same it is not a big deal since we
don't execute or do anything between applying working directory and
droping groups.
Rewrite the function to be slightly simpler. In particular, if a specific
match is found (like ConditionVirtualization=yes), simply return an answer
immediately, instead of relying that "yes" will not be matched by any of
the virtualization names below.
detect-virt: add --private-users switch to check if a userns is active
Various things don't work when we're running in a user namespace, but it's
pretty hard to reliably detect if that is true.
A function is added which looks at /proc/self/uid_map and returns false
if the default "0 0 UINT32_MAX" is found, and true if it finds anything else.
This misses the case where an 1:1 mapping with the full range was used, but
I don't know how to distinguish this case.
'systemd-detect-virt --private-users' is very similar to
'systemd-detect-virt --chroot', but we check for a user namespace instead.
Fixes:
$ ls -l /bin/sh
lrwxrwxrwx 1 root root 4 Feb 17 2016 /bin/sh -> dash
$ ./autogen.sh c
./autogen.sh: 22: ./autogen.sh: [[: not found
...
checking whether make supports nested variables... (cached) yes
checking build system type... Invalid configuration `c': machine `c' not
recognized
configure: error: /bin/bash build-aux/config.sub c failed
Dongsu Park [Tue, 25 Oct 2016 12:51:01 +0000 (14:51 +0200)]
test: skip exec tests when inaccessible dir is unavailable
In case of running test-execute on systems with systemd < v232, several
tests like privatedevices or protectkernelmodules fail because
/run/systemd/inaccessible/ doesn't exist. In these cases, we should skip
tests to avoid unnecessary errors.
See also https://github.com/systemd/systemd/pull/4243#issuecomment-253665566
Since this unit is synthesized anyway there's no point in actually shipping it
on disk. This also has the benefit that "cd /usr/lib/systemd/system ; ls *"
won't be confused by the leading dash of the file name anymore.
core: move initialization of -.slice and init.scope into the unit_load() callbacks
Previously, we'd synthesize the root slice unit and the init scope unit in the
enumerator callbacks for the unit type. This is problematic if either of them
is already referenced from a unit that is loaded as result of another unit
type's enumerator logic.
Let's clean this up and simply create the two objects from the enumerator
callbacks, if they are not around yet. Do the actual filling in of the settings
from the unit_load() callbacks, to match how other units are loaded.
"myhostname" should probably be dropped eventually, but when we do this we
should do it in full, and not only drop it from the suggested nsswitch.conf
for one of the modules, but also drop it in source and stop referring to it
altogether.
Note that nss-resolve doesn't replace nss-myhostname in full: the former only
works if D-Bus/resolved is available for resolving the local hostname, the
latter works in all cases even if D-Bus or resolved are not in operation, hence
there's some value in keeping the line as it is right now. Note that neither
dns nor myhostname are considered at all with the above configuration unless
the resolve module actually returns UNAVAIL. Thus, even though handling of
local hostname resolving is implemented twice this way it is only executed once
for each lookup.
nss-resolve: be a bit more careful with returning NSS_STATUS_NOTFOUND
Let's tighten the cases when our module returns NSS_STATUS_NOTFOUND. Let's do
so only if we actually managed to talk to resolved. In all other cases stick to
NSS_STATUS_UNAVAIL as before, as it clearly indicates that our module or the
system is borked, and the "dns" fallback should really take place.
In particular this fixes the 2nd-level fallback from our own dlopen() based
fallback handling. In this case we really should return UNAVAIL so that the
caller can apply its own fallback still.
"oldumount()" is not a syscall, but simply a wrapper for it, the actual syscall
nr is called "umount" (and the nr of umount() is called umount2 internally).
"sysctl()" is not a syscall, but "_syscall()" is. Fix this in the table.
Without these changes libseccomp cannot actually translate the tables in full.
This wasn't noticed before as the code was written defensively for this case.
seccomp: add new seccomp_init_conservative() helper
This adds a new seccomp_init_conservative() helper call that is mostly just a
wrapper around seccomp_init(), but turns off NNP and adds in all secondary
archs, for best compatibility with everything else.
Pretty much all of our code used the very same constructs for these three
steps, hence unifying this in one small function makes things a lot shorter.
This also changes incorrect usage of the "scmp_filter_ctx" type at various
places. libseccomp defines it as typedef to "void*", i.e. it is a pointer type
(pretty poor choice already!) that casts implicitly to and from all other
pointer types (even poorer choice: you defined a confusing type now, and don't
even gain any bit of type safety through it...). A lot of the code assumed the
type would refer to a structure, and hence aded additional "*" here and there.
Remove that.
core: rework apply_protect_kernel_modules() to use seccomp_add_syscall_filter_set()
Let's simplify this call, by making use of the new infrastructure.
This is actually more in line with Djalal's original patch but instead of
search the filter set in the array by its name we can now use the set index and
jump directly to it.
- rename the SystemCallFilterSet structure to SyscallFilterSet. So far the main
instance of it (the syscall_filter_sets[] array) used to abbreviate
"SystemCall" as "Syscall". Let's stick to one of the two syntaxes, and not
mix and match too wildly. Let's pick the shorter name in this case, as it is
sufficiently well established to not confuse hackers reading this.
- Export explicit indexes into the syscall_filter_sets[] array via an enum.
This way, code that wants to make use of a specific filter set, can index it
directly via the enum, instead of having to search for it. This makes
apply_private_devices() in particular a lot simpler.
- Provide two new helper calls in seccomp-util.c: syscall_filter_set_find() to
find a set by its name, seccomp_add_syscall_filter_set() to add a set to a
seccomp object.
- Update SystemCallFilter= parser to use extract_first_word(). Let's work on
deprecating FOREACH_WORD_QUOTED().
- Simplify apply_private_devices() using this functionality
- The field name in the timestamp file is changed from "TimestampNSec=" to
"TIMESTAMP_NSEC=". This is done simply to reflect the fact that we parse the
file with the env var file parser, and hence the contents should better
follow the usual capitalization of env vars, i.e. be all uppercase.
- Needless negation of the errno parameter log_error_errno() and friends has
been removed.
- Instead of manually calculating the nsec remainder of the timestamp, use
timespec_store().
- We now check whether we were able to write the timestamp file in full with
fflush_and_check() the way we usually do it.
Patrik Flykt [Mon, 24 Oct 2016 11:44:01 +0000 (14:44 +0300)]
networkd-ndisc: Don't add NDisc route for local address (#4467)
When systemd-networkd is run on the same IPv6 enabled interface where
radvd is announcing prefixes, a route is being set up pointing to the
interface address. As this will fail with an invalid argument error,
the link is marked as failed and the following message like the
following will appear in in the logs:
systemd-networkd[21459]: eth1: Could not set NDisc route or address: Invalid argument
systemd-networkd[21459]: eth1: Failed
Should the interface be required by systemd-networkd-wait-online,
network-online.target will wait until its timeout hits thereby
significantly delaying system startup.
The fix is to check whether the gateway address obtained from NDisc
messages is equal to any of the interface addresses on the same link
and not set the NDisc route in that case.
Djalal Harouni [Mon, 24 Oct 2016 11:13:06 +0000 (13:13 +0200)]
core: do not assert when sysconf(_SC_NGROUPS_MAX) fails (#4466)
Remove the assert and check the return code of sysconf(_SC_NGROUPS_MAX).
_SC_NGROUPS_MAX maps to NGROUPS_MAX which is defined in <limits.h> to
65536 these days. The value is a sysctl read-only
/proc/sys/kernel/ngroups_max and the kernel assumes that it is always
positive otherwise things may break. Follow this and support only
positive values for all other case return either -errno or -EOPNOTSUPP.
Now if there are systems that want to re-write NGROUPS_MAX then they
should not pass SupplementaryGroups= in units even if it is empty, in
this case nothing fails and we just ignore supplementary groups. However
if SupplementaryGroups= is passed even if it is empty we have to assume
that there will be groups manipulation from our side or the kernel and
since the kernel always assumes that NGROUPS_MAX is positive, then
follow that and support only positive values.
Jan Synacek [Thu, 20 Oct 2016 13:20:11 +0000 (15:20 +0200)]
shared, systemctl: teach is-enabled to show installation targets
It may be desired by users to know what targets a particular service is
installed into. Improve user friendliness by teaching the is-enabled
command to show such information when used with --full.
This patch makes use of the newly added UnitFileFlags and adds
UNIT_FILE_DRY_RUN flag into it. Since the API had already been modified,
it's now easy to add the new dry-run feature for other commands as
well. As a next step, --dry-run could be added to systemctl, which in
turn might pave the way for a long requested dry-run feature when
running systemctl start.
> vfs: Don't create inodes with a uid or gid unknown to the vfs
It is expected that filesystems can not represent uids and gids from
outside of their user namespace. Keep things simple by not even
trying to create filesystem nodes with non-sense uids and gids.
So, we actually should `reset_uid_gid` early to prevent https://github.com/systemd/systemd/pull/4223#issuecomment-252522955
Spawning container fedora-rawhide on /var/lib/machines/fedora-rawhide.
Press ^] three times within 1s to kill container.
Child died too early.
Selected user namespace base 1073283072 and range 65536.
Failed to mount to /sys/fs/cgroup/systemd: No such file or directory
* `mount_all (outer_child)` creates `container_dir/sys/fs/selinux`
* `mount_all (outer_child)` doesn't patch `container_dir/sys/fs` and so on.
* `mount_sysfs (inner_child)` tries to create `/sys/fs/cgroup`
* This fails
basic: fallback to the fstat if we don't have access to the /proc/self/fdinfo
https://github.com/systemd/systemd/pull/4372#discussion_r83354107:
I get `open("/proc/self/fdinfo/13", O_RDONLY|O_CLOEXEC) = -1 EACCES (Permission denied)`
core: do not set no_new_privileges flag in config_parse_syscall_filter
If SyscallFilter was set, and subsequently cleared, the no_new_privileges flag
was not reset properly. We don't need to set this flag here, it will be
set automatically in unit_patch_contexts() if syscall_filter is set.
tree-wide: make parse_proc_cmdline() strip "rd." prefix automatically
This stripping is contolled by a new boolean parameter. When the parameter
is true, it means that the caller does not care about the distinction between
initrd and real root, and wants to act on both rd-dot-prefixed and unprefixed
parameters in the initramfs, and only on the unprefixed parameters in real
root. If the parameter is false, behaviour is the same as before.
Changes by caller:
log.c (systemd.log_*): changed to accept rd-dot-prefix params
pid1: no change, custom logic
cryptsetup-generator: no change, still accepts rd-dot-prefix params
debug-generator: no change, does not accept rd-dot-prefix params
fsck: changed to accept rd-dot-prefix params
fstab-generator: no change, custom logic
gpt-auto-generator: no change, custom logic
hibernate-resume-generator: no change, does not accept rd-dot-prefix params
journald: changed to accept rd-dot-prefix params
modules-load: no change, still accepts rd-dot-prefix params
quote-check: no change, does not accept rd-dot-prefix params
udevd: no change, still accepts rd-dot-prefix params
I added support for "rd." params in the three cases where I think it's
useful: logging, fsck options, journald forwarding options.
- do not crash if an option without value is specified on the kernel command
line, e.g. "udev.log-priority" :P
- simplify the code a bit
- warn about unknown "udev.*" options — this should make it easier to spot
typos and reduce user confusion
journald: convert journald to use parse_proc_cmdline
This makes journald use the common option parsing functionality.
One behavioural change is implemented:
"systemd.journald.forward_to_syslog" is now equivalent to
"systemd.journald.forward_to_syslog=1".
I think it's nicer to use this way.