nspawn: enable major=0/minor=0 devices inside the container (#3773)
https://github.com/systemd/systemd/pull/3685 introduced
/run/systemd/inaccessible/{chr,blk} to map inacessible devices,
this patch allows systemd running inside a nspawn container to create
/run/systemd/inaccessible/{chr,blk}.
Alexander Kurtz [Thu, 21 Jul 2016 00:29:54 +0000 (02:29 +0200)]
bootctl: Always use upper case for "/EFI/BOOT" and "/EFI/BOOT/BOOT*.EFI".
If the ESP is not mounted with "iocharset=ascii", but with "iocharset=utf8"
(which is for example the default in Debian), the file system becomes case
sensitive. This means that a file created as "FooBarBaz" cannot be accessed as
"foobarbaz" since those are then considered different files.
Moreover, a file created as "FooBar" can then also not be accessed as "foobar",
and it also prevents such a file from being created, as both would use the same
8.3 short name "FOOBAR".
Even though the UEFI specification [0] does give the canonical spelling for
the files mentioned above, not all implementations completely conform to that,
so it's possible that those files would already exist, but with a different
spelling, causing subtle bugs when scanning or modifying the ESP.
While the proper fix would of course be that everybody conformed to the
standard, we can work around this problem by just referencing the files by
their 8.3 short names, i.e. using upper case.
Fixes: #3740
[0] <http://www.uefi.org/specifications>, version 2.6, section 3.5.1.1
nspawn: when netns is on, mount /proc/sys/net writable
Normally we make all of /proc/sys read-only in a container, but if we do have
netns enabled we can make /proc/sys/net writable, as things are virtualized
then.
units: fix TasksMax=16384 for systemd-nspawn@.service
When a container scope is allocated via machined it gets 16K set already since cf7d1a30e44bf380027a2e73f9bf13f423a33cc1. Make sure when a container is run as
system service it gets the same values.
core: normalize header inclusion in execute.h a bit
We don't actually need any functionality from cgroup.h in execute.h, hence
don't include that. However, we do need the Unit structure from unit.h, hence
include that, and move it as late as possible, since it needs the definitions
from execute.h.
All other functions in execute.c that need the unit id take a Unit* parameter
as first argument. Let's change connect_logger_as() to follow a similar logic.
core: when forcibly killing/aborting left-over unit processes log about it
Let's lot at LOG_NOTICE about any processes that we are going to
SIGKILL/SIGABRT because clean termination of them didn't work.
This turns the various boolean flag parameters to cg_kill(), cg_migrate() and
related calls into a single binary flags parameter, simply because the function
now gained even more parameters and the parameter listed shouldn't get too
long.
Logging for killing processes is done either when the kill signal is SIGABRT or
SIGKILL, or on explicit request if KILL_TERMINATE_AND_LOG instead of LOG_TERMINATE
is passed. This isn't used yet in this patch, but is made use of in a later
patch.
rules: make sure always set at least one property on rfkill devices
The rfkill service waits for rfkill device initialization as reported by
udev_device_is_initialized(), and if that is never reported it might dead-lock.
However, udev never reports completed initialization for devices that have no
properties or tags set. For some rfkill devices this might be the case, in
particular those which are connected to exotic busses, where path_id returns
nothing.
This patch simply sets the SYSTEM_RFKILL property on all rfkill devices, to
ensure that udev_device_is_initialized() always reports something useful and we
don't dead-lock.
man: revise entry about specifying a file path (#3739)
* Specifying a device node has an effect much larger than a simple shortcut
for a field/value match, so the original sentence is no longer a good way
to start the paragraph.
* Specifying a device node causes matches to be generated for all ancestor
devices of the device specified, not just its parents.
* Indicates that the path must be absolute, but that it may be a link.
* Eliminates a few typos.
doc,core: Read{Write,Only}Paths= and InaccessiblePaths=
This patch renames Read{Write,Only}Directories= and InaccessibleDirectories=
to Read{Write,Only}Paths= and InaccessiblePaths=, previous names are kept
as aliases but they are not advertised in the documentation.
namespace: unify limit behavior on non-directory paths
Despite the name, `Read{Write,Only}Directories=` already allows for
regular file paths to be masked. This commit adds the same behavior
to `InaccessibleDirectories=` and makes it explicit in the doc.
This patch introduces `/run/systemd/inaccessible/{reg,dir,chr,blk,fifo,sock}`
{dile,device}nodes and mounts on the appropriate one the paths specified
in `InacessibleDirectories=`.
Based on Luca's patch from https://github.com/systemd/systemd/pull/3327
journalctl: make sure that journalctl's --all switch also has an effect on json output
With this change, binary record data is formatted as string if --all is
specified when using json output. This is inline with the effect of --all on
the other available output modes.
sd-journal: when formatting log messages, implicitly strip trailing whitespace
When converting log messages from human readable text into binary records to
send off to journald in sd_journal_print(), strip trailing whitespace in the
log message. This way, handling of logs made via syslog(), stdout/stderr and
sd_journal_print() are treated the same way: trailing (but not leading)
whitespace is automatically removed, in particular \n and \r. Note that in case
of syslog() and stdout/stderr based logging the stripping takes place
server-side though, while for the native protocol based transport this takes
place client-side. This is because in the former cases conversion from
free-form human-readable strings into structured, binary log records takes
place on the server-side while for journal-native logging it happens on the
client side, and after conversion into binary records we probably shouldn't
alter the data anymore.
Jan Janssen [Mon, 18 Jul 2016 19:19:32 +0000 (21:19 +0200)]
sd-boot: Fix waiting for keyboard input (#3735)
WaitForKeyEx may never return on some UEFI systems depending
on firmware, hardware configuration and the phase of the moon.
Use ConIn->WaitForKey unconditionally instead.
nspawn: decrease mkdir error logging in /sys to debug priority (#3748)
Such mkdir errors happen for example when trying to mkdir /sys/fs/selinux.
/sys is documented to be readonly in the container, so mkdir errors below /sys
can be expected.
They shouldn't be logged as warnings since they lead users to think that
there is something wrong.
basic/strv: add an extra NUL after strings in strv_make_nulstr
strv_make_nulstr was creating a nulstr which was not a valid nulstr,
because it was missing the terminating NUL. This didn't cause any issues,
because strv_parse_nulstr correctly parsed the result, using the
separately specified length.
But it's confusing to have something called nulstr which really isn't.
It is likely that somebody will try to use strv_make_nulstr() in
some other place, incorrectly.
This patch changes strv_parse_nulstr() to produce a valid nulstr, and
changes the output length parameter to be the minimum number of bytes
which can be later on parsed by strv_parse_nulstr(). This allows the
only user in ask-password-api to be slightly simplified.
manager: don't skip sigchld handler for main and control pid for services (#3738)
During stop when service has one "regular" pid one main pid and one
control pid and the sighld for the regular one is processed first the
unit_tidy_watch_pids will skip the main and control pid and does not
remove them from u->pids(). But then we skip the sigchld event because we
already did one in the iteration and there are two pids in u->pids.
v2: Use general unit_main_pid() and unit_control_pid() instead of
reaching directly to service structure.
Michael Biebl [Sat, 16 Jul 2016 16:51:45 +0000 (18:51 +0200)]
man: mention system-shutdown hook directory in synopsis (#3741)
The distinction between systemd-shutdown the binary vs system-shutdown
the hook directory (without the 'd') is not immediately obvious and can
be quite confusing if you are looking for a directory which doesn't exist.
Therefore explicitly mention the hook directory in the synopsis with a
trailing slash to make it clearer which is which.
This adds a build script and a settings file for "mkosi", a tool for putting
together full, bootable disk images for container managers of EFI systems and
VMs.
With these files it's enough to type "mkosi" in the project directory to
generate a bootable Fedora 24 OS image with a version of systemd compiled fresh
from the working tree.
Sometimes, the persistent storage rules should be skipped for a subset
of devices. For example, the Qubes operating system prevents dom0 from
parsing untrusted block device content (such as filesystem metadata) by
shipping a custom 60-persistent-storage.rules, patched to bail out early
if the device name matches a hardcoded pattern.
As a less brittle and more flexible alternative, this commit adds a line
to the two relevant .rules files which makes them test the value of the
UDEV_DISABLE_PERSISTENT_STORAGE_RULES_FLAG device property, modeled
after the various DM_UDEV_DISABLE_*_RULES_FLAG properties.
Stef Walter [Fri, 15 Jul 2016 10:24:34 +0000 (12:24 +0200)]
udev: Line buffer 'udev monitor' output (#3733)
Callers of the 'udev monitor' tool expect to see output when
an event occurs. The stdio buffering defeats that. This patch
switches it to line buffering.
macros: provide %_systemdgeneratordir and %_systemdusergeneratordir (#3672)
... as requested in
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/message/DJ7HDNRM5JGBSA4HL3UWW5ZGLQDJ6Y7M/.
Adding the macro makes it marginally easier to create generators
for outside projects.
I opted for "generatordir" and "usergeneratordir" to match
%unitdir and %userunitdir. OTOH, "_systemd" prefix makes it obvious
that this is related to systemd. "%_generatordir" would be to generic
of a name.
Daniel Mack [Fri, 15 Jul 2016 02:56:11 +0000 (04:56 +0200)]
network-ndisc: avoid VLAs (#3725)
Do not allocate objects of dynamic and potentially large size on the stack
to avoid both clang compilation errors and unpredictable runtime behavior
on exotic platforms. Use the heap for that instead.
While at it, refactor the code a bit. Access 's->domain' via
NDISC_DNSSL_DOMAIN(), and refrain from allocating 'x' independently, but
rather reuse 's' if we're dealing with a new entry to the set.
There's really no reason to use 10s here, let's instead default to 90s like we
do for everything else.
The SIGKILL during the final killing spree is in most regards the fourth level
of a safety net, after all: any normal service should have already been stopped
during the normal service shutdown logic, first via SIGTERM and then SIGKILL,
and then also via SIGTERM during the finall killing spree before we send
SIGKILL. And as a fourth level safety net it should only be required in
exceptional cases, which means it's safe to rais the default timeout, as normal
shutdowns should never be delayed by it.
Note that journald excludes itself from the normal service shutdown, and relies
on the final killing spree to terminate it (this is because it wants to cover
the normal shutdown phase's complete logging). If the system's IO is
excessively slow, then the 10s might not be enough for journald to sync
everything to disk and logs might get lost during shutdown.
Luca Bruno [Tue, 12 Jul 2016 09:55:26 +0000 (11:55 +0200)]
seccomp: only abort on syscall name resolution failures (#3701)
seccomp_syscall_resolve_name() can return a mix of positive and negative
(pseudo-) syscall numbers, while errors are signaled via __NR_SCMP_ERROR.
This commit lets the syscall filter parser only abort on real parsing
failures, letting libseccomp handle pseudo-syscall number on its own
and allowing proper multiplexed syscalls filtering.
rules: block: add support for pmem devices (#3683)
Persistent memory devices can be exposed as block devices as /dev/pmemN
and /dev/pmemNs. pmemN is the raw device and is byte-addressable from
within the kernel and when mmapped by applications from a DAX-mounted
file system. pmemNs has the block translation table (BTT) layered on top,
offering atomic sector/block access. Both pmemN and pmemNs are expected
to contain file systems.
blkid(8) and lsblk(8) seem to correctly report on pmemN and pmemNs.
systemd v219 will populate /dev/disk/by-uuid/ when, for example, mkfs is
used on pmem, but systemd v228 does not.
David Michael [Fri, 8 Jul 2016 03:43:01 +0000 (20:43 -0700)]
core: queue loading transient units after setting their properties (#3676)
The unit load queue can be processed in the middle of setting the
unit's properties, so its load_state would no longer be UNIT_STUB
for the check in bus_unit_set_properties(), which would cause it to
incorrectly return an error.
Daniel Mack [Fri, 8 Jul 2016 02:29:35 +0000 (04:29 +0200)]
cgroup: fix memory cgroup limit regression on kernel 3.10 (#3673)
Commit da4d897e ("core: add cgroup memory controller support on the unified
hierarchy (#3315)") changed the code in src/core/cgroup.c to always write
the real numeric value from the cgroup parameters to the
"memory.limit_in_bytes" attribute file.
For parameters set to CGROUP_LIMIT_MAX, this results in the string
"18446744073709551615" being written into that file, which is UINT64_MAX.
Before that commit, CGROUP_LIMIT_MAX was special-cased to the string "-1".
This causes a regression on CentOS 7, which is based on kernel 3.10, as the
value is interpreted as *signed* 64 bit, and clamped to 0:
Hence, all units that are subject to the limits enforced by the memory
controller will crash immediately, even though they have no actual limit
set. This happens to for the user.slice, for instance:
By cleaning up before setting up PAM we maintain control of overriding
behavior in setting variables. Otherwise, pam_putenv is in control.
This also makes sure we use a cleaned up environment in replacing
variables in argv.
Daniel Mack [Thu, 7 Jul 2016 04:30:34 +0000 (06:30 +0200)]
basic: log: Increase static buffer for source file location (#3674)
Commit d054f0a4 ("tree-wide: use xsprintf() where applicable") used a
semantic patch approach to change a number of locations from
snprintf(buf, sizeof(buf), FMT, ...)
to
xsprintf(buf, FMT, ...)
The problem is that xsprintf() wraps the snprintf() in an
assert_message_se(), so if snprintf() reports an overflow of the
destination buffer, the binary will now terminate.
This hit a user running a version of systemd that was built from a
deeply nested system path.
Fix this by
a) Switching back to snprintf() for this particular case. We should really
rather truncate the location string than crash in such situations.
b) Increasing the size of that static string buffer, to make the event more
unlikely.