Michael Vogt [Wed, 24 Jan 2018 10:18:46 +0000 (11:18 +0100)]
test: add TEST-21-SYSUSERS test
This test tests the systemd-sysuser binary via the --root=$TESTDIR
option and ensures that for the given inputs the expected passwd
and group files will be generated.
Michael Vogt [Wed, 24 Jan 2018 08:26:51 +0000 (09:26 +0100)]
sysuser: use OrderedHashmap
This means we have more predicable behavior for "u foo uid:gid" lines
and also makes the generated files appear in the same order as the
inputs. So e.g.
```
u root 0 - /root
u daemon 1 - /usr/sbin
u games 5:60 - /usr/games
```
will generate
```
root:x:0:0::/root:/bin/sh
daemon:x:1:1::/usr/sbin:/sbin/nologin
games:x:5:60::/usr/games:/sbin/nologin
```
Michael Vogt [Tue, 23 Jan 2018 18:53:08 +0000 (19:53 +0100)]
sysusers: allow uid:gid in sysusers.conf files
This PR allows to write sysuser.conf lines like:
```
u games 5:60 -
```
This will create an a "games" user with uid 5 and games group with
gid 60. This is arguable ugly, however it is required to represent
certain configurations like the default passwd file on Debian and
Ubuntu.
When the ":" syntax is used and there is a group with the given
gid already then no new group is created. This allows writing the
following:
```
g unrelated 60
u games 5:60 -
```
which will create a "games" user with the uid 5 and the primary
gid 60. No group games is created here (might be useful for [1]).
core: rework how we count the n_on_console counter
Let's add a per-unit boolean that tells us whether our unit is currently
counted or not. This way it's unlikely we get out of sync again and
things are generally more robust.
This also allows us to remove the counting logic specific to service
units (which was in fact mostly a copy from the generic implementation),
in favour of fully generic code.
This call determines whether a specific unit currently needs access to
the console. It's a fancy wrapper around
exec_context_may_touch_console() ultimately, however for service units
we'll explicitly exclude the SERVICE_EXITED state from when we report
true.
manager: minor manager_get_show_status() simplification
Since the the whole function ultimately is just a fancy getter for the
show_status field, let's actually return it as last step literally
without an extra needless "if".
This removes LOG_TARGET_SAFE. It's made redundant by the new
"prohibit-ipc" logging flag, as it used to have a similar effect: avoid
logging to the journal/syslog, i.e. any local services in order to avoid
deadlocks when we lock from PID 1 or its utility processes (such as
generators).
All previous users of LOG_TARGET_SAFE are switched over to the new
setting. This makes things a bit safer for all, as not even the
SYSTEMD_LOG_TARGET env var can be used to accidentally log to the
journal anymore in these programs.
log: add new "prohibit_ipc" flag to logging system
If set, we'll avoid logging to any IPC log targets, i.e. syslog or the
journal, but allow stderr, kmsg, console logging.
This is useful as PID 1 wants to turn this off explicitly as long as the
journal is not up.
Previously we'd open/close the log stream to these services whenever
needed but this is incompatible with the "open_when_needed" logic
introduced in #6915, which might open the log streams whenever it likes,
including possibly inside of the child process we fork off that'll
become journald later on. Hence, let's make this all explicit, and
instead of managing when we open/close log streams add a boolean that
clearly prohibits the IPC targets when needed, so that opening can be
done at any time, but will honour this.
log: make log_set_upgrade_syslog_to_journal() take effect immediately
This doesn't matter much, and we don't rely on it, but I think it's much
nicer if we log_set_target() and log_set_upgrade_syslog_to_journal() can
be called in either order and have the same effect.
It is often the case that a file descriptor and its corresponding IO
sd_event_source share a life span. When this is the case, developers will
have to unref the event source and close the file descriptor. Instead, we
can just have the event source take ownership of the file descriptor and
close it when the event source is freed. This is especially useful when
combined with cleanup attributes and sd_event_source_unrefp().
The time-related functions in sd-event.h take as inputs constants (CLOCK_*)
defined in time.h. By including time.h in sd-event.h, we free the developer
from having to do this manually.
Apparently O_NONBLOCK is the modern name used in most documentation and
for most cases in our sources. Let's hence replace the old alias
O_NDELAY and stick to O_NONBLOCK everywhere.
tmpfiles: make "f" lines behaviour match what the documentation says
CHANGE OF BEHAVIOUR — with this commit "f" line's behaviour is altered
to match what the documentation says: if an "argument" string is
specified it is written to the file only when the file didn't exist
before. Previously, it would be appended to the file each time
systemd-tmpfiles was invoked — which is not a particularly useful
behaviour as the tool is not idempotent then and the indicated files
grow without bounds each time the tool is invoked.
I did some spelunking whether this change in behaviour would break
things, but afaics nothing relies on the previous O_APPEND behaviour of
this line type, hence I think it's relatively safe to make "f" lines
work the way the docs say, rather than adding a new modifier for it or
so.
The left side of the || expression is conditionalized on SERVICE_START,
but SERVICE_START is blanket listed on the right side anyway, hence we
can drop the left side entirely without any change in behaviour.
Moreover, if main_pid is initialized, it should be watched, hence this
is even the safe and right thing to do.
core: rework how we track which PIDs to watch for a unit
Previously, we'd maintain two hashmaps keyed by PIDs, pointing to Unit
interested in SIGCHLD events for them. This scheme allowed a specific
PID to be watched by exactly 0, 1 or 2 units.
With this rework this is replaced by a single hashmap which is primarily
keyed by the PID and points to a Unit interested in it. However, it
optionally also keyed by the negated PID, in which case it points to a
NULL terminated array of additional Unit objects also interested. This
scheme means arbitrary numbers of Units may now watch the same PID.
Runtime and memory behaviour should not be impact by this change, as for
the common case (i.e. each PID only watched by a single unit) behaviour
stays the same, but for the uncommon case (a PID watched by more than
one unit) we only pay with a single additional memory allocation for the
array.
Why this all? Primarily, because allowing exactly two units to watch a
specific PID is not sufficient for some niche cases, as processes can
belong to more than one unit these days:
1. sd_notify() with MAINPID= can be used to attach a process from a
different cgroup to multiple units.
2. Similar, the PIDFile= setting in unit files can be used for similar
setups,
3. By creating a scope unit a main process of a service may join a
different unit, too.
4. On cgroupsv1 we frequently end up watching all processes remaining in
a scope, and if a process opens lots of scopes one after the other it
might thus end up being watch by many of them.
This patch hence removes the 2-unit-per-PID limit. It also makes a
couple of other changes, some of them quite relevant:
- manager_get_unit_by_pid() (and the bus call wrapping it) when there's
ambiguity will prefer returning the Unit the process belongs to based on
cgroup membership, and only check the watch-pids hashmap if that
fails. This change in logic is probably more in line with what people
expect and makes things more stable as each process can belong to
exactly one cgroup only.
- Every SIGCHLD event is now dispatched to all units interested in its
PID. Previously, there was some magic conditionalization: the SIGCHLD
would only be dispatched to the unit if it was only interested in a
single PID only, or the PID belonged to the control or main PID or we
didn't dispatch a signle SIGCHLD to the unit in the current event loop
iteration yet. These rules were quite arbitrary and also redundant as
the the per-unit handlers would filter the PIDs anyway a second time.
With this change we'll hence relax the rules: all we do now is
dispatch every SIGCHLD event exactly once to each unit interested in
it, and it's up to the unit to then use or ignore this. We use a
generation counter in the unit to ensure that we only invoke the unit
handler once for each event, protecting us from confusion if a unit is
both associated with a specific PID through cgroup membership and
through the "watch_pids" logic. It also protects us from being
confused if the "watch_pids" hashmap is altered while we are
dispatching to it (which is a very likely case).
- sd_notify() message dispatching has been reworked to be very similar
to SIGCHLD handling now. A generation counter is used for dispatching
as well.
This also adds a new test that validates that "watch_pid" registration
and unregstration works correctly.
core: unify call we use to synthesize cgroup empty events when we stopped watching any unit PIDs
This code is very similar in scope and service units, let's unify it in
one function. This changes little for service units, but for scope units
makes sure we go through the cgroup queue, which is something we should
do anyway.
service: don't send out dbus change notifications spuriously on SIGCHLD
Let's send them out only if the main or control processe exited and we
recorded a new exit status that is worth reporting. But if any other
service process died this is nothing to report since we don't expose any
properties about that anyway.
core: fix manager_get_unit_by_pid() special casing of manager PID
Previously, we'd hard map PID 1 to the manager scope unit. That's wrong
however when we are run in --user mode, as the PID 1 is outside of the
subtree we manage and the manager PID might be very differently. Correct
that by checking for getpid() rather than hardcoding 1.
tmpfiles: create parent directories if they are missing for more line types
Currently, we create leading directories implicitly for all lines that
create directory or directory-like nodes.
With this, we also do the same for a number of other lines: f/F, C, p,
L, c/b (that is regular files, pipes, symlinks, device nodes as well as
file trees we copy).
The leading directories are created with te default access mode of 0755.
If something else is desired, users should simply declare appropriate
"d" lines.
pid1: rework how we dispatch SIGCHLD and other signals
This fundamentally makes one change: we never process more than one
signal or more than one waitid() event per event loop. We'll never tight
loop around waitid() or around read() on our signalfd instead, but
always return to the main event loop after processing one event.
By doing this we put the event priorization handling into full power
again, as we'll always check for higher priority events before looking
at the next signal or waitid() again.
This introduces a new "defer" event source "sigchld_event". It's enabled
as soon as we see SIGCHLD, and disabled as soon as waitid() reported no
further children pending. It's running at a relatively high priority,
one step higher than signal handling itself, but lower than
/proc/self/mountinfo event handling, so that the latter always takes
precedence.
Since we want to process sd_notify() events at an even higher priority
than SIGCHLD (as before) it is moved one priority step up, too.
mount,swap: write event loop priority as "SD_EVENT_PRIORITY_NORMAL-x"
We do that in all other cases, let's do it here too. Since
SD_EVENT_PRIORITY_NORMAL evaluates to zero there's zero effective
difference, but it makes things easier to grok and grep for if we always
express relative priorities within PID 1 only.
manager: split out send_ready and basic.target checking into functions of their own
Let's shorten manager_check_finished() a bit by splitting out checking
of basic.target and the two things we do when we reach it.
This should not change behaviour, except for one thing: we now check
basic.target's actual state for figuring out whether it is up, instead
of generically checking whether it has any job queued. This is arguably
more correct, and is what other code does too for similar purposes, for
example manager_state()
Currently, sd-bus supports the ability to have thread-local default busses.
However, this is less useful than it can be since all functions which
require an sd_bus* as input require the caller to pass it. This patch adds
a new macro which allows the developer to pass a constant SD_BUS_DEFAULT,
SD_BUS_DEFAULT_USER or SD_BUS_DEFAULT_SYSTEM instead. This reduces work for
the caller.
For example:
r = sd_bus_default(&bus);
r = sd_bus_call_method(bus, ...);
sd_bus_unref(bus);
Becomes:
r = sd_bus_call_method(SD_BUS_DEFAULT, ...);
If the specified thread-local default bus does not exist, the function
calls will return -ENOPKG. No bus will ever be implicitly created.
Currently, sd-event supports the ability to have a thread-local default
event loop. However, this is less useful than it can be since all functions
which require an sd_event* as input require the caller to pass it. This
patch adds a new macro which allows the developer to pass a constant
SD_EVENT_DEFAULT instead. This reduces work for the caller.
For example:
r = sd_event_default(&e);
r = sd_event_add_io(e, ...);
sd_event_unref(e);
Becomes:
r = sd_event_add_io(SD_EVENT_DEFAULT, ...);
If no thread-local default event loop exists, the function calls will
return -ENOPKG. No event loop will ever be implicitly created.
The DHCPv6 client will set its state to DHCP6_STATE_STOPPED if
an error occurs or when receiving an Information Reply DHCPv6
message. Once in DHCP6_STATE_STOPPED, the DHCPv6 client needs
to be restarted by calling sd_dhcp6_client_start().
As of pull request #7796 client_reset() no longer closes the
network socket, thus a call to sd_dhcp6_client_start() needs to
check whether the file descriptor already exists in order not to
create a new one. Likewise, a call to sd_dhcp6_client_unref()
must now close the network socket as client_reset() is not
closing it.
Alan Jenkins [Sat, 20 Jan 2018 20:12:09 +0000 (20:12 +0000)]
mount: don't consider activated until /sbin/mount returns
So far, we considered mount units activated as soon as the mount
appeared. This avoided seeing a difference between mounts started by
systemd, and e.g. by running `mount` from a terminal.
(`umount` was not handled this way).
However in some cases, options passed to `mount` require additional
system calls after the mount is successfully created. E.g. the
`private` mount option, or the `ro` option on bind mounts.
It seems best to wait for mount to finish doing that. E.g. in
the `private` case, the current behaviour could theoretically cause
non-deterministic results, as child mounts inherit the
private/shared propagation setting from their parent.
This also avoids a special case in mount_reload().
Alan Jenkins [Mon, 22 Jan 2018 17:42:25 +0000 (17:42 +0000)]
mount: clarify that umount retries do not (anymore) allow multiple timeouts
It _looks_ as if, back when we used to retry unsuccessful calls to umount,
this would have inflated the effective timeout. Multiplying it by
RETRY_UMOUNT_MAX. Which is set to 32.
I'm surprised if it's true: I would have expected it to be noticed during
the work on NFS timeouts. But I can't see what would have stopped it.
Clarify that I do not expect this to happen anymore. I think each
individual umount call is allowed up to the full timeout, but if umount
ever exited with a signal status, we would stop retrying.
To be extra clear, make sure that we do not retry in the event that umount
perversely returned EXIT_SUCCESS after receiving SIGTERM.
Alan Jenkins [Sat, 20 Jan 2018 20:05:52 +0000 (20:05 +0000)]
mount: mountinfo event is supposed to always arrive before SIGCHLD
"Due to the io event priority logic we can be sure the new mountinfo is
loaded before we process the SIGCHLD for the mount command."
I think this is a reasonable expectation. But if it works, then the
other comment must be false:
"Note that mount(8) returning and the kernel sending us a mount table
change event might happen out-of-order."
Therefore we can clean up the code for the latter.
If this is working as advertised, then we can make sure that mount units
fail if the mount we thought we were creating did not actually appear,
due to races or trickery (or because /sbin/mount did something unexpected
despite returning EXIT_SUCCESS).
Include a specific warning message for this failure.
If we give up when the mount point is still mounted after 32 successful
calls to /sbin/umount, that seems a fairly similar case. So make that
message a LOG_WARN as well (not LOG_DEBUG). Also, this was recently changed to only
retry while umount is returning EXIT_SUCCESS; in that case in particular
there would be no other messages in the log to suggest what had happened.
Martin Pitt [Mon, 22 Jan 2018 20:17:08 +0000 (21:17 +0100)]
hwdb: map zoomin/out keys to up/down
Some keyboards come with a zoom see-saw or rocker which until now got
mapped to the Linux "zoomin/out" keys in hwdb. However, these keycodes
are not recognized by any major desktop. They now produce Up/Down key
events so that they can be used for scrolling.
The internet is full of instructions how to "unbreak" these keys, e. g.
ott [Tue, 23 Jan 2018 00:53:31 +0000 (01:53 +0100)]
resolve: Adjust and unify D-Bus call timeout (#7847)
DNS queries have a timeout of DNS_TRANSACTION_ATTEMPTS_MAX *
DNS_TIMEOUT_MAX_USEC = 120 s. Calls to the ResolveHostname method of
the org.freedesktop.resolve1.Manager interface have various call
timeouts that are smaller than 120 s. So it seems correct to adjust
the call timeout to the maximum query timeout and to unify the call
timeout among all callers.
A timeout of 120 s might seem large, in particular since BIND does seem
to have a query timeout of 10 s. However, it seems match the timeout
value of 120 s of Unbound. Moreover, the query and timeout handling of
resolve have problems and might be improved in the future, so this
change is at best an interim solution.
Jan Klötzke [Thu, 11 Jan 2018 09:44:38 +0000 (10:44 +0100)]
systemd-analyze: add service-watchdogs verb
New debug verb that enables or disables the service runtime watchdogs
and emergency actions during runtime. This is the systemd-analyze
version of the systemd.service_watchdogs command line option.
Armin Widegreen [Thu, 11 Jan 2018 11:42:56 +0000 (12:42 +0100)]
journal: Fix journal dumping for json, cat and export output
Incorporating the fix from d00f1d57 into other output formats of journalctl.
If journal files are corrupted, e.g. not cleanly closed, some journal
entries can not be read by output options other than 'short' (default).
If such entries has been identified, they will now just be skipped.
Michal Koutný [Tue, 16 Jan 2018 18:22:46 +0000 (19:22 +0100)]
core/timer: Prevent timer looping when unit cannot start
When a unit job finishes early (e.g. when fork(2) fails) triggered unit goes
through states
stopped->failed (or failed->failed),
in case a ExecStart= command fails unit passes through
stopped->starting->failed.
The former transition doesn't result in unit active/inactive timestamp being
updated and timer (OnUnitActiveSec= or OnUnitInactiveSec=) would use an expired
timestamp triggering immediately again (repeatedly).
This patch exploits timer's last trigger timestamp to ensure the timer isn't
triggered more frequently than OnUnitActiveSec=/OnUnitInactiveSec= period.
core: propagate TasksMax= on the root slice to sysctls
The cgroup "pids" controller is not supported on the root cgroup.
However we expose TasksMax= on it, but currently don't actually apply it
to anything. Let's correct this: if set, let's propagate things to the
right sysctls.
This way we can expose TasksMax= on all units in a somewhat sensible
way.