pid1: properly remove references to the unit from gc queue during final cleanup
When various references to the unit were dropped during cleanup in unit_free(),
add_to_gc_queue() could be called on this unit. If the unit was previously in
the gc queue (at the time when unit_free() was called on it), this wouldn't
matter, because it'd have in_gc_queue still set even though it was already
removed from the queue. But if it wasn't set, then the unit could be added to
the queue. Then after unit_free() would deallocate the unit, we would be left
with a dangling pointer in gc_queue.
A unit could be added to the gc queue in two places called from unit_free():
in the job_install calls, and in unit_ref_unset(). The first was OK, because
it was above the LIST_REMOVE(gc_queue,...) call, but the second was not, because
it was after that. Move the all LIST_REMOVE() calls down.
pid1: free basic unit information at the very end, before freeing the unit
We would free stuff like the names of the unit first, and then recurse
into other structures to remove the unit from there. Technically this
was OK, since the code did not access the name, but this makes debugging
harder. And if any log messages are added in any of those functions, they
are likely to access u->id and such other basic information about the unit.
So let's move the removal of this "basic" information towards the end
of unit_free().
pid1: fix collection of cycles of units which reference one another
A .socket will reference a .service unit, by registering a UnitRef with the
.service unit. If this .service unit has the .socket unit listed in Wants or
Sockets or such, a cycle will be created. We would not free this cycle
properly, because we treated any unit with non-empty refs as uncollectable. To
solve this issue, treats refs with UnitRef in u->refs_by_target similarly to
the refs in u->dependencies, and check if the "other" unit is known to be
needed. If it is not needed, do not treat the reference from it as preventing
the unit we are looking at from being freed.
The source unit manages the reference. It allocates the UnitRef structure and
registers it in the target unit, and then the reference must be destroyed
before the source unit is destroyed. Thus, is should be OK to include the
pointer to the source unit, it should be live as long as the reference exists.
This adds some paranoia code that moves some of the fds we allocate for
longer periods of times to fds > 2 if they are allocated below this
boundary. This is a paranoid safety thing, in order to avoid that
external code might end up erroneously use our fds under the assumption
they were valid stdin/stdout/stderr. Think: some app closes
stdin/stdout/stderr and then invokes 'fprintf(stderr, …' which causes
writes on our fds.
This both adds the helper to do the moving as well as ports over a
number of users to this new logic. Since we don't want to litter all our
code with invocations of this I tried to strictly focus on fds we keep
open for long periods of times only and only in code that is frequently
loaded into foreign programs (under the assumptions that in our own
codebase we are smart enough to always keep stdin/stdout/stderr
allocated to avoid this pitfall). Specifically this means all code used
by NSS and our sd-xyz API:
Simon Fowler [Fri, 9 Feb 2018 16:37:39 +0000 (02:37 +1000)]
Suspend on lid close based on power status. (#8016)
This change adds support for controlling the suspend-on-lid-close
behaviour based on the power status as well as whether the machine is
docked or has an external monitor. For backwards compatibility the new
configuration file variable is ignored completely by default, and must
be set explicitly before being considered in any decisions.
service: relax PID file symlink chain checks a bit (#8133)
Let's read the PID file after all if there's a potentially unsafe
symlink chain in place. But if we do, then refuse taking the PID if its
outside of the cgroup.
So far we didn't document control, transient, dbus config, or generator paths.
But those paths are visible to users, and they need to understand why systemd
loads units from those paths, and how the precedence hierarchy looks.
The whole thing is a bit messy, since the list of paths is quite long.
I made the tables a bit shorter by combining rows for the alternatives
where $XDG_* is set and the fallback.
In various places, tags are split like <element
param="blah">
this. This is necessary to keep everyting in one logical XML line so that
docbook renders the table properly.
Strictly speaking, online upgrades of user instances through daemon-reexec will
be broken. We can get away with this since
a) reexecs of the user instance are not commonly done, at least package upgrade
scripts don't do this afawk.
b) cgroups aren't delegateable on cgroupsv1 there's little reason to use "systemctl
set-property" for --user mode
shared/path-lookup: rearrange paths in --global mode to match --user mode
It's not good if the paths are in different order. With --user, we expect
more paths, but it must be a strict superset, and the order for the ones
that appear in both sets must be the same.
path-lookup: include paths from --global in --user search path too
This doesn't matter that much, because set-property --global does not work,
so at least those paths wouldn't be used automatically. It is still possible
to create such snippets manually, so we better fix this.
Mao [Thu, 1 Feb 2018 09:33:13 +0000 (17:33 +0800)]
udevadm: allow trigger command to be synchronous
There are cases that we want to trigger and settle only specific
commands. For example, let's say at boot time we want to make sure all
the graphics devices are working correctly because it's critical for
booting, but not the USB subsystem (we'll trigger USB events later). So
we do:
However, we cannot block the kernel from emitting kernel events from
discovering USB devices. So if any of the USB kernel event was emitted
before the settle command, the settle command would still wait for the
entire queue to complete. And if the USB event takes a long time to be
processed, the system slows down.
The new `settle` option allows the `trigger` command to wait for only
the triggered events, and effectively solves this problem.
man: fix capability name in man:systemd-tmpfiles(8) (#8139)
CAP_ADMIN does not exist (the closest existing capability name would be
CAP_SYS_ADMIN), and according to man:open(2) and man:capabilities(7),
the capability required to specify O_NOATIME is actually CAP_FOWNER.
Peter Portante [Sun, 28 Jan 2018 21:48:04 +0000 (16:48 -0500)]
Periodically call sd_journal_process in journalctl
If `journalctl` take a long time to process messages, and during that
time journal file rotation occurs, a `journalctl` client will keep
those rotated files open until it calls `sd_journal_process()`, which
typically happens as a result of calling `sd_journal_wait()` below in
the "following" case. By periodically calling `sd_journal_process()`
during the processing loop we shrink the window of time a client
instance has open file descriptors for rotated (deleted) journal
files.
**Warning**
This change does not appear to solve the case of a "paused" output
stream. If somebody is using `journalctl | less` and pauses the
output, then without a background thread periodically listening for
inotify delete events and cleaning up, journal logs will eventually
stop flowing in cases where a journal client with enough open files
causes the "free" disk space threshold to be crossed.
Shawn Landden [Sat, 3 Feb 2018 18:16:33 +0000 (10:16 -0800)]
sd-bus: cleanup ssh sessions (Closes: #8076)
we still invoke ssh unnecessarily when there in incompatible or erreneous input
The fallow-up to finish that would make the code a bit more verbose,
as it would require repeating this bit:
```
r = bus_connect_transport(arg_transport, arg_host, false, &bus);
if (r < 0) {
log_error_errno(r, "Failed to create bus connection: %m");
goto finish;
}
sd_bus_set_allow_interactive_authorization(bus, arg_ask_password);
```
in every verb, after parsing.
v2: add waitpid() to avoid a zombie process, switch to SIGTERM from SIGKILL
v3: refactor, wait in bus_start_address()
Susant Sahani [Thu, 8 Feb 2018 09:22:46 +0000 (14:52 +0530)]
networkd: vxlan require Remote= to be a non multicast address (#8117)
Remote= must be a non multicast address. ip-link(8) says:
> remote IPADDR - specifies the unicast destination IP address to
> use in outgoing packets when the destination link layer address
> is not known in the VXLAN device forwarding database.
Faalagorn [Thu, 8 Feb 2018 08:14:55 +0000 (09:14 +0100)]
man: .service <filename> to <literal> (#8126)
Changed <filename>.service</filename> to <literal>.service</literal> to match style in other manual pages: man 5 systemd.socket, device, mount, automount, swap, target path, timer, slice and scope.
Alan Jenkins [Thu, 8 Feb 2018 08:14:32 +0000 (08:14 +0000)]
journal: avoid code that relies on LOG_KERN == 0 (#8110)
LOG_FAC() is the general way to extract the logging facility (when it has
been combined with the logging priority).
LOG_FACMASK can be used to mask off the priority so you only have the
logging facility bits... but to get the logging facility e.g. LOG_USER,
you also have to bitshift it as well. (The priority is in the low bits,
and so only requires masking).
((priority & LOG_FACMASK) == LOG_KERN) happens to work only because
LOG_KERN is 0, and hence has the same value with or without the bitshift.
Code that relies on weird assumptions like this could make it harder to
realize how the logging values are treated.
Faalagorn [Wed, 7 Feb 2018 18:10:41 +0000 (19:10 +0100)]
man: "reboot" to "power off" in poweroff.target (#8124)
Changed "reboot" to "power off" in poweroff.target description. It was most likely copied and pasted from the reboot.target below, compare with e.g. halt.target
Franck Bui [Wed, 7 Feb 2018 13:08:02 +0000 (14:08 +0100)]
core: use id unit when retrieving unit file state (#8038)
Previous code was using the basename(id->fragment_path) which returned
incorrect result if the unit was an instance.
For example, assuming that no instances of "template" have been created so far:
$ systemctl enable template@1
Created symlink from /etc/systemd/system/multi-user.target.wants/template@1.service to /usr/lib/systemd/system/template@.service.
process-util: use raw_getpid() in getpid_cache() internally (#8115)
We have the raw_getpid() definition in place anyway, and it's certainly
beneficial to expose the same semantics on pre glibc 2.24 and after it
too, hence always bypass glibc for this, and always cache things on our
side.
Add more file triggers to handle more aspects of systemd (#8090)
For quite a while now, there have been file triggers to handle
automatically setting up service units in upstream systemd. However,
most of the actions being done by these macros upon files can be set up
as RPM file triggers.
In fact, in Mageia, we had been doing this for most of these. In particular,
we have file triggers in place for sysusers, tmpfiles, hwdb, and the journal.
This change adds Lua versions of the original file triggers used in Mageia,
based on the existing Lua-based file triggers for service units.
In addition, we can also have useful file triggers for udev rules, sysctl
directives, and binfmt directives. These are based on the other existing
file triggers.
Yu Watanabe [Tue, 6 Feb 2018 08:08:38 +0000 (17:08 +0900)]
nss-mymachines: add work-around to silence gcc warning
This is similar to 3c3d384ae93700ef08545b078c37065fdb98eee7 and
a workaround for the following warning.
```
In file included from ../src/basic/in-addr-util.h:28,
from ../src/nss-mymachines/nss-mymachines.c:31:
../src/nss-mymachines/nss-mymachines.c: In function '_nss_mymachines_getgrnam_r':
../src/nss-mymachines/nss-mymachines.c:653:32: warning: argument to 'sizeof' in 'memset' call is the same pointer type 'char *' as the destination; expected 'char' or an explicit length [-Wsizeof-pointer-memaccess]
memzero(buffer, sizeof(char*));
^~~~
../src/basic/util.h:118:39: note: in definition of macro 'memzero'
#define memzero(x,l) (memset((x), 0, (l)))
^
../src/nss-mymachines/nss-mymachines.c: In function '_nss_mymachines_getgrgid_r':
../src/nss-mymachines/nss-mymachines.c:730:32: warning: argument to 'sizeof' in 'memset' call is the same pointer type 'char *' as the destination; expected 'char' or an explicit length [-Wsizeof-pointer-memaccess]
memzero(buffer, sizeof(char*));
^~~~
../src/basic/util.h:118:39: note: in definition of macro 'memzero'
#define memzero(x,l) (memset((x), 0, (l)))
^
```
Yu Watanabe [Tue, 6 Feb 2018 08:05:58 +0000 (17:05 +0900)]
networkd: fix dhcp6_prefixes_compare_func()
Found by the following warning by gcc.
```
../src/network/networkd-manager.c: In function 'dhcp6_prefixes_compare_func':
../src/network/networkd-manager.c:1383:16: warning: 'memcmp' reading 16 bytes from a region of size 8 [-Wstringop-overflow=]
return memcmp(&a, &b, sizeof(*a));
^
```
Yu Watanabe [Tue, 6 Feb 2018 07:00:34 +0000 (16:00 +0900)]
core: make ExecRuntime be manager managed object
Before this, each ExecRuntime object is owned by a unit. However,
it may be shared with other units which enable JoinsNamespaceOf=.
Thus, by the serialization/deserialization process, its sharing
information, more specifically, reference counter is lost, and
causes issue #7790.
This makes ExecRuntime objects be managed by manager, and changes
the serialization/deserialization process.
Alan Jenkins [Mon, 5 Feb 2018 16:53:40 +0000 (16:53 +0000)]
journal: include kmsg lines from the systemd process which exec()d us (#8078)
Let the journal capture messages emitted by systemd, before it ran
exec("/usr/lib/systemd/systemd-journald"). Usually such messages will only
appear with `systemd.log_level=debug`. kmsg lines written after the exec()
will be ignored as before.
In other words, we are avoiding reading our own lines, which start
"systemd-journald[100]: " assuming we are PID 100. But now we will start
allowing ourself to read lines which start "systemd[100]: ", or any other
prefix which is not "systemd-journald[100]: ".
So this can't help you see messages when we fail to exec() journald :). But,
it makes it easier to see what the pre-exec() messages look like in
the successful case. Comparing messages like this can be useful when
debugging. Noticing weird omissions of messages, otoh, makes me anxious.
nss-systemd: add work-around to silence gcc warning
In file included from ../src/basic/fs-util.h:32,
from ../src/nss-systemd/nss-systemd.c:28:
../src/nss-systemd/nss-systemd.c: In function '_nss_systemd_getgrnam_r':
../src/nss-systemd/nss-systemd.c:416:32: warning: argument to 'sizeof' in 'memset' call is the same pointer type 'char *' as the destination; expected 'char' or an explicit length [-Wsizeof-pointer-memaccess]
memzero(buffer, sizeof(char*));
^~~~
../src/basic/util.h:118:39: note: in definition of macro 'memzero'
#define memzero(x,l) (memset((x), 0, (l)))
^
gcc is trying to be helpful, and it's not far from being right. It _looks_ like
sizeof(char*) is an error, but in this case we're really leaving a space empty
for a pointer, and our calculation is correct. Since this is a short file,
let's just use simplest option and turn off the warning above the two functions
that trigger it.
I expect that this will be mostly obsoleted by transfiletriggers that
(I hope) we will soon add. But let's do this for completeness anyway.
I'm keeping the description of the macro a bit vague, since I expect
that it'll be changed when transfiletriggers are added.
man: document meaning of age in tmpfiles.d (#8092)
This documents how the age of a file is determined, which previously was
only alluded to in other parts of the documentation. Fixes #8091.
The phrasings of “last modification timestamp” etc. are taken from
man:inode(7) (as of man-pages 4.14). The debug messages in tmpfiles.c
use different messages (“modify time”), which according to a code
comment follow man:stat(1); however, my copy of that manpage (from GNU
coreutils 8.29) documents %y as “time of last data modification”
instead.
test: sort imports and use "new" string formatting
Followed PEP8 and PEP3101 rules (#8079)
Imports re-ordered by Alphabetical Standarts for following PEP8
Old type string formattings (" example %s " % exampleVar ) re-writed as new type string
formattings ( " example {} ".format(exampleVar) ) for following PEP3101
Yu Watanabe [Thu, 1 Feb 2018 10:39:30 +0000 (19:39 +0900)]
systemctl: show: use EnvironmentFiles= instead of EnvironmentFile=
EnvironmentFile= is used in the unit file, but in the dbus,
the related field name is EnvironmentFiles=.
As the other variables, let's use the field name instead of the name
used in the unit file setting.
Alan Jenkins [Sun, 4 Feb 2018 20:46:27 +0000 (20:46 +0000)]
slice: system.slice should be perpetual like -.mount
`-.mount` is placed in `system.slice`, and hence depends on it.
`-.mount` is always active and can never be stopped. Therefore the same
should be true of `system.slice`.
Synthesize it as perpetual (unless systemd is running as a user manager).
Notice we also drop `Before=slices.target` as unnecessary.
AFAICS the justification for `perpetual` is to provide extra protection
against unintentionally stopping every single service. So adding
system.slice to the perpetual units is perfectly consistent.
I don't expect this will (or can) fix any other problem. And the
`perpetual` protection probably isn't formal enough to spend much time
thinking about. I've just noticed this a couple of times, as something
that looks strange.
Might be a bit surprising that we have user.slice on-disk but not
system.slice, but I think it's ok. `systemctl status system.slice` will
still point you towards `man systemd.special`. The only detail is that the
system slice disables `DefaultDependencies`. If you're worrying about how
system shutdown works when you read `man systemd.slice`, I think it is not
too hard to guess that system.slice might do this:
> Only slice units involved with early boot
> or late system shutdown should disable this option
(Docs are great. I really appreciate the systemd ones).
Alan Jenkins [Sun, 4 Feb 2018 20:16:50 +0000 (20:16 +0000)]
slice, scope: IgnoreOnIsolate=yes is already the default
`IgnoreOnIsolate=yes` is the default for slices and scopes. So it's not
essential to set it on root.slice or init.scope.
We don't need to worry about a bad unit file configuration. Any attempt
to stop these unit should fail, since we mark them as `perpetual`.
Also since init.scope cannot be stopped, there is no point setting
`KillSignal=SIGRTMIN+14`. According to both documentation and testing,
KillSignal= does not affect the behaviour of `systemctl kill`.
Alan Jenkins [Fri, 2 Feb 2018 16:06:32 +0000 (16:06 +0000)]
seccomp: allow x86-64 syscalls on x32, used by the VDSO (fix #8060)
The VDSO provided by the kernel for x32, uses x86-64 syscalls instead of
x32 ones.
I think we can safely allow this; the set of x86-64 syscalls should be
very similar to the x32 ones. The real point is not to allow *x86*
syscalls, because some of those are inconveniently multiplexed and we're
apparently not able to block the specific actions we want to.
Boucman [Fri, 2 Feb 2018 14:58:40 +0000 (15:58 +0100)]
do not report total time when kernel time is not provided (#8063)
the whole systemd-analyze time logic is based on the fact that monotonic
time 0 is the start of the kernel.
If the firmware does not provide a correct time, firmware_time degrades to
0, which is the start of the kernel. The diference between FinishTime and
firmware_time is thus correct.
That assumption is still true with containers, but the start time of the
kernel is not what the user expects : It's the time when the host booted.
The total is thus still correct, but highly misleading. Containers can be
easily detected (and, in fact, already are) by systemd not reporting any
kernel non-monotonic timestamp.
This patch simply avoids printing a misleading time when it can detect that
case
basic/hashmap: tweak code to avoid pointless gcc warning
gcc says:
[196/1142] Compiling C object 'src/basic/basic@sta/hashmap.c.o'.
../src/basic/hashmap.c: In function ‘cachemem_maintain’:
../src/basic/hashmap.c:1913:17: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
mem->active = r = true;
^~~
which conflates two things: the first is transitive assignent a = b = c = d;
the second is assignment of the value of an expression, which happens to be a
an assignment expression here, and boolean. While the second _should_ be
parenthesized, the first should _not_, and it's more natural to understand
our code as the first, and gcc should treat this as an exception and not emit
the warning. But since it's a while until this will be fixed, let's update
our code too.