bus-util: improve logging when we can't connect to the bus
Previously, we'd already have explicit logging for the case where
$XDG_RUNTIME_DIR is not set. Let's also add some explicit logging for
the EPERM/ACCESS case. Let's also in both cases suggest the
--machine=<user>@.host syntax.
And while we are at it, let's remove side-effects from the macro.
By checking for both the EPERM/EACCES case and the $XDG_RUNTIME_DIR case
we will now catch both the cases where people use "su" to issue a
"systemctl --user" operation, and those where they (more correctly, but
still not good enough) call "su -".
So far, the bridge always acted as if "--system" was used, i.e. would
unconditionally connect to the system bus. Let's add "--user" too, to
connect to the users session bus.
This is mostly for completeness' sake.
I wanted to use this when making sd-bus's ability to connect to other
user's D-Bus busses work, but it didn't exist so far. In the interest of
keeping things compatible the implementation in sd-bus will not use the
new "--user" switch, and instead manually construct the right bus path
via "--path=", but we still should add the proper switches, as
preparation for a brighter future, one day.
sd-bus: add API for connecting to a specific user's user bus of a specific container
This is unfortunately harder to implement than it sounds. The user's bus
is bound a to the user's lifecycle after all (i.e. only exists as long
as the user has at least one PAM session), and the path dynamically (at
least theoretically, in practice it's going to be the same always)
generated via $XDG_RUNTIME_DIR in /run/.
To fix this properly, we'll thus go through PAM before connecting to a
user bus. Which is hard since we cannot just link against libpam in the
container, since the container might have been compiled entirely
differently. So our way out is to use systemd-run from outside, which
invokes a transient unit that does PAM from outside, doing so via D-Bus.
Inside the transient unit we then invoke systemd-stdio-bridge which
forwards D-Bus from the user bus to us. The systemd-stdio-bridge makes
up the PAM session and thus we can sure tht the bus exists at least as
long as the bus connection is kept.
Or so say this differently: if you use "systemctl -M lennart@foobar"
now, the bus connection works like this:
2. systemd-run gets a connection to the "foobar" container's
system bus, and invokes the "systemd-stdio-bridge" binary as
transient service inside a PAM session for the user "lennart"
3. The systemd-stdio-bridge then proxies our D-Bus traffic to
the user bus.
sd-bus (on host) → systemd-run (on host) → systemd-stdio-bridge (in container)
Complicated? Well, to some point yes, but otoh it's actually nice in
various other ways, primarily as it makes the -H and -M codepaths more
alike. In the -H case (i.e. connect to remote host via SSH) a very
similar three steps are used. The only difference is that instead of
"systemd-run" the "ssh" binary is used to invoke the stdio bridge in a
PAM session of some other system. Thus we get similar implementation and
isolation for similar operations.
So far when asked for augmented bus credentials and the process was
already gone we'd fail fatally. Let's make this graceful instead, and
never allow augmenting fail due to PID having vanished — unless the
augmenting is the explicit and only purpose of the requested operation.
This should be safe as clients have to explicitly query the acquired
creds anyway and handle if they couldn't be acquired. Moreover we
already handle permission problems gracefully, thus clients must be
ready to deal with missing creds.
This is useful to make selinux authorization work for short-lived client
proceses. PReviously we'd augment creds to have more info to log about
(the selinux decision would not be based on augmented data however,
because that'd be unsafe), and would fail if we couldn't get it. Now,
we'll try to acquire the data, but if we cannot acquire it, we'll still
do the selinux check, except that logging will be more limited.
hostname-util: flagsify hostname_is_valid(), drop machine_name_is_valid()
Let's clean up hostname_is_valid() a bit: let's turn the second boolean
argument into a more explanatory flags field, and add a flag that
accepts the special name ".host" as valid. This is useful for the
container logic, where the special hostname ".host" refers to the "root
container", i.e. the host system itself, and can be specified at various
places.
let's also get rid of machine_name_is_valid(). It was just an alias,
which is confusing and even more so now that we have the flags param.
logs-show: drop redundant validation of machine name
The immediately following container_get_leader() call validate the name
anyway, no need to twice exactly the same way twice immediately after
each other.
Gaurav [Fri, 4 Dec 2020 11:15:15 +0000 (16:45 +0530)]
Detect special character in dbus interface name
Added debug log to detect special character in dbus interface names.
Helps to detect a case mentioned in https://github.com/systemd/systemd/issues/14636
When something fails, we need some logs to figure out what happened.
This is primarily relevant for connection errors, but in general we
want to log about all errors, even if they are relatively unlikely.
We want one log on failure, and generally no logs on success.
The general idea is to not log in static functions, and to log in the
non-static functions. Non-static functions which call other functions
may thus log or not log as appropriate to have just one log entry in the
end.
Yu Watanabe [Thu, 10 Dec 2020 23:34:13 +0000 (08:34 +0900)]
sd-device: make TAGS= property prefixed and suffixed with ":"
The commit 6f3ac0d51766b0b9101676cefe5c4ba81feba436 drops the prefix and
suffix in TAGS= property. But there exists several rules that have like
`TAGS=="*:tag:*"`. So, the property must be always prefixed and suffixed
with ":".
Vito Caputo [Sun, 6 Dec 2020 08:21:17 +0000 (00:21 -0800)]
mmap-cache: drop ret_size from mmap_cache_get()
The ret_size result is a bit of an awkward optimization that in a
sense enables bypassing the mmap-cache API, while encouraging
duplication of logic it already implements.
It's only utilized in one place; journal_file_move_to_object(),
apparently to avoid the overhead of remapping the whole object
again once its header, and thus its actual size, is known.
With mmap-cache's context cache, the overhead of simply
re-getting the object with the now known size should already be
negligible. So it's not clear what benefit this brings, unless
avoiding some function calls that do very little in the hot
context-cache hit case is of such a priority.
There's value in having all object-sized gets pass through
mmap_cache_get(), as it provides a single entrypoint for
instrumentation in profiling/statistics gathering. When
journal_file_move_to_object() bypasses getting the full object
size, you don't capture the full picture on the mmap-cache side
in terms of object sizes explicitly loaded from a journal file.
I'd like to see additional accounting in mmap_cache_get() in a
future commit, taking advantage of this change.
Quoting Andy Lutomirski:
> The upcoming Linux SGX driver has a device node /dev/sgx. User code opens
> it, does various setup things, mmaps it, and needs to be able to create
> PROT_EXEC mappings. This gets quite awkward if /dev is mounted noexec.
We already didn't use noexec in spawn, and this extends this behaviour to other
systems.
Afaik, the kernel would refuse execve() on a character or block device
anyway. Thus noexec on /dev matters only for actual binaries copied to /dev,
which requires root privileges in the first place.
We don't do noexec on either /tmp or /dev/shm (because that causes immediate
problems with stuff like Java and cffi). And if you have those two at your
disposal anyway, having noexec on /dev doesn't seem important. So the 'noexec'
attribute on /dev doesn't really mean much, since there are multiple other
similar directories which don't require root privileges to write to.
Yu Watanabe [Fri, 11 Dec 2020 03:15:45 +0000 (12:15 +0900)]
network: do not reconfigure interface when the link gains carrier but udev not initialized it yet
When an interface gains carrier but udev have not initialized the
interface or link_initialized_handler() has not been called yet,
then link_configure will be called twice. Thus LLDP client will be
configured twice, and triggers assertion.
When the .so module is loaded, it gets a separate copy of stuff in src/basic,
including the log level variables. So any logging settings are unaffected by
the loading program calling log_parse_environment() or such. Let's also parse
the environment here so that we can have nice logging.
Initialization is done from each exported function, and pthread_once_t is used
to avoid duplicate initialization. I didn't merge PROTECT_ERRNO into
NSS_ENTRYPOINT_BEGIN because UNPROTECT_ERRNO is called in a bunch of places
and it would feel strange to have PROTECT_ERRNO hidden, but not UNPROTECT_ERRNO.
The most interesting stuff in this module is the varlink messages, and any
potential errors in json. So let's enable json logging when debug messages are
enabled.
With those changes, figuring out the issue in
https://github.com/systemd/systemd/pull/17823 is trivial:
$ LD_LIBRARY_PATH=build/ SYSTEMD_LOG_COLOR=1 SYSTEMD_LOG_LOCATION=1 SYSTEMD_LOG_LEVEL=debug getent hosts mirrors.fedoraproject.org
src/shared/varlink.c:237: n/a: varlink: setting state idle-client
src/shared/varlink.c:1240: n/a: Sending message: {"method":"io.systemd.Resolve.ResolveHostname","parameters":{"name":"mirrors.fedoraproject.org","family":10}}
src/shared/varlink.c:240: n/a: varlink: changing state idle-client → calling
src/shared/varlink.c:588: n/a: New incoming message: {"parameters":{"addresses":[{"ifindex":0,"family":10,"address":[42,5,208,20,0,16,120,3,247,116,77,124,226,119,164,87]},{"ifindex":0,"family":10,"address":[42,5,208,28,12,106,204,3,38,58,132,9,185,97,126,2]},{"ifindex":0,"family":10,"address":[38,32,0,82,0,3,0,1,222,173,190,239,202,254,254,215]},{"ifindex":0,"family":10,"address":[38,5,188,128,48,16,6,0,222,173,190,239,202,254,254,217]},{"ifindex":0,"family":10,"address":[38,4,21,128,254,0,0,0,222,173,190,239,202,254,254,209]},{"ifindex":0,"family":10,"address":[38,32,0,82,0,3,0,1,222,173,190,239,202,254,254,214]},{"ifindex":0,"family":10,"address":[38,16,0,40,48,144,48,1,222,173,190,239,202,254,254,211]},{"ifindex":0,"family":10,"address":[32,1,65,120,0,2,18,105,0,0,0,0,0,0,254,210]}],"name":"wildcard.fedoraproject.org","flags":1}}
src/shared/varlink.c:240: n/a: varlink: changing state calling → called
src/shared/varlink.c:240: n/a: varlink: changing state called → idle-client
src/nss-resolve/nss-resolve.c:84: (string):1:40: JSON field 'ifindex' is out of bounds for an interface index.
Jinyuan Si [Fri, 4 Dec 2020 02:38:28 +0000 (10:38 +0800)]
cryptsetup: Fix crypto device missing issue after bootup
Normally, the udev rules operate on "change" events. But when
coldplugging, there's an "add" event present. The udev rules have to
recognize this and do some actions in this particular situation, too.
Also, we don't want the nodes to be created prematurely on "add"
events while not coldplugging. The udev rules will check
DM_UDEV_PRIMARY_SOURCE_FLAG to see if the device was activated
correctly before and if not, it ignore the "add" event totally.
This way the udev rules can support udev triggers generating "add"
events (e.g. "udevadm trigger --action=add" or
"echo add > /sys/block/<dm_device>/uevent").
In this case, the udevd service is started after
systemd-cryptsetup@config.service, is started, which will cause udevd
service to miss the "change" uevent with DM_UDEV_PRIMARY_SOURCE_FLAG
flag generated by systemd-cryptsetup@config.service. To solve this
issue, we let the cryptsetup service be started after the udevd
service.
Back in 5248e7e1f11aba6859de0b28f0dd3778b22842f2 (July 2017) we moved over to
"_gateway", with the old name declared to be temporary measure. Since we're
doing a bunch of changes to resolved now, it seems to be a good moment to make
this simplification and not add support for the compat name in new code.
seccomp: don't install filters for archs that can't use syscalls
When seccomp_restrict_archs is called, architectures that are blocked
are replaced by the SECCOMP_LOCAL_ARCH_BLOCKED marker so that they are
not disabled again and filters are not installed for them.
This can make some service that use SystemCallArchitecture= and
SystemCallFilter= start faster.
Vito Caputo [Thu, 3 Dec 2020 06:11:23 +0000 (22:11 -0800)]
mmap-cache: bind prot(ection) to MMapFileDescriptor
There are no mmap_cache_get() users that actually deviate prot
from the JournalFile's f->prot.
So there's no point in making this a separate parameter to
mmap_cache_get(), nor is there any need to store it in
JournalFile's f->prot.
Instead just pass it to mmap_cache_add_fd() at MMapFileDescriptor
creation, storing it in there for the mmap() callers, which
already receive MMapFileDescriptor *.
For functions receiving both an MMapFileDescriptor * and prot,
the prot argument has been simply removed and call sites updated.
Formalizing this fd:prot binding at the public API also enables
discarding the prot check in window_matches(), which is a hot
function on long window lists, so a minor CPU efficiency gain
should be had there as seen with the past removal of the fd
check. Unnoticable for uncached journals, but maybe a little
runtime improvement when cached in specific circumstances.
window_matches_fd() has also been simplified to treat the
MMapFileDescrptor * as equivalent to its fd and prot.
E.g. in nss-resolve it is still useful to print the location of the error:
src/test/test-nss.c:231: dlsym(0x0x1dc6fb0, _nss_resolve_gethostbyname2_r) → 0x0x7fdbfc53f626
(string):1:40: JSON field ifindex is out of bounds for an interface index.
I opted to use a partially duplicated if condition to avoid nesting. It's nice
to have the log calls vertically aligned. The compiler will optimize this nicely.
Takashi Iwai [Wed, 9 Dec 2020 09:56:51 +0000 (10:56 +0100)]
udev: Fix sound.target dependency
The recent bug report indicated a race at device creation and the
sound.target dependencies, and the cause turned out to be the condition
of the sound.target trigger. Currently it's set for "card*", but this
is actually the parent object; i.e. the sound.target is triggered before
the sound devices are created.
For assuring the whole sound device creations beforehand, we need to use
"controlC*" instead of "card*"; as already described in
78-sound-card.rules, this is guaranteed to be the last device, and can
be used as a synchronization point.
On really old kernels (< 4.14+) a bug in /proc/1/sched handling in the
kernel could be used to determine whether we are running in a PID
namespace. This hasn't worked for a long time, and there's little point
in making things work on old kernels we can't make work on current
kernels, hence let's drop that old cruft.
Daan De Meyer [Sun, 6 Dec 2020 11:42:45 +0000 (11:42 +0000)]
mkosi: Add gdb to final images
Let's add a debugger to the mkosi images so we can debug coredumps
from inside mkosi qemu VMs (and hopefully in the future from
mkosi systemd-nspawn containers as well).
Daan De Meyer [Mon, 7 Dec 2020 22:18:28 +0000 (22:18 +0000)]
Silence cgroups v1 read-only filesystem warning
Avoid warning messages when booting systemd-nspawn containers and using
hybrid or legacy cgroups. systemd-nspawn mounts the cgroups v1 controller
tree as read-only so these errors are expected and not problematic.
Partially fixes #17862.
```
‣ Processing default...
Spawning container image on /home/daan/projects/systemd/image.raw.
Press ^] three times within 1s to kill container.
systemd 247 running in system mode. (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization systemd-nspawn.
Detected architecture x86-64.
Welcome to Fedora 33 (Thirty Three)!
Queued start job for default target Graphical Interface.
-.slice: Failed to migrate controller cgroups from , ignoring: Read-only file system
system.slice: Failed to delete controller cgroups /system.slice, ignoring: Read-only file system
[ OK ] Created slice system-getty.slice.
[ OK ] Created slice system-modprobe.slice.
user.slice: Failed to delete controller cgroups /user.slice, ignoring: Read-only file system
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Reached target Remote File Systems.
[ OK ] Reached target Slices.
[ OK ] Reached target Swap.
[ OK ] Listening on Process Core Dump Socket.
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on User Database Manager Socket.
dev-hugepages.mount: Failed to delete controller cgroups /dev-hugepages.mount, ignoring: Read-only file system
Mounting Huge Pages File System...
sys-fs-fuse-connections.mount: Failed to delete controller cgroups /sys-fs-fuse-connections.mount, ignoring: Read-only file system
Mounting FUSE Control File System...
Starting Journal Service...
Starting Remount Root and Kernel File Systems...
system.slice: Failed to delete controller cgroups /system.slice, ignoring: Read-only file system
```
After: `mkosi --default .mkosi/mkosi.fedora boot`
```
‣ Processing default...
Spawning container image on /home/daan/projects/systemd/mkosi.output/image.raw.
Press ^] three times within 1s to kill container.
systemd 247 running in system mode. (+PAM +AUDIT +SELINUX -APPARMOR +IMA +SMACK +SECCOMP +GCRYPT +GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC +KMOD +LIBCRYPTSETUP +LIBFDISK +PCRE2 +PWQUALITY +P11KIT +QRENCODE +BZIP2 +LZ4 +XZ +ZLIB +ZSTD +XKBCOMMON +UTMP +SYSVINIT default-hierarchy=unified)
Detected virtualization systemd-nspawn.
Detected architecture x86-64.
Welcome to Fedora 33 (Thirty Three)!
Queued start job for default target Graphical Interface.
[ OK ] Created slice system-getty.slice.
[ OK ] Created slice system-modprobe.slice.
[ OK ] Created slice User and Session Slice.
[ OK ] Started Dispatch Password Requests to Console Directory Watch.
[ OK ] Started Forward Password Requests to Wall Directory Watch.
[ OK ] Reached target Local Encrypted Volumes.
[ OK ] Reached target Paths.
[ OK ] Reached target Remote File Systems.
[ OK ] Reached target Slices.
[ OK ] Reached target Swap.
[ OK ] Listening on Process Core Dump Socket.
[ OK ] Listening on initctl Compatibility Named Pipe.
[ OK ] Listening on Journal Socket (/dev/log).
[ OK ] Listening on Journal Socket.
[ OK ] Listening on User Database Manager Socket.
Mounting Huge Pages File System...
Mounting FUSE Control File System...
Starting Journal Service...
Starting Remount Root and Kernel File Systems...
[ OK ] Mounted Huge Pages File System.
[ OK ] Mounted FUSE Control File System.
[ OK ] Finished Remount Root and Kernel File Systems.
Starting Create Static Device Nodes in /dev...
[ OK ] Finished Create Static Device Nodes in /dev.
[ OK ] Reached target Local File Systems (Pre).
[ OK ] Reached target Local File Systems.
Starting Restore /run/initramfs on shutdown...
[ OK ] Finished Restore /run/initramfs on shutdown.
[ OK ] Started Journal Service.
Starting Flush Journal to Persistent Storage...
[ OK ] Finished Flush Journal to Persistent Storage.
Starting Create Volatile Files and Directories...
[ OK ] Finished Create Volatile Files and Directories.
Starting Network Name Resolution...
Starting Update UTMP about System Boot/Shutdown...
[ OK ] Finished Update UTMP about System Boot/Shutdown.
[ OK ] Reached target System Initialization.
[ OK ] Started Daily Cleanup of Temporary Directories.
[ OK ] Reached target Timers.
[ OK ] Listening on D-Bus System Message Bus Socket.
[ OK ] Reached target Sockets.
[ OK ] Reached target Basic System.
Starting Home Area Manager...
Starting User Login Management...
Starting Permit User Sessions...
[ OK ] Finished Permit User Sessions.
[ OK ] Started Console Getty.
[ OK ] Reached target Login Prompts.
Starting D-Bus System Message Bus...
[ OK ] Started D-Bus System Message Bus.
[ OK ] Started Home Area Manager.
[ OK ] Started User Login Management.
[ OK ] Reached target Multi-User System.
[ OK ] Reached target Graphical Interface.
Starting Update UTMP about System Runlevel Changes...
[ OK ] Finished Update UTMP about System Runlevel Changes.
[ OK ] Started Network Name Resolution.
[ OK ] Reached target Host and Network Name Lookups.
Fedora 33 (Thirty Three) (built from systemd tree)
Kernel 5.9.11-arch2-1 on an x86_64 (console)
```
test: add test that dlopen()'s all our weak library deps once
This test should ensure we notice if distros update shared libraries
that broke so name, and we still use the old soname.
(In contrast to what the commit summary says, this currently doesn#t
cover really all such deps, specifically xkbcommon and PCRE are missing,
since they currently aren't loaded from src/shared/. This is stuff to
fix later)
Michael Marley [Tue, 8 Dec 2020 02:27:38 +0000 (21:27 -0500)]
manager: Fix HW watchdog when systemd starts before driver loaded
When manager_{set|override}_watchdog is called, set the watchdog timeout
regardless of whether the hardware watchdog was successfully initialized. If
the watchdog was requested but could not be initialized, then instead of
pinging it, attempt to initialize it again. This ensures that the hardware
watchdog is initialized even if the kernel module for it isn't loaded when
systemd starts (which is quite likely, unless it is compiled in).
This builds on work by @danc86 in https://github.com/systemd/systemd/pull/17460,
but fixes the issue of not updating the watchdog timeout with the actual value
from the hardware.
Quit frankly, I am not convinced we should do all this at all. If
close()ing of these input devices is really that slow, then this should
probably be fixed in the kernel, not worked around in userspace like
this.