Yu Watanabe [Sun, 29 Jun 2025 20:18:32 +0000 (05:18 +0900)]
pretty-print: several cleanups for cat_files()
- drop redundant error messages in cat_files(), as cat_file() internally
logs errors,
- show an empty line and filename before opening file, to make not mix
any error messages with the previous file,
- drop unnecessary fflush(),
- use RET_GATHER() and continue to show files even if some files cannot
be shown.
r = cg_create(SYSTEMD_CGROUP_CONTROLLER, test_a);
- if (IN_SET(r, -EPERM, -EACCES, -EROFS)) {
+ if (IN_SET(r, -EPERM, -EACCES, -EROFS, -ENOENT)) {
log_info_errno(r, "Skipping %s: %m", __func__);
return;
}
```
I confirmed that the `ERRNO_IS_NEG_FS_WRITE_REFUSED` macro is equivalent
to checking the first 3 error codes above, so the addition of the check
for `ENOENT` is still just as relevant as it was in 252, but adding it
into the macro would be inconsistent with its name, description, and
possible other uses. Hence, in this PR I'm adding the extra check into
the `if`.
Plumbing to perform SELinux checks in varlink API (#38146)
This PR does minimal changes to introduce varlink support. Ideally, the
code should switch to using `mac_selinux_get_our_label()` and new
`mac_selinux_get_peer_label()`. But I leave it for now to minimize
breakage. `mac_selinux_get_peer_label()` remains unused.
This is a prep step to merge
https://github.com/systemd/systemd/pull/38032
Add a small paragraph explaining how BPF token works, how it's being
created and its relationship between the BPF filesystem.
Move all the relevant documentation in the PrivateBPF= section and let
point all the BPFDelegate* options to that one.
Introduce ERRNO_IS_FS_WRITE_REFUSED(), and use it in binfmt_mounted() (#38117)
- This introduces ERRNO_IS_FS_WRITE_REFUSED(), and apply it where
usable.
- This makes unexpected errors in access_fd() called by binfmt_mounted()
propagated to the caller.
- Renames binfmt_mounted() to binfmt_mounted_and_writable(), as it also
checks the fs is writable.
- Voidifies one disable_binfmt() call in shutdown.c.
To implement --bind-user in systemd-vmspawn, we need a transient
version of these credentials. These are useful when the home directory
of the user is mounted into the container/vm and every trace of the user
will be (mostly) gone again when the container/vm is shut down.
Li Tian [Tue, 8 Jul 2025 06:44:35 +0000 (14:44 +0800)]
Add --entry-type=type1|type2 option to kernel-install.
Both kernel-core and kernel-uki-virt call kernel-install upon removal. Need an additional argument to avoid complete removal for both traditional kernel and UKI.
Even if we're not using --accept=, it's very useful to be able to
synchronize on systemd-socket-activate having binded to its listen
socket, so let's always send READY=1. This means the payload can't
send READY=1 anymore but it's doubtful whether that's useful in this
case in the first place.
vmspawn: Use virtio-blk-pci for image instead of virtio-scsi-pci
We don't need a full blown SCSI controller just to present the main
root drive device to the VM. Let's simplify the storage stack by using
virtio-blk-pci instead.
Additionally, virtio-blk-pci is a builtin module in Arch and Fedora
which means we can do qemu direct kernel boot without needing an initrd.
vmspawn: Disable hpet for vmspawn x86 virtual machines
hpet is an emulated clocksource that is generally discouraged in favor
of kvm-clock or tsc for virtual machines. While vmspawn's virtual machines
already use kvm-clock, leaving hpet enabled causes qemu on the host to
consume a non-trivial amount of cpu, so let's disable the hpet feature since
we're not making use of it anyway.
ci: also set TEST_RUNNER environment variable in coverage test
Otherwise, integration-test-wrapper.py will fail.
```
Traceback (most recent call last):
File "/home/runner/work/systemd/systemd/test/integration-tests/integration-test-wrapper.py", line 693, in <module>
main()
~~~~^^
File "/home/runner/work/systemd/systemd/test/integration-tests/integration-test-wrapper.py", line 677, in main
runner = os.environ['TEST_RUNNER']
~~~~~~~~~~^^^^^^^^^^^^^^^
File "<frozen os>", line 717, in __getitem__
KeyError: 'TEST_RUNNER'
```
ukify: fix version detection for aarch64 zboot kernels with gzip or lzma compression
Fixes https://github.com/systemd/systemd/issues/34780. The number in the header
is the size of the *compressed* data, so for gzip we'd read the initial part of
the decompressed data (equal to the size of the compressed data) and not find
the version string. Later on, Fedora switched to zstd compression, and there we
correctly use the number as the size of the compressed data, so we stopped
hitting the issue, but we should still fix it for older kernels.
I verified that the fix works for gzip-compressed kernels. I also made the same
change for the code for lzma compression. I'm pretty sure it is the right thing,
even though I don't have such a kernel at hand to test.
>>> ukify.Uname.scrape('/lib/modules/6.12.0-0.rc2.24.fc42.aarch64/vmlinuz')
Real-Mode Kernel Header magic not found
+ readelf --notes /lib/modules/6.12.0-0.rc2.24.fc42.aarch64/vmlinuz
readelf: Error: Not an ELF file - it has the wrong magic bytes at the start
Found uname version: 6.12.0-0.rc2.24.fc42.aarch64
closes #37602, see there for extra motivation and considered
alternatives.
On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a
default to globally restrict creation of suid/sgid files makes it easier
to apply this restriction precisely.
## testing done
- built on aarch64-linux and x86_64-linux
- ran a VM test on x86_64-linux, checking for:
- VM system boots successfully
- defaults apply (both `yes`, `no`, and undefined)
- systemd tmpfiles can set suid/sgid on journal log path
- Other services explicitly defining `RestrictSUIDSGID=no` can create
suid files
The recently added test case TEST-07-PID1.subgroup-kill.sh surfaced a
race: if we enumerate PIDs in a cgroup, and the cgroup is unlinked at
the very same time reading will result in ENODEV. We need to handle that
gracefully. Hence let's do so.
On typical systems, only few services need to create SUID/SGID files.
This often is limited to the user explicitly setting suid/sgid, the
`systemd-tmpfiles*` services, and the package manager. Allowing a default
to globally restrict creation of suid/sgid files makes it easier to apply
this restriction precisely.
If emergency.target is started while initrd-cleanup.service/start is queued,
the initrd-cleanup job did not get canceled. In parallel to the emergency
units, it eventually runs the service, which in turn isolates and starts
initrd-switch-root.target. This stops the emergency units and effectively
starts the initrd boot process again, which likely fails again like the
initial attempt. The system is thus stuck in a loop, never really reaching
emergency.target.
This can be triggered if a service in between initrd-parse-etc.service
and initrd.target fails.
With this conflict added, starting emergency.target automatically cancels
initrd-cleanup.service/start, avoiding the loop.
Add a new option `PrivateBPF=` to mount a private instance of bpffs.
Add also four configuration options
`BPFDelegate{Commands,Maps,Programs,Attachments}=` which set the
corresponding bpffs mount options in order to create BPF tokens:
https://lwn.net/Articles/947173/
Matteo Croce [Thu, 15 May 2025 14:32:46 +0000 (16:32 +0200)]
core: add options to delegate BPFFS token creation
Add four new options BPFDelegate{Commands,Maps,Programs,Attachments}=
in order to delegate to a BPFFS instance the permission to create tokens.
The value is a list of options taken from:
https://github.com/torvalds/linux/blob/v6.14/include/uapi/linux/bpf.h#L922-L1121
The special value "any" means to allow every possible values.
More informations about BPF tokens here:
https://lwn.net/Articles/947173/
Matteo Croce [Fri, 27 Jun 2025 12:17:00 +0000 (14:17 +0200)]
core: Introduce PrivateBPF= to mount a private BPFFS
Add a new option PrivateBPF= to mount a new instance of bpffs within a
namespace.
PrivateBPF= can be set to "no" to use the host bpffs in readonly mode
and "yes" to do a new mount.
The mount is done with the new fsopen()/fsmount() API because in future
we'll hook some commands between the two calls.
I added them in 41afb5eb7214727301132aedc381831fbfc78e37 without too
much explanation. Most likely the idea was to get rid of unused code
in libsystemd.so [1]. But now that I'm testing this, it doesn't seem
to have an effect. LTO is needed to get rid of unused functions, and
it's enough to have LTO without those options. Those options might have
some downsides [2], so let's disable them since there are doubts and no
particularly good reason to have them.
But keep the -Wl,--gc-sections option. Without this, libsystemd.so
grows a little:
-rwxr-xr-x 1 zbyszek zbyszek 5532424 07-08 13:24 build/libsystemd.so.0.40.0-orig
-rwxr-xr-x 1 zbyszek zbyszek 5614472 07-08 13:26 build/libsystemd.so.0.40.0-no-sections
-rwxr-xr-x 1 zbyszek zbyszek 5532392 07-08 13:27 build/libsystemd.so.0.40.0
Let's apply the --gc-sections option always to make the debug and final
builds more similar.
We need to verify that distro packages don't unexpectedly grow after this.
We put the name of the variable in the message, but it is a local variable
and the name does not have global meaning. We end up with pointless copies
of the error string:
$ strings build/libsystemd.so.0.40.0 | grep 'big enough'
xsprintf: p[] must be big enough
xsprintf: error[] must be big enough
xsprintf: prefix[] must be big enough
xsprintf: pty[] must be big enough
xsprintf: mode[] must be big enough
xsprintf: t[] must be big enough
xsprintf: s[] must be big enough
xsprintf: spid[] must be big enough
xsprintf: header_priority[] must be big enough
xsprintf: header_pid[] must be big enough
xsprintf: path[] must be big enough
xsprintf: buf[] must be big enough
The error message already shows the file, line, and function name, which
is enough to identify the problem:
Assertion 'xsprintf: buffer too small' failed at src/test/test-string-util.c:20, function test_xsprintf(). Aborting.
Merge shared/exec-directory-util.? into basic/unit-def.?
Suggested in
https://github.com/systemd/systemd/pull/35892#discussion_r2180322856.
This is a tiny amount of code and does not warrant having a separate file
and spawning a separate instance of the compiler during the build.
Note: it took me a while to confirm that the contents of that table and
function don't end up in libsystemd.so. The issue is that they _are_ present in
it, unless LTO is used. We actually use link_whole[libbasic_static] for
libsystemd, so we end up with all that code there. LTO is needed to clean
that up.
As is usually the case, the bitfields don't create the expected space savings,
because the field that follows needs to be aligned. But we don't want to fully
drop the bitfields here, because then ConditionType and ConditionResult are
each 4 bytes, and the whole struct grows from 32 to 40 bytes (on amd64). We
potentially have lots of little Conditions and that'd waste some memory.
Make each of the four fields one byte. This still allows the compiler to
generate simpler code without changing the struct size: