git.ipfire.org Git - thirdparty/systemd.git/log

]> git.ipfire.org Git - thirdparty/systemd.git/log

projects / thirdparty / systemd.git / log

Ivan Kruglov [Fri, 15 May 2026 14:02:15 +0000 (07:02 -0700)]

test: add integration tests for io.systemd.Job varlink methods

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Mon, 18 May 2026 07:45:23 +0000 (00:45 -0700)]

test: wait systemd to finish reexec in TEST-74-AUX-UTILS.varlinkctl.sh

commit | commitdiff | tree

Ivan Kruglov [Fri, 15 May 2026 14:01:43 +0000 (07:01 -0700)]

test: split TEST-74-AUX-UTILS.varlinkctl.sh into per-interface subtests

Split the monolithic varlinkctl test script into separate files per varlink interface for better organization and easier maintenance:
- varlinkctl.sh: core varlinkctl tool tests (CLI, transports, socket discovery, upgrade/serve) and io.systemd.Manager
- varlinkctl-network.sh: io.systemd.Network
- varlinkctl-unit.sh: io.systemd.Unit (system + user manager)
- varlinkctl-metrics.sh: io.systemd.Metrics

No functional changes — the test content is moved as-is.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Fri, 15 May 2026 14:01:27 +0000 (07:01 -0700)]

core: introduce io.systemd.Job interface with List, Cancel, and ClearAll methods

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Fri, 15 May 2026 14:00:07 +0000 (07:00 -0700)]

shared: extend Job varlink type with Unit and ActivationDetails fields

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Fri, 15 May 2026 14:13:35 +0000 (07:13 -0700)]

json-util: add JSON_BUILD_PAIR_ENUM_NON_EMPTY macro

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Fri, 15 May 2026 10:16:00 +0000 (03:16 -0700)]

json-util: add json_dispatch_job_id() dispatcher for job IDs

Job IDs are uint32_t values that are always >= 1 (the manager's ID counter starts at 1 and wraps from UINT32_MAX back to 1, never assigning 0). Add a dedicated dispatch function that validates this constraint, rejecting 0 and treating null as "unset" (0).

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Luca Boccassi [Fri, 15 May 2026 09:20:56 +0000 (10:20 +0100)]

Implement Service Context/Runtime for io.systemd.Unit.List (#42098)

The PR implements the following objects + tests for
io.systemd.Unit.List:
- ServiceContext
- ServiceRuntime

It's hopefully the last PR of the long sequence of:

* https://github.com/systemd/systemd/pull/37432
* https://github.com/systemd/systemd/pull/37646
* https://github.com/systemd/systemd/pull/38032
* https://github.com/systemd/systemd/pull/38212
* https://github.com/systemd/systemd/pull/39391
* https://github.com/systemd/systemd/pull/41980
* https://github.com/systemd/systemd/pull/42057

commit | commitdiff | tree

Ivan Kruglov [Thu, 14 May 2026 16:41:50 +0000 (09:41 -0700)]

test: add ServiceContext/Runtime enum and integration tests

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Thu, 14 May 2026 16:41:39 +0000 (09:41 -0700)]

core: expand ServiceContext and add ServiceRuntime for io.systemd.Unit.List

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Ivan Kruglov [Thu, 14 May 2026 16:41:24 +0000 (09:41 -0700)]

shared: add exec_command_status_build_json() and ExecCommandStatus varlink type to common

Add exec_command_status_build_json() and exec_command_status_list_build_json() to varlink-common, alongside exec_command_build_json() and exec_command_list_build_json(). The status list function is the runtime counterpart of the command list function — the two arrays are positionally aligned so index N in the status array corresponds to index N in the command array. Commands that have not yet run produce null entries to preserve alignment.

Add the ExecCommandStatus varlink struct type to varlink-idl-common next to ExecCommand. It contains PID, timestamps, and mutually exclusive ExitStatus (int, for normal exit) / ExitSignal (string, for signal kill).

commit | commitdiff | tree

Yu Watanabe [Fri, 15 May 2026 06:33:11 +0000 (15:33 +0900)]

TODO: fix typo

commit | commitdiff | tree

Yu Watanabe [Fri, 15 May 2026 06:04:02 +0000 (15:04 +0900)]

sd-journal: update comments

commit | commitdiff | tree

Michal Sekletar [Wed, 13 May 2026 14:20:55 +0000 (16:20 +0200)]

core: make manager event loop rate limit configurable

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Luca Boccassi [Thu, 14 May 2026 20:06:15 +0000 (21:06 +0100)]

ci: switch SUSE mkosi mirror to cdn.o.o

The cdn mirror is preferred by SUSE for clouds/CIs. There have been issues with some
mirrors, which fail to download from GHA quite often lately, so hopefully this will
make it reliable again.

commit | commitdiff | tree

Aleksa Sarai [Thu, 14 May 2026 10:15:06 +0000 (17:15 +0700)]

sysupdate: mkdir_parents CurrentSymlink= path

This was missing in the CurrentSymlink= creation path, and leads to
partially-broken update installs.

Signed-off-by: Aleksa Sarai <aleksa@amutable.com>

commit | commitdiff | tree

Philip Withnall [Sun, 3 May 2026 21:36:32 +0000 (22:36 +0100)]

test: Add a sysupdate test for files which are a prefix match of each other

This tests whether the pattern matching code checks it’s matched the
whole string and not just a prefix (see commit 4ffb60319b).

In particular, this tests a setup which KDE currently use in their
sysupdate images, where two regular file transfers are done, one of a
`foo.erofs` file, and the other `foo.erofs.caibx`. As one is a prefix of
the other, they were hitting this bug.

See:
- https://files.kde.org/kde-linux/sysupdate/v2/
- https://github.com/KDE/kde-linux/tree/master/mkosi.extra/usr/lib/sysupdate.d

Signed-off-by: Philip Withnall <pwithnall@gnome.org>
Fixes: https://github.com/systemd/systemd/issues/38605
Fixes: https://github.com/systemd/systemd/issues/41288

commit | commitdiff | tree

r-vdp [Mon, 13 Apr 2026 12:41:11 +0000 (14:41 +0200)]

sd-radv: do not stop on transient send errors

When the periodic RA timer fires, any error returned by sendmsg()
currently propagates up through sd_radv_send() into radv_timeout(),
which then calls sd_radv_stop(). The RA engine is never started again
until the next carrier transition.

On an 802.3ad bond there is a window right after carrier-up where the
link is administratively up but no aggregator has been selected yet, so
sendmsg() returns ENOBUFS. If the very first RA after a flap lands in
that window, radv stops permanently and all clients lose their SLAAC
addresses, on-link/PD prefixes, and default router once the previously
advertised lifetimes expire, while IPv4 keeps working, leading to a very
confusing situation with v4 up and v6 down.

Handle this the same way solicited RAs already do (see
radv_process_packet()): log the failure and reschedule the timer instead
of giving up. ra_sent is left untouched on failure so we stay in the
fast initial-advertisement regime until a send actually succeeds.

commit | commitdiff | tree

r-vdp [Mon, 13 Apr 2026 17:03:27 +0000 (19:03 +0200)]

network: honour static IPv6LL addresses in network_adjust_*()

link_radv_enabled() and link_ndisc_enabled() use
link_ipv6ll_enabled_harder(), which considers a static fe80:: address
in [Address] sufficient to run radv/ndisc even when LinkLocalAddressing=
(or IPv6LinkLocalAddressGenerationMode=none, which network_verify()
folds into the same flag) disables the kernel-generated link-local.

network_adjust_radv()/ndisc()/dhcp() however only check the raw
link_local flag and zero router_prefix_delegation / ndisc / dhcp&IPV6
at parse time, so the runtime gate never gets a chance to fire.

Factor the static-LL lookup out of link_ipv6ll_enabled_harder() into a
Network-level helper and use it in the three network_adjust_*()
functions, bringing parse-time and runtime behaviour in line.

commit | commitdiff | tree

我超厉害 [Thu, 14 May 2026 17:51:29 +0000 (01:51 +0800)]

sd-device: use ERRNO_IS_NEG_DEVICE_ABSENT() for device-id load failures (#41764)

Device enumeration may encounter transient errors such as ENXIO when devices
appear or disappear concurrently. These conditions represent expected "device absent"
races and should be treated uniformly across the enumeration logic.

This change replaces the ENODEV-specific check with ERRNO_IS_NEG_DEVICE_ABSENT(),
ensuring that all expected disappearance conditions are handled consistently.
Unexpected errors are still propagated, while expected races are ignored without
aborting the enumeration.

commit | commitdiff | tree

Yu Watanabe [Thu, 14 May 2026 17:47:23 +0000 (02:47 +0900)]

A few more conversions of options and verbs (#41795)

I had those prepared before but I didn't submit them because the
automatic layout didn't work well. In two cases now the sync of widths
between verbs and options is disabled and one case is left with the
automatic alignment. I think it'd good enough to merge.

commit | commitdiff | tree

Ivan Kruglov [Thu, 14 May 2026 16:41:09 +0000 (09:41 -0700)]

core: move service_context_build_json() to varlink-service.c

Move the existing (partial) service context builder from varlink-unit.c into its own varlink-service.c file, following the pattern established by other unit type context builders (varlink-path.c, varlink-scope.c, etc.). No functional change.

commit | commitdiff | tree

Yu Watanabe [Thu, 14 May 2026 15:38:19 +0000 (00:38 +0900)]

meson: don't use Python module for host Python (#41959)

Checking for pefile required that module to be made available for the
Python used to build systemd, even though it's only used at runtime,
potentially via a different Python installation.

Furthermore, Meson's Python module doesn't do the right thing when cross
compiling and looking up a Python for the host system, so this would end
up uselessly checking whether the build Python had the pefile module,
which is not needed. Even if it were made to check the host Python using
find_program, it still relies on being able to run its Python, which in
a cross scenario it probably wouldn't be able to do.

All in all, this check does more harm than good, and prevents building
ukify in valid configurations, so remove it.

commit | commitdiff | tree

noxiouz [Tue, 17 Mar 2026 23:55:51 +0000 (23:55 +0000)]

coredump: add JSON output support to coredumpctl info

Implement support for the --json= flag in the info subcommand
(issue #38844). Previously, coredumpctl info always produced
human-readable text output regardless of --json=.

Add a CoredumpFields struct that holds all journal fields extracted
for a coredump entry, along with coredump_fields_done() to release
member resources and coredump_fields_load() to populate the struct
from a journal entry. Both print_info() and the new print_info_json()
use this shared loader, eliminating the duplicate RETRIEVE loop.

print_info_json() builds a JSON object with the same fields shown by
print_info(). Missing fields are omitted via SD_JSON_BUILD_PAIR_CONDITION,
matching the tolerant behavior of print_info() rather than skipping the
entry entirely. Signal/Reason handling mirrors print_info(): normal
coredumps (MESSAGE_ID == SD_MESSAGE_COREDUMP_STR) emit a numeric Signal
field; non-normal entries (kernel oops, etc.) emit a Reason field with
the raw text from COREDUMP_SIGNAL.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Daan De Meyer [Mon, 11 May 2026 19:58:24 +0000 (21:58 +0200)]

btrfs: Beef up btrfs_subvol_make()

Let's make sure we handle AT_FDCWD and XAT_FDROOT
properly by using xopenat().

commit | commitdiff | tree

Daan De Meyer [Tue, 12 May 2026 13:03:49 +0000 (13:03 +0000)]

nspawn: split boot parameters into env vars and argv

When the kernel hands the command line to PID 1, any KEY=VALUE assignment
whose KEY does not contain a '.' is exported as an environment variable
(with '-' replaced by '_') rather than passed as an argument. Mimic the
same split in --boot mode so kernel-cmdline-style arguments passed after
the container path behave as they would on a real boot.

commit | commitdiff | tree

Yu Watanabe [Thu, 14 May 2026 14:29:47 +0000 (23:29 +0900)]

Implement Socket Context/Runtime for io.systemd.Unit.List (#42057)

The PR implements the following objects + tests for
io.systemd.Unit.List:
- SocketContext
- Socket Runtime

It's a continuation of the following PRs:

* https://github.com/systemd/systemd/pull/37432
* https://github.com/systemd/systemd/pull/37646
* https://github.com/systemd/systemd/pull/38032
* https://github.com/systemd/systemd/pull/38212
* https://github.com/systemd/systemd/pull/39391
* https://github.com/systemd/systemd/pull/41980

commit | commitdiff | tree

Frantisek Sumsal [Thu, 14 May 2026 11:05:02 +0000 (13:05 +0200)]

profile: bail out early if promptvars is disabled

We need promptvars, otherwise the prompt strings won't undergo parameter
expansion and we'd print them literally:

$ shopt -u promptvars
$ echo foo
$(__systemd_osc_context_ps0)foo

Resolves: #40620

commit | commitdiff | tree

Emanuele Rocca [Thu, 14 May 2026 11:31:24 +0000 (13:31 +0200)]

test-fs-util: check for CAP_DAC_OVERRIDE in xopenat_auto_rw_ro

When running test_xopenat_auto_rw_ro under a non-root user with the
CAP_DAC_OVERRIDE capability, the test currently fails.

As the comment already says, root bypasses mode bits via CAP_DAC_OVERRIDE so
let's check for that instead of the effective user ID.

Signed-off-by: Emanuele Rocca <emanuele.rocca@arm.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Thu, 23 Apr 2026 19:43:19 +0000 (21:43 +0200)]

various: fix duplicated logging from parse_path_argument

As pointed out in review, parse_path_argument can fail for non-oom reasons.
But the function already logs, so the correct thing to do is to just
propagate the error.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Tue, 14 Apr 2026 10:56:01 +0000 (12:56 +0200)]

busctl: convert to the new option and verb parsers

The conversion doesn't work great, because some of the verbs take many
arguments and the first column is extermely wide. So similarly to
kernel-install, I dropped the sync of column widths. This allows the
help for options to use most of the available space.

-C/--capsule is now documented, fixup for
00431b2b66cb59540deda4ea018170a289673585.

Verb functions are renamed to match verb names.

The missing first param is added to the synopsis of "wait".
It now matches the man page.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Tue, 14 Apr 2026 14:42:41 +0000 (16:42 +0200)]

busctl: reorder option cases to match --help output

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Thu, 16 Apr 2026 09:38:06 +0000 (11:38 +0200)]

localectl: convert to the new option and verb parsers

The verb synopses are long, so they got broken up:
===============================================================================
> localectl [OPTIONS…] COMMAND …

Query or change system locale and keyboard settings.

Commands:
  [status]                      Show current locale settings
  set-locale LOCALE...          Set system locale
  list-locales                  Show known locales
  set-keymap MAP [MAP]          Set console and X11 keyboard mappings
  list-keymaps                  Show known virtual console keyboard mappings
  set-x11-keymap LAYOUT [MODEL  Set X11 and console keyboard mappings
    [VARIANT [OPTIONS]]]
  list-x11-keymap-models        Show known X11 keyboard mapping models
  list-x11-keymap-layouts       Show known X11 keyboard mapping layouts
  list-x11-keymap-variants      Show known X11 keyboard mapping variants
    [LAYOUT]
  list-x11-keymap-options       Show known X11 keyboard mapping options

Options:
  -h --help                     Show this help
     --version                  Show package version
  -l --full                     Do not ellipsize output
     --no-pager                 Do not start a pager
     --no-ask-password          Do not prompt for password
  -H --host=[USER@]HOST         Operate on remote host
  -M --machine=CONTAINER        Operate on local container
     --no-convert               Don't convert keyboard mappings

See the localectl(1) man page for details.
===============================================================================

But I think this is OK. Everything is readable. On a more normal terminal,
everything fits nicely.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Thu, 16 Apr 2026 08:25:49 +0000 (10:25 +0200)]

kernel-install: convert to the new option and verb parsers

The verb synopses are very long because of the many parameters.
Previously were shown without help and occupied all available columns.
With the autogenerated help format, this doesn't work great. So the
verbs and options tables are not synced, so that help for options can
use more columns. I think in this case this is better than the
alternatives.

Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>

commit | commitdiff | tree

Luca Boccassi [Thu, 14 May 2026 12:10:13 +0000 (13:10 +0100)]

cgroup: Add CPUSetPartition= setting (#42013)

Add support for configuring cpuset partition type via the
CPUSetPartition= unit file setting. This controls the kernel's
cpuset.cpus.partition cgroup attribute.

The setting takes one of "member", "root", or "isolated". This is
useful for real-time workloads that require dedicated CPU resources
without interference from other processes.

When set, systemd will write the partition type to the
cpuset.cpus.partition cgroup file. If the kernel rejects the value
(e.g., due to partition hierarchy rules), a warning is logged and the
unit continues with the kernel's default partition type.

Co-developed-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Thu, 14 May 2026 09:51:56 +0000 (11:51 +0200)]

shared/options: introduce OPTION_COMMON_{ENTRY_TOKEN,MAKE_ENTRY_DIRECTORY}

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Thu, 14 May 2026 09:23:40 +0000 (11:23 +0200)]

shared/verbs: split verbs in two lines when the synopsis is > 25 characters

The help tests would not pass because in cases where the verb synopsis
is very long, we'd format the table badly if the terminal is fairly
narrow. I experimented with a few solutions, but overall, it's hard to
achieve very good layout with the automatic formatting. I think the
approach in this commit works the best: we end up with an two- or
three-line verb synopis, which is similar to what we did manually
before.

$ COLUMNS=80 build/localectl -h
...
Commands:
  [status]                      Show current locale settings
  set-locale LOCALE...          Set system locale
  list-locales                  Show known locales
  set-keymap MAP [MAP]          Set console and X11 keyboard mappings
  list-keymaps                  Show known virtual console keyboard mappings
  set-x11-keymap LAYOUT [MODEL  Set X11 and console keyboard mappings
    [VARIANT [OPTIONS]]]
  list-x11-keymap-models        Show known X11 keyboard mapping models
  list-x11-keymap-layouts       Show known X11 keyboard mapping layouts
  list-x11-keymap-variants      Show known X11 keyboard mapping variants
    [LAYOUT]
  list-x11-keymap-options       Show known X11 keyboard mapping options

I think that almost nobody actually uses an 80 column terminal, and if
they do, they probably don't spend too much time looking at our --help
output there. So the goal here is to do something reasonable and robust
and get the tests to pass.

We can use strjoina here because the strings are fully under our
control.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Fri, 24 Apr 2026 11:02:25 +0000 (13:02 +0200)]

shared/format-table: shorten code a bit

Define variables at point of initialization so the whole thing is easier
to read.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Thu, 14 May 2026 09:33:09 +0000 (11:33 +0200)]

Convert systemctl to option and verb macros (#42088)

This one was non-trivial, so it'd benefit from a close review.

commit | commitdiff | tree

Luca Boccassi [Thu, 14 May 2026 08:57:00 +0000 (09:57 +0100)]

export system memory and number of cpus in basic metrics (#42076)

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 22:34:03 +0000 (00:34 +0200)]

fuzz-systemctl-parse-argv: add two corpus files to test compat parsers

Looking at the corpus examples, I'm not sure the fuzzer even went into
the compat parsers. None of the files have argv[0] that'd cause
invoked_as() to go into the compat paths. So add the files to provide
a quick test and possibly bias the fuzzer search into the right
direction.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 22:21:25 +0000 (00:21 +0200)]

fuzz-systemctl-parse-argv: update suppression of logging and resetting of state

There's certainly more than one way to skin this particular cat,
so I'm keeping this as a separate commit.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 21:29:46 +0000 (23:29 +0200)]

shared/options: implement the equivalent of 'opterr'

All log messages during option parsing are emitted using log_full,
and the level is set as LOG_ERR + state->log_level_shift. The default
shift is 0, but if set to e.g. 4, we log at LOG_DEBUG, and if set
to 5 or higher, logging is effectively suppressed. (Unless compiled
with LOG_TRACE, when it'd be suppressed if the shift if set to 6
or higher.) So this gives something like 'opterr', except that
without global state and potentially more flexible.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 21:02:48 +0000 (23:02 +0200)]

systemctl: convert shutdown_parse_argv to OPTION macros

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 20:56:25 +0000 (22:56 +0200)]

systemctl: convert halt_parse_argv to OPTION macros

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 20:24:14 +0000 (22:24 +0200)]

systemctl: convert verbs to VERB macros

systemctl_main() is moved to systemctl.c to allow fuzz-systemctl-parse-argv
to compile. It needs systemctl_help(), which needs the verb table, with the
expected groups. Once we provide that, the linker needs all the verb_*
functions. So add dummy implementations in fuzz-systemctl-parse-argv to
allow the link to happen.

The alternative would be to provide an empty option table, but that
seems to be more complicated, and also can simulate parsing of the whole
command line with the full verb set, so it seems better to test with the
real verb table.

$ nm build/fuzz-systemctl-parse-argv | rg 0000000000418885
0000000000418885 T verb_add_dependency
0000000000418885 T verb_bind
0000000000418885 T verb_cancel
0000000000418885 T verb_cat
0000000000418885 T verb_clean_or_freeze
0000000000418885 T verb_edit
0000000000418885 T verb_enable
0000000000418885 T verb_get_default
0000000000418885 T verb_import_environment
0000000000418885 T verb_is_active
0000000000418885 T verb_is_enabled
0000000000418885 T verb_is_failed
0000000000418885 T verb_is_system_running
0000000000418885 T verb_kill
0000000000418885 T verb_list_automounts
0000000000418885 T verb_list_dependencies
0000000000418885 T verb_list_jobs
0000000000418885 T verb_list_machines
0000000000418885 T verb_list_paths
0000000000418885 T verb_list_sockets
0000000000418885 T verb_list_timers
0000000000418885 T verb_list_unit_files
0000000000418885 T verb_list_units
0000000000418885 T verb_log_setting
0000000000418885 T verb_mount_image
0000000000418885 t verb_noop
0000000000418885 T verb_preset_all
0000000000418885 T verb_reset_failed
0000000000418885 T verb_service_log_setting
0000000000418885 T verb_service_watchdogs
0000000000418885 T verb_set_default
0000000000418885 T verb_set_environment
0000000000418885 T verb_set_property
0000000000418885 T verb_show
0000000000418885 T verb_show_environment
0000000000418885 T verb_start_special
0000000000418885 T verb_start_system_special
0000000000418885 T verb_switch_root
0000000000418885 T verb_trivial_method
0000000000418885 T verb_whoami

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 15:50:24 +0000 (17:50 +0200)]

systemctl: convert parse_argv to OPTION macros

The verbs[] table still lives in systemctl-main.c — only the option parsing
side is migrated. systemctl_dispatch_parse_argv() gains a remaining_args
out-param so run() can pass the parsed positional args to systemctl_main(),
which dispatches via _dispatch_verb_with_args() instead of dispatch_verb().

The Options section of --help now renders from the OPTION declarations; the
verb sections still use raw printfs and will be converted alongside the
verbs[] migration.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 15:39:26 +0000 (17:39 +0200)]

systemctl: reorder cases in parse_argv() to match order in --help

Compatibility-only options (--fail, --irreversible, --ignore-dependencies,
--no-legend) are grouped at the end alongside the '.' / '?' error handlers.
The case 'P': … _fallthrough_; case 'p': pair is kept intact and placed at
-p's slot in --help, so -P sits immediately before -p in the source.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 15:32:44 +0000 (17:32 +0200)]

systemctl: split out helper for --what and allow resetting

Analogous to parent commit.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 15:18:30 +0000 (17:18 +0200)]

systemctl: split out helper for --type= and allow resetting

Analogous to grandparent commit.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 15:10:00 +0000 (17:10 +0200)]

systemctl: split out helper for --property=

We explicitly handled --property= in a specific way, so preserve
that behaviour.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 15:00:00 +0000 (17:00 +0200)]

systemctl: split out helper for --state= and allow resetting

So far we'd reject --state=, but it seems nicer to make it reset the
setting as we generally do. The output variable is modified in place…
Option parsing isn't atomic anyway, so I think it's fine to to that.

commit | commitdiff | tree

glemco [Tue, 12 May 2026 17:47:43 +0000 (19:47 +0200)]

cgroups: Refactor cgroup_apply_cpuset() argument order

Cgroups function have the name key first and then the value,
cgroup_apply_cpuset() has the opposite.

Swap the name and value (cpuset) arguments.

commit | commitdiff | tree

glemco [Sun, 10 May 2026 09:48:27 +0000 (11:48 +0200)]

cgroup: Add CPUSetPartition= setting

Add support for configuring cpuset partition type via the
CPUSetPartition= unit file setting. This controls the kernel's
cpuset.cpus.partition cgroup attribute.

The setting takes one of "member", "root", or "isolated". This is
useful for real-time workloads that require dedicated CPU resources
without interference from other processes.

When set, systemd will write the partition type to the
cpuset.cpus.partition cgroup file. If the kernel rejects the value
(e.g., due to partition hierarchy rules), a warning is logged and the
unit continues with the kernel's default partition type.

Co-developed-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Gabriele Monaco <gmonaco@redhat.com>

commit | commitdiff | tree

Christian Brauner [Thu, 14 May 2026 06:13:00 +0000 (08:13 +0200)]

vmspawn: multifunction-pack pcie-root-ports on pcie.0 (#42077)

The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.

pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs while
we are mid-feature-probe, reported as 'QMP connection dropped during
feature probing'.

Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so vmspawn's
QMP device_add machinery is unaffected. 14 ports collapse to 2 pcie.0
slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.

The chassis/slot properties (used for ACPI hotplug identity) stay as i+1
— they live in a uint8_t namespace independent of the PCI BDF and are
still unique. Base PCI slot 0x10 sits above the auto-assigned virtio
devices (which land at 0x01-0x03 in config order) and below the q35 LPC
reservation at 0x1f.

While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now mirrors
assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives (root +
extras + bind volumes) take one builtin port each, SCSI drives take none
— they share a controller drawn from the hotplug pool at device-add
time. Tighten the cap from UINT8_MAX to 192 (24 packed device-numbers ×
8) so we cannot claim more than 24 slots on pcie.0 regardless of how
many extras/runtime-mounts a caller asks for.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 13:06:57 +0000 (15:06 +0200)]

update TODO

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 13:00:08 +0000 (15:00 +0200)]

report-basic: export PhysicalMemorybytes + CPUsOnline metrics

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 12:59:35 +0000 (14:59 +0200)]

cpu-set-util: introduce cpus_online().

Add a helper that tries to determine the number of installed CPUs. This
borrows heavily from physical_memory(), i.e. uses the physical number,
but caps by per-container cpuset.

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 12:58:21 +0000 (14:58 +0200)]

cpu-set-util: add cpu_set_count() helper

Let's add a minor, simplifying helper for getting number of CPUs in a
mask.

commit | commitdiff | tree

Daan De Meyer [Wed, 13 May 2026 10:21:10 +0000 (12:21 +0200)]

nsresourced: re-link GID delegation file after atomic UID file write

userns_registry_remove() restores a sub-delegated UID range by writing
the previous owner's data to u<UID>.delegate with WRITE_STRING_FILE_ATOMIC.
Atomic writes go via a temp file and rename, which replaces the directory
entry with a fresh inode and severs the hardlink to g<GID>.delegate. The
stale GID side then keeps pointing at the prior inode with outdated owner
and ancestor data, so subsequent lookups via GID return wrong results.

Re-create the hardlink after the atomic write so the two views stay in
sync, matching what userns_registry_store() already does after writing
a new delegation.

commit | commitdiff | tree

Daan De Meyer [Wed, 13 May 2026 20:21:57 +0000 (22:21 +0200)]

blockdev-util: Drop name argument from BLKPG functions

We don't use it, the kernel ignores it, let's just drop
the argument. Saves callers from having to ensure the name
they pass in fits in the 64 char buffer.

commit | commitdiff | tree

Christian Brauner [Wed, 13 May 2026 13:58:11 +0000 (15:58 +0200)]

vmspawn: multifunction-pack pcie-root-ports on pcie.0

The pre-allocated pcie-root-port block in run_virtual_machine() places
every port directly on pcie.0 with an auto-assigned PCI address. A
minimal VM already costs 4 builtin + 10 hotplug spares = 14 pcie.0
slots, on top of 3 implicit virtio devices (virtio-rng-pci,
virtio-balloon, virtio-serial-pci) for another 3.

pcie.0 has 32 device-numbers; q35 reserves 0x00 (host bridge) and 0x1f
(ICH9 LPC), leaving ~30 auto-assignable slots. TEST-64-UDEV-STORAGE-
nvme_basic pushes 20 '-device nvme' lines through
$SYSTEMD_VMSPAWN_QEMU_EXTRA, which vmspawn does not see — total demand
14 + 3 + 20 = 37 > 30. Bus realization fails after QEMU's chardev has
already emitted the QMP greeting, and the monitor socket POLLHUPs
while we are mid-feature-probe, reported as 'QMP connection dropped
during feature probing'.

Pack the root ports as multifunction devices, 8 per pcie.0 device-
number (QEMU docs/pcie.txt:84, 117-120, 255-258). Function 0 of each
group carries multifunction=on; functions 1-7 ride the same slot via
addr=N.F. Each function remains independently hot-pluggable so
vmspawn's QMP device_add machinery is unaffected. 14 ports collapse to
2 pcie.0 slots; the nvme_basic budget becomes 2 + 3 + 20 = 25.

The chassis/slot properties (used for ACPI hotplug identity) stay as
i+1 — they live in a uint8_t namespace independent of the PCI BDF and
are still unique. Base PCI slot 0x10 sits above the auto-assigned
virtio devices (which land at 0x01-0x03 in config order) and below
the q35 LPC reservation at 0x1f.

While here, rebuild the slot-count formula to match what
assign_pcie_ports() actually allocates. The +1 'SCSI controller' term
was bogus — virtio-scsi-pci comes from the hotplug-spares pool via
hotplug_port_owner[] in vmspawn-qmp.c, never from a builtin port (see
the comment in assign_pcie_ports()). The +1 'network' and +1 'vsock'
terms are now conditional on arg_network_stack and use_vsock. Bind
volumes were missing entirely. And the per-drive accounting now
mirrors assign_pcie_ports()'s skip-SCSI behaviour: non-SCSI drives
(root + extras + bind volumes) take one builtin port each, SCSI
drives take none — they share a controller drawn from the hotplug
pool at device-add time. Cap at 120 ports (15 device-numbers × 8) so
we cannot run off the end of the 5-bit PCI device-number space — the
usable range starting at 0x10 ends at 0x1e because ICH9 LPC sits at
0x1f.0 single-function, blocking the rest of that slot for
multifunction packing.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 13:19:54 +0000 (15:19 +0200)]

core: when figuring out whether to create orphanage units, consult vtable instead of allowlist

As per https://github.com/systemd/systemd/pull/41986#pullrequestreview-4281939586

This also corrects the list of unit types a bit:

1. this removes the mount/automount unit type from the list, since for these types
   we do not allow aliases/renaming anyway.

2. this adds socket + swap units to the list, since they can change
   name, and for both of them we actually do fork off processes hence
   track resources.

Follow-up for: #41986

commit | commitdiff | tree

Luca Boccassi [Wed, 13 May 2026 21:04:14 +0000 (22:04 +0100)]

import: two minor debugability improvements (#42081)

TEST-13-NSPAWN.machined occasionally fails when importing, and it's hard
to debug, so try to make it better. eg:

https://github.com/systemd/systemd/actions/runs/25800895182/job/75790334230?pr=42071

commit | commitdiff | tree

Daan De Meyer [Wed, 13 May 2026 19:23:48 +0000 (21:23 +0200)]

repart: Add debug logging for block_device_partition_add()

commit | commitdiff | tree

Luca Boccassi [Wed, 13 May 2026 14:17:39 +0000 (15:17 +0100)]

mkosi: update debian commit reference to 8b9ea8981eee267a2fa493435f2869f7b2479350

* 8b9ea8981e Install new files for upstream build
* b230cf0490 use dh-cruft to register & purge volatile files
* 8f9b9952e1 Install new files for upstream build

commit | commitdiff | tree

Luca Boccassi [Wed, 13 May 2026 19:12:47 +0000 (20:12 +0100)]

preset: enable cgroup metrics logic (#42075)

commit | commitdiff | tree

Luca Boccassi [Wed, 13 May 2026 17:39:06 +0000 (18:39 +0100)]

import: try to capture tar exit codes on failure

TEST-13-NSPAWN.machined occasionally fails with a tar error, and it's hard
to say what the problem is at the exit code is lost. Try to capture it.

[ 34.054] systemd-importd[504]: (transfer18) Imported 92%.
[ 34.118] systemd-importd[504]: (transfer18) Failed to decode and write: Broken pipe
[ 34.119] systemd-importd[504]: (transfer18) Exiting.
[ 34.121] systemd-importd[504]: (transfer18) Failed to allocate transient user namespace: Operation not permitted
[ 34.121] systemd-importd[504]: Transfer process failed with exit code 1.

Follow-up for b6e676ce41508e2aeea22202fc8f234126177f52

commit | commitdiff | tree

Luca Boccassi [Wed, 13 May 2026 17:31:27 +0000 (18:31 +0100)]

import: do not create foreign ns on cleanup if not needed

The user ns is only used if the appropriate flag is set, so avoid
creating it unless it is. This avoids a spurious EPERM error in
TEST-13-NSPAWN.machined that is confusing when debugging failures

[ 34.054] systemd-importd[504]: (transfer18) Imported 92%.
[ 34.118] systemd-importd[504]: (transfer18) Failed to decode and write: Broken pipe
[ 34.119] systemd-importd[504]: (transfer18) Exiting.
[ 34.121] systemd-importd[504]: (transfer18) Failed to allocate transient user namespace: Operation not permitted
[ 34.121] systemd-importd[504]: Transfer process failed with exit code 1.

Follow-up for 1be8caa6be6f5a10a7dea5ac562a0df5c5fac2e9

commit | commitdiff | tree

Daan De Meyer [Wed, 13 May 2026 13:35:21 +0000 (13:35 +0000)]

TEST-64-UDEV-STORAGE: Drop number of nvme devices to 12

qemu by default only has 30 PCI slots and with vmspawn now reserving
some of those for its hotplug features, we go over the limit for the
nvme test.

Let's drop the number of nvme devices to 12 to fix the conflict.

commit | commitdiff | tree

Lennart Poettering [Tue, 12 May 2026 14:13:36 +0000 (16:13 +0200)]

copy: retire splice use() for copying files on disk

Apparently splice() is quite problematic, hence just don't anymore. It's
also unnecessary these days since either copy_file_range() or sendfile()
nowadays typically work, the splice() fallback doesn't give us much
anymore.

(At least I am not aware of a combo of fds where splice() would work
where neither cfr nor sf would work).

This leaves one use of splice() in place, in
src/shared/socket-forward.c. We should probably kill that too, but
that'd require some reworking to use sendfile() I guess, and I am too
lazy for that right now. Moreover, in contrast to the other uses it's
probably even safe, since it uses an intermediary pipe always. But what
do I know...

Fixes: #29044

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 07:52:19 +0000 (09:52 +0200)]

meson: move systemd-sysupdate to /usr/bin/

Let's make systemd-sysupdate easy to call. It was added in 2021
and it's around to stay and not "experimental" in any way.

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 13:57:24 +0000 (15:57 +0200)]

update TODO

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 13:25:41 +0000 (15:25 +0200)]

preset: enable cgroup metrics logic

This stuff is so useful, and should work out of the box I am sure. Given
that the metrics are only generated on request this shouldn't create any
additional burden by default.

Yes, this might enlarge reports a bit, if generated with everything on,
but we really should solve that at the report generation level, not at
the point where we make the metrics available.

Follow-up for: 4409e52494d803426a365b6636a66fd2dfc70b62

commit | commitdiff | tree

Lennart Poettering [Wed, 13 May 2026 12:12:37 +0000 (14:12 +0200)]

update TODO

commit | commitdiff | tree

Chris Down [Wed, 13 May 2026 12:25:08 +0000 (21:25 +0900)]

core: do not leak resources when handling stale alias state on reload (#41986)

The fix for the corrupted state when units become aliased on reload
leaks the now-aliased unit's resources, which become untracked and
essentially lost.

While fixing the state corruption is of course necessary, leaking
processes/etc. is not ideal for a system and service manager, so
instead attempt to keep track of them by creating stub units
on-the-fly.
This way resources are not leaked, there are clear indications of
where they moved, and all state can be tracked as expected.

commit | commitdiff | tree

Christian Brauner [Wed, 13 May 2026 11:58:46 +0000 (13:58 +0200)]

RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340)

This series adds a new `RestrictFileSystemAccess=` setting in the
`[Manager]` section of `system.conf` that enforces a deny-default
execution policy: only binaries residing on signed dm-verity block
devices (and the initramfs during early boot) are permitted to execute.
Everything else — tmpfs, procfs, sysfs, anonymous executable mappings,
unsigned dm-verity devices — is denied.

The directive takes the values `no` (default), `exec` (lock down
execution), and accepts `yes` as an alias for `exec`. The name is
deliberately broader than what the initial values cover so the same
setting can grow to restrict other filesystem access categories in the
future (e.g. `any` to deny all access from untrusted filesystems, not
just execution).

### How it works

The BPF program is entirely self-contained; PID1 loads it and the kernel
does the rest. When dm-verity brings up a device, the kernel calls
`security_bdev_setintegrity()` twice during `verity_preresume()`: once
with the root hash and once with the signature validity status. Our
`lsm/bdev_setintegrity` hook captures the second call and records the
device number in a BPF hash map if the signature is valid. When a device
is torn down, `lsm/bdev_free_security` cleans up the map entry. No
userspace map population is needed at any point.

The enforcement side hooks `bprm_check_security` (execve), `mmap_file`
(PROT_EXEC mappings including shared libraries), and `file_mprotect`
(W→X transitions like JIT and libffi). Each hook resolves the file's
backing device via `file->f_inode->i_sb->s_dev` and looks it up in the
verity device map. For block-backed filesystems, `s_dev` equals
`s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on
`s_bdev` — non-block filesystems simply miss in the map and get denied
by the default policy.

During early boot the initramfs needs to be trusted as well, since it
runs before any dm-verity volume is mounted. PID1 writes the initramfs
superblock's device number into a BPF global before attaching the
programs, and clears it after `switch_root` to close the trust window.
As a prerequisite, PID1 also verifies that
`dm_verity.require_signatures=1` is active — without it, unsigned
dm-verity devices could be created, which would weaken the security
model even though the BPF program would correctly deny execution from
them.

### Surviving daemon-reexec

The BPF programs and their verity device map must survive PID1
re-execution (daemon-reexec, switch_root, soft-reboot). Without
preservation, `manager_free()` would destroy the skeleton, the link FDs
would close, programs would detach, and the map would be freed. After
exec, a fresh skeleton would have an empty map — but existing dm-verity
devices have already signaled their integrity and won't do so again. A
deny-default policy plus an empty map means all execution denied and the
system is bricked.

We solve this by serializing the raw BPF link FDs and the `.bss` map FD
across exec using systemd's existing `serialize_fd` / `fdset_cloexec` /
`deserialize_fd` infrastructure. The kernel reference chain (link FD →
`struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs
attached and map data intact as long as the dup'd FDs survive. After
exec, PID1 detects the deserialized FDs and skips skeleton re-creation
entirely. If switching root, it uses the deserialized `.bss` map FD to
clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the
other guard globals in `.bss`.

We intentionally avoid bpffs pinning. Pinned objects are discoverable
and manipulable by any process with sufficient privileges
(`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to
PID1 with no external attack surface.

### Self-protection

BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are
inherently tamper-resistant — `bpf_tracing_link_lops` has no
`.update_prog` and no `.detach` callbacks, so the kernel rejects
`BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with
`-EOPNOTSUPP`. Once attached, our programs cannot be modified or
detached through the `bpf()` syscall.

The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to
obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a
fake trusted device. The self-protection guard blocks this with three
hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for
all code paths that produce a map FD, and denies access to our map IDs
from any process other than PID1 (identified via `tgid == 1`, which is
unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from
`pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides
analogous protection for program FDs as defense-in-depth. `lsm/bpf`
handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no
`security_bpf_link()` hook in the kernel.

The guard starts inactive — all protected IDs default to 0 in `.bss`,
and no real BPF object has ID 0 — so there is no window where it
interferes with PID1's own setup. After attaching all programs, PID1
queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and
writes them into the guard's globals. From that point on, the guard is
active. The guard has zero collateral damage: it only denies access to
our specific object IDs, leaving bpftrace, bpftool,
`RestrictFileSystems=`, and all other BPF usage completely unaffected.

Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks
`PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction
of sensitive state from PID1's address space via ptrace, `/proc/1/mem`,
`process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed
so that monitoring tools and `systemctl` continue to work normally.

### Limitations

- The enforcement hooks resolve trust by looking at
`file->f_inode->i_sb->s_dev` — the device number of the superblock that
owns the file's inode. This works correctly for files directly on a
dm-verity block device, but it does not see through overlayfs. When a
file is accessed on an overlay mount, `f_inode` points to the overlay
inode, and `i_sb->s_dev` is the overlay superblock's anonymous device
number — not the underlying dm-verity device. The overlay superblock has
no backing block device, so the lookup misses in the verity map and
execution is denied by the default policy.

This means that overlayfs mounts whose lower layers are on
dm-verity-protected volumes will currently have execution blocked, even
though the actual data is integrity-protected. The correct fix requires
a kernel extension that allows the BPF program to call something like
`d_real_inode()` to resolve through the overlay to the real inode on the
underlying filesystem, and then check that inode's superblock device
number against the verity map. I plan to add a BPF kfunc exposing this
functionality in a follow-up kernel series.

- Multi-device filesystems such as btrfs use entirely synthetic device
numbers and there is no way to reach the actual device backing the inode
from the inode itself. So `RestrictFileSystemAccess=` only works
reliably with a subset of filesystems. In practice this isn't a problem
because the feature is tailored to erofs; using it on arbitrary
filesystems requires careful vetting of the actual filesystem behaviour.

- The initial implementation also blocks JIT-style execution that relies
on memory mapped executable. This is part of `exec` semantics today and
can be loosened later by introducing finer-grained values (a common
pattern in systemd — following the precedent of `ProtectSystem=`, which
started as a boolean and later grew `auto`/`yes`/`full`/`strict`
semantics).

- The configuration is a system-wide setting with no per-unit opt-out.
This is intentional for the initial implementation: a global invariant
is easier to reason about and harder to accidentally weaken. Per-unit
relaxation can be added later if a concrete need arises.

### Testing

The series includes unit tests and integration tests covering both the
core enforcement logic and the self-protection guard. The unit test
loads the skeleton, attaches programs, populates guard globals, and
verifies that protected IDs are set correctly. The integration tests
exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and
`BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that
access is denied.

What we cannot currently test end-to-end is actual execution enforcement
against a dm-verity-signed root filesystem. The systemd test suite does
not yet have infrastructure for booting a VM with a signed dm-verity
rootfs image — the existing mkosi-based test framework lacks the ability
to produce and boot such images. This will hopefully change soon when
Daan integrates barrage into the test suite.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

commit | commitdiff | tree

Daan De Meyer [Tue, 12 May 2026 07:41:01 +0000 (09:41 +0200)]

test: Modernize btrfs tests

Convert test-btrfs to use the test framework and
assertions, merge the physical offset test into it
and beef it up to include what TEST-83-BTRFS does and
finally get rid of TEST-83-BTRFS as it is unneeded now.

commit | commitdiff | tree

Daan De Meyer [Wed, 13 May 2026 11:06:35 +0000 (13:06 +0200)]

libc,shared: detect newer library symbols at runtime via weak references (#42065)

For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we
previously
gated the calls behind build-time HAVE_* checks. Replace these with weak
external references, falling back to the raw syscall at runtime when the
loaded glibc lacks the symbol. Drop the corresponding cc.has_function()
loop
from meson.build and disable -Wredundant-decls /
readability-redundant-declaration
for src/libc/ via meson c_args and a local .clang-tidy.

For optional libraries (libcryptsetup, libdw, libarchive), drop the
per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the
redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the
symbols after the main dlopen via a new DLSYM_OPTIONAL() helper that
only
assigns on success. libarchive's *_is_set wrappers now use fallback
functions
as their pointer initializers, so call sites never need to NULL-check.

The same treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
in
process-util.c and epoll_pwait2 in sd-event.c. coredump-config and
coredump-submit get a dlopen_dw_has_dwfl_set_sysroot() helper. The kexec
arch gate now uses defined(__NR_kexec_file_load) directly; pidfd.h uses
__has_include_next() to decide whether to pull in glibc's header.

This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these
symbols
are absent.

commit | commitdiff | tree

Daan De Meyer [Wed, 13 May 2026 09:51:18 +0000 (11:51 +0200)]

dhcp-message: introduce several more functions to parse/append DHCP options (#42063)

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 09:22:12 +0000 (11:22 +0200)]

Convert loginctl to option and verb macros (#42066)

commit | commitdiff | tree

Christian Brauner [Tue, 12 May 2026 14:04:44 +0000 (16:04 +0200)]

ci: disable BPF framework in Jammy build tests

Jammy's kernel is too old at this point, and doesn't even provide a
vmlinux.h, so disable the feature in the build smoketests to let us
add new features

Co-developed-by: Luca Boccassi <luca.boccassi@gmail.com>
Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Christian Brauner [Fri, 8 May 2026 08:53:16 +0000 (10:53 +0200)]

core: work around btf_ctx_access() rejection of const void * in BPF LSM

Kernels before v6.16 (missing commit 1271a40eeafa "bpf: Allow access to
const void pointer arguments in tracing programs") have a bug in
btf_ctx_access() where const void * parameters in LSM hook signatures
are not recognized as void pointers. The function checks t->type == 0
to detect void *, but for const void * the BTF chain is PTR -> CONST ->
void, so t->type points to the CONST node rather than directly to
type_id 0. This causes the verifier to reject any BPF program that
reads the const void *value argument of bdev_setintegrity:

func 'bpf_lsm_bdev_setintegrity' arg2 type UNKNOWN is not a struct
invalid bpf_context access off=16 size=8

Work around this by providing a compat variant of the
bdev_setintegrity BPF program that avoids reading the const void *value
argument entirely. Instead it reads the size argument (a scalar integer)
directly from the raw BPF context (ctx[3]), which is not subject to the
broken type check. This is safe because dm-verity guarantees that value
and size are always in lockstep: both NULL/0 for unsigned devices, both
non-zero for signed devices.

The loader tries the full version first (which reads both value and size
for defense-in-depth) and falls back to the compat variant if loading
fails. bpf_program__set_autoload(false) disables whichever variant is
not needed so the verifier never sees it.

This compat logic can be removed once the minimum kernel baseline
includes the 1271a40eeafa fix.

Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Christian Brauner [Fri, 8 May 2026 08:52:18 +0000 (10:52 +0200)]

test: add integration tests for RestrictFileSystemAccess= BPF LSM

Add TEST-90-RESTRICT-FSACCESS with two subtests:

config subtest — Tests PID1's RestrictFileSystemAccess= configuration parsing and
failure modes via system.conf drop-ins and daemon-reexec:
- Default RestrictFileSystemAccess=no produces no log messages
- RestrictFileSystemAccess=yes without BPF LSM logs appropriate warning
- RestrictFileSystemAccess=yes without require_signatures is correctly rejected
   by the test helper binary's precondition check

enforce subtest — Tests actual BPF LSM enforcement using a test helper
binary (test-bpf-restrict-fsaccess) that loads the BPF skeleton with
initramfs_s_dev set to the rootfs s_dev, pins BPF links, and exits:
- Execution from rootfs continues to work (trusted via initramfs_s_dev)
- Execution from tmpfs is blocked with EPERM
- Execution from a signed dm-verity device is allowed, driven via
   systemd-run -p RootImage= against the pre-built signed minimal_0
   images that mkosi ships and signs at image build time (no on-the-fly
   squashfs / verity hash tree / signature build required)
- After BPF detach, enforcement is lifted

All tests skip gracefully when prerequisites are not met (BPF LSM, BPF
framework, dm-verity tools, signing keys).

Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Christian Brauner [Fri, 8 May 2026 08:50:20 +0000 (10:50 +0200)]

core: expose internal helpers for test-bpf-restrict-fsaccess

Make dm_verity_require_signatures() non-static and declare it in the
header so the test helper binary can exercise the same precondition
checks that PID1 uses.

Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Christian Brauner [Fri, 8 May 2026 08:49:10 +0000 (10:49 +0200)]

core: add self-protection guard for RestrictFileSystemAccess= BPF LSM

Add self-protection guard programs to the RestrictFileSystemAccess= skeleton that
prevent non-PID1 processes from obtaining FDs to our maps, programs, or
links via the bpf() syscall.

This blocks the primary attack vector against the RestrictFileSystemAccess= policy:
using BPF_MAP_GET_FD_BY_ID to get an FD to the verity_devices map,
then BPF_MAP_UPDATE_ELEM to inject fake trusted devices. Protection of
program and link IDs is defense-in-depth (the kernel already blocks
BPF_LINK_UPDATE and BPF_LINK_DETACH for LSM tracing links).

Additionally, a ptrace guard (lsm/ptrace_access_check) blocks
PTRACE_MODE_ATTACH to PID1 from other processes, preventing
extraction of sensitive state from PID1's address space via
ptrace, /proc/1/mem, process_vm_readv(), or pidfd_getfd().

Guard logic:
1. Allow all BPF ops from PID1 (tgid == 1, unspoofable)
2. Deny BPF_MAP_GET_FD_BY_ID for our protected map IDs
3. Deny BPF_PROG_GET_FD_BY_ID for our program IDs
4. Deny BPF_LINK_GET_FD_BY_ID for our link IDs
5. Allow everything else (zero collateral damage)

The guard starts inactive (all protected IDs default to 0 in .bss).
After skeleton attach, PID1 queries kernel-assigned IDs via
bpf_obj_get_info_by_fd() and writes them into the guard globals via
the mmap'd .bss, then extracts owned FDs and destroys the skeleton.
Destroying the skeleton unmaps the .bss page from PID1's address
space, so no BPF state — guard globals, protected map/prog/link IDs,
initramfs_s_dev — remains readable via /proc/1/mem. The kernel map
data persists (held by the dup'd FDs) but is only accessible via
bpf_map_* syscalls, which the guard itself blocks for non-PID1.

Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Christian Brauner [Fri, 8 May 2026 08:48:12 +0000 (10:48 +0200)]

core: preserve RestrictFileSystemAccess= BPF state across daemon-reexec

The BPF link and .bss map FDs must survive PID1 re-execution
(daemon-reexec, switch_root, soft-reboot). Without serialization,
manager_free() closes them before execv, programs detach, and the
verity_devices map is freed. After exec a fresh skeleton would have
an empty map — but existing dm-verity devices have already called
bdev_setintegrity and won't call it again. The result would be a
deny-default policy with an empty map, i.e., all execution denied
and the system bricked.

Add serialize/deserialize support using systemd's existing
serialize_fd / fdset_cloexec / deserialize_fd infrastructure:

Before exec (in manager_serialize via bpf_restrict_fsaccess_serialize):
  - Dup each link FD and the .bss map FD into the FDSet
  - fdset_cloexec(fds, false) + execv() preserves them across exec

After exec (in manager_deserialize + bpf_restrict_fsaccess_setup):
  - Deserialize the link FDs and .bss map FD into the Manager struct
  - bpf_restrict_fsaccess_setup() detects the deserialized FDs and skips
    skeleton re-creation entirely — the programs are already attached
  - If no longer in initrd, clear initramfs_s_dev in the kernel map

No bpffs pinning is needed. This avoids a bpffs mount dependency and
eliminates the external attack surface that pinned objects would create
(discoverable/manipulable via unlink or BPF_OBJ_GET). The FDs remain
private to PID1.

Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Christian Brauner [Fri, 8 May 2026 08:45:23 +0000 (10:45 +0200)]

core: add RestrictFileSystemAccess= BPF LSM for dm-verity execution enforcement

Add a new RestrictFileSystemAccess= boolean setting in the [Manager] section of
system.conf that enforces execution only from signed dm-verity block
devices and the initramfs during early boot.

When RestrictFileSystemAccess=yes is set, PID1 loads a BPF LSM program early in boot
that:

Integrity tracking (self-populating, no userspace involvement):
- bdev_setintegrity: records dm-verity signature status in a BPF hash
map when the kernel signals device integrity via
security_bdev_setintegrity()
- bdev_free_security: removes devices from the map on teardown

Execution enforcement (deny-default policy):
- bprm_check_security: blocks execve() from untrusted sources
- mmap_file: blocks PROT_EXEC mmap (shared libs, anonymous exec memory)
- file_mprotect: blocks W->X transitions (JIT, libffi, etc.)

Trust anchors:
- Signed dm-verity volumes (sig_valid flag in the BPF map)
- Initramfs (s_dev captured at load time, cleared after switch_root)
- Everything else is denied (tmpfs, procfs, sysfs, anonymous PROT_EXEC)

PID1 requires dm-verity require_signatures=1 to be enabled and refuses
to load the BPF program otherwise, ensuring the kernel enforces that all
dm-verity devices carry valid signatures.

After attach, PID1 extracts owned FDs from the skeleton (link FDs +
.bss map FD) and lets the skeleton be destroyed. The dup'd link FDs
keep programs attached via the kernel reference chain (link FD ->
bpf_link -> bpf_prog -> bpf_map). Destroying the skeleton unmaps the
.bss page from PID1's address space so no BPF state is readable via
/proc/1/mem. The .bss map FD is retained for targeted writes (clearing
initramfs_s_dev after switch_root via mmap).

Signed-off-by: Christian Brauner <brauner@kernel.org>

commit | commitdiff | tree

Daan De Meyer [Tue, 12 May 2026 14:29:18 +0000 (16:29 +0200)]

libc,shared: detect newer library symbols at runtime

For libc syscall wrappers (pidfd_open, fsopen, openat2, etc.) we previously
gated the calls behind build-time HAVE_* checks. Replace these with shim
functions in src/libc/ that fall back to the raw syscall at runtime when the
loaded glibc lacks the symbol. The infrastructure lives in src/libc/libc-shim.h:
DEFINE_SYSCALL_SHIM falls back to a direct syscall, DEFINE_LIBC_SHIM returns
ENOSYS (for posix_spawn-family helpers that have no corresponding syscall), and
DEFINE_LIBC_ERRNO_SHIM sets errno=ENOSYS and returns -1 (for read/write-style
helpers). The weak reference to the libc symbol is bound via __asm__(\"name\")
rename so the bare libc identifier never appears as a C token — this avoids
both #undef boilerplate against override-header redirects and the resulting
-Wredundant-decls warning. Drop the corresponding cc.has_function() loop from
meson.build.

For optional libraries (libcryptsetup, libdw, libarchive), drop the per-symbol
HAVE_* checks. Always declare the prototypes, suppressing the redundant-decl
warnings via DISABLE_WARNING_REDUNDANT_DECLS and NOLINT, and resolve the symbols
after the main dlopen via a new DLSYM_OPTIONAL() helper that only assigns on
success. libcryptsetup's crypt_set_keyring_to_link / crypt_token_set_external_path
and libarchive's *_is_set wrappers use fallback functions as their pointer
initializers (returning -ENOSYS and 0 respectively), so call sites can invoke
the symbol unconditionally and just check for -ENOSYS where the \"not supported\"
distinction matters.

The same shim treatment applies to pidfd_spawn / posix_spawnattr_setcgroup_np
(src/libc/spawn.c) and epoll_pwait2 (src/libc/epoll.c), with corresponding
override headers in src/include/override/spawn.h and
src/include/override/sys/epoll.h. posix_spawn_wrapper() in process-util.c and
epoll_wait_usec() in sd-event.c now detect ENOSYS in the return value instead
of checking the function pointer, falling back to plain posix_spawn() and
epoll_wait() respectively. coredump-config and coredump-submit get a
dlopen_dw_has_dwfl_set_sysroot() helper. The kexec arch gate now uses
defined(__NR_kexec_file_load) directly; pidfd.h uses __has_include_next() to
decide whether to pull in glibc's header.

This lets binaries built against newer glibc / libcryptsetup / libdw /
libarchive headers still load and run on older targets where these symbols are
absent.

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Tue, 12 May 2026 13:27:52 +0000 (15:27 +0200)]

loginctl: convert to OPTION and VERB macros

--help output is the same, except for the expected formatting changes
and moving of --no-pager/--no-legend/--no-ask-password to the end.

Co-developed-by: Claude Opus 4.7 <noreply@anthropic.com>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 05:47:40 +0000 (07:47 +0200)]

shared/verbs: allow all groups to be named

When verb groups were added, I assumed that the first group will always
by the unnamed group, or in other words, that VERB_GROUP() line cannot
appear first. This provides an additional check on the whether the verbs
haven't been reordered by the compiler or linker. But that check is weak
and we can do a better check anyway. And this limitation is unexpected,
since we allow that for OPTIONs. The code should all work without an
unnamed group, once this assertion is removed.

commit | commitdiff | tree

Daan De Meyer [Tue, 12 May 2026 19:54:06 +0000 (21:54 +0200)]

syscall: add kexec_file_load to the generated override header

This makes __NR_kexec_file_load available on architectures where the kernel
UAPI headers don't define it, matching the runtime fallback path in
src/libc/kexec.c which is gated on #ifdef __NR_kexec_file_load.

commit | commitdiff | tree

Christian Brauner [Wed, 13 May 2026 06:48:38 +0000 (08:48 +0200)]

vmspawn: add io.systemd.MachineInstance.ReplaceStorage (#42017)

A follow-up to the AddStorage / RemoveStorage series. ReplaceStorage
swaps the *backing file* of an already-attached storage device on a
running vmspawn-managed VM, leaving the guest-visible device frontend
(virtio-blk, virtio-scsi, nvme, scsi-cd) and every other property of
the device untouched. The intended use is to point an existing disk
at a new image without the guest seeing a hot-unplug/hot-plug cycle.

The signature mirrors AddStorage minus the 'config' field: the
device frontend doesn't change, only the backing behind it. Read-
only / read-write is derived from the new fd's O_ACCMODE; scsi-cd is
forced read-only to match the boot-time policy. S_ISBLK on the new
fd selects host_device vs file driver, matching AddStorage.

The QMP primitive is blockdev-reopen. It cannot change a file /
host_device node's 'filename' so we can't just point the existing
file node at a new fd, but it can swap a format node's 'file' child
to a different existing monitor-owned node by node-name reference
(case 3 in qemu/qapi/block-core.json:5034-5040). The chain is:

  add-fd          (host fd → new fdset)
  blockdev-add    (new file node, filename=/dev/fdset/N — fd-only)
  remove-fd       (release monitor's ref; new file holds the dup)
  blockdev-reopen (format node, file = new file node-name)
  blockdev-del    (old file node; its dup release frees old fdset)

The reopen options must restate every option the original blockdev-
add emitted on the format node — blockdev-reopen resets any
unspecified option to its driver default. The 'file' field is a
node-name string reference, never a path.

No new errors and no new IDL types beyond the method itself;
everything is built on the existing NoSuchStorage / StorageImmutable
/ NotConnected / EBUSY vocabulary.

The series is:

  vmspawn: split blockdev-add into separate file and format calls
      Preparatory refactor. qemu/blockdev.c:3440 only marks the
      top-level BDS returned by blockdev-add as monitor-owned;
      inline children are NOT, so blockdev-del later rejects them
      with "Node X is not owned by the monitor". Split into two
      blockdev-add calls so the file node is independently
      deletable. DriveInfo gains qmp_file_node_name and a
      file_generation counter; the teardown helper deletes format
      then file (file-first is rejected as "node used as 'file'
      of Y"). The ephemeral path was already structured this way;
      only the regular add path changes. Drops the now-unused
      qmp_build_blockdev_add_inline().

  shared/varlink-io.systemd.MachineInstance: add ReplaceStorage method
      IDL only: ReplaceStorage(fileDescriptorIndex, name). No new
      errors.

  vmspawn: implement io.systemd.MachineInstance.ReplaceStorage
      vmspawn_qmp_replace_block_device() entry point, ReplaceCtx
      (refcounted, ReplaceCtxStateFlags for partial-state tracking)
      and four async callbacks plus an idempotent replace_fail.
      file_generation is bumped before issuing blockdev-add so
      retries don't collide on node-name.
      BLOCK_DEVICE_STATE_REPLACE_PENDING gates concurrent
      Replace / Remove on the same drive. On reopen success the
      trailing blockdev-del of the old file node fires from the
      reopen callback; its failure logs a warning and still replies
      success (the swap already committed; the orphan resolves at VM
      exit). QMP disconnect mid-replace routes via
      qmp_client_fail_pending → replace_fail → NotConnected.

  test: integration test for io.systemd.MachineInstance.ReplaceStorage
      TEST-87-AUX-UTILS-VM.replace-storage covers happy-path replace,
      successive replaces (file_generation rotation), StorageImmutable
      rejection on the boot-time drive, NoSuchStorage on unknown
      names, InvalidParameter on malformed names, and clean
      RemoveStorage after a replace (proves the new file node is
      monitor-owned and the teardown order works). Backing files are
      passed via 'varlinkctl --push-fd'; no machinectl front-end is
      added in this round.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

commit | commitdiff | tree

Zbigniew Jędrzejewski-Szmek [Wed, 13 May 2026 06:07:27 +0000 (08:07 +0200)]

report-basic: expose os-release fields as a metric (#41988)

Add io.systemd.Basic.OSRelease metric family that reports all the fields
in os-release.

commit | commitdiff | tree

Yu Watanabe [Sun, 12 Apr 2026 23:54:38 +0000 (08:54 +0900)]

dhcp-message: introduce dhcp_message_get_option_dnr()

This is for DHCP option 162 (DNR).

commit | commitdiff | tree

Yu Watanabe [Sun, 12 Apr 2026 19:02:39 +0000 (04:02 +0900)]

dhcp-message: introduce dhcp_message_{append,get}_option_6rd()

These are for DHCP option 212 (6rd).

commit | commitdiff | tree

Yu Watanabe [Sat, 11 Apr 2026 21:18:50 +0000 (06:18 +0900)]

dhcp-message: introduce dhcp_message_{append,get}_option_routes()

These are for DHCP options 33 (static route), 121 (classless static
route), and 249 (private classless static route).

commit | commitdiff | tree

Yu Watanabe [Sat, 11 Apr 2026 19:50:05 +0000 (04:50 +0900)]

dhcp: move definition of sd_dhcp_route and related functions to dhcp-route.[ch]

This also renames arguments for storing results.
No functional change, just refactoring and preparation for later commits.

commit | commitdiff | tree

Yu Watanabe [Sun, 19 Apr 2026 07:09:10 +0000 (16:09 +0900)]

dhcp-message: add SIP server option support

The DHCP option 120 (SIP server) option takes a list of addresses or
domain names, and the first byte in the data classifies which type is
stored. Let's extend _addresses() and _domains() to make them support
the SIP server option.

commit | commitdiff | tree

Yu Watanabe [Sat, 11 Apr 2026 17:59:16 +0000 (02:59 +0900)]

dhcp-message: introduce dhcp_message_get_option_domains()

This is for e.g. DHCP option 119 (domain search).

Unnamed repository; edit this file 'description' to name the repository.

RSS Atom