From: Christian Brauner <christian@amutable.com>
Date: Wed, 13 May 2026 11:58:46 +0000 (+0200)
Subject: RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM ... 
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=e28aab8840cbec43830af5c1520e9d9e123ad896;p=thirdparty%2Fsystemd.git

RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340)

This series adds a new `RestrictFileSystemAccess=` setting in the
`[Manager]` section of `system.conf` that enforces a deny-default
execution policy: only binaries residing on signed dm-verity block
devices (and the initramfs during early boot) are permitted to execute.
Everything else — tmpfs, procfs, sysfs, anonymous executable mappings,
unsigned dm-verity devices — is denied.

The directive takes the values `no` (default), `exec` (lock down
execution), and accepts `yes` as an alias for `exec`. The name is
deliberately broader than what the initial values cover so the same
setting can grow to restrict other filesystem access categories in the
future (e.g. `any` to deny all access from untrusted filesystems, not
just execution).

### How it works

The BPF program is entirely self-contained; PID1 loads it and the kernel
does the rest. When dm-verity brings up a device, the kernel calls
`security_bdev_setintegrity()` twice during `verity_preresume()`: once
with the root hash and once with the signature validity status. Our
`lsm/bdev_setintegrity` hook captures the second call and records the
device number in a BPF hash map if the signature is valid. When a device
is torn down, `lsm/bdev_free_security` cleans up the map entry. No
userspace map population is needed at any point.

The enforcement side hooks `bprm_check_security` (execve), `mmap_file`
(PROT_EXEC mappings including shared libraries), and `file_mprotect`
(W→X transitions like JIT and libffi). Each hook resolves the file's
backing device via `file->f_inode->i_sb->s_dev` and looks it up in the
verity device map. For block-backed filesystems, `s_dev` equals
`s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on
`s_bdev` — non-block filesystems simply miss in the map and get denied
by the default policy.

During early boot the initramfs needs to be trusted as well, since it
runs before any dm-verity volume is mounted. PID1 writes the initramfs
superblock's device number into a BPF global before attaching the
programs, and clears it after `switch_root` to close the trust window.
As a prerequisite, PID1 also verifies that
`dm_verity.require_signatures=1` is active — without it, unsigned
dm-verity devices could be created, which would weaken the security
model even though the BPF program would correctly deny execution from
them.

### Surviving daemon-reexec

The BPF programs and their verity device map must survive PID1
re-execution (daemon-reexec, switch_root, soft-reboot). Without
preservation, `manager_free()` would destroy the skeleton, the link FDs
would close, programs would detach, and the map would be freed. After
exec, a fresh skeleton would have an empty map — but existing dm-verity
devices have already signaled their integrity and won't do so again. A
deny-default policy plus an empty map means all execution denied and the
system is bricked.

We solve this by serializing the raw BPF link FDs and the `.bss` map FD
across exec using systemd's existing `serialize_fd` / `fdset_cloexec` /
`deserialize_fd` infrastructure. The kernel reference chain (link FD →
`struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs
attached and map data intact as long as the dup'd FDs survive. After
exec, PID1 detects the deserialized FDs and skips skeleton re-creation
entirely. If switching root, it uses the deserialized `.bss` map FD to
clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the
other guard globals in `.bss`.

We intentionally avoid bpffs pinning. Pinned objects are discoverable
and manipulable by any process with sufficient privileges
(`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to
PID1 with no external attack surface.

### Self-protection

BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are
inherently tamper-resistant — `bpf_tracing_link_lops` has no
`.update_prog` and no `.detach` callbacks, so the kernel rejects
`BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with
`-EOPNOTSUPP`. Once attached, our programs cannot be modified or
detached through the `bpf()` syscall.

The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to
obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a
fake trusted device. The self-protection guard blocks this with three
hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for
all code paths that produce a map FD, and denies access to our map IDs
from any process other than PID1 (identified via `tgid == 1`, which is
unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from
`pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides
analogous protection for program FDs as defense-in-depth. `lsm/bpf`
handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no
`security_bpf_link()` hook in the kernel.

The guard starts inactive — all protected IDs default to 0 in `.bss`,
and no real BPF object has ID 0 — so there is no window where it
interferes with PID1's own setup. After attaching all programs, PID1
queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and
writes them into the guard's globals. From that point on, the guard is
active. The guard has zero collateral damage: it only denies access to
our specific object IDs, leaving bpftrace, bpftool,
`RestrictFileSystems=`, and all other BPF usage completely unaffected.

Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks
`PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction
of sensitive state from PID1's address space via ptrace, `/proc/1/mem`,
`process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed
so that monitoring tools and `systemctl` continue to work normally.

### Limitations

- The enforcement hooks resolve trust by looking at
`file->f_inode->i_sb->s_dev` — the device number of the superblock that
owns the file's inode. This works correctly for files directly on a
dm-verity block device, but it does not see through overlayfs. When a
file is accessed on an overlay mount, `f_inode` points to the overlay
inode, and `i_sb->s_dev` is the overlay superblock's anonymous device
number — not the underlying dm-verity device. The overlay superblock has
no backing block device, so the lookup misses in the verity map and
execution is denied by the default policy.

This means that overlayfs mounts whose lower layers are on
dm-verity-protected volumes will currently have execution blocked, even
though the actual data is integrity-protected. The correct fix requires
a kernel extension that allows the BPF program to call something like
`d_real_inode()` to resolve through the overlay to the real inode on the
underlying filesystem, and then check that inode's superblock device
number against the verity map. I plan to add a BPF kfunc exposing this
functionality in a follow-up kernel series.

- Multi-device filesystems such as btrfs use entirely synthetic device
numbers and there is no way to reach the actual device backing the inode
from the inode itself. So `RestrictFileSystemAccess=` only works
reliably with a subset of filesystems. In practice this isn't a problem
because the feature is tailored to erofs; using it on arbitrary
filesystems requires careful vetting of the actual filesystem behaviour.

- The initial implementation also blocks JIT-style execution that relies
on memory mapped executable. This is part of `exec` semantics today and
can be loosened later by introducing finer-grained values (a common
pattern in systemd — following the precedent of `ProtectSystem=`, which
started as a boolean and later grew `auto`/`yes`/`full`/`strict`
semantics).

- The configuration is a system-wide setting with no per-unit opt-out.
This is intentional for the initial implementation: a global invariant
is easier to reason about and harder to accidentally weaken. Per-unit
relaxation can be added later if a concrete need arises.

### Testing

The series includes unit tests and integration tests covering both the
core enforcement logic and the self-protection guard. The unit test
loads the skeleton, attaches programs, populates guard globals, and
verifies that protected IDs are set correctly. The integration tests
exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and
`BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that
access is denied.

What we cannot currently test end-to-end is actual execution enforcement
against a dm-verity-signed root filesystem. The systemd test suite does
not yet have infrastructure for booting a VM with a signed dm-verity
rootfs image — the existing mkosi-based test framework lacks the ability
to produce and boot such images. This will hopefully change soon when
Daan integrates barrage into the test suite.

Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>
---

e28aab8840cbec43830af5c1520e9d9e123ad896