From: Christian Brauner Date: Wed, 13 May 2026 11:58:46 +0000 (+0200) Subject: RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM ... X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=e28aab8840cbec43830af5c1520e9d9e123ad896;p=thirdparty%2Fsystemd.git RestrictFileSystemAccess= — dm-verity filesystem access enforcement via BPF LSM (#41340) This series adds a new `RestrictFileSystemAccess=` setting in the `[Manager]` section of `system.conf` that enforces a deny-default execution policy: only binaries residing on signed dm-verity block devices (and the initramfs during early boot) are permitted to execute. Everything else — tmpfs, procfs, sysfs, anonymous executable mappings, unsigned dm-verity devices — is denied. The directive takes the values `no` (default), `exec` (lock down execution), and accepts `yes` as an alias for `exec`. The name is deliberately broader than what the initial values cover so the same setting can grow to restrict other filesystem access categories in the future (e.g. `any` to deny all access from untrusted filesystems, not just execution). ### How it works The BPF program is entirely self-contained; PID1 loads it and the kernel does the rest. When dm-verity brings up a device, the kernel calls `security_bdev_setintegrity()` twice during `verity_preresume()`: once with the root hash and once with the signature validity status. Our `lsm/bdev_setintegrity` hook captures the second call and records the device number in a BPF hash map if the signature is valid. When a device is torn down, `lsm/bdev_free_security` cleans up the map entry. No userspace map population is needed at any point. The enforcement side hooks `bprm_check_security` (execve), `mmap_file` (PROT_EXEC mappings including shared libraries), and `file_mprotect` (W→X transitions like JIT and libffi). Each hook resolves the file's backing device via `file->f_inode->i_sb->s_dev` and looks it up in the verity device map. For block-backed filesystems, `s_dev` equals `s_bdev->bd_dev`, which avoids an extra pointer chase and NULL check on `s_bdev` — non-block filesystems simply miss in the map and get denied by the default policy. During early boot the initramfs needs to be trusted as well, since it runs before any dm-verity volume is mounted. PID1 writes the initramfs superblock's device number into a BPF global before attaching the programs, and clears it after `switch_root` to close the trust window. As a prerequisite, PID1 also verifies that `dm_verity.require_signatures=1` is active — without it, unsigned dm-verity devices could be created, which would weaken the security model even though the BPF program would correctly deny execution from them. ### Surviving daemon-reexec The BPF programs and their verity device map must survive PID1 re-execution (daemon-reexec, switch_root, soft-reboot). Without preservation, `manager_free()` would destroy the skeleton, the link FDs would close, programs would detach, and the map would be freed. After exec, a fresh skeleton would have an empty map — but existing dm-verity devices have already signaled their integrity and won't do so again. A deny-default policy plus an empty map means all execution denied and the system is bricked. We solve this by serializing the raw BPF link FDs and the `.bss` map FD across exec using systemd's existing `serialize_fd` / `fdset_cloexec` / `deserialize_fd` infrastructure. The kernel reference chain (link FD → `struct bpf_link` → `struct bpf_prog` → `struct bpf_map`) keeps programs attached and map data intact as long as the dup'd FDs survive. After exec, PID1 detects the deserialized FDs and skips skeleton re-creation entirely. If switching root, it uses the deserialized `.bss` map FD to clear `initramfs_s_dev` via a targeted `mmap()` write, preserving the other guard globals in `.bss`. We intentionally avoid bpffs pinning. Pinned objects are discoverable and manipulable by any process with sufficient privileges (`BPF_OBJ_GET`, unlink). FD serialization keeps everything private to PID1 with no external attack surface. ### Self-protection BPF LSM programs attached via the tracing trampoline (`BPF_LSM_MAC`) are inherently tamper-resistant — `bpf_tracing_link_lops` has no `.update_prog` and no `.detach` callbacks, so the kernel rejects `BPF_LINK_UPDATE` with `-EINVAL` and `BPF_LINK_DETACH` with `-EOPNOTSUPP`. Once attached, our programs cannot be modified or detached through the `bpf()` syscall. The remaining attack vector is map injection: `BPF_MAP_GET_FD_BY_ID` to obtain an FD to `verity_devices`, then `BPF_MAP_UPDATE_ELEM` to insert a fake trusted device. The self-protection guard blocks this with three hooks. `lsm/bpf_map` fires inside `bpf_map_new_fd()`, the chokepoint for all code paths that produce a map FD, and denies access to our map IDs from any process other than PID1 (identified via `tgid == 1`, which is unspoofable — `bpf_get_current_pid_tgid()` reads `current->tgid` from `pid->numbers[0].nr`, the init-namespace PID). `lsm/bpf_prog` provides analogous protection for program FDs as defense-in-depth. `lsm/bpf` handles `BPF_LINK_GET_FD_BY_ID` at the command level since there is no `security_bpf_link()` hook in the kernel. The guard starts inactive — all protected IDs default to 0 in `.bss`, and no real BPF object has ID 0 — so there is no window where it interferes with PID1's own setup. After attaching all programs, PID1 queries the kernel-assigned IDs via `bpf_obj_get_info_by_fd()` and writes them into the guard's globals. From that point on, the guard is active. The guard has zero collateral damage: it only denies access to our specific object IDs, leaving bpftrace, bpftool, `RestrictFileSystems=`, and all other BPF usage completely unaffected. Additionally, a ptrace guard (`lsm/ptrace_access_check`) blocks `PTRACE_MODE_ATTACH` to PID1 from other processes, preventing extraction of sensitive state from PID1's address space via ptrace, `/proc/1/mem`, `process_vm_readv()`, or `pidfd_getfd()`. `PTRACE_MODE_READ` is allowed so that monitoring tools and `systemctl` continue to work normally. ### Limitations - The enforcement hooks resolve trust by looking at `file->f_inode->i_sb->s_dev` — the device number of the superblock that owns the file's inode. This works correctly for files directly on a dm-verity block device, but it does not see through overlayfs. When a file is accessed on an overlay mount, `f_inode` points to the overlay inode, and `i_sb->s_dev` is the overlay superblock's anonymous device number — not the underlying dm-verity device. The overlay superblock has no backing block device, so the lookup misses in the verity map and execution is denied by the default policy. This means that overlayfs mounts whose lower layers are on dm-verity-protected volumes will currently have execution blocked, even though the actual data is integrity-protected. The correct fix requires a kernel extension that allows the BPF program to call something like `d_real_inode()` to resolve through the overlay to the real inode on the underlying filesystem, and then check that inode's superblock device number against the verity map. I plan to add a BPF kfunc exposing this functionality in a follow-up kernel series. - Multi-device filesystems such as btrfs use entirely synthetic device numbers and there is no way to reach the actual device backing the inode from the inode itself. So `RestrictFileSystemAccess=` only works reliably with a subset of filesystems. In practice this isn't a problem because the feature is tailored to erofs; using it on arbitrary filesystems requires careful vetting of the actual filesystem behaviour. - The initial implementation also blocks JIT-style execution that relies on memory mapped executable. This is part of `exec` semantics today and can be loosened later by introducing finer-grained values (a common pattern in systemd — following the precedent of `ProtectSystem=`, which started as a boolean and later grew `auto`/`yes`/`full`/`strict` semantics). - The configuration is a system-wide setting with no per-unit opt-out. This is intentional for the initial implementation: a global invariant is easier to reason about and harder to accidentally weaken. Per-unit relaxation can be added later if a concrete need arises. ### Testing The series includes unit tests and integration tests covering both the core enforcement logic and the self-protection guard. The unit test loads the skeleton, attaches programs, populates guard globals, and verifies that protected IDs are set correctly. The integration tests exercise the guard by attempting `BPF_MAP_GET_FD_BY_ID` and `BPF_PROG_GET_FD_BY_ID` from a non-PID1 process and verifying that access is denied. What we cannot currently test end-to-end is actual execution enforcement against a dm-verity-signed root filesystem. The systemd test suite does not yet have infrastructure for booting a VM with a signed dm-verity rootfs image — the existing mkosi-based test framework lacks the ability to produce and boot such images. This will hopefully change soon when Daan integrates barrage into the test suite. Signed-off-by: Christian Brauner (Amutable) --- e28aab8840cbec43830af5c1520e9d9e123ad896