core: preserve RestrictFileSystemAccess= BPF state across daemon-reexec
The BPF link and .bss map FDs must survive PID1 re-execution
(daemon-reexec, switch_root, soft-reboot). Without serialization,
manager_free() closes them before execv, programs detach, and the
verity_devices map is freed. After exec a fresh skeleton would have
an empty map — but existing dm-verity devices have already called
bdev_setintegrity and won't call it again. The result would be a
deny-default policy with an empty map, i.e., all execution denied
and the system bricked.
Add serialize/deserialize support using systemd's existing
serialize_fd / fdset_cloexec / deserialize_fd infrastructure:
Before exec (in manager_serialize via bpf_restrict_fsaccess_serialize):
- Dup each link FD and the .bss map FD into the FDSet
- fdset_cloexec(fds, false) + execv() preserves them across exec
After exec (in manager_deserialize + bpf_restrict_fsaccess_setup):
- Deserialize the link FDs and .bss map FD into the Manager struct
- bpf_restrict_fsaccess_setup() detects the deserialized FDs and skips
skeleton re-creation entirely — the programs are already attached
- If no longer in initrd, clear initramfs_s_dev in the kernel map
No bpffs pinning is needed. This avoids a bpffs mount dependency and
eliminates the external attack surface that pinned objects would create
(discoverable/manipulable via unlink or BPF_OBJ_GET). The FDs remain
private to PID1.
Signed-off-by: Christian Brauner <brauner@kernel.org>