From: Christian Brauner Date: Wed, 4 Feb 2026 22:24:31 +0000 (+0100) Subject: nsresourced: Ensure that all user namespaces are cleaned-up X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=e8416e854b6e5d5164fac72dc575856f641a2cee;p=thirdparty%2Fsystemd.git nsresourced: Ensure that all user namespaces are cleaned-up The code here assumes that free_user_ns() is called for every single user namespace. That however has never been the case and the logic for free_user_ns() is a bit more involved. A nested user namespace pins its parent user namespace. IOW, the lifetime of the parent user namespaces is at least as long as the child user namespaces. If a parent user namespace becomes unused (no namespace file descriptors or task using it anymore) then it will stick around and its lifetime still bound to the child user namespace. free_user_ns() takes advantage of that behavior. If a child user namespace is freed and its parent user namespace is already unused then then free_user_ns() will free both the child and the parent user namespace. This means a single free_user_ns() frees two user namespaces. Hence, the bpf program never sees the parent user namespace being freed. We can fix this by piggy-backing on another function that is called for every single user namespace being freed. This requires CONFIG_SYSCTL but systemd doesn't work without that anyway. The return type needs to change to a scalar type as required by libbpf. Long-term what we need is appropriate LSM infrastructure for this including hooks that get called on namespace destruction. Thanks to Daan DeMeyer for figuring out that the cast is needed. Signed-off-by: Christian Brauner --- diff --git a/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c b/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c index f327e9004b3..10abcc32276 100644 --- a/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c +++ b/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c @@ -155,25 +155,22 @@ int BPF_PROG(userns_restrict_path_link, struct dentry *old_dentry, const struct return validate_path(new_dir, ret); } -SEC("kprobe/free_user_ns") -void BPF_KPROBE(userns_restrict_free_user_ns, struct work_struct *work) { - struct user_namespace *userns; +SEC("kprobe/retire_userns_sysctls") +int BPF_KPROBE(userns_restrict_retire_userns_sysctls, struct user_namespace *userns) { unsigned inode; void *mnt_id_map; /* Inform userspace that a user namespace just went away. I wish there was a nicer way to hook into * user namespaces being deleted than using kprobes, but couldn't find any. */ - - userns = bpf_rdonly_cast(container_of(work, struct user_namespace, work), - bpf_core_type_id_kernel(struct user_namespace)); - + userns = bpf_rdonly_cast(userns, bpf_core_type_id_kernel(struct user_namespace)); inode = userns->ns.inum; mnt_id_map = bpf_map_lookup_elem(&userns_mnt_id_hash, &inode); if (!mnt_id_map) /* No rules installed for this userns? Then send no notification. */ - return; + return 0; bpf_ringbuf_output(&userns_ringbuf, &inode, sizeof(inode), 0); + return 0; } static const char _license[] SEC("license") = "GPL";