From: Christian Brauner <brauner@kernel.org>
Date: Wed, 4 Feb 2026 22:24:31 +0000 (+0100)
Subject: nsresourced: Ensure that all user namespaces are cleaned-up
X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=e8416e854b6e5d5164fac72dc575856f641a2cee;p=thirdparty%2Fsystemd.git

nsresourced: Ensure that all user namespaces are cleaned-up

The code here assumes that free_user_ns() is called for every single
user namespace. That however has never been the case and the logic for
free_user_ns() is a bit more involved.

A nested user namespace pins its parent user namespace. IOW, the
lifetime of the parent user namespaces is at least as long as the child
user namespaces.

If a parent user namespace becomes unused (no namespace file descriptors
or task using it anymore) then it will stick around and its lifetime
still bound to the child user namespace.

free_user_ns() takes advantage of that behavior. If a child user
namespace is freed and its parent user namespace is already unused then
then free_user_ns() will free both the child and the parent user
namespace. This means a single free_user_ns() frees two user namespaces.
Hence, the bpf program never sees the parent user namespace being freed.

We can fix this by piggy-backing on another function that is called for
every single user namespace being freed. This requires CONFIG_SYSCTL but
systemd doesn't work without that anyway.

The return type needs to change to a scalar type as required by libbpf.

Long-term what we need is appropriate LSM infrastructure for this
including hooks that get called on namespace destruction.

Thanks to Daan DeMeyer for figuring out that the cast is needed.

Signed-off-by: Christian Brauner <brauner@kernel.org>
---

diff --git a/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c b/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c
index f327e9004b3..10abcc32276 100644
--- a/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c
+++ b/src/nsresourced/bpf/userns-restrict/userns-restrict.bpf.c
@@ -155,25 +155,22 @@ int BPF_PROG(userns_restrict_path_link, struct dentry *old_dentry, const struct
         return validate_path(new_dir, ret);
 }
 
-SEC("kprobe/free_user_ns")
-void BPF_KPROBE(userns_restrict_free_user_ns, struct work_struct *work) {
-        struct user_namespace *userns;
+SEC("kprobe/retire_userns_sysctls")
+int BPF_KPROBE(userns_restrict_retire_userns_sysctls, struct user_namespace *userns) {
         unsigned inode;
         void *mnt_id_map;
 
         /* Inform userspace that a user namespace just went away. I wish there was a nicer way to hook into
          * user namespaces being deleted than using kprobes, but couldn't find any. */
-
-        userns = bpf_rdonly_cast(container_of(work, struct user_namespace, work),
-                                 bpf_core_type_id_kernel(struct user_namespace));
-
+        userns = bpf_rdonly_cast(userns, bpf_core_type_id_kernel(struct user_namespace));
         inode = userns->ns.inum;
 
         mnt_id_map = bpf_map_lookup_elem(&userns_mnt_id_hash, &inode);
         if (!mnt_id_map) /* No rules installed for this userns? Then send no notification. */
-                return;
+                return 0;
 
         bpf_ringbuf_output(&userns_ringbuf, &inode, sizeof(inode), 0);
+        return 0;
 }
 
 static const char _license[] SEC("license") = "GPL";