From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Mon, 21 Jul 2025 14:09:27 +0000 (+0200)
Subject: 6.12-stable patches
X-Git-Tag: v6.1.147~47
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=a265c8a52b512f659cf86dabeb2ffb544c60d20e;p=thirdparty%2Fkernel%2Fstable-queue.git

6.12-stable patches

added patches:
	btrfs-fix-block-group-refcount-race-in-btrfs_create_pending_block_groups.patch
	clone_private_mnt-make-sure-that-caller-has-cap_sys_admin-in-the-right-userns.patch
	sched-change-nr_uninterruptible-type-to-unsigned-long.patch
---

diff --git a/queue-6.12/btrfs-fix-block-group-refcount-race-in-btrfs_create_pending_block_groups.patch b/queue-6.12/btrfs-fix-block-group-refcount-race-in-btrfs_create_pending_block_groups.patch
new file mode 100644
index 0000000000..fcf6f2ce8e
--- /dev/null
+++ b/queue-6.12/btrfs-fix-block-group-refcount-race-in-btrfs_create_pending_block_groups.patch
@@ -0,0 +1,108 @@
+From 2d8e5168d48a91e7a802d3003e72afb4304bebfa Mon Sep 17 00:00:00 2001
+From: Boris Burkov <boris@bur.io>
+Date: Wed, 5 Mar 2025 15:03:13 -0800
+Subject: btrfs: fix block group refcount race in btrfs_create_pending_block_groups()
+
+From: Boris Burkov <boris@bur.io>
+
+commit 2d8e5168d48a91e7a802d3003e72afb4304bebfa upstream.
+
+Block group creation is done in two phases, which results in a slightly
+unintuitive property: a block group can be allocated/deallocated from
+after btrfs_make_block_group() adds it to the space_info with
+btrfs_add_bg_to_space_info(), but before creation is completely completed
+in btrfs_create_pending_block_groups(). As a result, it is possible for a
+block group to go unused and have 'btrfs_mark_bg_unused' called on it
+concurrently with 'btrfs_create_pending_block_groups'. This causes a
+number of issues, which were fixed with the block group flag
+'BLOCK_GROUP_FLAG_NEW'.
+
+However, this fix is not quite complete. Since it does not use the
+unused_bg_lock, it is possible for the following race to occur:
+
+btrfs_create_pending_block_groups            btrfs_mark_bg_unused
+                                           if list_empty // false
+        list_del_init
+        clear_bit
+                                           else if (test_bit) // true
+                                                list_move_tail
+
+And we get into the exact same broken ref count and invalid new_bgs
+state for transaction cleanup that BLOCK_GROUP_FLAG_NEW was designed to
+prevent.
+
+The broken refcount aspect will result in a warning like:
+
+  [1272.943527] refcount_t: underflow; use-after-free.
+  [1272.943967] WARNING: CPU: 1 PID: 61 at lib/refcount.c:28 refcount_warn_saturate+0xba/0x110
+  [1272.944731] Modules linked in: btrfs virtio_net xor zstd_compress raid6_pq null_blk [last unloaded: btrfs]
+  [1272.945550] CPU: 1 UID: 0 PID: 61 Comm: kworker/u32:1 Kdump: loaded Tainted: G        W          6.14.0-rc5+ #108
+  [1272.946368] Tainted: [W]=WARN
+  [1272.946585] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.16.3-1-1 04/01/2014
+  [1272.947273] Workqueue: btrfs_discard btrfs_discard_workfn [btrfs]
+  [1272.947788] RIP: 0010:refcount_warn_saturate+0xba/0x110
+  [1272.949532] RSP: 0018:ffffbf1200247df0 EFLAGS: 00010282
+  [1272.949901] RAX: 0000000000000000 RBX: ffffa14b00e3f800 RCX: 0000000000000000
+  [1272.950437] RDX: 0000000000000000 RSI: ffffbf1200247c78 RDI: 00000000ffffdfff
+  [1272.950986] RBP: ffffa14b00dc2860 R08: 00000000ffffdfff R09: ffffffff90526268
+  [1272.951512] R10: ffffffff904762c0 R11: 0000000063666572 R12: ffffa14b00dc28c0
+  [1272.952024] R13: 0000000000000000 R14: ffffa14b00dc2868 R15: 000001285dcd12c0
+  [1272.952850] FS:  0000000000000000(0000) GS:ffffa14d33c40000(0000) knlGS:0000000000000000
+  [1272.953458] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+  [1272.953931] CR2: 00007f838cbda000 CR3: 000000010104e000 CR4: 00000000000006f0
+  [1272.954474] Call Trace:
+  [1272.954655]  <TASK>
+  [1272.954812]  ? refcount_warn_saturate+0xba/0x110
+  [1272.955173]  ? __warn.cold+0x93/0xd7
+  [1272.955487]  ? refcount_warn_saturate+0xba/0x110
+  [1272.955816]  ? report_bug+0xe7/0x120
+  [1272.956103]  ? handle_bug+0x53/0x90
+  [1272.956424]  ? exc_invalid_op+0x13/0x60
+  [1272.956700]  ? asm_exc_invalid_op+0x16/0x20
+  [1272.957011]  ? refcount_warn_saturate+0xba/0x110
+  [1272.957399]  btrfs_discard_cancel_work.cold+0x26/0x2b [btrfs]
+  [1272.957853]  btrfs_put_block_group.cold+0x5d/0x8e [btrfs]
+  [1272.958289]  btrfs_discard_workfn+0x194/0x380 [btrfs]
+  [1272.958729]  process_one_work+0x130/0x290
+  [1272.959026]  worker_thread+0x2ea/0x420
+  [1272.959335]  ? __pfx_worker_thread+0x10/0x10
+  [1272.959644]  kthread+0xd7/0x1c0
+  [1272.959872]  ? __pfx_kthread+0x10/0x10
+  [1272.960172]  ret_from_fork+0x30/0x50
+  [1272.960474]  ? __pfx_kthread+0x10/0x10
+  [1272.960745]  ret_from_fork_asm+0x1a/0x30
+  [1272.961035]  </TASK>
+  [1272.961238] ---[ end trace 0000000000000000 ]---
+
+Though we have seen them in the async discard workfn as well. It is
+most likely to happen after a relocation finishes which cancels discard,
+tears down the block group, etc.
+
+Fix this fully by taking the lock around the list_del_init + clear_bit
+so that the two are done atomically.
+
+Fixes: 0657b20c5a76 ("btrfs: fix use-after-free of new block group that became unused")
+Reviewed-by: Qu Wenruo <wqu@suse.com>
+Reviewed-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: Boris Burkov <boris@bur.io>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Alva Lan <alvalan9@foxmail.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/block-group.c |    3 +++
+ 1 file changed, 3 insertions(+)
+
+--- a/fs/btrfs/block-group.c
++++ b/fs/btrfs/block-group.c
+@@ -2780,8 +2780,11 @@ void btrfs_create_pending_block_groups(s
+ 		/* Already aborted the transaction if it failed. */
+ next:
+ 		btrfs_dec_delayed_refs_rsv_bg_inserts(fs_info);
++
++		spin_lock(&fs_info->unused_bgs_lock);
+ 		list_del_init(&block_group->bg_list);
+ 		clear_bit(BLOCK_GROUP_FLAG_NEW, &block_group->runtime_flags);
++		spin_unlock(&fs_info->unused_bgs_lock);
+ 
+ 		/*
+ 		 * If the block group is still unused, add it to the list of
diff --git a/queue-6.12/clone_private_mnt-make-sure-that-caller-has-cap_sys_admin-in-the-right-userns.patch b/queue-6.12/clone_private_mnt-make-sure-that-caller-has-cap_sys_admin-in-the-right-userns.patch
new file mode 100644
index 0000000000..ba4d4667df
--- /dev/null
+++ b/queue-6.12/clone_private_mnt-make-sure-that-caller-has-cap_sys_admin-in-the-right-userns.patch
@@ -0,0 +1,48 @@
+From c28f922c9dcee0e4876a2c095939d77fe7e15116 Mon Sep 17 00:00:00 2001
+From: Al Viro <viro@zeniv.linux.org.uk>
+Date: Sun, 1 Jun 2025 20:11:06 -0400
+Subject: clone_private_mnt(): make sure that caller has CAP_SYS_ADMIN in the right userns
+
+From: Al Viro <viro@zeniv.linux.org.uk>
+
+commit c28f922c9dcee0e4876a2c095939d77fe7e15116 upstream.
+
+What we want is to verify there is that clone won't expose something
+hidden by a mount we wouldn't be able to undo.  "Wouldn't be able to undo"
+may be a result of MNT_LOCKED on a child, but it may also come from
+lacking admin rights in the userns of the namespace mount belongs to.
+
+clone_private_mnt() checks the former, but not the latter.
+
+There's a number of rather confusing CAP_SYS_ADMIN checks in various
+userns during the mount, especially with the new mount API; they serve
+different purposes and in case of clone_private_mnt() they usually,
+but not always end up covering the missing check mentioned above.
+
+Reviewed-by: Christian Brauner <brauner@kernel.org>
+Reported-by: "Orlando, Noah" <Noah.Orlando@deshaw.com>
+Fixes: 427215d85e8d ("ovl: prevent private clone if bind mount is not allowed")
+Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
+[ merge conflict resolution: clone_private_mount() was reworked in
+  db04662e2f4f ("fs: allow detached mounts in clone_private_mount()").
+  Tweak the relevant ns_capable check so that it works on older kernels ]
+Signed-off-by: Noah Orlando <Noah.Orlando@deshaw.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/namespace.c |    5 +++++
+ 1 file changed, 5 insertions(+)
+
+--- a/fs/namespace.c
++++ b/fs/namespace.c
+@@ -2263,6 +2263,11 @@ struct vfsmount *clone_private_mount(con
+ 	if (!check_mnt(old_mnt))
+ 		goto invalid;
+ 
++	if (!ns_capable(old_mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)) {
++		up_read(&namespace_sem);
++		return ERR_PTR(-EPERM);
++	}
++
+ 	if (has_locked_children(old_mnt, path->dentry))
+ 		goto invalid;
+ 
diff --git a/queue-6.12/sched-change-nr_uninterruptible-type-to-unsigned-long.patch b/queue-6.12/sched-change-nr_uninterruptible-type-to-unsigned-long.patch
new file mode 100644
index 0000000000..99097943c8
--- /dev/null
+++ b/queue-6.12/sched-change-nr_uninterruptible-type-to-unsigned-long.patch
@@ -0,0 +1,54 @@
+From 36569780b0d64de283f9d6c2195fd1a43e221ee8 Mon Sep 17 00:00:00 2001
+From: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
+Date: Wed, 9 Jul 2025 17:33:28 +0000
+Subject: sched: Change nr_uninterruptible type to unsigned long
+
+From: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
+
+commit 36569780b0d64de283f9d6c2195fd1a43e221ee8 upstream.
+
+The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
+32-bit") changed nr_uninterruptible to an unsigned int. But the
+nr_uninterruptible values for each of the CPU runqueues can grow to
+large numbers, sometimes exceeding INT_MAX. This is valid, if, over
+time, a large number of tasks are migrated off of one CPU after going
+into an uninterruptible state. Only the sum of all nr_interruptible
+values across all CPUs yields the correct result, as explained in a
+comment in kernel/sched/loadavg.c.
+
+Change the type of nr_uninterruptible back to unsigned long to prevent
+overflows, and thus the miscalculation of load average.
+
+Fixes: e6fe3f422be1 ("sched: Make multiple runqueue task counters 32-bit")
+
+Signed-off-by: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
+Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
+Link: https://lkml.kernel.org/r/20250709173328.606794-1-aruna.ramakrishna@oracle.com
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ kernel/sched/loadavg.c |    2 +-
+ kernel/sched/sched.h   |    2 +-
+ 2 files changed, 2 insertions(+), 2 deletions(-)
+
+--- a/kernel/sched/loadavg.c
++++ b/kernel/sched/loadavg.c
+@@ -80,7 +80,7 @@ long calc_load_fold_active(struct rq *th
+ 	long nr_active, delta = 0;
+ 
+ 	nr_active = this_rq->nr_running - adjust;
+-	nr_active += (int)this_rq->nr_uninterruptible;
++	nr_active += (long)this_rq->nr_uninterruptible;
+ 
+ 	if (nr_active != this_rq->calc_load_active) {
+ 		delta = nr_active - this_rq->calc_load_active;
+--- a/kernel/sched/sched.h
++++ b/kernel/sched/sched.h
+@@ -1156,7 +1156,7 @@ struct rq {
+ 	 * one CPU and if it got migrated afterwards it may decrease
+ 	 * it on another CPU. Always updated under the runqueue lock:
+ 	 */
+-	unsigned int		nr_uninterruptible;
++	unsigned long 		nr_uninterruptible;
+ 
+ 	struct task_struct __rcu	*curr;
+ 	struct sched_dl_entity	*dl_server;
diff --git a/queue-6.12/series b/queue-6.12/series
index a501eff194..2b04334bdc 100644
--- a/queue-6.12/series
+++ b/queue-6.12/series
@@ -138,3 +138,5 @@ drm-mediatek-only-announce-afbc-if-really-supported.patch
 libbpf-fix-handling-of-bpf-arena-relocations.patch
 efivarfs-fix-memory-leak-of-efivarfs_fs_info-in-fs_c.patch
 sched-change-nr_uninterruptible-type-to-unsigned-long.patch
+clone_private_mnt-make-sure-that-caller-has-cap_sys_admin-in-the-right-userns.patch
+btrfs-fix-block-group-refcount-race-in-btrfs_create_pending_block_groups.patch