From: Greg Kroah-Hartman Date: Fri, 2 Sep 2022 06:14:35 +0000 (+0200) Subject: 5.15-stable patches X-Git-Tag: v4.9.327~13 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=f8c56d049165fb4133051bd06d39f6ecb557fadc;p=thirdparty%2Fkernel%2Fstable-queue.git 5.15-stable patches added patches: btrfs-fix-space-cache-corruption-and-potential-double-allocations.patch kprobes-don-t-call-disarm_kprobe-for-disabled-kprobes.patch --- diff --git a/queue-5.15/btrfs-fix-space-cache-corruption-and-potential-double-allocations.patch b/queue-5.15/btrfs-fix-space-cache-corruption-and-potential-double-allocations.patch new file mode 100644 index 00000000000..8b8f29e5f54 --- /dev/null +++ b/queue-5.15/btrfs-fix-space-cache-corruption-and-potential-double-allocations.patch @@ -0,0 +1,304 @@ +From ced8ecf026fd8084cf175530ff85c76d6085d715 Mon Sep 17 00:00:00 2001 +From: Omar Sandoval +Date: Tue, 23 Aug 2022 11:28:13 -0700 +Subject: btrfs: fix space cache corruption and potential double allocations + +From: Omar Sandoval + +commit ced8ecf026fd8084cf175530ff85c76d6085d715 upstream. + +When testing space_cache v2 on a large set of machines, we encountered a +few symptoms: + +1. "unable to add free space :-17" (EEXIST) errors. +2. Missing free space info items, sometimes caught with a "missing free + space info for X" error. +3. Double-accounted space: ranges that were allocated in the extent tree + and also marked as free in the free space tree, ranges that were + marked as allocated twice in the extent tree, or ranges that were + marked as free twice in the free space tree. If the latter made it + onto disk, the next reboot would hit the BUG_ON() in + add_new_free_space(). +4. On some hosts with no on-disk corruption or error messages, the + in-memory space cache (dumped with drgn) disagreed with the free + space tree. + +All of these symptoms have the same underlying cause: a race between +caching the free space for a block group and returning free space to the +in-memory space cache for pinned extents causes us to double-add a free +range to the space cache. This race exists when free space is cached +from the free space tree (space_cache=v2) or the extent tree +(nospace_cache, or space_cache=v1 if the cache needs to be regenerated). +struct btrfs_block_group::last_byte_to_unpin and struct +btrfs_block_group::progress are supposed to protect against this race, +but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when +waiting for a transaction commit") subtly broke this by allowing +multiple transactions to be unpinning extents at the same time. + +Specifically, the race is as follows: + +1. An extent is deleted from an uncached block group in transaction A. +2. btrfs_commit_transaction() is called for transaction A. +3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed + ref for the deleted extent. +4. __btrfs_free_extent() -> do_free_extent_accounting() -> + add_to_free_space_tree() adds the deleted extent back to the free + space tree. +5. do_free_extent_accounting() -> btrfs_update_block_group() -> + btrfs_cache_block_group() queues up the block group to get cached. + block_group->progress is set to block_group->start. +6. btrfs_commit_transaction() for transaction A calls + switch_commit_roots(). It sets block_group->last_byte_to_unpin to + block_group->progress, which is block_group->start because the block + group hasn't been cached yet. +7. The caching thread gets to our block group. Since the commit roots + were already switched, load_free_space_tree() sees the deleted extent + as free and adds it to the space cache. It finishes caching and sets + block_group->progress to U64_MAX. +8. btrfs_commit_transaction() advances transaction A to + TRANS_STATE_SUPER_COMMITTED. +9. fsync calls btrfs_commit_transaction() for transaction B. Since + transaction A is already in TRANS_STATE_SUPER_COMMITTED and the + commit is for fsync, it advances. +10. btrfs_commit_transaction() for transaction B calls + switch_commit_roots(). This time, the block group has already been + cached, so it sets block_group->last_byte_to_unpin to U64_MAX. +11. btrfs_commit_transaction() for transaction A calls + btrfs_finish_extent_commit(), which calls unpin_extent_range() for + the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by + transaction B!), so it adds the deleted extent to the space cache + again! + +This explains all of our symptoms above: + +* If the sequence of events is exactly as described above, when the free + space is re-added in step 11, it will fail with EEXIST. +* If another thread reallocates the deleted extent in between steps 7 + and 11, then step 11 will silently re-add that space to the space + cache as free even though it is actually allocated. Then, if that + space is allocated *again*, the free space tree will be corrupted + (namely, the wrong item will be deleted). +* If we don't catch this free space tree corruption, it will continue + to get worse as extents are deleted and reallocated. + +The v1 space_cache is synchronously loaded when an extent is deleted +(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group() +with load_cache_only=1), so it is not normally affected by this bug. +However, as noted above, if we fail to load the space cache, we will +fall back to caching from the extent tree and may hit this bug. + +The easiest fix for this race is to also make caching from the free +space tree or extent tree synchronous. Josef tested this and found no +performance regressions. + +A few extra changes fall out of this change. Namely, this fix does the +following, with step 2 being the crucial fix: + +1. Factor btrfs_caching_ctl_wait_done() out of + btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl + that we already hold a reference to. +2. Change the call in btrfs_cache_block_group() of + btrfs_wait_space_cache_v1_finished() to + btrfs_caching_ctl_wait_done(), which makes us wait regardless of the + space_cache option. +3. Delete the now unused btrfs_wait_space_cache_v1_finished() and + space_cache_v1_done(). +4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to + `bool wait` to more accurately describe its new meaning. +5. Change a few callers which had a separate call to + btrfs_wait_block_group_cache_done() to use wait = true instead. +6. Make btrfs_wait_block_group_cache_done() static now that it's not + used outside of block-group.c anymore. + +Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") +CC: stable@vger.kernel.org # 5.12+ +Reviewed-by: Filipe Manana +Signed-off-by: Omar Sandoval +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/block-group.c | 47 +++++++++++++++-------------------------------- + fs/btrfs/block-group.h | 4 +--- + fs/btrfs/ctree.h | 1 - + fs/btrfs/extent-tree.c | 30 ++++++------------------------ + 4 files changed, 22 insertions(+), 60 deletions(-) + +--- a/fs/btrfs/block-group.c ++++ b/fs/btrfs/block-group.c +@@ -418,39 +418,26 @@ void btrfs_wait_block_group_cache_progre + btrfs_put_caching_control(caching_ctl); + } + +-int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) ++static int btrfs_caching_ctl_wait_done(struct btrfs_block_group *cache, ++ struct btrfs_caching_control *caching_ctl) ++{ ++ wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); ++ return cache->cached == BTRFS_CACHE_ERROR ? -EIO : 0; ++} ++ ++static int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache) + { + struct btrfs_caching_control *caching_ctl; +- int ret = 0; ++ int ret; + + caching_ctl = btrfs_get_caching_control(cache); + if (!caching_ctl) + return (cache->cached == BTRFS_CACHE_ERROR) ? -EIO : 0; +- +- wait_event(caching_ctl->wait, btrfs_block_group_done(cache)); +- if (cache->cached == BTRFS_CACHE_ERROR) +- ret = -EIO; ++ ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); + btrfs_put_caching_control(caching_ctl); + return ret; + } + +-static bool space_cache_v1_done(struct btrfs_block_group *cache) +-{ +- bool ret; +- +- spin_lock(&cache->lock); +- ret = cache->cached != BTRFS_CACHE_FAST; +- spin_unlock(&cache->lock); +- +- return ret; +-} +- +-void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache, +- struct btrfs_caching_control *caching_ctl) +-{ +- wait_event(caching_ctl->wait, space_cache_v1_done(cache)); +-} +- + #ifdef CONFIG_BTRFS_DEBUG + static void fragment_free_space(struct btrfs_block_group *block_group) + { +@@ -727,9 +714,8 @@ done: + btrfs_put_block_group(block_group); + } + +-int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only) ++int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait) + { +- DEFINE_WAIT(wait); + struct btrfs_fs_info *fs_info = cache->fs_info; + struct btrfs_caching_control *caching_ctl = NULL; + int ret = 0; +@@ -762,10 +748,7 @@ int btrfs_cache_block_group(struct btrfs + } + WARN_ON(cache->caching_ctl); + cache->caching_ctl = caching_ctl; +- if (btrfs_test_opt(fs_info, SPACE_CACHE)) +- cache->cached = BTRFS_CACHE_FAST; +- else +- cache->cached = BTRFS_CACHE_STARTED; ++ cache->cached = BTRFS_CACHE_STARTED; + cache->has_caching_ctl = 1; + spin_unlock(&cache->lock); + +@@ -778,8 +761,8 @@ int btrfs_cache_block_group(struct btrfs + + btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work); + out: +- if (load_cache_only && caching_ctl) +- btrfs_wait_space_cache_v1_finished(cache, caching_ctl); ++ if (wait && caching_ctl) ++ ret = btrfs_caching_ctl_wait_done(cache, caching_ctl); + if (caching_ctl) + btrfs_put_caching_control(caching_ctl); + +@@ -3200,7 +3183,7 @@ int btrfs_update_block_group(struct btrf + * space back to the block group, otherwise we will leak space. + */ + if (!alloc && !btrfs_block_group_done(cache)) +- btrfs_cache_block_group(cache, 1); ++ btrfs_cache_block_group(cache, true); + + byte_in_group = bytenr - cache->start; + WARN_ON(byte_in_group > cache->length); +--- a/fs/btrfs/block-group.h ++++ b/fs/btrfs/block-group.h +@@ -251,9 +251,7 @@ void btrfs_dec_nocow_writers(struct btrf + void btrfs_wait_nocow_writers(struct btrfs_block_group *bg); + void btrfs_wait_block_group_cache_progress(struct btrfs_block_group *cache, + u64 num_bytes); +-int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache); +-int btrfs_cache_block_group(struct btrfs_block_group *cache, +- int load_cache_only); ++int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait); + void btrfs_put_caching_control(struct btrfs_caching_control *ctl); + struct btrfs_caching_control *btrfs_get_caching_control( + struct btrfs_block_group *cache); +--- a/fs/btrfs/ctree.h ++++ b/fs/btrfs/ctree.h +@@ -454,7 +454,6 @@ struct btrfs_free_cluster { + enum btrfs_caching_type { + BTRFS_CACHE_NO, + BTRFS_CACHE_STARTED, +- BTRFS_CACHE_FAST, + BTRFS_CACHE_FINISHED, + BTRFS_CACHE_ERROR, + }; +--- a/fs/btrfs/extent-tree.c ++++ b/fs/btrfs/extent-tree.c +@@ -2572,17 +2572,10 @@ int btrfs_pin_extent_for_log_replay(stru + return -EINVAL; + + /* +- * pull in the free space cache (if any) so that our pin +- * removes the free space from the cache. We have load_only set +- * to one because the slow code to read in the free extents does check +- * the pinned extents. ++ * Fully cache the free space first so that our pin removes the free space ++ * from the cache. + */ +- btrfs_cache_block_group(cache, 1); +- /* +- * Make sure we wait until the cache is completely built in case it is +- * missing or is invalid and therefore needs to be rebuilt. +- */ +- ret = btrfs_wait_block_group_cache_done(cache); ++ ret = btrfs_cache_block_group(cache, true); + if (ret) + goto out; + +@@ -2605,12 +2598,7 @@ static int __exclude_logged_extent(struc + if (!block_group) + return -EINVAL; + +- btrfs_cache_block_group(block_group, 1); +- /* +- * Make sure we wait until the cache is completely built in case it is +- * missing or is invalid and therefore needs to be rebuilt. +- */ +- ret = btrfs_wait_block_group_cache_done(block_group); ++ ret = btrfs_cache_block_group(block_group, true); + if (ret) + goto out; + +@@ -4324,7 +4312,7 @@ have_block_group: + ffe_ctl.cached = btrfs_block_group_done(block_group); + if (unlikely(!ffe_ctl.cached)) { + ffe_ctl.have_caching_bg = true; +- ret = btrfs_cache_block_group(block_group, 0); ++ ret = btrfs_cache_block_group(block_group, false); + + /* + * If we get ENOMEM here or something else we want to +@@ -6082,13 +6070,7 @@ int btrfs_trim_fs(struct btrfs_fs_info * + + if (end - start >= range->minlen) { + if (!btrfs_block_group_done(cache)) { +- ret = btrfs_cache_block_group(cache, 0); +- if (ret) { +- bg_failed++; +- bg_ret = ret; +- continue; +- } +- ret = btrfs_wait_block_group_cache_done(cache); ++ ret = btrfs_cache_block_group(cache, true); + if (ret) { + bg_failed++; + bg_ret = ret; diff --git a/queue-5.15/kprobes-don-t-call-disarm_kprobe-for-disabled-kprobes.patch b/queue-5.15/kprobes-don-t-call-disarm_kprobe-for-disabled-kprobes.patch new file mode 100644 index 00000000000..499bccfac19 --- /dev/null +++ b/queue-5.15/kprobes-don-t-call-disarm_kprobe-for-disabled-kprobes.patch @@ -0,0 +1,150 @@ +From 9c80e79906b4ca440d09e7f116609262bb747909 Mon Sep 17 00:00:00 2001 +From: Kuniyuki Iwashima +Date: Fri, 12 Aug 2022 19:05:09 -0700 +Subject: kprobes: don't call disarm_kprobe() for disabled kprobes + +From: Kuniyuki Iwashima + +commit 9c80e79906b4ca440d09e7f116609262bb747909 upstream. + +The assumption in __disable_kprobe() is wrong, and it could try to disarm +an already disarmed kprobe and fire the WARN_ONCE() below. [0] We can +easily reproduce this issue. + +1. Write 0 to /sys/kernel/debug/kprobes/enabled. + + # echo 0 > /sys/kernel/debug/kprobes/enabled + +2. Run execsnoop. At this time, one kprobe is disabled. + + # /usr/share/bcc/tools/execsnoop & + [1] 2460 + PCOMM PID PPID RET ARGS + + # cat /sys/kernel/debug/kprobes/list + ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] + ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE] + +3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes + kprobes_all_disarmed to false but does not arm the disabled kprobe. + + # echo 1 > /sys/kernel/debug/kprobes/enabled + + # cat /sys/kernel/debug/kprobes/list + ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE] + ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE] + +4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the + disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace(). + + # fg + /usr/share/bcc/tools/execsnoop + ^C + +Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses +some cleanups and leaves the aggregated kprobe in the hash table. Then, +__unregister_trace_kprobe() initialises tk->rp.kp.list and creates an +infinite loop like this. + + aggregated kprobe.list -> kprobe.list -. + ^ | + '.__.' + +In this situation, these commands fall into the infinite loop and result +in RCU stall or soft lockup. + + cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the + infinite loop with RCU. + + /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex, + and __get_valid_kprobe() is stuck in + the loop. + +To avoid the issue, make sure we don't call disarm_kprobe() for disabled +kprobes. + +[0] +Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2) +WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) +Modules linked in: ena +CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28 +Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017 +RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129) +Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94 +RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282 +RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001 +RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff +RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff +R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40 +R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000 +FS: 00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000 +CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 +CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0 +DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 +DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 +PKRU: 55555554 +Call Trace: + + __disable_kprobe (kernel/kprobes.c:1716) + disable_kprobe (kernel/kprobes.c:2392) + __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340) + disable_trace_kprobe (kernel/trace/trace_kprobe.c:429) + perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168) + perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295) + _free_event (kernel/events/core.c:4971) + perf_event_release_kernel (kernel/events/core.c:5176) + perf_release (kernel/events/core.c:5186) + __fput (fs/file_table.c:321) + task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1)) + exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201) + syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296) + do_syscall_64 (arch/x86/entry/common.c:87) + entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120) +RIP: 0033:0x7fe7ff210654 +Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc +RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 +RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654 +RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008 +RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30 +R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600 +R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560 + + +Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com +Fixes: 69d54b916d83 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.") +Signed-off-by: Kuniyuki Iwashima +Reported-by: Ayushman Dutta +Cc: "Naveen N. Rao" +Cc: Anil S Keshavamurthy +Cc: "David S. Miller" +Cc: Masami Hiramatsu +Cc: Wang Nan +Cc: Kuniyuki Iwashima +Cc: Kuniyuki Iwashima +Cc: Ayushman Dutta +Cc: +Signed-off-by: Andrew Morton +Signed-off-by: Greg Kroah-Hartman +--- + kernel/kprobes.c | 9 +++++---- + 1 file changed, 5 insertions(+), 4 deletions(-) + +--- a/kernel/kprobes.c ++++ b/kernel/kprobes.c +@@ -1705,11 +1705,12 @@ static struct kprobe *__disable_kprobe(s + /* Try to disarm and disable this/parent probe */ + if (p == orig_p || aggr_kprobe_disabled(orig_p)) { + /* +- * If kprobes_all_disarmed is set, orig_p +- * should have already been disarmed, so +- * skip unneed disarming process. ++ * Don't be lazy here. Even if 'kprobes_all_disarmed' ++ * is false, 'orig_p' might not have been armed yet. ++ * Note arm_all_kprobes() __tries__ to arm all kprobes ++ * on the best effort basis. + */ +- if (!kprobes_all_disarmed) { ++ if (!kprobes_all_disarmed && !kprobe_disabled(orig_p)) { + ret = disarm_kprobe(orig_p, true); + if (ret) { + p->flags &= ~KPROBE_FLAG_DISABLED; diff --git a/queue-5.15/series b/queue-5.15/series index b87d8c2597d..278dc4f1ff1 100644 --- a/queue-5.15/series +++ b/queue-5.15/series @@ -66,3 +66,5 @@ testing-selftests-nft_flowtable.sh-use-random-netns-.patch btrfs-move-lockdep-class-helpers-to-locking.c.patch btrfs-fix-lockdep-splat-with-reloc-root-extent-buffe.patch btrfs-tree-checker-check-for-overlapping-extent-item.patch +kprobes-don-t-call-disarm_kprobe-for-disabled-kprobes.patch +btrfs-fix-space-cache-corruption-and-potential-double-allocations.patch