--- /dev/null
+From ced8ecf026fd8084cf175530ff85c76d6085d715 Mon Sep 17 00:00:00 2001
+From: Omar Sandoval <osandov@fb.com>
+Date: Tue, 23 Aug 2022 11:28:13 -0700
+Subject: btrfs: fix space cache corruption and potential double allocations
+
+From: Omar Sandoval <osandov@fb.com>
+
+commit ced8ecf026fd8084cf175530ff85c76d6085d715 upstream.
+
+When testing space_cache v2 on a large set of machines, we encountered a
+few symptoms:
+
+1. "unable to add free space :-17" (EEXIST) errors.
+2. Missing free space info items, sometimes caught with a "missing free
+ space info for X" error.
+3. Double-accounted space: ranges that were allocated in the extent tree
+ and also marked as free in the free space tree, ranges that were
+ marked as allocated twice in the extent tree, or ranges that were
+ marked as free twice in the free space tree. If the latter made it
+ onto disk, the next reboot would hit the BUG_ON() in
+ add_new_free_space().
+4. On some hosts with no on-disk corruption or error messages, the
+ in-memory space cache (dumped with drgn) disagreed with the free
+ space tree.
+
+All of these symptoms have the same underlying cause: a race between
+caching the free space for a block group and returning free space to the
+in-memory space cache for pinned extents causes us to double-add a free
+range to the space cache. This race exists when free space is cached
+from the free space tree (space_cache=v2) or the extent tree
+(nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
+struct btrfs_block_group::last_byte_to_unpin and struct
+btrfs_block_group::progress are supposed to protect against this race,
+but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
+waiting for a transaction commit") subtly broke this by allowing
+multiple transactions to be unpinning extents at the same time.
+
+Specifically, the race is as follows:
+
+1. An extent is deleted from an uncached block group in transaction A.
+2. btrfs_commit_transaction() is called for transaction A.
+3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
+ ref for the deleted extent.
+4. __btrfs_free_extent() -> do_free_extent_accounting() ->
+ add_to_free_space_tree() adds the deleted extent back to the free
+ space tree.
+5. do_free_extent_accounting() -> btrfs_update_block_group() ->
+ btrfs_cache_block_group() queues up the block group to get cached.
+ block_group->progress is set to block_group->start.
+6. btrfs_commit_transaction() for transaction A calls
+ switch_commit_roots(). It sets block_group->last_byte_to_unpin to
+ block_group->progress, which is block_group->start because the block
+ group hasn't been cached yet.
+7. The caching thread gets to our block group. Since the commit roots
+ were already switched, load_free_space_tree() sees the deleted extent
+ as free and adds it to the space cache. It finishes caching and sets
+ block_group->progress to U64_MAX.
+8. btrfs_commit_transaction() advances transaction A to
+ TRANS_STATE_SUPER_COMMITTED.
+9. fsync calls btrfs_commit_transaction() for transaction B. Since
+ transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
+ commit is for fsync, it advances.
+10. btrfs_commit_transaction() for transaction B calls
+ switch_commit_roots(). This time, the block group has already been
+ cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
+11. btrfs_commit_transaction() for transaction A calls
+ btrfs_finish_extent_commit(), which calls unpin_extent_range() for
+ the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
+ transaction B!), so it adds the deleted extent to the space cache
+ again!
+
+This explains all of our symptoms above:
+
+* If the sequence of events is exactly as described above, when the free
+ space is re-added in step 11, it will fail with EEXIST.
+* If another thread reallocates the deleted extent in between steps 7
+ and 11, then step 11 will silently re-add that space to the space
+ cache as free even though it is actually allocated. Then, if that
+ space is allocated *again*, the free space tree will be corrupted
+ (namely, the wrong item will be deleted).
+* If we don't catch this free space tree corruption, it will continue
+ to get worse as extents are deleted and reallocated.
+
+The v1 space_cache is synchronously loaded when an extent is deleted
+(btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
+with load_cache_only=1), so it is not normally affected by this bug.
+However, as noted above, if we fail to load the space cache, we will
+fall back to caching from the extent tree and may hit this bug.
+
+The easiest fix for this race is to also make caching from the free
+space tree or extent tree synchronous. Josef tested this and found no
+performance regressions.
+
+A few extra changes fall out of this change. Namely, this fix does the
+following, with step 2 being the crucial fix:
+
+1. Factor btrfs_caching_ctl_wait_done() out of
+ btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
+ that we already hold a reference to.
+2. Change the call in btrfs_cache_block_group() of
+ btrfs_wait_space_cache_v1_finished() to
+ btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
+ space_cache option.
+3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
+ space_cache_v1_done().
+4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
+ `bool wait` to more accurately describe its new meaning.
+5. Change a few callers which had a separate call to
+ btrfs_wait_block_group_cache_done() to use wait = true instead.
+6. Make btrfs_wait_block_group_cache_done() static now that it's not
+ used outside of block-group.c anymore.
+
+Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
+CC: stable@vger.kernel.org # 5.12+
+Reviewed-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: Omar Sandoval <osandov@fb.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/block-group.c | 47 +++++++++++++++--------------------------------
+ fs/btrfs/block-group.h | 4 +---
+ fs/btrfs/ctree.h | 1 -
+ fs/btrfs/extent-tree.c | 30 ++++++------------------------
+ 4 files changed, 22 insertions(+), 60 deletions(-)
+
+--- a/fs/btrfs/block-group.c
++++ b/fs/btrfs/block-group.c
+@@ -418,39 +418,26 @@ void btrfs_wait_block_group_cache_progre
+ btrfs_put_caching_control(caching_ctl);
+ }
+
+-int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache)
++static int btrfs_caching_ctl_wait_done(struct btrfs_block_group *cache,
++ struct btrfs_caching_control *caching_ctl)
++{
++ wait_event(caching_ctl->wait, btrfs_block_group_done(cache));
++ return cache->cached == BTRFS_CACHE_ERROR ? -EIO : 0;
++}
++
++static int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache)
+ {
+ struct btrfs_caching_control *caching_ctl;
+- int ret = 0;
++ int ret;
+
+ caching_ctl = btrfs_get_caching_control(cache);
+ if (!caching_ctl)
+ return (cache->cached == BTRFS_CACHE_ERROR) ? -EIO : 0;
+-
+- wait_event(caching_ctl->wait, btrfs_block_group_done(cache));
+- if (cache->cached == BTRFS_CACHE_ERROR)
+- ret = -EIO;
++ ret = btrfs_caching_ctl_wait_done(cache, caching_ctl);
+ btrfs_put_caching_control(caching_ctl);
+ return ret;
+ }
+
+-static bool space_cache_v1_done(struct btrfs_block_group *cache)
+-{
+- bool ret;
+-
+- spin_lock(&cache->lock);
+- ret = cache->cached != BTRFS_CACHE_FAST;
+- spin_unlock(&cache->lock);
+-
+- return ret;
+-}
+-
+-void btrfs_wait_space_cache_v1_finished(struct btrfs_block_group *cache,
+- struct btrfs_caching_control *caching_ctl)
+-{
+- wait_event(caching_ctl->wait, space_cache_v1_done(cache));
+-}
+-
+ #ifdef CONFIG_BTRFS_DEBUG
+ static void fragment_free_space(struct btrfs_block_group *block_group)
+ {
+@@ -727,9 +714,8 @@ done:
+ btrfs_put_block_group(block_group);
+ }
+
+-int btrfs_cache_block_group(struct btrfs_block_group *cache, int load_cache_only)
++int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait)
+ {
+- DEFINE_WAIT(wait);
+ struct btrfs_fs_info *fs_info = cache->fs_info;
+ struct btrfs_caching_control *caching_ctl = NULL;
+ int ret = 0;
+@@ -762,10 +748,7 @@ int btrfs_cache_block_group(struct btrfs
+ }
+ WARN_ON(cache->caching_ctl);
+ cache->caching_ctl = caching_ctl;
+- if (btrfs_test_opt(fs_info, SPACE_CACHE))
+- cache->cached = BTRFS_CACHE_FAST;
+- else
+- cache->cached = BTRFS_CACHE_STARTED;
++ cache->cached = BTRFS_CACHE_STARTED;
+ cache->has_caching_ctl = 1;
+ spin_unlock(&cache->lock);
+
+@@ -778,8 +761,8 @@ int btrfs_cache_block_group(struct btrfs
+
+ btrfs_queue_work(fs_info->caching_workers, &caching_ctl->work);
+ out:
+- if (load_cache_only && caching_ctl)
+- btrfs_wait_space_cache_v1_finished(cache, caching_ctl);
++ if (wait && caching_ctl)
++ ret = btrfs_caching_ctl_wait_done(cache, caching_ctl);
+ if (caching_ctl)
+ btrfs_put_caching_control(caching_ctl);
+
+@@ -3200,7 +3183,7 @@ int btrfs_update_block_group(struct btrf
+ * space back to the block group, otherwise we will leak space.
+ */
+ if (!alloc && !btrfs_block_group_done(cache))
+- btrfs_cache_block_group(cache, 1);
++ btrfs_cache_block_group(cache, true);
+
+ byte_in_group = bytenr - cache->start;
+ WARN_ON(byte_in_group > cache->length);
+--- a/fs/btrfs/block-group.h
++++ b/fs/btrfs/block-group.h
+@@ -251,9 +251,7 @@ void btrfs_dec_nocow_writers(struct btrf
+ void btrfs_wait_nocow_writers(struct btrfs_block_group *bg);
+ void btrfs_wait_block_group_cache_progress(struct btrfs_block_group *cache,
+ u64 num_bytes);
+-int btrfs_wait_block_group_cache_done(struct btrfs_block_group *cache);
+-int btrfs_cache_block_group(struct btrfs_block_group *cache,
+- int load_cache_only);
++int btrfs_cache_block_group(struct btrfs_block_group *cache, bool wait);
+ void btrfs_put_caching_control(struct btrfs_caching_control *ctl);
+ struct btrfs_caching_control *btrfs_get_caching_control(
+ struct btrfs_block_group *cache);
+--- a/fs/btrfs/ctree.h
++++ b/fs/btrfs/ctree.h
+@@ -454,7 +454,6 @@ struct btrfs_free_cluster {
+ enum btrfs_caching_type {
+ BTRFS_CACHE_NO,
+ BTRFS_CACHE_STARTED,
+- BTRFS_CACHE_FAST,
+ BTRFS_CACHE_FINISHED,
+ BTRFS_CACHE_ERROR,
+ };
+--- a/fs/btrfs/extent-tree.c
++++ b/fs/btrfs/extent-tree.c
+@@ -2572,17 +2572,10 @@ int btrfs_pin_extent_for_log_replay(stru
+ return -EINVAL;
+
+ /*
+- * pull in the free space cache (if any) so that our pin
+- * removes the free space from the cache. We have load_only set
+- * to one because the slow code to read in the free extents does check
+- * the pinned extents.
++ * Fully cache the free space first so that our pin removes the free space
++ * from the cache.
+ */
+- btrfs_cache_block_group(cache, 1);
+- /*
+- * Make sure we wait until the cache is completely built in case it is
+- * missing or is invalid and therefore needs to be rebuilt.
+- */
+- ret = btrfs_wait_block_group_cache_done(cache);
++ ret = btrfs_cache_block_group(cache, true);
+ if (ret)
+ goto out;
+
+@@ -2605,12 +2598,7 @@ static int __exclude_logged_extent(struc
+ if (!block_group)
+ return -EINVAL;
+
+- btrfs_cache_block_group(block_group, 1);
+- /*
+- * Make sure we wait until the cache is completely built in case it is
+- * missing or is invalid and therefore needs to be rebuilt.
+- */
+- ret = btrfs_wait_block_group_cache_done(block_group);
++ ret = btrfs_cache_block_group(block_group, true);
+ if (ret)
+ goto out;
+
+@@ -4324,7 +4312,7 @@ have_block_group:
+ ffe_ctl.cached = btrfs_block_group_done(block_group);
+ if (unlikely(!ffe_ctl.cached)) {
+ ffe_ctl.have_caching_bg = true;
+- ret = btrfs_cache_block_group(block_group, 0);
++ ret = btrfs_cache_block_group(block_group, false);
+
+ /*
+ * If we get ENOMEM here or something else we want to
+@@ -6082,13 +6070,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *
+
+ if (end - start >= range->minlen) {
+ if (!btrfs_block_group_done(cache)) {
+- ret = btrfs_cache_block_group(cache, 0);
+- if (ret) {
+- bg_failed++;
+- bg_ret = ret;
+- continue;
+- }
+- ret = btrfs_wait_block_group_cache_done(cache);
++ ret = btrfs_cache_block_group(cache, true);
+ if (ret) {
+ bg_failed++;
+ bg_ret = ret;
--- /dev/null
+From 9c80e79906b4ca440d09e7f116609262bb747909 Mon Sep 17 00:00:00 2001
+From: Kuniyuki Iwashima <kuniyu@amazon.com>
+Date: Fri, 12 Aug 2022 19:05:09 -0700
+Subject: kprobes: don't call disarm_kprobe() for disabled kprobes
+
+From: Kuniyuki Iwashima <kuniyu@amazon.com>
+
+commit 9c80e79906b4ca440d09e7f116609262bb747909 upstream.
+
+The assumption in __disable_kprobe() is wrong, and it could try to disarm
+an already disarmed kprobe and fire the WARN_ONCE() below. [0] We can
+easily reproduce this issue.
+
+1. Write 0 to /sys/kernel/debug/kprobes/enabled.
+
+ # echo 0 > /sys/kernel/debug/kprobes/enabled
+
+2. Run execsnoop. At this time, one kprobe is disabled.
+
+ # /usr/share/bcc/tools/execsnoop &
+ [1] 2460
+ PCOMM PID PPID RET ARGS
+
+ # cat /sys/kernel/debug/kprobes/list
+ ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE]
+ ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]
+
+3. Write 1 to /sys/kernel/debug/kprobes/enabled, which changes
+ kprobes_all_disarmed to false but does not arm the disabled kprobe.
+
+ # echo 1 > /sys/kernel/debug/kprobes/enabled
+
+ # cat /sys/kernel/debug/kprobes/list
+ ffffffff91345650 r __x64_sys_execve+0x0 [FTRACE]
+ ffffffff91345650 k __x64_sys_execve+0x0 [DISABLED][FTRACE]
+
+4. Kill execsnoop, when __disable_kprobe() calls disarm_kprobe() for the
+ disabled kprobe and hits the WARN_ONCE() in __disarm_kprobe_ftrace().
+
+ # fg
+ /usr/share/bcc/tools/execsnoop
+ ^C
+
+Actually, WARN_ONCE() is fired twice, and __unregister_kprobe_top() misses
+some cleanups and leaves the aggregated kprobe in the hash table. Then,
+__unregister_trace_kprobe() initialises tk->rp.kp.list and creates an
+infinite loop like this.
+
+ aggregated kprobe.list -> kprobe.list -.
+ ^ |
+ '.__.'
+
+In this situation, these commands fall into the infinite loop and result
+in RCU stall or soft lockup.
+
+ cat /sys/kernel/debug/kprobes/list : show_kprobe_addr() enters into the
+ infinite loop with RCU.
+
+ /usr/share/bcc/tools/execsnoop : warn_kprobe_rereg() holds kprobe_mutex,
+ and __get_valid_kprobe() is stuck in
+ the loop.
+
+To avoid the issue, make sure we don't call disarm_kprobe() for disabled
+kprobes.
+
+[0]
+Failed to disarm kprobe-ftrace at __x64_sys_execve+0x0/0x40 (error -2)
+WARNING: CPU: 6 PID: 2460 at kernel/kprobes.c:1130 __disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
+Modules linked in: ena
+CPU: 6 PID: 2460 Comm: execsnoop Not tainted 5.19.0+ #28
+Hardware name: Amazon EC2 c5.2xlarge/, BIOS 1.0 10/16/2017
+RIP: 0010:__disarm_kprobe_ftrace.isra.19 (kernel/kprobes.c:1129)
+Code: 24 8b 02 eb c1 80 3d c4 83 f2 01 00 75 d4 48 8b 75 00 89 c2 48 c7 c7 90 fa 0f 92 89 04 24 c6 05 ab 83 01 e8 e4 94 f0 ff <0f> 0b 8b 04 24 eb b1 89 c6 48 c7 c7 60 fa 0f 92 89 04 24 e8 cc 94
+RSP: 0018:ffff9e6ec154bd98 EFLAGS: 00010282
+RAX: 0000000000000000 RBX: ffffffff930f7b00 RCX: 0000000000000001
+RDX: 0000000080000001 RSI: ffffffff921461c5 RDI: 00000000ffffffff
+RBP: ffff89c504286da8 R08: 0000000000000000 R09: c0000000fffeffff
+R10: 0000000000000000 R11: ffff9e6ec154bc28 R12: ffff89c502394e40
+R13: ffff89c502394c00 R14: ffff9e6ec154bc00 R15: 0000000000000000
+FS: 00007fe800398740(0000) GS:ffff89c812d80000(0000) knlGS:0000000000000000
+CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+CR2: 000000c00057f010 CR3: 0000000103b54006 CR4: 00000000007706e0
+DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
+PKRU: 55555554
+Call Trace:
+<TASK>
+ __disable_kprobe (kernel/kprobes.c:1716)
+ disable_kprobe (kernel/kprobes.c:2392)
+ __disable_trace_kprobe (kernel/trace/trace_kprobe.c:340)
+ disable_trace_kprobe (kernel/trace/trace_kprobe.c:429)
+ perf_trace_event_unreg.isra.2 (./include/linux/tracepoint.h:93 kernel/trace/trace_event_perf.c:168)
+ perf_kprobe_destroy (kernel/trace/trace_event_perf.c:295)
+ _free_event (kernel/events/core.c:4971)
+ perf_event_release_kernel (kernel/events/core.c:5176)
+ perf_release (kernel/events/core.c:5186)
+ __fput (fs/file_table.c:321)
+ task_work_run (./include/linux/sched.h:2056 (discriminator 1) kernel/task_work.c:179 (discriminator 1))
+ exit_to_user_mode_prepare (./include/linux/resume_user_mode.h:49 kernel/entry/common.c:169 kernel/entry/common.c:201)
+ syscall_exit_to_user_mode (./arch/x86/include/asm/jump_label.h:55 ./arch/x86/include/asm/nospec-branch.h:384 ./arch/x86/include/asm/entry-common.h:94 kernel/entry/common.c:133 kernel/entry/common.c:296)
+ do_syscall_64 (arch/x86/entry/common.c:87)
+ entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:120)
+RIP: 0033:0x7fe7ff210654
+Code: 15 79 89 20 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb be 0f 1f 00 8b 05 9a cd 20 00 48 63 ff 85 c0 75 11 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 3a f3 c3 48 83 ec 18 48 89 7c 24 08 e8 34 fc
+RSP: 002b:00007ffdbd1d3538 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
+RAX: 0000000000000000 RBX: 0000000000000008 RCX: 00007fe7ff210654
+RDX: 0000000000000000 RSI: 0000000000002401 RDI: 0000000000000008
+RBP: 0000000000000000 R08: 94ae31d6fda838a4 R0900007fe8001c9d30
+R10: 00007ffdbd1d34b0 R11: 0000000000000246 R12: 00007ffdbd1d3600
+R13: 0000000000000000 R14: fffffffffffffffc R15: 00007ffdbd1d3560
+</TASK>
+
+Link: https://lkml.kernel.org/r/20220813020509.90805-1-kuniyu@amazon.com
+Fixes: 69d54b916d83 ("kprobes: makes kprobes/enabled works correctly for optimized kprobes.")
+Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
+Reported-by: Ayushman Dutta <ayudutta@amazon.com>
+Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
+Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
+Cc: "David S. Miller" <davem@davemloft.net>
+Cc: Masami Hiramatsu <mhiramat@kernel.org>
+Cc: Wang Nan <wangnan0@huawei.com>
+Cc: Kuniyuki Iwashima <kuniyu@amazon.com>
+Cc: Kuniyuki Iwashima <kuni1840@gmail.com>
+Cc: Ayushman Dutta <ayudutta@amazon.com>
+Cc: <stable@vger.kernel.org>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ kernel/kprobes.c | 9 +++++----
+ 1 file changed, 5 insertions(+), 4 deletions(-)
+
+--- a/kernel/kprobes.c
++++ b/kernel/kprobes.c
+@@ -1705,11 +1705,12 @@ static struct kprobe *__disable_kprobe(s
+ /* Try to disarm and disable this/parent probe */
+ if (p == orig_p || aggr_kprobe_disabled(orig_p)) {
+ /*
+- * If kprobes_all_disarmed is set, orig_p
+- * should have already been disarmed, so
+- * skip unneed disarming process.
++ * Don't be lazy here. Even if 'kprobes_all_disarmed'
++ * is false, 'orig_p' might not have been armed yet.
++ * Note arm_all_kprobes() __tries__ to arm all kprobes
++ * on the best effort basis.
+ */
+- if (!kprobes_all_disarmed) {
++ if (!kprobes_all_disarmed && !kprobe_disabled(orig_p)) {
+ ret = disarm_kprobe(orig_p, true);
+ if (ret) {
+ p->flags &= ~KPROBE_FLAG_DISABLED;