From: Greg Kroah-Hartman Date: Mon, 7 Mar 2022 07:39:00 +0000 (+0100) Subject: 5.15-stable patches X-Git-Tag: v4.9.305~22 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=2773007de653e0cf683cdd067e90006939f5706f;p=thirdparty%2Fkernel%2Fstable-queue.git 5.15-stable patches added patches: btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch btrfs-do-not-warn_on-if-we-have-pageerror-set.patch btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch tracing-fix-return-value-of-__setup-handlers.patch tracing-histogram-fix-sorting-on-old-cpu-value.patch --- diff --git a/queue-5.15/btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch b/queue-5.15/btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch new file mode 100644 index 00000000000..a2ea312d46e --- /dev/null +++ b/queue-5.15/btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch @@ -0,0 +1,72 @@ +From 4751dc99627e4d1465c5bfa8cb7ab31ed418eff5 Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Mon, 28 Feb 2022 16:29:28 +0000 +Subject: btrfs: add missing run of delayed items after unlink during log replay + +From: Filipe Manana + +commit 4751dc99627e4d1465c5bfa8cb7ab31ed418eff5 upstream. + +During log replay, whenever we need to check if a name (dentry) exists in +a directory we do searches on the subvolume tree for inode references or +or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY +keys as well, before kernel 5.17). However when during log replay we +unlink a name, through btrfs_unlink_inode(), we may not delete inode +references and dir index keys from a subvolume tree and instead just add +the deletions to the delayed inode's delayed items, which will only be +run when we commit the transaction used for log replay. This means that +after an unlink operation during log replay, if we attempt to search for +the same name during log replay, we will not see that the name was already +deleted, since the deletion is recorded only on the delayed items. + +We run delayed items after every unlink operation during log replay, +except at unlink_old_inode_refs() and at add_inode_ref(). This was due +to an overlook, as delayed items should be run after evert unlink, for +the reasons stated above. + +So fix those two cases. + +Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation") +Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination") +CC: stable@vger.kernel.org # 4.19+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/tree-log.c | 18 ++++++++++++++++++ + 1 file changed, 18 insertions(+) + +--- a/fs/btrfs/tree-log.c ++++ b/fs/btrfs/tree-log.c +@@ -1329,6 +1329,15 @@ again: + inode, name, namelen); + kfree(name); + iput(dir); ++ /* ++ * Whenever we need to check if a name exists or not, we ++ * check the subvolume tree. So after an unlink we must ++ * run delayed items, so that future checks for a name ++ * during log replay see that the name does not exists ++ * anymore. ++ */ ++ if (!ret) ++ ret = btrfs_run_delayed_items(trans); + if (ret) + goto out; + goto again; +@@ -1580,6 +1589,15 @@ static noinline int add_inode_ref(struct + */ + if (!ret && inode->i_nlink == 0) + inc_nlink(inode); ++ /* ++ * Whenever we need to check if a name exists or ++ * not, we check the subvolume tree. So after an ++ * unlink we must run delayed items, so that future ++ * checks for a name during log replay see that the ++ * name does not exists anymore. ++ */ ++ if (!ret) ++ ret = btrfs_run_delayed_items(trans); + } + if (ret < 0) + goto out; diff --git a/queue-5.15/btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch b/queue-5.15/btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch new file mode 100644 index 00000000000..e8e76e52a18 --- /dev/null +++ b/queue-5.15/btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch @@ -0,0 +1,279 @@ +From b4be6aefa73c9a6899ef3ba9c5faaa8a66e333ef Mon Sep 17 00:00:00 2001 +From: Josef Bacik +Date: Fri, 18 Feb 2022 14:56:10 -0500 +Subject: btrfs: do not start relocation until in progress drops are done + +From: Josef Bacik + +commit b4be6aefa73c9a6899ef3ba9c5faaa8a66e333ef upstream. + +We hit a bug with a recovering relocation on mount for one of our file +systems in production. I reproduced this locally by injecting errors +into snapshot delete with balance running at the same time. This +presented as an error while looking up an extent item + + WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 + CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 + RIP: 0010:lookup_inline_extent_backref+0x647/0x680 + RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 + RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 + RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 + RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 + R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 + R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 + FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 + CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 + Call Trace: + + insert_inline_extent_backref+0x46/0xd0 + __btrfs_inc_extent_ref.isra.0+0x5f/0x200 + ? btrfs_merge_delayed_refs+0x164/0x190 + __btrfs_run_delayed_refs+0x561/0xfa0 + ? btrfs_search_slot+0x7b4/0xb30 + ? btrfs_update_root+0x1a9/0x2c0 + btrfs_run_delayed_refs+0x73/0x1f0 + ? btrfs_update_root+0x1a9/0x2c0 + btrfs_commit_transaction+0x50/0xa50 + ? btrfs_update_reloc_root+0x122/0x220 + prepare_to_merge+0x29f/0x320 + relocate_block_group+0x2b8/0x550 + btrfs_relocate_block_group+0x1a6/0x350 + btrfs_relocate_chunk+0x27/0xe0 + btrfs_balance+0x777/0xe60 + balance_kthread+0x35/0x50 + ? btrfs_balance+0xe60/0xe60 + kthread+0x16b/0x190 + ? set_kthread_struct+0x40/0x40 + ret_from_fork+0x22/0x30 + + +Normally snapshot deletion and relocation are excluded from running at +the same time by the fs_info->cleaner_mutex. However if we had a +pending balance waiting to get the ->cleaner_mutex, and a snapshot +deletion was running, and then the box crashed, we would come up in a +state where we have a half deleted snapshot. + +Again, in the normal case the snapshot deletion needs to complete before +relocation can start, but in this case relocation could very well start +before the snapshot deletion completes, as we simply add the root to the +dead roots list and wait for the next time the cleaner runs to clean up +the snapshot. + +Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that +had a pending drop_progress key. If they do then we know we were in the +middle of the drop operation and set a flag on the fs_info. Then +balance can wait until this flag is cleared to start up again. + +If there are DEAD_ROOT's that don't have a drop_progress set then we're +safe to start balance right away as we'll be properly protected by the +cleaner_mutex. + +CC: stable@vger.kernel.org # 5.10+ +Reviewed-by: Filipe Manana +Signed-off-by: Josef Bacik +Reviewed-by: David Sterba +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/ctree.h | 10 ++++++++++ + fs/btrfs/disk-io.c | 10 ++++++++++ + fs/btrfs/extent-tree.c | 10 ++++++++++ + fs/btrfs/relocation.c | 13 +++++++++++++ + fs/btrfs/root-tree.c | 15 +++++++++++++++ + fs/btrfs/transaction.c | 33 ++++++++++++++++++++++++++++++++- + fs/btrfs/transaction.h | 1 + + 7 files changed, 91 insertions(+), 1 deletion(-) + +--- a/fs/btrfs/ctree.h ++++ b/fs/btrfs/ctree.h +@@ -593,6 +593,9 @@ enum { + /* Indicate whether there are any tree modification log users */ + BTRFS_FS_TREE_MOD_LOG_USERS, + ++ /* Indicate we have half completed snapshot deletions pending. */ ++ BTRFS_FS_UNFINISHED_DROPS, ++ + #if BITS_PER_LONG == 32 + /* Indicate if we have error/warn message printed on 32bit systems */ + BTRFS_FS_32BIT_ERROR, +@@ -1098,8 +1101,15 @@ enum { + BTRFS_ROOT_HAS_LOG_TREE, + /* Qgroup flushing is in progress */ + BTRFS_ROOT_QGROUP_FLUSHING, ++ /* This root has a drop operation that was started previously. */ ++ BTRFS_ROOT_UNFINISHED_DROP, + }; + ++static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info) ++{ ++ clear_and_wake_up_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags); ++} ++ + /* + * Record swapped tree blocks of a subvolume tree for delayed subtree trace + * code. For detail check comment in fs/btrfs/qgroup.c. +--- a/fs/btrfs/disk-io.c ++++ b/fs/btrfs/disk-io.c +@@ -3659,6 +3659,10 @@ int __cold open_ctree(struct super_block + + set_bit(BTRFS_FS_OPEN, &fs_info->flags); + ++ /* Kick the cleaner thread so it'll start deleting snapshots. */ ++ if (test_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags)) ++ wake_up_process(fs_info->cleaner_kthread); ++ + clear_oneshot: + btrfs_clear_oneshot_options(fs_info); + return 0; +@@ -4340,6 +4344,12 @@ void __cold close_ctree(struct btrfs_fs_ + */ + kthread_park(fs_info->cleaner_kthread); + ++ /* ++ * If we had UNFINISHED_DROPS we could still be processing them, so ++ * clear that bit and wake up relocation so it can stop. ++ */ ++ btrfs_wake_unfinished_drop(fs_info); ++ + /* wait for the qgroup rescan worker to stop */ + btrfs_qgroup_wait_for_completion(fs_info, false); + +--- a/fs/btrfs/extent-tree.c ++++ b/fs/btrfs/extent-tree.c +@@ -5541,6 +5541,7 @@ int btrfs_drop_snapshot(struct btrfs_roo + int ret; + int level; + bool root_dropped = false; ++ bool unfinished_drop = false; + + btrfs_debug(fs_info, "Drop subvolume %llu", root->root_key.objectid); + +@@ -5583,6 +5584,8 @@ int btrfs_drop_snapshot(struct btrfs_roo + * already dropped. + */ + set_bit(BTRFS_ROOT_DELETING, &root->state); ++ unfinished_drop = test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state); ++ + if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) { + level = btrfs_header_level(root->node); + path->nodes[level] = btrfs_lock_root_node(root); +@@ -5758,6 +5761,13 @@ out_free: + btrfs_free_path(path); + out: + /* ++ * We were an unfinished drop root, check to see if there are any ++ * pending, and if not clear and wake up any waiters. ++ */ ++ if (!err && unfinished_drop) ++ btrfs_maybe_wake_unfinished_drop(fs_info); ++ ++ /* + * So if we need to stop dropping the snapshot for whatever reason we + * need to make sure to add it back to the dead root list so that we + * keep trying to do the work later. This also cleans up roots if we +--- a/fs/btrfs/relocation.c ++++ b/fs/btrfs/relocation.c +@@ -3967,6 +3967,19 @@ int btrfs_relocate_block_group(struct bt + int rw = 0; + int err = 0; + ++ /* ++ * This only gets set if we had a half-deleted snapshot on mount. We ++ * cannot allow relocation to start while we're still trying to clean up ++ * these pending deletions. ++ */ ++ ret = wait_on_bit(&fs_info->flags, BTRFS_FS_UNFINISHED_DROPS, TASK_INTERRUPTIBLE); ++ if (ret) ++ return ret; ++ ++ /* We may have been woken up by close_ctree, so bail if we're closing. */ ++ if (btrfs_fs_closing(fs_info)) ++ return -EINTR; ++ + bg = btrfs_lookup_block_group(fs_info, group_start); + if (!bg) + return -ENOENT; +--- a/fs/btrfs/root-tree.c ++++ b/fs/btrfs/root-tree.c +@@ -280,6 +280,21 @@ int btrfs_find_orphan_roots(struct btrfs + + WARN_ON(!test_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED, &root->state)); + if (btrfs_root_refs(&root->root_item) == 0) { ++ struct btrfs_key drop_key; ++ ++ btrfs_disk_key_to_cpu(&drop_key, &root->root_item.drop_progress); ++ /* ++ * If we have a non-zero drop_progress then we know we ++ * made it partly through deleting this snapshot, and ++ * thus we need to make sure we block any balance from ++ * happening until this snapshot is completely dropped. ++ */ ++ if (drop_key.objectid != 0 || drop_key.type != 0 || ++ drop_key.offset != 0) { ++ set_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags); ++ set_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state); ++ } ++ + set_bit(BTRFS_ROOT_DEAD_TREE, &root->state); + btrfs_add_dead_root(root); + } +--- a/fs/btrfs/transaction.c ++++ b/fs/btrfs/transaction.c +@@ -1341,6 +1341,32 @@ again: + } + + /* ++ * If we had a pending drop we need to see if there are any others left in our ++ * dead roots list, and if not clear our bit and wake any waiters. ++ */ ++void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info) ++{ ++ /* ++ * We put the drop in progress roots at the front of the list, so if the ++ * first entry doesn't have UNFINISHED_DROP set we can wake everybody ++ * up. ++ */ ++ spin_lock(&fs_info->trans_lock); ++ if (!list_empty(&fs_info->dead_roots)) { ++ struct btrfs_root *root = list_first_entry(&fs_info->dead_roots, ++ struct btrfs_root, ++ root_list); ++ if (test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state)) { ++ spin_unlock(&fs_info->trans_lock); ++ return; ++ } ++ } ++ spin_unlock(&fs_info->trans_lock); ++ ++ btrfs_wake_unfinished_drop(fs_info); ++} ++ ++/* + * dead roots are old snapshots that need to be deleted. This allocates + * a dirty root struct and adds it into the list of dead roots that need to + * be deleted +@@ -1352,7 +1378,12 @@ void btrfs_add_dead_root(struct btrfs_ro + spin_lock(&fs_info->trans_lock); + if (list_empty(&root->root_list)) { + btrfs_grab_root(root); +- list_add_tail(&root->root_list, &fs_info->dead_roots); ++ ++ /* We want to process the partially complete drops first. */ ++ if (test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state)) ++ list_add(&root->root_list, &fs_info->dead_roots); ++ else ++ list_add_tail(&root->root_list, &fs_info->dead_roots); + } + spin_unlock(&fs_info->trans_lock); + } +--- a/fs/btrfs/transaction.h ++++ b/fs/btrfs/transaction.h +@@ -217,6 +217,7 @@ int btrfs_wait_for_commit(struct btrfs_f + + void btrfs_add_dead_root(struct btrfs_root *root); + int btrfs_defrag_root(struct btrfs_root *root); ++void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info); + int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root); + int btrfs_commit_transaction(struct btrfs_trans_handle *trans); + int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans); diff --git a/queue-5.15/btrfs-do-not-warn_on-if-we-have-pageerror-set.patch b/queue-5.15/btrfs-do-not-warn_on-if-we-have-pageerror-set.patch new file mode 100644 index 00000000000..65709a8091b --- /dev/null +++ b/queue-5.15/btrfs-do-not-warn_on-if-we-have-pageerror-set.patch @@ -0,0 +1,119 @@ +From a50e1fcbc9b85fd4e95b89a75c0884cb032a3e06 Mon Sep 17 00:00:00 2001 +From: Josef Bacik +Date: Fri, 18 Feb 2022 10:17:39 -0500 +Subject: btrfs: do not WARN_ON() if we have PageError set + +From: Josef Bacik + +commit a50e1fcbc9b85fd4e95b89a75c0884cb032a3e06 upstream. + +Whenever we do any extent buffer operations we call +assert_eb_page_uptodate() to complain loudly if we're operating on an +non-uptodate page. Our overnight tests caught this warning earlier this +week + + WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50 + CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G W 5.17.0-rc3+ #564 + Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014 + Workqueue: btrfs-cache btrfs_work_helper + RIP: 0010:assert_eb_page_uptodate+0x3f/0x50 + RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246 + RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000 + RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0 + RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000 + R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1 + R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000 + FS: 0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000 + CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0 + Call Trace: + + extent_buffer_test_bit+0x3f/0x70 + free_space_test_bit+0xa6/0xc0 + load_free_space_tree+0x1f6/0x470 + caching_thread+0x454/0x630 + ? rcu_read_lock_sched_held+0x12/0x60 + ? rcu_read_lock_sched_held+0x12/0x60 + ? rcu_read_lock_sched_held+0x12/0x60 + ? lock_release+0x1f0/0x2d0 + btrfs_work_helper+0xf2/0x3e0 + ? lock_release+0x1f0/0x2d0 + ? finish_task_switch.isra.0+0xf9/0x3a0 + process_one_work+0x26d/0x580 + ? process_one_work+0x580/0x580 + worker_thread+0x55/0x3b0 + ? process_one_work+0x580/0x580 + kthread+0xf0/0x120 + ? kthread_complete_and_exit+0x20/0x20 + ret_from_fork+0x1f/0x30 + +This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer +uptodate when we fail to write it"), however all that fix did was keep +us from finding extent buffers after a failed writeout. It didn't keep +us from continuing to use a buffer that we already had found. + +In this case we're searching the commit root to cache the block group, +so we can start committing the transaction and switch the commit root +and then start writing. After the switch we can look up an extent +buffer that hasn't been written yet and start processing that block +group. Then we fail to write that block out and clear Uptodate on the +page, and then we start spewing these errors. + +Normally we're protected by the tree lock to a certain degree here. If +we read a block we have that block read locked, and we block the writer +from locking the block before we submit it for the write. However this +isn't necessarily fool proof because the read could happen before we do +the submit_bio and after we locked and unlocked the extent buffer. + +Also in this particular case we have path->skip_locking set, so that +won't save us here. We'll simply get a block that was valid when we +read it, but became invalid while we were using it. + +What we really want is to catch the case where we've "read" a block but +it's not marked Uptodate. On read we ClearPageError(), so if we're +!Uptodate and !Error we know we didn't do the right thing for reading +the page. + +Fix this by checking !Uptodate && !Error, this way we will not complain +if our buffer gets invalidated while we're using it, and we'll maintain +the spirit of the check which is to make sure we have a fully in-cache +block while we're messing with it. + +CC: stable@vger.kernel.org # 5.4+ +Signed-off-by: Josef Bacik +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/extent_io.c | 16 +++++++++++++--- + 1 file changed, 13 insertions(+), 3 deletions(-) + +--- a/fs/btrfs/extent_io.c ++++ b/fs/btrfs/extent_io.c +@@ -6801,14 +6801,24 @@ static void assert_eb_page_uptodate(cons + { + struct btrfs_fs_info *fs_info = eb->fs_info; + ++ /* ++ * If we are using the commit root we could potentially clear a page ++ * Uptodate while we're using the extent buffer that we've previously ++ * looked up. We don't want to complain in this case, as the page was ++ * valid before, we just didn't write it out. Instead we want to catch ++ * the case where we didn't actually read the block properly, which ++ * would have !PageUptodate && !PageError, as we clear PageError before ++ * reading. ++ */ + if (fs_info->sectorsize < PAGE_SIZE) { +- bool uptodate; ++ bool uptodate, error; + + uptodate = btrfs_subpage_test_uptodate(fs_info, page, + eb->start, eb->len); +- WARN_ON(!uptodate); ++ error = btrfs_subpage_test_error(fs_info, page, eb->start, eb->len); ++ WARN_ON(!uptodate && !error); + } else { +- WARN_ON(!PageUptodate(page)); ++ WARN_ON(!PageUptodate(page) && !PageError(page)); + } + } + diff --git a/queue-5.15/btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch b/queue-5.15/btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch new file mode 100644 index 00000000000..e10ccf5b919 --- /dev/null +++ b/queue-5.15/btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch @@ -0,0 +1,175 @@ +From d99478874355d3a7b9d86dfb5d7590d5b1754b1f Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Thu, 17 Feb 2022 12:12:02 +0000 +Subject: btrfs: fix lost prealloc extents beyond eof after full fsync + +From: Filipe Manana + +commit d99478874355d3a7b9d86dfb5d7590d5b1754b1f upstream. + +When doing a full fsync, if we have prealloc extents beyond (or at) eof, +and the leaves that contain them were not modified in the current +transaction, we end up not logging them. This results in losing those +extents when we replay the log after a power failure, since the inode is +truncated to the current value of the logged i_size. + +Just like for the fast fsync path, we need to always log all prealloc +extents starting at or beyond i_size. The fast fsync case was fixed in +commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size +after fsync log replay") but it missed the full fsync path. The problem +exists since the very early days, when the log tree was added by +commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize +synchronous operations"). + +Example reproducer: + + $ mkfs.btrfs -f /dev/sdc + $ mount /dev/sdc /mnt + + # Create our test file with many file extent items, so that they span + # several leaves of metadata, even if the node/page size is 64K. Use + # direct IO and not fsync/O_SYNC because it's both faster and it avoids + # clearing the full sync flag from the inode - we want the fsync below + # to trigger the slow full sync code path. + $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo + + # Now add two preallocated extents to our file without extending the + # file's size. One right at i_size, and another further beyond, leaving + # a gap between the two prealloc extents. + $ xfs_io -c "falloc -k 16M 1M" /mnt/foo + $ xfs_io -c "falloc -k 20M 1M" /mnt/foo + + # Make sure everything is durably persisted and the transaction is + # committed. This makes all created extents to have a generation lower + # than the generation of the transaction used by the next write and + # fsync. + sync + + # Now overwrite only the first extent, which will result in modifying + # only the first leaf of metadata for our inode. Then fsync it. This + # fsync will use the slow code path (inode full sync bit is set) because + # it's the first fsync since the inode was created/loaded. + $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo + + # Extent list before power failure. + $ xfs_io -c "fiemap -v" /mnt/foo + /mnt/foo: + EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS + 0: [0..7]: 2178048..2178055 8 0x0 + 1: [8..16383]: 26632..43007 16376 0x0 + 2: [16384..32767]: 2156544..2172927 16384 0x0 + 3: [32768..34815]: 2172928..2174975 2048 0x800 + 4: [34816..40959]: hole 6144 + 5: [40960..43007]: 2174976..2177023 2048 0x801 + + + + # Mount fs again, trigger log replay. + $ mount /dev/sdc /mnt + + # Extent list after power failure and log replay. + $ xfs_io -c "fiemap -v" /mnt/foo + /mnt/foo: + EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS + 0: [0..7]: 2178048..2178055 8 0x0 + 1: [8..16383]: 26632..43007 16376 0x0 + 2: [16384..32767]: 2156544..2172927 16384 0x1 + + # The prealloc extents at file offsets 16M and 20M are missing. + +So fix this by calling btrfs_log_prealloc_extents() when we are doing a +full fsync, so that we always log all prealloc extents beyond eof. + +A test case for fstests will follow soon. + +CC: stable@vger.kernel.org # 4.19+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/tree-log.c | 43 +++++++++++++++++++++++++++++++------------ + 1 file changed, 31 insertions(+), 12 deletions(-) + +--- a/fs/btrfs/tree-log.c ++++ b/fs/btrfs/tree-log.c +@@ -4423,7 +4423,7 @@ static int log_one_extent(struct btrfs_t + + /* + * Log all prealloc extents beyond the inode's i_size to make sure we do not +- * lose them after doing a fast fsync and replaying the log. We scan the ++ * lose them after doing a full/fast fsync and replaying the log. We scan the + * subvolume's root instead of iterating the inode's extent map tree because + * otherwise we can log incorrect extent items based on extent map conversion. + * That can happen due to the fact that extent maps are merged when they +@@ -5208,6 +5208,7 @@ static int copy_inode_items_to_log(struc + struct btrfs_log_ctx *ctx, + bool *need_log_inode_item) + { ++ const u64 i_size = i_size_read(&inode->vfs_inode); + struct btrfs_root *root = inode->root; + int ins_start_slot = 0; + int ins_nr = 0; +@@ -5228,13 +5229,21 @@ again: + if (min_key->type > max_key->type) + break; + +- if (min_key->type == BTRFS_INODE_ITEM_KEY) ++ if (min_key->type == BTRFS_INODE_ITEM_KEY) { + *need_log_inode_item = false; +- +- if ((min_key->type == BTRFS_INODE_REF_KEY || +- min_key->type == BTRFS_INODE_EXTREF_KEY) && +- inode->generation == trans->transid && +- !recursive_logging) { ++ } else if (min_key->type == BTRFS_EXTENT_DATA_KEY && ++ min_key->offset >= i_size) { ++ /* ++ * Extents at and beyond eof are logged with ++ * btrfs_log_prealloc_extents(). ++ * Only regular files have BTRFS_EXTENT_DATA_KEY keys, ++ * and no keys greater than that, so bail out. ++ */ ++ break; ++ } else if ((min_key->type == BTRFS_INODE_REF_KEY || ++ min_key->type == BTRFS_INODE_EXTREF_KEY) && ++ inode->generation == trans->transid && ++ !recursive_logging) { + u64 other_ino = 0; + u64 other_parent = 0; + +@@ -5265,10 +5274,8 @@ again: + btrfs_release_path(path); + goto next_key; + } +- } +- +- /* Skip xattrs, we log them later with btrfs_log_all_xattrs() */ +- if (min_key->type == BTRFS_XATTR_ITEM_KEY) { ++ } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) { ++ /* Skip xattrs, logged later with btrfs_log_all_xattrs() */ + if (ins_nr == 0) + goto next_slot; + ret = copy_items(trans, inode, dst_path, path, +@@ -5321,9 +5328,21 @@ next_key: + break; + } + } +- if (ins_nr) ++ if (ins_nr) { + ret = copy_items(trans, inode, dst_path, path, ins_start_slot, + ins_nr, inode_only, logged_isize); ++ if (ret) ++ return ret; ++ } ++ ++ if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) { ++ /* ++ * Release the path because otherwise we might attempt to double ++ * lock the same leaf with btrfs_log_prealloc_extents() below. ++ */ ++ btrfs_release_path(path); ++ ret = btrfs_log_prealloc_extents(trans, inode, dst_path); ++ } + + return ret; + } diff --git a/queue-5.15/btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch b/queue-5.15/btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch new file mode 100644 index 00000000000..b80e6bbc1ca --- /dev/null +++ b/queue-5.15/btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch @@ -0,0 +1,210 @@ +From 5fd76bf31ccfecc06e2e6b29f8c809e934085b99 Mon Sep 17 00:00:00 2001 +From: Omar Sandoval +Date: Thu, 17 Feb 2022 15:14:43 -0800 +Subject: btrfs: fix relocation crash due to premature return from btrfs_commit_transaction() + +From: Omar Sandoval + +commit 5fd76bf31ccfecc06e2e6b29f8c809e934085b99 upstream. + +We are seeing crashes similar to the following trace: + +[38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs] +[38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54 +[38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 +[38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs] +[38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206 +[38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14 +[38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360 +[38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000 +[38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800 +[38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360 +[38.987146] FS: 00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000 +[38.988662] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 +[38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0 +[38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 +[38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 +[38.992528] Call Trace: +[38.992854] +[38.993148] btrfs_relocate_chunk+0x27/0xe0 [btrfs] +[38.993941] btrfs_balance+0x78e/0xea0 [btrfs] +[38.994801] ? vsnprintf+0x33c/0x520 +[38.995368] ? __kmalloc_track_caller+0x351/0x440 +[38.996198] btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs] +[38.997084] btrfs_ioctl+0x11b0/0x2da0 [btrfs] +[38.997867] ? mod_objcg_state+0xee/0x340 +[38.998552] ? seq_release+0x24/0x30 +[38.999184] ? proc_nr_files+0x30/0x30 +[38.999654] ? call_rcu+0xc8/0x2f0 +[39.000228] ? __x64_sys_ioctl+0x84/0xc0 +[39.000872] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] +[39.001973] __x64_sys_ioctl+0x84/0xc0 +[39.002566] do_syscall_64+0x3a/0x80 +[39.003011] entry_SYSCALL_64_after_hwframe+0x44/0xae +[39.003735] RIP: 0033:0x7f11c166959b +[39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 +[39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b +[39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003 +[39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0 +[39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3 +[39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001 +[39.015040] +[39.015418] ---[ end trace 0000000000000000 ]--- +[43.131559] ------------[ cut here ]------------ +[43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717! +[43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI +[43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G W 5.17.0-rc4 #54 +[43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 +[43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] +[43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 +[43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 +[43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff +[43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 +[43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 +[43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 +[43.145686] FS: 00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000 +[43.146808] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 +[43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0 +[43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 +[43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 +[43.150559] Call Trace: +[43.150904] +[43.151253] btrfs_finish_extent_commit+0x88/0x290 [btrfs] +[43.152127] btrfs_commit_transaction+0x74f/0xaa0 [btrfs] +[43.152932] ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs] +[43.153786] btrfs_ioctl+0x1edc/0x2da0 [btrfs] +[43.154475] ? __check_object_size+0x150/0x170 +[43.155170] ? preempt_count_add+0x49/0xa0 +[43.155753] ? __x64_sys_ioctl+0x84/0xc0 +[43.156437] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] +[43.157456] __x64_sys_ioctl+0x84/0xc0 +[43.157980] do_syscall_64+0x3a/0x80 +[43.158543] entry_SYSCALL_64_after_hwframe+0x44/0xae +[43.159231] RIP: 0033:0x7f7657f1e59b +[43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 +[43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b +[43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003 +[43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 +[43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 +[43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800 +[43.166863] +[43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common +[43.169552] ---[ end trace 0000000000000000 ]--- +[43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs] +[43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246 +[43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001 +[43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff +[43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50 +[43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000 +[43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000 +[43.180071] FS: 00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000 +[43.181073] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 +[43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0 +[43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 +[43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 + +We first hit the WARN_ON(rc->block_group->pinned > 0) in +btrfs_relocate_block_group() and then the BUG_ON(!cache) in +unpin_extent_range(). This tells us that we are exiting relocation and +removing the block group with bytes still pinned for that block group. +This is supposed to be impossible: the last thing relocate_block_group() +does is commit the transaction to get rid of pinned extents. + +Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when +waiting for a transaction commit") introduced an optimization so that +commits from fsync don't have to wait for the previous commit to unpin +extents. This was only intended to affect fsync, but it inadvertently +made it possible for any commit to skip waiting for the previous commit +to unpin. This is because if a call to btrfs_commit_transaction() finds +that another thread is already committing the transaction, it waits for +the other thread to complete the commit and then returns. If that other +thread was in fsync, then it completes the commit without completing the +previous commit. This makes the following sequence of events possible: + +Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________ +btrfs_commit_transaction(N) | | + btrfs_run_delayed_refs | | + pin extents | | + ... | | + state = UNBLOCKED |btrfs_sync_file | + | btrfs_start_transaction(N + 1) |relocate_block_group + | | btrfs_join_transaction(N + 1) + | btrfs_commit_transaction(N + 1) | + ... | trans->state = COMMIT_START | + | | btrfs_commit_transaction(N + 1) + | | wait_for_commit(N + 1, COMPLETED) + | wait_for_commit(N, SUPER_COMMITTED)| + state = SUPER_COMMITTED | ... | + btrfs_finish_extent_commit| | + unpin_extent_range() | trans->state = COMPLETED | + | | return + | | + ... | |Thread 1 isn't done, so pinned > 0 + | |and we WARN + | | + | |btrfs_remove_block_group + unpin_extent_range() | | + Thread 3 removed the | | + block group, so we BUG| | + +There are other sequences involving SUPER_COMMITTED transactions that +can cause a similar outcome. + +We could fix this by making relocation explicitly wait for unpinning, +but there may be other cases that need it. Josef mentioned ENOSPC +flushing and the free space cache inode as other potential victims. +Rather than playing whack-a-mole, this fix is conservative and makes all +commits not in fsync wait for all previous transactions, which is what +the optimization intended. + +Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") +CC: stable@vger.kernel.org # 5.15+ +Reviewed-by: Filipe Manana +Signed-off-by: Omar Sandoval +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/transaction.c | 32 +++++++++++++++++++++++++++++++- + 1 file changed, 31 insertions(+), 1 deletion(-) + +--- a/fs/btrfs/transaction.c ++++ b/fs/btrfs/transaction.c +@@ -846,7 +846,37 @@ btrfs_attach_transaction_barrier(struct + static noinline void wait_for_commit(struct btrfs_transaction *commit, + const enum btrfs_trans_state min_state) + { +- wait_event(commit->commit_wait, commit->state >= min_state); ++ struct btrfs_fs_info *fs_info = commit->fs_info; ++ u64 transid = commit->transid; ++ bool put = false; ++ ++ while (1) { ++ wait_event(commit->commit_wait, commit->state >= min_state); ++ if (put) ++ btrfs_put_transaction(commit); ++ ++ if (min_state < TRANS_STATE_COMPLETED) ++ break; ++ ++ /* ++ * A transaction isn't really completed until all of the ++ * previous transactions are completed, but with fsync we can ++ * end up with SUPER_COMMITTED transactions before a COMPLETED ++ * transaction. Wait for those. ++ */ ++ ++ spin_lock(&fs_info->trans_lock); ++ commit = list_first_entry_or_null(&fs_info->trans_list, ++ struct btrfs_transaction, ++ list); ++ if (!commit || commit->transid > transid) { ++ spin_unlock(&fs_info->trans_lock); ++ break; ++ } ++ refcount_inc(&commit->use_count); ++ put = true; ++ spin_unlock(&fs_info->trans_lock); ++ } + } + + int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid) diff --git a/queue-5.15/btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch b/queue-5.15/btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch new file mode 100644 index 00000000000..82af62d1065 --- /dev/null +++ b/queue-5.15/btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch @@ -0,0 +1,82 @@ +From d4aef1e122d8bbdc15ce3bd0bc813d6b44a7d63a Mon Sep 17 00:00:00 2001 +From: Sidong Yang +Date: Mon, 28 Feb 2022 01:43:40 +0000 +Subject: btrfs: qgroup: fix deadlock between rescan worker and remove qgroup + +From: Sidong Yang + +commit d4aef1e122d8bbdc15ce3bd0bc813d6b44a7d63a upstream. + +The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and +qgroup rescan worker") by Kawasaki resolves deadlock between quota +disable and qgroup rescan worker. But also there is a deadlock case like +it. It's about enabling or disabling quota and creating or removing +qgroup. It can be reproduced in simple script below. + +for i in {1..100} +do + btrfs quota enable /mnt & + btrfs qgroup create 1/0 /mnt & + btrfs qgroup destroy 1/0 /mnt & + btrfs quota disable /mnt & +done + +Here's why the deadlock happens: + +1) The quota rescan task is running. + +2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock + mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for + the quota rescan task to complete. + +3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock + the qgroup_ioctl_lock mutex, because it's being held by task A. At that + point task B is holding a transaction handle for the current transaction. + +4) The quota rescan task calls btrfs_commit_transaction(). This results + in it waiting for all other tasks to release their handles on the + transaction, but task B is blocked on the qgroup_ioctl_lock mutex + while holding a handle on the transaction, and that mutex is being held + by task A, which is waiting for the quota rescan task to complete, + resulting in a deadlock between these 3 tasks. + +To resolve this issue, the thread disabling quota should unlock +qgroup_ioctl_lock before waiting rescan completion. Move +btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock. + +Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker") +CC: stable@vger.kernel.org # 5.4+ +Reviewed-by: Filipe Manana +Reviewed-by: Shin'ichiro Kawasaki +Signed-off-by: Sidong Yang +Reviewed-by: David Sterba +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/qgroup.c | 9 ++++++++- + 1 file changed, 8 insertions(+), 1 deletion(-) + +--- a/fs/btrfs/qgroup.c ++++ b/fs/btrfs/qgroup.c +@@ -1197,13 +1197,20 @@ int btrfs_quota_disable(struct btrfs_fs_ + goto out; + + /* ++ * Unlock the qgroup_ioctl_lock mutex before waiting for the rescan worker to ++ * complete. Otherwise we can deadlock because btrfs_remove_qgroup() needs ++ * to lock that mutex while holding a transaction handle and the rescan ++ * worker needs to commit a transaction. ++ */ ++ mutex_unlock(&fs_info->qgroup_ioctl_lock); ++ ++ /* + * Request qgroup rescan worker to complete and wait for it. This wait + * must be done before transaction start for quota disable since it may + * deadlock with transaction by the qgroup rescan worker. + */ + clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags); + btrfs_qgroup_wait_for_completion(fs_info, false); +- mutex_unlock(&fs_info->qgroup_ioctl_lock); + + /* + * 1 For the root item diff --git a/queue-5.15/series b/queue-5.15/series index 98d85d6e5a4..f70ad123e12 100644 --- a/queue-5.15/series +++ b/queue-5.15/series @@ -249,3 +249,11 @@ input-elan_i2c-fix-regulator-enable-count-imbalance-after-suspend-resume.patch input-samsung-keypad-properly-state-iomem-dependency.patch hid-add-mapping-for-key_dictate.patch hid-add-mapping-for-key_all_applications.patch +tracing-histogram-fix-sorting-on-old-cpu-value.patch +tracing-fix-return-value-of-__setup-handlers.patch +btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch +btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch +btrfs-do-not-warn_on-if-we-have-pageerror-set.patch +btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch +btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch +btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch diff --git a/queue-5.15/tracing-fix-return-value-of-__setup-handlers.patch b/queue-5.15/tracing-fix-return-value-of-__setup-handlers.patch new file mode 100644 index 00000000000..0421d0a8279 --- /dev/null +++ b/queue-5.15/tracing-fix-return-value-of-__setup-handlers.patch @@ -0,0 +1,82 @@ +From 1d02b444b8d1345ea4708db3bab4db89a7784b55 Mon Sep 17 00:00:00 2001 +From: Randy Dunlap +Date: Wed, 2 Mar 2022 19:17:44 -0800 +Subject: tracing: Fix return value of __setup handlers + +From: Randy Dunlap + +commit 1d02b444b8d1345ea4708db3bab4db89a7784b55 upstream. + +__setup() handlers should generally return 1 to indicate that the +boot options have been handled. + +Using invalid option values causes the entire kernel boot option +string to be reported as Unknown and added to init's environment +strings, polluting it. + + Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6 + kprobe_event=p,syscall_any,$arg1 trace_options=quiet + trace_clock=jiffies", will be passed to user space. + + Run /sbin/init as init process + with arguments: + /sbin/init + with environment: + HOME=/ + TERM=linux + BOOT_IMAGE=/boot/bzImage-517rc6 + kprobe_event=p,syscall_any,$arg1 + trace_options=quiet + trace_clock=jiffies + +Return 1 from the __setup() handlers so that init's environment is not +polluted with kernel boot options. + +Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru +Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org + +Cc: stable@vger.kernel.org +Fixes: 7bcfaf54f591 ("tracing: Add trace_options kernel command line parameter") +Fixes: e1e232ca6b8f ("tracing: Add trace_clock= kernel parameter") +Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter") +Signed-off-by: Randy Dunlap +Reported-by: Igor Zhbanov +Acked-by: Masami Hiramatsu +Signed-off-by: Steven Rostedt (Google) +Signed-off-by: Greg Kroah-Hartman +--- + kernel/trace/trace.c | 4 ++-- + kernel/trace/trace_kprobe.c | 2 +- + 2 files changed, 3 insertions(+), 3 deletions(-) + +--- a/kernel/trace/trace.c ++++ b/kernel/trace/trace.c +@@ -235,7 +235,7 @@ static char trace_boot_options_buf[MAX_T + static int __init set_trace_boot_options(char *str) + { + strlcpy(trace_boot_options_buf, str, MAX_TRACER_SIZE); +- return 0; ++ return 1; + } + __setup("trace_options=", set_trace_boot_options); + +@@ -246,7 +246,7 @@ static int __init set_trace_boot_clock(c + { + strlcpy(trace_boot_clock_buf, str, MAX_TRACER_SIZE); + trace_boot_clock = trace_boot_clock_buf; +- return 0; ++ return 1; + } + __setup("trace_clock=", set_trace_boot_clock); + +--- a/kernel/trace/trace_kprobe.c ++++ b/kernel/trace/trace_kprobe.c +@@ -31,7 +31,7 @@ static int __init set_kprobe_boot_events + strlcpy(kprobe_boot_events_buf, str, COMMAND_LINE_SIZE); + disable_tracing_selftest("running kprobe events"); + +- return 0; ++ return 1; + } + __setup("kprobe_event=", set_kprobe_boot_events); + diff --git a/queue-5.15/tracing-histogram-fix-sorting-on-old-cpu-value.patch b/queue-5.15/tracing-histogram-fix-sorting-on-old-cpu-value.patch new file mode 100644 index 00000000000..008e8109e08 --- /dev/null +++ b/queue-5.15/tracing-histogram-fix-sorting-on-old-cpu-value.patch @@ -0,0 +1,80 @@ +From 1d1898f65616c4601208963c3376c1d828cbf2c7 Mon Sep 17 00:00:00 2001 +From: "Steven Rostedt (Google)" +Date: Tue, 1 Mar 2022 22:29:04 -0500 +Subject: tracing/histogram: Fix sorting on old "cpu" value + +From: Steven Rostedt (Google) + +commit 1d1898f65616c4601208963c3376c1d828cbf2c7 upstream. + +When trying to add a histogram against an event with the "cpu" field, it +was impossible due to "cpu" being a keyword to key off of the running CPU. +So to fix this, it was changed to "common_cpu" to match the other generic +fields (like "common_pid"). But since some scripts used "cpu" for keying +off of the CPU (for events that did not have "cpu" as a field, which is +most of them), a backward compatibility trick was added such that if "cpu" +was used as a key, and the event did not have "cpu" as a field name, then +it would fallback and switch over to "common_cpu". + +This fix has a couple of subtle bugs. One was that when switching over to +"common_cpu", it did not change the field name, it just set a flag. But +the code still found a "cpu" field. The "cpu" field is used for filtering +and is returned when the event does not have a "cpu" field. + +This was found by: + + # cd /sys/kernel/tracing + # echo hist:key=cpu,pid:sort=cpu > events/sched/sched_wakeup/trigger + # cat events/sched/sched_wakeup/hist + +Which showed the histogram unsorted: + +{ cpu: 19, pid: 1175 } hitcount: 1 +{ cpu: 6, pid: 239 } hitcount: 2 +{ cpu: 23, pid: 1186 } hitcount: 14 +{ cpu: 12, pid: 249 } hitcount: 2 +{ cpu: 3, pid: 994 } hitcount: 5 + +Instead of hard coding the "cpu" checks, take advantage of the fact that +trace_event_field_field() returns a special field for "cpu" and "CPU" if +the event does not have "cpu" as a field. This special field has the +"filter_type" of "FILTER_CPU". Check that to test if the returned field is +of the CPU type instead of doing the string compare. + +Also, fix the sorting bug by testing for the hist_field flag of +HIST_FIELD_FL_CPU when setting up the sort routine. Otherwise it will use +the special CPU field to know what compare routine to use, and since that +special field does not have a size, it returns tracing_map_cmp_none. + +Cc: stable@vger.kernel.org +Fixes: 1e3bac71c505 ("tracing/histogram: Rename "cpu" to "common_cpu"") +Reported-by: Daniel Bristot de Oliveira +Signed-off-by: Steven Rostedt (Google) +Signed-off-by: Greg Kroah-Hartman +--- + kernel/trace/trace_events_hist.c | 6 +++--- + 1 file changed, 3 insertions(+), 3 deletions(-) + +--- a/kernel/trace/trace_events_hist.c ++++ b/kernel/trace/trace_events_hist.c +@@ -2049,9 +2049,9 @@ parse_field(struct hist_trigger_data *hi + /* + * For backward compatibility, if field_name + * was "cpu", then we treat this the same as +- * common_cpu. ++ * common_cpu. This also works for "CPU". + */ +- if (strcmp(field_name, "cpu") == 0) { ++ if (field && field->filter_type == FILTER_CPU) { + *flags |= HIST_FIELD_FL_CPU; + } else { + hist_err(tr, HIST_ERR_FIELD_NOT_FOUND, +@@ -4478,7 +4478,7 @@ static int create_tracing_map_fields(str + + if (hist_field->flags & HIST_FIELD_FL_STACKTRACE) + cmp_fn = tracing_map_cmp_none; +- else if (!field) ++ else if (!field || hist_field->flags & HIST_FIELD_FL_CPU) + cmp_fn = tracing_map_cmp_num(hist_field->size, + hist_field->is_signed); + else if (is_string_field(field))