From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Date: Mon, 7 Mar 2022 07:39:00 +0000 (+0100)
Subject: 5.15-stable patches
X-Git-Tag: v4.9.305~22
X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=2773007de653e0cf683cdd067e90006939f5706f;p=thirdparty%2Fkernel%2Fstable-queue.git

5.15-stable patches

added patches:
	btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch
	btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch
	btrfs-do-not-warn_on-if-we-have-pageerror-set.patch
	btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch
	btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch
	btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch
	tracing-fix-return-value-of-__setup-handlers.patch
	tracing-histogram-fix-sorting-on-old-cpu-value.patch
---

diff --git a/queue-5.15/btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch b/queue-5.15/btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch
new file mode 100644
index 00000000000..a2ea312d46e
--- /dev/null
+++ b/queue-5.15/btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch
@@ -0,0 +1,72 @@
+From 4751dc99627e4d1465c5bfa8cb7ab31ed418eff5 Mon Sep 17 00:00:00 2001
+From: Filipe Manana <fdmanana@suse.com>
+Date: Mon, 28 Feb 2022 16:29:28 +0000
+Subject: btrfs: add missing run of delayed items after unlink during log replay
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit 4751dc99627e4d1465c5bfa8cb7ab31ed418eff5 upstream.
+
+During log replay, whenever we need to check if a name (dentry) exists in
+a directory we do searches on the subvolume tree for inode references or
+or directory entries (BTRFS_DIR_INDEX_KEY keys, and BTRFS_DIR_ITEM_KEY
+keys as well, before kernel 5.17). However when during log replay we
+unlink a name, through btrfs_unlink_inode(), we may not delete inode
+references and dir index keys from a subvolume tree and instead just add
+the deletions to the delayed inode's delayed items, which will only be
+run when we commit the transaction used for log replay. This means that
+after an unlink operation during log replay, if we attempt to search for
+the same name during log replay, we will not see that the name was already
+deleted, since the deletion is recorded only on the delayed items.
+
+We run delayed items after every unlink operation during log replay,
+except at unlink_old_inode_refs() and at add_inode_ref(). This was due
+to an overlook, as delayed items should be run after evert unlink, for
+the reasons stated above.
+
+So fix those two cases.
+
+Fixes: 0d836392cadd5 ("Btrfs: fix mount failure after fsync due to hard link recreation")
+Fixes: 1f250e929a9c9 ("Btrfs: fix log replay failure after unlink and link combination")
+CC: stable@vger.kernel.org # 4.19+
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/tree-log.c |   18 ++++++++++++++++++
+ 1 file changed, 18 insertions(+)
+
+--- a/fs/btrfs/tree-log.c
++++ b/fs/btrfs/tree-log.c
+@@ -1329,6 +1329,15 @@ again:
+ 						 inode, name, namelen);
+ 			kfree(name);
+ 			iput(dir);
++			/*
++			 * Whenever we need to check if a name exists or not, we
++			 * check the subvolume tree. So after an unlink we must
++			 * run delayed items, so that future checks for a name
++			 * during log replay see that the name does not exists
++			 * anymore.
++			 */
++			if (!ret)
++				ret = btrfs_run_delayed_items(trans);
+ 			if (ret)
+ 				goto out;
+ 			goto again;
+@@ -1580,6 +1589,15 @@ static noinline int add_inode_ref(struct
+ 				 */
+ 				if (!ret && inode->i_nlink == 0)
+ 					inc_nlink(inode);
++				/*
++				 * Whenever we need to check if a name exists or
++				 * not, we check the subvolume tree. So after an
++				 * unlink we must run delayed items, so that future
++				 * checks for a name during log replay see that the
++				 * name does not exists anymore.
++				 */
++				if (!ret)
++					ret = btrfs_run_delayed_items(trans);
+ 			}
+ 			if (ret < 0)
+ 				goto out;
diff --git a/queue-5.15/btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch b/queue-5.15/btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch
new file mode 100644
index 00000000000..e8e76e52a18
--- /dev/null
+++ b/queue-5.15/btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch
@@ -0,0 +1,279 @@
+From b4be6aefa73c9a6899ef3ba9c5faaa8a66e333ef Mon Sep 17 00:00:00 2001
+From: Josef Bacik <josef@toxicpanda.com>
+Date: Fri, 18 Feb 2022 14:56:10 -0500
+Subject: btrfs: do not start relocation until in progress drops are done
+
+From: Josef Bacik <josef@toxicpanda.com>
+
+commit b4be6aefa73c9a6899ef3ba9c5faaa8a66e333ef upstream.
+
+We hit a bug with a recovering relocation on mount for one of our file
+systems in production.  I reproduced this locally by injecting errors
+into snapshot delete with balance running at the same time.  This
+presented as an error while looking up an extent item
+
+  WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680
+  CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8
+  RIP: 0010:lookup_inline_extent_backref+0x647/0x680
+  RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202
+  RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000
+  RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000
+  RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001
+  R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000
+  R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000
+  FS:  0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000
+  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+  CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0
+  Call Trace:
+   <TASK>
+   insert_inline_extent_backref+0x46/0xd0
+   __btrfs_inc_extent_ref.isra.0+0x5f/0x200
+   ? btrfs_merge_delayed_refs+0x164/0x190
+   __btrfs_run_delayed_refs+0x561/0xfa0
+   ? btrfs_search_slot+0x7b4/0xb30
+   ? btrfs_update_root+0x1a9/0x2c0
+   btrfs_run_delayed_refs+0x73/0x1f0
+   ? btrfs_update_root+0x1a9/0x2c0
+   btrfs_commit_transaction+0x50/0xa50
+   ? btrfs_update_reloc_root+0x122/0x220
+   prepare_to_merge+0x29f/0x320
+   relocate_block_group+0x2b8/0x550
+   btrfs_relocate_block_group+0x1a6/0x350
+   btrfs_relocate_chunk+0x27/0xe0
+   btrfs_balance+0x777/0xe60
+   balance_kthread+0x35/0x50
+   ? btrfs_balance+0xe60/0xe60
+   kthread+0x16b/0x190
+   ? set_kthread_struct+0x40/0x40
+   ret_from_fork+0x22/0x30
+   </TASK>
+
+Normally snapshot deletion and relocation are excluded from running at
+the same time by the fs_info->cleaner_mutex.  However if we had a
+pending balance waiting to get the ->cleaner_mutex, and a snapshot
+deletion was running, and then the box crashed, we would come up in a
+state where we have a half deleted snapshot.
+
+Again, in the normal case the snapshot deletion needs to complete before
+relocation can start, but in this case relocation could very well start
+before the snapshot deletion completes, as we simply add the root to the
+dead roots list and wait for the next time the cleaner runs to clean up
+the snapshot.
+
+Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that
+had a pending drop_progress key.  If they do then we know we were in the
+middle of the drop operation and set a flag on the fs_info.  Then
+balance can wait until this flag is cleared to start up again.
+
+If there are DEAD_ROOT's that don't have a drop_progress set then we're
+safe to start balance right away as we'll be properly protected by the
+cleaner_mutex.
+
+CC: stable@vger.kernel.org # 5.10+
+Reviewed-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: Josef Bacik <josef@toxicpanda.com>
+Reviewed-by: David Sterba <dsterba@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/ctree.h       |   10 ++++++++++
+ fs/btrfs/disk-io.c     |   10 ++++++++++
+ fs/btrfs/extent-tree.c |   10 ++++++++++
+ fs/btrfs/relocation.c  |   13 +++++++++++++
+ fs/btrfs/root-tree.c   |   15 +++++++++++++++
+ fs/btrfs/transaction.c |   33 ++++++++++++++++++++++++++++++++-
+ fs/btrfs/transaction.h |    1 +
+ 7 files changed, 91 insertions(+), 1 deletion(-)
+
+--- a/fs/btrfs/ctree.h
++++ b/fs/btrfs/ctree.h
+@@ -593,6 +593,9 @@ enum {
+ 	/* Indicate whether there are any tree modification log users */
+ 	BTRFS_FS_TREE_MOD_LOG_USERS,
+ 
++	/* Indicate we have half completed snapshot deletions pending. */
++	BTRFS_FS_UNFINISHED_DROPS,
++
+ #if BITS_PER_LONG == 32
+ 	/* Indicate if we have error/warn message printed on 32bit systems */
+ 	BTRFS_FS_32BIT_ERROR,
+@@ -1098,8 +1101,15 @@ enum {
+ 	BTRFS_ROOT_HAS_LOG_TREE,
+ 	/* Qgroup flushing is in progress */
+ 	BTRFS_ROOT_QGROUP_FLUSHING,
++	/* This root has a drop operation that was started previously. */
++	BTRFS_ROOT_UNFINISHED_DROP,
+ };
+ 
++static inline void btrfs_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
++{
++	clear_and_wake_up_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags);
++}
++
+ /*
+  * Record swapped tree blocks of a subvolume tree for delayed subtree trace
+  * code. For detail check comment in fs/btrfs/qgroup.c.
+--- a/fs/btrfs/disk-io.c
++++ b/fs/btrfs/disk-io.c
+@@ -3659,6 +3659,10 @@ int __cold open_ctree(struct super_block
+ 
+ 	set_bit(BTRFS_FS_OPEN, &fs_info->flags);
+ 
++	/* Kick the cleaner thread so it'll start deleting snapshots. */
++	if (test_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags))
++		wake_up_process(fs_info->cleaner_kthread);
++
+ clear_oneshot:
+ 	btrfs_clear_oneshot_options(fs_info);
+ 	return 0;
+@@ -4340,6 +4344,12 @@ void __cold close_ctree(struct btrfs_fs_
+ 	 */
+ 	kthread_park(fs_info->cleaner_kthread);
+ 
++	/*
++	 * If we had UNFINISHED_DROPS we could still be processing them, so
++	 * clear that bit and wake up relocation so it can stop.
++	 */
++	btrfs_wake_unfinished_drop(fs_info);
++
+ 	/* wait for the qgroup rescan worker to stop */
+ 	btrfs_qgroup_wait_for_completion(fs_info, false);
+ 
+--- a/fs/btrfs/extent-tree.c
++++ b/fs/btrfs/extent-tree.c
+@@ -5541,6 +5541,7 @@ int btrfs_drop_snapshot(struct btrfs_roo
+ 	int ret;
+ 	int level;
+ 	bool root_dropped = false;
++	bool unfinished_drop = false;
+ 
+ 	btrfs_debug(fs_info, "Drop subvolume %llu", root->root_key.objectid);
+ 
+@@ -5583,6 +5584,8 @@ int btrfs_drop_snapshot(struct btrfs_roo
+ 	 * already dropped.
+ 	 */
+ 	set_bit(BTRFS_ROOT_DELETING, &root->state);
++	unfinished_drop = test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state);
++
+ 	if (btrfs_disk_key_objectid(&root_item->drop_progress) == 0) {
+ 		level = btrfs_header_level(root->node);
+ 		path->nodes[level] = btrfs_lock_root_node(root);
+@@ -5758,6 +5761,13 @@ out_free:
+ 	btrfs_free_path(path);
+ out:
+ 	/*
++	 * We were an unfinished drop root, check to see if there are any
++	 * pending, and if not clear and wake up any waiters.
++	 */
++	if (!err && unfinished_drop)
++		btrfs_maybe_wake_unfinished_drop(fs_info);
++
++	/*
+ 	 * So if we need to stop dropping the snapshot for whatever reason we
+ 	 * need to make sure to add it back to the dead root list so that we
+ 	 * keep trying to do the work later.  This also cleans up roots if we
+--- a/fs/btrfs/relocation.c
++++ b/fs/btrfs/relocation.c
+@@ -3967,6 +3967,19 @@ int btrfs_relocate_block_group(struct bt
+ 	int rw = 0;
+ 	int err = 0;
+ 
++	/*
++	 * This only gets set if we had a half-deleted snapshot on mount.  We
++	 * cannot allow relocation to start while we're still trying to clean up
++	 * these pending deletions.
++	 */
++	ret = wait_on_bit(&fs_info->flags, BTRFS_FS_UNFINISHED_DROPS, TASK_INTERRUPTIBLE);
++	if (ret)
++		return ret;
++
++	/* We may have been woken up by close_ctree, so bail if we're closing. */
++	if (btrfs_fs_closing(fs_info))
++		return -EINTR;
++
+ 	bg = btrfs_lookup_block_group(fs_info, group_start);
+ 	if (!bg)
+ 		return -ENOENT;
+--- a/fs/btrfs/root-tree.c
++++ b/fs/btrfs/root-tree.c
+@@ -280,6 +280,21 @@ int btrfs_find_orphan_roots(struct btrfs
+ 
+ 		WARN_ON(!test_bit(BTRFS_ROOT_ORPHAN_ITEM_INSERTED, &root->state));
+ 		if (btrfs_root_refs(&root->root_item) == 0) {
++			struct btrfs_key drop_key;
++
++			btrfs_disk_key_to_cpu(&drop_key, &root->root_item.drop_progress);
++			/*
++			 * If we have a non-zero drop_progress then we know we
++			 * made it partly through deleting this snapshot, and
++			 * thus we need to make sure we block any balance from
++			 * happening until this snapshot is completely dropped.
++			 */
++			if (drop_key.objectid != 0 || drop_key.type != 0 ||
++			    drop_key.offset != 0) {
++				set_bit(BTRFS_FS_UNFINISHED_DROPS, &fs_info->flags);
++				set_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state);
++			}
++
+ 			set_bit(BTRFS_ROOT_DEAD_TREE, &root->state);
+ 			btrfs_add_dead_root(root);
+ 		}
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -1341,6 +1341,32 @@ again:
+ }
+ 
+ /*
++ * If we had a pending drop we need to see if there are any others left in our
++ * dead roots list, and if not clear our bit and wake any waiters.
++ */
++void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info)
++{
++	/*
++	 * We put the drop in progress roots at the front of the list, so if the
++	 * first entry doesn't have UNFINISHED_DROP set we can wake everybody
++	 * up.
++	 */
++	spin_lock(&fs_info->trans_lock);
++	if (!list_empty(&fs_info->dead_roots)) {
++		struct btrfs_root *root = list_first_entry(&fs_info->dead_roots,
++							   struct btrfs_root,
++							   root_list);
++		if (test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state)) {
++			spin_unlock(&fs_info->trans_lock);
++			return;
++		}
++	}
++	spin_unlock(&fs_info->trans_lock);
++
++	btrfs_wake_unfinished_drop(fs_info);
++}
++
++/*
+  * dead roots are old snapshots that need to be deleted.  This allocates
+  * a dirty root struct and adds it into the list of dead roots that need to
+  * be deleted
+@@ -1352,7 +1378,12 @@ void btrfs_add_dead_root(struct btrfs_ro
+ 	spin_lock(&fs_info->trans_lock);
+ 	if (list_empty(&root->root_list)) {
+ 		btrfs_grab_root(root);
+-		list_add_tail(&root->root_list, &fs_info->dead_roots);
++
++		/* We want to process the partially complete drops first. */
++		if (test_bit(BTRFS_ROOT_UNFINISHED_DROP, &root->state))
++			list_add(&root->root_list, &fs_info->dead_roots);
++		else
++			list_add_tail(&root->root_list, &fs_info->dead_roots);
+ 	}
+ 	spin_unlock(&fs_info->trans_lock);
+ }
+--- a/fs/btrfs/transaction.h
++++ b/fs/btrfs/transaction.h
+@@ -217,6 +217,7 @@ int btrfs_wait_for_commit(struct btrfs_f
+ 
+ void btrfs_add_dead_root(struct btrfs_root *root);
+ int btrfs_defrag_root(struct btrfs_root *root);
++void btrfs_maybe_wake_unfinished_drop(struct btrfs_fs_info *fs_info);
+ int btrfs_clean_one_deleted_snapshot(struct btrfs_root *root);
+ int btrfs_commit_transaction(struct btrfs_trans_handle *trans);
+ int btrfs_commit_transaction_async(struct btrfs_trans_handle *trans);
diff --git a/queue-5.15/btrfs-do-not-warn_on-if-we-have-pageerror-set.patch b/queue-5.15/btrfs-do-not-warn_on-if-we-have-pageerror-set.patch
new file mode 100644
index 00000000000..65709a8091b
--- /dev/null
+++ b/queue-5.15/btrfs-do-not-warn_on-if-we-have-pageerror-set.patch
@@ -0,0 +1,119 @@
+From a50e1fcbc9b85fd4e95b89a75c0884cb032a3e06 Mon Sep 17 00:00:00 2001
+From: Josef Bacik <josef@toxicpanda.com>
+Date: Fri, 18 Feb 2022 10:17:39 -0500
+Subject: btrfs: do not WARN_ON() if we have PageError set
+
+From: Josef Bacik <josef@toxicpanda.com>
+
+commit a50e1fcbc9b85fd4e95b89a75c0884cb032a3e06 upstream.
+
+Whenever we do any extent buffer operations we call
+assert_eb_page_uptodate() to complain loudly if we're operating on an
+non-uptodate page.  Our overnight tests caught this warning earlier this
+week
+
+  WARNING: CPU: 1 PID: 553508 at fs/btrfs/extent_io.c:6849 assert_eb_page_uptodate+0x3f/0x50
+  CPU: 1 PID: 553508 Comm: kworker/u4:13 Tainted: G        W         5.17.0-rc3+ #564
+  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
+  Workqueue: btrfs-cache btrfs_work_helper
+  RIP: 0010:assert_eb_page_uptodate+0x3f/0x50
+  RSP: 0018:ffffa961440a7c68 EFLAGS: 00010246
+  RAX: 0017ffffc0002112 RBX: ffffe6e74453f9c0 RCX: 0000000000001000
+  RDX: ffffe6e74467c887 RSI: ffffe6e74453f9c0 RDI: ffff8d4c5efc2fc0
+  RBP: 0000000000000d56 R08: ffff8d4d4a224000 R09: 0000000000000000
+  R10: 00015817fa9d1ef0 R11: 000000000000000c R12: 00000000000007b1
+  R13: ffff8d4c5efc2fc0 R14: 0000000001500000 R15: 0000000001cb1000
+  FS:  0000000000000000(0000) GS:ffff8d4dbbd00000(0000) knlGS:0000000000000000
+  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+  CR2: 00007ff31d3448d8 CR3: 0000000118be8004 CR4: 0000000000370ee0
+  Call Trace:
+
+   extent_buffer_test_bit+0x3f/0x70
+   free_space_test_bit+0xa6/0xc0
+   load_free_space_tree+0x1f6/0x470
+   caching_thread+0x454/0x630
+   ? rcu_read_lock_sched_held+0x12/0x60
+   ? rcu_read_lock_sched_held+0x12/0x60
+   ? rcu_read_lock_sched_held+0x12/0x60
+   ? lock_release+0x1f0/0x2d0
+   btrfs_work_helper+0xf2/0x3e0
+   ? lock_release+0x1f0/0x2d0
+   ? finish_task_switch.isra.0+0xf9/0x3a0
+   process_one_work+0x26d/0x580
+   ? process_one_work+0x580/0x580
+   worker_thread+0x55/0x3b0
+   ? process_one_work+0x580/0x580
+   kthread+0xf0/0x120
+   ? kthread_complete_and_exit+0x20/0x20
+   ret_from_fork+0x1f/0x30
+
+This was partially fixed by c2e39305299f01 ("btrfs: clear extent buffer
+uptodate when we fail to write it"), however all that fix did was keep
+us from finding extent buffers after a failed writeout.  It didn't keep
+us from continuing to use a buffer that we already had found.
+
+In this case we're searching the commit root to cache the block group,
+so we can start committing the transaction and switch the commit root
+and then start writing.  After the switch we can look up an extent
+buffer that hasn't been written yet and start processing that block
+group.  Then we fail to write that block out and clear Uptodate on the
+page, and then we start spewing these errors.
+
+Normally we're protected by the tree lock to a certain degree here.  If
+we read a block we have that block read locked, and we block the writer
+from locking the block before we submit it for the write.  However this
+isn't necessarily fool proof because the read could happen before we do
+the submit_bio and after we locked and unlocked the extent buffer.
+
+Also in this particular case we have path->skip_locking set, so that
+won't save us here.  We'll simply get a block that was valid when we
+read it, but became invalid while we were using it.
+
+What we really want is to catch the case where we've "read" a block but
+it's not marked Uptodate.  On read we ClearPageError(), so if we're
+!Uptodate and !Error we know we didn't do the right thing for reading
+the page.
+
+Fix this by checking !Uptodate && !Error, this way we will not complain
+if our buffer gets invalidated while we're using it, and we'll maintain
+the spirit of the check which is to make sure we have a fully in-cache
+block while we're messing with it.
+
+CC: stable@vger.kernel.org # 5.4+
+Signed-off-by: Josef Bacik <josef@toxicpanda.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/extent_io.c |   16 +++++++++++++---
+ 1 file changed, 13 insertions(+), 3 deletions(-)
+
+--- a/fs/btrfs/extent_io.c
++++ b/fs/btrfs/extent_io.c
+@@ -6801,14 +6801,24 @@ static void assert_eb_page_uptodate(cons
+ {
+ 	struct btrfs_fs_info *fs_info = eb->fs_info;
+ 
++	/*
++	 * If we are using the commit root we could potentially clear a page
++	 * Uptodate while we're using the extent buffer that we've previously
++	 * looked up.  We don't want to complain in this case, as the page was
++	 * valid before, we just didn't write it out.  Instead we want to catch
++	 * the case where we didn't actually read the block properly, which
++	 * would have !PageUptodate && !PageError, as we clear PageError before
++	 * reading.
++	 */
+ 	if (fs_info->sectorsize < PAGE_SIZE) {
+-		bool uptodate;
++		bool uptodate, error;
+ 
+ 		uptodate = btrfs_subpage_test_uptodate(fs_info, page,
+ 						       eb->start, eb->len);
+-		WARN_ON(!uptodate);
++		error = btrfs_subpage_test_error(fs_info, page, eb->start, eb->len);
++		WARN_ON(!uptodate && !error);
+ 	} else {
+-		WARN_ON(!PageUptodate(page));
++		WARN_ON(!PageUptodate(page) && !PageError(page));
+ 	}
+ }
+ 
diff --git a/queue-5.15/btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch b/queue-5.15/btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch
new file mode 100644
index 00000000000..e10ccf5b919
--- /dev/null
+++ b/queue-5.15/btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch
@@ -0,0 +1,175 @@
+From d99478874355d3a7b9d86dfb5d7590d5b1754b1f Mon Sep 17 00:00:00 2001
+From: Filipe Manana <fdmanana@suse.com>
+Date: Thu, 17 Feb 2022 12:12:02 +0000
+Subject: btrfs: fix lost prealloc extents beyond eof after full fsync
+
+From: Filipe Manana <fdmanana@suse.com>
+
+commit d99478874355d3a7b9d86dfb5d7590d5b1754b1f upstream.
+
+When doing a full fsync, if we have prealloc extents beyond (or at) eof,
+and the leaves that contain them were not modified in the current
+transaction, we end up not logging them. This results in losing those
+extents when we replay the log after a power failure, since the inode is
+truncated to the current value of the logged i_size.
+
+Just like for the fast fsync path, we need to always log all prealloc
+extents starting at or beyond i_size. The fast fsync case was fixed in
+commit 471d557afed155 ("Btrfs: fix loss of prealloc extents past i_size
+after fsync log replay") but it missed the full fsync path. The problem
+exists since the very early days, when the log tree was added by
+commit e02119d5a7b439 ("Btrfs: Add a write ahead tree log to optimize
+synchronous operations").
+
+Example reproducer:
+
+  $ mkfs.btrfs -f /dev/sdc
+  $ mount /dev/sdc /mnt
+
+  # Create our test file with many file extent items, so that they span
+  # several leaves of metadata, even if the node/page size is 64K. Use
+  # direct IO and not fsync/O_SYNC because it's both faster and it avoids
+  # clearing the full sync flag from the inode - we want the fsync below
+  # to trigger the slow full sync code path.
+  $ xfs_io -f -d -c "pwrite -b 4K 0 16M" /mnt/foo
+
+  # Now add two preallocated extents to our file without extending the
+  # file's size. One right at i_size, and another further beyond, leaving
+  # a gap between the two prealloc extents.
+  $ xfs_io -c "falloc -k 16M 1M" /mnt/foo
+  $ xfs_io -c "falloc -k 20M 1M" /mnt/foo
+
+  # Make sure everything is durably persisted and the transaction is
+  # committed. This makes all created extents to have a generation lower
+  # than the generation of the transaction used by the next write and
+  # fsync.
+  sync
+
+  # Now overwrite only the first extent, which will result in modifying
+  # only the first leaf of metadata for our inode. Then fsync it. This
+  # fsync will use the slow code path (inode full sync bit is set) because
+  # it's the first fsync since the inode was created/loaded.
+  $ xfs_io -c "pwrite 0 4K" -c "fsync" /mnt/foo
+
+  # Extent list before power failure.
+  $ xfs_io -c "fiemap -v" /mnt/foo
+  /mnt/foo:
+   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
+     0: [0..7]:          2178048..2178055     8   0x0
+     1: [8..16383]:      26632..43007     16376   0x0
+     2: [16384..32767]:  2156544..2172927 16384   0x0
+     3: [32768..34815]:  2172928..2174975  2048 0x800
+     4: [34816..40959]:  hole              6144
+     5: [40960..43007]:  2174976..2177023  2048 0x801
+
+  <power fail>
+
+  # Mount fs again, trigger log replay.
+  $ mount /dev/sdc /mnt
+
+  # Extent list after power failure and log replay.
+  $ xfs_io -c "fiemap -v" /mnt/foo
+  /mnt/foo:
+   EXT: FILE-OFFSET      BLOCK-RANGE      TOTAL FLAGS
+     0: [0..7]:          2178048..2178055     8   0x0
+     1: [8..16383]:      26632..43007     16376   0x0
+     2: [16384..32767]:  2156544..2172927 16384   0x1
+
+  # The prealloc extents at file offsets 16M and 20M are missing.
+
+So fix this by calling btrfs_log_prealloc_extents() when we are doing a
+full fsync, so that we always log all prealloc extents beyond eof.
+
+A test case for fstests will follow soon.
+
+CC: stable@vger.kernel.org # 4.19+
+Signed-off-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/tree-log.c |   43 +++++++++++++++++++++++++++++++------------
+ 1 file changed, 31 insertions(+), 12 deletions(-)
+
+--- a/fs/btrfs/tree-log.c
++++ b/fs/btrfs/tree-log.c
+@@ -4423,7 +4423,7 @@ static int log_one_extent(struct btrfs_t
+ 
+ /*
+  * Log all prealloc extents beyond the inode's i_size to make sure we do not
+- * lose them after doing a fast fsync and replaying the log. We scan the
++ * lose them after doing a full/fast fsync and replaying the log. We scan the
+  * subvolume's root instead of iterating the inode's extent map tree because
+  * otherwise we can log incorrect extent items based on extent map conversion.
+  * That can happen due to the fact that extent maps are merged when they
+@@ -5208,6 +5208,7 @@ static int copy_inode_items_to_log(struc
+ 				   struct btrfs_log_ctx *ctx,
+ 				   bool *need_log_inode_item)
+ {
++	const u64 i_size = i_size_read(&inode->vfs_inode);
+ 	struct btrfs_root *root = inode->root;
+ 	int ins_start_slot = 0;
+ 	int ins_nr = 0;
+@@ -5228,13 +5229,21 @@ again:
+ 		if (min_key->type > max_key->type)
+ 			break;
+ 
+-		if (min_key->type == BTRFS_INODE_ITEM_KEY)
++		if (min_key->type == BTRFS_INODE_ITEM_KEY) {
+ 			*need_log_inode_item = false;
+-
+-		if ((min_key->type == BTRFS_INODE_REF_KEY ||
+-		     min_key->type == BTRFS_INODE_EXTREF_KEY) &&
+-		    inode->generation == trans->transid &&
+-		    !recursive_logging) {
++		} else if (min_key->type == BTRFS_EXTENT_DATA_KEY &&
++			   min_key->offset >= i_size) {
++			/*
++			 * Extents at and beyond eof are logged with
++			 * btrfs_log_prealloc_extents().
++			 * Only regular files have BTRFS_EXTENT_DATA_KEY keys,
++			 * and no keys greater than that, so bail out.
++			 */
++			break;
++		} else if ((min_key->type == BTRFS_INODE_REF_KEY ||
++			    min_key->type == BTRFS_INODE_EXTREF_KEY) &&
++			   inode->generation == trans->transid &&
++			   !recursive_logging) {
+ 			u64 other_ino = 0;
+ 			u64 other_parent = 0;
+ 
+@@ -5265,10 +5274,8 @@ again:
+ 				btrfs_release_path(path);
+ 				goto next_key;
+ 			}
+-		}
+-
+-		/* Skip xattrs, we log them later with btrfs_log_all_xattrs() */
+-		if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
++		} else if (min_key->type == BTRFS_XATTR_ITEM_KEY) {
++			/* Skip xattrs, logged later with btrfs_log_all_xattrs() */
+ 			if (ins_nr == 0)
+ 				goto next_slot;
+ 			ret = copy_items(trans, inode, dst_path, path,
+@@ -5321,9 +5328,21 @@ next_key:
+ 			break;
+ 		}
+ 	}
+-	if (ins_nr)
++	if (ins_nr) {
+ 		ret = copy_items(trans, inode, dst_path, path, ins_start_slot,
+ 				 ins_nr, inode_only, logged_isize);
++		if (ret)
++			return ret;
++	}
++
++	if (inode_only == LOG_INODE_ALL && S_ISREG(inode->vfs_inode.i_mode)) {
++		/*
++		 * Release the path because otherwise we might attempt to double
++		 * lock the same leaf with btrfs_log_prealloc_extents() below.
++		 */
++		btrfs_release_path(path);
++		ret = btrfs_log_prealloc_extents(trans, inode, dst_path);
++	}
+ 
+ 	return ret;
+ }
diff --git a/queue-5.15/btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch b/queue-5.15/btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch
new file mode 100644
index 00000000000..b80e6bbc1ca
--- /dev/null
+++ b/queue-5.15/btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch
@@ -0,0 +1,210 @@
+From 5fd76bf31ccfecc06e2e6b29f8c809e934085b99 Mon Sep 17 00:00:00 2001
+From: Omar Sandoval <osandov@fb.com>
+Date: Thu, 17 Feb 2022 15:14:43 -0800
+Subject: btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
+
+From: Omar Sandoval <osandov@fb.com>
+
+commit 5fd76bf31ccfecc06e2e6b29f8c809e934085b99 upstream.
+
+We are seeing crashes similar to the following trace:
+
+[38.969182] WARNING: CPU: 20 PID: 2105 at fs/btrfs/relocation.c:4070 btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
+[38.973556] CPU: 20 PID: 2105 Comm: btrfs Not tainted 5.17.0-rc4 #54
+[38.974580] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
+[38.976539] RIP: 0010:btrfs_relocate_block_group+0x2dc/0x340 [btrfs]
+[38.980336] RSP: 0000:ffffb0dd42e03c20 EFLAGS: 00010206
+[38.981218] RAX: ffff96cfc4ede800 RBX: ffff96cfc3ce0000 RCX: 000000000002ca14
+[38.982560] RDX: 0000000000000000 RSI: 4cfd109a0bcb5d7f RDI: ffff96cfc3ce0360
+[38.983619] RBP: ffff96cfc309c000 R08: 0000000000000000 R09: 0000000000000000
+[38.984678] R10: ffff96cec0000001 R11: ffffe84c80000000 R12: ffff96cfc4ede800
+[38.985735] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96cfc3ce0360
+[38.987146] FS:  00007f11c15218c0(0000) GS:ffff96d6dfb00000(0000) knlGS:0000000000000000
+[38.988662] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+[38.989398] CR2: 00007ffc922c8e60 CR3: 00000001147a6001 CR4: 0000000000370ee0
+[38.990279] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+[38.991219] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
+[38.992528] Call Trace:
+[38.992854]  <TASK>
+[38.993148]  btrfs_relocate_chunk+0x27/0xe0 [btrfs]
+[38.993941]  btrfs_balance+0x78e/0xea0 [btrfs]
+[38.994801]  ? vsnprintf+0x33c/0x520
+[38.995368]  ? __kmalloc_track_caller+0x351/0x440
+[38.996198]  btrfs_ioctl_balance+0x2b9/0x3a0 [btrfs]
+[38.997084]  btrfs_ioctl+0x11b0/0x2da0 [btrfs]
+[38.997867]  ? mod_objcg_state+0xee/0x340
+[38.998552]  ? seq_release+0x24/0x30
+[38.999184]  ? proc_nr_files+0x30/0x30
+[38.999654]  ? call_rcu+0xc8/0x2f0
+[39.000228]  ? __x64_sys_ioctl+0x84/0xc0
+[39.000872]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
+[39.001973]  __x64_sys_ioctl+0x84/0xc0
+[39.002566]  do_syscall_64+0x3a/0x80
+[39.003011]  entry_SYSCALL_64_after_hwframe+0x44/0xae
+[39.003735] RIP: 0033:0x7f11c166959b
+[39.007324] RSP: 002b:00007fff2543e998 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
+[39.008521] RAX: ffffffffffffffda RBX: 00007f11c1521698 RCX: 00007f11c166959b
+[39.009833] RDX: 00007fff2543ea40 RSI: 00000000c4009420 RDI: 0000000000000003
+[39.011270] RBP: 0000000000000003 R08: 0000000000000013 R09: 00007f11c16f94e0
+[39.012581] R10: 0000000000000000 R11: 0000000000000246 R12: 00007fff25440df3
+[39.014046] R13: 0000000000000000 R14: 00007fff2543ea40 R15: 0000000000000001
+[39.015040]  </TASK>
+[39.015418] ---[ end trace 0000000000000000 ]---
+[43.131559] ------------[ cut here ]------------
+[43.132234] kernel BUG at fs/btrfs/extent-tree.c:2717!
+[43.133031] invalid opcode: 0000 [#1] PREEMPT SMP PTI
+[43.133702] CPU: 1 PID: 1839 Comm: btrfs Tainted: G        W         5.17.0-rc4 #54
+[43.134863] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
+[43.136426] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
+[43.139913] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
+[43.140629] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
+[43.141604] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
+[43.142645] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
+[43.143669] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
+[43.144657] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
+[43.145686] FS:  00007f7657dd68c0(0000) GS:ffff96d6df640000(0000) knlGS:0000000000000000
+[43.146808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+[43.147584] CR2: 00007f7fe81bf5b0 CR3: 00000001093ee004 CR4: 0000000000370ee0
+[43.148589] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+[43.149581] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
+[43.150559] Call Trace:
+[43.150904]  <TASK>
+[43.151253]  btrfs_finish_extent_commit+0x88/0x290 [btrfs]
+[43.152127]  btrfs_commit_transaction+0x74f/0xaa0 [btrfs]
+[43.152932]  ? btrfs_attach_transaction_barrier+0x1e/0x50 [btrfs]
+[43.153786]  btrfs_ioctl+0x1edc/0x2da0 [btrfs]
+[43.154475]  ? __check_object_size+0x150/0x170
+[43.155170]  ? preempt_count_add+0x49/0xa0
+[43.155753]  ? __x64_sys_ioctl+0x84/0xc0
+[43.156437]  ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs]
+[43.157456]  __x64_sys_ioctl+0x84/0xc0
+[43.157980]  do_syscall_64+0x3a/0x80
+[43.158543]  entry_SYSCALL_64_after_hwframe+0x44/0xae
+[43.159231] RIP: 0033:0x7f7657f1e59b
+[43.161819] RSP: 002b:00007ffda5cd1658 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
+[43.162702] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7657f1e59b
+[43.163526] RDX: 0000000000000000 RSI: 0000000000009408 RDI: 0000000000000003
+[43.164358] RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
+[43.165208] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
+[43.166029] R13: 00005621b91c3232 R14: 00005621b91ba580 R15: 00007ffda5cd1800
+[43.166863]  </TASK>
+[43.167125] Modules linked in: btrfs blake2b_generic xor pata_acpi ata_piix libata raid6_pq scsi_mod libcrc32c virtio_net virtio_rng net_failover rng_core failover scsi_common
+[43.169552] ---[ end trace 0000000000000000 ]---
+[43.171226] RIP: 0010:unpin_extent_range+0x37a/0x4f0 [btrfs]
+[43.174767] RSP: 0000:ffffb0dd4216bc70 EFLAGS: 00010246
+[43.175600] RAX: 0000000000000000 RBX: ffff96cfc34490f8 RCX: 0000000000000001
+[43.176468] RDX: 0000000080000001 RSI: 0000000051d00000 RDI: 00000000ffffffff
+[43.177357] RBP: 0000000000000000 R08: 0000000000000000 R09: ffff96cfd07dca50
+[43.178271] R10: ffff96cfc46e8a00 R11: fffffffffffec000 R12: 0000000041d00000
+[43.179178] R13: ffff96cfc3ce0000 R14: ffffb0dd4216bd08 R15: 0000000000000000
+[43.180071] FS:  00007f7657dd68c0(0000) GS:ffff96d6df800000(0000) knlGS:0000000000000000
+[43.181073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
+[43.181808] CR2: 00007fe09905f010 CR3: 00000001093ee004 CR4: 0000000000370ee0
+[43.182706] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
+[43.183591] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
+
+We first hit the WARN_ON(rc->block_group->pinned > 0) in
+btrfs_relocate_block_group() and then the BUG_ON(!cache) in
+unpin_extent_range(). This tells us that we are exiting relocation and
+removing the block group with bytes still pinned for that block group.
+This is supposed to be impossible: the last thing relocate_block_group()
+does is commit the transaction to get rid of pinned extents.
+
+Commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
+waiting for a transaction commit") introduced an optimization so that
+commits from fsync don't have to wait for the previous commit to unpin
+extents. This was only intended to affect fsync, but it inadvertently
+made it possible for any commit to skip waiting for the previous commit
+to unpin. This is because if a call to btrfs_commit_transaction() finds
+that another thread is already committing the transaction, it waits for
+the other thread to complete the commit and then returns. If that other
+thread was in fsync, then it completes the commit without completing the
+previous commit. This makes the following sequence of events possible:
+
+Thread 1____________________|Thread 2 (fsync)_____________________|Thread 3 (balance)___________________
+btrfs_commit_transaction(N) |                                     |
+  btrfs_run_delayed_refs    |                                     |
+    pin extents             |                                     |
+  ...                       |                                     |
+  state = UNBLOCKED         |btrfs_sync_file                      |
+                            |  btrfs_start_transaction(N + 1)     |relocate_block_group
+                            |                                     |  btrfs_join_transaction(N + 1)
+                            |  btrfs_commit_transaction(N + 1)    |
+  ...                       |  trans->state = COMMIT_START        |
+                            |                                     |  btrfs_commit_transaction(N + 1)
+                            |                                     |    wait_for_commit(N + 1, COMPLETED)
+                            |  wait_for_commit(N, SUPER_COMMITTED)|
+  state = SUPER_COMMITTED   |  ...                                |
+  btrfs_finish_extent_commit|                                     |
+    unpin_extent_range()    |  trans->state = COMPLETED           |
+                            |                                     |    return
+                            |                                     |
+    ...                     |                                     |Thread 1 isn't done, so pinned > 0
+                            |                                     |and we WARN
+                            |                                     |
+                            |                                     |btrfs_remove_block_group
+    unpin_extent_range()    |                                     |
+      Thread 3 removed the  |                                     |
+      block group, so we BUG|                                     |
+
+There are other sequences involving SUPER_COMMITTED transactions that
+can cause a similar outcome.
+
+We could fix this by making relocation explicitly wait for unpinning,
+but there may be other cases that need it. Josef mentioned ENOSPC
+flushing and the free space cache inode as other potential victims.
+Rather than playing whack-a-mole, this fix is conservative and makes all
+commits not in fsync wait for all previous transactions, which is what
+the optimization intended.
+
+Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
+CC: stable@vger.kernel.org # 5.15+
+Reviewed-by: Filipe Manana <fdmanana@suse.com>
+Signed-off-by: Omar Sandoval <osandov@fb.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/transaction.c |   32 +++++++++++++++++++++++++++++++-
+ 1 file changed, 31 insertions(+), 1 deletion(-)
+
+--- a/fs/btrfs/transaction.c
++++ b/fs/btrfs/transaction.c
+@@ -846,7 +846,37 @@ btrfs_attach_transaction_barrier(struct
+ static noinline void wait_for_commit(struct btrfs_transaction *commit,
+ 				     const enum btrfs_trans_state min_state)
+ {
+-	wait_event(commit->commit_wait, commit->state >= min_state);
++	struct btrfs_fs_info *fs_info = commit->fs_info;
++	u64 transid = commit->transid;
++	bool put = false;
++
++	while (1) {
++		wait_event(commit->commit_wait, commit->state >= min_state);
++		if (put)
++			btrfs_put_transaction(commit);
++
++		if (min_state < TRANS_STATE_COMPLETED)
++			break;
++
++		/*
++		 * A transaction isn't really completed until all of the
++		 * previous transactions are completed, but with fsync we can
++		 * end up with SUPER_COMMITTED transactions before a COMPLETED
++		 * transaction. Wait for those.
++		 */
++
++		spin_lock(&fs_info->trans_lock);
++		commit = list_first_entry_or_null(&fs_info->trans_list,
++						  struct btrfs_transaction,
++						  list);
++		if (!commit || commit->transid > transid) {
++			spin_unlock(&fs_info->trans_lock);
++			break;
++		}
++		refcount_inc(&commit->use_count);
++		put = true;
++		spin_unlock(&fs_info->trans_lock);
++	}
+ }
+ 
+ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid)
diff --git a/queue-5.15/btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch b/queue-5.15/btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch
new file mode 100644
index 00000000000..82af62d1065
--- /dev/null
+++ b/queue-5.15/btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch
@@ -0,0 +1,82 @@
+From d4aef1e122d8bbdc15ce3bd0bc813d6b44a7d63a Mon Sep 17 00:00:00 2001
+From: Sidong Yang <realwakka@gmail.com>
+Date: Mon, 28 Feb 2022 01:43:40 +0000
+Subject: btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
+
+From: Sidong Yang <realwakka@gmail.com>
+
+commit d4aef1e122d8bbdc15ce3bd0bc813d6b44a7d63a upstream.
+
+The commit e804861bd4e6 ("btrfs: fix deadlock between quota disable and
+qgroup rescan worker") by Kawasaki resolves deadlock between quota
+disable and qgroup rescan worker. But also there is a deadlock case like
+it. It's about enabling or disabling quota and creating or removing
+qgroup. It can be reproduced in simple script below.
+
+for i in {1..100}
+do
+    btrfs quota enable /mnt &
+    btrfs qgroup create 1/0 /mnt &
+    btrfs qgroup destroy 1/0 /mnt &
+    btrfs quota disable /mnt &
+done
+
+Here's why the deadlock happens:
+
+1) The quota rescan task is running.
+
+2) Task A calls btrfs_quota_disable(), locks the qgroup_ioctl_lock
+   mutex, and then calls btrfs_qgroup_wait_for_completion(), to wait for
+   the quota rescan task to complete.
+
+3) Task B calls btrfs_remove_qgroup() and it blocks when trying to lock
+   the qgroup_ioctl_lock mutex, because it's being held by task A. At that
+   point task B is holding a transaction handle for the current transaction.
+
+4) The quota rescan task calls btrfs_commit_transaction(). This results
+   in it waiting for all other tasks to release their handles on the
+   transaction, but task B is blocked on the qgroup_ioctl_lock mutex
+   while holding a handle on the transaction, and that mutex is being held
+   by task A, which is waiting for the quota rescan task to complete,
+   resulting in a deadlock between these 3 tasks.
+
+To resolve this issue, the thread disabling quota should unlock
+qgroup_ioctl_lock before waiting rescan completion. Move
+btrfs_qgroup_wait_for_completion() after unlock of qgroup_ioctl_lock.
+
+Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
+CC: stable@vger.kernel.org # 5.4+
+Reviewed-by: Filipe Manana <fdmanana@suse.com>
+Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
+Signed-off-by: Sidong Yang <realwakka@gmail.com>
+Reviewed-by: David Sterba <dsterba@suse.com>
+Signed-off-by: David Sterba <dsterba@suse.com>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ fs/btrfs/qgroup.c |    9 ++++++++-
+ 1 file changed, 8 insertions(+), 1 deletion(-)
+
+--- a/fs/btrfs/qgroup.c
++++ b/fs/btrfs/qgroup.c
+@@ -1197,13 +1197,20 @@ int btrfs_quota_disable(struct btrfs_fs_
+ 		goto out;
+ 
+ 	/*
++	 * Unlock the qgroup_ioctl_lock mutex before waiting for the rescan worker to
++	 * complete. Otherwise we can deadlock because btrfs_remove_qgroup() needs
++	 * to lock that mutex while holding a transaction handle and the rescan
++	 * worker needs to commit a transaction.
++	 */
++	mutex_unlock(&fs_info->qgroup_ioctl_lock);
++
++	/*
+ 	 * Request qgroup rescan worker to complete and wait for it. This wait
+ 	 * must be done before transaction start for quota disable since it may
+ 	 * deadlock with transaction by the qgroup rescan worker.
+ 	 */
+ 	clear_bit(BTRFS_FS_QUOTA_ENABLED, &fs_info->flags);
+ 	btrfs_qgroup_wait_for_completion(fs_info, false);
+-	mutex_unlock(&fs_info->qgroup_ioctl_lock);
+ 
+ 	/*
+ 	 * 1 For the root item
diff --git a/queue-5.15/series b/queue-5.15/series
index 98d85d6e5a4..f70ad123e12 100644
--- a/queue-5.15/series
+++ b/queue-5.15/series
@@ -249,3 +249,11 @@ input-elan_i2c-fix-regulator-enable-count-imbalance-after-suspend-resume.patch
 input-samsung-keypad-properly-state-iomem-dependency.patch
 hid-add-mapping-for-key_dictate.patch
 hid-add-mapping-for-key_all_applications.patch
+tracing-histogram-fix-sorting-on-old-cpu-value.patch
+tracing-fix-return-value-of-__setup-handlers.patch
+btrfs-fix-lost-prealloc-extents-beyond-eof-after-full-fsync.patch
+btrfs-fix-relocation-crash-due-to-premature-return-from-btrfs_commit_transaction.patch
+btrfs-do-not-warn_on-if-we-have-pageerror-set.patch
+btrfs-qgroup-fix-deadlock-between-rescan-worker-and-remove-qgroup.patch
+btrfs-add-missing-run-of-delayed-items-after-unlink-during-log-replay.patch
+btrfs-do-not-start-relocation-until-in-progress-drops-are-done.patch
diff --git a/queue-5.15/tracing-fix-return-value-of-__setup-handlers.patch b/queue-5.15/tracing-fix-return-value-of-__setup-handlers.patch
new file mode 100644
index 00000000000..0421d0a8279
--- /dev/null
+++ b/queue-5.15/tracing-fix-return-value-of-__setup-handlers.patch
@@ -0,0 +1,82 @@
+From 1d02b444b8d1345ea4708db3bab4db89a7784b55 Mon Sep 17 00:00:00 2001
+From: Randy Dunlap <rdunlap@infradead.org>
+Date: Wed, 2 Mar 2022 19:17:44 -0800
+Subject: tracing: Fix return value of __setup handlers
+
+From: Randy Dunlap <rdunlap@infradead.org>
+
+commit 1d02b444b8d1345ea4708db3bab4db89a7784b55 upstream.
+
+__setup() handlers should generally return 1 to indicate that the
+boot options have been handled.
+
+Using invalid option values causes the entire kernel boot option
+string to be reported as Unknown and added to init's environment
+strings, polluting it.
+
+  Unknown kernel command line parameters "BOOT_IMAGE=/boot/bzImage-517rc6
+    kprobe_event=p,syscall_any,$arg1 trace_options=quiet
+    trace_clock=jiffies", will be passed to user space.
+
+ Run /sbin/init as init process
+   with arguments:
+     /sbin/init
+   with environment:
+     HOME=/
+     TERM=linux
+     BOOT_IMAGE=/boot/bzImage-517rc6
+     kprobe_event=p,syscall_any,$arg1
+     trace_options=quiet
+     trace_clock=jiffies
+
+Return 1 from the __setup() handlers so that init's environment is not
+polluted with kernel boot options.
+
+Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru
+Link: https://lkml.kernel.org/r/20220303031744.32356-1-rdunlap@infradead.org
+
+Cc: stable@vger.kernel.org
+Fixes: 7bcfaf54f591 ("tracing: Add trace_options kernel command line parameter")
+Fixes: e1e232ca6b8f ("tracing: Add trace_clock=<clock> kernel parameter")
+Fixes: 970988e19eb0 ("tracing/kprobe: Add kprobe_event= boot parameter")
+Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
+Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru>
+Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
+Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ kernel/trace/trace.c        |    4 ++--
+ kernel/trace/trace_kprobe.c |    2 +-
+ 2 files changed, 3 insertions(+), 3 deletions(-)
+
+--- a/kernel/trace/trace.c
++++ b/kernel/trace/trace.c
+@@ -235,7 +235,7 @@ static char trace_boot_options_buf[MAX_T
+ static int __init set_trace_boot_options(char *str)
+ {
+ 	strlcpy(trace_boot_options_buf, str, MAX_TRACER_SIZE);
+-	return 0;
++	return 1;
+ }
+ __setup("trace_options=", set_trace_boot_options);
+ 
+@@ -246,7 +246,7 @@ static int __init set_trace_boot_clock(c
+ {
+ 	strlcpy(trace_boot_clock_buf, str, MAX_TRACER_SIZE);
+ 	trace_boot_clock = trace_boot_clock_buf;
+-	return 0;
++	return 1;
+ }
+ __setup("trace_clock=", set_trace_boot_clock);
+ 
+--- a/kernel/trace/trace_kprobe.c
++++ b/kernel/trace/trace_kprobe.c
+@@ -31,7 +31,7 @@ static int __init set_kprobe_boot_events
+ 	strlcpy(kprobe_boot_events_buf, str, COMMAND_LINE_SIZE);
+ 	disable_tracing_selftest("running kprobe events");
+ 
+-	return 0;
++	return 1;
+ }
+ __setup("kprobe_event=", set_kprobe_boot_events);
+ 
diff --git a/queue-5.15/tracing-histogram-fix-sorting-on-old-cpu-value.patch b/queue-5.15/tracing-histogram-fix-sorting-on-old-cpu-value.patch
new file mode 100644
index 00000000000..008e8109e08
--- /dev/null
+++ b/queue-5.15/tracing-histogram-fix-sorting-on-old-cpu-value.patch
@@ -0,0 +1,80 @@
+From 1d1898f65616c4601208963c3376c1d828cbf2c7 Mon Sep 17 00:00:00 2001
+From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
+Date: Tue, 1 Mar 2022 22:29:04 -0500
+Subject: tracing/histogram: Fix sorting on old "cpu" value
+
+From: Steven Rostedt (Google) <rostedt@goodmis.org>
+
+commit 1d1898f65616c4601208963c3376c1d828cbf2c7 upstream.
+
+When trying to add a histogram against an event with the "cpu" field, it
+was impossible due to "cpu" being a keyword to key off of the running CPU.
+So to fix this, it was changed to "common_cpu" to match the other generic
+fields (like "common_pid"). But since some scripts used "cpu" for keying
+off of the CPU (for events that did not have "cpu" as a field, which is
+most of them), a backward compatibility trick was added such that if "cpu"
+was used as a key, and the event did not have "cpu" as a field name, then
+it would fallback and switch over to "common_cpu".
+
+This fix has a couple of subtle bugs. One was that when switching over to
+"common_cpu", it did not change the field name, it just set a flag. But
+the code still found a "cpu" field. The "cpu" field is used for filtering
+and is returned when the event does not have a "cpu" field.
+
+This was found by:
+
+  # cd /sys/kernel/tracing
+  # echo hist:key=cpu,pid:sort=cpu > events/sched/sched_wakeup/trigger
+  # cat events/sched/sched_wakeup/hist
+
+Which showed the histogram unsorted:
+
+{ cpu:         19, pid:       1175 } hitcount:          1
+{ cpu:          6, pid:        239 } hitcount:          2
+{ cpu:         23, pid:       1186 } hitcount:         14
+{ cpu:         12, pid:        249 } hitcount:          2
+{ cpu:          3, pid:        994 } hitcount:          5
+
+Instead of hard coding the "cpu" checks, take advantage of the fact that
+trace_event_field_field() returns a special field for "cpu" and "CPU" if
+the event does not have "cpu" as a field. This special field has the
+"filter_type" of "FILTER_CPU". Check that to test if the returned field is
+of the CPU type instead of doing the string compare.
+
+Also, fix the sorting bug by testing for the hist_field flag of
+HIST_FIELD_FL_CPU when setting up the sort routine. Otherwise it will use
+the special CPU field to know what compare routine to use, and since that
+special field does not have a size, it returns tracing_map_cmp_none.
+
+Cc: stable@vger.kernel.org
+Fixes: 1e3bac71c505 ("tracing/histogram: Rename "cpu" to "common_cpu"")
+Reported-by: Daniel Bristot de Oliveira <bristot@kernel.org>
+Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
+Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
+---
+ kernel/trace/trace_events_hist.c |    6 +++---
+ 1 file changed, 3 insertions(+), 3 deletions(-)
+
+--- a/kernel/trace/trace_events_hist.c
++++ b/kernel/trace/trace_events_hist.c
+@@ -2049,9 +2049,9 @@ parse_field(struct hist_trigger_data *hi
+ 			/*
+ 			 * For backward compatibility, if field_name
+ 			 * was "cpu", then we treat this the same as
+-			 * common_cpu.
++			 * common_cpu. This also works for "CPU".
+ 			 */
+-			if (strcmp(field_name, "cpu") == 0) {
++			if (field && field->filter_type == FILTER_CPU) {
+ 				*flags |= HIST_FIELD_FL_CPU;
+ 			} else {
+ 				hist_err(tr, HIST_ERR_FIELD_NOT_FOUND,
+@@ -4478,7 +4478,7 @@ static int create_tracing_map_fields(str
+ 
+ 			if (hist_field->flags & HIST_FIELD_FL_STACKTRACE)
+ 				cmp_fn = tracing_map_cmp_none;
+-			else if (!field)
++			else if (!field || hist_field->flags & HIST_FIELD_FL_CPU)
+ 				cmp_fn = tracing_map_cmp_num(hist_field->size,
+ 							     hist_field->is_signed);
+ 			else if (is_string_field(field))