From: Greg Kroah-Hartman Date: Mon, 14 Mar 2022 09:18:42 +0000 (+0100) Subject: 5.16-stable patches X-Git-Tag: v4.9.307~14 X-Git-Url: http://git.ipfire.org/?a=commitdiff_plain;h=28403a472a2e119d73f331f9ddc6fe232128767d;p=thirdparty%2Fkernel%2Fstable-queue.git 5.16-stable patches added patches: btrfs-make-send-work-with-concurrent-block-group-relocation.patch --- diff --git a/queue-5.16/btrfs-make-send-work-with-concurrent-block-group-relocation.patch b/queue-5.16/btrfs-make-send-work-with-concurrent-block-group-relocation.patch new file mode 100644 index 00000000000..cb09b9dad14 --- /dev/null +++ b/queue-5.16/btrfs-make-send-work-with-concurrent-block-group-relocation.patch @@ -0,0 +1,998 @@ +From d96b34248c2f4ea8cd09286090f2f6f77102eaab Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Mon, 22 Nov 2021 12:03:38 +0000 +Subject: btrfs: make send work with concurrent block group relocation + +From: Filipe Manana + +commit d96b34248c2f4ea8cd09286090f2f6f77102eaab upstream. + +We don't allow send and balance/relocation to run in parallel in order +to prevent send failing or silently producing some bad stream. This is +because while send is using an extent (specially metadata) or about to +read a metadata extent and expecting it belongs to a specific parent +node, relocation can run, the transaction used for the relocation is +committed and the extent gets reallocated while send is still using the +extent, so it ends up with a different content than expected. This can +result in just failing to read a metadata extent due to failure of the +validation checks (parent transid, level, etc), failure to find a +backreference for a data extent, and other unexpected failures. Besides +reallocation, there's also a similar problem of an extent getting +discarded when it's unpinned after the transaction used for block group +relocation is committed. + +The restriction between balance and send was added in commit 9e967495e0e0 +("Btrfs: prevent send failures and crashes due to concurrent relocation"), +kernel 5.3, while the more general restriction between send and relocation +was added in commit 1cea5cf0e664 ("btrfs: ensure relocation never runs +while we have send operations running"), kernel 5.14. + +Both send and relocation can be very long running operations. Relocation +because it has to do a lot of IO and expensive backreference lookups in +case there are many snapshots, and send due to read IO when operating on +very large trees. This makes it inconvenient for users and tools to deal +with scheduling both operations. + +For zoned filesystem we also have automatic block group relocation, so +send can fail with -EAGAIN when users least expect it or send can end up +delaying the block group relocation for too long. In the future we might +also get the automatic block group relocation for non zoned filesystems. + +This change makes it possible for send and relocation to run in parallel. +This is achieved the following way: + +1) For all tree searches, send acquires a read lock on the commit root + semaphore; + +2) After each tree search, and before releasing the commit root semaphore, + the leaf is cloned and placed in the search path (struct btrfs_path); + +3) After releasing the commit root semaphore, the changed_cb() callback + is invoked, which operates on the leaf and writes commands to the pipe + (or file in case send/receive is not used with a pipe). It's important + here to not hold a lock on the commit root semaphore, because if we did + we could deadlock when sending and receiving to the same filesystem + using a pipe - the send task blocks on the pipe because it's full, the + receive task, which is the only consumer of the pipe, triggers a + transaction commit when attempting to create a subvolume or reserve + space for a write operation for example, but the transaction commit + blocks trying to write lock the commit root semaphore, resulting in a + deadlock; + +4) Before moving to the next key, or advancing to the next change in case + of an incremental send, check if a transaction used for relocation was + committed (or is about to finish its commit). If so, release the search + path(s) and restart the search, to where we were before, so that we + don't operate on stale extent buffers. The search restarts are always + possible because both the send and parent roots are RO, and no one can + add, remove of update keys (change their offset) in RO trees - the + only exception is deduplication, but that is still not allowed to run + in parallel with send; + +5) Periodically check if there is contention on the commit root semaphore, + which means there is a transaction commit trying to write lock it, and + release the semaphore and reschedule if there is contention, so as to + avoid causing any significant delays to transaction commits. + +This leaves some room for optimizations for send to have less path +releases and re searching the trees when there's relocation running, but +for now it's kept simple as it performs quite well (on very large trees +with resulting send streams in the order of a few hundred gigabytes). + +Test case btrfs/187, from fstests, stresses relocation, send and +deduplication attempting to run in parallel, but without verifying if send +succeeds and if it produces correct streams. A new test case will be added +that exercises relocation happening in parallel with send and then checks +that send succeeds and the resulting streams are correct. + +A final note is that for now this still leaves the mutual exclusion +between send operations and deduplication on files belonging to a root +used by send operations. A solution for that will be slightly more complex +but it will eventually be built on top of this change. + +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Anand Jain +Signed-off-by: Greg Kroah-Hartman +--- + fs/btrfs/block-group.c | 9 - + fs/btrfs/ctree.c | 98 ++++++++++--- + fs/btrfs/ctree.h | 14 - + fs/btrfs/disk-io.c | 4 + fs/btrfs/relocation.c | 13 - + fs/btrfs/send.c | 357 ++++++++++++++++++++++++++++++++++++++++++------- + fs/btrfs/transaction.c | 4 + 7 files changed, 395 insertions(+), 104 deletions(-) + +--- a/fs/btrfs/block-group.c ++++ b/fs/btrfs/block-group.c +@@ -1508,7 +1508,6 @@ void btrfs_reclaim_bgs_work(struct work_ + container_of(work, struct btrfs_fs_info, reclaim_bgs_work); + struct btrfs_block_group *bg; + struct btrfs_space_info *space_info; +- LIST_HEAD(again_list); + + if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags)) + return; +@@ -1585,18 +1584,14 @@ void btrfs_reclaim_bgs_work(struct work_ + div64_u64(zone_unusable * 100, bg->length)); + trace_btrfs_reclaim_block_group(bg); + ret = btrfs_relocate_chunk(fs_info, bg->start); +- if (ret && ret != -EAGAIN) ++ if (ret) + btrfs_err(fs_info, "error relocating chunk %llu", + bg->start); + + next: ++ btrfs_put_block_group(bg); + spin_lock(&fs_info->unused_bgs_lock); +- if (ret == -EAGAIN && list_empty(&bg->bg_list)) +- list_add_tail(&bg->bg_list, &again_list); +- else +- btrfs_put_block_group(bg); + } +- list_splice_tail(&again_list, &fs_info->reclaim_bgs); + spin_unlock(&fs_info->unused_bgs_lock); + mutex_unlock(&fs_info->reclaim_bgs_lock); + btrfs_exclop_finish(fs_info); +--- a/fs/btrfs/ctree.c ++++ b/fs/btrfs/ctree.c +@@ -1568,32 +1568,13 @@ static struct extent_buffer *btrfs_searc + struct btrfs_path *p, + int write_lock_level) + { +- struct btrfs_fs_info *fs_info = root->fs_info; + struct extent_buffer *b; + int root_lock = 0; + int level = 0; + + if (p->search_commit_root) { +- /* +- * The commit roots are read only so we always do read locks, +- * and we always must hold the commit_root_sem when doing +- * searches on them, the only exception is send where we don't +- * want to block transaction commits for a long time, so +- * we need to clone the commit root in order to avoid races +- * with transaction commits that create a snapshot of one of +- * the roots used by a send operation. +- */ +- if (p->need_commit_sem) { +- down_read(&fs_info->commit_root_sem); +- b = btrfs_clone_extent_buffer(root->commit_root); +- up_read(&fs_info->commit_root_sem); +- if (!b) +- return ERR_PTR(-ENOMEM); +- +- } else { +- b = root->commit_root; +- atomic_inc(&b->refs); +- } ++ b = root->commit_root; ++ atomic_inc(&b->refs); + level = btrfs_header_level(b); + /* + * Ensure that all callers have set skip_locking when +@@ -1659,6 +1640,42 @@ out: + return b; + } + ++/* ++ * Replace the extent buffer at the lowest level of the path with a cloned ++ * version. The purpose is to be able to use it safely, after releasing the ++ * commit root semaphore, even if relocation is happening in parallel, the ++ * transaction used for relocation is committed and the extent buffer is ++ * reallocated in the next transaction. ++ * ++ * This is used in a context where the caller does not prevent transaction ++ * commits from happening, either by holding a transaction handle or holding ++ * some lock, while it's doing searches through a commit root. ++ * At the moment it's only used for send operations. ++ */ ++static int finish_need_commit_sem_search(struct btrfs_path *path) ++{ ++ const int i = path->lowest_level; ++ const int slot = path->slots[i]; ++ struct extent_buffer *lowest = path->nodes[i]; ++ struct extent_buffer *clone; ++ ++ ASSERT(path->need_commit_sem); ++ ++ if (!lowest) ++ return 0; ++ ++ lockdep_assert_held_read(&lowest->fs_info->commit_root_sem); ++ ++ clone = btrfs_clone_extent_buffer(lowest); ++ if (!clone) ++ return -ENOMEM; ++ ++ btrfs_release_path(path); ++ path->nodes[i] = clone; ++ path->slots[i] = slot; ++ ++ return 0; ++} + + /* + * btrfs_search_slot - look for a key in a tree and perform necessary +@@ -1695,6 +1712,7 @@ int btrfs_search_slot(struct btrfs_trans + const struct btrfs_key *key, struct btrfs_path *p, + int ins_len, int cow) + { ++ struct btrfs_fs_info *fs_info = root->fs_info; + struct extent_buffer *b; + int slot; + int ret; +@@ -1736,6 +1754,11 @@ int btrfs_search_slot(struct btrfs_trans + + min_write_lock_level = write_lock_level; + ++ if (p->need_commit_sem) { ++ ASSERT(p->search_commit_root); ++ down_read(&fs_info->commit_root_sem); ++ } ++ + again: + prev_cmp = -1; + b = btrfs_search_slot_get_root(root, p, write_lock_level); +@@ -1930,6 +1953,16 @@ cow_done: + done: + if (ret < 0 && !p->skip_release_on_error) + btrfs_release_path(p); ++ ++ if (p->need_commit_sem) { ++ int ret2; ++ ++ ret2 = finish_need_commit_sem_search(p); ++ up_read(&fs_info->commit_root_sem); ++ if (ret2) ++ ret = ret2; ++ } ++ + return ret; + } + ALLOW_ERROR_INJECTION(btrfs_search_slot, ERRNO); +@@ -4414,7 +4447,9 @@ int btrfs_next_old_leaf(struct btrfs_roo + int level; + struct extent_buffer *c; + struct extent_buffer *next; ++ struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_key key; ++ bool need_commit_sem = false; + u32 nritems; + int ret; + int i; +@@ -4431,14 +4466,20 @@ again: + + path->keep_locks = 1; + +- if (time_seq) ++ if (time_seq) { + ret = btrfs_search_old_slot(root, &key, path, time_seq); +- else ++ } else { ++ if (path->need_commit_sem) { ++ path->need_commit_sem = 0; ++ need_commit_sem = true; ++ down_read(&fs_info->commit_root_sem); ++ } + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); ++ } + path->keep_locks = 0; + + if (ret < 0) +- return ret; ++ goto done; + + nritems = btrfs_header_nritems(path->nodes[0]); + /* +@@ -4561,6 +4602,15 @@ again: + ret = 0; + done: + unlock_up(path, 0, 1, 0, NULL); ++ if (need_commit_sem) { ++ int ret2; ++ ++ path->need_commit_sem = 1; ++ ret2 = finish_need_commit_sem_search(path); ++ up_read(&fs_info->commit_root_sem); ++ if (ret2) ++ ret = ret2; ++ } + + return ret; + } +--- a/fs/btrfs/ctree.h ++++ b/fs/btrfs/ctree.h +@@ -576,7 +576,6 @@ enum { + /* + * Indicate that relocation of a chunk has started, it's set per chunk + * and is toggled between chunks. +- * Set, tested and cleared while holding fs_info::send_reloc_lock. + */ + BTRFS_FS_RELOC_RUNNING, + +@@ -676,6 +675,12 @@ struct btrfs_fs_info { + + u64 generation; + u64 last_trans_committed; ++ /* ++ * Generation of the last transaction used for block group relocation ++ * since the filesystem was last mounted (or 0 if none happened yet). ++ * Must be written and read while holding btrfs_fs_info::commit_root_sem. ++ */ ++ u64 last_reloc_trans; + u64 avg_delayed_ref_runtime; + + /* +@@ -1006,13 +1011,6 @@ struct btrfs_fs_info { + + struct crypto_shash *csum_shash; + +- spinlock_t send_reloc_lock; +- /* +- * Number of send operations in progress. +- * Updated while holding fs_info::send_reloc_lock. +- */ +- int send_in_progress; +- + /* Type of exclusive operation running, protected by super_lock */ + enum btrfs_exclusive_operation exclusive_operation; + +--- a/fs/btrfs/disk-io.c ++++ b/fs/btrfs/disk-io.c +@@ -2858,6 +2858,7 @@ static int __cold init_tree_roots(struct + /* All successful */ + fs_info->generation = generation; + fs_info->last_trans_committed = generation; ++ fs_info->last_reloc_trans = 0; + + /* Always begin writing backup roots after the one being used */ + if (backup_index < 0) { +@@ -2993,9 +2994,6 @@ void btrfs_init_fs_info(struct btrfs_fs_ + spin_lock_init(&fs_info->swapfile_pins_lock); + fs_info->swapfile_pins = RB_ROOT; + +- spin_lock_init(&fs_info->send_reloc_lock); +- fs_info->send_in_progress = 0; +- + fs_info->bg_reclaim_threshold = BTRFS_DEFAULT_RECLAIM_THRESH; + INIT_WORK(&fs_info->reclaim_bgs_work, btrfs_reclaim_bgs_work); + } +--- a/fs/btrfs/relocation.c ++++ b/fs/btrfs/relocation.c +@@ -3858,25 +3858,14 @@ out: + * 0 success + * -EINPROGRESS operation is already in progress, that's probably a bug + * -ECANCELED cancellation request was set before the operation started +- * -EAGAIN can not start because there are ongoing send operations + */ + static int reloc_chunk_start(struct btrfs_fs_info *fs_info) + { +- spin_lock(&fs_info->send_reloc_lock); +- if (fs_info->send_in_progress) { +- btrfs_warn_rl(fs_info, +-"cannot run relocation while send operations are in progress (%d in progress)", +- fs_info->send_in_progress); +- spin_unlock(&fs_info->send_reloc_lock); +- return -EAGAIN; +- } + if (test_and_set_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags)) { + /* This should not happen */ +- spin_unlock(&fs_info->send_reloc_lock); + btrfs_err(fs_info, "reloc already running, cannot start"); + return -EINPROGRESS; + } +- spin_unlock(&fs_info->send_reloc_lock); + + if (atomic_read(&fs_info->reloc_cancel_req) > 0) { + btrfs_info(fs_info, "chunk relocation canceled on start"); +@@ -3898,9 +3887,7 @@ static void reloc_chunk_end(struct btrfs + /* Requested after start, clear bit first so any waiters can continue */ + if (atomic_read(&fs_info->reloc_cancel_req) > 0) + btrfs_info(fs_info, "chunk relocation canceled during operation"); +- spin_lock(&fs_info->send_reloc_lock); + clear_and_wake_up_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags); +- spin_unlock(&fs_info->send_reloc_lock); + atomic_set(&fs_info->reloc_cancel_req, 0); + } + +--- a/fs/btrfs/send.c ++++ b/fs/btrfs/send.c +@@ -24,6 +24,7 @@ + #include "transaction.h" + #include "compression.h" + #include "xattr.h" ++#include "print-tree.h" + + /* + * Maximum number of references an extent can have in order for us to attempt to +@@ -98,6 +99,15 @@ struct send_ctx { + struct btrfs_key *cmp_key; + + /* ++ * Keep track of the generation of the last transaction that was used ++ * for relocating a block group. This is periodically checked in order ++ * to detect if a relocation happened since the last check, so that we ++ * don't operate on stale extent buffers for nodes (level >= 1) or on ++ * stale disk_bytenr values of file extent items. ++ */ ++ u64 last_reloc_trans; ++ ++ /* + * infos of the currently processed inode. In case of deleted inodes, + * these are the values from the deleted inode. + */ +@@ -1427,6 +1437,26 @@ static int find_extent_clone(struct send + if (ret < 0) + goto out; + ++ down_read(&fs_info->commit_root_sem); ++ if (fs_info->last_reloc_trans > sctx->last_reloc_trans) { ++ /* ++ * A transaction commit for a transaction in which block group ++ * relocation was done just happened. ++ * The disk_bytenr of the file extent item we processed is ++ * possibly stale, referring to the extent's location before ++ * relocation. So act as if we haven't found any clone sources ++ * and fallback to write commands, which will read the correct ++ * data from the new extent location. Otherwise we will fail ++ * below because we haven't found our own back reference or we ++ * could be getting incorrect sources in case the old extent ++ * was already reallocated after the relocation. ++ */ ++ up_read(&fs_info->commit_root_sem); ++ ret = -ENOENT; ++ goto out; ++ } ++ up_read(&fs_info->commit_root_sem); ++ + if (!backref_ctx.found_itself) { + /* found a bug in backref code? */ + ret = -EIO; +@@ -6601,6 +6631,50 @@ static int changed_cb(struct btrfs_path + { + int ret = 0; + ++ /* ++ * We can not hold the commit root semaphore here. This is because in ++ * the case of sending and receiving to the same filesystem, using a ++ * pipe, could result in a deadlock: ++ * ++ * 1) The task running send blocks on the pipe because it's full; ++ * ++ * 2) The task running receive, which is the only consumer of the pipe, ++ * is waiting for a transaction commit (for example due to a space ++ * reservation when doing a write or triggering a transaction commit ++ * when creating a subvolume); ++ * ++ * 3) The transaction is waiting to write lock the commit root semaphore, ++ * but can not acquire it since it's being held at 1). ++ * ++ * Down this call chain we write to the pipe through kernel_write(). ++ * The same type of problem can also happen when sending to a file that ++ * is stored in the same filesystem - when reserving space for a write ++ * into the file, we can trigger a transaction commit. ++ * ++ * Our caller has supplied us with clones of leaves from the send and ++ * parent roots, so we're safe here from a concurrent relocation and ++ * further reallocation of metadata extents while we are here. Below we ++ * also assert that the leaves are clones. ++ */ ++ lockdep_assert_not_held(&sctx->send_root->fs_info->commit_root_sem); ++ ++ /* ++ * We always have a send root, so left_path is never NULL. We will not ++ * have a leaf when we have reached the end of the send root but have ++ * not yet reached the end of the parent root. ++ */ ++ if (left_path->nodes[0]) ++ ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, ++ &left_path->nodes[0]->bflags)); ++ /* ++ * When doing a full send we don't have a parent root, so right_path is ++ * NULL. When doing an incremental send, we may have reached the end of ++ * the parent root already, so we don't have a leaf at right_path. ++ */ ++ if (right_path && right_path->nodes[0]) ++ ASSERT(test_bit(EXTENT_BUFFER_UNMAPPED, ++ &right_path->nodes[0]->bflags)); ++ + if (result == BTRFS_COMPARE_TREE_SAME) { + if (key->type == BTRFS_INODE_REF_KEY || + key->type == BTRFS_INODE_EXTREF_KEY) { +@@ -6647,14 +6721,46 @@ out: + return ret; + } + ++static int search_key_again(const struct send_ctx *sctx, ++ struct btrfs_root *root, ++ struct btrfs_path *path, ++ const struct btrfs_key *key) ++{ ++ int ret; ++ ++ if (!path->need_commit_sem) ++ lockdep_assert_held_read(&root->fs_info->commit_root_sem); ++ ++ /* ++ * Roots used for send operations are readonly and no one can add, ++ * update or remove keys from them, so we should be able to find our ++ * key again. The only exception is deduplication, which can operate on ++ * readonly roots and add, update or remove keys to/from them - but at ++ * the moment we don't allow it to run in parallel with send. ++ */ ++ ret = btrfs_search_slot(NULL, root, key, path, 0, 0); ++ ASSERT(ret <= 0); ++ if (ret > 0) { ++ btrfs_print_tree(path->nodes[path->lowest_level], false); ++ btrfs_err(root->fs_info, ++"send: key (%llu %u %llu) not found in %s root %llu, lowest_level %d, slot %d", ++ key->objectid, key->type, key->offset, ++ (root == sctx->parent_root ? "parent" : "send"), ++ root->root_key.objectid, path->lowest_level, ++ path->slots[path->lowest_level]); ++ return -EUCLEAN; ++ } ++ ++ return ret; ++} ++ + static int full_send_tree(struct send_ctx *sctx) + { + int ret; + struct btrfs_root *send_root = sctx->send_root; + struct btrfs_key key; ++ struct btrfs_fs_info *fs_info = send_root->fs_info; + struct btrfs_path *path; +- struct extent_buffer *eb; +- int slot; + + path = alloc_path_for_send(); + if (!path) +@@ -6665,6 +6771,10 @@ static int full_send_tree(struct send_ct + key.type = BTRFS_INODE_ITEM_KEY; + key.offset = 0; + ++ down_read(&fs_info->commit_root_sem); ++ sctx->last_reloc_trans = fs_info->last_reloc_trans; ++ up_read(&fs_info->commit_root_sem); ++ + ret = btrfs_search_slot_for_read(send_root, &key, path, 1, 0); + if (ret < 0) + goto out; +@@ -6672,15 +6782,35 @@ static int full_send_tree(struct send_ct + goto out_finish; + + while (1) { +- eb = path->nodes[0]; +- slot = path->slots[0]; +- btrfs_item_key_to_cpu(eb, &key, slot); ++ btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0]); + + ret = changed_cb(path, NULL, &key, + BTRFS_COMPARE_TREE_NEW, sctx); + if (ret < 0) + goto out; + ++ down_read(&fs_info->commit_root_sem); ++ if (fs_info->last_reloc_trans > sctx->last_reloc_trans) { ++ sctx->last_reloc_trans = fs_info->last_reloc_trans; ++ up_read(&fs_info->commit_root_sem); ++ /* ++ * A transaction used for relocating a block group was ++ * committed or is about to finish its commit. Release ++ * our path (leaf) and restart the search, so that we ++ * avoid operating on any file extent items that are ++ * stale, with a disk_bytenr that reflects a pre ++ * relocation value. This way we avoid as much as ++ * possible to fallback to regular writes when checking ++ * if we can clone file ranges. ++ */ ++ btrfs_release_path(path); ++ ret = search_key_again(sctx, send_root, path, &key); ++ if (ret < 0) ++ goto out; ++ } else { ++ up_read(&fs_info->commit_root_sem); ++ } ++ + ret = btrfs_next_item(send_root, path); + if (ret < 0) + goto out; +@@ -6698,6 +6828,20 @@ out: + return ret; + } + ++static int replace_node_with_clone(struct btrfs_path *path, int level) ++{ ++ struct extent_buffer *clone; ++ ++ clone = btrfs_clone_extent_buffer(path->nodes[level]); ++ if (!clone) ++ return -ENOMEM; ++ ++ free_extent_buffer(path->nodes[level]); ++ path->nodes[level] = clone; ++ ++ return 0; ++} ++ + static int tree_move_down(struct btrfs_path *path, int *level, u64 reada_min_gen) + { + struct extent_buffer *eb; +@@ -6707,6 +6851,8 @@ static int tree_move_down(struct btrfs_p + u64 reada_max; + u64 reada_done = 0; + ++ lockdep_assert_held_read(&parent->fs_info->commit_root_sem); ++ + BUG_ON(*level == 0); + eb = btrfs_read_node_slot(parent, slot); + if (IS_ERR(eb)) +@@ -6730,6 +6876,10 @@ static int tree_move_down(struct btrfs_p + path->nodes[*level - 1] = eb; + path->slots[*level - 1] = 0; + (*level)--; ++ ++ if (*level == 0) ++ return replace_node_with_clone(path, 0); ++ + return 0; + } + +@@ -6743,8 +6893,10 @@ static int tree_move_next_or_upnext(stru + path->slots[*level]++; + + while (path->slots[*level] >= nritems) { +- if (*level == root_level) ++ if (*level == root_level) { ++ path->slots[*level] = nritems - 1; + return -1; ++ } + + /* move upnext */ + path->slots[*level] = 0; +@@ -6776,14 +6928,20 @@ static int tree_advance(struct btrfs_pat + } else { + ret = tree_move_down(path, level, reada_min_gen); + } +- if (ret >= 0) { +- if (*level == 0) +- btrfs_item_key_to_cpu(path->nodes[*level], key, +- path->slots[*level]); +- else +- btrfs_node_key_to_cpu(path->nodes[*level], key, +- path->slots[*level]); +- } ++ ++ /* ++ * Even if we have reached the end of a tree, ret is -1, update the key ++ * anyway, so that in case we need to restart due to a block group ++ * relocation, we can assert that the last key of the root node still ++ * exists in the tree. ++ */ ++ if (*level == 0) ++ btrfs_item_key_to_cpu(path->nodes[*level], key, ++ path->slots[*level]); ++ else ++ btrfs_node_key_to_cpu(path->nodes[*level], key, ++ path->slots[*level]); ++ + return ret; + } + +@@ -6813,6 +6971,97 @@ static int tree_compare_item(struct btrf + } + + /* ++ * A transaction used for relocating a block group was committed or is about to ++ * finish its commit. Release our paths and restart the search, so that we are ++ * not using stale extent buffers: ++ * ++ * 1) For levels > 0, we are only holding references of extent buffers, without ++ * any locks on them, which does not prevent them from having been relocated ++ * and reallocated after the last time we released the commit root semaphore. ++ * The exception are the root nodes, for which we always have a clone, see ++ * the comment at btrfs_compare_trees(); ++ * ++ * 2) For leaves, level 0, we are holding copies (clones) of extent buffers, so ++ * we are safe from the concurrent relocation and reallocation. However they ++ * can have file extent items with a pre relocation disk_bytenr value, so we ++ * restart the start from the current commit roots and clone the new leaves so ++ * that we get the post relocation disk_bytenr values. Not doing so, could ++ * make us clone the wrong data in case there are new extents using the old ++ * disk_bytenr that happen to be shared. ++ */ ++static int restart_after_relocation(struct btrfs_path *left_path, ++ struct btrfs_path *right_path, ++ const struct btrfs_key *left_key, ++ const struct btrfs_key *right_key, ++ int left_level, ++ int right_level, ++ const struct send_ctx *sctx) ++{ ++ int root_level; ++ int ret; ++ ++ lockdep_assert_held_read(&sctx->send_root->fs_info->commit_root_sem); ++ ++ btrfs_release_path(left_path); ++ btrfs_release_path(right_path); ++ ++ /* ++ * Since keys can not be added or removed to/from our roots because they ++ * are readonly and we do not allow deduplication to run in parallel ++ * (which can add, remove or change keys), the layout of the trees should ++ * not change. ++ */ ++ left_path->lowest_level = left_level; ++ ret = search_key_again(sctx, sctx->send_root, left_path, left_key); ++ if (ret < 0) ++ return ret; ++ ++ right_path->lowest_level = right_level; ++ ret = search_key_again(sctx, sctx->parent_root, right_path, right_key); ++ if (ret < 0) ++ return ret; ++ ++ /* ++ * If the lowest level nodes are leaves, clone them so that they can be ++ * safely used by changed_cb() while not under the protection of the ++ * commit root semaphore, even if relocation and reallocation happens in ++ * parallel. ++ */ ++ if (left_level == 0) { ++ ret = replace_node_with_clone(left_path, 0); ++ if (ret < 0) ++ return ret; ++ } ++ ++ if (right_level == 0) { ++ ret = replace_node_with_clone(right_path, 0); ++ if (ret < 0) ++ return ret; ++ } ++ ++ /* ++ * Now clone the root nodes (unless they happen to be the leaves we have ++ * already cloned). This is to protect against concurrent snapshotting of ++ * the send and parent roots (see the comment at btrfs_compare_trees()). ++ */ ++ root_level = btrfs_header_level(sctx->send_root->commit_root); ++ if (root_level > 0) { ++ ret = replace_node_with_clone(left_path, root_level); ++ if (ret < 0) ++ return ret; ++ } ++ ++ root_level = btrfs_header_level(sctx->parent_root->commit_root); ++ if (root_level > 0) { ++ ret = replace_node_with_clone(right_path, root_level); ++ if (ret < 0) ++ return ret; ++ } ++ ++ return 0; ++} ++ ++/* + * This function compares two trees and calls the provided callback for + * every changed/new/deleted item it finds. + * If shared tree blocks are encountered, whole subtrees are skipped, making +@@ -6840,10 +7089,10 @@ static int btrfs_compare_trees(struct bt + int right_root_level; + int left_level; + int right_level; +- int left_end_reached; +- int right_end_reached; +- int advance_left; +- int advance_right; ++ int left_end_reached = 0; ++ int right_end_reached = 0; ++ int advance_left = 0; ++ int advance_right = 0; + u64 left_blockptr; + u64 right_blockptr; + u64 left_gen; +@@ -6911,12 +7160,18 @@ static int btrfs_compare_trees(struct bt + down_read(&fs_info->commit_root_sem); + left_level = btrfs_header_level(left_root->commit_root); + left_root_level = left_level; ++ /* ++ * We clone the root node of the send and parent roots to prevent races ++ * with snapshot creation of these roots. Snapshot creation COWs the ++ * root node of a tree, so after the transaction is committed the old ++ * extent can be reallocated while this send operation is still ongoing. ++ * So we clone them, under the commit root semaphore, to be race free. ++ */ + left_path->nodes[left_level] = + btrfs_clone_extent_buffer(left_root->commit_root); + if (!left_path->nodes[left_level]) { +- up_read(&fs_info->commit_root_sem); + ret = -ENOMEM; +- goto out; ++ goto out_unlock; + } + + right_level = btrfs_header_level(right_root->commit_root); +@@ -6924,9 +7179,8 @@ static int btrfs_compare_trees(struct bt + right_path->nodes[right_level] = + btrfs_clone_extent_buffer(right_root->commit_root); + if (!right_path->nodes[right_level]) { +- up_read(&fs_info->commit_root_sem); + ret = -ENOMEM; +- goto out; ++ goto out_unlock; + } + /* + * Our right root is the parent root, while the left root is the "send" +@@ -6936,7 +7190,6 @@ static int btrfs_compare_trees(struct bt + * will need to read them at some point. + */ + reada_min_gen = btrfs_header_generation(right_root->commit_root); +- up_read(&fs_info->commit_root_sem); + + if (left_level == 0) + btrfs_item_key_to_cpu(left_path->nodes[left_level], +@@ -6951,11 +7204,26 @@ static int btrfs_compare_trees(struct bt + btrfs_node_key_to_cpu(right_path->nodes[right_level], + &right_key, right_path->slots[right_level]); + +- left_end_reached = right_end_reached = 0; +- advance_left = advance_right = 0; ++ sctx->last_reloc_trans = fs_info->last_reloc_trans; + + while (1) { +- cond_resched(); ++ if (need_resched() || ++ rwsem_is_contended(&fs_info->commit_root_sem)) { ++ up_read(&fs_info->commit_root_sem); ++ cond_resched(); ++ down_read(&fs_info->commit_root_sem); ++ } ++ ++ if (fs_info->last_reloc_trans > sctx->last_reloc_trans) { ++ ret = restart_after_relocation(left_path, right_path, ++ &left_key, &right_key, ++ left_level, right_level, ++ sctx); ++ if (ret < 0) ++ goto out_unlock; ++ sctx->last_reloc_trans = fs_info->last_reloc_trans; ++ } ++ + if (advance_left && !left_end_reached) { + ret = tree_advance(left_path, &left_level, + left_root_level, +@@ -6964,7 +7232,7 @@ static int btrfs_compare_trees(struct bt + if (ret == -1) + left_end_reached = ADVANCE; + else if (ret < 0) +- goto out; ++ goto out_unlock; + advance_left = 0; + } + if (advance_right && !right_end_reached) { +@@ -6975,54 +7243,55 @@ static int btrfs_compare_trees(struct bt + if (ret == -1) + right_end_reached = ADVANCE; + else if (ret < 0) +- goto out; ++ goto out_unlock; + advance_right = 0; + } + + if (left_end_reached && right_end_reached) { + ret = 0; +- goto out; ++ goto out_unlock; + } else if (left_end_reached) { + if (right_level == 0) { ++ up_read(&fs_info->commit_root_sem); + ret = changed_cb(left_path, right_path, + &right_key, + BTRFS_COMPARE_TREE_DELETED, + sctx); + if (ret < 0) + goto out; ++ down_read(&fs_info->commit_root_sem); + } + advance_right = ADVANCE; + continue; + } else if (right_end_reached) { + if (left_level == 0) { ++ up_read(&fs_info->commit_root_sem); + ret = changed_cb(left_path, right_path, + &left_key, + BTRFS_COMPARE_TREE_NEW, + sctx); + if (ret < 0) + goto out; ++ down_read(&fs_info->commit_root_sem); + } + advance_left = ADVANCE; + continue; + } + + if (left_level == 0 && right_level == 0) { ++ up_read(&fs_info->commit_root_sem); + cmp = btrfs_comp_cpu_keys(&left_key, &right_key); + if (cmp < 0) { + ret = changed_cb(left_path, right_path, + &left_key, + BTRFS_COMPARE_TREE_NEW, + sctx); +- if (ret < 0) +- goto out; + advance_left = ADVANCE; + } else if (cmp > 0) { + ret = changed_cb(left_path, right_path, + &right_key, + BTRFS_COMPARE_TREE_DELETED, + sctx); +- if (ret < 0) +- goto out; + advance_right = ADVANCE; + } else { + enum btrfs_compare_tree_result result; +@@ -7036,11 +7305,13 @@ static int btrfs_compare_trees(struct bt + result = BTRFS_COMPARE_TREE_SAME; + ret = changed_cb(left_path, right_path, + &left_key, result, sctx); +- if (ret < 0) +- goto out; + advance_left = ADVANCE; + advance_right = ADVANCE; + } ++ ++ if (ret < 0) ++ goto out; ++ down_read(&fs_info->commit_root_sem); + } else if (left_level == right_level) { + cmp = btrfs_comp_cpu_keys(&left_key, &right_key); + if (cmp < 0) { +@@ -7080,6 +7351,8 @@ static int btrfs_compare_trees(struct bt + } + } + ++out_unlock: ++ up_read(&fs_info->commit_root_sem); + out: + btrfs_free_path(left_path); + btrfs_free_path(right_path); +@@ -7429,21 +7702,7 @@ long btrfs_ioctl_send(struct file *mnt_f + if (ret) + goto out; + +- spin_lock(&fs_info->send_reloc_lock); +- if (test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags)) { +- spin_unlock(&fs_info->send_reloc_lock); +- btrfs_warn_rl(fs_info, +- "cannot run send because a relocation operation is in progress"); +- ret = -EAGAIN; +- goto out; +- } +- fs_info->send_in_progress++; +- spin_unlock(&fs_info->send_reloc_lock); +- + ret = send_subvol(sctx); +- spin_lock(&fs_info->send_reloc_lock); +- fs_info->send_in_progress--; +- spin_unlock(&fs_info->send_reloc_lock); + if (ret < 0) + goto out; + +--- a/fs/btrfs/transaction.c ++++ b/fs/btrfs/transaction.c +@@ -163,6 +163,10 @@ static noinline void switch_commit_roots + struct btrfs_caching_control *caching_ctl, *next; + + down_write(&fs_info->commit_root_sem); ++ ++ if (test_bit(BTRFS_FS_RELOC_RUNNING, &fs_info->flags)) ++ fs_info->last_reloc_trans = trans->transid; ++ + list_for_each_entry_safe(root, tmp, &cur_trans->switch_commits, + dirty_list) { + list_del_init(&root->dirty_list); diff --git a/queue-5.16/perf-parse-fix-event-parser-error-for-hybrid-systems.patch b/queue-5.16/perf-parse-fix-event-parser-error-for-hybrid-systems.patch index a83bb6f5881..50627321132 100644 --- a/queue-5.16/perf-parse-fix-event-parser-error-for-hybrid-systems.patch +++ b/queue-5.16/perf-parse-fix-event-parser-error-for-hybrid-systems.patch @@ -73,14 +73,12 @@ Link: https://lore.kernel.org/r/20220307151627.30049-1-zhengjun.xing@linux.intel Signed-off-by: Arnaldo Carvalho de Melo Signed-off-by: Greg Kroah-Hartman --- - tools/perf/util/parse-events.c | 6 ++++-- + tools/perf/util/parse-events.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) -diff --git a/tools/perf/util/parse-events.c b/tools/perf/util/parse-events.c -index dfb50a5f83d0..24997925ae00 100644 --- a/tools/perf/util/parse-events.c +++ b/tools/perf/util/parse-events.c -@@ -1648,6 +1648,7 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, +@@ -1648,6 +1648,7 @@ int parse_events_multi_pmu_add(struct pa { struct parse_events_term *term; struct list_head *list = NULL; @@ -88,7 +86,7 @@ index dfb50a5f83d0..24997925ae00 100644 struct perf_pmu *pmu = NULL; int ok = 0; char *config; -@@ -1674,7 +1675,6 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, +@@ -1674,7 +1675,6 @@ int parse_events_multi_pmu_add(struct pa } list_add_tail(&term->list, head); @@ -96,7 +94,7 @@ index dfb50a5f83d0..24997925ae00 100644 /* Add it for all PMUs that support the alias */ list = malloc(sizeof(struct list_head)); if (!list) -@@ -1687,13 +1687,15 @@ int parse_events_multi_pmu_add(struct parse_events_state *parse_state, +@@ -1687,13 +1687,15 @@ int parse_events_multi_pmu_add(struct pa list_for_each_entry(alias, &pmu->aliases, list) { if (!strcasecmp(alias->name, str)) { @@ -113,6 +111,3 @@ index dfb50a5f83d0..24997925ae00 100644 } } } --- -2.35.1 - diff --git a/queue-5.16/series b/queue-5.16/series index 1a698eca9ed..6779dee1e14 100644 --- a/queue-5.16/series +++ b/queue-5.16/series @@ -117,3 +117,4 @@ x86-sgx-free-backing-memory-after-faulting-the-enclave-page.patch x86-traps-mark-do_int3-nokprobe_symbol.patch drm-panel-select-drm_dp_helper-for-drm_panel_edp.patch perf-parse-fix-event-parser-error-for-hybrid-systems.patch +btrfs-make-send-work-with-concurrent-block-group-relocation.patch