From: Greg Kroah-Hartman Date: Thu, 18 Jun 2020 14:55:22 +0000 (+0200) Subject: 5.7-stable patches X-Git-Tag: v4.4.228~53 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=d80b84061a964f0d259e6287d02b197cd3a20e8d;p=thirdparty%2Fkernel%2Fstable-queue.git 5.7-stable patches added patches: btrfs-fix-a-race-between-scrub-and-block-group-removal-allocation.patch btrfs-fix-corrupt-log-due-to-concurrent-fsync-of-inodes-with-shared-extents.patch btrfs-fix-error-handling-when-submitting-direct-i-o-bio.patch btrfs-fix-space_info-bytes_may_use-underflow-after-nocow-buffered-write.patch btrfs-fix-space_info-bytes_may_use-underflow-during-space-cache-writeout.patch btrfs-fix-wrong-file-range-cleanup-after-an-error-filling-dealloc-range.patch btrfs-force-chunk-allocation-if-our-global-rsv-is-larger-than-metadata.patch btrfs-free-alien-device-after-device-add.patch btrfs-include-non-missing-as-a-qualifier-for-the-latest_bdev.patch btrfs-reloc-fix-reloc-root-leak-and-null-pointer-dereference.patch btrfs-send-emit-file-capabilities-after-chown.patch --- diff --git a/queue-5.7/btrfs-fix-a-race-between-scrub-and-block-group-removal-allocation.patch b/queue-5.7/btrfs-fix-a-race-between-scrub-and-block-group-removal-allocation.patch new file mode 100644 index 00000000000..3b3dced7fad --- /dev/null +++ b/queue-5.7/btrfs-fix-a-race-between-scrub-and-block-group-removal-allocation.patch @@ -0,0 +1,270 @@ +From 2473d24f2b77da0ffabcbb916793e58e7f57440b Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Fri, 8 May 2020 11:01:10 +0100 +Subject: btrfs: fix a race between scrub and block group removal/allocation + +From: Filipe Manana + +commit 2473d24f2b77da0ffabcbb916793e58e7f57440b upstream. + +When scrub is verifying the extents of a block group for a device, it is +possible that the corresponding block group gets removed and its logical +address and device extents get used for a new block group allocation. +When this happens scrub incorrectly reports that errors were detected +and, if the the new block group has a different profile then the old one, +deleted block group, we can crash due to a null pointer dereference. +Possibly other unexpected and weird consequences can happen as well. + +Consider the following sequence of actions that leads to the null pointer +dereference crash when scrub is running in parallel with balance: + +1) Balance sets block group X to read-only mode and starts relocating it. + Block group X is a metadata block group, has a raid1 profile (two + device extents, each one in a different device) and a logical address + of 19424870400; + +2) Scrub is running and finds device extent E, which belongs to block + group X. It enters scrub_stripe() to find all extents allocated to + block group X, the search is done using the extent tree; + +3) Balance finishes relocating block group X and removes block group X; + +4) Balance starts relocating another block group and when trying to + commit the current transaction as part of the preparation step + (prepare_to_relocate()), it blocks because scrub is running; + +5) The scrub task finds the metadata extent at the logical address + 19425001472 and marks the pages of the extent to be read by a bio + (struct scrub_bio). The extent item's flags, which have the bit + BTRFS_EXTENT_FLAG_TREE_BLOCK set, are added to each page (struct + scrub_page). It is these flags in the scrub pages that tells the + bio's end io function (scrub_bio_end_io_worker) which type of extent + it is dealing with. At this point we end up with 4 pages in a bio + which is ready for submission (the metadata extent has a size of + 16Kb, so that gives 4 pages on x86); + +6) At the next iteration of scrub_stripe(), scrub checks that there is a + pause request from the relocation task trying to commit a transaction, + therefore it submits the pending bio and pauses, waiting for the + transaction commit to complete before resuming; + +7) The relocation task commits the transaction. The device extent E, that + was used by our block group X, is now available for allocation, since + the commit root for the device tree was swapped by the transaction + commit; + +8) Another task doing a direct IO write allocates a new data block group Y + which ends using device extent E. This new block group Y also ends up + getting the same logical address that block group X had: 19424870400. + This happens because block group X was the block group with the highest + logical address and, when allocating Y, find_next_chunk() returns the + end offset of the current last block group to be used as the logical + address for the new block group, which is + + 18351128576 + 1073741824 = 19424870400 + + So our new block group Y has the same logical address and device extent + that block group X had. However Y is a data block group, while X was + a metadata one, and Y has a raid0 profile, while X had a raid1 profile; + +9) After allocating block group Y, the direct IO submits a bio to write + to device extent E; + +10) The read bio submitted by scrub reads the 4 pages (16Kb) from device + extent E, which now correspond to the data written by the task that + did a direct IO write. Then at the end io function associated with + the bio, scrub_bio_end_io_worker(), we call scrub_block_complete() + which calls scrub_checksum(). This later function checks the flags + of the first page, and sees that the bit BTRFS_EXTENT_FLAG_TREE_BLOCK + is set in the flags, so it assumes it has a metadata extent and + then calls scrub_checksum_tree_block(). That functions returns an + error, since interpreting data as a metadata extent causes the + checksum verification to fail. + + So this makes scrub_checksum() call scrub_handle_errored_block(), + which determines 'failed_mirror_index' to be 1, since the device + extent E was allocated as the second mirror of block group X. + + It allocates BTRFS_MAX_MIRRORS scrub_block structures as an array at + 'sblocks_for_recheck', and all the memory is initialized to zeroes by + kcalloc(). + + After that it calls scrub_setup_recheck_block(), which is responsible + for filling each of those structures. However, when that function + calls btrfs_map_sblock() against the logical address of the metadata + extent, 19425001472, it gets a struct btrfs_bio ('bbio') that matches + the current block group Y. However block group Y has a raid0 profile + and not a raid1 profile like X had, so the following call returns 1: + + scrub_nr_raid_mirrors(bbio) + + And as a result scrub_setup_recheck_block() only initializes the + first (index 0) scrub_block structure in 'sblocks_for_recheck'. + + Then scrub_recheck_block() is called by scrub_handle_errored_block() + with the second (index 1) scrub_block structure as the argument, + because 'failed_mirror_index' was previously set to 1. + This scrub_block was not initialized by scrub_setup_recheck_block(), + so it has zero pages, its 'page_count' member is 0 and its 'pagev' + page array has all members pointing to NULL. + + Finally when scrub_recheck_block() calls scrub_recheck_block_checksum() + we have a NULL pointer dereference when accessing the flags of the first + page, as pavev[0] is NULL: + + static void scrub_recheck_block_checksum(struct scrub_block *sblock) + { + (...) + if (sblock->pagev[0]->flags & BTRFS_EXTENT_FLAG_DATA) + scrub_checksum_data(sblock); + (...) + } + + Producing a stack trace like the following: + + [542998.008985] BUG: kernel NULL pointer dereference, address: 0000000000000028 + [542998.010238] #PF: supervisor read access in kernel mode + [542998.010878] #PF: error_code(0x0000) - not-present page + [542998.011516] PGD 0 P4D 0 + [542998.011929] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI + [542998.012786] CPU: 3 PID: 4846 Comm: kworker/u8:1 Tainted: G B W 5.6.0-rc7-btrfs-next-58 #1 + [542998.014524] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 + [542998.016065] Workqueue: btrfs-scrub btrfs_work_helper [btrfs] + [542998.017255] RIP: 0010:scrub_recheck_block_checksum+0xf/0x20 [btrfs] + [542998.018474] Code: 4c 89 e6 ... + [542998.021419] RSP: 0018:ffffa7af0375fbd8 EFLAGS: 00010202 + [542998.022120] RAX: 0000000000000000 RBX: ffff9792e674d120 RCX: 0000000000000000 + [542998.023178] RDX: 0000000000000001 RSI: ffff9792e674d120 RDI: ffff9792e674d120 + [542998.024465] RBP: 0000000000000000 R08: 0000000000000067 R09: 0000000000000001 + [542998.025462] R10: ffffa7af0375fa50 R11: 0000000000000000 R12: ffff9791f61fe800 + [542998.026357] R13: ffff9792e674d120 R14: 0000000000000001 R15: ffffffffc0e3dfc0 + [542998.027237] FS: 0000000000000000(0000) GS:ffff9792fb200000(0000) knlGS:0000000000000000 + [542998.028327] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + [542998.029261] CR2: 0000000000000028 CR3: 00000000b3b18003 CR4: 00000000003606e0 + [542998.030301] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + [542998.031316] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 + [542998.032380] Call Trace: + [542998.032752] scrub_recheck_block+0x162/0x400 [btrfs] + [542998.033500] ? __alloc_pages_nodemask+0x31e/0x460 + [542998.034228] scrub_handle_errored_block+0x6f8/0x1920 [btrfs] + [542998.035170] scrub_bio_end_io_worker+0x100/0x520 [btrfs] + [542998.035991] btrfs_work_helper+0xaa/0x720 [btrfs] + [542998.036735] process_one_work+0x26d/0x6a0 + [542998.037275] worker_thread+0x4f/0x3e0 + [542998.037740] ? process_one_work+0x6a0/0x6a0 + [542998.038378] kthread+0x103/0x140 + [542998.038789] ? kthread_create_worker_on_cpu+0x70/0x70 + [542998.039419] ret_from_fork+0x3a/0x50 + [542998.039875] Modules linked in: dm_snapshot dm_thin_pool ... + [542998.047288] CR2: 0000000000000028 + [542998.047724] ---[ end trace bde186e176c7f96a ]--- + +This issue has been around for a long time, possibly since scrub exists. +The last time I ran into it was over 2 years ago. After recently fixing +fstests to pass the "--full-balance" command line option to btrfs-progs +when doing balance, several tests started to more heavily exercise balance +with fsstress, scrub and other operations in parallel, and therefore +started to hit this issue again (with btrfs/061 for example). + +Fix this by having scrub increment the 'trimming' counter of the block +group, which pins the block group in such a way that it guarantees neither +its logical address nor device extents can be reused by future block group +allocations until we decrement the 'trimming' counter. Also make sure that +on each iteration of scrub_stripe() we stop scrubbing the block group if +it was removed already. + +A later patch in the series will rename the block group's 'trimming' +counter and its helpers to a more generic name, since now it is not used +exclusively for pinning while trimming anymore. + +CC: stable@vger.kernel.org # 4.4+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/scrub.c | 38 ++++++++++++++++++++++++++++++++++++-- + 1 file changed, 36 insertions(+), 2 deletions(-) + +--- a/fs/btrfs/scrub.c ++++ b/fs/btrfs/scrub.c +@@ -3046,7 +3046,8 @@ out: + static noinline_for_stack int scrub_stripe(struct scrub_ctx *sctx, + struct map_lookup *map, + struct btrfs_device *scrub_dev, +- int num, u64 base, u64 length) ++ int num, u64 base, u64 length, ++ struct btrfs_block_group *cache) + { + struct btrfs_path *path, *ppath; + struct btrfs_fs_info *fs_info = sctx->fs_info; +@@ -3284,6 +3285,20 @@ static noinline_for_stack int scrub_stri + break; + } + ++ /* ++ * If our block group was removed in the meanwhile, just ++ * stop scrubbing since there is no point in continuing. ++ * Continuing would prevent reusing its device extents ++ * for new block groups for a long time. ++ */ ++ spin_lock(&cache->lock); ++ if (cache->removed) { ++ spin_unlock(&cache->lock); ++ ret = 0; ++ goto out; ++ } ++ spin_unlock(&cache->lock); ++ + extent = btrfs_item_ptr(l, slot, + struct btrfs_extent_item); + flags = btrfs_extent_flags(l, extent); +@@ -3457,7 +3472,7 @@ static noinline_for_stack int scrub_chun + if (map->stripes[i].dev->bdev == scrub_dev->bdev && + map->stripes[i].physical == dev_offset) { + ret = scrub_stripe(sctx, map, scrub_dev, i, +- chunk_offset, length); ++ chunk_offset, length, cache); + if (ret) + goto out; + } +@@ -3555,6 +3570,23 @@ int scrub_enumerate_chunks(struct scrub_ + goto skip; + + /* ++ * Make sure that while we are scrubbing the corresponding block ++ * group doesn't get its logical address and its device extents ++ * reused for another block group, which can possibly be of a ++ * different type and different profile. We do this to prevent ++ * false error detections and crashes due to bogus attempts to ++ * repair extents. ++ */ ++ spin_lock(&cache->lock); ++ if (cache->removed) { ++ spin_unlock(&cache->lock); ++ btrfs_put_block_group(cache); ++ goto skip; ++ } ++ btrfs_get_block_group_trimming(cache); ++ spin_unlock(&cache->lock); ++ ++ /* + * we need call btrfs_inc_block_group_ro() with scrubs_paused, + * to avoid deadlock caused by: + * btrfs_inc_block_group_ro() +@@ -3609,6 +3641,7 @@ int scrub_enumerate_chunks(struct scrub_ + } else { + btrfs_warn(fs_info, + "failed setting block group ro: %d", ret); ++ btrfs_put_block_group_trimming(cache); + btrfs_put_block_group(cache); + scrub_pause_off(fs_info); + break; +@@ -3695,6 +3728,7 @@ int scrub_enumerate_chunks(struct scrub_ + spin_unlock(&cache->lock); + } + ++ btrfs_put_block_group_trimming(cache); + btrfs_put_block_group(cache); + if (ret) + break; diff --git a/queue-5.7/btrfs-fix-corrupt-log-due-to-concurrent-fsync-of-inodes-with-shared-extents.patch b/queue-5.7/btrfs-fix-corrupt-log-due-to-concurrent-fsync-of-inodes-with-shared-extents.patch new file mode 100644 index 00000000000..d9d3eae6641 --- /dev/null +++ b/queue-5.7/btrfs-fix-corrupt-log-due-to-concurrent-fsync-of-inodes-with-shared-extents.patch @@ -0,0 +1,316 @@ +From e289f03ea79bbc6574b78ac25682555423a91cbb Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Mon, 18 May 2020 12:14:50 +0100 +Subject: btrfs: fix corrupt log due to concurrent fsync of inodes with shared extents + +From: Filipe Manana + +commit e289f03ea79bbc6574b78ac25682555423a91cbb upstream. + +When we have extents shared amongst different inodes in the same subvolume, +if we fsync them in parallel we can end up with checksum items in the log +tree that represent ranges which overlap. + +For example, consider we have inodes A and B, both sharing an extent that +covers the logical range from X to X + 64KiB: + +1) Task A starts an fsync on inode A; + +2) Task B starts an fsync on inode B; + +3) Task A calls btrfs_csum_file_blocks(), and the first search in the + log tree, through btrfs_lookup_csum(), returns -EFBIG because it + finds an existing checksum item that covers the range from X - 64KiB + to X; + +4) Task A checks that the checksum item has not reached the maximum + possible size (MAX_CSUM_ITEMS) and then releases the search path + before it does another path search for insertion (through a direct + call to btrfs_search_slot()); + +5) As soon as task A releases the path and before it does the search + for insertion, task B calls btrfs_csum_file_blocks() and gets -EFBIG + too, because there is an existing checksum item that has an end + offset that matches the start offset (X) of the checksum range we want + to log; + +6) Task B releases the path; + +7) Task A does the path search for insertion (through btrfs_search_slot()) + and then verifies that the checksum item that ends at offset X still + exists and extends its size to insert the checksums for the range from + X to X + 64KiB; + +8) Task A releases the path and returns from btrfs_csum_file_blocks(), + having inserted the checksums into an existing checksum item that got + its size extended. At this point we have one checksum item in the log + tree that covers the logical range from X - 64KiB to X + 64KiB; + +9) Task B now does a search for insertion using btrfs_search_slot() too, + but it finds that the previous checksum item no longer ends at the + offset X, it now ends at an of offset X + 64KiB, so it leaves that item + untouched. + + Then it releases the path and calls btrfs_insert_empty_item() + that inserts a checksum item with a key offset corresponding to X and + a size for inserting a single checksum (4 bytes in case of crc32c). + Subsequent iterations end up extending this new checksum item so that + it contains the checksums for the range from X to X + 64KiB. + + So after task B returns from btrfs_csum_file_blocks() we end up with + two checksum items in the log tree that have overlapping ranges, one + for the range from X - 64KiB to X + 64KiB, and another for the range + from X to X + 64KiB. + +Having checksum items that represent ranges which overlap, regardless of +being in the log tree or in the chekcsums tree, can lead to problems where +checksums for a file range end up not being found. This type of problem +has happened a few times in the past and the following commits fixed them +and explain in detail why having checksum items with overlapping ranges is +problematic: + + 27b9a8122ff71a "Btrfs: fix csum tree corruption, duplicate and outdated checksums" + b84b8390d6009c "Btrfs: fix file read corruption after extent cloning and fsync" + 40e046acbd2f36 "Btrfs: fix missing data checksums after replaying a log tree" + +Since this specific instance of the problem can only happen when logging +inodes, because it is the only case where concurrent attempts to insert +checksums for the same range can happen, fix the issue by using an extent +io tree as a range lock to serialize checksum insertion during inode +logging. + +This issue could often be reproduced by the test case generic/457 from +fstests. When it happens it produces the following trace: + + BTRFS critical (device dm-0): corrupt leaf: root=18446744073709551610 block=30625792 slot=42, csum end range (15020032) goes beyond the start range (15015936) of the next csum item + BTRFS info (device dm-0): leaf 30625792 gen 7 total ptrs 49 free space 2402 owner 18446744073709551610 + BTRFS info (device dm-0): refs 1 lock (w:0 r:0 bw:0 br:0 sw:0 sr:0) lock_owner 0 current 15884 + item 0 key (18446744073709551606 128 13979648) itemoff 3991 itemsize 4 + item 1 key (18446744073709551606 128 13983744) itemoff 3987 itemsize 4 + item 2 key (18446744073709551606 128 13987840) itemoff 3983 itemsize 4 + item 3 key (18446744073709551606 128 13991936) itemoff 3979 itemsize 4 + item 4 key (18446744073709551606 128 13996032) itemoff 3975 itemsize 4 + item 5 key (18446744073709551606 128 14000128) itemoff 3971 itemsize 4 + (...) + BTRFS error (device dm-0): block=30625792 write time tree block corruption detected + ------------[ cut here ]------------ + WARNING: CPU: 1 PID: 15884 at fs/btrfs/disk-io.c:539 btree_csum_one_bio+0x268/0x2d0 [btrfs] + Modules linked in: btrfs dm_thin_pool ... + CPU: 1 PID: 15884 Comm: fsx Tainted: G W 5.6.0-rc7-btrfs-next-58 #1 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 + RIP: 0010:btree_csum_one_bio+0x268/0x2d0 [btrfs] + Code: c7 c7 ... + RSP: 0018:ffffbb0109e6f8e0 EFLAGS: 00010296 + RAX: 0000000000000000 RBX: ffffe1c0847b6080 RCX: 0000000000000000 + RDX: 0000000000000000 RSI: ffffffffaa963988 RDI: 0000000000000001 + RBP: ffff956a4f4d2000 R08: 0000000000000000 R09: 0000000000000001 + R10: 0000000000000526 R11: 0000000000000000 R12: ffff956a5cd28bb0 + R13: 0000000000000000 R14: ffff956a649c9388 R15: 000000011ed82000 + FS: 00007fb419959e80(0000) GS:ffff956a7aa00000(0000) knlGS:0000000000000000 + CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + CR2: 0000000000fe6d54 CR3: 0000000138696005 CR4: 00000000003606e0 + DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 + Call Trace: + btree_submit_bio_hook+0x67/0xc0 [btrfs] + submit_one_bio+0x31/0x50 [btrfs] + btree_write_cache_pages+0x2db/0x4b0 [btrfs] + ? __filemap_fdatawrite_range+0xb1/0x110 + do_writepages+0x23/0x80 + __filemap_fdatawrite_range+0xd2/0x110 + btrfs_write_marked_extents+0x15e/0x180 [btrfs] + btrfs_sync_log+0x206/0x10a0 [btrfs] + ? kmem_cache_free+0x315/0x3b0 + ? btrfs_log_inode+0x1e8/0xf90 [btrfs] + ? __mutex_unlock_slowpath+0x45/0x2a0 + ? lockref_put_or_lock+0x9/0x30 + ? dput+0x2d/0x580 + ? dput+0xb5/0x580 + ? btrfs_sync_file+0x464/0x4d0 [btrfs] + btrfs_sync_file+0x464/0x4d0 [btrfs] + do_fsync+0x38/0x60 + __x64_sys_fsync+0x10/0x20 + do_syscall_64+0x5c/0x280 + entry_SYSCALL_64_after_hwframe+0x49/0xbe + RIP: 0033:0x7fb41953a6d0 + Code: 48 3d ... + RSP: 002b:00007ffcc86bd218 EFLAGS: 00000246 ORIG_RAX: 000000000000004a + RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fb41953a6d0 + RDX: 0000000000000009 RSI: 0000000000040000 RDI: 0000000000000003 + RBP: 0000000000040000 R08: 0000000000000001 R09: 0000000000000009 + R10: 0000000000000064 R11: 0000000000000246 R12: 0000556cf4b2c060 + R13: 0000000000000100 R14: 0000000000000000 R15: 0000556cf322b420 + irq event stamp: 0 + hardirqs last enabled at (0): [<0000000000000000>] 0x0 + hardirqs last disabled at (0): [] copy_process+0x74f/0x2020 + softirqs last enabled at (0): [] copy_process+0x74f/0x2020 + softirqs last disabled at (0): [<0000000000000000>] 0x0 + ---[ end trace d543fc76f5ad7fd8 ]--- + +In that trace the tree checker detected the overlapping checksum items at +the time when we triggered writeback for the log tree when syncing the +log. + +Another trace that can happen is due to BUG_ON() when deleting checksum +items while logging an inode: + + BTRFS critical (device dm-0): slot 81 key (18446744073709551606 128 13635584) new key (18446744073709551606 128 13635584) + BTRFS info (device dm-0): leaf 30949376 gen 7 total ptrs 98 free space 8527 owner 18446744073709551610 + BTRFS info (device dm-0): refs 4 lock (w:1 r:0 bw:0 br:0 sw:1 sr:0) lock_owner 13473 current 13473 + item 0 key (257 1 0) itemoff 16123 itemsize 160 + inode generation 7 size 262144 mode 100600 + item 1 key (257 12 256) itemoff 16103 itemsize 20 + item 2 key (257 108 0) itemoff 16050 itemsize 53 + extent data disk bytenr 13631488 nr 4096 + extent data offset 0 nr 131072 ram 131072 + (...) + ------------[ cut here ]------------ + kernel BUG at fs/btrfs/ctree.c:3153! + invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI + CPU: 1 PID: 13473 Comm: fsx Not tainted 5.6.0-rc7-btrfs-next-58 #1 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 + RIP: 0010:btrfs_set_item_key_safe+0x1ea/0x270 [btrfs] + Code: 0f b6 ... + RSP: 0018:ffff95e3889179d0 EFLAGS: 00010282 + RAX: 0000000000000000 RBX: 0000000000000051 RCX: 0000000000000000 + RDX: 0000000000000000 RSI: ffffffffb7763988 RDI: 0000000000000001 + RBP: fffffffffffffff6 R08: 0000000000000000 R09: 0000000000000001 + R10: 00000000000009ef R11: 0000000000000000 R12: ffff8912a8ba5a08 + R13: ffff95e388917a06 R14: ffff89138dcf68c8 R15: ffff95e388917ace + FS: 00007fe587084e80(0000) GS:ffff8913baa00000(0000) knlGS:0000000000000000 + CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + CR2: 00007fe587091000 CR3: 0000000126dac005 CR4: 00000000003606e0 + DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 + Call Trace: + btrfs_del_csums+0x2f4/0x540 [btrfs] + copy_items+0x4b5/0x560 [btrfs] + btrfs_log_inode+0x910/0xf90 [btrfs] + btrfs_log_inode_parent+0x2a0/0xe40 [btrfs] + ? dget_parent+0x5/0x370 + btrfs_log_dentry_safe+0x4a/0x70 [btrfs] + btrfs_sync_file+0x42b/0x4d0 [btrfs] + __x64_sys_msync+0x199/0x200 + do_syscall_64+0x5c/0x280 + entry_SYSCALL_64_after_hwframe+0x49/0xbe + RIP: 0033:0x7fe586c65760 + Code: 00 f7 ... + RSP: 002b:00007ffe250f98b8 EFLAGS: 00000246 ORIG_RAX: 000000000000001a + RAX: ffffffffffffffda RBX: 00000000000040e1 RCX: 00007fe586c65760 + RDX: 0000000000000004 RSI: 0000000000006b51 RDI: 00007fe58708b000 + RBP: 0000000000006a70 R08: 0000000000000003 R09: 00007fe58700cb61 + R10: 0000000000000100 R11: 0000000000000246 R12: 00000000000000e1 + R13: 00007fe58708b000 R14: 0000000000006b51 R15: 0000558de021a420 + Modules linked in: dm_log_writes ... + ---[ end trace c92a7f447a8515f5 ]--- + +CC: stable@vger.kernel.org # 4.4+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/ctree.h | 3 +++ + fs/btrfs/disk-io.c | 5 ++++- + fs/btrfs/extent-io-tree.h | 1 + + fs/btrfs/tree-log.c | 22 +++++++++++++++++++--- + include/trace/events/btrfs.h | 1 + + 5 files changed, 28 insertions(+), 4 deletions(-) + +--- a/fs/btrfs/ctree.h ++++ b/fs/btrfs/ctree.h +@@ -1146,6 +1146,9 @@ struct btrfs_root { + /* Record pairs of swapped blocks for qgroup */ + struct btrfs_qgroup_swapped_blocks swapped_blocks; + ++ /* Used only by log trees, when logging csum items */ ++ struct extent_io_tree log_csum_range; ++ + #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS + u64 alloc_bytenr; + #endif +--- a/fs/btrfs/disk-io.c ++++ b/fs/btrfs/disk-io.c +@@ -1137,9 +1137,12 @@ static void __setup_root(struct btrfs_ro + root->log_transid = 0; + root->log_transid_committed = -1; + root->last_log_commit = 0; +- if (!dummy) ++ if (!dummy) { + extent_io_tree_init(fs_info, &root->dirty_log_pages, + IO_TREE_ROOT_DIRTY_LOG_PAGES, NULL); ++ extent_io_tree_init(fs_info, &root->log_csum_range, ++ IO_TREE_LOG_CSUM_RANGE, NULL); ++ } + + memset(&root->root_key, 0, sizeof(root->root_key)); + memset(&root->root_item, 0, sizeof(root->root_item)); +--- a/fs/btrfs/extent-io-tree.h ++++ b/fs/btrfs/extent-io-tree.h +@@ -44,6 +44,7 @@ enum { + IO_TREE_TRANS_DIRTY_PAGES, + IO_TREE_ROOT_DIRTY_LOG_PAGES, + IO_TREE_INODE_FILE_EXTENT, ++ IO_TREE_LOG_CSUM_RANGE, + IO_TREE_SELFTEST, + }; + +--- a/fs/btrfs/tree-log.c ++++ b/fs/btrfs/tree-log.c +@@ -3299,6 +3299,7 @@ static void free_log_tree(struct btrfs_t + + clear_extent_bits(&log->dirty_log_pages, 0, (u64)-1, + EXTENT_DIRTY | EXTENT_NEW | EXTENT_NEED_WAIT); ++ extent_io_tree_release(&log->log_csum_range); + btrfs_put_root(log); + } + +@@ -3916,9 +3917,21 @@ static int log_csums(struct btrfs_trans_ + struct btrfs_root *log_root, + struct btrfs_ordered_sum *sums) + { ++ const u64 lock_end = sums->bytenr + sums->len - 1; ++ struct extent_state *cached_state = NULL; + int ret; + + /* ++ * Serialize logging for checksums. This is to avoid racing with the ++ * same checksum being logged by another task that is logging another ++ * file which happens to refer to the same extent as well. Such races ++ * can leave checksum items in the log with overlapping ranges. ++ */ ++ ret = lock_extent_bits(&log_root->log_csum_range, sums->bytenr, ++ lock_end, &cached_state); ++ if (ret) ++ return ret; ++ /* + * Due to extent cloning, we might have logged a csum item that covers a + * subrange of a cloned extent, and later we can end up logging a csum + * item for a larger subrange of the same extent or the entire range. +@@ -3928,10 +3941,13 @@ static int log_csums(struct btrfs_trans_ + * trim and adjust) any existing csum items in the log for this range. + */ + ret = btrfs_del_csums(trans, log_root, sums->bytenr, sums->len); +- if (ret) +- return ret; ++ if (!ret) ++ ret = btrfs_csum_file_blocks(trans, log_root, sums); + +- return btrfs_csum_file_blocks(trans, log_root, sums); ++ unlock_extent_cached(&log_root->log_csum_range, sums->bytenr, lock_end, ++ &cached_state); ++ ++ return ret; + } + + static noinline int copy_items(struct btrfs_trans_handle *trans, +--- a/include/trace/events/btrfs.h ++++ b/include/trace/events/btrfs.h +@@ -89,6 +89,7 @@ TRACE_DEFINE_ENUM(COMMIT_TRANS); + { IO_TREE_TRANS_DIRTY_PAGES, "TRANS_DIRTY_PAGES" }, \ + { IO_TREE_ROOT_DIRTY_LOG_PAGES, "ROOT_DIRTY_LOG_PAGES" }, \ + { IO_TREE_INODE_FILE_EXTENT, "INODE_FILE_EXTENT" }, \ ++ { IO_TREE_LOG_CSUM_RANGE, "LOG_CSUM_RANGE" }, \ + { IO_TREE_SELFTEST, "SELFTEST" }) + + #define BTRFS_GROUP_FLAGS \ diff --git a/queue-5.7/btrfs-fix-error-handling-when-submitting-direct-i-o-bio.patch b/queue-5.7/btrfs-fix-error-handling-when-submitting-direct-i-o-bio.patch new file mode 100644 index 00000000000..dd8cefb2c7b --- /dev/null +++ b/queue-5.7/btrfs-fix-error-handling-when-submitting-direct-i-o-bio.patch @@ -0,0 +1,63 @@ +From 6d3113a193e3385c72240096fe397618ecab6e43 Mon Sep 17 00:00:00 2001 +From: Omar Sandoval +Date: Thu, 16 Apr 2020 14:46:12 -0700 +Subject: btrfs: fix error handling when submitting direct I/O bio + +From: Omar Sandoval + +commit 6d3113a193e3385c72240096fe397618ecab6e43 upstream. + +In btrfs_submit_direct_hook(), if a direct I/O write doesn't span a RAID +stripe or chunk, we submit orig_bio without cloning it. In this case, we +don't increment pending_bios. Then, if btrfs_submit_dio_bio() fails, we +decrement pending_bios to -1, and we never complete orig_bio. Fix it by +initializing pending_bios to 1 instead of incrementing later. + +Fixing this exposes another bug: we put orig_bio prematurely and then +put it again from end_io. Fix it by not putting orig_bio. + +After this change, pending_bios is really more of a reference count, but +I'll leave that cleanup separate to keep the fix small. + +Fixes: e65e15355429 ("btrfs: fix panic caused by direct IO") +CC: stable@vger.kernel.org # 4.4+ +Reviewed-by: Nikolay Borisov +Reviewed-by: Josef Bacik +Reviewed-by: Johannes Thumshirn +Signed-off-by: Omar Sandoval +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/inode.c | 6 +++--- + 1 file changed, 3 insertions(+), 3 deletions(-) + +--- a/fs/btrfs/inode.c ++++ b/fs/btrfs/inode.c +@@ -7939,7 +7939,6 @@ static int btrfs_submit_direct_hook(stru + + /* bio split */ + ASSERT(geom.len <= INT_MAX); +- atomic_inc(&dip->pending_bios); + do { + clone_len = min_t(int, submit_len, geom.len); + +@@ -7989,7 +7988,8 @@ submit: + if (!status) + return 0; + +- bio_put(bio); ++ if (bio != orig_bio) ++ bio_put(bio); + out_err: + dip->errors = 1; + /* +@@ -8030,7 +8030,7 @@ static void btrfs_submit_direct(struct b + bio->bi_private = dip; + dip->orig_bio = bio; + dip->dio_bio = dio_bio; +- atomic_set(&dip->pending_bios, 0); ++ atomic_set(&dip->pending_bios, 1); + io_bio = btrfs_io_bio(bio); + io_bio->logical = file_offset; + diff --git a/queue-5.7/btrfs-fix-space_info-bytes_may_use-underflow-after-nocow-buffered-write.patch b/queue-5.7/btrfs-fix-space_info-bytes_may_use-underflow-after-nocow-buffered-write.patch new file mode 100644 index 00000000000..e1841e16f07 --- /dev/null +++ b/queue-5.7/btrfs-fix-space_info-bytes_may_use-underflow-after-nocow-buffered-write.patch @@ -0,0 +1,197 @@ +From 467dc47ea99c56e966e99d09dae54869850abeeb Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Wed, 27 May 2020 11:16:07 +0100 +Subject: btrfs: fix space_info bytes_may_use underflow after nocow buffered write + +From: Filipe Manana + +commit 467dc47ea99c56e966e99d09dae54869850abeeb upstream. + +When doing a buffered write we always try to reserve data space for it, +even when the file has the NOCOW bit set or the write falls into a file +range covered by a prealloc extent. This is done both because it is +expensive to check if we can do a nocow write (checking if an extent is +shared through reflinks or if there's a hole in the range for example), +and because when writeback starts we might actually need to fallback to +COW mode (for example the block group containing the target extents was +turned into RO mode due to a scrub or balance). + +When we are unable to reserve data space we check if we can do a nocow +write, and if we can, we proceed with dirtying the pages and setting up +the range for delalloc. In this case the bytes_may_use counter of the +data space_info object is not incremented, unlike in the case where we +are able to reserve data space (done through btrfs_check_data_free_space() +which calls btrfs_alloc_data_chunk_ondemand()). + +Later when running delalloc we attempt to start writeback in nocow mode +but we might revert back to cow mode, for example because in the meanwhile +a block group was turned into RO mode by a scrub or relocation. The cow +path after successfully allocating an extent ends up calling +btrfs_add_reserved_bytes(), which expects the bytes_may_use counter of +the data space_info object to have been incremented before - but we did +not do it when the buffered write started, since there was not enough +available data space. So btrfs_add_reserved_bytes() ends up decrementing +the bytes_may_use counter anyway, and when the counter's current value +is smaller then the size of the allocated extent we get a stack trace +like the following: + + ------------[ cut here ]------------ + WARNING: CPU: 0 PID: 20138 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] + Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...) + CPU: 0 PID: 20138 Comm: kworker/u8:15 Not tainted 5.6.0-rc7-btrfs-next-58 #5 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 + Workqueue: writeback wb_workfn (flush-btrfs-1754) + RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] + Code: ff ff 48 (...) + RSP: 0018:ffffbda18a4b3568 EFLAGS: 00010287 + RAX: 0000000000000000 RBX: ffff9ca076f5d800 RCX: 0000000000000000 + RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9ca068470410 + RBP: fffffffffffff000 R08: 0000000000000001 R09: 0000000000000000 + R10: ffff9ca079d58040 R11: 0000000000000000 R12: ffff9ca068470400 + R13: ffff9ca0408b2000 R14: 0000000000001000 R15: ffff9ca076f5d800 + FS: 0000000000000000(0000) GS:ffff9ca07a600000(0000) knlGS:0000000000000000 + CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + CR2: 00005605dbfe7048 CR3: 0000000138570006 CR4: 00000000003606f0 + DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 + Call Trace: + find_free_extent+0x4a0/0x16c0 [btrfs] + btrfs_reserve_extent+0x91/0x180 [btrfs] + cow_file_range+0x12d/0x490 [btrfs] + run_delalloc_nocow+0x341/0xa40 [btrfs] + btrfs_run_delalloc_range+0x1ea/0x6d0 [btrfs] + ? find_lock_delalloc_range+0x221/0x250 [btrfs] + writepage_delalloc+0xe8/0x150 [btrfs] + __extent_writepage+0xe8/0x4c0 [btrfs] + extent_write_cache_pages+0x237/0x530 [btrfs] + ? btrfs_wq_submit_bio+0x9f/0xc0 [btrfs] + extent_writepages+0x44/0xa0 [btrfs] + do_writepages+0x23/0x80 + __writeback_single_inode+0x59/0x700 + writeback_sb_inodes+0x267/0x5f0 + __writeback_inodes_wb+0x87/0xe0 + wb_writeback+0x382/0x590 + ? wb_workfn+0x4a2/0x6c0 + wb_workfn+0x4a2/0x6c0 + process_one_work+0x26d/0x6a0 + worker_thread+0x4f/0x3e0 + ? process_one_work+0x6a0/0x6a0 + kthread+0x103/0x140 + ? kthread_create_worker_on_cpu+0x70/0x70 + ret_from_fork+0x3a/0x50 + irq event stamp: 0 + hardirqs last enabled at (0): [<0000000000000000>] 0x0 + hardirqs last disabled at (0): [] copy_process+0x74f/0x2020 + softirqs last enabled at (0): [] copy_process+0x74f/0x2020 + softirqs last disabled at (0): [<0000000000000000>] 0x0 + ---[ end trace f9f6ef8ec4cd8ec9 ]--- + +So to fix this, when falling back into cow mode check if space was not +reserved, by testing for the bit EXTENT_NORESERVE in the respective file +range, and if not, increment the bytes_may_use counter for the data +space_info object. Also clear the EXTENT_NORESERVE bit from the range, so +that if the cow path fails it decrements the bytes_may_use counter when +clearing the delalloc range (through the btrfs_clear_delalloc_extent() +callback). + +Fixes: 7ee9e4405f264e ("Btrfs: check if we can nocow if we don't have data space") +CC: stable@vger.kernel.org # 4.4+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/inode.c | 61 ++++++++++++++++++++++++++++++++++++++++++++++++++----- + 1 file changed, 56 insertions(+), 5 deletions(-) + +--- a/fs/btrfs/inode.c ++++ b/fs/btrfs/inode.c +@@ -49,6 +49,7 @@ + #include "qgroup.h" + #include "delalloc-space.h" + #include "block-group.h" ++#include "space-info.h" + + struct btrfs_iget_args { + struct btrfs_key *location; +@@ -1355,6 +1356,56 @@ static noinline int csum_exist_in_range( + return 1; + } + ++static int fallback_to_cow(struct inode *inode, struct page *locked_page, ++ const u64 start, const u64 end, ++ int *page_started, unsigned long *nr_written) ++{ ++ struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; ++ u64 range_start = start; ++ u64 count; ++ ++ /* ++ * If EXTENT_NORESERVE is set it means that when the buffered write was ++ * made we had not enough available data space and therefore we did not ++ * reserve data space for it, since we though we could do NOCOW for the ++ * respective file range (either there is prealloc extent or the inode ++ * has the NOCOW bit set). ++ * ++ * However when we need to fallback to COW mode (because for example the ++ * block group for the corresponding extent was turned to RO mode by a ++ * scrub or relocation) we need to do the following: ++ * ++ * 1) We increment the bytes_may_use counter of the data space info. ++ * If COW succeeds, it allocates a new data extent and after doing ++ * that it decrements the space info's bytes_may_use counter and ++ * increments its bytes_reserved counter by the same amount (we do ++ * this at btrfs_add_reserved_bytes()). So we need to increment the ++ * bytes_may_use counter to compensate (when space is reserved at ++ * buffered write time, the bytes_may_use counter is incremented); ++ * ++ * 2) We clear the EXTENT_NORESERVE bit from the range. We do this so ++ * that if the COW path fails for any reason, it decrements (through ++ * extent_clear_unlock_delalloc()) the bytes_may_use counter of the ++ * data space info, which we incremented in the step above. ++ */ ++ count = count_range_bits(io_tree, &range_start, end, end + 1 - start, ++ EXTENT_NORESERVE, 0); ++ if (count > 0) { ++ struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; ++ struct btrfs_space_info *sinfo = fs_info->data_sinfo; ++ ++ spin_lock(&sinfo->lock); ++ btrfs_space_info_update_bytes_may_use(fs_info, sinfo, count); ++ spin_unlock(&sinfo->lock); ++ ++ clear_extent_bit(io_tree, start, end, EXTENT_NORESERVE, 0, 0, ++ NULL); ++ } ++ ++ return cow_file_range(inode, locked_page, start, end, page_started, ++ nr_written, 1); ++} ++ + /* + * when nowcow writeback call back. This checks for snapshots or COW copies + * of the extents that exist in the file, and COWs the file as required. +@@ -1602,9 +1653,9 @@ out_check: + * NOCOW, following one which needs to be COW'ed + */ + if (cow_start != (u64)-1) { +- ret = cow_file_range(inode, locked_page, +- cow_start, found_key.offset - 1, +- page_started, nr_written, 1); ++ ret = fallback_to_cow(inode, locked_page, cow_start, ++ found_key.offset - 1, ++ page_started, nr_written); + if (ret) { + if (nocow) + btrfs_dec_nocow_writers(fs_info, +@@ -1693,8 +1744,8 @@ out_check: + + if (cow_start != (u64)-1) { + cur_offset = end; +- ret = cow_file_range(inode, locked_page, cow_start, end, +- page_started, nr_written, 1); ++ ret = fallback_to_cow(inode, locked_page, cow_start, end, ++ page_started, nr_written); + if (ret) + goto error; + } diff --git a/queue-5.7/btrfs-fix-space_info-bytes_may_use-underflow-during-space-cache-writeout.patch b/queue-5.7/btrfs-fix-space_info-bytes_may_use-underflow-during-space-cache-writeout.patch new file mode 100644 index 00000000000..4339934d574 --- /dev/null +++ b/queue-5.7/btrfs-fix-space_info-bytes_may_use-underflow-during-space-cache-writeout.patch @@ -0,0 +1,146 @@ +From 2166e5edce9ac1edf3b113d6091ef72fcac2d6c4 Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Wed, 27 May 2020 11:16:19 +0100 +Subject: btrfs: fix space_info bytes_may_use underflow during space cache writeout + +From: Filipe Manana + +commit 2166e5edce9ac1edf3b113d6091ef72fcac2d6c4 upstream. + +We always preallocate a data extent for writing a free space cache, which +causes writeback to always try the nocow path first, since the free space +inode has the prealloc bit set in its flags. + +However if the block group that contains the data extent for the space +cache has been turned to RO mode due to a running scrub or balance for +example, we have to fallback to the cow path. In that case once a new data +extent is allocated we end up calling btrfs_add_reserved_bytes(), which +decrements the counter named bytes_may_use from the data space_info object +with the expection that this counter was previously incremented with the +same amount (the size of the data extent). + +However when we started writeout of the space cache at cache_save_setup(), +we incremented the value of the bytes_may_use counter through a call to +btrfs_check_data_free_space() and then decremented it through a call to +btrfs_prealloc_file_range_trans() immediately after. So when starting the +writeback if we fallback to cow mode we have to increment the counter +bytes_may_use of the data space_info again to compensate for the extent +allocation done by the cow path. + +When this issue happens we are incorrectly decrementing the bytes_may_use +counter and when its current value is smaller then the amount we try to +subtract we end up with the following warning: + + ------------[ cut here ]------------ + WARNING: CPU: 3 PID: 657 at fs/btrfs/space-info.h:115 btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] + Modules linked in: btrfs blake2b_generic xor raid6_pq libcrc32c (...) + CPU: 3 PID: 657 Comm: kworker/u8:7 Tainted: G W 5.6.0-rc7-btrfs-next-58 #5 + Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 + Workqueue: writeback wb_workfn (flush-btrfs-1591) + RIP: 0010:btrfs_add_reserved_bytes+0x3d6/0x4e0 [btrfs] + Code: ff ff 48 (...) + RSP: 0000:ffffa41608f13660 EFLAGS: 00010287 + RAX: 0000000000001000 RBX: ffff9615b93ae400 RCX: 0000000000000000 + RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff9615b96ab410 + RBP: fffffffffffee000 R08: 0000000000000001 R09: 0000000000000000 + R10: ffff961585e62a40 R11: 0000000000000000 R12: ffff9615b96ab400 + R13: ffff9615a1a2a000 R14: 0000000000012000 R15: ffff9615b93ae400 + FS: 0000000000000000(0000) GS:ffff9615bb200000(0000) knlGS:0000000000000000 + CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 + CR2: 000055cbbc2ae178 CR3: 0000000115794006 CR4: 00000000003606e0 + DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 + DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 + Call Trace: + find_free_extent+0x4a0/0x16c0 [btrfs] + btrfs_reserve_extent+0x91/0x180 [btrfs] + cow_file_range+0x12d/0x490 [btrfs] + btrfs_run_delalloc_range+0x9f/0x6d0 [btrfs] + ? find_lock_delalloc_range+0x221/0x250 [btrfs] + writepage_delalloc+0xe8/0x150 [btrfs] + __extent_writepage+0xe8/0x4c0 [btrfs] + extent_write_cache_pages+0x237/0x530 [btrfs] + extent_writepages+0x44/0xa0 [btrfs] + do_writepages+0x23/0x80 + __writeback_single_inode+0x59/0x700 + writeback_sb_inodes+0x267/0x5f0 + __writeback_inodes_wb+0x87/0xe0 + wb_writeback+0x382/0x590 + ? wb_workfn+0x4a2/0x6c0 + wb_workfn+0x4a2/0x6c0 + process_one_work+0x26d/0x6a0 + worker_thread+0x4f/0x3e0 + ? process_one_work+0x6a0/0x6a0 + kthread+0x103/0x140 + ? kthread_create_worker_on_cpu+0x70/0x70 + ret_from_fork+0x3a/0x50 + irq event stamp: 0 + hardirqs last enabled at (0): [<0000000000000000>] 0x0 + hardirqs last disabled at (0): [] copy_process+0x74f/0x2020 + softirqs last enabled at (0): [] copy_process+0x74f/0x2020 + softirqs last disabled at (0): [<0000000000000000>] 0x0 + ---[ end trace bd7c03622e0b0a52 ]--- + ------------[ cut here ]------------ + +So fix this by incrementing the bytes_may_use counter of the data +space_info when we fallback to the cow path. If the cow path is successful +the counter is decremented after extent allocation (by +btrfs_add_reserved_bytes()), if it fails it ends up being decremented as +well when clearing the delalloc range (extent_clear_unlock_delalloc()). + +This could be triggered sporadically by the test case btrfs/061 from +fstests. + +Fixes: 82d5902d9c681b ("Btrfs: Support reading/writing on disk free ino cache") +CC: stable@vger.kernel.org # 4.4+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/inode.c | 20 +++++++++++++++----- + 1 file changed, 15 insertions(+), 5 deletions(-) + +--- a/fs/btrfs/inode.c ++++ b/fs/btrfs/inode.c +@@ -1360,6 +1360,8 @@ static int fallback_to_cow(struct inode + const u64 start, const u64 end, + int *page_started, unsigned long *nr_written) + { ++ const bool is_space_ino = btrfs_is_free_space_inode(BTRFS_I(inode)); ++ const u64 range_bytes = end + 1 - start; + struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; + u64 range_start = start; + u64 count; +@@ -1387,19 +1389,27 @@ static int fallback_to_cow(struct inode + * that if the COW path fails for any reason, it decrements (through + * extent_clear_unlock_delalloc()) the bytes_may_use counter of the + * data space info, which we incremented in the step above. ++ * ++ * If we need to fallback to cow and the inode corresponds to a free ++ * space cache inode, we must also increment bytes_may_use of the data ++ * space_info for the same reason. Space caches always get a prealloc ++ * extent for them, however scrub or balance may have set the block ++ * group that contains that extent to RO mode. + */ +- count = count_range_bits(io_tree, &range_start, end, end + 1 - start, ++ count = count_range_bits(io_tree, &range_start, end, range_bytes, + EXTENT_NORESERVE, 0); +- if (count > 0) { ++ if (count > 0 || is_space_ino) { ++ const u64 bytes = is_space_ino ? range_bytes : count; + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; + struct btrfs_space_info *sinfo = fs_info->data_sinfo; + + spin_lock(&sinfo->lock); +- btrfs_space_info_update_bytes_may_use(fs_info, sinfo, count); ++ btrfs_space_info_update_bytes_may_use(fs_info, sinfo, bytes); + spin_unlock(&sinfo->lock); + +- clear_extent_bit(io_tree, start, end, EXTENT_NORESERVE, 0, 0, +- NULL); ++ if (count > 0) ++ clear_extent_bit(io_tree, start, end, EXTENT_NORESERVE, ++ 0, 0, NULL); + } + + return cow_file_range(inode, locked_page, start, end, page_started, diff --git a/queue-5.7/btrfs-fix-wrong-file-range-cleanup-after-an-error-filling-dealloc-range.patch b/queue-5.7/btrfs-fix-wrong-file-range-cleanup-after-an-error-filling-dealloc-range.patch new file mode 100644 index 00000000000..b75bc5a5967 --- /dev/null +++ b/queue-5.7/btrfs-fix-wrong-file-range-cleanup-after-an-error-filling-dealloc-range.patch @@ -0,0 +1,39 @@ +From e2c8e92d1140754073ad3799eb6620c76bab2078 Mon Sep 17 00:00:00 2001 +From: Filipe Manana +Date: Wed, 27 May 2020 11:15:53 +0100 +Subject: btrfs: fix wrong file range cleanup after an error filling dealloc range + +From: Filipe Manana + +commit e2c8e92d1140754073ad3799eb6620c76bab2078 upstream. + +If an error happens while running dellaloc in COW mode for a range, we can +end up calling extent_clear_unlock_delalloc() for a range that goes beyond +our range's end offset by 1 byte, which affects 1 extra page. This results +in clearing bits and doing page operations (such as a page unlock) outside +our target range. + +Fix that by calling extent_clear_unlock_delalloc() with an inclusive end +offset, instead of an exclusive end offset, at cow_file_range(). + +Fixes: a315e68f6e8b30 ("Btrfs: fix invalid attempt to free reserved space on failure to cow range") +CC: stable@vger.kernel.org # 4.14+ +Signed-off-by: Filipe Manana +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/inode.c | 2 +- + 1 file changed, 1 insertion(+), 1 deletion(-) + +--- a/fs/btrfs/inode.c ++++ b/fs/btrfs/inode.c +@@ -1142,7 +1142,7 @@ out_unlock: + */ + if (extent_reserved) { + extent_clear_unlock_delalloc(inode, start, +- start + cur_alloc_size, ++ start + cur_alloc_size - 1, + locked_page, + clear_bits, + page_ops); diff --git a/queue-5.7/btrfs-force-chunk-allocation-if-our-global-rsv-is-larger-than-metadata.patch b/queue-5.7/btrfs-force-chunk-allocation-if-our-global-rsv-is-larger-than-metadata.patch new file mode 100644 index 00000000000..0a877ff412a --- /dev/null +++ b/queue-5.7/btrfs-force-chunk-allocation-if-our-global-rsv-is-larger-than-metadata.patch @@ -0,0 +1,119 @@ +From 9c343784c4328781129bcf9e671645f69fe4b38a Mon Sep 17 00:00:00 2001 +From: Josef Bacik +Date: Fri, 13 Mar 2020 15:28:48 -0400 +Subject: btrfs: force chunk allocation if our global rsv is larger than metadata + +From: Josef Bacik + +commit 9c343784c4328781129bcf9e671645f69fe4b38a upstream. + +Nikolay noticed a bunch of test failures with my global rsv steal +patches. At first he thought they were introduced by them, but they've +been failing for a while with 64k nodes. + +The problem is with 64k nodes we have a global reserve that calculates +out to 13MiB on a freshly made file system, which only has 8MiB of +metadata space. Because of changes I previously made we no longer +account for the global reserve in the overcommit logic, which means we +correctly allow overcommit to happen even though we are already +overcommitted. + +However in some corner cases, for example btrfs/170, we will allocate +the entire file system up with data chunks before we have enough space +pressure to allocate a metadata chunk. Then once the fs is full we +ENOSPC out because we cannot overcommit and the global reserve is taking +up all of the available space. + +The most ideal way to deal with this is to change our space reservation +stuff to take into account the height of the tree's that we're +modifying, so that our global reserve calculation does not end up so +obscenely large. + +However that is a huge undertaking. Instead fix this by forcing a chunk +allocation if the global reserve is larger than the total metadata +space. This gives us essentially the same behavior that happened +before, we get a chunk allocated and these tests can pass. + +This is meant to be a stop-gap measure until we can tackle the "tree +height only" project. + +Fixes: 0096420adb03 ("btrfs: do not account global reserve in can_overcommit") +CC: stable@vger.kernel.org # 5.4+ +Reviewed-by: Nikolay Borisov +Tested-by: Nikolay Borisov +Signed-off-by: Josef Bacik +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/block-rsv.c | 3 +++ + fs/btrfs/transaction.c | 18 ++++++++++++++++++ + 2 files changed, 21 insertions(+) + +--- a/fs/btrfs/block-rsv.c ++++ b/fs/btrfs/block-rsv.c +@@ -5,6 +5,7 @@ + #include "block-rsv.h" + #include "space-info.h" + #include "transaction.h" ++#include "block-group.h" + + /* + * HOW DO BLOCK RESERVES WORK +@@ -405,6 +406,8 @@ void btrfs_update_global_block_rsv(struc + else + block_rsv->full = 0; + ++ if (block_rsv->size >= sinfo->total_bytes) ++ sinfo->force_alloc = CHUNK_ALLOC_FORCE; + spin_unlock(&block_rsv->lock); + spin_unlock(&sinfo->lock); + } +--- a/fs/btrfs/transaction.c ++++ b/fs/btrfs/transaction.c +@@ -21,6 +21,7 @@ + #include "dev-replace.h" + #include "qgroup.h" + #include "block-group.h" ++#include "space-info.h" + + #define BTRFS_ROOT_TRANS_TAG 0 + +@@ -523,6 +524,7 @@ start_transaction(struct btrfs_root *roo + u64 num_bytes = 0; + u64 qgroup_reserved = 0; + bool reloc_reserved = false; ++ bool do_chunk_alloc = false; + int ret; + + /* Send isn't supposed to start transactions. */ +@@ -585,6 +587,9 @@ start_transaction(struct btrfs_root *roo + delayed_refs_bytes); + num_bytes -= delayed_refs_bytes; + } ++ ++ if (rsv->space_info->force_alloc) ++ do_chunk_alloc = true; + } else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL && + !delayed_refs_rsv->full) { + /* +@@ -667,6 +672,19 @@ got_it: + current->journal_info = h; + + /* ++ * If the space_info is marked ALLOC_FORCE then we'll get upgraded to ++ * ALLOC_FORCE the first run through, and then we won't allocate for ++ * anybody else who races in later. We don't care about the return ++ * value here. ++ */ ++ if (do_chunk_alloc && num_bytes) { ++ u64 flags = h->block_rsv->space_info->flags; ++ ++ btrfs_chunk_alloc(h, btrfs_get_alloc_profile(fs_info, flags), ++ CHUNK_ALLOC_NO_FORCE); ++ } ++ ++ /* + * btrfs_record_root_in_trans() needs to alloc new extents, and may + * call btrfs_join_transaction() while we're also starting a + * transaction. diff --git a/queue-5.7/btrfs-free-alien-device-after-device-add.patch b/queue-5.7/btrfs-free-alien-device-after-device-add.patch new file mode 100644 index 00000000000..078fccdee9f --- /dev/null +++ b/queue-5.7/btrfs-free-alien-device-after-device-add.patch @@ -0,0 +1,63 @@ +From 7f551d969037cc128eca60688d9c5a300d84e665 Mon Sep 17 00:00:00 2001 +From: Anand Jain +Date: Tue, 5 May 2020 02:58:26 +0800 +Subject: btrfs: free alien device after device add + +From: Anand Jain + +commit 7f551d969037cc128eca60688d9c5a300d84e665 upstream. + +When an old device has new fsid through 'btrfs device add -f ' our +fs_devices list has an alien device in one of the fs_devices lists. + +By having an alien device in fs_devices, we have two issues so far + +1. missing device does not not show as missing in the userland + +2. degraded mount will fail + +Both issues are caused by the fact that there's an alien device in the +fs_devices list. (Alien means that it does not belong to the filesystem, +identified by fsid, or does not contain btrfs filesystem at all, eg. due +to overwrite). + +A device can be scanned/added through the control device ioctls +SCAN_DEV, DEVICES_READY or by ADD_DEV. + +And device coming through the control device is checked against the all +other devices in the lists, but this was not the case for ADD_DEV. + +This patch fixes both issues above by removing the alien device. + +CC: stable@vger.kernel.org # 5.4+ +Signed-off-by: Anand Jain +Reviewed-by: David Sterba +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/volumes.c | 12 +++++++++++- + 1 file changed, 11 insertions(+), 1 deletion(-) + +--- a/fs/btrfs/volumes.c ++++ b/fs/btrfs/volumes.c +@@ -2663,8 +2663,18 @@ int btrfs_init_new_device(struct btrfs_f + ret = btrfs_commit_transaction(trans); + } + +- /* Update ctime/mtime for libblkid */ ++ /* ++ * Now that we have written a new super block to this device, check all ++ * other fs_devices list if device_path alienates any other scanned ++ * device. ++ * We can ignore the return value as it typically returns -EINVAL and ++ * only succeeds if the device was an alien. ++ */ ++ btrfs_forget_devices(device_path); ++ ++ /* Update ctime/mtime for blkid or udev */ + update_dev_time(device_path); ++ + return ret; + + error_sysfs: diff --git a/queue-5.7/btrfs-include-non-missing-as-a-qualifier-for-the-latest_bdev.patch b/queue-5.7/btrfs-include-non-missing-as-a-qualifier-for-the-latest_bdev.patch new file mode 100644 index 00000000000..a24a4abff13 --- /dev/null +++ b/queue-5.7/btrfs-include-non-missing-as-a-qualifier-for-the-latest_bdev.patch @@ -0,0 +1,77 @@ +From 998a0671961f66e9fad4990ed75f80ba3088c2f1 Mon Sep 17 00:00:00 2001 +From: Anand Jain +Date: Tue, 5 May 2020 02:58:25 +0800 +Subject: btrfs: include non-missing as a qualifier for the latest_bdev + +From: Anand Jain + +commit 998a0671961f66e9fad4990ed75f80ba3088c2f1 upstream. + +btrfs_free_extra_devids() updates fs_devices::latest_bdev to point to +the bdev with greatest device::generation number. For a typical-missing +device the generation number is zero so fs_devices::latest_bdev will +never point to it. + +But if the missing device is due to alienation [1], then +device::generation is not zero and if it is greater or equal to the rest +of device generations in the list, then fs_devices::latest_bdev ends up +pointing to the missing device and reports the error like [2]. + +[1] We maintain devices of a fsid (as in fs_device::fsid) in the +fs_devices::devices list, a device is considered as an alien device +if its fsid does not match with the fs_device::fsid + +Consider a working filesystem with raid1: + + $ mkfs.btrfs -f -d raid1 -m raid1 /dev/sda /dev/sdb + $ mount /dev/sda /mnt-raid1 + $ umount /mnt-raid1 + +While mnt-raid1 was unmounted the user force-adds one of its devices to +another btrfs filesystem: + + $ mkfs.btrfs -f /dev/sdc + $ mount /dev/sdc /mnt-single + $ btrfs dev add -f /dev/sda /mnt-single + +Now the original mnt-raid1 fails to mount in degraded mode, because +fs_devices::latest_bdev is pointing to the alien device. + + $ mount -o degraded /dev/sdb /mnt-raid1 + +[2] +mount: wrong fs type, bad option, bad superblock on /dev/sdb, + missing codepage or helper program, or other error + + In some cases useful info is found in syslog - try + dmesg | tail or so. + + kernel: BTRFS warning (device sdb): devid 1 uuid 072a0192-675b-4d5a-8640-a5cf2b2c704d is missing + kernel: BTRFS error (device sdb): failed to read devices + kernel: BTRFS error (device sdb): open_ctree failed + +Fix the root cause by checking if the device is not missing before it +can be considered for the fs_devices::latest_bdev. + +CC: stable@vger.kernel.org # 4.19+ +Reviewed-by: Josef Bacik +Signed-off-by: Anand Jain +Reviewed-by: David Sterba +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/volumes.c | 2 ++ + 1 file changed, 2 insertions(+) + +--- a/fs/btrfs/volumes.c ++++ b/fs/btrfs/volumes.c +@@ -1042,6 +1042,8 @@ again: + &device->dev_state)) { + if (!test_bit(BTRFS_DEV_STATE_REPLACE_TGT, + &device->dev_state) && ++ !test_bit(BTRFS_DEV_STATE_MISSING, ++ &device->dev_state) && + (!latest_dev || + device->generation > latest_dev->generation)) { + latest_dev = device; diff --git a/queue-5.7/btrfs-reloc-fix-reloc-root-leak-and-null-pointer-dereference.patch b/queue-5.7/btrfs-reloc-fix-reloc-root-leak-and-null-pointer-dereference.patch new file mode 100644 index 00000000000..b663bf7c1a9 --- /dev/null +++ b/queue-5.7/btrfs-reloc-fix-reloc-root-leak-and-null-pointer-dereference.patch @@ -0,0 +1,134 @@ +From 51415b6c1b117e223bc083e30af675cb5c5498f3 Mon Sep 17 00:00:00 2001 +From: Qu Wenruo +Date: Tue, 19 May 2020 10:13:20 +0800 +Subject: btrfs: reloc: fix reloc root leak and NULL pointer dereference + +From: Qu Wenruo + +commit 51415b6c1b117e223bc083e30af675cb5c5498f3 upstream. + +[BUG] +When balance is canceled, there is a pretty high chance that unmounting +the fs can lead to lead the NULL pointer dereference: + + BTRFS warning (device dm-3): page private not zero on page 223158272 + ... + BTRFS warning (device dm-3): page private not zero on page 223162368 + BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1 + BUG: kernel NULL pointer dereference, address: 0000000000000168 + #PF: supervisor read access in kernel mode + #PF: error_code(0x0000) - not-present page + PGD 0 P4D 0 + Oops: 0000 [#1] PREEMPT SMP NOPTI + CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53 + Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 + RIP: 0010:__lock_acquire+0x5dc/0x24c0 + Call Trace: + lock_acquire+0xab/0x390 + _raw_spin_lock+0x39/0x80 + btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs] + release_extent_buffer+0xb2/0x170 [btrfs] + free_extent_buffer+0x66/0xb0 [btrfs] + btrfs_put_root+0x8e/0x130 [btrfs] + btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs] + btrfs_free_fs_info+0xe5/0x120 [btrfs] + btrfs_kill_super+0x1f/0x30 [btrfs] + deactivate_locked_super+0x3b/0x80 + deactivate_super+0x3e/0x50 + cleanup_mnt+0x109/0x160 + __cleanup_mnt+0x12/0x20 + task_work_run+0x67/0xa0 + exit_to_usermode_loop+0xc5/0xd0 + syscall_return_slowpath+0x205/0x360 + do_syscall_64+0x6e/0xb0 + entry_SYSCALL_64_after_hwframe+0x49/0xb3 + RIP: 0033:0x7fd028ef740b + +[CAUSE] +When balance is canceled, all reloc roots are marked as orphan, and +orphan reloc roots are going to be cleaned up. + +However for orphan reloc roots and merged reloc roots, their lifespan +are quite different: + + Merged reloc roots | Orphan reloc roots by cancel +-------------------------------------------------------------------- +create_reloc_root() | create_reloc_root() +|- refs == 1 | |- refs == 1 + | +btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root); +|- refs == 2 | |- refs == 2 + | +root->reloc_root = reloc_root; | root->reloc_root = reloc_root; + >>> No difference so far <<< + | +prepare_to_merge() | prepare_to_merge() +|- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR) + | +merge_reloc_roots() | merge_reloc_roots() +|- merge_reloc_root() | |- Doing nothing to put reloc root + |- insert_dirty_subvol() | |- refs == 2 + |- __del_reloc_root() | + |- btrfs_put_root() | + |- refs == 1 | + >>> Now orphan reloc roots still have refs 2 <<< + | +clean_dirty_subvols() | clean_dirty_subvols() +|- btrfs_drop_snapshot() | |- btrfS_drop_snapshot() + |- reloc_root get freed | |- reloc_root still has refs 2 + | related ebs get freed, but + | reloc_root still recorded in + | allocated_roots +btrfs_check_leaked_roots() | btrfs_check_leaked_roots() +|- No leaked roots | |- Leaked reloc_roots detected + | |- btrfs_put_root() + | |- free_extent_buffer(root->node); + | |- eb already freed, caused NULL + | pointer dereference + +[FIX] +The fix is to clear fs_root->reloc_root and put it at +merge_reloc_roots() time, so that we won't leak reloc roots. + +Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots") +CC: stable@vger.kernel.org # 5.1+ +Tested-by: Johannes Thumshirn +Signed-off-by: Qu Wenruo +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/relocation.c | 12 +++++++++--- + 1 file changed, 9 insertions(+), 3 deletions(-) + +--- a/fs/btrfs/relocation.c ++++ b/fs/btrfs/relocation.c +@@ -2624,12 +2624,10 @@ again: + reloc_root = list_entry(reloc_roots.next, + struct btrfs_root, root_list); + ++ root = read_fs_root(fs_info, reloc_root->root_key.offset); + if (btrfs_root_refs(&reloc_root->root_item) > 0) { +- root = read_fs_root(fs_info, +- reloc_root->root_key.offset); + BUG_ON(IS_ERR(root)); + BUG_ON(root->reloc_root != reloc_root); +- + ret = merge_reloc_root(rc, root); + btrfs_put_root(root); + if (ret) { +@@ -2639,6 +2637,14 @@ again: + goto out; + } + } else { ++ if (!IS_ERR(root)) { ++ if (root->reloc_root == reloc_root) { ++ root->reloc_root = NULL; ++ btrfs_put_root(reloc_root); ++ } ++ btrfs_put_root(root); ++ } ++ + list_del_init(&reloc_root->root_list); + /* Don't forget to queue this reloc root for cleanup */ + list_add_tail(&reloc_root->reloc_dirty_list, diff --git a/queue-5.7/btrfs-send-emit-file-capabilities-after-chown.patch b/queue-5.7/btrfs-send-emit-file-capabilities-after-chown.patch new file mode 100644 index 00000000000..df51c7022cd --- /dev/null +++ b/queue-5.7/btrfs-send-emit-file-capabilities-after-chown.patch @@ -0,0 +1,154 @@ +From 89efda52e6b6930f80f5adda9c3c9edfb1397191 Mon Sep 17 00:00:00 2001 +From: Marcos Paulo de Souza +Date: Sun, 10 May 2020 23:15:07 -0300 +Subject: btrfs: send: emit file capabilities after chown + +From: Marcos Paulo de Souza + +commit 89efda52e6b6930f80f5adda9c3c9edfb1397191 upstream. + +Whenever a chown is executed, all capabilities of the file being touched +are lost. When doing incremental send with a file with capabilities, +there is a situation where the capability can be lost on the receiving +side. The sequence of actions bellow shows the problem: + + $ mount /dev/sda fs1 + $ mount /dev/sdb fs2 + + $ touch fs1/foo.bar + $ setcap cap_sys_nice+ep fs1/foo.bar + $ btrfs subvolume snapshot -r fs1 fs1/snap_init + $ btrfs send fs1/snap_init | btrfs receive fs2 + + $ chgrp adm fs1/foo.bar + $ setcap cap_sys_nice+ep fs1/foo.bar + + $ btrfs subvolume snapshot -r fs1 fs1/snap_complete + $ btrfs subvolume snapshot -r fs1 fs1/snap_incremental + + $ btrfs send fs1/snap_complete | btrfs receive fs2 + $ btrfs send -p fs1/snap_init fs1/snap_incremental | btrfs receive fs2 + +At this point, only a chown was emitted by "btrfs send" since only the +group was changed. This makes the cap_sys_nice capability to be dropped +from fs2/snap_incremental/foo.bar + +To fix that, only emit capabilities after chown is emitted. The current +code first checks for xattrs that are new/changed, emits them, and later +emit the chown. Now, __process_new_xattr skips capabilities, letting +only finish_inode_if_needed to emit them, if they exist, for the inode +being processed. + +This behavior was being worked around in "btrfs receive" side by caching +the capability and only applying it after chown. Now, xattrs are only +emmited _after_ chown, making that workaround not needed anymore. + +Link: https://github.com/kdave/btrfs-progs/issues/202 +CC: stable@vger.kernel.org # 4.4+ +Suggested-by: Filipe Manana +Reviewed-by: Filipe Manana +Signed-off-by: Marcos Paulo de Souza +Signed-off-by: David Sterba +Signed-off-by: Greg Kroah-Hartman + +--- + fs/btrfs/send.c | 67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + 1 file changed, 67 insertions(+) + +--- a/fs/btrfs/send.c ++++ b/fs/btrfs/send.c +@@ -23,6 +23,7 @@ + #include "btrfs_inode.h" + #include "transaction.h" + #include "compression.h" ++#include "xattr.h" + + /* + * Maximum number of references an extent can have in order for us to attempt to +@@ -4545,6 +4546,10 @@ static int __process_new_xattr(int num, + struct fs_path *p; + struct posix_acl_xattr_header dummy_acl; + ++ /* Capabilities are emitted by finish_inode_if_needed */ ++ if (!strncmp(name, XATTR_NAME_CAPS, name_len)) ++ return 0; ++ + p = fs_path_alloc(); + if (!p) + return -ENOMEM; +@@ -5107,6 +5112,64 @@ static int send_extent_data(struct send_ + return 0; + } + ++/* ++ * Search for a capability xattr related to sctx->cur_ino. If the capability is ++ * found, call send_set_xattr function to emit it. ++ * ++ * Return 0 if there isn't a capability, or when the capability was emitted ++ * successfully, or < 0 if an error occurred. ++ */ ++static int send_capabilities(struct send_ctx *sctx) ++{ ++ struct fs_path *fspath = NULL; ++ struct btrfs_path *path; ++ struct btrfs_dir_item *di; ++ struct extent_buffer *leaf; ++ unsigned long data_ptr; ++ char *buf = NULL; ++ int buf_len; ++ int ret = 0; ++ ++ path = alloc_path_for_send(); ++ if (!path) ++ return -ENOMEM; ++ ++ di = btrfs_lookup_xattr(NULL, sctx->send_root, path, sctx->cur_ino, ++ XATTR_NAME_CAPS, strlen(XATTR_NAME_CAPS), 0); ++ if (!di) { ++ /* There is no xattr for this inode */ ++ goto out; ++ } else if (IS_ERR(di)) { ++ ret = PTR_ERR(di); ++ goto out; ++ } ++ ++ leaf = path->nodes[0]; ++ buf_len = btrfs_dir_data_len(leaf, di); ++ ++ fspath = fs_path_alloc(); ++ buf = kmalloc(buf_len, GFP_KERNEL); ++ if (!fspath || !buf) { ++ ret = -ENOMEM; ++ goto out; ++ } ++ ++ ret = get_cur_path(sctx, sctx->cur_ino, sctx->cur_inode_gen, fspath); ++ if (ret < 0) ++ goto out; ++ ++ data_ptr = (unsigned long)(di + 1) + btrfs_dir_name_len(leaf, di); ++ read_extent_buffer(leaf, buf, data_ptr, buf_len); ++ ++ ret = send_set_xattr(sctx, fspath, XATTR_NAME_CAPS, ++ strlen(XATTR_NAME_CAPS), buf, buf_len); ++out: ++ kfree(buf); ++ fs_path_free(fspath); ++ btrfs_free_path(path); ++ return ret; ++} ++ + static int clone_range(struct send_ctx *sctx, + struct clone_root *clone_root, + const u64 disk_byte, +@@ -5972,6 +6035,10 @@ static int finish_inode_if_needed(struct + goto out; + } + ++ ret = send_capabilities(sctx); ++ if (ret < 0) ++ goto out; ++ + /* + * If other directory inodes depended on our current directory + * inode's move/rename, now do their move/rename operations. diff --git a/queue-5.7/series b/queue-5.7/series index 78e56fe6d65..b04a7e18d75 100644 --- a/queue-5.7/series +++ b/queue-5.7/series @@ -253,3 +253,14 @@ bpf-fix-up-bpf_skb_adjust_room-helper-s-skb-csum-set.patch s390-bpf-maintain-8-byte-stack-alignment.patch kasan-stop-tests-being-eliminated-as-dead-code-with-.patch string.h-fix-incompatibility-between-fortify_source-.patch +btrfs-free-alien-device-after-device-add.patch +btrfs-include-non-missing-as-a-qualifier-for-the-latest_bdev.patch +btrfs-fix-a-race-between-scrub-and-block-group-removal-allocation.patch +btrfs-send-emit-file-capabilities-after-chown.patch +btrfs-force-chunk-allocation-if-our-global-rsv-is-larger-than-metadata.patch +btrfs-reloc-fix-reloc-root-leak-and-null-pointer-dereference.patch +btrfs-fix-error-handling-when-submitting-direct-i-o-bio.patch +btrfs-fix-corrupt-log-due-to-concurrent-fsync-of-inodes-with-shared-extents.patch +btrfs-fix-wrong-file-range-cleanup-after-an-error-filling-dealloc-range.patch +btrfs-fix-space_info-bytes_may_use-underflow-after-nocow-buffered-write.patch +btrfs-fix-space_info-bytes_may_use-underflow-during-space-cache-writeout.patch