From: Filipe Manana Date: Thu, 8 Jan 2026 16:16:38 +0000 (+0000) Subject: btrfs: invalidate pages instead of truncate after reflinking X-Git-Tag: v6.19-rc6~15^2~2 X-Git-Url: http://git.ipfire.org/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=882680774933fd276023e01cf0261c2350d7201e;p=thirdparty%2Flinux.git btrfs: invalidate pages instead of truncate after reflinking Qu reported that generic/164 often fails because the read operations get zeroes when it expects to either get all bytes with a value of 0x61 or 0x62. The issue stems from truncating the pages from the page cache instead of invalidating, as truncating can zero page contents. This zeroing is not just in case the range is not page sized (as it's commented in truncate_inode_pages_range()) but also in case we are using large folios, they need to be split and the splitting fails. Stealing Qu's comment in the thread linked below: "We can have the following case: 0 4K 8K 12K 16K | | | | | |<---- Extent A ----->|<----- Extent B ------>| The page size is still 4K, but the folio we got is 16K. Then if we remap the range for [8K, 16K), then truncate_inode_pages_range() will get the large folio 0 sized 16K, then call truncate_inode_partial_folio(). Which later calls folio_zero_range() for the [8K, 16K) range first, then tries to split the folio into smaller ones to properly drop them from the cache. But if splitting failed (e.g. racing with other operations holding the filemap lock), the partially zeroed large folio will be kept, resulting the range [8K, 16K) being zeroed meanwhile the folio is still a 16K sized large one." So instead of truncating, invalidate the page cache range with a call to filemap_invalidate_inode(), which besides not doing any zeroing also ensures that while it's invalidating folios, no new folios are added. This helps ensure that buffered reads that happen while a reflink operation is in progress always get either the whole old data (the one before the reflink) or the whole new data, which is what generic/164 expects. Link: https://lore.kernel.org/linux-btrfs/7fb9b44f-9680-4c22-a47f-6648cb109ddf@suse.com/ Reported-by: Qu Wenruo Reviewed-by: Qu Wenruo Reviewed-by: Boris Burkov Signed-off-by: Filipe Manana Signed-off-by: David Sterba --- diff --git a/fs/btrfs/reflink.c b/fs/btrfs/reflink.c index b5fe95baf92e..58dc3e5057ce 100644 --- a/fs/btrfs/reflink.c +++ b/fs/btrfs/reflink.c @@ -705,7 +705,6 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src, struct inode *src = file_inode(file_src); struct btrfs_fs_info *fs_info = inode_to_fs_info(inode); int ret; - int wb_ret; u64 len = olen; u64 bs = fs_info->sectorsize; u64 end; @@ -750,25 +749,29 @@ static noinline int btrfs_clone_files(struct file *file, struct file *file_src, btrfs_lock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state); ret = btrfs_clone(src, inode, off, olen, len, destoff, 0); btrfs_unlock_extent(&BTRFS_I(inode)->io_tree, destoff, end, &cached_state); + if (ret < 0) + return ret; /* * We may have copied an inline extent into a page of the destination - * range, so wait for writeback to complete before truncating pages + * range, so wait for writeback to complete before invalidating pages * from the page cache. This is a rare case. */ - wb_ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len); - ret = ret ? ret : wb_ret; + ret = btrfs_wait_ordered_range(BTRFS_I(inode), destoff, len); + if (ret < 0) + return ret; + /* - * Truncate page cache pages so that future reads will see the cloned - * data immediately and not the previous data. + * Invalidate page cache so that future reads will see the cloned data + * immediately and not the previous data. */ - truncate_inode_pages_range(&inode->i_data, - round_down(destoff, PAGE_SIZE), - round_up(destoff + len, PAGE_SIZE) - 1); + ret = filemap_invalidate_inode(inode, false, destoff, end); + if (ret < 0) + return ret; btrfs_btree_balance_dirty(fs_info); - return ret; + return 0; } static int btrfs_remap_file_range_prep(struct file *file_in, loff_t pos_in,