1 From stable+bounces-179491-greg=kroah.com@vger.kernel.org Sat Sep 13 16:13:05 2025
2 From: Sasha Levin <sashal@kernel.org>
3 Date: Sat, 13 Sep 2025 10:12:33 -0400
4 Subject: btrfs: fix corruption reading compressed range when block size is smaller than page size
5 To: stable@vger.kernel.org
6 Cc: Qu Wenruo <wqu@suse.com>, Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
7 Message-ID: <20250913141233.1363589-2-sashal@kernel.org>
9 From: Qu Wenruo <wqu@suse.com>
11 [ Upstream commit 9786531399a679fc2f4630d2c0a186205282ab2f ]
14 With 64K page size (aarch64 with 64K page size config) and 4K btrfs
15 block size, the following workload can easily lead to a corrupted read:
17 mkfs.btrfs -f -s 4k $dev > /dev/null
18 mount -o compress $dev $mnt
19 xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
20 echo "correct result:"
21 od -Ad -t x1 $mnt/base
22 xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
23 -c "reflink $mnt/base 0 32k 32k" \
24 -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
25 echo "incorrect result:"
29 This shows the following result:
32 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
36 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
38 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
40 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
44 Notice the zero in the range [32K, 60K), which is incorrect.
47 With extra trace printk, it shows the following events during od:
48 (some unrelated info removed like CPU and context)
50 od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
52 The "r/i" is indicating the root and inode number. In our case the file
53 "new" is using ino 258 from fs tree (root 5).
55 Here notice the @prev_em_start pointer is NULL. This means the
56 btrfs_do_readpage() is called from btrfs_read_folio(), not from
59 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
60 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
61 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
62 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
63 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
64 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
65 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
66 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
68 These above 32K blocks will be read from the first half of the
69 compressed data extent.
71 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
73 Note here there is no btrfs_submit_compressed_read() call. Which is
75 Although both extent maps at 0 and 32K are pointing to the same compressed
76 data, their offsets are different thus can not be merged into the same
79 So this means the compressed data read merge check is doing something
82 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
83 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
84 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
85 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
86 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
87 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
88 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
89 od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
91 The function btrfs_submit_compressed_read() is only called at the end of
92 folio read. The compressed bio will only have an extent map of range [0,
93 32K), but the original bio passed in is for the whole 64K folio.
95 This will cause the decompression part to only fill the first 32K,
96 leaving the rest untouched (aka, filled with zero).
98 This incorrect compressed read merge leads to the above data corruption.
100 There were similar problems that happened in the past, commit 808f80b46790
101 ("Btrfs: update fix for read corruption of compressed and shared
102 extents") is doing pretty much the same fix for readahead.
104 But that's back to 2015, where btrfs still only supports bs (block size)
105 == ps (page size) cases.
106 This means btrfs_do_readpage() only needs to handle a folio which
107 contains exactly one block.
109 Only btrfs_readahead() can lead to a read covering multiple blocks.
110 Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
112 With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
113 assumption that a folio can only contain one block.
115 Now btrfs_read_folio() can also read multiple blocks in one go.
116 But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
117 existing bio force submission check will never be triggered.
119 In theory, this can also happen for btrfs with large folios, but since
120 large folio is still experimental, we don't need to bother it, thus only
121 bs < ps support is affected for now.
124 Instead of passing @prev_em_start to do the proper compressed extent
125 check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
126 the existing bio force submission logic will always be triggered.
128 CC: stable@vger.kernel.org # 5.15+
129 Reviewed-by: Filipe Manana <fdmanana@suse.com>
130 Signed-off-by: Qu Wenruo <wqu@suse.com>
131 Signed-off-by: David Sterba <dsterba@suse.com>
132 Signed-off-by: Sasha Levin <sashal@kernel.org>
133 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
135 fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
136 1 file changed, 30 insertions(+), 10 deletions(-)
138 --- a/fs/btrfs/extent_io.c
139 +++ b/fs/btrfs/extent_io.c
140 @@ -111,6 +111,24 @@ struct btrfs_bio_ctrl {
142 unsigned long submit_bitmap;
143 struct readahead_control *ractl;
146 + * The start offset of the last used extent map by a read operation.
148 + * This is for proper compressed read merge.
149 + * U64_MAX means we are starting the read and have made no progress yet.
151 + * The current btrfs_bio_is_contig() only uses disk_bytenr as
152 + * the condition to check if the read can be merged with previous
153 + * bio, which is not correct. E.g. two file extents pointing to the
154 + * same extent but with different offset.
156 + * So here we need to do extra checks to only merge reads that are
157 + * covered by the same extent map.
158 + * Just extent_map::start will be enough, as they are unique
159 + * inside the same inode.
164 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
165 @@ -910,7 +928,7 @@ static void btrfs_readahead_expand(struc
166 * return 0 on success, otherwise return error
168 static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
169 - struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
170 + struct btrfs_bio_ctrl *bio_ctrl)
172 struct inode *inode = folio->mapping->host;
173 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
174 @@ -1020,12 +1038,11 @@ static int btrfs_do_readpage(struct foli
175 * non-optimal behavior (submitting 2 bios for the same extent).
177 if (compress_type != BTRFS_COMPRESS_NONE &&
178 - prev_em_start && *prev_em_start != (u64)-1 &&
179 - *prev_em_start != em->start)
180 + bio_ctrl->last_em_start != U64_MAX &&
181 + bio_ctrl->last_em_start != em->start)
182 force_bio_submit = true;
185 - *prev_em_start = em->start;
186 + bio_ctrl->last_em_start = em->start;
188 btrfs_free_extent_map(em);
190 @@ -1239,12 +1256,15 @@ int btrfs_read_folio(struct file *file,
191 const u64 start = folio_pos(folio);
192 const u64 end = start + folio_size(folio) - 1;
193 struct extent_state *cached_state = NULL;
194 - struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
195 + struct btrfs_bio_ctrl bio_ctrl = {
196 + .opf = REQ_OP_READ,
197 + .last_em_start = U64_MAX,
199 struct extent_map *em_cached = NULL;
202 lock_extents_for_read(inode, start, end, &cached_state);
203 - ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
204 + ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
205 btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);
207 btrfs_free_extent_map(em_cached);
208 @@ -2582,7 +2602,8 @@ void btrfs_readahead(struct readahead_co
210 struct btrfs_bio_ctrl bio_ctrl = {
211 .opf = REQ_OP_READ | REQ_RAHEAD,
214 + .last_em_start = U64_MAX,
217 struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
218 @@ -2590,12 +2611,11 @@ void btrfs_readahead(struct readahead_co
219 const u64 end = start + readahead_length(rac) - 1;
220 struct extent_state *cached_state = NULL;
221 struct extent_map *em_cached = NULL;
222 - u64 prev_em_start = (u64)-1;
224 lock_extents_for_read(inode, start, end, &cached_state);
226 while ((folio = readahead_folio(rac)) != NULL)
227 - btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
228 + btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
230 btrfs_unlock_extent(&inode->io_tree, start, end, &cached_state);