]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob
cbade1fbaa960bbbdda597a69308241ca7850583
[thirdparty/kernel/stable-queue.git] /
1 From stable+bounces-179556-greg=kroah.com@vger.kernel.org Sun Sep 14 06:02:28 2025
2 From: Sasha Levin <sashal@kernel.org>
3 Date: Sun, 14 Sep 2025 00:01:57 -0400
4 Subject: btrfs: fix corruption reading compressed range when block size is smaller than page size
5 To: stable@vger.kernel.org
6 Cc: Qu Wenruo <wqu@suse.com>, Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
7 Message-ID: <20250914040157.1958299-2-sashal@kernel.org>
8
9 From: Qu Wenruo <wqu@suse.com>
10
11 [ Upstream commit 9786531399a679fc2f4630d2c0a186205282ab2f ]
12
13 [BUG]
14 With 64K page size (aarch64 with 64K page size config) and 4K btrfs
15 block size, the following workload can easily lead to a corrupted read:
16
17 mkfs.btrfs -f -s 4k $dev > /dev/null
18 mount -o compress $dev $mnt
19 xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
20 echo "correct result:"
21 od -Ad -t x1 $mnt/base
22 xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
23 -c "reflink $mnt/base 0 32k 32k" \
24 -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
25 echo "incorrect result:"
26 od -Ad -t x1 $mnt/new
27 umount $mnt
28
29 This shows the following result:
30
31 correct result:
32 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
33 *
34 0065536
35 incorrect result:
36 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
37 *
38 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
39 *
40 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
41 *
42 0065536
43
44 Notice the zero in the range [32K, 60K), which is incorrect.
45
46 [CAUSE]
47 With extra trace printk, it shows the following events during od:
48 (some unrelated info removed like CPU and context)
49
50 od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
51
52 The "r/i" is indicating the root and inode number. In our case the file
53 "new" is using ino 258 from fs tree (root 5).
54
55 Here notice the @prev_em_start pointer is NULL. This means the
56 btrfs_do_readpage() is called from btrfs_read_folio(), not from
57 btrfs_readahead().
58
59 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
60 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
61 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
62 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
63 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
64 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
65 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
66 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
67
68 These above 32K blocks will be read from the first half of the
69 compressed data extent.
70
71 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
72
73 Note here there is no btrfs_submit_compressed_read() call. Which is
74 incorrect now.
75 Although both extent maps at 0 and 32K are pointing to the same compressed
76 data, their offsets are different thus can not be merged into the same
77 read.
78
79 So this means the compressed data read merge check is doing something
80 wrong.
81
82 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
83 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
84 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
85 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
86 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
87 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
88 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
89 od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
90
91 The function btrfs_submit_compressed_read() is only called at the end of
92 folio read. The compressed bio will only have an extent map of range [0,
93 32K), but the original bio passed in is for the whole 64K folio.
94
95 This will cause the decompression part to only fill the first 32K,
96 leaving the rest untouched (aka, filled with zero).
97
98 This incorrect compressed read merge leads to the above data corruption.
99
100 There were similar problems that happened in the past, commit 808f80b46790
101 ("Btrfs: update fix for read corruption of compressed and shared
102 extents") is doing pretty much the same fix for readahead.
103
104 But that's back to 2015, where btrfs still only supports bs (block size)
105 == ps (page size) cases.
106 This means btrfs_do_readpage() only needs to handle a folio which
107 contains exactly one block.
108
109 Only btrfs_readahead() can lead to a read covering multiple blocks.
110 Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
111
112 With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
113 assumption that a folio can only contain one block.
114
115 Now btrfs_read_folio() can also read multiple blocks in one go.
116 But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
117 existing bio force submission check will never be triggered.
118
119 In theory, this can also happen for btrfs with large folios, but since
120 large folio is still experimental, we don't need to bother it, thus only
121 bs < ps support is affected for now.
122
123 [FIX]
124 Instead of passing @prev_em_start to do the proper compressed extent
125 check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
126 the existing bio force submission logic will always be triggered.
127
128 CC: stable@vger.kernel.org # 5.15+
129 Reviewed-by: Filipe Manana <fdmanana@suse.com>
130 Signed-off-by: Qu Wenruo <wqu@suse.com>
131 Signed-off-by: David Sterba <dsterba@suse.com>
132 Signed-off-by: Sasha Levin <sashal@kernel.org>
133 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
134 ---
135 fs/btrfs/extent_io.c | 46 ++++++++++++++++++++++++++++++++--------------
136 1 file changed, 32 insertions(+), 14 deletions(-)
137
138 --- a/fs/btrfs/extent_io.c
139 +++ b/fs/btrfs/extent_io.c
140 @@ -104,6 +104,24 @@ struct btrfs_bio_ctrl {
141 btrfs_bio_end_io_t end_io_func;
142 struct writeback_control *wbc;
143 struct readahead_control *ractl;
144 +
145 + /*
146 + * The start offset of the last used extent map by a read operation.
147 + *
148 + * This is for proper compressed read merge.
149 + * U64_MAX means we are starting the read and have made no progress yet.
150 + *
151 + * The current btrfs_bio_is_contig() only uses disk_bytenr as
152 + * the condition to check if the read can be merged with previous
153 + * bio, which is not correct. E.g. two file extents pointing to the
154 + * same extent but with different offset.
155 + *
156 + * So here we need to do extra checks to only merge reads that are
157 + * covered by the same extent map.
158 + * Just extent_map::start will be enough, as they are unique
159 + * inside the same inode.
160 + */
161 + u64 last_em_start;
162 };
163
164 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
165 @@ -978,7 +996,7 @@ static void btrfs_readahead_expand(struc
166 * return 0 on success, otherwise return error
167 */
168 static int btrfs_do_readpage(struct page *page, struct extent_map **em_cached,
169 - struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
170 + struct btrfs_bio_ctrl *bio_ctrl)
171 {
172 struct inode *inode = page->mapping->host;
173 struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
174 @@ -1095,12 +1113,11 @@ static int btrfs_do_readpage(struct page
175 * non-optimal behavior (submitting 2 bios for the same extent).
176 */
177 if (test_bit(EXTENT_FLAG_COMPRESSED, &em->flags) &&
178 - prev_em_start && *prev_em_start != (u64)-1 &&
179 - *prev_em_start != em->start)
180 + bio_ctrl->last_em_start != (u64)-1 &&
181 + bio_ctrl->last_em_start != em->start)
182 force_bio_submit = true;
183
184 - if (prev_em_start)
185 - *prev_em_start = em->start;
186 + bio_ctrl->last_em_start = em->start;
187
188 free_extent_map(em);
189 em = NULL;
190 @@ -1146,12 +1163,15 @@ int btrfs_read_folio(struct file *file,
191 struct btrfs_inode *inode = BTRFS_I(page->mapping->host);
192 u64 start = page_offset(page);
193 u64 end = start + PAGE_SIZE - 1;
194 - struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
195 + struct btrfs_bio_ctrl bio_ctrl = {
196 + .opf = REQ_OP_READ,
197 + .last_em_start = (u64)-1,
198 + };
199 int ret;
200
201 btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
202
203 - ret = btrfs_do_readpage(page, NULL, &bio_ctrl, NULL);
204 + ret = btrfs_do_readpage(page, NULL, &bio_ctrl);
205 /*
206 * If btrfs_do_readpage() failed we will want to submit the assembled
207 * bio to do the cleanup.
208 @@ -1163,8 +1183,7 @@ int btrfs_read_folio(struct file *file,
209 static inline void contiguous_readpages(struct page *pages[], int nr_pages,
210 u64 start, u64 end,
211 struct extent_map **em_cached,
212 - struct btrfs_bio_ctrl *bio_ctrl,
213 - u64 *prev_em_start)
214 + struct btrfs_bio_ctrl *bio_ctrl)
215 {
216 struct btrfs_inode *inode = BTRFS_I(pages[0]->mapping->host);
217 int index;
218 @@ -1172,8 +1191,7 @@ static inline void contiguous_readpages(
219 btrfs_lock_and_flush_ordered_range(inode, start, end, NULL);
220
221 for (index = 0; index < nr_pages; index++) {
222 - btrfs_do_readpage(pages[index], em_cached, bio_ctrl,
223 - prev_em_start);
224 + btrfs_do_readpage(pages[index], em_cached, bio_ctrl);
225 put_page(pages[index]);
226 }
227 }
228 @@ -2255,11 +2273,11 @@ void extent_readahead(struct readahead_c
229 {
230 struct btrfs_bio_ctrl bio_ctrl = {
231 .opf = REQ_OP_READ | REQ_RAHEAD,
232 - .ractl = rac
233 + .ractl = rac,
234 + .last_em_start = (u64)-1,
235 };
236 struct page *pagepool[16];
237 struct extent_map *em_cached = NULL;
238 - u64 prev_em_start = (u64)-1;
239 int nr;
240
241 while ((nr = readahead_page_batch(rac, pagepool))) {
242 @@ -2267,7 +2285,7 @@ void extent_readahead(struct readahead_c
243 u64 contig_end = contig_start + readahead_batch_length(rac) - 1;
244
245 contiguous_readpages(pagepool, nr, contig_start, contig_end,
246 - &em_cached, &bio_ctrl, &prev_em_start);
247 + &em_cached, &bio_ctrl);
248 }
249
250 if (em_cached)