]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob
d56e125c58b25d2e6068c61416d1c41ca3c67360
[thirdparty/kernel/stable-queue.git] /
1 From stable+bounces-179510-greg=kroah.com@vger.kernel.org Sat Sep 13 18:14:04 2025
2 From: Sasha Levin <sashal@kernel.org>
3 Date: Sat, 13 Sep 2025 12:13:53 -0400
4 Subject: btrfs: fix corruption reading compressed range when block size is smaller than page size
5 To: stable@vger.kernel.org
6 Cc: Qu Wenruo <wqu@suse.com>, Filipe Manana <fdmanana@suse.com>, David Sterba <dsterba@suse.com>, Sasha Levin <sashal@kernel.org>
7 Message-ID: <20250913161353.1443138-2-sashal@kernel.org>
8
9 From: Qu Wenruo <wqu@suse.com>
10
11 [ Upstream commit 9786531399a679fc2f4630d2c0a186205282ab2f ]
12
13 [BUG]
14 With 64K page size (aarch64 with 64K page size config) and 4K btrfs
15 block size, the following workload can easily lead to a corrupted read:
16
17 mkfs.btrfs -f -s 4k $dev > /dev/null
18 mount -o compress $dev $mnt
19 xfs_io -f -c "pwrite -S 0xff 0 64k" $mnt/base > /dev/null
20 echo "correct result:"
21 od -Ad -t x1 $mnt/base
22 xfs_io -f -c "reflink $mnt/base 32k 0 32k" \
23 -c "reflink $mnt/base 0 32k 32k" \
24 -c "pwrite -S 0xff 60k 4k" $mnt/new > /dev/null
25 echo "incorrect result:"
26 od -Ad -t x1 $mnt/new
27 umount $mnt
28
29 This shows the following result:
30
31 correct result:
32 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
33 *
34 0065536
35 incorrect result:
36 0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
37 *
38 0032768 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
39 *
40 0061440 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
41 *
42 0065536
43
44 Notice the zero in the range [32K, 60K), which is incorrect.
45
46 [CAUSE]
47 With extra trace printk, it shows the following events during od:
48 (some unrelated info removed like CPU and context)
49
50 od-3457 btrfs_do_readpage: enter r/i=5/258 folio=0(65536) prev_em_start=0000000000000000
51
52 The "r/i" is indicating the root and inode number. In our case the file
53 "new" is using ino 258 from fs tree (root 5).
54
55 Here notice the @prev_em_start pointer is NULL. This means the
56 btrfs_do_readpage() is called from btrfs_read_folio(), not from
57 btrfs_readahead().
58
59 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=0 got em start=0 len=32768
60 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=4096 got em start=0 len=32768
61 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=8192 got em start=0 len=32768
62 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=12288 got em start=0 len=32768
63 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=16384 got em start=0 len=32768
64 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=20480 got em start=0 len=32768
65 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=24576 got em start=0 len=32768
66 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=28672 got em start=0 len=32768
67
68 These above 32K blocks will be read from the first half of the
69 compressed data extent.
70
71 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=32768 got em start=32768 len=32768
72
73 Note here there is no btrfs_submit_compressed_read() call. Which is
74 incorrect now.
75 Although both extent maps at 0 and 32K are pointing to the same compressed
76 data, their offsets are different thus can not be merged into the same
77 read.
78
79 So this means the compressed data read merge check is doing something
80 wrong.
81
82 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=36864 got em start=32768 len=32768
83 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=40960 got em start=32768 len=32768
84 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=45056 got em start=32768 len=32768
85 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=49152 got em start=32768 len=32768
86 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=53248 got em start=32768 len=32768
87 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=57344 got em start=32768 len=32768
88 od-3457 btrfs_do_readpage: r/i=5/258 folio=0(65536) cur=61440 skip uptodate
89 od-3457 btrfs_submit_compressed_read: cb orig_bio: file off=0 len=61440
90
91 The function btrfs_submit_compressed_read() is only called at the end of
92 folio read. The compressed bio will only have an extent map of range [0,
93 32K), but the original bio passed in is for the whole 64K folio.
94
95 This will cause the decompression part to only fill the first 32K,
96 leaving the rest untouched (aka, filled with zero).
97
98 This incorrect compressed read merge leads to the above data corruption.
99
100 There were similar problems that happened in the past, commit 808f80b46790
101 ("Btrfs: update fix for read corruption of compressed and shared
102 extents") is doing pretty much the same fix for readahead.
103
104 But that's back to 2015, where btrfs still only supports bs (block size)
105 == ps (page size) cases.
106 This means btrfs_do_readpage() only needs to handle a folio which
107 contains exactly one block.
108
109 Only btrfs_readahead() can lead to a read covering multiple blocks.
110 Thus only btrfs_readahead() passes a non-NULL @prev_em_start pointer.
111
112 With v5.15 kernel btrfs introduced bs < ps support. This breaks the above
113 assumption that a folio can only contain one block.
114
115 Now btrfs_read_folio() can also read multiple blocks in one go.
116 But btrfs_read_folio() doesn't pass a @prev_em_start pointer, thus the
117 existing bio force submission check will never be triggered.
118
119 In theory, this can also happen for btrfs with large folios, but since
120 large folio is still experimental, we don't need to bother it, thus only
121 bs < ps support is affected for now.
122
123 [FIX]
124 Instead of passing @prev_em_start to do the proper compressed extent
125 check, introduce one new member, btrfs_bio_ctrl::last_em_start, so that
126 the existing bio force submission logic will always be triggered.
127
128 CC: stable@vger.kernel.org # 5.15+
129 Reviewed-by: Filipe Manana <fdmanana@suse.com>
130 Signed-off-by: Qu Wenruo <wqu@suse.com>
131 Signed-off-by: David Sterba <dsterba@suse.com>
132 [ Adjust context ]
133 Signed-off-by: Sasha Levin <sashal@kernel.org>
134 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
135 ---
136 fs/btrfs/extent_io.c | 40 ++++++++++++++++++++++++++++++----------
137 1 file changed, 30 insertions(+), 10 deletions(-)
138
139 --- a/fs/btrfs/extent_io.c
140 +++ b/fs/btrfs/extent_io.c
141 @@ -109,6 +109,24 @@ struct btrfs_bio_ctrl {
142 */
143 unsigned long submit_bitmap;
144 struct readahead_control *ractl;
145 +
146 + /*
147 + * The start offset of the last used extent map by a read operation.
148 + *
149 + * This is for proper compressed read merge.
150 + * U64_MAX means we are starting the read and have made no progress yet.
151 + *
152 + * The current btrfs_bio_is_contig() only uses disk_bytenr as
153 + * the condition to check if the read can be merged with previous
154 + * bio, which is not correct. E.g. two file extents pointing to the
155 + * same extent but with different offset.
156 + *
157 + * So here we need to do extra checks to only merge reads that are
158 + * covered by the same extent map.
159 + * Just extent_map::start will be enough, as they are unique
160 + * inside the same inode.
161 + */
162 + u64 last_em_start;
163 };
164
165 static void submit_one_bio(struct btrfs_bio_ctrl *bio_ctrl)
166 @@ -955,7 +973,7 @@ static void btrfs_readahead_expand(struc
167 * return 0 on success, otherwise return error
168 */
169 static int btrfs_do_readpage(struct folio *folio, struct extent_map **em_cached,
170 - struct btrfs_bio_ctrl *bio_ctrl, u64 *prev_em_start)
171 + struct btrfs_bio_ctrl *bio_ctrl)
172 {
173 struct inode *inode = folio->mapping->host;
174 struct btrfs_fs_info *fs_info = inode_to_fs_info(inode);
175 @@ -1066,12 +1084,11 @@ static int btrfs_do_readpage(struct foli
176 * non-optimal behavior (submitting 2 bios for the same extent).
177 */
178 if (compress_type != BTRFS_COMPRESS_NONE &&
179 - prev_em_start && *prev_em_start != (u64)-1 &&
180 - *prev_em_start != em->start)
181 + bio_ctrl->last_em_start != U64_MAX &&
182 + bio_ctrl->last_em_start != em->start)
183 force_bio_submit = true;
184
185 - if (prev_em_start)
186 - *prev_em_start = em->start;
187 + bio_ctrl->last_em_start = em->start;
188
189 free_extent_map(em);
190 em = NULL;
191 @@ -1115,12 +1132,15 @@ int btrfs_read_folio(struct file *file,
192 const u64 start = folio_pos(folio);
193 const u64 end = start + folio_size(folio) - 1;
194 struct extent_state *cached_state = NULL;
195 - struct btrfs_bio_ctrl bio_ctrl = { .opf = REQ_OP_READ };
196 + struct btrfs_bio_ctrl bio_ctrl = {
197 + .opf = REQ_OP_READ,
198 + .last_em_start = U64_MAX,
199 + };
200 struct extent_map *em_cached = NULL;
201 int ret;
202
203 btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
204 - ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl, NULL);
205 + ret = btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
206 unlock_extent(&inode->io_tree, start, end, &cached_state);
207
208 free_extent_map(em_cached);
209 @@ -2391,7 +2411,8 @@ void btrfs_readahead(struct readahead_co
210 {
211 struct btrfs_bio_ctrl bio_ctrl = {
212 .opf = REQ_OP_READ | REQ_RAHEAD,
213 - .ractl = rac
214 + .ractl = rac,
215 + .last_em_start = U64_MAX,
216 };
217 struct folio *folio;
218 struct btrfs_inode *inode = BTRFS_I(rac->mapping->host);
219 @@ -2399,12 +2420,11 @@ void btrfs_readahead(struct readahead_co
220 const u64 end = start + readahead_length(rac) - 1;
221 struct extent_state *cached_state = NULL;
222 struct extent_map *em_cached = NULL;
223 - u64 prev_em_start = (u64)-1;
224
225 btrfs_lock_and_flush_ordered_range(inode, start, end, &cached_state);
226
227 while ((folio = readahead_folio(rac)) != NULL)
228 - btrfs_do_readpage(folio, &em_cached, &bio_ctrl, &prev_em_start);
229 + btrfs_do_readpage(folio, &em_cached, &bio_ctrl);
230
231 unlock_extent(&inode->io_tree, start, end, &cached_state);
232