]> git.ipfire.org Git - thirdparty/kernel/stable-queue.git/blob - queue-6.6/btrfs-fix-race-when-detecting-delalloc-ranges-during.patch
153e740680183010587f25f4d16a8b208ef3895f
[thirdparty/kernel/stable-queue.git] / queue-6.6 / btrfs-fix-race-when-detecting-delalloc-ranges-during.patch
1 From c7f65ad33a7107498d54a178888ec64134240ad9 Mon Sep 17 00:00:00 2001
2 From: Sasha Levin <sashal@kernel.org>
3 Date: Wed, 28 Feb 2024 11:37:56 +0000
4 Subject: btrfs: fix race when detecting delalloc ranges during fiemap
5
6 From: Filipe Manana <fdmanana@suse.com>
7
8 [ Upstream commit 978b63f7464abcfd364a6c95f734282c50f3decf ]
9
10 For fiemap we recently stopped locking the target extent range for the
11 whole duration of the fiemap call, in order to avoid a deadlock in a
12 scenario where the fiemap buffer happens to be a memory mapped range of
13 the same file. This use case is very unlikely to be useful in practice but
14 it may be triggered by fuzz testing (syzbot, etc).
15
16 This however introduced a race that makes us miss delalloc ranges for
17 file regions that are currently holes, so the caller of fiemap will not
18 be aware that there's data for some file regions. This can be quite
19 serious for some use cases - for example in coreutils versions before 9.0,
20 the cp program used fiemap to detect holes and data in the source file,
21 copying only regions with data (extents or delalloc) from the source file
22 to the destination file in order to preserve holes (see the documentation
23 for its --sparse command line option). This means that if cp was used
24 with a source file that had delalloc in a hole, the destination file could
25 end up without that data, which is effectively a data loss issue, if it
26 happened to hit the race described below.
27
28 The race happens like this:
29
30 1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
31 has delalloc in the file range [64M, 65M[, which is currently a hole;
32
33 2) Fiemap locks the inode in shared mode, then starts iterating the
34 inode's subvolume tree searching for file extent items, without having
35 the whole fiemap target range locked in the inode's io tree - the
36 change introduced recently by commit b0ad381fa769 ("btrfs: fix
37 deadlock with fiemap and extent locking"). It only locks ranges in
38 the io tree when it finds a hole or prealloc extent since that
39 commit;
40
41 3) Note that fiemap clones each leaf before using it, and this is to
42 avoid deadlocks when locking a file range in the inode's io tree and
43 the fiemap buffer is memory mapped to some file, because writing
44 to the page with btrfs_page_mkwrite() will wait on any ordered extent
45 for the page's range and the ordered extent needs to lock the range
46 and may need to modify the same leaf, therefore leading to a deadlock
47 on the leaf;
48
49 4) While iterating the file extent items in the cloned leaf before
50 finding the hole in the range [64M, 65M[, the delalloc in that range
51 is flushed and its ordered extent completes - meaning the corresponding
52 file extent item is in the inode's subvolume tree, but not present in
53 the cloned leaf that fiemap is iterating over;
54
55 5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
56 the cloned leaf (or a file extent item with disk_bytenr == 0 in case
57 the NO_HOLES feature is not enabled), it will lock that file range in
58 the inode's io tree and then search for delalloc by checking for the
59 EXTENT_DELALLOC bit in the io tree for that range and ordered extents
60 (with btrfs_find_delalloc_in_range()). But it finds nothing since the
61 delalloc in that range was already flushed and the ordered extent
62 completed and is gone - as a result fiemap will not report that there's
63 delalloc or an extent for the range [64M, 65M[, so user space will be
64 mislead into thinking that there's a hole in that range.
65
66 This could actually be sporadically triggered with test case generic/094
67 from fstests, which reports a missing extent/delalloc range like this:
68
69 # generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
70 # --- tests/generic/094.out 2020-06-10 19:29:03.830519425 +0100
71 # +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
72 # @@ -1,3 +1,9 @@
73 # QA output created by 094
74 # fiemap run with sync
75 # fiemap run without sync
76 # +ERROR: couldn't find extent at 7
77 # +map is 'HHDDHPPDPHPH'
78 # +logical: [ 5.. 6] phys: 301517.. 301518 flags: 0x800 tot: 2
79 # +logical: [ 8.. 8] phys: 301520.. 301520 flags: 0x800 tot: 1
80 # ...
81 # (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad' to see the entire diff)
82
83 So in order to fix this, while still avoiding deadlocks in the case where
84 the fiemap buffer is memory mapped to the same file, change fiemap to work
85 like the following:
86
87 1) Always lock the whole range in the inode's io tree before starting to
88 iterate the inode's subvolume tree searching for file extent items,
89 just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with
90 fiemap and extent locking");
91
92 2) Now instead of writing to the fiemap buffer every time we have an extent
93 to report, write instead to a temporary buffer (1 page), and when that
94 buffer becomes full, stop iterating the file extent items, unlock the
95 range in the io tree, release the search path, submit all the entries
96 kept in that buffer to the fiemap buffer, and then resume the search
97 for file extent items after locking again the remainder of the range in
98 the io tree.
99
100 The buffer having a size of a page, allows for 146 entries in a system
101 with 4K pages. This is a large enough value to have a good performance
102 by avoiding too many restarts of the search for file extent items.
103 In other words this preserves the huge performance gains made in the
104 last two years to fiemap, while avoiding the deadlocks in case the
105 fiemap buffer is memory mapped to the same file (useless in practice,
106 but possible and exercised by fuzz testing and syzbot).
107
108 Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
109 Reviewed-by: Josef Bacik <josef@toxicpanda.com>
110 Signed-off-by: Filipe Manana <fdmanana@suse.com>
111 Signed-off-by: David Sterba <dsterba@suse.com>
112 Signed-off-by: Sasha Levin <sashal@kernel.org>
113 ---
114 fs/btrfs/extent_io.c | 221 +++++++++++++++++++++++++++++++------------
115 1 file changed, 160 insertions(+), 61 deletions(-)
116
117 diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
118 index 45d427c3033d7..5acb2cb79d4bf 100644
119 --- a/fs/btrfs/extent_io.c
120 +++ b/fs/btrfs/extent_io.c
121 @@ -2410,12 +2410,65 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
122 return try_release_extent_state(tree, page, mask);
123 }
124
125 +struct btrfs_fiemap_entry {
126 + u64 offset;
127 + u64 phys;
128 + u64 len;
129 + u32 flags;
130 +};
131 +
132 /*
133 - * To cache previous fiemap extent
134 + * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
135 + * range from the inode's io tree, unlock the subvolume tree search path, flush
136 + * the fiemap cache and relock the file range and research the subvolume tree.
137 + * The value here is something negative that can't be confused with a valid
138 + * errno value and different from 1 because that's also a return value from
139 + * fiemap_fill_next_extent() and also it's often used to mean some btree search
140 + * did not find a key, so make it some distinct negative value.
141 + */
142 +#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
143 +
144 +/*
145 + * Used to:
146 + *
147 + * - Cache the next entry to be emitted to the fiemap buffer, so that we can
148 + * merge extents that are contiguous and can be grouped as a single one;
149 *
150 - * Will be used for merging fiemap extent
151 + * - Store extents ready to be written to the fiemap buffer in an intermediary
152 + * buffer. This intermediary buffer is to ensure that in case the fiemap
153 + * buffer is memory mapped to the fiemap target file, we don't deadlock
154 + * during btrfs_page_mkwrite(). This is because during fiemap we are locking
155 + * an extent range in order to prevent races with delalloc flushing and
156 + * ordered extent completion, which is needed in order to reliably detect
157 + * delalloc in holes and prealloc extents. And this can lead to a deadlock
158 + * if the fiemap buffer is memory mapped to the file we are running fiemap
159 + * against (a silly, useless in practice scenario, but possible) because
160 + * btrfs_page_mkwrite() will try to lock the same extent range.
161 */
162 struct fiemap_cache {
163 + /* An array of ready fiemap entries. */
164 + struct btrfs_fiemap_entry *entries;
165 + /* Number of entries in the entries array. */
166 + int entries_size;
167 + /* Index of the next entry in the entries array to write to. */
168 + int entries_pos;
169 + /*
170 + * Once the entries array is full, this indicates what's the offset for
171 + * the next file extent item we must search for in the inode's subvolume
172 + * tree after unlocking the extent range in the inode's io tree and
173 + * releasing the search path.
174 + */
175 + u64 next_search_offset;
176 + /*
177 + * This matches struct fiemap_extent_info::fi_mapped_extents, we use it
178 + * to count ourselves emitted extents and stop instead of relying on
179 + * fiemap_fill_next_extent() because we buffer ready fiemap entries at
180 + * the @entries array, and we want to stop as soon as we hit the max
181 + * amount of extents to map, not just to save time but also to make the
182 + * logic at extent_fiemap() simpler.
183 + */
184 + unsigned int extents_mapped;
185 + /* Fields for the cached extent (unsubmitted, not ready, extent). */
186 u64 offset;
187 u64 phys;
188 u64 len;
189 @@ -2423,6 +2476,28 @@ struct fiemap_cache {
190 bool cached;
191 };
192
193 +static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
194 + struct fiemap_cache *cache)
195 +{
196 + for (int i = 0; i < cache->entries_pos; i++) {
197 + struct btrfs_fiemap_entry *entry = &cache->entries[i];
198 + int ret;
199 +
200 + ret = fiemap_fill_next_extent(fieinfo, entry->offset,
201 + entry->phys, entry->len,
202 + entry->flags);
203 + /*
204 + * Ignore 1 (reached max entries) because we keep track of that
205 + * ourselves in emit_fiemap_extent().
206 + */
207 + if (ret < 0)
208 + return ret;
209 + }
210 + cache->entries_pos = 0;
211 +
212 + return 0;
213 +}
214 +
215 /*
216 * Helper to submit fiemap extent.
217 *
218 @@ -2437,8 +2512,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
219 struct fiemap_cache *cache,
220 u64 offset, u64 phys, u64 len, u32 flags)
221 {
222 + struct btrfs_fiemap_entry *entry;
223 u64 cache_end;
224 - int ret = 0;
225
226 /* Set at the end of extent_fiemap(). */
227 ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
228 @@ -2451,7 +2526,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
229 * find an extent that starts at an offset behind the end offset of the
230 * previous extent we processed. This happens if fiemap is called
231 * without FIEMAP_FLAG_SYNC and there are ordered extents completing
232 - * while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
233 + * after we had to unlock the file range, release the search path, emit
234 + * the fiemap extents stored in the buffer (cache->entries array) and
235 + * the lock the remainder of the range and re-search the btree.
236 *
237 * For example we are in leaf X processing its last item, which is the
238 * file extent item for file range [512K, 1M[, and after
239 @@ -2564,11 +2641,35 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
240
241 emit:
242 /* Not mergeable, need to submit cached one */
243 - ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
244 - cache->len, cache->flags);
245 - cache->cached = false;
246 - if (ret)
247 - return ret;
248 +
249 + if (cache->entries_pos == cache->entries_size) {
250 + /*
251 + * We will need to research for the end offset of the last
252 + * stored extent and not from the current offset, because after
253 + * unlocking the range and releasing the path, if there's a hole
254 + * between that end offset and this current offset, a new extent
255 + * may have been inserted due to a new write, so we don't want
256 + * to miss it.
257 + */
258 + entry = &cache->entries[cache->entries_size - 1];
259 + cache->next_search_offset = entry->offset + entry->len;
260 + cache->cached = false;
261 +
262 + return BTRFS_FIEMAP_FLUSH_CACHE;
263 + }
264 +
265 + entry = &cache->entries[cache->entries_pos];
266 + entry->offset = cache->offset;
267 + entry->phys = cache->phys;
268 + entry->len = cache->len;
269 + entry->flags = cache->flags;
270 + cache->entries_pos++;
271 + cache->extents_mapped++;
272 +
273 + if (cache->extents_mapped == fieinfo->fi_extents_max) {
274 + cache->cached = false;
275 + return 1;
276 + }
277 assign:
278 cache->cached = true;
279 cache->offset = offset;
280 @@ -2694,8 +2795,8 @@ static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path
281 * neighbour leaf).
282 * We also need the private clone because holding a read lock on an
283 * extent buffer of the subvolume's b+tree will make lockdep unhappy
284 - * when we call fiemap_fill_next_extent(), because that may cause a page
285 - * fault when filling the user space buffer with fiemap data.
286 + * when we check if extents are shared, as backref walking may need to
287 + * lock the same leaf we are processing.
288 */
289 clone = btrfs_clone_extent_buffer(path->nodes[0]);
290 if (!clone)
291 @@ -2735,34 +2836,16 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
292 * it beyond i_size.
293 */
294 while (cur_offset < end && cur_offset < i_size) {
295 - struct extent_state *cached_state = NULL;
296 u64 delalloc_start;
297 u64 delalloc_end;
298 u64 prealloc_start;
299 - u64 lockstart;
300 - u64 lockend;
301 u64 prealloc_len = 0;
302 bool delalloc;
303
304 - lockstart = round_down(cur_offset, inode->root->fs_info->sectorsize);
305 - lockend = round_up(end, inode->root->fs_info->sectorsize);
306 -
307 - /*
308 - * We are only locking for the delalloc range because that's the
309 - * only thing that can change here. With fiemap we have a lock
310 - * on the inode, so no buffered or direct writes can happen.
311 - *
312 - * However mmaps and normal page writeback will cause this to
313 - * change arbitrarily. We have to lock the extent lock here to
314 - * make sure that nobody messes with the tree while we're doing
315 - * btrfs_find_delalloc_in_range.
316 - */
317 - lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
318 delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
319 delalloc_cached_state,
320 &delalloc_start,
321 &delalloc_end);
322 - unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
323 if (!delalloc)
324 break;
325
326 @@ -2930,6 +3013,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
327 u64 start, u64 len)
328 {
329 const u64 ino = btrfs_ino(inode);
330 + struct extent_state *cached_state = NULL;
331 struct extent_state *delalloc_cached_state = NULL;
332 struct btrfs_path *path;
333 struct fiemap_cache cache = { 0 };
334 @@ -2942,26 +3026,33 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
335 bool stopped = false;
336 int ret;
337
338 + cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
339 + cache.entries = kmalloc_array(cache.entries_size,
340 + sizeof(struct btrfs_fiemap_entry),
341 + GFP_KERNEL);
342 backref_ctx = btrfs_alloc_backref_share_check_ctx();
343 path = btrfs_alloc_path();
344 - if (!backref_ctx || !path) {
345 + if (!cache.entries || !backref_ctx || !path) {
346 ret = -ENOMEM;
347 goto out;
348 }
349
350 +restart:
351 range_start = round_down(start, sectorsize);
352 range_end = round_up(start + len, sectorsize);
353 prev_extent_end = range_start;
354
355 + lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
356 +
357 ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
358 if (ret < 0)
359 - goto out;
360 + goto out_unlock;
361 btrfs_release_path(path);
362
363 path->reada = READA_FORWARD;
364 ret = fiemap_search_slot(inode, path, range_start);
365 if (ret < 0) {
366 - goto out;
367 + goto out_unlock;
368 } else if (ret > 0) {
369 /*
370 * No file extent item found, but we may have delalloc between
371 @@ -3008,7 +3099,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
372 backref_ctx, 0, 0, 0,
373 prev_extent_end, hole_end);
374 if (ret < 0) {
375 - goto out;
376 + goto out_unlock;
377 } else if (ret > 0) {
378 /* fiemap_fill_next_extent() told us to stop. */
379 stopped = true;
380 @@ -3064,7 +3155,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
381 extent_gen,
382 backref_ctx);
383 if (ret < 0)
384 - goto out;
385 + goto out_unlock;
386 else if (ret > 0)
387 flags |= FIEMAP_EXTENT_SHARED;
388 }
389 @@ -3075,9 +3166,9 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
390 }
391
392 if (ret < 0) {
393 - goto out;
394 + goto out_unlock;
395 } else if (ret > 0) {
396 - /* fiemap_fill_next_extent() told us to stop. */
397 + /* emit_fiemap_extent() told us to stop. */
398 stopped = true;
399 break;
400 }
401 @@ -3086,12 +3177,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
402 next_item:
403 if (fatal_signal_pending(current)) {
404 ret = -EINTR;
405 - goto out;
406 + goto out_unlock;
407 }
408
409 ret = fiemap_next_leaf_item(inode, path);
410 if (ret < 0) {
411 - goto out;
412 + goto out_unlock;
413 } else if (ret > 0) {
414 /* No more file extent items for this inode. */
415 break;
416 @@ -3100,22 +3191,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
417 }
418
419 check_eof_delalloc:
420 - /*
421 - * Release (and free) the path before emitting any final entries to
422 - * fiemap_fill_next_extent() to keep lockdep happy. This is because
423 - * once we find no more file extent items exist, we may have a
424 - * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
425 - * faults when copying data to the user space buffer.
426 - */
427 - btrfs_free_path(path);
428 - path = NULL;
429 -
430 if (!stopped && prev_extent_end < range_end) {
431 ret = fiemap_process_hole(inode, fieinfo, &cache,
432 &delalloc_cached_state, backref_ctx,
433 0, 0, 0, prev_extent_end, range_end - 1);
434 if (ret < 0)
435 - goto out;
436 + goto out_unlock;
437 prev_extent_end = range_end;
438 }
439
440 @@ -3123,28 +3204,16 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
441 const u64 i_size = i_size_read(&inode->vfs_inode);
442
443 if (prev_extent_end < i_size) {
444 - struct extent_state *cached_state = NULL;
445 u64 delalloc_start;
446 u64 delalloc_end;
447 - u64 lockstart;
448 - u64 lockend;
449 bool delalloc;
450
451 - lockstart = round_down(prev_extent_end, sectorsize);
452 - lockend = round_up(i_size, sectorsize);
453 -
454 - /*
455 - * See the comment in fiemap_process_hole as to why
456 - * we're doing the locking here.
457 - */
458 - lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
459 delalloc = btrfs_find_delalloc_in_range(inode,
460 prev_extent_end,
461 i_size - 1,
462 &delalloc_cached_state,
463 &delalloc_start,
464 &delalloc_end);
465 - unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
466 if (!delalloc)
467 cache.flags |= FIEMAP_EXTENT_LAST;
468 } else {
469 @@ -3152,9 +3221,39 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
470 }
471 }
472
473 +out_unlock:
474 + unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
475 +
476 + if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
477 + btrfs_release_path(path);
478 + ret = flush_fiemap_cache(fieinfo, &cache);
479 + if (ret)
480 + goto out;
481 + len -= cache.next_search_offset - start;
482 + start = cache.next_search_offset;
483 + goto restart;
484 + } else if (ret < 0) {
485 + goto out;
486 + }
487 +
488 + /*
489 + * Must free the path before emitting to the fiemap buffer because we
490 + * may have a non-cloned leaf and if the fiemap buffer is memory mapped
491 + * to a file, a write into it (through btrfs_page_mkwrite()) may trigger
492 + * waiting for an ordered extent that in order to complete needs to
493 + * modify that leaf, therefore leading to a deadlock.
494 + */
495 + btrfs_free_path(path);
496 + path = NULL;
497 +
498 + ret = flush_fiemap_cache(fieinfo, &cache);
499 + if (ret)
500 + goto out;
501 +
502 ret = emit_last_fiemap_cache(fieinfo, &cache);
503 out:
504 free_extent_state(delalloc_cached_state);
505 + kfree(cache.entries);
506 btrfs_free_backref_share_ctx(backref_ctx);
507 btrfs_free_path(path);
508 return ret;
509 --
510 2.43.0
511