queue-6.6/btrfs-fix-race-when-detecting-delalloc-ranges-during.patch

   1 From c7f65ad33a7107498d54a178888ec64134240ad9 Mon Sep 17 00:00:00 2001
   2 From: Sasha Levin <sashal@kernel.org>
   3 Date: Wed, 28 Feb 2024 11:37:56 +0000
   4 Subject: btrfs: fix race when detecting delalloc ranges during fiemap
   5
   6 From: Filipe Manana <fdmanana@suse.com>
   7
   8 [ Upstream commit 978b63f7464abcfd364a6c95f734282c50f3decf ]
   9
  10 For fiemap we recently stopped locking the target extent range for the
  11 whole duration of the fiemap call, in order to avoid a deadlock in a
  12 scenario where the fiemap buffer happens to be a memory mapped range of
  13 the same file. This use case is very unlikely to be useful in practice but
  14 it may be triggered by fuzz testing (syzbot, etc).
  15
  16 This however introduced a race that makes us miss delalloc ranges for
  17 file regions that are currently holes, so the caller of fiemap will not
  18 be aware that there's data for some file regions. This can be quite
  19 serious for some use cases - for example in coreutils versions before 9.0,
  20 the cp program used fiemap to detect holes and data in the source file,
  21 copying only regions with data (extents or delalloc) from the source file
  22 to the destination file in order to preserve holes (see the documentation
  23 for its --sparse command line option). This means that if cp was used
  24 with a source file that had delalloc in a hole, the destination file could
  25 end up without that data, which is effectively a data loss issue, if it
  26 happened to hit the race described below.
  27
  28 The race happens like this:
  29
  30 1) Fiemap is called, without the FIEMAP_FLAG_SYNC flag, for a file that
  31    has delalloc in the file range [64M, 65M[, which is currently a hole;
  32
  33 2) Fiemap locks the inode in shared mode, then starts iterating the
  34    inode's subvolume tree searching for file extent items, without having
  35    the whole fiemap target range locked in the inode's io tree - the
  36    change introduced recently by commit b0ad381fa769 ("btrfs: fix
  37    deadlock with fiemap and extent locking"). It only locks ranges in
  38    the io tree when it finds a hole or prealloc extent since that
  39    commit;
  40
  41 3) Note that fiemap clones each leaf before using it, and this is to
  42    avoid deadlocks when locking a file range in the inode's io tree and
  43    the fiemap buffer is memory mapped to some file, because writing
  44    to the page with btrfs_page_mkwrite() will wait on any ordered extent
  45    for the page's range and the ordered extent needs to lock the range
  46    and may need to modify the same leaf, therefore leading to a deadlock
  47    on the leaf;
  48
  49 4) While iterating the file extent items in the cloned leaf before
  50    finding the hole in the range [64M, 65M[, the delalloc in that range
  51    is flushed and its ordered extent completes - meaning the corresponding
  52    file extent item is in the inode's subvolume tree, but not present in
  53    the cloned leaf that fiemap is iterating over;
  54
  55 5) When fiemap finds the hole in the [64M, 65M[ range by seeing the gap in
  56    the cloned leaf (or a file extent item with disk_bytenr == 0 in case
  57    the NO_HOLES feature is not enabled), it will lock that file range in
  58    the inode's io tree and then search for delalloc by checking for the
  59    EXTENT_DELALLOC bit in the io tree for that range and ordered extents
  60    (with btrfs_find_delalloc_in_range()). But it finds nothing since the
  61    delalloc in that range was already flushed and the ordered extent
  62    completed and is gone - as a result fiemap will not report that there's
  63    delalloc or an extent for the range [64M, 65M[, so user space will be
  64    mislead into thinking that there's a hole in that range.
  65
  66 This could actually be sporadically triggered with test case generic/094
  67 from fstests, which reports a missing extent/delalloc range like this:
  68
  69 #  generic/094 2s ... - output mismatch (see /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad)
  70 #      --- tests/generic/094.out        2020-06-10 19:29:03.830519425 +0100
  71 #      +++ /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad 2024-02-28 11:00:00.381071525 +0000
  72 #      @@ -1,3 +1,9 @@
  73 #       QA output created by 094
  74 #       fiemap run with sync
  75 #       fiemap run without sync
  76 #      +ERROR: couldn't find extent at 7
  77 #      +map is 'HHDDHPPDPHPH'
  78 #      +logical: [       5..       6] phys:   301517..  301518 flags: 0x800 tot: 2
  79 #      +logical: [       8..       8] phys:   301520..  301520 flags: 0x800 tot: 1
  80 #      ...
  81 #      (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/generic/094.out /home/fdmanana/git/hub/xfstests/results//generic/094.out.bad'  to see the entire diff)
  82
  83 So in order to fix this, while still avoiding deadlocks in the case where
  84 the fiemap buffer is memory mapped to the same file, change fiemap to work
  85 like the following:
  86
  87 1) Always lock the whole range in the inode's io tree before starting to
  88    iterate the inode's subvolume tree searching for file extent items,
  89    just like we did before commit b0ad381fa769 ("btrfs: fix deadlock with
  90    fiemap and extent locking");
  91
  92 2) Now instead of writing to the fiemap buffer every time we have an extent
  93    to report, write instead to a temporary buffer (1 page), and when that
  94    buffer becomes full, stop iterating the file extent items, unlock the
  95    range in the io tree, release the search path, submit all the entries
  96    kept in that buffer to the fiemap buffer, and then resume the search
  97    for file extent items after locking again the remainder of the range in
  98    the io tree.
  99
 100    The buffer having a size of a page, allows for 146 entries in a system
 101    with 4K pages. This is a large enough value to have a good performance
 102    by avoiding too many restarts of the search for file extent items.
 103    In other words this preserves the huge performance gains made in the
 104    last two years to fiemap, while avoiding the deadlocks in case the
 105    fiemap buffer is memory mapped to the same file (useless in practice,
 106    but possible and exercised by fuzz testing and syzbot).
 107
 108 Fixes: b0ad381fa769 ("btrfs: fix deadlock with fiemap and extent locking")
 109 Reviewed-by: Josef Bacik <josef@toxicpanda.com>
 110 Signed-off-by: Filipe Manana <fdmanana@suse.com>
 111 Signed-off-by: David Sterba <dsterba@suse.com>
 112 Signed-off-by: Sasha Levin <sashal@kernel.org>
 113 ---
 114  fs/btrfs/extent_io.c | 221 +++++++++++++++++++++++++++++++------------
 115  1 file changed, 160 insertions(+), 61 deletions(-)
 116
 117 diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
 118 index 45d427c3033d7..5acb2cb79d4bf 100644
 119 --- a/fs/btrfs/extent_io.c
 120 +++ b/fs/btrfs/extent_io.c
 121 @@ -2410,12 +2410,65 @@ int try_release_extent_mapping(struct page *page, gfp_t mask)
 122         return try_release_extent_state(tree, page, mask);
 123  }
 124
 125 +struct btrfs_fiemap_entry {
 126 +       u64 offset;
 127 +       u64 phys;
 128 +       u64 len;
 129 +       u32 flags;
 130 +};
 131 +
 132  /*
 133 - * To cache previous fiemap extent
 134 + * Indicate the caller of emit_fiemap_extent() that it needs to unlock the file
 135 + * range from the inode's io tree, unlock the subvolume tree search path, flush
 136 + * the fiemap cache and relock the file range and research the subvolume tree.
 137 + * The value here is something negative that can't be confused with a valid
 138 + * errno value and different from 1 because that's also a return value from
 139 + * fiemap_fill_next_extent() and also it's often used to mean some btree search
 140 + * did not find a key, so make it some distinct negative value.
 141 + */
 142 +#define BTRFS_FIEMAP_FLUSH_CACHE (-(MAX_ERRNO + 1))
 143 +
 144 +/*
 145 + * Used to:
 146 + *
 147 + * - Cache the next entry to be emitted to the fiemap buffer, so that we can
 148 + *   merge extents that are contiguous and can be grouped as a single one;
 149   *
 150 - * Will be used for merging fiemap extent
 151 + * - Store extents ready to be written to the fiemap buffer in an intermediary
 152 + *   buffer. This intermediary buffer is to ensure that in case the fiemap
 153 + *   buffer is memory mapped to the fiemap target file, we don't deadlock
 154 + *   during btrfs_page_mkwrite(). This is because during fiemap we are locking
 155 + *   an extent range in order to prevent races with delalloc flushing and
 156 + *   ordered extent completion, which is needed in order to reliably detect
 157 + *   delalloc in holes and prealloc extents. And this can lead to a deadlock
 158 + *   if the fiemap buffer is memory mapped to the file we are running fiemap
 159 + *   against (a silly, useless in practice scenario, but possible) because
 160 + *   btrfs_page_mkwrite() will try to lock the same extent range.
 161   */
 162  struct fiemap_cache {
 163 +       /* An array of ready fiemap entries. */
 164 +       struct btrfs_fiemap_entry *entries;
 165 +       /* Number of entries in the entries array. */
 166 +       int entries_size;
 167 +       /* Index of the next entry in the entries array to write to. */
 168 +       int entries_pos;
 169 +       /*
 170 +        * Once the entries array is full, this indicates what's the offset for
 171 +        * the next file extent item we must search for in the inode's subvolume
 172 +        * tree after unlocking the extent range in the inode's io tree and
 173 +        * releasing the search path.
 174 +        */
 175 +       u64 next_search_offset;
 176 +       /*
 177 +        * This matches struct fiemap_extent_info::fi_mapped_extents, we use it
 178 +        * to count ourselves emitted extents and stop instead of relying on
 179 +        * fiemap_fill_next_extent() because we buffer ready fiemap entries at
 180 +        * the @entries array, and we want to stop as soon as we hit the max
 181 +        * amount of extents to map, not just to save time but also to make the
 182 +        * logic at extent_fiemap() simpler.
 183 +        */
 184 +       unsigned int extents_mapped;
 185 +       /* Fields for the cached extent (unsubmitted, not ready, extent). */
 186         u64 offset;
 187         u64 phys;
 188         u64 len;
 189 @@ -2423,6 +2476,28 @@ struct fiemap_cache {
 190         bool cached;
 191  };
 192
 193 +static int flush_fiemap_cache(struct fiemap_extent_info *fieinfo,
 194 +                             struct fiemap_cache *cache)
 195 +{
 196 +       for (int i = 0; i < cache->entries_pos; i++) {
 197 +               struct btrfs_fiemap_entry *entry = &cache->entries[i];
 198 +               int ret;
 199 +
 200 +               ret = fiemap_fill_next_extent(fieinfo, entry->offset,
 201 +                                             entry->phys, entry->len,
 202 +                                             entry->flags);
 203 +               /*
 204 +                * Ignore 1 (reached max entries) because we keep track of that
 205 +                * ourselves in emit_fiemap_extent().
 206 +                */
 207 +               if (ret < 0)
 208 +                       return ret;
 209 +       }
 210 +       cache->entries_pos = 0;
 211 +
 212 +       return 0;
 213 +}
 214 +
 215  /*
 216   * Helper to submit fiemap extent.
 217   *
 218 @@ -2437,8 +2512,8 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 219                                 struct fiemap_cache *cache,
 220                                 u64 offset, u64 phys, u64 len, u32 flags)
 221  {
 222 +       struct btrfs_fiemap_entry *entry;
 223         u64 cache_end;
 224 -       int ret = 0;
 225
 226         /* Set at the end of extent_fiemap(). */
 227         ASSERT((flags & FIEMAP_EXTENT_LAST) == 0);
 228 @@ -2451,7 +2526,9 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 229          * find an extent that starts at an offset behind the end offset of the
 230          * previous extent we processed. This happens if fiemap is called
 231          * without FIEMAP_FLAG_SYNC and there are ordered extents completing
 232 -        * while we call btrfs_next_leaf() (through fiemap_next_leaf_item()).
 233 +        * after we had to unlock the file range, release the search path, emit
 234 +        * the fiemap extents stored in the buffer (cache->entries array) and
 235 +        * the lock the remainder of the range and re-search the btree.
 236          *
 237          * For example we are in leaf X processing its last item, which is the
 238          * file extent item for file range [512K, 1M[, and after
 239 @@ -2564,11 +2641,35 @@ static int emit_fiemap_extent(struct fiemap_extent_info *fieinfo,
 240
 241  emit:
 242         /* Not mergeable, need to submit cached one */
 243 -       ret = fiemap_fill_next_extent(fieinfo, cache->offset, cache->phys,
 244 -                                     cache->len, cache->flags);
 245 -       cache->cached = false;
 246 -       if (ret)
 247 -               return ret;
 248 +
 249 +       if (cache->entries_pos == cache->entries_size) {
 250 +               /*
 251 +                * We will need to research for the end offset of the last
 252 +                * stored extent and not from the current offset, because after
 253 +                * unlocking the range and releasing the path, if there's a hole
 254 +                * between that end offset and this current offset, a new extent
 255 +                * may have been inserted due to a new write, so we don't want
 256 +                * to miss it.
 257 +                */
 258 +               entry = &cache->entries[cache->entries_size - 1];
 259 +               cache->next_search_offset = entry->offset + entry->len;
 260 +               cache->cached = false;
 261 +
 262 +               return BTRFS_FIEMAP_FLUSH_CACHE;
 263 +       }
 264 +
 265 +       entry = &cache->entries[cache->entries_pos];
 266 +       entry->offset = cache->offset;
 267 +       entry->phys = cache->phys;
 268 +       entry->len = cache->len;
 269 +       entry->flags = cache->flags;
 270 +       cache->entries_pos++;
 271 +       cache->extents_mapped++;
 272 +
 273 +       if (cache->extents_mapped == fieinfo->fi_extents_max) {
 274 +               cache->cached = false;
 275 +               return 1;
 276 +       }
 277  assign:
 278         cache->cached = true;
 279         cache->offset = offset;
 280 @@ -2694,8 +2795,8 @@ static int fiemap_search_slot(struct btrfs_inode *inode, struct btrfs_path *path
 281          * neighbour leaf).
 282          * We also need the private clone because holding a read lock on an
 283          * extent buffer of the subvolume's b+tree will make lockdep unhappy
 284 -        * when we call fiemap_fill_next_extent(), because that may cause a page
 285 -        * fault when filling the user space buffer with fiemap data.
 286 +        * when we check if extents are shared, as backref walking may need to
 287 +        * lock the same leaf we are processing.
 288          */
 289         clone = btrfs_clone_extent_buffer(path->nodes[0]);
 290         if (!clone)
 291 @@ -2735,34 +2836,16 @@ static int fiemap_process_hole(struct btrfs_inode *inode,
 292          * it beyond i_size.
 293          */
 294         while (cur_offset < end && cur_offset < i_size) {
 295 -               struct extent_state *cached_state = NULL;
 296                 u64 delalloc_start;
 297                 u64 delalloc_end;
 298                 u64 prealloc_start;
 299 -               u64 lockstart;
 300 -               u64 lockend;
 301                 u64 prealloc_len = 0;
 302                 bool delalloc;
 303
 304 -               lockstart = round_down(cur_offset, inode->root->fs_info->sectorsize);
 305 -               lockend = round_up(end, inode->root->fs_info->sectorsize);
 306 -
 307 -               /*
 308 -                * We are only locking for the delalloc range because that's the
 309 -                * only thing that can change here.  With fiemap we have a lock
 310 -                * on the inode, so no buffered or direct writes can happen.
 311 -                *
 312 -                * However mmaps and normal page writeback will cause this to
 313 -                * change arbitrarily.  We have to lock the extent lock here to
 314 -                * make sure that nobody messes with the tree while we're doing
 315 -                * btrfs_find_delalloc_in_range.
 316 -                */
 317 -               lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 318                 delalloc = btrfs_find_delalloc_in_range(inode, cur_offset, end,
 319                                                         delalloc_cached_state,
 320                                                         &delalloc_start,
 321                                                         &delalloc_end);
 322 -               unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 323                 if (!delalloc)
 324                         break;
 325
 326 @@ -2930,6 +3013,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 327                   u64 start, u64 len)
 328  {
 329         const u64 ino = btrfs_ino(inode);
 330 +       struct extent_state *cached_state = NULL;
 331         struct extent_state *delalloc_cached_state = NULL;
 332         struct btrfs_path *path;
 333         struct fiemap_cache cache = { 0 };
 334 @@ -2942,26 +3026,33 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 335         bool stopped = false;
 336         int ret;
 337
 338 +       cache.entries_size = PAGE_SIZE / sizeof(struct btrfs_fiemap_entry);
 339 +       cache.entries = kmalloc_array(cache.entries_size,
 340 +                                     sizeof(struct btrfs_fiemap_entry),
 341 +                                     GFP_KERNEL);
 342         backref_ctx = btrfs_alloc_backref_share_check_ctx();
 343         path = btrfs_alloc_path();
 344 -       if (!backref_ctx || !path) {
 345 +       if (!cache.entries || !backref_ctx || !path) {
 346                 ret = -ENOMEM;
 347                 goto out;
 348         }
 349
 350 +restart:
 351         range_start = round_down(start, sectorsize);
 352         range_end = round_up(start + len, sectorsize);
 353         prev_extent_end = range_start;
 354
 355 +       lock_extent(&inode->io_tree, range_start, range_end, &cached_state);
 356 +
 357         ret = fiemap_find_last_extent_offset(inode, path, &last_extent_end);
 358         if (ret < 0)
 359 -               goto out;
 360 +               goto out_unlock;
 361         btrfs_release_path(path);
 362
 363         path->reada = READA_FORWARD;
 364         ret = fiemap_search_slot(inode, path, range_start);
 365         if (ret < 0) {
 366 -               goto out;
 367 +               goto out_unlock;
 368         } else if (ret > 0) {
 369                 /*
 370                  * No file extent item found, but we may have delalloc between
 371 @@ -3008,7 +3099,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 372                                                   backref_ctx, 0, 0, 0,
 373                                                   prev_extent_end, hole_end);
 374                         if (ret < 0) {
 375 -                               goto out;
 376 +                               goto out_unlock;
 377                         } else if (ret > 0) {
 378                                 /* fiemap_fill_next_extent() told us to stop. */
 379                                 stopped = true;
 380 @@ -3064,7 +3155,7 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 381                                                                   extent_gen,
 382                                                                   backref_ctx);
 383                                 if (ret < 0)
 384 -                                       goto out;
 385 +                                       goto out_unlock;
 386                                 else if (ret > 0)
 387                                         flags |= FIEMAP_EXTENT_SHARED;
 388                         }
 389 @@ -3075,9 +3166,9 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 390                 }
 391
 392                 if (ret < 0) {
 393 -                       goto out;
 394 +                       goto out_unlock;
 395                 } else if (ret > 0) {
 396 -                       /* fiemap_fill_next_extent() told us to stop. */
 397 +                       /* emit_fiemap_extent() told us to stop. */
 398                         stopped = true;
 399                         break;
 400                 }
 401 @@ -3086,12 +3177,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 402  next_item:
 403                 if (fatal_signal_pending(current)) {
 404                         ret = -EINTR;
 405 -                       goto out;
 406 +                       goto out_unlock;
 407                 }
 408
 409                 ret = fiemap_next_leaf_item(inode, path);
 410                 if (ret < 0) {
 411 -                       goto out;
 412 +                       goto out_unlock;
 413                 } else if (ret > 0) {
 414                         /* No more file extent items for this inode. */
 415                         break;
 416 @@ -3100,22 +3191,12 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 417         }
 418
 419  check_eof_delalloc:
 420 -       /*
 421 -        * Release (and free) the path before emitting any final entries to
 422 -        * fiemap_fill_next_extent() to keep lockdep happy. This is because
 423 -        * once we find no more file extent items exist, we may have a
 424 -        * non-cloned leaf, and fiemap_fill_next_extent() can trigger page
 425 -        * faults when copying data to the user space buffer.
 426 -        */
 427 -       btrfs_free_path(path);
 428 -       path = NULL;
 429 -
 430         if (!stopped && prev_extent_end < range_end) {
 431                 ret = fiemap_process_hole(inode, fieinfo, &cache,
 432                                           &delalloc_cached_state, backref_ctx,
 433                                           0, 0, 0, prev_extent_end, range_end - 1);
 434                 if (ret < 0)
 435 -                       goto out;
 436 +                       goto out_unlock;
 437                 prev_extent_end = range_end;
 438         }
 439
 440 @@ -3123,28 +3204,16 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 441                 const u64 i_size = i_size_read(&inode->vfs_inode);
 442
 443                 if (prev_extent_end < i_size) {
 444 -                       struct extent_state *cached_state = NULL;
 445                         u64 delalloc_start;
 446                         u64 delalloc_end;
 447 -                       u64 lockstart;
 448 -                       u64 lockend;
 449                         bool delalloc;
 450
 451 -                       lockstart = round_down(prev_extent_end, sectorsize);
 452 -                       lockend = round_up(i_size, sectorsize);
 453 -
 454 -                       /*
 455 -                        * See the comment in fiemap_process_hole as to why
 456 -                        * we're doing the locking here.
 457 -                        */
 458 -                       lock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 459                         delalloc = btrfs_find_delalloc_in_range(inode,
 460                                                                 prev_extent_end,
 461                                                                 i_size - 1,
 462                                                                 &delalloc_cached_state,
 463                                                                 &delalloc_start,
 464                                                                 &delalloc_end);
 465 -                       unlock_extent(&inode->io_tree, lockstart, lockend, &cached_state);
 466                         if (!delalloc)
 467                                 cache.flags |= FIEMAP_EXTENT_LAST;
 468                 } else {
 469 @@ -3152,9 +3221,39 @@ int extent_fiemap(struct btrfs_inode *inode, struct fiemap_extent_info *fieinfo,
 470                 }
 471         }
 472
 473 +out_unlock:
 474 +       unlock_extent(&inode->io_tree, range_start, range_end, &cached_state);
 475 +
 476 +       if (ret == BTRFS_FIEMAP_FLUSH_CACHE) {
 477 +               btrfs_release_path(path);
 478 +               ret = flush_fiemap_cache(fieinfo, &cache);
 479 +               if (ret)
 480 +                       goto out;
 481 +               len -= cache.next_search_offset - start;
 482 +               start = cache.next_search_offset;
 483 +               goto restart;
 484 +       } else if (ret < 0) {
 485 +               goto out;
 486 +       }
 487 +
 488 +       /*
 489 +        * Must free the path before emitting to the fiemap buffer because we
 490 +        * may have a non-cloned leaf and if the fiemap buffer is memory mapped
 491 +        * to a file, a write into it (through btrfs_page_mkwrite()) may trigger
 492 +        * waiting for an ordered extent that in order to complete needs to
 493 +        * modify that leaf, therefore leading to a deadlock.
 494 +        */
 495 +       btrfs_free_path(path);
 496 +       path = NULL;
 497 +
 498 +       ret = flush_fiemap_cache(fieinfo, &cache);
 499 +       if (ret)
 500 +               goto out;
 501 +
 502         ret = emit_last_fiemap_cache(fieinfo, &cache);
 503  out:
 504         free_extent_state(delalloc_cached_state);
 505 +       kfree(cache.entries);
 506         btrfs_free_backref_share_ctx(backref_ctx);
 507         btrfs_free_path(path);
 508         return ret;
 509 --
 510 2.43.0
 511