git.ipfire.org Git - thirdparty/kernel/linux.git/log

btrfs: get rid of compressed_folios[] usage for encoded writes

Currently only encoded writes utilized btrfs_submit_compressed_write(),
which utilized compressed_bio::compressed_folios[] array.

Change the only call site to call the new helper,
btrfs_alloc_compressed_write(), to allocate a compressed bio, then queue
needed folios into that bio, and finally call
btrfs_submit_compressed_write() to submit the compressed bio.

This change has one hidden benefit, previously we used
btrfs_alloc_folio_array() for the folios of
btrfs_submit_compressed_read(), which doesn't utilize the compression
page pool for bs == ps cases.

Now we call btrfs_alloc_compr_folio() which will benefit from the page pool.

The other obvious benefit is that we no longer need to allocate an array
to hold all those folios, thus one less error path.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: get rid of compressed_folios[] usage for compressed read

Currently btrfs_submit_compressed_read() still uses
compressed_bio::compressed_folios[] array.

Change it to allocate each folio and queue them into the compressed bio
so that we do not need to allocate that array.

Considering how small each compressed read bio is (less than 128KiB), we
do not benefit that much from btrfs_alloc_folio_array() anyway,
while we may benefit more from btrfs_alloc_compr_folio() by using
the global folio pool.

So changing from btrfs_alloc_folio_array() to btrfs_alloc_compr_folio()
in a loop should still be fine.

This removes one error path, and paves the way to completely remove
compressed_folios[] array.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove the old btrfs_compress_folios() infrastructure

Since it's been replaced by btrfs_compress_bio(), remove all involved
functions.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: switch to btrfs_compress_bio() interface for compressed writes

This switch has the following benefits:

- A single structure to handle all compression

  No more extra members like compressed_folios[] nor compress_type, all
  those members.

  This means the structure of async_extent is much smaller.

- Simpler error handling

  A single cleanup_compressed_bio() will handle everything, no extra
  compressed_folios[] array to bother.

Some extra notes:

- Compressed folios releasing

  Now we go bio_for_each_folio_all() loop to release the folios of the
  bio. This will work for both the old compressed_folios[] array and the
  new pure bio method.

  For old compressed_folios[], all folios of that array is queued into
  the bio, thus releasing the folios from the bio is the same as
  releasing each folio of that array. We just need to be sure no double
  releasing from the array and bio.

  For the new pure bio method, that array is NULL, just usual folio
  releasing of the bio.

  The only extra note is for end_bbio_compressed_read(), as the folios
  are allocated using btrfs_alloc_folio_array(), thus the folios should
  only be released by regular folio_put(), not btrfs_free_compr_folio().

- Rounding up the bio to block size

  We cannot simply increase bi_size, as that will not increase the
  length of the last bvec.

  Thus we have to properly add the last part into the bio.
  This will be done by the helper, round_up_last_block().

  The reason we do not round those bios up at compression time is to get
  the unaligned compressed size, so that they can be utilized for
  inline extents.
  If we round the bios up at *_compress_bio(), then every compressed bio
  will be larger than or equal to one fs block, resulting no inline
  compressed extent.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: introduce btrfs_compress_bio() helper

The helper will allocate a new compressed_bio, do the compression, and
return it to the caller.

This greatly simplifies the compression path, as we no longer need to
allocate a folio array thus no extra error path, furthermore the
compressed bio structure can be utilized for submission with very minor
modifications (like rounding up the bi_size and populate the bi_sector).

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zlib: introduce zlib_compress_bio() helper

The new helper has the following enhancements against the existing
zlib_compress_folios()

- Much smaller parameter list

  No more shared IN/OUT members, no need to pre-allocate a
  compressed_folios[] array.

  Just a workspace and compressed_bio pointer, everything we need can be
  extracted from that @cb pointer.

- Ready-to-be-submitted compressed bio

  Although the caller still needs to do some common works like
  rounding up and zeroing the tailing part of the last fs block.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zstd: introduce zstd_compress_bio() helper

The new helper has the following enhancements against the existing
zstd_compress_folios()

- Much smaller parameter list

  No more shared IN/OUT members, no need to pre-allocate a
  compressed_folios[] array.

  Just a workspace and compressed_bio pointer, everything we need can be
  extracted from that @cb pointer.

- Ready-to-be-submitted compressed bio

  Although the caller still needs to do some common works like
  rounding up and zeroing the tailing part of the last fs block.

Overall the workflow is the same as zstd_compress_folios(), but with
some minor changes:

- @start/@len is now constant
  For the current input file offset, use @start + @tot_in instead.

  The original change of @start and @len makes it pretty hard to know
  what value we're really comparing to.

- No more @cur_len
  It's only utilized when switching input buffer.
  Directly use btrfs_calc_input_length() instead.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: lzo: introduce lzo_compress_bio() helper

The new helper has the following enhancements against the existing
lzo_compress_folios()

- Much smaller parameter list

  No more shared IN/OUT members, no need to pre-allocate a
  compressed_folios[] array.

  Just a workspace list header and a compressed_bio pointer.

  Everything else can be fetched from that @cb pointer.

- Read-to-be-submitted compressed bio

  Although the caller still needs to do some common works like
  rounding up and zeroing the tailing part of the last fs block.

Some workloads are specific to lZO that is not needed with other
multi-run compression interfaces:

- Need to write a LZO header or segment header

  Use the new write_and_queue_folio() helper to do the bio_add_folio()
  call and folio switching.

- Need to update the LZO header after compression is done

  Use bio_first_folio_all() to grab the first folio and update the header.

- Extra corner case of error handling

  This can happen when we have queued part of a folio and hit an error.
  In that case those folios will be released by the bio.
  Thus we can only release the folio that has no queued part.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: factor out the zone loading part into a testable function

Separate btrfs_load_block_group_* calling path into a function, so that it
can be an entry point of unit test.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add cleanup function for btrfs_free_chunk_map

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tests: add cleanup functions for test specific functions

Add auto-cleanup helper functions for btrfs_free_dummy_fs_info and
btrfs_free_dummy_block_group.

Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: raid56: fix memory leak of btrfs_raid_bio::stripe_uptodate_bitmap

We allocate the bitmap but we never free it in free_raid_bio_pointers().
Fix this by adding a bitmap_free() call against the stripe_uptodate_bitmap
of a raid bio.

Fixes: 1810350b04ef ("btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap")
Reported-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/linux-btrfs/20260126045315.GA31641@lst.de/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tests: add unit tests for pending extent walking functions

I ran into another sort of trivial bug in v1 of the patch and concluded
that these functions really ought to be unit tested.

These two functions form the core of searching the chunk allocation
pending extent bitmap and have relatively easily definable semantics, so
unit testing them can help ensure the correctness of chunk allocation.

I also made a minor unrelated fix in volumes.h to properly forward
declare btrfs_space_info. Because of the order of the includes in the
new test, this was actually hitting a latent build warning.

Note:
This is an early example for me of a commit authored in part by an AI
agent, so I wanted to more clear about what I did. I defined a
trivial test and explained the set of tests I wanted to the agent and it
produced the large set of test cases seen here. I then checked each test
case to make sure it matched the description and simplified the
constants and numbers until they looked reasonable to me. I then checked
the looping logic to make sure it made sense to the original spirit of
the trivial test. Finally, carefully combed over all the lines it wrote
to loop over the tests it generated to make sure they followed our code
style guide.

Assisted-by: Claude:claude-opus-4-5
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix EEXIST abort due to non-consecutive gaps in chunk allocation

I have been observing a number of systems aborting at
insert_dev_extents() in btrfs_create_pending_block_groups(). The
following is a sample stack trace of such an abort coming from forced
chunk allocation (typically behind CONFIG_BTRFS_EXPERIMENTAL) but this
can theoretically happen to any DUP chunk allocation.

  [81.801] ------------[ cut here ]------------
  [81.801] BTRFS: Transaction aborted (error -17)
  [81.801] WARNING: fs/btrfs/block-group.c:2876 at btrfs_create_pending_block_groups+0x721/0x770 [btrfs], CPU#1: bash/319
  [81.802] Modules linked in: virtio_net btrfs xor zstd_compress raid6_pq null_blk
  [81.803] CPU: 1 UID: 0 PID: 319 Comm: bash Kdump: loaded Not tainted 6.19.0-rc6+ #319 NONE
  [81.803] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Arch Linux 1.17.0-2-2 04/01/2014
  [81.804] RIP: 0010:btrfs_create_pending_block_groups+0x723/0x770 [btrfs]
  [81.806] RSP: 0018:ffffa36241a6bce8 EFLAGS: 00010282
  [81.806] RAX: 000000000000000d RBX: ffff8e699921e400 RCX: 0000000000000000
  [81.807] RDX: 0000000002040001 RSI: 00000000ffffffef RDI: ffffffffc0608bf0
  [81.807] RBP: 00000000ffffffef R08: ffff8e69830f6000 R09: 0000000000000007
  [81.808] R10: ffff8e699921e5e8 R11: 0000000000000000 R12: ffff8e6999228000
  [81.808] R13: ffff8e6984d82000 R14: ffff8e69966a69c0 R15: ffff8e69aa47b000
  [81.809] FS:  00007fec6bdd9740(0000) GS:ffff8e6b1b379000(0000) knlGS:0000000000000000
  [81.809] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [81.810] CR2: 00005604833670f0 CR3: 0000000116679000 CR4: 00000000000006f0
  [81.810] Call Trace:
  [81.810]  <TASK>
  [81.810]  __btrfs_end_transaction+0x3e/0x2b0 [btrfs]
  [81.811]  btrfs_force_chunk_alloc_store+0xcd/0x140 [btrfs]
  [81.811]  kernfs_fop_write_iter+0x15f/0x240
  [81.812]  vfs_write+0x264/0x500
  [81.812]  ksys_write+0x6c/0xe0
  [81.812]  do_syscall_64+0x66/0x770
  [81.812]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [81.813] RIP: 0033:0x7fec6be66197
  [81.814] RSP: 002b:00007fffb159dd30 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
  [81.815] RAX: ffffffffffffffda RBX: 00007fec6bdd9740 RCX: 00007fec6be66197
  [81.815] RDX: 0000000000000002 RSI: 0000560483374f80 RDI: 0000000000000001
  [81.816] RBP: 0000560483374f80 R08: 0000000000000000 R09: 0000000000000000
  [81.816] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000002
  [81.817] R13: 00007fec6bfb85c0 R14: 00007fec6bfb5ee0 R15: 00005604833729c0
  [81.817]  </TASK>
  [81.817] irq event stamp: 20039
  [81.818] hardirqs last  enabled at (20047): [<ffffffff99a68302>] __up_console_sem+0x52/0x60
  [81.818] hardirqs last disabled at (20056): [<ffffffff99a682e7>] __up_console_sem+0x37/0x60
  [81.819] softirqs last  enabled at (19470): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
  [81.819] softirqs last disabled at (19463): [<ffffffff999d2b46>] __irq_exit_rcu+0x96/0xc0
  [81.820] ---[ end trace 0000000000000000 ]---
  [81.820] BTRFS: error (device dm-7 state A) in btrfs_create_pending_block_groups:2876: errno=-17 Object already exists

Inspecting these aborts with drgn, I observed a pattern of overlapping
chunk_maps. Note how stripe 1 of the first chunk overlaps in physical
address with stripe 0 of the second chunk.

Physical Start     Physical End       Length       Logical            Type                 Stripe
----------------------------------------------------------------------------------------------------
0x0000000102500000 0x0000000142500000 1.0G         0x0000000641d00000 META|DUP             0/2
0x0000000142500000 0x0000000182500000 1.0G         0x0000000641d00000 META|DUP             1/2
0x0000000142500000 0x0000000182500000 1.0G         0x0000000601d00000 META|DUP             0/2
0x0000000182500000 0x00000001c2500000 1.0G         0x0000000601d00000 META|DUP             1/2

Now how could this possibly happen? All chunk allocation is protected by
the chunk_mutex so racing allocations should see a consistent view of
the CHUNK_ALLOCATED bit in the chunk allocation extent-io-tree
(device->alloc_state as set by chunk_map_device_set_bits()) The tree
itself is protected by a spin lock, and clearing/setting the bits is
always protected by fs_info->mapping_tree_lock, so no race is apparent.

It turns out that there is a subtle bug in the logic regarding chunk
allocations that have happened in the current transaction, known as
"pending extents". The chunk allocation as defined in
find_free_dev_extent() is a loop which searches the commit root of the
dev_root and looks for gaps between DEV_EXTENT items. For those gaps, it
then checks alloc_state bitmap for any pending extents and adjusts the
hole that it finds accordingly. However, the logic in that adjustment
assumes that the first pending extent is the only one in that range.

e.g., given a layout with two non-consecutive pending extents in a hole
passed to dev_extent_hole_check() via *hole_start and *hole_size:

  |----pending A----|    real hole     |----pending B----|
           |           candidate hole        |
      *hole_start                         *hole_start + *hole_size

the code incorrectly returns a "hole" from the end of pending extent A
until the passed in hole end, failing to account for pending B.

However, it is not entirely obvious that it is actually possible to
produce such a layout. I was able to reproduce it, but with some
contortions: I continued to use the force chunk allocation sysfs file
and I introduced a long delay (10 seconds) into the start of the cleaner
thread. I also prevented the unused bgs cleaning logic from ever
deleting metadata bgs. These help make it easier to deterministically
produce the condition but shouldn't really matter if you imagine the
conditions happening by race/luck. Allocations/frees can happen
concurrently with the cleaner thread preparing to process an unused
extent and both create some used chunks with an unused chunk
interleaved, all during one transaction. Then btrfs_delete_unused_bgs()
sees the unused one and clears it, leaving a range with several pending
chunk allocations and a gap in the middle.

The basic idea is that the unused_bgs cleanup work happens on a worker
so if we allocate 3 block groups in one transaction, then the cleaner
work kicked off by the previous transaction comes through and deletes
the middle one of the 3, then the commit root shows no dev extents and
we have the bad pattern in the extent-io-tree. One final consideration
is that the code happens to loop to the next hole if there are no more
extents at all, so we need one more dev extent way past the area we are
working in. Something like the following demonstrates the technique:

  # push the BG frontier out to 20G
  fallocate -l 20G $mnt/foo
  # allocate one more that will prevent the "no more dev extents" luck
  fallocate -l 1G $mnt/sticky
  # sync
  sync
  # clear out the allocation area
  rm $mnt/foo
  sync
  _cleaner
  # let everything quiesce
  sleep 20
  sync

  # dev tree should have one bg 20G out and the rest at the beginning..
  # sort of like an empty FS but with a random sticky chunk.

  # kick off the cleaner in the background, remember it will sleep 10s
  # before doing interesting work
  _cleaner &

  sleep 3

  # create 3 trivial block groups, all empty, all immediately marked as unused.
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"
  echo 1 > "$(_btrfs_sysfs_space_info $dev data)/force_chunk_alloc"
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"

  # let the cleaner thread definitely finish, it will remove the data bg
  sleep 10

  # this allocation sees the non-consecutive pending metadata chunks with
  # data chunk gap of 1G and allocates a 2G extent in that hole. ENOSPC!
  echo 1 > "$(_btrfs_sysfs_space_info $dev metadata)/force_chunk_alloc"

As for the fix, it is not that obvious. I could not see a trivial way to
do it even by adding backup loops into find_free_dev_extent(), so I
opted to change the semantics of dev_extent_hole_check() to not stop
looping until it finds a sufficiently big hole. For clarity, this also
required changing the helper function contains_pending_extent() into two
new helpers which find the first pending extent and the first suitable
hole in a range.

I attempted to clean up the documentation and range calculations to be
as consistent and clear as possible for the future.

I also looked at the zoned case and concluded that the loop there is
different and not to be unified with this one. As far as I can tell, the
zoned check will only further constrain the hole so looping back to find
more holes is acceptable. Though given that zoned really only appends, I
find it highly unlikely that it is susceptible to this bug.

Fixes: 1b9845081633 ("Btrfs: fix find_free_dev_extent() malfunction in case device tree has hole")
Reported-by: Dimitrios Apostolou <jimis@gmx.net>
Closes: https://lore.kernel.org/linux-btrfs/q7760374-q1p4-029o-5149-26p28421s468@tzk.arg/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix transaction commit blocking during trim of unallocated space

When trimming unallocated space, btrfs_trim_fs() holds the device_list_mutex
for the entire duration while iterating through all devices. On large
filesystems with significant unallocated space, this operation can take
minutes to hours on large storage systems.

This causes a problem because btrfs_run_dev_stats(), which is called
during transaction commit, also requires device_list_mutex:

  btrfs_trim_fs()
    mutex_lock(&fs_devices->device_list_mutex)
    list_for_each_entry(device, ...)
      btrfs_trim_free_extents(device)
    mutex_unlock(&fs_devices->device_list_mutex)

  commit_transaction()
    btrfs_run_dev_stats()
      mutex_lock(&fs_devices->device_list_mutex)  // blocked!
      ...

While trim is running, all transaction commits are blocked waiting for
the mutex.

Fix this by refactoring btrfs_trim_free_extents() to process devices in
bounded chunks (up to 2GB per iteration) and release device_list_mutex
between chunks.

Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle user interrupt properly in btrfs_trim_fs()

When a fatal signal is pending or the process is freezing,
btrfs_trim_block_group() and btrfs_trim_free_extents() return -ERESTARTSYS.
Currently this is treated as a regular error: the loops continue to the
next iteration and count it as a block group or device failure.

Instead, break out of the loops immediately and return -ERESTARTSYS to
userspace without counting it as a failure. Also skip the device loop
entirely if the block group loop was interrupted.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: preserve first error in btrfs_trim_fs()

When multiple block groups or devices fail during trim, preserve the
first error encountered rather than the last one. The first error is
typically more useful for debugging as it represents the original
failure, while subsequent errors may be cascading effects.

Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: continue trimming remaining devices on failure

Commit 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle
error better") intended to make device trimming continue even if one
device fails, tracking failures and reporting them at the end. However,
it used 'break' instead of 'continue', causing the loop to exit on the
first device failure.

Fix this by replacing 'break' with 'continue'.

Fixes: 93bba24d4b5a ("btrfs: Enhance btrfs_trim_fs function to handle error better")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Robbie Ko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: do not BUG_ON() in btrfs_remove_block_group()

There's no need to BUG_ON(), we can just abort the transaction and return
an error.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction on error in btrfs_remove_block_group()

When btrfs_remove_block_group() fails we abort the transaction in its
single caller (btrfs_remove_chunk()). This makes it harder to find out
where exactly the failure happened, as several steps inside
btrfs_remove_block_group() can fail.

So make btrfs_remove_block_group() abort the transaction whenever an
error happens, instead of aborting in its caller.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix block_group_tree dirty_list corruption

When the incompat flag EXTENT_TREE_V2 is set, we unconditionally add the
block group tree to the switch_commits list before calling
switch_commit_roots, as we do for the tree root and the chunk root.
However, the block group tree uses normal root dirty tracking and in any
transaction that does an allocation and dirties a block group, the block
group root will already be linked to a list by the dirty_list field and
this use of list_add_tail() is invalid and corrupts the prev/next
members of block_group_root->dirty_list.

This is apparent on a subsequent list_del on the prev if we enable
CONFIG_DEBUG_LIST:

  [32.1571] ------------[ cut here ]------------
  [32.1572] list_del corruption. next->prev should beffff958890202538, but was ffff9588992bd538. (next=ffff958890201538)
  [32.1575] WARNING: lib/list_debug.c:65 at 0x0, CPU#3: sync/607
  [32.1583] CPU: 3 UID: 0 PID: 607 Comm: sync Not tainted 6.18.0 #24PREEMPT(none)
  [32.1585] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS1.17.0-4.fc41 04/01/2014
  [32.1587] RIP: 0010:__list_del_entry_valid_or_report+0x108/0x120
  [32.1593] RSP: 0018:ffffaa288287fdd0 EFLAGS: 00010202
  [32.1594] RAX: 0000000000000001 RBX: ffff95889326e800 RCX:ffff958890201538
  [32.1596] RDX: ffff9588992bd538 RSI: ffff958890202538 RDI:ffffffff82a41e00
  [32.1597] RBP: ffff958890202538 R08: ffffffff828fc1e8 R09:00000000ffffefff
  [32.1599] R10: ffffffff8288c200 R11: ffffffff828e4200 R12:ffff958890201538
  [32.1601] R13: ffff95889326e958 R14: ffff958895c24000 R15:ffff958890202538
  [32.1603] FS:  00007f0c28eb5740(0000) GS:ffff958af2bd2000(0000)knlGS:0000000000000000
  [32.1605] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [32.1607] CR2: 00007f0c28e8a3cc CR3: 0000000109942005 CR4:0000000000370ef0
  [32.1609] Call Trace:
  [32.1610]  <TASK>
  [32.1611]  switch_commit_roots+0x82/0x1d0 [btrfs]
  [32.1615]  btrfs_commit_transaction+0x968/0x1550 [btrfs]
  [32.1618]  ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
  [32.1621]  __iterate_supers+0xe8/0x190
  [32.1622]  ? __pfx_sync_fs_one_sb+0x10/0x10
  [32.1623]  ksys_sync+0x63/0xb0
  [32.1624]  __do_sys_sync+0xe/0x20
  [32.1625]  do_syscall_64+0x73/0x450
  [32.1626]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [32.1627] RIP: 0033:0x7f0c28d05d2b
  [32.1632] RSP: 002b:00007ffc9d988048 EFLAGS: 00000246 ORIG_RAX:00000000000000a2
  [32.1634] RAX: ffffffffffffffda RBX: 00007ffc9d988228 RCX:00007f0c28d05d2b
  [32.1636] RDX: 00007f0c28e02301 RSI: 00007ffc9d989b21 RDI:00007f0c28dba90d
  [32.1637] RBP: 0000000000000001 R08: 0000000000000001 R09:0000000000000000
  [32.1639] R10: 0000000000000000 R11: 0000000000000246 R12:000055b96572cb80
  [32.1641] R13: 000055b96572b19f R14: 00007f0c28dfa434 R15:000055b96572b034
  [32.1643]  </TASK>
  [32.1644] irq event stamp: 0
  [32.1644] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [32.1646] hardirqs last disabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
  [32.1648] softirqs last  enabled at (0): [<ffffffff81298817>]copy_process+0xb37/0x2260
  [32.1650] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [32.1652] ---[ end trace 0000000000000000 ]---

Furthermore, this list corruption eventually (when we happen to add a
new block group) results in getting the switch_commits and
dirty_cowonly_roots lists mixed up and attempting to call update_root
on the tree root which can't be found in the tree root, resulting in a
transaction abort:

  [87.8269] BTRFS critical (device nvme1n1): unable to find root key (1 0 0) in tree 1
  [87.8272] ------------[ cut here ]------------
  [87.8274] BTRFS: Transaction aborted (error -117)
  [87.8275] WARNING: fs/btrfs/root-tree.c:153 at 0x0, CPU#4: sync/703
  [87.8285] CPU: 4 UID: 0 PID: 703 Comm: sync Not tainted 6.18.0 #25 PREEMPT(none)
  [87.8287] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.17.0-4.fc41 04/01/2014
  [87.8289] RIP: 0010:btrfs_update_root+0x296/0x790 [btrfs]
  [87.8295] RSP: 0018:ffffa58d035dfd60 EFLAGS: 00010282
  [87.8297] RAX: ffff9a59126ddb68 RBX: ffff9a59126dc000 RCX: 0000000000000000
  [87.8299] RDX: 0000000000000000 RSI: 00000000ffffff8b RDI: ffffffffc0b28270
  [87.8301] RBP: ffff9a5904aec000 R08: 0000000000000000 R09: 00000000ffffefff
  [87.8303] R10: ffffffff9ac8c200 R11: ffffffff9ace4200 R12: 0000000000000001
  [87.8305] R13: ffff9a59041740e8 R14: ffff9a5904aec1f7 R15: ffff9a590fdefaf0
  [87.8307] FS:  00007f54cde6b740(0000) GS:ffff9a5b5a81c000(0000) knlGS:0000000000000000
  [87.8309] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [87.8310] CR2: 00007f54cde403cc CR3: 0000000112902004 CR4: 0000000000370ef0
  [87.8312] Call Trace:
  [87.8313]  <TASK>
  [87.8314]  ? _raw_spin_unlock+0x23/0x40
  [87.8315]  commit_cowonly_roots+0x1ad/0x250 [btrfs]
  [87.8317]  ? btrfs_commit_transaction+0x79b/0x1560 [btrfs]
  [87.8320]  btrfs_commit_transaction+0x8aa/0x1560 [btrfs]
  [87.8322]  ? btrfs_attach_transaction_barrier+0x23/0x60 [btrfs]
  [87.8325]  __iterate_supers+0xf1/0x170
  [87.8326]  ? __pfx_sync_fs_one_sb+0x10/0x10
  [87.8327]  ksys_sync+0x63/0xb0
  [87.8328]  __do_sys_sync+0xe/0x20
  [87.8329]  do_syscall_64+0x73/0x450
  [87.8330]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
  [87.8331] RIP: 0033:0x7f54cdd05d2b
  [87.8336] RSP: 002b:00007fff1b58ff78 EFLAGS: 00000246 ORIG_RAX: 00000000000000a2
  [87.8338] RAX: ffffffffffffffda RBX: 00007fff1b590158 RCX: 00007f54cdd05d2b
  [87.8340] RDX: 00007f54cde02301 RSI: 00007fff1b592b66 RDI: 00007f54cddba90d
  [87.8342] RBP: 0000000000000001 R08: 0000000000000001 R09: 0000000000000000
  [87.8344] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e07ca96b80
  [87.8346] R13: 000055e07ca9519f R14: 00007f54cddfa434 R15: 000055e07ca95034
  [87.8348]  </TASK>
  [87.8348] irq event stamp: 0
  [87.8349] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
  [87.8351] hardirqs last disabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
  [87.8353] softirqs last  enabled at (0): [<ffffffff99698797>] copy_process+0xb37/0x21e0
  [87.8355] softirqs last disabled at (0): [<0000000000000000>] 0x0
  [87.8357] ---[ end trace 0000000000000000 ]---
  [87.8358] BTRFS: error (device nvme1n1 state A) in btrfs_update_root:153: errno=-117 Filesystem corrupted
  [87.8360] BTRFS info (device nvme1n1 state EA): forced readonly
  [87.8362] BTRFS warning (device nvme1n1 state EA): Skipping commit of aborted transaction.
  [87.8364] BTRFS: error (device nvme1n1 state EA) in cleanup_transaction:2037: errno=-117 Filesystem corrupted

Since the block group tree was pulled out of the extent tree and uses
normal root dirty tracking, remove the offending extra list_add. This
fixes the list corruption and the resulting fs corruption.

Fixes: 14033b08a029 ("btrfs: don't save block group root into super block")
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix copying the flags of btrfs_bio after split

When a btrfs_bio gets split, only 'bbio->csum_search_commit_root' gets
copied to the new btrfs_bio, all the other flags don't.

When a bio is split in btrfs_submit_chunk(), btrfs_split_bio() creates
the new split bio via btrfs_bio_init() which zeroes the struct with
memset. Looking at btrfs_split_bio(), it copies csum_search_commit_root
from the original but does not copy can_use_append.

After the split, the code does:

    bbio = split;
    bio = &bbio->bio;

This means the split bio (with can_use_append = false) gets submitted,
not the original. In btrfs_submit_dev_bio(), the condition:

    if (btrfs_bio(bio)->can_use_append && btrfs_dev_is_sequential(...))

Will be false for the split bio even when writing to a sequential zone.
Does the split bio need to inherit can_use_append from the original? The
old code used a local variable use_append which persisted across the
split.

Copy the rest of the flags as well.

Link: https://lore.kernel.org/linux-btrfs/20260125132120.2525146-1-clm@meta.com/
Reported-by: Chris Mason <clm@meta.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: use local fs_info variable in btrfs_load_block_group_dup()

btrfs_load_block_group_dup() has a local pointer to fs_info, yet the
error prints dereference fs_info from the block_group.

Use local fs_info variable to make the code more uniform.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: fixup last alloc pointer after extent removal for RAID0/10

When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.

However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.

We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.

Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: fixup last alloc pointer after extent removal for DUP

When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.

However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.

We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.

Fixes: c0d90a79e8e6 ("btrfs: zoned: fix alloc_offset calculation for partly conventional block groups")
CC: stable@vger.kernel.org # 6.16+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: fixup last alloc pointer after extent removal for RAID1

When a block group is composed of a sequential write zone and a
conventional zone, we recover the (pseudo) write pointer of the
conventional zone using the end of the last allocated position.

However, if the last extent in a block group is removed, the last extent
position will be smaller than the other real write pointer position.
Then, that will cause an error due to mismatch of the write pointers.

We can fixup this case by moving the alloc_offset to the corresponding
write pointer position.

Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in btrfs_wait_for_commit()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in btrfs_init_space_info()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in btrfs_check_rw_degradable()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in finish_verity()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in scrub_find_fill_first_stripe()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in lzo_decompress()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in btrfs_mark_extent_written()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in btrfs_csum_file_blocks()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out_failed label in find_lock_delalloc_range()

There is no point in having the label since all it does is return the
value in the 'found' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove out label in load_extent_tree_free()

There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from uuid-tree.c

Some functions (btrfs_uuid_iter_rem() and btrfs_check_uuid_tree_entry())
have an 'out' label that does nothing but return, making it pointless.
Simplify this by removing the label and returning instead of gotos plus
setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from inode.c

Some functions (insert_inline_extent() and insert_reserved_file_extent())
have an 'out' label that does nothing but return, making it pointless.
Simplify this by removing the label and returning instead of gotos plus
setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from free-space-cache.c

Some functions (update_cache_item(), find_free_space(), trim_bitmaps(),
btrfs_remove_free_space() and cleanup_free_space_cache_v1()) have an 'out'
label that does nothing but return, making it pointless. Simplify this by
removing the label and returning instead of gotos plus setting the 'ret'
variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from extent-tree.c

Some functions (lookup_extent_data_ref(), __btrfs_mod_ref() and
btrfs_free_tree_block()) have an 'out' label that does nothing but
return, making it pointless. Simplify this by removing the label and
returning instead of gotos plus setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from disk-io.c

Some functions (btrfs_validate_extent_buffer() and
btrfs_start_pre_rw_mount()) have an 'out' label that does nothing but
return, making it pointless. Simplify this by removing the label and
returning instead of gotos plus setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from qgroup.c

Some functions (__del_qgroup_relation() and
qgroup_trace_new_subtree_blocks()) have an 'out' label that does nothing
but return, making it pointless. Simplify this by removing the label and
returning instead of gotos plus setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from send.c

Some functions (process_extent(), process_recorded_refs_if_needed(),
changed_inode(), compare_refs() and changed_cb()) have an 'out' label that
does nothing but return, making it pointless. Simplify this by removing
the label and returning instead of gotos plus setting the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless out labels from ioctl.c

Some functions (__btrfs_ioctl_snap_create(), btrfs_ioctl_subvol_setflags()
and copy_to_sk()) have an 'out' label that does nothing but return, making
it pointless. Simplify this by removing the label and returning instead of
gotos plus setting up the 'ret' variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: qgroup: return correct error when deleting qgroup relation item

If we fail to delete the second qgroup relation item, we end up returning
success or -ENOENT in case the first item does not exist, instead of
returning the error from the second item deletion.

Fixes: 73798c465b66 ("btrfs: qgroup: Try our best to delete qgroup relations")
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass btrfs_fs_info to btrfs_first_delayed_node()

As the delayed root is now in the fs_info we can pass it to
btrfs_first_delayed_node().

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't use local variables for fs_info->delayed_root

In all cases the delayed_root is used once in a function, we don't need
to use a local variable for that.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: reorder members in btrfs_delayed_root for better packing

There are two unnecessary 4B holes in btrfs_delayed_root;

struct btrfs_delayed_root {
        spinlock_t                 lock;                 /*     0     4 */

        /* XXX 4 bytes hole, try to pack */

        struct list_head           node_list;            /*     8    16 */
        struct list_head           prepare_list;         /*    24    16 */
        atomic_t                   items;                /*    40     4 */
        atomic_t                   items_seq;            /*    44     4 */
        int                        nodes;                /*    48     4 */

        /* XXX 4 bytes hole, try to pack */

        wait_queue_head_t          wait;                 /*    56    24 */

        /* size: 80, cachelines: 2, members: 7 */
        /* sum members: 72, holes: 2, sum holes: 8 */
        /* last cacheline: 16 bytes */
};

Reordering 'nodes' after 'lock' reduces size by 8B, to 72 on release
config.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: embed delayed root to struct btrfs_fs_info

The fs_info::delayed_root is allocated dynamically but there's only one
instance per filesystem so we can embed it into the fs_info itself.
The two object have the same lifetime and delayed roots are always
present so we don't need to allocate it on demand from slab.

There's still some space left in fs_info until the 4K so there won't be
an spill over to next page on release config (size grows from 3880 to
3952). In case we want to shrink fs_info there are still holes to fill
or we can separate other non-core or optional structures if needed.

Link: https://lore.kernel.org/all/cover.1767979013.git.dsterba@suse.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add strict extent map alignment checks

Currently we do not check the alignment of extent_map structure.

The reasons are the inode and extent-map tests use unaligned values
for start offsets and lengths.

Thankfully those legacy problems are properly addressed by previous
patches, now we can finally put the alignment checks into
validate_extent_map().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tests: prepare extent map tests for strict alignment checks

Currently the extent map self tests have the following points that will
cause false alerts for the incoming strict extent map alignment checks:

- Incorrect inlined extent map size

  Which is not following what the kernel is doing for inlined extents,
  as btrfs_extent_item_to_extent_map() always uses the fs block size as
  the length, not the ram_bytes.

  Fix it by using SZ_4K as extent map's length.

- Incorrect btrfs_fs_info::sectorsize

  As we always use PAGE_SIZE, which can be values larger than 4K.
  Meanwhile all the immediate numbers used are based on 4K fs block size
  in the test case.

  Fix it by using fixed SZ_4K fs block size when allocating the dummy
  btrfs_fs_info.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tests: remove invalid file extent map tests

In the inode self tests, there are several problems:

- Invalid file extents

  E.g. hole range [4K, 4K + 4) is completely invalid.

  Only inlined extent maps can have an unaligned ram_bytes, and even for
  that case, the generated extent map will use sectorsize as em->len.

- Unaligned hole after inlined extent

  The kernel never does this by itself, the current btrfs_get_extent()
  will only return a single inlined extent map that covers the first
  block.

- Incorrect numbers in the comment

  E.g. 12291 no matter if you add or dec 1, is not aligned to 4K.
  The properly number for 12K is 12288, I don't know why there is even a
  diff of 3, and this completely doesn't match the extent map we
  inserted later.

- Hard-to-modify sequence in setup_file_extents()

  If some unfortunate person, just like me, needs to modify
  setup_file_extents(), good luck not screwing up the file offset.

Fix them by:

- Remove invalid unaligned extent maps

  This mostly means remove the [4K, 4K + 4) hole case.
  The remaining ones are already properly aligned.

  This slightly changes the on-disk data extent allocation, with that
  removed, the regular extents at [4K, 8K) and [8K , 12K) can be merged.

  So also add a 4K gap between those two data extents to prevent em
  merge.

- Remove the implied hole after an inlined extent

  Just like what the kernel is doing for inlined extents in the real
  world.

- Update the commit using proper numbers with 'K' suffixes

  Since there is no unaligned range except the first inlined one, we can
  always use numbers with 'K' suffixes, which is way more easier to read,
  and will always be aligned to 1024 at least.

- Add comments in setup_file_extents()

  So that we're clear about the file offset for each test file extent.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: unfold transaction aborts in btrfs_finish_one_ordered()

We have a single transaction abort that can be caused either by a failure
from a call to btrfs_mark_extent_written(), if we are dealing with a
write to a prealloc extent, or otherwise from a call to
insert_ordered_extent_file_extent(). So when the transaction abort happens
we can not know for sure which case failed. Unfold the aborts so that it's
clear in case of a failure.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: deal with missing root in sample_block_group_extent_item()

In case the root does not exists, which is unexpected, btrfs_extent_root()
returns NULL, but we ignore that and so if it happens we can trigger a
NULL pointer dereference later. So verify if we found the root and log an
error message in case it's missing.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove bogus root search condition in sample_block_group_extent_item()

There's no need to pass the maximum between the block group's start offset
and BTRFS_SUPER_INFO_OFFSET (64K) since we can't have any block groups
allocated in the first megabyte, as that's reserved space. Furthermore,
even if we could, the correct thing to do was to pass the block group's
start offset anyway - and that's precisely what we do for block groups
that happen to contain superblock mirror (the range for the super block
is never marked as free and it's marked as dirty in the
fs_info->excluded_extents io tree).

So simplify this and get rid of that maximum expression.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fallback to buffered IO if the data profile has duplication

[BACKGROUND]
Inspired by a recent kernel bug report, which is related to direct IO
buffer modification during writeback, that leads to contents mismatch of
different RAID1 mirrors.

[CAUSE AND PROBLEMS]
The root cause is exactly the same explained in commit 968f19c5b1b7
("btrfs: always fallback to buffered write if the inode requires
checksum"), that we can not trust direct IO buffer which can be modified
halfway during writeback.

Unlike data checksum verification, if this happened on inodes without
data checksum but has the data has extra mirrors, it will lead to
stealth data mismatch on different mirrors.

This will be way harder to detect without data checksum.

Furthermore for RAID56, we can even have data without checksum and data
with checksum mixed inside the same full stripe.

In that case if the direct IO buffer got changed halfway for the
nodatasum part, the data with checksum immediately lost its ability to
recover, e.g.:

" " = Good old data or parity calculated using good old data
"X" = Data modified during writeback

              0                32K                      64K
  Data 1      |                                         |  Has csum
  Data 2      |XXXXXXXXXXXXXXXX                         |  No csum
  Parity      |                                         |

In above case, the parity is calculated using data 1 (has csum, from
page cache, won't change during writeback), and old data 2 (has no csum,
direct IO write).

After parity is calculated, but before submission to the storage, direct
IO buffer of data 2 is modified, causing the range [0, 32K) of data 2
has a different content.

Now all data is submitted to the storage, and the fs got fully synced.

Then the device of data 1 is lost, has to be rebuilt from data 2 and
parity. But since the data 2 has some modified data, and the parity is
calculated using old data, the recovered data is no the same for data 1,
causing data checksum mismatch.

[FIX]
Fix the problem by checking the data allocation profile.
If our data allocation profile is either RAID0 or SINGLE, we can allow
true zero-copy direct IO and the end user is fully responsible for any
race.

However this is not going to fix all situations, as it's still possible
to race with balance where the fs got a new data profile after the data
allocation profile check.
But this fix should still greatly reduce the window of the original bug.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=99171
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: assert block group is locked in btrfs_use_block_group_size_class()

It's supposed to be called with the block group locked, in order to read
and set its size_class member, so assert it's locked.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't pass block group argument to load_block_group_size_class()

There's no need to pass the block group since we can extract it from
the given caching control structure. Same goes for its helper function
sample_block_group_extent_item().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: allocate path on stack in load_block_group_size_class()

Instead of allocating and freeing a path in every iteration of
load_block_group_size_class(), through its helper function
sample_block_group_extent_item(), allocate the path in the former and
pass it to the later. The path is allocated on stack since it's short
and we are in a workqueue context so there's not much stack usage.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make load_block_group_size_class() return void

There's no point in returning anything since determining and setting a
size class for a block group is an optimization, not something critical.
The only caller of load_block_group_size_class() (the caching thread)
does not do anything with the return value anyway, exactly because having
a size class is just an optimization and it can always be set later when
adding reserved bytes to a block group (btrfs_add_reserved_bytes()).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zstd: use folio_iter to handle zstd_decompress_bio()

Currently zstd_decompress_bio() is using
compressed_bio->compressed_folios[] array to grab each compressed folio.

However cb->compressed_folios[] is just a pointer to each folio of the
compressed bio, meaning we can just replace the compressed_folios[]
array by just grabbing the folio inside the compressed bio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zlib: use folio_iter to handle zlib_decompress_bio()

Currently zlib_decompress_bio() is using
compressed_bio->compressed_folios[] array to grab each compressed folio.

However cb->compressed_folios[] is just a pointer to each folio of the
compressed bio, meaning we can just replace the compressed_folios[]
array by just grabbing the folio inside the compressed bio.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: lzo: use folio_iter to handle lzo_decompress_bio()

Currently lzo_decompress_bio() is using
compressed_bio->compressed_folios[] array to grab each compressed folio.

This is making the code much easier to read, as we only need to maintain
a single iterator, @cur_in, and can easily grab any random folio using
@cur_in >> min_folio_shift as an index.

However lzo_decompress_bio() itself is ensured to only advance to the
next folio at one time, and compressed_folios[] is just a pointer to
each folio of the compressed bio, thus we have no real random access
requirement for lzo_decompress_bio().

Replace the compressed_folios[] access by a helper, get_current_folio(),
which uses folio_iter and an external folio counter to properly switch
the folio when needed.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: consolidate reclaim readiness checks in btrfs_should_reclaim()

Move the filesystem state validation from btrfs_reclaim_bgs_work() into
btrfs_should_reclaim() to centralize the reclaim eligibility logic.
Since it is the only caller of btrfs_should_reclaim(), there's no
functional change.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix periodic reclaim condition

Problems with current implementation:

1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
   negative reclaimable_bytes to trigger reclaim unexpectedly

2. The "space must be freed between scans" assumption breaks the
   two-scan requirement: first scan marks block groups, second scan
   reclaims them. Without the second scan, no reclamation occurs.

Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made. And resume
periodic reclaim when there's enough free space.

And we take care if reclaim is making any progress now, so it's
unnecessary to set periodic_reclaim_ready to false when failed to reclaim
a block group.

Fixes: 813d4c6422516 ("btrfs: prevent pathological periodic reclaim loops")
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't pass io_ctl to __btrfs_write_out_cache()

There's no need to pass both the block_group and block_group::io_ctl to
__btrfs_write_out_cache().

Remove passing io_ctl to __btrfs_write_out_cache() and dereference it
inside __btrfs_write_out_cache().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use the btrfs_extent_map_end() helper everywhere

We have a helper to calculate an extent map's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use the btrfs_block_group_end() helper everywhere

We have a helper to calculate a block group's exclusive end offset, but
we only use it in some places. Update every site that open codes the
calculation to use the helper.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove bogus NULL checks in __btrfs_write_out_cache()

Dan reported a new smatch warning in free-space-cache.c:

New smatch warnings:
fs/btrfs/free-space-cache.c:1207 write_pinned_extent_entries() warn: variable dereferenced before check 'block_group' (see line 1203)

But the check if the block_group pointer is NULL is bogus, because to
get to this point block_group::io_ctl has already been dereferenced
further up the call-chain when calling __btrfs_write_out_cache() from
btrfs_write_out_cache().

Remove the bogus checks for block_group == NULL in
__btrfs_write_out_cache() and it's siblings.

Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/r/202601170636.WsePMV5H-lkp@intel.com/
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: populate fully_remapped_bgs_list on mount

Add a function btrfs_populate_fully_remapped_bgs_list() which gets
called on mount, which looks for fully remapped block groups
(i.e. identity_remap_count == 0) which haven't yet had their chunk
stripes and device extents removed.

This happens when a filesystem is unmounted while async discard has not
yet finished, as otherwise the data range occupied by the chunk stripes
would be permanently unusable.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle discarding fully-remapped block groups

Discard normally works by iterating over the free-space entries of a
block group. This doesn't work for fully-remapped block groups, as we
removed their free-space entries when we started relocation.

For sync discard, call btrfs_discard_extent() when we commit the
transaction in which the last identity remap was removed.

For async discard, add a new function btrfs_trim_fully_remapped_block_group()
to be called by the discard worker, which iterates over the block
group's range using the normal async discard rules. Once we reach the
end, remove the chunk's stripes and device extents to get back its free
space.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: allow balancing remap tree

Balancing the METADATA_REMAP chunk, i.e. the chunk in which the remap tree
lives, is a special case.

We can't use the remap tree itself for this, as then we'd have no way to
boostrap it on mount. And we can't use the pre-remap tree code for this
as it relies on walking the extent tree, and we're not creating backrefs
for METADATA_REMAP chunks.

So instead, if a balance would relocate any METADATA_REMAP block groups, mark
those block groups as readonly and COW every leaf of the remap tree.

There's more sophisticated ways of doing this, such as only COWing nodes
within a block group that's to be relocated, but they're fiddly and with
lots of edge cases. Plus it's not anticipated that

a) the number of METADATA_REMAP chunks is going to be particularly large, or

b) that users will want to only relocate some of these chunks - the main
use case here is to unbreak RAID conversion and device removal.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add do_remap parameter to btrfs_discard_extent()

btrfs_discard_extent() can be called either when an extent is removed
or from walking the free-space tree. With a remapped block group these
two things are no longer equivalent: the extent's addresses are
remapped, while the free-space tree exclusively uses underlying
addresses.

Add a do_remap parameter to btrfs_discard_extent() and
btrfs_map_discard(), saying whether or not the address needs to be run
through the remap tree first.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: replace identity remaps with actual remaps when doing relocations

Add a function do_remap_tree_reloc(), which does the actual work of
doing a relocation using the remap tree.

In a loop we call do_remap_reloc_trans(), which searches for the first
identity remap for the block group. We call btrfs_reserve_extent() to
find space elsewhere for it, and read the data into memory and write it
to the new location. We then carve out the identity remap and replace it
with an actual remap, which points to the new location in which to look.

Once the last identity remap has been removed we call
last_identity_remap_gone(), which, as with deletions, removes the
chunk's stripes and device extents.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move existing remaps before relocating block group

If when relocating a block group we find that `remap_bytes` > 0 in its
block group item, that means that it has been the destination block
group for another that has been remapped.

We need to search the remap tree for any remap backrefs within this
range, and move the data to a third block group. This is because
otherwise btrfs_translate_remap() could end up following an unbounded
chain of remaps, which would only get worse over time.

We only relocate one block group at a time, so `remap_bytes` will only
ever go down while we are doing this. Once we're finished we set the
REMAPPED flag on the block group, which will permanently prevent any
other data from being moved to within it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle setting up relocation of block group with remap-tree

Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.

If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it
does already, as bootstrapping issues mean that these block groups have
to be processed the existing way. Similarly with METADATA_REMAP blocks, which
are dealt with in a later patch.

Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.

The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.

(Thanks to Sun YangKai for his suggestions.)

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: handle deletions from remapped block group

Handle the case where we free an extent from a block group that has the
REMAPPED flag set. Because the remap tree is orthogonal to the extent
tree, for data this may be within any number of identity remaps or
actual remaps. If we're freeing a metadata node, this will be wholly
inside one or the other.

btrfs_remove_extent_from_remap_tree() searches the remap tree for the
remaps that cover the range in question, then calls
remove_range_from_remap_tree() for each one, to punch a hole in the
remap and adjust the free-space tree.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

For an identity remap, remove_range_from_remap_tree() will adjust the
block group's `identity_remap_count` if this changes. If it reaches
zero we mark the block group as fully remapped.

Fully remapped block groups have their chunk stripes removed and their
device extents freed, which makes the disk space available again to the
chunk allocator. This happens asynchronously: in the cleaner thread for
sync discard and nodiscard, and (in a later patch) in the discard worker
for async discard.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: redirect I/O for remapped block groups

Change btrfs_map_block() so that if the block group has the REMAPPED
flag set, we call btrfs_translate_remap() to obtain a new address.

btrfs_translate_remap() searches the remap tree for a range
corresponding to the logical address passed to btrfs_map_block(). If it
is within an identity remap, this part of the block group hasn't yet
been relocated, and so we use the existing address.

If it is within an actual remap, we subtract the start of the remap
range and add the address of its destination, contained in the item's
payload.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: allow mounting filesystems with remap-tree incompat flag

If we encounter a filesystem with the remap-tree incompat flag set,
validate its compatibility with the other flags, and load the remap tree
using the values that have been added to the superblock.

The remap-tree feature depends on the free-space-tree, but no-holes and
block-group-tree have been made dependencies to reduce the testing
matrix. Similarly I'm not aware of any reason why mixed-bg and zoned would be
incompatible with remap-tree, but this is blocked for the time being
until it can be fully tested.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add extended version of struct block_group_item

Add a struct btrfs_block_group_item_v2, which is used in the block group
tree if the remap-tree incompat flag is set.

This adds two new fields to the block group item: `remap_bytes` and
`identity_remap_count`.

`remap_bytes` records the amount of data that's physically within this
block group, but nominally in another, remapped block group. This is
necessary because this data will need to be moved first if this block
group is itself relocated. If `remap_bytes` > 0, this is an indicator to
the relocation thread that it will need to search the remap-tree for
backrefs. A block group must also have `remap_bytes` == 0 before it can
be dropped.

`identity_remap_count` records how many identity remap items are located
in the remap tree for this block group. When relocation is begun for
this block group, this is set to the number of holes in the free-space
tree for this range. As identity remaps are converted into actual remaps
by the relocation process, this number is decreased. Once it reaches 0,
either because of relocation or because extents have been deleted, the
block group has been fully remapped and its chunk's device extents are
removed.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename struct btrfs_block_group field commit_used to last_used

Rename the field commit_used in struct btrfs_block_group to last_used,
for clarity and consistency with the similar fields we're about to add.
It's not obvious that commit_flags means "flags as of the last commit"
rather than "flags related to a commit".

Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't add metadata items for the remap tree to the extent tree

There is the following potential problem with the remap tree and delayed refs:

* Remapped extent freed in a delayed ref, which removes an entry from the
remap tree
* Remap tree now small enough to fit in a single leaf
* Corruption as we now have a level-0 block with a level-1 metadata item
in the extent tree

One solution to this would be to rework the remap tree code so that it operates
via delayed refs. But as we're hoping to remove cow-only metadata items in the
future anyway, change things so that the remap tree doesn't have any entries in
the extent tree. This also has the benefit of reducing write amplification.

We also make it so that the clear_cache mount option is a no-op, as with the
extent tree v2, as the free-space tree can no longer be recreated from the
extent tree.

Finally disable relocating the remap tree itself, which is added back in
a later patch. As it is we would get corruption as the traditional
relocation method walks the extent tree, and we're removing its metadata
items.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove remapped block groups from the free-space-tree

No new allocations can be done from block groups that have the REMAPPED
flag set, so there's no value in their having entries in the free-space
tree.

Prevent a search through the free-space tree being scheduled for such a
block group, and prevent any additions to the in-memory free-space tree.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: allow remapped chunks to have zero stripes

When a chunk has been fully remapped, we are going to set its
num_stripes to 0, as it will no longer represent a physical location on
disk.

Change tree-checker to allow for this, and fix read_one_chunk() to avoid
a divide by zero.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add METADATA_REMAP chunk type

Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the
remap tree.

This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.

The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.

The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate
here, we can adjust it later if it proves to be too big or too small.
This works out to be ~500,000 remap items, which for a 4KB block size
covers ~2GB of remapped data in the worst case and ~500TB in the best case.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add definitions and constants for remap-tree

Add an incompat flag for the new remap-tree feature, and the constants
and definitions needed to support it.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add and use helper to compute the available space for a block group

We have currently three places that compute how much available space a
block group has. Add a helper function for this and use it in those
places.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tag as unlikely error handling in run_one_delayed_ref()

We don't expect to get errors unless we have a corrupted fs, bad RAM or a
bug. So tag the error handling as unlikely.

This slightly reduces the module's text size on x86_64 using gcc 14.2.0-19
from Debian.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1939458 172512   15592 2127562 2076ca fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1939398 172512   15592 2127502 20768e fs/btrfs/btrfs.ko

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove unnecessary else branch in run_one_delayed_ref()

There is no need for an else branch to deal with an unexpected delayed ref
type. We can just change the previous branch to deal with this by checking
if the ref type is not BTRFS_EXTENT_OWNER_REF_KEY, since that branch is
useless as it only sets 'ret' to zero when it's already zero. So merge the
two branches.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: don't BUG() on unexpected delayed ref type in run_one_delayed_ref()

There is no need to BUG(), we can just return an error and log an error
message.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use READA_FORWARD_ALWAYS for device extent verification

btrfs_verify_dev_extents() iterates through the entire device tree
during mount to verify dev extents against chunks. Since this function
scans the whole tree, READA_FORWARD_ALWAYS is more appropriate than
READA_FORWARD.

While the device tree is typically small (a few hundred KB even for
multi-TB filesystems), using the correct readahead mode for full-tree
iteration is more consistent with the intended usage.

Signed-off-by: robbieko <robbieko@synology.com>
Signed-off-by: jinbaohong <jinbaohong@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: shrink the size of btrfs_device

There are two main causes of holes inside btrfs_device:

- The single bytes member of last_flush_error
  Not only it's a single byte member, but we never really care about the
  exact error number.

- The @devt member
  Which is placed between two u64 members.

Shrink the size of btrfs_device by:

- Use a single bit flag for flush error
  Use BTRFS_DEV_STATE_FLUSH_FAILED so that we no longer need that
  dedicated member.

- Move @devt to the hole after dev_stat_values[]

This reduces the size of btrfs_device from 528 to exact 512 bytes for
x86_64.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update comment for delalloc flush and oe wait in btrfs_clone_files()

Make the comment more detailed about why we need to flush delalloc and
wait for ordered extent completion before attempting to invalidate the
page cache.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove experimental offload csum mode

The offload csum mode was introduced to allow developers to compare the
performance of generating checksum for data writes at different timings:

- During btrfs_submit_chunk()
  This is the most common one, if any of the following condition is met
  we go this path:

  * The csum is fast
    For now it's CRC32C and xxhash.

  * It's a synchronous write

  * Zoned

- Delay the checksum generation to a workqueue

However since commit dd57c78aec39 ("btrfs: introduce
btrfs_bio::async_csum") we no longer need to bother any of them.

As if it's an experimental build, async checksum generation at the
background will be faster anyway.

And if not an experimental build, we won't even have the offload csum
mode support.

Considering the async csum will be the new default, let's remove the
offload csum mode code.

There will be no impact to end users, and offload csum mode is still
under experimental features.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: split btrfs_fs_closing() and change return type to bool

There are two tests in btrfs_fs_closing() but checking the
BTRFS_FS_CLOSING_DONE bit is done only in one place
load_extent_tree_free(). As this is an inline we can reduce size of the
generated code. The types can be also changed to bool as this becomes a
simple condition.

   text    data     bss     dec     hex filename
1674006  146704   15560 1836270  1c04ee pre/btrfs.ko
1673772  146704   15560 1836036  1c0404 post/btrfs.ko

DELTA: -234

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: reject single block sized compression early

Currently for an inode that needs compression, even if there is a delalloc
range that is single fs block sized and can not be inlined, we will
still go through the compression path.

Then inside compress_file_range(), we have one extra check to reject
single block sized range, and fall back to regular uncompressed write.

This rejection is in fact a little too late, we have already allocated
memory to async_chunk, delayed the submission, just to fallback to the
same uncompressed write.

Change the behavior to reject such cases earlier at
inode_need_compress(), so for such single block sized range we won't
even bother trying to go through compress path.

And since the inline small block check is inside inode_need_compress()
and compress_file_range() also calls that function, we no longer need a
dedicate check inside compress_file_range().

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: update outdated comment in __add_block_group_free_space()

The function add_block_group_free_space() was renamed
btrfs_add_block_group_free_space() by commit 6fc5ef782988 ("btrfs:
add btrfs prefix to free space tree exported functions"). Update
the comment accordingly.

Do some reorganization of the next few lines to keep the comment
within 80 characters.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add mount time auto fix for orphan fst entries

[BUG]
Before btrfs-progs v6.16.1 release, mkfs.btrfs can leave free space tree
entries for deleted chunks:

  # mkfs.btrfs -f -O fst $dev
  # btrfs ins dump-tree -t chunk $dev
  btrfs-progs v6.16
  chunk tree
  leaf 22036480 items 4 free space 15781 generation 8 owner CHUNK_TREE
  leaf 22036480 flags 0x1(WRITTEN) backref revision 1
item 0 key (DEV_ITEMS DEV_ITEM 1) itemoff 16185 itemsize 98
item 1 key (FIRST_CHUNK_TREE CHUNK_ITEM 13631488) itemoff 16105 itemsize 80
^^^ The first chunk is at 13631488
item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 22020096) itemoff 15993 itemsize 112
item 3 key (FIRST_CHUNK_TREE CHUNK_ITEM 30408704) itemoff 15881 itemsize 112

  # btrfs ins dump-tree -t free-space-tree $dev
  btrfs-progs v6.16
  free space tree key (FREE_SPACE_TREE ROOT_ITEM 0)
  leaf 30556160 items 13 free space 15918 generation 8 owner FREE_SPACE_TREE
  leaf 30556160 flags 0x1(WRITTEN) backref revision 1
item 0 key (1048576 FREE_SPACE_INFO 4194304) itemoff 16275 itemsize 8
free space info extent count 1 flags 0
item 1 key (1048576 FREE_SPACE_EXTENT 4194304) itemoff 16275 itemsize 0
free space extent
item 2 key (5242880 FREE_SPACE_INFO 8388608) itemoff 16267 itemsize 8
free space info extent count 1 flags 0
item 3 key (5242880 FREE_SPACE_EXTENT 8388608) itemoff 16267 itemsize 0
free space extent
^^^ Above 4 items are all before the first chunk.
item 4 key (13631488 FREE_SPACE_INFO 8388608) itemoff 16259 itemsize 8
free space info extent count 1 flags 0
item 5 key (13631488 FREE_SPACE_EXTENT 8388608) itemoff 16259 itemsize 0
free space extent
...

This can trigger btrfs check errors.

[CAUSE]
It's a bug in free space tree implementation of btrfs-progs, which
doesn't delete involved fst entries for the to-be-deleted chunk/block
group.

[ENHANCEMENT]
The mostly common fix is to clear the space cache and rebuild it, but
that requires a ro->rw remount which may not be possible for rootfs,
and also relies on users to use "clear_cache" mount option manually.

Here introduce a kernel fix for it, which will delete any entries that
is before the first block group automatically at the first RW mount.

For filesystems without such problem, the overhead is just a single tree
search and no modification to the free space tree, thus the overhead
should be minimal.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify check for zoned NODATASUM writes in btrfs_submit_chunk()

This function already dereferences 'inode' multiple times earlier,
making the additional NULL check at line 840 redundant since the
function would have crashed already if inode were NULL.

After commit 81cea6cd7041 ("btrfs: remove btrfs_bio::fs_info by
extracting it from btrfs_bio::inode"), the btrfs_bio::inode field is
mandatory for all btrfs_bio allocations and is guaranteed to be
non-NULL.

Simplify the condition for allocating dummy checksums for zoned
NODATASUM data by removing the unnecessary 'inode &&' check.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid transaction commit on error in insert_balance_item()

There's no point in committing the transaction if we failed to insert the
balance item, since we haven't done anything else after we started/joined
the transaction. Also stop using two variables for tracking the return
value and use only 'ret'.

Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>