git.ipfire.org Git - thirdparty/kernel/linux.git/log

]> git.ipfire.org Git - thirdparty/kernel/linux.git/log

projects / thirdparty / kernel / linux.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

Qu Wenruo [Tue, 25 Nov 2025 08:19:56 +0000 (18:49 +1030)]

btrfs: fix a potential path leak in print_data_reloc_error()

Inside print_data_reloc_error(), if extent_from_logical() failed we
return immediately.

However there are the following cases where extent_from_logical() can
return error but still holds a path:

- btrfs_search_slot() returned 0

- No backref item found in extent tree

- No flags_ret provided
This is not possible in this call site though.

So for the above two cases, we can return without releasing the path,
causing extent buffer leaks.

Fixes: b9a9a85059cd ("btrfs: output affected files when relocation fails")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 8 Dec 2025 09:25:48 +0000 (19:55 +1030)]

Revert "btrfs: add ASSERTs on prealloc in qgroup functions"

This reverts commit 252877a8701530fde861a4f27710c1e718e97caa.

Commit 252877a87015 ("btrfs: add ASSERTs on prealloc in qgroup
functions") tries to remove the kfree() on preallocated qgroup during
several call sites, but this cannot work as intended:

- btrfs_quota_enable()
- btrfs_create_qgroup()
  If add_qgroup_item() failed, we go out_free_path() and at that time
  prealloc is not yet utilized and will trigger the new ASSERT().

- btrfs_qgroup_inherit()
  If qgroup_auto_inherit() failed, prealloc is not yet utilized and
  will trigger the new ASSERT()

Reported-by: syzbot+b44d4a4885bc82af2a06@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/69369331.a70a0220.38f243.009e.GAE@google.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 3 Dec 2025 17:02:00 +0000 (17:02 +0000)]

btrfs: do not skip logging new dentries when logging a new name

When we are logging a directory and the log context indicates that we
are logging a new name for some other file (that is or was inside that
directory), we skip logging the inodes for new dentries in the directory.

This is ok most of the time, but if after the rename or link operation
that triggered the logging of that directory, we have an explicit fsync
of that directory without the directory inode being evicted and reloaded,
we end up never logging the inodes for the new dentries that we found
during the new name logging, as the next directory fsync will only process
dentries that were added after the last time we logged the directory (we
are doing an incremental directory logging).

So make sure we always log new dentries for a directory even if we are
in a context of logging a new name.

We started skipping logging inodes for new dentries as of commit
c48792c6ee7a ("btrfs: do not log new dentries when logging that a new name
exists") and it was fine back then, because when logging a directory we
always iterated over all the directory entries (for leaves changed in the
current transaction) so a subsequent fsync would always log anything that
was previously skipped while logging a directory when logging a new name
(with btrfs_log_new_name()). But later support for incrementally logging
a directory was added in commit dc2872247ec0 ("btrfs: keep track of the
last logged keys when logging a directory"), to avoid checking all dir
items every time we log a directory, so the check to skip dentry logging
added in the first commit should have been removed when the incremental
support for logging a directory was added.

A test case for fstests will follow soon.

Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/84c4e713-85d6-42b9-8dcf-0722ed26cb05@gmail.com/
Fixes: dc2872247ec0 ("btrfs: keep track of the last logged keys when logging a directory")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 27 Nov 2025 16:35:59 +0000 (16:35 +0000)]

btrfs: don't log conflicting inode if it's a dir moved in the current transaction

We can't log a conflicting inode if it's a directory and it was moved
from one parent directory to another parent directory in the current
transaction, as this can result an attempt to have a directory with
two hard links during log replay, one for the old parent directory and
another for the new parent directory.

The following scenario triggers that issue:

1) We have directories "dir1" and "dir2" created in a past transaction.
   Directory "dir1" has inode A as its parent directory;

2) We move "dir1" to some other directory;

3) We create a file with the name "dir1" in directory inode A;

4) We fsync the new file. This results in logging the inode of the new file
   and the inode for the directory "dir1" that was previously moved in the
   current transaction. So the log tree has the INODE_REF item for the
   new location of "dir1";

5) We move the new file to some other directory. This results in updating
   the log tree to included the new INODE_REF for the new location of the
   file and removes the INODE_REF for the old location. This happens
   during the rename when we call btrfs_log_new_name();

6) We fsync the file, and that persists the log tree changes done in the
   previous step (btrfs_log_new_name() only updates the log tree in
   memory);

7) We have a power failure;

8) Next time the fs is mounted, log replay happens and when processing
   the inode for directory "dir1" we find a new INODE_REF and add that
   link, but we don't remove the old link of the inode since we have
   not logged the old parent directory of the directory inode "dir1".

As a result after log replay finishes when we trigger writeback of the
subvolume tree's extent buffers, the tree check will detect that we have
a directory a hard link count of 2 and we get a mount failure.
The errors and stack traces reported in dmesg/syslog are like this:

   [ 3845.729764] BTRFS info (device dm-0): start tree-log replay
   [ 3845.730304] page: refcount:3 mapcount:0 mapping:000000005c8a3027 index:0x1d00 pfn:0x11510c
   [ 3845.731236] memcg:ffff9264c02f4e00
   [ 3845.731751] aops:btree_aops [btrfs] ino:1
   [ 3845.732300] flags: 0x17fffc00000400a(uptodate|private|writeback|node=0|zone=2|lastcpupid=0x1ffff)
   [ 3845.733346] raw: 017fffc00000400a 0000000000000000 dead000000000122 ffff9264d978aea8
   [ 3845.734265] raw: 0000000000001d00 ffff92650e6d4738 00000003ffffffff ffff9264c02f4e00
   [ 3845.735305] page dumped because: eb page dump
   [ 3845.735981] BTRFS critical (device dm-0): corrupt leaf: root=5 block=30408704 slot=6 ino=257, invalid nlink: has 2 expect no more than 1 for dir
   [ 3845.737786] BTRFS info (device dm-0): leaf 30408704 gen 10 total ptrs 17 free space 14881 owner 5
   [ 3845.737789] BTRFS info (device dm-0): refs 4 lock_owner 0 current 30701
   [ 3845.737792] item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
   [ 3845.737794] inode generation 3 transid 9 size 16 nbytes 16384
   [ 3845.737795] block group 0 mode 40755 links 1 uid 0 gid 0
   [ 3845.737797] rdev 0 sequence 2 flags 0x0
   [ 3845.737798] atime 1764259517.0
   [ 3845.737800] ctime 1764259517.572889464
   [ 3845.737801] mtime 1764259517.572889464
   [ 3845.737802] otime 1764259517.0
   [ 3845.737803] item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
   [ 3845.737805] index 0 name_len 2
   [ 3845.737807] item 2 key (256 DIR_ITEM 2363071922) itemoff 16077 itemsize 34
   [ 3845.737808] location key (257 1 0) type 2
   [ 3845.737810] transid 9 data_len 0 name_len 4
   [ 3845.737811] item 3 key (256 DIR_ITEM 2676584006) itemoff 16043 itemsize 34
   [ 3845.737813] location key (258 1 0) type 2
   [ 3845.737814] transid 9 data_len 0 name_len 4
   [ 3845.737815] item 4 key (256 DIR_INDEX 2) itemoff 16009 itemsize 34
   [ 3845.737816] location key (257 1 0) type 2
   [ 3845.737818] transid 9 data_len 0 name_len 4
   [ 3845.737819] item 5 key (256 DIR_INDEX 3) itemoff 15975 itemsize 34
   [ 3845.737820] location key (258 1 0) type 2
   [ 3845.737821] transid 9 data_len 0 name_len 4
   [ 3845.737822] item 6 key (257 INODE_ITEM 0) itemoff 15815 itemsize 160
   [ 3845.737824] inode generation 9 transid 10 size 6 nbytes 0
   [ 3845.737825] block group 0 mode 40755 links 2 uid 0 gid 0
   [ 3845.737826] rdev 0 sequence 1 flags 0x0
   [ 3845.737827] atime 1764259517.572889464
   [ 3845.737828] ctime 1764259517.572889464
   [ 3845.737830] mtime 1764259517.572889464
   [ 3845.737831] otime 1764259517.572889464
   [ 3845.737832] item 7 key (257 INODE_REF 256) itemoff 15801 itemsize 14
   [ 3845.737833] index 2 name_len 4
   [ 3845.737834] item 8 key (257 INODE_REF 258) itemoff 15787 itemsize 14
   [ 3845.737836] index 2 name_len 4
   [ 3845.737837] item 9 key (257 DIR_ITEM 2507850652) itemoff 15754 itemsize 33
   [ 3845.737838] location key (259 1 0) type 1
   [ 3845.737839] transid 10 data_len 0 name_len 3
   [ 3845.737840] item 10 key (257 DIR_INDEX 2) itemoff 15721 itemsize 33
   [ 3845.737842] location key (259 1 0) type 1
   [ 3845.737843] transid 10 data_len 0 name_len 3
   [ 3845.737844] item 11 key (258 INODE_ITEM 0) itemoff 15561 itemsize 160
   [ 3845.737846] inode generation 9 transid 10 size 8 nbytes 0
   [ 3845.737847] block group 0 mode 40755 links 1 uid 0 gid 0
   [ 3845.737848] rdev 0 sequence 1 flags 0x0
   [ 3845.737849] atime 1764259517.572889464
   [ 3845.737850] ctime 1764259517.572889464
   [ 3845.737851] mtime 1764259517.572889464
   [ 3845.737852] otime 1764259517.572889464
   [ 3845.737853] item 12 key (258 INODE_REF 256) itemoff 15547 itemsize 14
   [ 3845.737855] index 3 name_len 4
   [ 3845.737856] item 13 key (258 DIR_ITEM 1843588421) itemoff 15513 itemsize 34
   [ 3845.737857] location key (257 1 0) type 2
   [ 3845.737858] transid 10 data_len 0 name_len 4
   [ 3845.737860] item 14 key (258 DIR_INDEX 2) itemoff 15479 itemsize 34
   [ 3845.737861] location key (257 1 0) type 2
   [ 3845.737862] transid 10 data_len 0 name_len 4
   [ 3845.737863] item 15 key (259 INODE_ITEM 0) itemoff 15319 itemsize 160
   [ 3845.737865] inode generation 10 transid 10 size 0 nbytes 0
   [ 3845.737866] block group 0 mode 100600 links 1 uid 0 gid 0
   [ 3845.737867] rdev 0 sequence 2 flags 0x0
   [ 3845.737868] atime 1764259517.580874966
   [ 3845.737869] ctime 1764259517.586121869
   [ 3845.737870] mtime 1764259517.580874966
   [ 3845.737872] otime 1764259517.580874966
   [ 3845.737873] item 16 key (259 INODE_REF 257) itemoff 15306 itemsize 13
   [ 3845.737874] index 2 name_len 3
   [ 3845.737875] BTRFS error (device dm-0): block=30408704 write time tree block corruption detected
   [ 3845.739448] ------------[ cut here ]------------
   [ 3845.740092] WARNING: CPU: 5 PID: 30701 at fs/btrfs/disk-io.c:335 btree_csum_one_bio+0x25a/0x270 [btrfs]
   [ 3845.741439] Modules linked in: btrfs dm_flakey crc32c_cryptoapi (...)
   [ 3845.750626] CPU: 5 UID: 0 PID: 30701 Comm: mount Tainted: G        W           6.18.0-rc6-btrfs-next-218+ #1 PREEMPT(full)
   [ 3845.752414] Tainted: [W]=WARN
   [ 3845.752828] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
   [ 3845.754499] RIP: 0010:btree_csum_one_bio+0x25a/0x270 [btrfs]
   [ 3845.755460] Code: 31 f6 48 89 (...)
   [ 3845.758685] RSP: 0018:ffffa8d9c5677678 EFLAGS: 00010246
   [ 3845.759450] RAX: 0000000000000000 RBX: ffff92650e6d4738 RCX: 0000000000000000
   [ 3845.760309] RDX: 0000000000000000 RSI: ffffffff9aab45b9 RDI: ffff9264c4748000
   [ 3845.761239] RBP: ffff9264d4324000 R08: 0000000000000000 R09: ffffa8d9c5677468
   [ 3845.762607] R10: ffff926bdc1fffa8 R11: 0000000000000003 R12: ffffa8d9c5677680
   [ 3845.764099] R13: 0000000000004000 R14: ffff9264dd624000 R15: ffff9264d978aba8
   [ 3845.765094] FS:  00007f751fa5a840(0000) GS:ffff926c42a82000(0000) knlGS:0000000000000000
   [ 3845.766226] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   [ 3845.766970] CR2: 0000558df1815380 CR3: 000000010ed88003 CR4: 0000000000370ef0
   [ 3845.768009] Call Trace:
   [ 3845.768392]  <TASK>
   [ 3845.768714]  btrfs_submit_bbio+0x6ee/0x7f0 [btrfs]
   [ 3845.769640]  ? write_one_eb+0x28e/0x340 [btrfs]
   [ 3845.770588]  btree_write_cache_pages+0x2f0/0x550 [btrfs]
   [ 3845.771286]  ? alloc_extent_state+0x19/0x100 [btrfs]
   [ 3845.771967]  ? merge_next_state+0x1a/0x90 [btrfs]
   [ 3845.772586]  ? set_extent_bit+0x233/0x8b0 [btrfs]
   [ 3845.773198]  ? xas_load+0x9/0xc0
   [ 3845.773589]  ? xas_find+0x14d/0x1a0
   [ 3845.773969]  do_writepages+0xc6/0x160
   [ 3845.774367]  filemap_fdatawrite_wbc+0x48/0x60
   [ 3845.775003]  __filemap_fdatawrite_range+0x5b/0x80
   [ 3845.775902]  btrfs_write_marked_extents+0x61/0x170 [btrfs]
   [ 3845.776707]  btrfs_write_and_wait_transaction+0x4e/0xc0 [btrfs]
   [ 3845.777379]  ? _raw_spin_unlock_irqrestore+0x23/0x40
   [ 3845.777923]  btrfs_commit_transaction+0x5ea/0xd20 [btrfs]
   [ 3845.778551]  ? _raw_spin_unlock+0x15/0x30
   [ 3845.778986]  ? release_extent_buffer+0x34/0x160 [btrfs]
   [ 3845.779659]  btrfs_recover_log_trees+0x7a3/0x7c0 [btrfs]
   [ 3845.780416]  ? __pfx_replay_one_buffer+0x10/0x10 [btrfs]
   [ 3845.781499]  open_ctree+0x10bb/0x15f0 [btrfs]
   [ 3845.782194]  btrfs_get_tree.cold+0xb/0x16c [btrfs]
   [ 3845.782764]  ? fscontext_read+0x15c/0x180
   [ 3845.783202]  ? rw_verify_area+0x50/0x180
   [ 3845.783667]  vfs_get_tree+0x25/0xd0
   [ 3845.784047]  vfs_cmd_create+0x59/0xe0
   [ 3845.784458]  __do_sys_fsconfig+0x4f6/0x6b0
   [ 3845.784914]  do_syscall_64+0x50/0x1220
   [ 3845.785340]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
   [ 3845.785980] RIP: 0033:0x7f751fc7f4aa
   [ 3845.786759] Code: 73 01 c3 48 (...)
   [ 3845.789951] RSP: 002b:00007ffcdba45dc8 EFLAGS: 00000246 ORIG_RAX: 00000000000001af
   [ 3845.791402] RAX: ffffffffffffffda RBX: 000055ccc8291c20 RCX: 00007f751fc7f4aa
   [ 3845.792688] RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000003
   [ 3845.794308] RBP: 000055ccc8292120 R08: 0000000000000000 R09: 0000000000000000
   [ 3845.795829] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   [ 3845.797183] R13: 00007f751fe11580 R14: 00007f751fe1326c R15: 00007f751fdf8a23
   [ 3845.798633]  </TASK>
   [ 3845.799067] ---[ end trace 0000000000000000 ]---
   [ 3845.800215] BTRFS: error (device dm-0) in btrfs_commit_transaction:2553: errno=-5 IO failure (Error while writing out transaction)
   [ 3845.801860] BTRFS warning (device dm-0 state E): Skipping commit of aborted transaction.
   [ 3845.802815] BTRFS error (device dm-0 state EA): Transaction aborted (error -5)
   [ 3845.803728] BTRFS: error (device dm-0 state EA) in cleanup_transaction:2036: errno=-5 IO failure
   [ 3845.805374] BTRFS: error (device dm-0 state EA) in btrfs_replay_log:2083: errno=-5 IO failure (Failed to recover log tree)
   [ 3845.807919] BTRFS error (device dm-0 state EA): open_ctree failed: -5

Fix this by never logging a conflicting inode that is a directory and was
moved in the current transaction (its last_unlink_trans equals the current
transaction) and instead fallback to a transaction commit.

A test case for fstests will follow soon.

Reported-by: Vyacheslav Kovalevsky <slva.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/7bbc9419-5c56-450a-b5a0-efeae7457113@gmail.com/
CC: stable@vger.kernel.org # 6.1+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Dan Carpenter [Thu, 27 Nov 2025 07:14:24 +0000 (10:14 +0300)]

btrfs: tests: fix double btrfs_path free in remove_extent_ref()

We converted this code to use auto free cleanup.h magic but one
remaining free was accidentally left behind which leads to a double free
bug.

Fixes: a320476ca8a3 ("btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversions")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 21 Nov 2025 16:56:46 +0000 (16:56 +0000)]

btrfs: remove unnecessary inode key in btrfs_log_all_parents()

We are setting up an inode key to lookup parent directory inode but all we
need is the inode's objectid. The use of the key was necessary in the past
but since commit 0202e83fdab0 ("btrfs: simplify iget helpers") we only
need the objectid.

So remove the key variable in the stack and use instead a simple u64 for
the inode's objectid.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 21 Nov 2025 15:56:14 +0000 (15:56 +0000)]

btrfs: remove redundant zero/NULL initializations in btrfs_alloc_root()

We have allocated the root with kzalloc() so all the memory is already
zero initialized, therefore it's redundant to assign 0 and NULL to several
of the root members. Remove all of them except the atomic initializations
since atomic_t is an opaque type and it's not a good practice to assume
its internals.

This slightly reduces the binary size.
With gcc 14.2.0-19 from Debian on x86_64, before this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1939404 162963   15592 2117959 205147 fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1939212 162963   15592 2117767 205087 fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

David Sterba [Tue, 18 Nov 2025 16:06:46 +0000 (17:06 +0100)]

btrfs: remaining BTRFS_PATH_AUTO_FREE conversions

Do the remaining btrfs_path conversion to the auto cleaning, this seems
to be the last one. Most of the conversions are trivial, only adding the
declaration and removing the freeing, or changing the goto patterns to
return.

There are some functions with many changes, like __btrfs_free_extent(),
btrfs_remove_from_free_space_tree() or btrfs_add_to_free_space_tree()
but it still follows the same pattern.

Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 19 Nov 2025 17:59:52 +0000 (17:59 +0000)]

btrfs: send: do not allocate memory for xattr data when checking it exists

When checking if xattrs were deleted we don't care about their data, but
we are allocating memory for the data and copying it, which only wastes
time and can result in an unnecessary error in case the allocation fails.
So stop allocating memory and copying data by making find_xattr() and
__find_xattr() skip those steps if the given data buffer is NULL.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 19 Nov 2025 16:43:11 +0000 (16:43 +0000)]

btrfs: send: add unlikely to all unexpected overflow checks

There are several checks for unexpected overflows of buffers and path
lengths that makes us fail the send operation with an error if for some
highly unexpected reason they happen. So add the unlikely tag to those
checks to hint the compiler to generate better code, while also making
it more explicit in the source that it's highly unexpected.

With gcc 14.2.0-19 from Debian on x86_64, I also got a small reduction
the text size of the btrfs module.

Before:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1936917 162723   15592 2115232 2046a0 fs/btrfs/btrfs.ko

After:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1936789 162723   15592 2115104 204620 fs/btrfs/btrfs.ko

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 19 Nov 2025 13:06:55 +0000 (13:06 +0000)]

btrfs: reduce arguments to btrfs_del_inode_ref_in_log()

Instead of passing a root and the objectid of the parent directory, just
pass the directory inode, as like that we can extract both the root and
the objectid, reducing the number of arguments by one. It also makes the
function more consistent with other log tree functions in the sense that
we pass the inode and not only its objectid.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 19 Nov 2025 13:01:20 +0000 (13:01 +0000)]

btrfs: remove root argument from btrfs_del_dir_entries_in_log()

There's no need to pass the root as we can extract it from the directory
inode, so remove it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 19 Nov 2025 12:35:10 +0000 (12:35 +0000)]

btrfs: use test_and_set_bit() in btrfs_delayed_delete_inode_ref()

Instead of testing and setting the BTRFS_DELAYED_NODE_DEL_IREF bit in the
delayed node's flags, use test_and_set_bit() which makes the code shorter
without compromising readability and getting rid of the label and goto.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Josef Bacik [Tue, 18 Nov 2025 16:08:43 +0000 (17:08 +0100)]

btrfs: don't search back for dir inode item in INO_LOOKUP_USER

We don't need to search back to the inode item, the directory inode
number is in key.offset, so simply use that. If we can't find the
directory we'll get an ENOENT at the iget().

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Josef Bacik [Tue, 18 Nov 2025 16:08:41 +0000 (17:08 +0100)]

btrfs: don't rewrite ret from inode_permission

In our user safe ino resolve ioctl we'll just turn any ret into -EACCES
from inode_permission(). This is redundant, and could potentially be
wrong if we had an ENOMEM in the security layer or some such other
error, so simply return the actual return value.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Fixes: 23d0b79dfaed ("btrfs: Add unprivileged version of ino_lookup ioctl")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Josef Bacik [Tue, 18 Nov 2025 16:08:40 +0000 (17:08 +0100)]

btrfs: add orig_logical to btrfs_bio for encryption

When checksumming the encrypted bio on writes we need to know which
logical address this checksum is for. At the point where we get the
encrypted bio the bi_sector is the physical location on the target disk,
so we need to save the original logical offset in the btrfs_bio. Then
we can use this when checksumming the bio instead of the
bio->iter.bi_sector.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Sweet Tea Dorminy [Tue, 18 Nov 2025 16:08:39 +0000 (17:08 +0100)]

btrfs: disable verity on encrypted inodes

Right now there isn't a way to encrypt things that aren't either
filenames in directories or data on blocks on disk with extent
encryption, so for now, disable verity usage with encryption on btrfs.

fscrypt with fsverity should be possible and it can be implemented
in the future.

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Omar Sandoval [Tue, 18 Nov 2025 16:08:38 +0000 (17:08 +0100)]

btrfs: disable various operations on encrypted inodes

Initially, only normal data extents will be encrypted. This change
forbids various other bits:

- allows reflinking only if both inodes have the same encryption status
- disable inline data on encrypted inodes

Note: The patch was taken from v5 of fscrypt patchset
(https://lore.kernel.org/linux-btrfs/cover.1706116485.git.josef@toxicpanda.com/)
which was handled over time by various people: Omar Sandoval, Sweet Tea
Dorminy, Josef Bacik.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Sun YangKai [Fri, 14 Nov 2025 07:24:48 +0000 (15:24 +0800)]

btrfs: remove redundant level reset in btrfs_del_items()

When btrfs_del_items() empties a leaf, it deletes the leaf unless it's
the root node. For the root leaf case, the code used to reset its level
to 0 via btrfs_set_header_level(). This is redundant as leaf nodes
always have level == 0.

Remove the unnecessary level assignment and invert the conditional to
handle only the non-root leaf deletion. The root leaf is correctly left
as-is.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Sun YangKai [Fri, 14 Nov 2025 07:24:47 +0000 (15:24 +0800)]

btrfs: simplify leaf traversal after path release in btrfs_next_old_leaf()

After releasing the path in btrfs_next_old_leaf(), we need to re-check
the leaf because a balance operation may have added items or removed the
last item. The original code handled this with two separate conditional
blocks, the second marked with a lengthy comment explaining a "missed
case".

Merge these two blocks into a single logical structure that handles both
scenarios more clearly.

Also update the comment to be more concise and accurate, incorporating the
explanation directly into the main block rather than a separate annotation.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Sun YangKai [Fri, 14 Nov 2025 07:24:46 +0000 (15:24 +0800)]

btrfs: optimize balance_level() path reference handling

Instead of incrementing refcount on 'left' node when it's referenced by
path, simply transfer ownership to path and set left to NULL. This
eliminates:

- Unnecessary refcount increment/decrement operations
- Redundant conditional checks for left node cleanup

The path now consistently owns the left node reference when used.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Sun YangKai [Fri, 14 Nov 2025 07:24:45 +0000 (15:24 +0800)]

btrfs: factor out root promotion logic into promote_child_to_root()

The balance_level() function is overly long and contains a cold code path
that handles promoting a child node to root when the root has only one item.
This code has distinct logic that is clearer and more maintainable when
isolated in its own function.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Sun, 16 Nov 2025 00:02:50 +0000 (10:32 +1030)]

btrfs: raid56: remove the "_step" infix

The following functions are introduced as a middle step for bs > ps
support:

- rbio_streip_step_paddr()
- rbio_pstripe_step_paddr()
- rbio_qstripe_step_paddr()
- sector_step_paddr_in_rbio()

As there is already an existing function without the infix, and has a
different parameter list.

But the existing functions have been cleaned up, there is no need to
keep the "_step" infix, just remove it completely.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 14 Nov 2025 08:45:28 +0000 (19:15 +1030)]

btrfs: raid56: enable bs > ps support

The support code for bs > ps is complete, enable it and update
assertions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 14 Nov 2025 04:59:46 +0000 (15:29 +1030)]

btrfs: raid56: prepare finish_parity_scrub() to support bs > ps cases

The function finish_parity_scrub() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper, verify_one_parity_step()
  Since the P/Q generation is always done in a vertical stripe, we have
  to handle the range step by step.

- Only clear the rbio->dbitmap if all steps of an fs block match

- Remove rbio_stripe_paddr() and sector_paddr_in_rbio() helpers
  Now we either use the paddrs version for checksum, or the step version
  for P/Q generation/recovery.

- Make alloc_rbio_essential_pages() to handle bs > ps cases
  Since for bs > ps cases, one fs block needs multiple pages, the
  existing simple check against rbio->stripe_pages[] is not enough.

  Extract a dedicated helper, alloc_rbio_sector_pages(), for the
  existing alloc_rbio_essential_pages(), which is still based on sector
  number.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 14 Nov 2025 04:09:09 +0000 (14:39 +1030)]

btrfs: raid56: prepare rbio_bio_add_io_paddr() to support bs > ps cases

The function rbio_bio_add_io_paddr() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce a helper bio_add_paddrs()
  Previously we only need to add a single page to a bio for a fs block,
  but now we need to add multiple pages, this means we can fail halfway.

  In that case we need to properly revert the bio (only for its size
  though) for halfway failed cases.

- Rename rbio_add_io_paddr() to rbio_add_io_paddrs()
  And change the @paddr parameter to @paddrs[].

- Change all callers to use the updated rbio_add_io_paddrs()
  For the @paddrs pointer used for the new function, it can be grabbed
  using sector_paddrs_in_rbio() and rbio_stripe_paddrs() helpers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 17 Nov 2025 04:09:51 +0000 (14:39 +1030)]

btrfs: raid56: prepare steal_rbio() to support bs > ps cases

The function steal_rbio() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce two helpers to calculate the sector number
  Previously we assume one page will contain at least one fs block, thus
  can use something like "sectors_per_page = PAGE_SIZE / sectorsize;",
  but with bs > ps support that above number will be 0.

  Instead introduce two helpers:

  * page_nr_to_sector_nr()
    Returns the sector number of the first sector covered by the page.

  * page_nr_to_num_sectors()
    Return how many sectors are covered by the page.

  And use the returned values for bitmap operations other than
  open-coded "PAGE_SIZE / sectorsize".
  Those helpers also have extra ASSERT()s to catch weird numbers.

- Use above helpers
  The involved functions are:
  * steal_rbio_page()
  * is_data_stripe_page()
  * full_page_sectors_uptodate()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 17 Nov 2025 03:27:55 +0000 (13:57 +1030)]

btrfs: raid56: prepare set_bio_pages_uptodate() to support bs > ps cases

The function set_bio_pages_uptodate() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Update find_stripe_sector_nr() to check only the first step paddr
  We don't need to check each paddr, as the bios are still aligned to fs
  block size, thus checking the first step is enough.

- Use step size to iterate the bio
  This means we only need to find the sector number for the first step
  of each fs block, and skip the remaining part.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 14 Nov 2025 04:00:25 +0000 (14:30 +1030)]

btrfs: raid56: prepare verify_bio_data_sectors() to support bs > ps cases

The function verify_bio_data_sectors() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Make get_bio_sector_nr() to consider bs > ps cases
  The function is utilized to calculate the sector number of a device
  bio submitted by btrfs raid56 layer.

- Assemble a local paddrs[] for checksum calculation

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple non-contiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 14 Nov 2025 03:31:15 +0000 (14:01 +1030)]

btrfs: raid56: prepare verify_one_sector() to support bs > ps cases

The function verify_one_sector() assume each fs block can be mapped by
one page, blocking bs > ps support for raid56.

Prepare it for bs > ps cases by:

- Introduce helpers to get a paddrs pointer
  Thankfully all the higher layer bio should still be aligned to fs
  block size, thus a fs block should still be fully covered by the bio.

  Introduce sector_paddrs_in_rbio() and rbio_stripe_paddrs(), which will
  return a paddrs pointer inside btrfs_raid_bio::bio_paddrs[] or
  stripe_paddrs[].

  The pointer can be directly passed to
  btrfs_calculate_block_csum_pages() to verify the checksum.

- Open code btrfs_check_block_csum()
  btrfs_check_block_csum() only supports fs blocks backed by large
  folios.

  But for raid56 we can have fs blocks backed by multiple non-contiguous
  pages, e.g. direct IO, encoded read/write/send.

  So instead of using btrfs_check_block_csum(), open code it to use
  btrfs_calculate_block_csum_pages().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 14 Nov 2025 03:19:33 +0000 (13:49 +1030)]

btrfs: raid56: prepare recover_vertical() to support bs > ps cases

Currently recover_vertical() assumes that every fs block can be mapped
by one page, this is blocking bs > ps support for raid56.

Prepare recover_vertical() to support bs > ps cases by:

- Introduce recover_vertical_step() helper
  Which will recover a full step (min(PAGE_SIZE, sectorsize)).

  Now recover_vertical() will do the error check for the specified
  sector, do the recover step by step, then do the sector verification.

- Fix a spelling error of get_rbio_vertical_errors()
  The old name has a typo: "veritical".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 13 Nov 2025 09:40:38 +0000 (20:10 +1030)]

btrfs: raid56: prepare generate_pq_vertical() for bs > ps cases

Unlike btrfs_calculate_block_csum_pages(), we cannot handle multiple
pages at the same time for P/Q generation.

So here we introduce a new @step_nr, and various helpers to grab the
sub-block page from the rbio, and generate the P/Q stripe page by page.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 13 Nov 2025 08:41:36 +0000 (19:11 +1030)]

btrfs: raid56: introduce a new parameter to locate a sector

Since we cannot ensure that all bios from the higher layer are backed by
large folios (e.g. direct IO, encoded read/write/send), we need the
ability to locate sub-block (aka, a page) inside a full stripe.

So the existing @stripe_nr + @sector_nr combination is not enough to
locate such page for bs > ps cases.

Introduce a new parameter, @step_nr, to locate the page of a larger fs
block.  The naming is following the conventions used inside btrfs
elsewhere, where one step is min(sectorsize, PAGE_SIZE).

It's still a preparation, only touching the following aspects:

- btrfs_dump_rbio()
  To show the new @sector_nsteps member.

- btrfs_raid_bio::sector_nsteps
  Recording how many steps there are inside a fs block.

- Enlarge btrfs_raid_bio::*_paddrs[] size
  To take @sector_nsteps into consideration.

- index_one_bio()
- index_stripe_sectors()
- memcpy_from_bio_to_stripe()
- cache_rbio_pages()
- need_read_stripe_sectors()
  Those functions are iterating *_paddrs[], which needs to take
  sector_nsteps into consideration.

- Rename rbio_stripe_sector_index() to rbio_sector_index()
  The "stripe" part is not that helpful.

  And an extra ASSERT() before returning the result.

- Add a new rbio_paddr_index() helper
  This will take the extra @step_nr into consideration.

- The comments of btrfs_raid_bio

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 13 Nov 2025 08:24:25 +0000 (18:54 +1030)]

btrfs: raid56: add an overview for the btrfs_raid_bio structure

The structure needs to track both the pages from higher layer bio and
internal pages, thus it can be a little complex to grasp.

Add an overview of the structure, especially how we track different
pages from higher layer bios and internal ones, to save some time for
future developers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 3 Nov 2025 02:21:09 +0000 (12:51 +1030)]

btrfs: scrub: always update btrfs_scrub_progress::last_physical

[BUG]
When a scrub failed immediately without any byte scrubbed, the returned
btrfs_scrub_progress::last_physical will always be 0, even if there is a
non-zero @start passed into btrfs_scrub_dev() for resume cases.

This will reset the progress and make later scrub resume start from the
beginning.

[CAUSE]
The function btrfs_scrub_dev() accepts a @progress parameter to copy its
updated progress to the caller, there are cases where we either don't
touch progress::last_physical at all or copy 0 into last_physical:

- last_physical not updated at all
  If some error happened before scrubbing any super block or chunk, we
  will not copy the progress, leaving the @last_physical untouched.

  E.g. failed to allocate @sctx, scrubbing a missing device or even
  there is already a running scrub and so on.

  All those cases won't touch @progress at all, resulting the
  last_physical untouched and will be left as 0 for most cases.

- Error out before scrubbing any bytes
  In those case we allocated @sctx, and sctx->stat.last_physical is all
  zero (initialized by kvzalloc()).
  Unfortunately some critical errors happened during
  scrub_enumerate_chunks() or scrub_supers() before any stripe is really
  scrubbed.

  In that case although we will copy sctx->stat back to @progress, since
  no byte is really scrubbed, last_physical will be overwritten to 0.

[FIX]
Make sure the parameter @progress always has its @last_physical member
updated to @start parameter inside btrfs_scrub_dev().

At the very beginning of the function, set @progress->last_physical to
@start, so that even if we error out without doing progress copying,
last_physical is still at @start.

Then after we got @sctx allocated, set sctx->stat.last_physical to
@start, this will make sure even if we didn't get any byte scrubbed, at
the progress copying stage the @last_physical is not left as zero.

This should resolve the resume progress reset problem.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 17 Nov 2025 12:15:09 +0000 (12:15 +0000)]

btrfs: place all boolean fields together in struct find_free_extent_ctl

Move the 'retry_uncached' and 'hint' fields close to the other boolean
fields so that we remove a hole from the structure and reduce its size
from 136 bytes down to 128 bytes. Currently this structure is only
allocated in the stack of btrfs_reserve_extent().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 17 Nov 2025 12:02:29 +0000 (12:02 +0000)]

btrfs: use booleans for delalloc arguments and struct find_free_extent_ctl

The struct find_free_extent_ctl uses an int for the 'delalloc' field but
it's always used as a boolean, and its value is used to be passed to
several functions to signal if we are dealing with delalloc. The same goes
for the 'is_data' argument from btrfs_reserve_extent(). So change the type
from int to bool and move the field definition in the find_free_extent_ctl
structure so that it's close to other bool fields and reduces the size of
the structure from 144 down to 136 bytes (at the moment it's only declared
in the stack of btrfs_reserve_extent(), never allocated otherwise).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 14 Nov 2025 16:00:04 +0000 (16:00 +0000)]

btrfs: use bool type for btrfs_path members used as booleans

Many fields of struct btrfs_path are used as booleans but their type is
an unsigned int (of one 1 bit width to save space). Change the type to
bool keeping the :1 suffix so that they combine with the previous u8
fields in order to save space. This makes the code more clear by using
explicit true/false and more in line with the preferred style, preserving
the size of the structure.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 12:07:14 +0000 (12:07 +0000)]

btrfs: update check_skip variable after unlocking current node

There's no need to update the local variable 'check_skip' to false inside
the critical section delimited by the lock of the current node, so do it
after unlocking the node.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 16:44:41 +0000 (16:44 +0000)]

btrfs: abort transaction on item count overflow in __push_leaf_left()

If we try to push an item count from the right leaf that is greater than
the number of items in the leaf, we just emit a warning. This should
never happen but if it does we get an underflow in the new number of
items in the right leaf and chaos follows from it. So replace the warning
with proper error handling, by aborting the transaction and returning
-EUCLEAN, and proper logging by using btrfs_crit() instead of WARN(),
which gives us proper formatting and information about the filesystem.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 11:52:34 +0000 (11:52 +0000)]

btrfs: always use right leaf variable in __push_leaf_left()

The 'right' variable points to path->nodes[0] and path->nodes[0] is never
changed, but some places use 'right' while others refer to path->nodes[0].
Update all sites to use 'right' as not only it's shorter it's also easier
to reason since it means the right leaf and avoids any confusion with the
sibling left leaf.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 11:46:34 +0000 (11:46 +0000)]

btrfs: remove duplicated leaf dirty status clearing in __push_leaf_right()

We have already called btrfs_clear_buffer_dirty() against the left leaf in
the code above:

  btrfs_set_header_nritems(left, left_nritems);

  if (left_nritems)
       btrfs_mark_buffer_dirty(trans, left);
  else
       btrfs_clear_buffer_dirty(trans, left);

So remove the second check for a 0 number of items in the left leaf and
calling again btrfs_clear_buffer_dirty() against the left leaf.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 11:32:44 +0000 (11:32 +0000)]

btrfs: always use left leaf variable in __push_leaf_right()

The 'left' variable points to path->nodes[0] and path->nodes[0] is never
changed, but some places use 'left' while others refer to path->nodes[0].
Update all sites to use 'left' as not only it's shorter it's also easier
to reason since it means the left leaf and avoids any confusion with the
sibling right leaf.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 13:04:13 +0000 (13:04 +0000)]

btrfs: add unlikely to critical error in btrfs_extend_item()

It's not expected to get a data size less than the leaf's free space,
which would lead to a leaf dump and BUG(), so tag the if statement's
expression as unlikely, hinting the compiler to potentially generate
better code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 12:59:19 +0000 (12:59 +0000)]

btrfs: remove pointless return value update in btrfs_del_items()

The call to btrfs_del_leaf() can only return an error (negative value) or
zero (success). If we didn't get an error then 'ret' is zero, so it's
pointless to set it to zero again.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 13 Nov 2025 12:52:45 +0000 (12:52 +0000)]

btrfs: fix leaf leak in an error path in btrfs_del_items()

If the call to btrfs_del_leaf() fails we return without decrementing the
extra ref we took on the leaf, therefore leaking it. Fix this by ensuring
we drop the ref count before returning the error.

Fixes: 751a27615dda ("btrfs: do not BUG_ON() on tree mod log failures at btrfs_del_ptr()")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Zhen Ni [Fri, 14 Nov 2025 07:53:13 +0000 (15:53 +0800)]

btrfs: fix incomplete parameter rename in btrfs_decompress()

Commit 2c25716dcc25 ("btrfs: zlib: fix and simplify the inline extent
decompression") renamed the 'start_byte' parameter to 'dest_pgoff' in
the btrfs_decompress(). The remaining 'start_byte' references are
inconsistent with the actual implementation and may cause confusion for
developers.

Ensure consistency between function declaration and implementation.

Signed-off-by: Zhen Ni <zhen.ni@easystack.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

David Sterba [Tue, 11 Nov 2025 14:31:52 +0000 (15:31 +0100)]

btrfs: make a few more ASSERTs verbose

We have support for optional string to be printed in ASSERT() (added in
19468a623a9109 ("btrfs: enhance ASSERT() to take optional format
string")), it's not yet everywhere it could be so add a few more files.

Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 10 Nov 2025 22:42:01 +0000 (09:12 +1030)]

btrfs: enable encoded read/write/send for bs > ps cases

Since the read verification and read repair are all supporting bs > ps
without large folios now, we can enable encoded read/write/send.

Now we can relax the alignment in assert_bbio_alignment() to
min(blocksize, PAGE_SIZE).
But also add the extra blocksize based alignment check for the logical
and length of the bbio.

There is a pitfall in btrfs_add_compress_bio_folios(), which relies on
the folios passed in to meet the minimal folio order.
But now we can pass regular page sized folios in, update it to check
each folio's size instead of using the minimal folio size.

This allows btrfs_add_compress_bio_folios() to even handle folios array
with different sizes, thankfully we don't yet need to handle such crazy
situation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 10 Nov 2025 22:42:00 +0000 (09:12 +1030)]

btrfs: make read verification handle bs > ps cases without large folios

The current read verification is also relying on large folios to support
bs > ps cases, but that introduced quite some limits.

To enhance read-repair to support bs > ps without large folios:

- Make btrfs_data_csum_ok() to accept an array of paddrs
  Which can pass the paddrs[] direct into
  btrfs_calculate_block_csum_pages().

- Make repair_one_sector() to accept an array of paddrs
  So that it can submit a repair bio backed by regular pages, not only
  large folios.
  This requires us to allocate more slots at bio allocation time though.

  Also since the caller may have only partially advanced the saved_iter
  for bs > ps cases, we can not directly trust the logical bytenr from
  saved_iter (can be unaligned), thus a manual round down is necessary
  for the logical bytenr.

- Make btrfs_check_read_bio() to build an array of paddrs
  The tricky part is that we can only call btrfs_data_csum_ok() after
  all involved pages are assembled.

  This means at the call time of btrfs_check_read_bio(), our offset
  inside the bio is already at the end of the fs block.
  Thus we must re-calculate @bio_offset for btrfs_data_csum_ok() and
  repair_one_sector().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 10 Nov 2025 22:41:59 +0000 (09:11 +1030)]

btrfs: make btrfs_repair_io_failure() handle bs > ps cases without large folios

Currently btrfs_repair_io_failure() only accept a single @paddr
parameter, and for bs > ps cases it's required that @paddr is backed by
a large folio.

That assumption has quite some limitations, preventing us from utilizing
true zero-copy direct-io and encoded read/writes.

To address the problem, enhance btrfs_repair_io_failure() by:

- Accept an array of paddrs, up to 64K / PAGE_SIZE entries
  This kind of acts like a bio_vec, but with very limited entries, as the
  function is only utilized to repair one fs data block, or a tree block.

  Both have an upper size limit (BTRFS_MAX_BLOCK_SIZE, i.e. 64K), so we
  don't need the full bio_vec thing to handle it.

- Allocate a bio with multiple slots
  Previously even for bs > ps cases, we only passed in a contiguous
  physical address range, thus a single slot will be enough.

  But not anymore, so we have to allocate a bio structure, other than
  using the on-stack one.

- Use on-stack memory to allocate @paddrs array
  It's at most 16 pages (4K page size, 64K block size), will take up at
  most 128 bytes.
  I think the on-stack cost is still acceptable.

- Add one extra check to make sure the repair bio is exactly one block

- Utilize btrfs_repair_io_failure() to submit a single bio for metadata
  This should improve the read-repair performance for metadata, as now
  we submit a node sized bio then wait, other than submit each block of
  the metadata and wait for each submitted block.

- Add one extra parameter indicating the step
  This is due to the fact that metadata step can be as large as
  nodesize, instead of sectorsize.
  So we need a way to distinguish metadata and data repair.

- Reduce the width of @length parameter of btrfs_repair_io_failure()
  Since we only call btrfs_repair_io_failure() on a single data or
  metadata block, u64 is overkilled.
  Use u32 instead and add one extra ASSERT()s to make sure the length
  never exceed BTRFS_MAX_BLOCK_SIZE.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 10 Nov 2025 22:41:58 +0000 (09:11 +1030)]

btrfs: make btrfs_csum_one_bio() handle bs > ps without large folios

For bs > ps cases, all folios passed into btrfs_csum_one_bio() are
ensured to be backed by large folios.  But that requirement excludes
features like direct IO and encoded writes.

To support bs > ps without large folios, enhance btrfs_csum_one_bio()
by:

- Split btrfs_calculate_block_csum() into two versions
  * btrfs_calculate_block_csum_folio()
    For call sites where a fs block is always backed by a large folio.

    This will do extra checks on the folio size, build a paddrs[] array,
    and pass it into the newer btrfs_calculate_block_csum_pages()
    helper.

    For now btrfs_check_block_csum() is still using this version.

  * btrfs_calculate_block_csum_pages()
    For call sites that may hit a fs block backed by noncontiguous pages.
    The pages are represented by paddrs[] array, which includes the
    offset inside the page.

    This function will do the proper sub-block handling.

- Make btrfs_csum_one_bio() to use btrfs_calculate_block_csum_pages()
  This means we will need to build a local paddrs[] array, and after
  filling a fs block, do the checksum calculation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Tue, 11 Nov 2025 15:40:47 +0000 (15:40 +0000)]

btrfs: move struct reserve_ticket definition to space-info.c

It's not used anywhere outside space-info.c so move it from space-info.h
into space-info.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

David Sterba [Wed, 15 Oct 2025 16:48:37 +0000 (18:48 +0200)]

btrfs: move and rename CSUM_FMT definition

Move the CSUM_FMT* definitions to fs.h where is be the BTRFS_KEY_FMT
and add the prefix for consistency.

Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Sun YangKai [Tue, 7 Oct 2025 03:35:12 +0000 (11:35 +0800)]

btrfs: tests: do trivial BTRFS_PATH_AUTO_FREE conversions

Trivial pattern for the auto freeing where there are no operations
between btrfs_free_path() and the function returns.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 9 Oct 2025 04:40:01 +0000 (15:10 +1030)]

btrfs: raid56: remove sector_ptr structure

Since sector_ptr structure is now only containing a single paddr, there
is no need to use that structure.

Instead use phys_addr_t array for bio and stripe pointers.

This means several helpers are also needed to accept a paddr instead of
a sector_ptr pointer.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 9 Oct 2025 04:40:00 +0000 (15:10 +1030)]

btrfs: raid56: move sector_ptr::uptodate into a dedicated bitmap

The uptodate boolean member can be extracted into a bitmap, which will
save us some space (1 bit in a byte vs 8 bits in a byte).

Furthermore we do not need to record the uptodate bitmap for bio
sectors, as if bio_sectors[].paddr is valid it means there is a bio and
will be uptodate.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 9 Oct 2025 04:39:59 +0000 (15:09 +1030)]

btrfs: raid56: remove sector_ptr::has_paddr member

We can use paddr -1 as an indicator for unset/uninitialized paddr.

We can not use 0 paddr, unlike virtual address 0 which is never mapped
thus will always trigger a page fault, physical address 0 may be a valid
page.

So here we follow swiotlb to use (paddr)-1 as a special indicator for
invalid/unset physical address.

Even if the PFN may still be valid, our usage of the physical address
should always be aligned to fs block size (or page size for bs > ps
cases), thus such -1 paddr should never be a valid one.

With this special -1 paddr, we can get rid of has_paddr member and save
1 byte for sector_ptr structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Baolin Liu [Tue, 11 Nov 2025 12:05:58 +0000 (20:05 +0800)]

btrfs: simplify list initialization in btrfs_compr_pool_scan()

In btrfs_compr_pool_scan(), use LIST_HEAD() to declare and initialize
the 'remove' list_head in one step instead of using INIT_LIST_HEAD()
separately.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 6 Nov 2025 09:32:15 +0000 (20:02 +1030)]

btrfs: scrub: factor out parity scrub code into a helper

The function scrub_raid56_parity_stripe() is handling the parity stripe
by the following steps:

- Scrub each data stripes
  And make sure everything is fine in each data stripe

- Cache the data stripe into the raid bio

- Use the cached raid bio to scrub the target parity stripe

Extract the last two steps into a new helper,
scrub_raid56_cached_parity(), as a cleanup and make the error handling
more straightforward.

With the following minor cleanups:

- Use on-stack bio structure
  The bio is always empty thus we do not need any bio vector nor the
  block device. Thus there is no need to allocate a bio, the on-stack
  one is more than enough to cut it.

- Remove the unnecessary btrfs_put_bioc() call if btrfs_map_block()
  failed
  If btrfs_map_block() is failed, @bioc_ret will not be touched thus
  there is no need to call btrfs_put_bioc() in this case.

- Use a proper out: tag to do the cleanup
  Now the error cleanup is much shorter and simpler, just
  btrfs_bio_counter_dec() and bio_uninit().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Wed, 5 Nov 2025 09:58:12 +0000 (20:28 +1030)]

btrfs: make sure extent and csum paths are always released in scrub_raid56_parity_stripe()

Unlike queue_scrub_stripe() which uses the global sctx->extent_path and
sctx->csum_path which are always released at the end of scrub_stripe(),
scrub_raid56_parity_stripe() uses local extent_path and csum_path, as
that function is going to handle the full stripe, whose bytenr may be
smaller than the bytenr in the global sctx paths.

However the cleanup of local extent/csum paths is only happening after
we have successfully submitted an rbio.

There are several error routes that we didn't release those two paths:

- scrub_find_fill_first_stripe() errored out at csum tree search
  In that case extent_path is still valid, and that function itself will
  not release the extent_path passed in.
  And the function returns directly without releasing both paths.

- The full stripe is empty
- Some blocks failed to be recovered
- btrfs_map_block() failed
- raid56_parity_alloc_scrub_rbio() failed
  The function returns directly without releasing both paths.

Fix it by covering btrfs_release_path() calls inside the out: tag.

This is just a hot fix, in the long run we will go scoped based auto
freeing for both local paths.

Fixes: 1dc4888e725d ("btrfs: scrub: avoid unnecessary extent tree search preparing stripes")
Fixes: 3c771c194402 ("btrfs: scrub: avoid unnecessary csum tree search preparing stripes")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Wed, 5 Nov 2025 21:45:03 +0000 (08:15 +1030)]

btrfs: use kvcalloc for btrfs_bio::csum allocation

[BUG]
There is a report that memory allocation failed for btrfs_bio::csum
during a large read:

  b2sum: page allocation failure: order:4, mode:0x40c40(GFP_NOFS|__GFP_COMP), nodemask=(null),cpuset=/,mems_allowed=0
  CPU: 0 UID: 0 PID: 416120 Comm: b2sum Tainted: G        W           6.17.0 #1 NONE
  Tainted: [W]=WARN
  Hardware name: Raspberry Pi 4 Model B Rev 1.5 (DT)
  Call trace:
   show_stack+0x18/0x30 (C)
   dump_stack_lvl+0x5c/0x7c
   dump_stack+0x18/0x24
   warn_alloc+0xec/0x184
   __alloc_pages_slowpath.constprop.0+0x21c/0x730
   __alloc_frozen_pages_noprof+0x230/0x260
   ___kmalloc_large_node+0xd4/0xf0
   __kmalloc_noprof+0x1c8/0x260
   btrfs_lookup_bio_sums+0x214/0x278
   btrfs_submit_chunk+0xf0/0x3c0
   btrfs_submit_bbio+0x2c/0x4c
   submit_one_bio+0x50/0xac
   submit_extent_folio+0x13c/0x340
   btrfs_do_readpage+0x4b0/0x7a0
   btrfs_readahead+0x184/0x254
   read_pages+0x58/0x260
   page_cache_ra_unbounded+0x170/0x24c
   page_cache_ra_order+0x360/0x3bc
   page_cache_async_ra+0x1a4/0x1d4
   filemap_readahead.isra.0+0x44/0x74
   filemap_get_pages+0x2b4/0x3b4
   filemap_read+0xc4/0x3bc
   btrfs_file_read_iter+0x70/0x7c
   vfs_read+0x1ec/0x2c0
   ksys_read+0x4c/0xe0
   __arm64_sys_read+0x18/0x24
   el0_svc_common.constprop.0+0x5c/0x130
   do_el0_svc+0x1c/0x30
   el0_svc+0x30/0xa0
   el0t_64_sync_handler+0xa0/0xe4
   el0t_64_sync+0x198/0x19c

[CAUSE]
Btrfs needs to allocate memory for btrfs_bio::csum for large reads, so
that we can later verify the contents of the read.

However nowadays a read bio can easily go beyond BIO_MAX_VECS *
PAGE_SIZE (which is 1M for 4K page sizes), due to the multi-page bvec
that one bvec can have more than one pages, as long as the pages are
physically adjacent.

This will become more common when the large folio support is moved out
of experimental features.

In the above case, a read larger than 4MiB with SHA256 checksum (32
bytes for each 4K block) will be able to trigger a order 4 allocation.

The order 4 is larger than PAGE_ALLOC_COSTLY_ORDER (3), thus without
extra flags such allocation will not retry.

And if the system has very small amount of memory (e.g. RPI4 with low
memory spec) or VMs with small vRAM, or the memory is heavily
fragmented, such allocation will fail and cause the above warning.

[FIX]
Although btrfs is handling the memory allocation failure correctly, we
do not really need the physically contiguous memory just to restore
our checksum.

In fact btrfs_csum_one_bio() is already using kvzalloc() to reduce the
memory pressure.

So follow the step to use kvcalloc() for btrfs_bio::csum.

Reported-by: Calvin Owens <calvin@wbinvd.org>
Link: https://lore.kernel.org/linux-btrfs/20251105180054.511528-1-calvin@wbinvd.org/
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Gladyshev Ilya [Sun, 2 Nov 2025 07:38:52 +0000 (10:38 +0300)]

btrfs: don't generate any code from ASSERT() in release builds

The current definition of ASSERT(cond) as (void)(cond) is redundant,
since these checks have no side effects and don't affect code logic.

However, some checks contain READ_ONCE() or other compiler-unfriendly
constructs. For example, ASSERT(list_empty) in btrfs_add_dealloc_inode()
was compiled to a redundant mov instruction due to this issue.

Define ASSERT as BUILD_BUG_ON_INVALID for !CONFIG_BTRFS_ASSERT builds
which uses sizeof(cond) trick. Also mark full_page_sectors_uptodate()
as __maybe_unused to suppress "unneeded declaration" warning (it's
needed in compile time)

Signed-off-by: Gladyshev Ilya <foxido@foxido.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Fri, 24 Oct 2025 04:38:34 +0000 (15:08 +1030)]

btrfs: introduce btrfs_bio::async_csum

[ENHANCEMENT]
Btrfs currently calculates data checksums then submits the bio.

But after commit 968f19c5b1b7 ("btrfs: always fallback to buffered write
if the inode requires checksum"), any writes with data checksum will
fallback to buffered IO, meaning the content will not change during
writeback.

This means we're safe to calculate the data checksum and submit the bio
in parallel, and only need the following new behavior:

- Wait the csum generation to finish before calling btrfs_bio::end_io()
  Or this can lead to use-after-free for the csum generation worker.

- Save the current bi_iter for csum_one_bio()
  As the submission part can advance btrfs_bio::bio.bi_iter, if not
  saved csum_one_bio() may got an empty bi_iter and do not generate any
  checksum.

  Unfortunately this means we have to increase the size of btrfs_bio for
  16 bytes, but this is still acceptable.

As usual, such new feature is hidden behind the experimental flag.

[THEORETIC ANALYZE]
Consider the following theoretic hardware performance, which should be
more or less close to modern mainstream hardware:

Memory bandwidth: 50GiB/s
CRC32C bandwidth: 45GiB/s
SSD bandwidth: 8GiB/s

Then write bandwidth with data checksum before the patch is:

1 / ( 1 / 50 + 1 / 45 + 1 / 8) = 5.98 GiB/s

After the patch, the bandwidth is:

1 / ( 1 / 50 + max( 1 / 45 + 1 / 8)) = 6.90 GiB/s

The difference is 15.32% improvement.

[REAL WORLD BENCHMARK]
I'm using a Zen5 (HX 370) as the host, the VM has 4GiB memory, 10 vCPUs, the
storage is backed by a PCIe gen3 x4 NVMe.

The test is a direct IO write, with 1MiB block size, write 7GiB data
into a btrfs mount with data checksum. Thus the direct write will
fallback to buffered one:

Vanilla Datasum: 1619.97 GiB/s
Patched Datasum: 1792.26 GiB/s
Diff +10.6 %

In my case, the bottleneck is the storage, thus the improvement is not
reaching the theoretic one, but still some observable improvement.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 23 Oct 2025 22:02:41 +0000 (08:32 +1030)]

btrfs: relax btrfs_inode::ordered_tree_lock IRQ locking context

We used IRQ version of spinlock for ordered_tree_lock, as
btrfs_finish_ordered_extent() can be called in end_bbio_data_write()
which was in IRQ context.

However since we're moving all the btrfs_bio::end_io() calls into task
context, there is no more need to support IRQ context thus we can relax
to regular spin_lock()/spin_unlock() for btrfs_inode::ordered_tree_lock.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 23 Oct 2025 08:02:34 +0000 (18:32 +1030)]

btrfs: remove btrfs_fs_info::compressed_write_workers

The reason why end_bbio_compressed_write() queues a work into
compressed_write_workers wq is for end_compressed_writeback() call, as
it will grab all the involved folios and clear the writeback flags,
which may sleep.

However now we always run btrfs_bio::end_io() in task context, there is
no need to queue the work anymore.

Just remove btrfs_fs_info::compressed_write_workers and
compressed_bio::write_end_work.

There is a comment about the works queued into
compressed_write_workers, now change to flush endio wq instead, which is
responsible to handle all data endio functions.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Thu, 23 Oct 2025 04:49:16 +0000 (15:19 +1030)]

btrfs: make sure all btrfs_bio::end_io are called in task context

[BACKGROUND]
Btrfs has a lot of different bi_end_io functions, to handle different
raid profiles. But they introduced a lot of different contexts for
btrfs_bio::end_io() calls:

- Simple read bios
  Run in task context, backed by either endio_meta_workers or
  endio_workers.

- Simple write bios
  Run in IRQ context.

- RAID56 write or rebuild bios
  Run in task context, backed by rmw_workers.

- Mirrored write bios
  Run in irq context.

This is inconsistent, and contributes to the number of workqueues used
in btrfs.

[ENHANCEMENT]
Make all the above bios call their btrfs_bio::end_io() in task context,
backed by either endio_meta_workers for metadata, or endio_workers for
data.

For simple write bios, merge the handling into simple_end_io_work().

For mirrored write bios, it will be a little more complex, since both
the original or the cloned bios can run the final btrfs_bio::end_io().

Here we make sure the cloned bios are using btrfs_bioset, to reuse the
end_io_work, and run both original and cloned work inside the workqueue.

Add extra ASSERT()s to make sure btrfs_bio_end_io() is running in task
context.

This not only unifies the context for btrfs_bio::end_io() functions, but
also opens a new door for further btrfs_bio::end_io() related cleanups.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Tue, 28 Oct 2025 22:05:33 +0000 (08:35 +1030)]

btrfs: remove btrfs_bio::fs_info by extracting it from btrfs_bio::inode

Currently there is only one caller which doesn't populate
btrfs_bio::inode, and that's scrub.

The idea is scrub doesn't want any automatic csum verification nor
read-repair, as everything will be handled by scrub itself.

However that behavior is really no different than metadata inode, thus
we can reuse btree_inode as btrfs_bio::inode for scrub.

The only exception is in btrfs_submit_chunk() where if a bbio is from
scrub or data reloc inode, we set rst_search_commit_root to true.
This means we still need a way to distinguish scrub from metadata, but
that can be done by a new flag inside btrfs_bio.

Now btrfs_bio::inode is a mandatory parameter, we can extract fs_info
from that inode thus can remove btrfs_bio::fs_info to save 8 bytes from
btrfs_bio structure.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 27 Oct 2025 23:36:36 +0000 (10:06 +1030)]

btrfs: headers cleanup to remove unnecessary local includes

[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.

Then btrfs will fail to compile.

[CAUSE]
There is a recursive including chain:

  "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
  "bio.h"

That recursive including is causing problems for btrfs.

[ENHANCEMENT]
To reduce the risk of recursive including:

- Remove unnecessary local includes from btrfs headers
  Either the included header is pulled in by other headers, or is
  completely unnecessary.

- Remove btrfs local includes if the header only requires a pointer
  In that case let the implementing C file to pull the required header.

  This is especially important for headers like "btrfs_inode.h" which
  pulls in a lot of other btrfs headers, thus it's a mine field of
  recursive including.

- Remove unnecessary temporary structure definition
  Either if we have included the header defining the structure, or
  completely unused.

Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Mon, 27 Oct 2025 08:28:47 +0000 (18:58 +1030)]

btrfs: replace BTRFS_MAX_BIO_SECTORS with BIO_MAX_VECS

It's impossible to have a btrfs bio with more than BIO_MAX_VECS vectors
anyway. And there is only one location utilizing that macro, just
replace it with BIO_MAX_VECS. Both have the same value.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Andy Shevchenko [Fri, 31 Oct 2025 07:55:09 +0000 (08:55 +0100)]

btrfs: replace const_ilog2() with ilog2()

const_ilog2() was a workaround of some sparse issue, which has never
appeared in the C functions. Replace it with ilog2().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Johannes Thumshirn [Wed, 22 Oct 2025 09:19:59 +0000 (11:19 +0200)]

btrfs: zoned: show statistics for zoned filesystems

Provide statistics for zoned filesystems. These statistics include, the
number of active block-groups, how many of them are reclaimable or unused,
if the filesystem needs to be reclaimed, the currently assigned relocation
and treelog block-groups if they're present and a list of active zones.

Example:
  active block-groups: 4
   reclaimable: 0
   unused: 2
   need reclaim: false
  data relocation block-group: 4294967296
  active zones:
   start: 1610612736, wp: 344064 used: 16384, reserved: 0, unusable: 327680
   start: 1879048192, wp: 34963456 used: 131072, reserved: 0, unusable: 34832384
   start: 4026531840, wp: 0 used: 0, reserved: 0, unusable: 0
   start: 4294967296, wp: 0 used: 0, reserved: 0, unusable: 0

Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Miquel Sabaté Solà [Fri, 24 Oct 2025 10:21:43 +0000 (12:21 +0200)]

btrfs: add ASSERTs on prealloc in qgroup functions

The prealloc variable in these functions is always initialized to
NULL. Whenever we allocate memory for it, if it fails then NULL is
preserved, otherwise we delegate the ownership of the pointer to
add_qgroup_rb() and set it right after to NULL.

Since in any case the pointer ends up being NULL at the end of its
usage, we can safely remove calls to kfree() for it, while adding an
ASSERT as an extra check.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Miquel Sabaté Solà [Fri, 24 Oct 2025 10:21:42 +0000 (12:21 +0200)]

btrfs: apply the AUTO_K(V)FREE macros throughout the code

Apply the AUTO_KFREE and AUTO_KVFREE macros wherever it makes
sense. Since this macro is expected to improve code readability, it has
been avoided in places where the lifetime of objects wasn't easy to
follow and a cleanup attribute would've made things worse; or when the
cleanup section of a function involved many other things and thus there
was no readability impact anyways. This change has also not been applied
in extremely short functions where readability was clearly not an issue.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Miquel Sabaté Solà [Fri, 24 Oct 2025 10:21:41 +0000 (12:21 +0200)]

btrfs: define the AUTO_KFREE/AUTO_KVFREE helper macros

These are two simple macros which ensure that a pointer is initialized
to NULL and with the proper cleanup attribute for it.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Miquel Sabaté Solà [Fri, 24 Oct 2025 10:21:40 +0000 (12:21 +0200)]

btrfs: declare free_ipath() via DEFINE_FREE()

The free_ipath() function was being used as a cleanup function
everywhere. Declare it via DEFINE_FREE() so we can use this function
with the __free() helper.

The name has also been adjusted so it's closer to the type's name.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Sun, 19 Oct 2025 00:45:28 +0000 (11:15 +1030)]

btrfs: scrub: cancel the run if there is a pending signal

Unlike relocation, scrub never checks pending signals, and even for
relocation is only explicitly checking for fatal signal (SIGKILL), not
for regular ones.

Thankfully relocation can still be interrupted by regular signals by
the usage of wait_on_bit(), which is called with TASK_INTERRUPTIBLE.

Do the same for scrub/dev-replace, so that regular signals can also
cancel the scrub/replace run, and more importantly handle v2 cgroup
freezing which is based on signal handling code inside the kernel, and
freezing() function will not return true for v2 cgroup freezing.

This will address the problem that systemd slice freezing will timeout
on long running scrub/dev-replace.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Sun, 19 Oct 2025 00:45:27 +0000 (11:15 +1030)]

btrfs: scrub: cancel the run if the process or fs is being frozen

It's a known bug that btrfs scrub/dev-replace can prevent the system
from suspending.

There are at least two factors involved:

- Holding super_block::s_writers for the whole scrub/dev-replace duration
  We hold that percpu rw semaphore through mnt_want_write_file() for the
  whole scrub/dev-replace duration.

  That will prevent the fs being frozen, which can be initiated by
  either the user (e.g. fsfreeze) or power management suspend/hibernate.

- Stuck in the kernel space for a long time
  During suspend all user processes (and some kernel threads) will
  be frozen.
  But if a user space progress has fallen into kernel (scrub ioctl) and
  do not return for a long time, it will make process freezing time out.

  Unfortunately scrub/dev-replace is a long running ioctl, and it will
  prevent the btrfs process from returning to the user space, thus make PM
  suspend/hibernate time out.

Address them in one go:

- Introduce a new helper should_cancel_scrub()
  Which includes the existing cancel request and new fs/process freezing
  checks.

  Here we have to check both fs and process freezing for PM
  suspend/hibernate.

  PM can be configured to freeze filesystems before processes.
  (The current default is not to freeze filesystems, but planned to
  freeze the filesystems as the new default.)

  Checking only fs freezing will fail PM without fs freezing, as the
  process freezing will time out.

  Checking only process freezing will fail PM with fs freezing since the
  fs freezing happens before process freezing.

  And the return value will indicate the reason, -ECANCLED for the
  explicitly canceled runs, and -EINTR for fs freeze or PM reasons.

- Cancel the run if should_cancel_scrub() is true
  Unfortunately canceling is the only feasible solution here, pausing is
  not possible as we will still stay in the kernel space thus will still
  prevent the process from being frozen.

This will cause a user impacting behavior change:

  Dev-replace can be interrupted by PM, and there is no way to resume
  but start from the beginning again.

This means dev-replace may fail on newer kernels, and end users will
need extra steps like using systemd-inhibit to prevent
suspend/hibernate, to get back the old uninterrupted behavior.

This behavior change will need extra documentation updates and
communication with projects involving scrub/dev-replace including
btrfs-progs.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Link: https://lore.kernel.org/linux-btrfs/d93b2a2d-6ad9-4c49-809f-11d769a6f30a@app.fastmail.com/
Reported-by: Chris Murphy <lists@colorremedies.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Qu Wenruo [Sun, 19 Oct 2025 00:45:26 +0000 (11:15 +1030)]

btrfs: scrub: add cancel/pause/removed bg checks for raid56 parity stripes

For raid56, data and parity stripes are handled differently.

For data stripes they are handled just like regular RAID1/RAID10 stripes,
going through the regular scrub_simple_mirror().

But for parity stripes we have to read out all involved data stripes and
do any needed verification and repair, then scrub the parity stripe.

This process will take a much longer time than a regular stripe, but
unlike scrub_simple_mirror(), we do not check if we should cancel/pause
or the block group is already removed.

Aligned the behavior of scrub_raid56_parity_stripe() to
scrub_simple_mirror(), by adding:

- Cancel check
- Pause check
- Removed block group check

Since those checks are the same from the scrub_simple_mirror(), also
update the comments of scrub_simple_mirror() by:

- Remove too obvious comments
  We do not need extra comments on what we're checking, it's really too
  obvious.

- Remove a stale comment about pausing
  Now the scrub is always queuing all involved stripes, and submit them
  in one go, there is no more submission part during pausing.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Wed, 22 Oct 2025 18:15:00 +0000 (19:15 +0100)]

btrfs: annotate as unlikely fs aborted checks in space flushing code

It's not expected to have the fs in an aborted state, so surround the
abortion checks with unlikely to make it clear it's unexpected and to
hint the compiler to generate better code.

Also at maybe_fail_all_tickets() untangle all repeated checks for the
abortion into a single if-then-else. This makes things more readable
and makes the compiler generate less code. On x86_64 with gcc 14.2.0-19
from Debian I got the following object size differences.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  2021606 179704   25088 2226398 21f8de fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  2021458 179704   25088 2226250 21f84a fs/btrfs/btrfs.ko

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Tue, 21 Oct 2025 15:35:19 +0000 (16:35 +0100)]

btrfs: avoid space_info locking when checking if tickets are served

When checking if a ticket was served, we take the space_info's spinlock.
If the ticket was served (its ->bytes is 0) or had an error (its ->error
it not 0) then we just unlock the space_info and return.

This however causes contention on the space_info's spinlock, which is
heavily used (space reservation, space flushing, allocating and
deallocating an extent from a block group (btrfs_update_block_group()),
etc).

Instead of using the space_info's spinlock to check if a ticket was
served, use a per ticket spinlock which isn't used by anyone other than
the task that created the ticket (stack allocated) and the task that
serves the ticket (a reclaim task or any task deallocating space that
ends up at btrfs_try_granting_tickets()).

After applying this patch and all previous patches from the same patchset
(many attempt to reduce space_info critical sections), lockstat showed
some improvements for a fs_mark test regarding the space_info's spinlock
'lock'. The lockstat results:

Before patchset:

  con-bounces:     13733858
  contentions:     15902322
  waittime-total:  264902529.72
  acq-bounces:     28161791
  acquisitions:    38679282

After patchset:

  con-bounces:     12032220
  contentions:     13598034
  waittime-total:  221806127.28
  acq-bounces:     24717947
  acquisitions:    34103281

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 21:59:03 +0000 (22:59 +0100)]

btrfs: move ticket wakeup and finalization to remove_ticket()

Instead of repeating the wakeup and setup of the ->bytes or ->error field,
move those steps to remove_ticket() to avoid duplication. This is also
needed for the next patch in the series, so that we avoid duplicating more
logic.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Thu, 23 Oct 2025 12:24:22 +0000 (13:24 +0100)]

btrfs: add data_race() in btrfs_account_ro_block_groups_free_space()

Surround the intentional empty list check with the data_race() annotation
so that tools like KCSAN don't report a data race. The race is intentional
as it's harmless and we want to avoid lock contention of the space_info
since its lock is heavily used (space reservation, space flushing, extent
allocation and deallocation, etc).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 15:08:50 +0000 (16:08 +0100)]

btrfs: remove pointless label and goto from unpin_extent_range()

There's no need to have an 'out' label and jump there in case we can
not find a block group. We can simply return directly since there are no
resources to release, removing the need for the label and the 'ret'
variable.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 14:53:01 +0000 (15:53 +0100)]

btrfs: reduce block group critical section in unpin_extent_range()

There's no need to update the bytes_pinned, bytes_readonly and
max_extent_size fields of the space_info while inside the critical section
delimited by the block group's lock. So move that out of the block group's
critical section, but sill inside the space_info's critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 12:52:11 +0000 (13:52 +0100)]

btrfs: change 'reserved' argument from pin_down_extent() to bool

It's used as a boolean, so convert it from int type to bool type.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 12:48:33 +0000 (13:48 +0100)]

btrfs: remove 'reserved' argument from btrfs_pin_extent()

All callers pass a value of 1 (true) to it, so remove it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 12:40:56 +0000 (13:40 +0100)]

btrfs: use local variable for space_info in pin_down_extent()

Instead of dereferencing the block group multiple times to access its
space_info, use a local variable to shorten the code horizontal wise and
make it easier to read. Also, while at it, also rename the block group
argument from 'cache' to 'bg', as the cache name is confusing and it's
from the old days where the block group structure was named as
'btrfs_block_group_cache'.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 12:37:32 +0000 (13:37 +0100)]

btrfs: reduce block group critical section in pin_down_extent()

There's no need to update the bytes_reserved and bytes_may_use fields of
the space_info while holding the block group's spinlock. We are only
making the critical section longer than necessary. So move the space_info
updates outside of the block group's critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 12:17:23 +0000 (13:17 +0100)]

btrfs: reduce block group critical section in do_trimming()

There's no need to update the bytes_reserved and bytes_readonly fields of
the space_info while holding the block group's spinlock. We are only
making the critical section longer than necessary. So move the space_info
updates outside of the block group's critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 11:57:34 +0000 (12:57 +0100)]

btrfs: reduce block group critical section in btrfs_add_reserved_bytes()

We are doing some things inside the block group's critical section that
are relevant only to the space_info: updating the space_info counters
bytes_reserved and bytes_may_use as well as trying to grant tickets
(calling btrfs_try_granting_tickets()), and this later can take some
time. So move all those updates to outside the block group's critical
section and still inside the space_info's critical section. Like this
we keep the block group's critical section only for block group updates
and can help reduce contention on a block group's lock.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 11:47:26 +0000 (12:47 +0100)]

btrfs: reduce block group critical section in btrfs_free_reserved_bytes()

There's no need to update the space_info fields (bytes_reserved,
max_extent_size, bytes_readonly, bytes_zone_unusable) while holding the
block group's spinlock. So move those updates to happen after we unlock
the block group (and while holding the space_info locked of course), so
that all we do under the block group's critical section is to update the
block group itself.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Mon, 20 Oct 2025 11:39:52 +0000 (12:39 +0100)]

btrfs: reduce space_info critical section in btrfs_chunk_alloc()

There's no need to update local variables while holding the space_info's
spinlock, since the update isn't using anything from the space_info. So
move these updates outside the critical section to shorten it.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 16:58:23 +0000 (17:58 +0100)]

btrfs: remove double underscore prefix from __reserve_bytes()

The use of a double underscore prefix is discouraged and we have no
justification at all for it all since there's no reserved_bytes() counter
part. So remove the prefix.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 16:34:36 +0000 (17:34 +0100)]

btrfs: process ticket outside global reserve critical section

In steal_from_global_rsv() there's no need to process the ticket inside
the critical section of the global reserve. Move the ticket processing to
happen after the critical section. This helps reduce contention on the
global reserve's spinlock.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 16:30:38 +0000 (17:30 +0100)]

btrfs: assign booleans to global reserve's full field

We have a couple places that are assigning 0 and 1 to the full field of
the global reserve. This is harmless since 0 is converted to false and
1 converted to true, but for better readability, replace these with true
and false since the field is of type bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 16:26:58 +0000 (17:26 +0100)]

btrfs: assert space_info is locked in steal_from_global_rsv()

The caller is supposed to have locked the space_info, so assert that.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 16:14:11 +0000 (17:14 +0100)]

btrfs: avoid unnecessary reclaim calculation in priority_reclaim_metadata_space()

If the given ticket was already served (its ->bytes is 0), then we wasted
time calculating the metadata reclaim size. So calculate it only after we
checked the ticket was not yet served.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 16:07:22 +0000 (17:07 +0100)]

btrfs: shorten critical section in btrfs_preempt_reclaim_metadata_space()

We are doing a lot of small calculations and assignments while holding the
space_info's spinlock, which is a heavily used lock for space reservation
and flushing. There's no point in holding the lock for so long when all we
want is to call need_preemptive_reclaim() and get a consistent value for a
couple of counters from the space_info. Instead, grab the counters into
local variables, release the lock and then use the local variables.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

commit | commitdiff | tree

Filipe Manana [Fri, 17 Oct 2025 15:54:12 +0000 (16:54 +0100)]

btrfs: increment loop count outside critical section during metadata reclaim

In btrfs_preempt_reclaim_metadata_space() there's no need to increment the
local variable that tracks the number of iterations of the while loop
while inside the critical section delimited by the space_info's spinlock.
That spinlock is heavily used by space reservation and flushing code, so
it's desirable to have its critical sections as short as possible.
So move the loop count incremented outside the critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

A mirror of Linus' kernel repository