git.ipfire.org Git - thirdparty/linux.git/log

btrfs: mark dirty extent range for out of bound prealloc extents

In btrfs_fallocate(), when the allocated range overlaps with a prealloc
extent and the extent starts after i_size, the range doesn't get marked
dirty in file_extent_tree. This results in persisting an incorrect
disk_i_size for the inode when not using the no-holes feature.

This is reproducible since commit 41a2ee75aab0 ("btrfs: introduce
per-inode file extent tree"), then became hidden since commit 3d7db6e8bd22
("btrfs: don't allocate file extent tree for non regular files") and then
visible again after commit 8679d2687c35 ("btrfs: initialize
inode::file_extent_tree after i_mode has been set"), which fixes the
previous commit.

The following reproducer triggers the problem:

$ cat test.sh

MNT=/mnt/test
DEV=/dev/vdb

mkdir -p $MNT

mkfs.btrfs -f -O ^no-holes $DEV
mount $DEV $MNT

touch $MNT/file1
fallocate -n -o 1M -l 2M $MNT/file1

umount $MNT
mount $DEV $MNT

len=$((1 * 1024 * 1024))

fallocate -o 1M -l $len $MNT/file1

du --bytes $MNT/file1

umount $MNT
mount $DEV $MNT

du --bytes $MNT/file1

umount $MNT

Running the reproducer gives the following result:

$ ./test.sh
(...)
2097152 /mnt/test/file1
1048576 /mnt/test/file1

The difference is exactly 1048576 as we assigned.

Fix by adding a call to btrfs_inode_set_file_extent_range() in
btrfs_fallocate_update_isize().

Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree")
Signed-off-by: austinchang <austinchang@synology.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: set inode flag BTRFS_INODE_COPY_EVERYTHING when logging new name

If we are logging a new name make sure our inode has the runtime flag
BTRFS_INODE_COPY_EVERYTHING set so that at btrfs_log_inode() we will find
new inode refs/extrefs in the subvolume tree and copy them into the log
tree.

We are currently doing it when adding a new link but we are missing it
when renaming.

An example where this makes a new name not persisted:

  1) create symlink with name foo in directory A
  2) fsync directory A, which persists the symlink
  3) rename the symlink from foo to bar
  4) fsync directory A to persist the new symlink name

Step 4 isn't working correctly as it's not logging the new name and also
leaving the old inode ref in the log tree, so after a power failure the
symlink still has the old name of "foo". This is because when we first
fsync directoy A we log the symlink's inode (as it's a new entry) and at
btrfs_log_inode() we set the log mode to LOG_INODE_ALL and then because
we are using that mode and the inode has the runtime flag
BTRFS_INODE_NEEDS_FULL_SYNC set, we clear that flag as well as the flag
BTRFS_INODE_COPY_EVERYTHING. That means the next time we log the inode,
during the rename through the call to btrfs_log_new_name() (calling
btrfs_log_inode_parent() and then btrfs_log_inode()), we will not search
the subvolume tree for new refs/extrefs and jump directory to the
'log_extents' label.

Fix this by making sure we set BTRFS_INODE_COPY_EVERYTHING on an inode
when we are about to log a new name. A test case for fstests will follow
soon.

Reported-by: Vyacheslav Kovalevsky <slava.kovalevskiy.2014@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/ac949c74-90c2-4b9a-b7fd-1ffc5c3175c7@gmail.com/
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix memory leak of qgroup_list in btrfs_add_qgroup_relation

When btrfs_add_qgroup_relation() is called with invalid qgroup levels
(src >= dst), the function returns -EINVAL directly without freeing the
preallocated qgroup_list structure passed by the caller. This causes a
memory leak because the caller unconditionally sets the pointer to NULL
after the call, preventing any cleanup.

The issue occurs because the level validation check happens before the
mutex is acquired and before any error handling path that would free
the prealloc pointer. On this early return, the cleanup code at the
'out' label (which includes kfree(prealloc)) is never reached.

In btrfs_ioctl_qgroup_assign(), the code pattern is:

    prealloc = kzalloc(sizeof(*prealloc), GFP_KERNEL);
    ret = btrfs_add_qgroup_relation(trans, sa->src, sa->dst, prealloc);
    prealloc = NULL;  // Always set to NULL regardless of return value
    ...
    kfree(prealloc);  // This becomes kfree(NULL), does nothing

When the level check fails, 'prealloc' is never freed by either the
callee or the caller, resulting in a 64-byte memory leak per failed
operation. This can be triggered repeatedly by an unprivileged user
with access to a writable btrfs mount, potentially exhausting kernel
memory.

Fix this by freeing prealloc before the early return, ensuring prealloc
is always freed on all error paths.

Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Shardul Bankar <shardulsb08@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: ensure no dirty metadata is written back for an fs with errors

[BUG]
During development of a minor feature (make sure all btrfs_bio::end_io()
is called in task context), I noticed a crash in generic/388, where
metadata writes triggered new works after btrfs_stop_all_workers().

It turns out that it can even happen without any code modification, just
using RAID5 for metadata and the same workload from generic/388 is going
to trigger the use-after-free.

[CAUSE]
If btrfs hits an error, the fs is marked as error, no new
transaction is allowed thus metadata is in a frozen state.

But there are some metadata modifications before that error, and they are
still in the btree inode page cache.

Since there will be no real transaction commit, all those dirty folios
are just kept as is in the page cache, and they can not be invalidated
by invalidate_inode_pages2() call inside close_ctree(), because they are
dirty.

And finally after btrfs_stop_all_workers(), we call iput() on btree
inode, which triggers writeback of those dirty metadata.

And if the fs is using RAID56 metadata, this will trigger RMW and queue
new works into rmw_workers, which is already stopped, causing warning
from queue_work() and use-after-free.

[FIX]
Add a special handling for write_one_eb(), that if the fs is already in
an error state, immediately mark the bbio as failure, instead of really
submitting them.

Then during close_ctree(), iput() will just discard all those dirty
tree blocks without really writing them back, thus no more new jobs for
already stopped-and-freed workqueues.

The extra discard in write_one_eb() also acts as an extra safenet.
E.g. the transaction abort is triggered by some extent/free space
tree corruptions, and since extent/free space tree is already corrupted
some tree blocks may be allocated where they shouldn't be (overwriting
existing tree blocks). In that case writing them back will further
corrupting the fs.

CC: stable@vger.kernel.org # 6.6+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: ref-verify: fix IS_ERR() vs NULL check in btrfs_build_ref_tree()

btrfs_extent_root()/btrfs_global_root() does not return error pointers,
it returns NULL on error.

Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Link: https://lore.kernel.org/all/aNJfvxj0anEnk9Dm@stanley.mountain/
Fixes : ed4e6b5d644c ("btrfs: ref-verify: handle damaged extent root tree")
CC: stable@vger.kernel.org # 6.17+
Signed-off-by: Amit Dhingra <mechanicalamit@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix delayed_node ref_tracker use after free

Move the print before releasing the delayed node.

In my initial testing there was a bug that was causing delayed_nodes
to not get freed which is why I put the print after the release. This
obviously neglects the case where the delayed node is properly freed.

Add condition to make sure we only print if we have more than one
reference to the delayed_node to prevent printing when we only have
the reference taken in btrfs_kill_all_delayed_nodes().

Fixes: b767a28d6154 ("btrfs: print leaked references in kill_all_delayed_nodes()")
Tested-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leo Martins <loemra.dev@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: send: fix duplicated rmdir operations when using extrefs

Commit 29d6d30f5c8a ("Btrfs: send, don't send rmdir for same target
multiple times") has fixed an issue that a send stream contained a rmdir
operation for the same directory multiple times. After that fix we keep
track of the last directory for which we sent a rmdir operation and
compare with it before sending a rmdir for the parent inode of a deleted
hardlink we are processing. But there is still a corner case that in
between rmdir dir operations for the same inode we find deleted hardlinks
for other parent inodes, so tracking just the last inode for which we sent
a rmdir operation is not enough.

Hardlinks of a file in the same directory are stored in the same INODE_REF
item, but if the number of hardlinks is too large and can not fit in a
leaf, we use INODE_EXTREF items to store them. The key of an INODE_EXTREF
item is (inode_id, INODE_EXTREF, hash[name, parent ino]), so between two
hardlinks for the same parent directory, we can find others for other
parent directories. For example for the reproducer below we get the
following (from a btrfs inspect-internal dump-tree output):

    item 0 key (259 INODE_EXTREF 2309449) itemoff 16257 itemsize 26
            index 6925 parent 257 namelen 8 name: foo.6923
    item 1 key (259 INODE_EXTREF 2311350) itemoff 16231 itemsize 26
            index 6588 parent 258 namelen 8 name: foo.6587
    item 2 key (259 INODE_EXTREF 2457395) itemoff 16205 itemsize 26
            index 6611 parent 257 namelen 8 name: foo.6609
    (...)

So tracking the last directory's inode number does not work in this case
since we process a link for parent inode 257, then for 258 and then back
again for 257, and that second time we process a deleted link for 257 we
think we have not yet sent a rmdir operation.

Fix this by using a rbtree to keep track of all the directories for which
we have already sent rmdir operations, and add those directories to the
'check_dirs' ref list in process_recorded_refs() only if the directory is
not yet in the rbtree, otherwise skip it since it means we have already
sent a rmdir operation for that directory.

The following test script reproduces the problem:

  $ cat test.sh
  #!/bin/bash

  DEV=/dev/sdi
  MNT=/mnt/sdi

  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  mkdir $MNT/a $MNT/b

  echo 123 > $MNT/a/foo
  for ((i = 1; i <= 1000; i++)); do
     ln $MNT/a/foo $MNT/a/foo.$i
     ln $MNT/a/foo $MNT/b/foo.$i
  done

  btrfs subvolume snapshot -r $MNT $MNT/snap1
  btrfs send $MNT/snap1 -f /tmp/base.send

  rm -r $MNT/a $MNT/b

  btrfs subvolume snapshot -r $MNT $MNT/snap2
  btrfs send -p $MNT/snap1 $MNT/snap2 -f /tmp/incremental.send

  umount $MNT
  mkfs.btrfs -f $DEV
  mount $DEV $MNT

  btrfs receive $MNT -f /tmp/base.send
  btrfs receive $MNT -f /tmp/incremental.send

  rm -f /tmp/base.send /tmp/incremental.send

  umount $MNT

When running it, it fails like this:

  $ ./test.sh
  (...)
  At subvol snap1
  At snapshot snap2
  ERROR: rmdir o257-9-0 failed: No such file or directory

CC: <stable@vger.kernel.org>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Ting-Chang Hou <tchou@synology.com>
[ Updated changelog ]
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: directly free partially initialized fs_info in btrfs_check_leaked_roots()

If fs_info->super_copy or fs_info->super_for_commit allocated failed in
btrfs_get_tree_subvol(), then no need to call btrfs_free_fs_info().
Otherwise btrfs_check_leaked_roots() would access NULL pointer because
fs_info->allocated_roots had not been initialised.

syzkaller reported the following information:
  ------------[ cut here ]------------
  BUG: unable to handle page fault for address: fffffffffffffbb0
  #PF: supervisor read access in kernel mode
  #PF: error_code(0x0000) - not-present page
  PGD 64c9067 P4D 64c9067 PUD 64cb067 PMD 0
  Oops: Oops: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 1402 Comm: syz.1.35 Not tainted 6.15.8 #4 PREEMPT(lazy)
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), (...)
  RIP: 0010:arch_atomic_read arch/x86/include/asm/atomic.h:23 [inline]
  RIP: 0010:raw_atomic_read include/linux/atomic/atomic-arch-fallback.h:457 [inline]
  RIP: 0010:atomic_read include/linux/atomic/atomic-instrumented.h:33 [inline]
  RIP: 0010:refcount_read include/linux/refcount.h:170 [inline]
  RIP: 0010:btrfs_check_leaked_roots+0x18f/0x2c0 fs/btrfs/disk-io.c:1230
  [...]
  Call Trace:
   <TASK>
   btrfs_free_fs_info+0x310/0x410 fs/btrfs/disk-io.c:1280
   btrfs_get_tree_subvol+0x592/0x6b0 fs/btrfs/super.c:2029
   btrfs_get_tree+0x63/0x80 fs/btrfs/super.c:2097
   vfs_get_tree+0x98/0x320 fs/super.c:1759
   do_new_mount+0x357/0x660 fs/namespace.c:3899
   path_mount+0x716/0x19c0 fs/namespace.c:4226
   do_mount fs/namespace.c:4239 [inline]
   __do_sys_mount fs/namespace.c:4450 [inline]
   __se_sys_mount fs/namespace.c:4427 [inline]
   __x64_sys_mount+0x28c/0x310 fs/namespace.c:4427
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0x92/0x180 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x76/0x7e
  RIP: 0033:0x7f032eaffa8d
  [...]

Fixes: 3bb17a25bcb0 ("btrfs: add get_tree callback for new mount API")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Daniel Vacek <neelx@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dewei Meng <mengdewei@cqsoftware.com.cn>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: send: fix -Wflex-array-member-not-at-end warning in struct send_ctx

The warning -Wflex-array-member-not-at-end was introduced in GCC-14, and
we are getting ready to enable it, globally.

Fix the following warning:

fs/btrfs/send.c:181:24: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end]

and move the declaration of send_ctx::cur_inode_path to the end.

Notice that struct fs_path contains a flexible array member inline_buf,
but also a padding array and a limit calculated for the usable space of
inline_buf (FS_PATH_INLINE_SIZE). It is not the pattern where flexible
array is in the middle of a structure and could potentially overwrite
other members.

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: fix bounds check in check_inode_extref()

The parentheses for the unlikely() annotation were put in the wrong
place so it means that the condition is basically never true and the
bounds checking is skipped.

Fixes: aab9458b9f00 ("btrfs: tree-checker: add inode extref checks")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix memory leaks when rejecting a non SINGLE data profile without an RST

At the end of btrfs_load_block_group_zone_info() the first thing we do
is to ensure that if the mapping type is not a SINGLE one and there is
no RAID stripe tree, then we return early with an error.

Doing that, though, prevents the code from running the last calls from
this function which are about freeing memory allocated during its
run. Hence, in this case, instead of returning early, we set the ret
value and fall through the rest of the cleanup code.

Fixes: 5906333cc4af ("btrfs: zoned: don't skip block group profile checks on conventional zones")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix incorrect readahead expansion length

The intent of btrfs_readahead_expand() was to expand to the length of
the current compressed extent being read. However, "ram_bytes" is *not*
that, in the case where a single physical compressed extent is used for
multiple file extents.

Consider this case with a large compressed extent C and then later two
non-compressed extents N1 and N2 written over C, leaving C1 and C2
pointing to offset/len pairs of C:

[ C ]
[ N1 ][ C1 ][ N2 ][ C2 ]

In such a case, ram_bytes for both C1 and C2 is the full uncompressed
length of C. So starting readahead in C1 will expand the readahead past
the end of C1, past N2, and into C2. This will then expand readahead
again, to C2_start + ram_bytes, way past EOF. First of all, this is
totally undesirable, we don't want to read the whole file in arbitrary
chunks of the large underlying extent if it happens to exist. Secondly,
it results in zeroing the range past the end of C2 up to ram_bytes. This
is particularly unpleasant with fs-verity as it can zero and set
uptodate pages in the verity virtual space past EOF. This incorrect
readahead behavior can lead to verity verification errors, if we iterate
in a way that happens to do the wrong readahead.

Fix this by using em->len for readahead expansion, not em->ram_bytes,
resulting in the expected behavior of stopping readahead at the extent
boundary.

Reported-by: Max Chernoff <git@maxchernoff.ca>
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2399898
Fixes: 9e9ff875e417 ("btrfs: use readahead_expand() on compressed extents")
CC: stable@vger.kernel.org # 6.17
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: do not assert we found block group item when creating free space tree

Currently, when building a free space tree at populate_free_space_tree(),
if we are not using the block group tree feature, we always expect to find
block group items (either extent items or a block group item with key type
BTRFS_BLOCK_GROUP_ITEM_KEY) when we search the extent tree with
btrfs_search_slot_for_read(), so we assert that we found an item. However
this expectation is wrong since we can have a new block group created in
the current transaction which is still empty and for which we still have
not added the block group's item to the extent tree, in which case we do
not have any items in the extent tree associated to the block group.

The insertion of a new block group's block group item in the extent tree
happens at btrfs_create_pending_block_groups() when it calls the helper
insert_block_group_item(). This typically is done when a transaction
handle is released, committed or when running delayed refs (either as
part of a transaction commit or when serving tickets for space reservation
if we are low on free space).

So remove the assertion at populate_free_space_tree() even when the block
group tree feature is not enabled and update the comment to mention this
case.

Syzbot reported this with the following stack trace:

  BTRFS info (device loop3 state M): rebuilding free space tree
  assertion failed: ret == 0 :: 0, in fs/btrfs/free-space-tree.c:1115
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/free-space-tree.c:1115!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 1 UID: 0 PID: 6352 Comm: syz.3.25 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:populate_free_space_tree+0x700/0x710 fs/btrfs/free-space-tree.c:1115
  Code: ff ff e8 d3 (...)
  RSP: 0018:ffffc9000430f780 EFLAGS: 00010246
  RAX: 0000000000000043 RBX: ffff88805b709630 RCX: fea61d0e2e79d000
  RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
  RBP: ffffc9000430f8b0 R08: ffffc9000430f4a7 R09: 1ffff92000861e94
  R10: dffffc0000000000 R11: fffff52000861e95 R12: 0000000000000001
  R13: 1ffff92000861f00 R14: dffffc0000000000 R15: 0000000000000000
  FS:  00007f424d9fe6c0(0000) GS:ffff888125afc000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 00007fd78ad212c0 CR3: 0000000076d68000 CR4: 00000000003526f0
  Call Trace:
   <TASK>
   btrfs_rebuild_free_space_tree+0x1ba/0x6d0 fs/btrfs/free-space-tree.c:1364
   btrfs_start_pre_rw_mount+0x128f/0x1bf0 fs/btrfs/disk-io.c:3062
   btrfs_remount_rw fs/btrfs/super.c:1334 [inline]
   btrfs_reconfigure+0xaed/0x2160 fs/btrfs/super.c:1559
   reconfigure_super+0x227/0x890 fs/super.c:1076
   do_remount fs/namespace.c:3279 [inline]
   path_mount+0xd1a/0xfe0 fs/namespace.c:4027
   do_mount fs/namespace.c:4048 [inline]
   __do_sys_mount fs/namespace.c:4236 [inline]
   __se_sys_mount+0x313/0x410 fs/namespace.c:4213
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f424e39066a
  Code: d8 64 89 02 (...)
  RSP: 002b:00007f424d9fde68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
  RAX: ffffffffffffffda RBX: 00007f424d9fdef0 RCX: 00007f424e39066a
  RDX: 0000200000000180 RSI: 0000200000000380 RDI: 0000000000000000
  RBP: 0000200000000180 R08: 00007f424d9fdef0 R09: 0000000000000020
  R10: 0000000000000020 R11: 0000000000000246 R12: 0000200000000380
  R13: 00007f424d9fdeb0 R14: 0000000000000000 R15: 00002000000002c0
   </TASK>
  Modules linked in:
  ---[ end trace 0000000000000000 ]---

Reported-by: syzbot+884dc4621377ba579a6f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/68dc3dab.a00a0220.102ee.004e.GAE@google.com/
Fixes: a5ed91828518 ("Btrfs: implement the free space B-tree")
CC: <stable@vger.kernel.org> # 6.1.x: 1961d20f6fa8: btrfs: fix assertion when building free space tree
CC: <stable@vger.kernel.org> # 6.1.x
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: do not use folio_test_partial_kmap() in ASSERT()s

[BUG]
Syzbot reported an ASSERT() triggered inside scrub:

  BTRFS info (device loop0): scrub: started on devid 1
  assertion failed: !folio_test_partial_kmap(folio) :: 0, in fs/btrfs/scrub.c:697
  ------------[ cut here ]------------
  kernel BUG at fs/btrfs/scrub.c:697!
  Oops: invalid opcode: 0000 [#1] SMP KASAN PTI
  CPU: 0 UID: 0 PID: 6077 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full)
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 08/18/2025
  RIP: 0010:scrub_stripe_get_kaddr+0x1bb/0x1c0 fs/btrfs/scrub.c:697
  Call Trace:
   <TASK>
   scrub_bio_add_sector fs/btrfs/scrub.c:932 [inline]
   scrub_submit_initial_read+0xf21/0x1120 fs/btrfs/scrub.c:1897
   submit_initial_group_read+0x423/0x5b0 fs/btrfs/scrub.c:1952
   flush_scrub_stripes+0x18f/0x1150 fs/btrfs/scrub.c:1973
   scrub_stripe+0xbea/0x2a30 fs/btrfs/scrub.c:2516
   scrub_chunk+0x2a3/0x430 fs/btrfs/scrub.c:2575
   scrub_enumerate_chunks+0xa70/0x1350 fs/btrfs/scrub.c:2839
   btrfs_scrub_dev+0x6e7/0x10e0 fs/btrfs/scrub.c:3153
   btrfs_ioctl_scrub+0x249/0x4b0 fs/btrfs/ioctl.c:3163
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:597 [inline]
   __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:583
   do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
   do_syscall_64+0xfa/0xfa0 arch/x86/entry/syscall_64.c:94
   entry_SYSCALL_64_after_hwframe+0x77/0x7f
   </TASK>
  ---[ end trace 0000000000000000 ]---

Which doesn't make much sense, as all the folios we allocated for scrub
should not be highmem.

[CAUSE]
Thankfully syzbot has a detailed kernel config file, showing that
CONFIG_DEBUG_KMAP_LOCAL_FORCE_MAP is set to y.

And that debug option will force all folio_test_partial_kmap() to return
true, to improve coverage on highmem tests.

But in our case we really just want to make sure the folios we allocated
are not highmem (and they are indeed not). Such incorrect result from
folio_test_partial_kmap() is just screwing up everything.

[FIX]
Replace folio_test_partial_kmap() to folio_test_highmem() so that we
won't bother those highmem specific debuging options.

Fixes: 5fbaae4b8567 ("btrfs: prepare scrub to support bs > ps cases")
Reported-by: syzbot+bde59221318c592e6346@syzkaller.appspotmail.com
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: only set the device specific options after devices are opened

[BUG]
With v6.17-rc kernels, btrfs will always set 'ssd' mount option even if
the block device is not a rotating one:

  # cat /sys/block/sdd/queue/rotational
  1
  # cat /etc/fstab:
  LABEL=DATA2     /data2  btrfs rw,relatime,space_cache=v2,subvolid=5,subvol=/,nofail,nosuid,nodev      0 0

  # mount
  [...]
  /dev/sdd on /data2 type btrfs (rw,nosuid,nodev,relatime,ssd,space_cache=v2,subvolid=5,subvol=/)

[CAUSE]
The 'ssd' mount option is set by set_device_specific_options(), and it
expects that if there is any rotating device in the btrfs, it will set
fs_devices::rotating.

However after commit bddf57a70781 ("btrfs: delay btrfs_open_devices()
until super block is created"), the device opening is delayed until the
super block is created.

But the timing of set_device_specific_options() is still left as is,
this makes the function be called without any device opened.

Since no device is opened, thus fs_devices::rotating will never be set,
making btrfs incorrectly set 'ssd' mount option.

[FIX]
Only call set_device_specific_options() after btrfs_open_devices().

Also only call set_device_specific_options() after a new mount, if we're
mounting a mounted btrfs, there is no need to set the device specific
mount options again.

Reported-by: HAN Yuwei <hrx@bupt.moe>
Link: https://lore.kernel.org/linux-btrfs/C8FF75669DFFC3C5+5f93bf8a-80a0-48a6-81bf-4ec890abc99a@bupt.moe/
Fixes: bddf57a70781 ("btrfs: delay btrfs_open_devices() until super block is created")
CC: stable@vger.kernel.org # 6.17
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix memory leak on duplicated memory in the qgroup assign ioctl

On 'btrfs_ioctl_qgroup_assign' we first duplicate the argument as
provided by the user, which is kfree'd in the end. But this was not the
case when allocating memory for 'prealloc'. In this case, if it somehow
failed, then the previous code would go directly into calling
'mnt_drop_write_file', without freeing the string duplicated from the
user space.

Fixes: 4addc1ffd67a ("btrfs: qgroup: preallocate memory before adding a relation")
CC: stable@vger.kernel.org # 6.12+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix clearing of BTRFS_FS_RELOC_RUNNING if relocation already running

When starting relocation, at reloc_chunk_start(), if we happen to find
the flag BTRFS_FS_RELOC_RUNNING is already set we return an error
(-EINPROGRESS) to the callers, however the callers call reloc_chunk_end()
which will clear the flag BTRFS_FS_RELOC_RUNNING, which is wrong since
relocation was started by another task and still running.

Finding the BTRFS_FS_RELOC_RUNNING flag already set is an unexpected
scenario, but still our current behaviour is not correct.

Fix this by never calling reloc_chunk_end() if reloc_chunk_start() has
returned an error, which is what logically makes sense, since the general
widespread pattern is to have end functions called only if the counterpart
start functions succeeded. This requires changing reloc_chunk_start() to
clear BTRFS_FS_RELOC_RUNNING if there's a pending cancel request.

Fixes: 907d2710d727 ("btrfs: add cancellable chunk relocation support")
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix PAGE_SIZE format specifier in open_ctree()

There is an instance of -Wformat when targeting 32-bit architectures due
to using a 'size_t' specifier (which is 'unsigned int' for 32-bit
platforms) to print PAGE_SIZE:

  In file included from fs/btrfs/compression.h:17,
                   from fs/btrfs/extent_io.h:15,
                   from fs/btrfs/locking.h:13,
                   from fs/btrfs/ctree.h:19,
                   from fs/btrfs/disk-io.c:22:
  fs/btrfs/disk-io.c: In function 'open_ctree':
  include/linux/kern_levels.h:5:25: error: format '%zu' expects argument of type 'size_t', but argument 4 has type 'long unsigned int' [-Werror=format=]
  ...
  fs/btrfs/disk-io.c:3398:17: note: in expansion of macro 'btrfs_warn'
   3398 |                 btrfs_warn(fs_info,
        |                 ^~~~~~~~~~

PAGE_SIZE is consistently defined as an 'unsigned long' in
include/vsdo/page.h so use '%lu' to clear up the warning.

Fixes: 98077f7f2180 ("btrfs: enable experimental bs > ps support")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Tested-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid potential out-of-bounds in btrfs_encode_fh()

The function btrfs_encode_fh() does not properly account for the three
cases it handles.

Before writing to the file handle (fh), the function only returns to the
user BTRFS_FID_SIZE_NON_CONNECTABLE (5 dwords, 20 bytes) or
BTRFS_FID_SIZE_CONNECTABLE (8 dwords, 32 bytes).

However, when a parent exists and the root ID of the parent and the
inode are different, the function writes BTRFS_FID_SIZE_CONNECTABLE_ROOT
(10 dwords, 40 bytes).

If *max_len is not large enough, this write goes out of bounds because
BTRFS_FID_SIZE_CONNECTABLE_ROOT is greater than
BTRFS_FID_SIZE_CONNECTABLE originally returned.

This results in an 8-byte out-of-bounds write at
fid->parent_root_objectid = parent_root_id.

A previous attempt to fix this issue was made but was lost.

https://lore.kernel.org/all/4CADAEEC020000780001B32C@vpn.id2.novell.com/

Although this issue does not seem to be easily triggerable, it is a
potential memory corruption bug that should be fixed. This patch
resolves the issue by ensuring the function returns the appropriate size
for all three cases and validates that *max_len is large enough before
writing any data.

Fixes: be6e8dc0ba84 ("NFS support for btrfs - v3")
CC: stable@vger.kernel.org # 3.0+
Signed-off-by: Anderson Nascimento <anderson@allelesecurity.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()

After setting the BTRFS_ROOT_FORCE_COW flag on the root we are doing a
full write barrier, smp_wmb(), but we don't need to, all we need is a
smp_mb__after_atomic(). The use of the smp_wmb() is from the old days
when we didn't use a bit and used instead an int field in the root to
signal if cow is forced. After the int field was changed to a bit in
the root's state (flags field), we forgot to update the memory barrier
in create_pending_snapshot() to smp_mb__after_atomic(), but we did the
change in commit_fs_roots() after clearing BTRFS_ROOT_FORCE_COW. That
happened in commit 27cdeb7096b8 ("Btrfs: use bitfield instead of integer
data type for the some variants in btrfs_root"). On the reader side, in
should_cow_block(), we also use the counterpart smp_mb__before_atomic()
which generates further confusion.

So change the smp_wmb() to smp_mb__after_atomic(). In fact we don't
even need any barrier at all since create_pending_snapshot() is called
in the critical section of a transaction commit and therefore no one
can concurrently join/attach the transaction, or start a new one, until
the transaction is unblocked. By the time someone starts a new transaction
and enters should_cow_block(), a lot of implicit memory barriers already
took place by having acquired several locks such as fs_info->trans_lock
and extent buffer locks on the root node at least. Nevertlheless, for
consistency use smp_mb__after_atomic() after setting the force cow bit
in create_pending_snapshot().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add unlikely annotations to branches leading to transaction abort

The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen.

Transaction abort is one such error, the btrfs_abort_transaction()
inlines code to check the state and print a warning, this ought to be
out of the hot path.

The most common pattern is when transaction abort is called after
checking a return value and the control flow leads to a quick return.
In other cases it may not be necessary to add unlikely() e.g. when the
function returns anyway or the control flow is not changed noticeably.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add unlikely annotations to branches leading to EIO

The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EIO is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add unlikely annotations to branches leading to EUCLEAN

The unlikely() annotation is a static prediction hint that compiler may
use to reorder code out of hot path. We use it elsewhere (namely
tree-checker.c) for error branches that almost never happen, where
EUCLEAN (a corruption) is one of them.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions

Trivial pattern for the auto freeing with goto -> return conversions
if possible.

The following cases are considered trivial in this patch:

1. Cases where there are no operations between btrfs_free_path() and the
   function returns.
2. Cases where only simple cleanup operations (such as kfree(), kvfree(),
   clear_bit(), and fs_path_free()) are present between
   btrfs_free_path() and the function return.

Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: zoned: don't fail mount needlessly due to too many active zones

Previously BTRFS did not look at a device's reported max_open_zones limit,
but starting with commit 04147d8394e8 ("btrfs: zoned: limit active zones
to max_open_zones"), zoned BTRFS limited the number of concurrently used
block-groups to the number of max_open_zones a device reported, if it
hadn't already reported a number of max_active_zones.

Starting with commit 04147d8394e8 the number of open zones is treated the
same way as active zones. But this leads to mount failures on filesystems
which have been used before 04147d8394e8 because too many zones are in an
open state.

Ignore the new limitations on these filesystems, so zones can be finished
or evacuated.

Reported-by: Yuwei Han <hrx@bupt.moe>
Link: https://lore.kernel.org/all/2F48A90AF7DDF380+1790bcfd-cb6f-456b-870d-7982f21b5eae@bupt.moe/
Fixes: 04147d8394e8 ("btrfs: zoned: limit active zones to max_open_zones")
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()

As pointed out in the documentation, calling 'kmalloc' with open-coded
arithmetic can lead to unfortunate overflows and this particular way of
using it has been deprecated. Instead, it's preferred to use
'kmalloc_array' in cases where it might apply so an overflow check is
performed.

Note this is an API cleanup and is not fixing any overflows because in
all cases the multipliers are bounded small numbers derived from number
of items in leaves/nodes.

Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: enable experimental bs > ps support

With all the preparation patches, we're able to finally enable btrfs
block size (sector size) larger than page size support and give it a
full fstests run.

And obviously this new feature is hidden behind experimental flags, and
should not be considered as a core feature yet as btrfs' default block
size is still 4K.

But this is still a feature that will shine in the future where 16K
block sized device are widely adopted.

For now there are some features explicitly disabled:

- Direct IO
  This is the most complex part to support, the root reason is we can
  not control the pages of iov iter passed in.

  User space programs can only ensure the virtual addresses are
  contiguous, but have no control on their physical addresses.

  Our bs > ps support heavily relies on large folios, and direct IO
  memory can easily break it.

  So direct IO is disabled and will always fall back to buffered IO.

- RAID56
  In theory we can convert RAID56 to use large folios, but it will need
  to be converted back to page based if we want to support direct IO in
  the future.
  So just reject it for now.

- Encoded send
- Encoded read
  Both are utilizing btrfs_encoded_read_regular_fill_pages(), and send
  is utilizing vmallocated memory.
  Unfortunately for vmallocated memory we can not guarantee the minimal
  folio order.

  For send, it will just always fallback to regular writes, which reads
  from page cache and will follow the existing folio order requirement.

- Encoded write
  Encoded write itself is allocating pages by themselves, and we can
  easily change it to follow the minimal order.
  But since encoded read is already disabled, there is no need to only
  enable encoded write.

Finally just like what we did for bs < ps support in the past, add a
warning message for bs > ps mounts.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add extra ASSERT()s to catch unaligned bios

Btrfs uses btrfs_bio to handle read/write of logical address, for the
incoming bs > ps support, btrfs has extra requirements:

- One folio must contain at least one fs block
- No fs block can cross folio boundaries

This requirement is not hard to maintain, thanks to the address space's
minimal folio order.

But not all btrfs bios are generated through address space, e.g.
compression and scrub.

To catch possible unaligned bios, introduce a helper,
assert_bbio_alginment(), for each btrfs_bio in btrfs_submit_bbio().

This will check the following things:

- bv_offset is aligned to block size
- bv_len is aligned to block size

With a btrfs bio passing above checks, unless it's empty it will ensure
the requirements for bs > ps support.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix symbolic link reading when bs > ps

[BUG DURING BS > PS TEST]
When running the following script on a btrfs whose block size is larger
than page size, e.g. 8K block size and 4K page size, it will trigger a
kernel BUG:

  # mkfs.btrfs -s 8k $dev
  # mount $dev $mnt
  # mkdir $mnt/dir
  # ln -s dir $mnt/link
  # ls $mnt/link

The call trace looks like this:

  BTRFS warning (device dm-2): support for block size 8192 with page size 4096 is experimental, some features may be missing
  BTRFS info (device dm-2): checking UUID tree
  BTRFS info (device dm-2): enabling ssd optimizations
  BTRFS info (device dm-2): enabling free space tree
  ------------[ cut here ]------------
  kernel BUG at /home/adam/linux/include/linux/highmem.h:275!
  Oops: invalid opcode: 0000 [#1] SMP
  CPU: 8 UID: 0 PID: 667 Comm: ls Tainted: G           OE       6.17.0-rc4-custom+ #283 PREEMPT(full)
  Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 02/02/2022
  RIP: 0010:zero_user_segments.constprop.0+0xdc/0xe0 [btrfs]
  Call Trace:
   <TASK>
   btrfs_get_extent.cold+0x85/0x101 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   btrfs_do_readpage+0x244/0x750 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   btrfs_read_folio+0x9c/0x100 [btrfs 7453c70c03e631c8d8bfdd4264fa62d3e238da6f]
   filemap_read_folio+0x37/0xe0
   do_read_cache_folio+0x94/0x3e0
   __page_get_link.isra.0+0x20/0x90
   page_get_link+0x16/0x40
   step_into+0x69b/0x830
   path_lookupat+0xa7/0x170
   filename_lookup+0xf7/0x200
   ? set_ptes.isra.0+0x36/0x70
   vfs_statx+0x7a/0x160
   do_statx+0x63/0xa0
   __x64_sys_statx+0x90/0xe0
   do_syscall_64+0x82/0xae0
   entry_SYSCALL_64_after_hwframe+0x4b/0x53
   </TASK>

Please note bs > ps support is still under development and the
enablement patch is not even in btrfs development branch.

[CAUSE]
Btrfs reuses its data folio read path to handle symbolic links, as the
symbolic link target is stored as an inline data extent.

But for newly created inodes, btrfs only set the minimal order if the
target inode is a regular file.

Thus for above newly created symbolic link, it doesn't properly respect
the minimal folio order, and triggered the above crash.

[FIX]
Call btrfs_set_inode_mapping_order() unconditionally inside
btrfs_create_new_inode().

For symbolic links this will fix the crash as now the folio will meet
the minimal order.

For regular files this brings no change.

For directory/bdev/char and all the other types of inodes, they won't
go through the data read path, thus no effect either.

Fixes: cc38d178ff33 ("btrfs: enable large data folio support under CONFIG_BTRFS_EXPERIMENTAL")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare scrub to support bs > ps cases

This involves:

- Migrate scrub_stripe::pages[] to folios[]

- Use btrfs_alloc_folio_array() and folio_put() to alloc above array.

- Migrate scrub_stripe_get_kaddr() and scrub_stripe_get_paddr() to use
  folio interfaces

- Migrate raid56_parity_cache_data_pages() to
  raid56_parity_cache_data_folios()
  Since scrub is the only caller still using pages.

  This helper will copy the folio array contents into rbio::stripe_pages,
  with sector uptodate flags updated.

  And a new ASSERT() to make sure bs > ps cases will not hit this path.

Since most scrub code is based on kaddr/paddr, the migration itself is
pretty straightforward.

And since we're here, also move the loop to set the
stripe_sectors[].uptodate out of the copy loop.
As we always mark all the sectors as uptodate for the data stripe, it's
easier to do in one go, other than doing it inside the copy loop.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare zlib to support bs > ps cases

This involves converting the following functions to use correct folio
sizes/shifts:

- zlib_compress_folios()
- zlib_decompress_bio()

There is a special handling for s390 hardware acceleration.
With bs > ps cases, we can go with 16K block size on s390 (which uses
fixed 4K page size).
In that case we do not need to do the buffer copy as our folio is large
enough for hardware acceleration.

So factor out the s390 specific and folio size check into a helper,
need_special_buffer().

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare lzo to support bs > ps cases

This involves converting the following functions to use correct folio
sizes/shifts:

- copy_compress_data_to_page()
- lzo_compress_folios()
- lzo_decompress_bio()

Just like zstd, lzo has some extra incorrect usage of kmap_local_folio()
that the offset is always 0.

This will not handle HIGHMEM large folios correctly, but those cases are
already rejected explicitly so it should not cause problems when bs > ps
support is enabled.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare zstd to support bs > ps cases

This involves converting the following functions to use proper folio
sizes/shifts:

- zstd_compress_folios()
- zstd_decompress_bio()

The function zstd_decompress() is already using block size correctly
without using page size, thus it needs no modification.

And since zstd compression is calling kmap_local_folio(), the existing
code cannot handle large folios with HIGHMEM, as kmap_local_folio()
requires us to handle one page range each time.

I do not really think it's worth to spend time on some feature that will
be deprecated eventually. So here just add an extra explicit rejection
for bs > ps with HIGHMEM feature enabled kernels.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: prepare compression folio alloc/free for bs > ps cases

This includes the following preparation for bs > ps cases:

- Always alloc/free the folio directly if bs > ps
  This adds a new @fs_info parameter for btrfs_alloc_compr_folio(), thus
  affecting all compression algorithms.

  For btrfs_free_compr_folio() it needs no parameter for now, as we can
  use the folio size to skip the caching part.

  For now the change is just to passing a @fs_info into the function,
  all the folio size assumption is still based on page size.

- Properly zero the last folio in compress_file_range()
  Since the compressed folios can be larger than a page, we need to
  properly zero the whole folio.

- Use correct folio size for btrfs_add_compressed_bio_folios()
  Instead of page size, use the correct folio size.

- Use correct folio size/shift for btrfs_compress_filemap_get_folio()
  As we are not only using simple page sized folios anymore.

- Use correct folio size for btrfs_decompress()
  There is an ASSERT() making sure the decompressed range is no larger
  than a page, which will be triggered for bs > ps cases.

- Skip readahead for compressed pages
  Similar to subpage cases.

- Make btrfs_alloc_folio_array() to accept a new @order parameter

- Add a helper to calculate the minimal folio size

All those changes should not affect the existing bs <= ps handling.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()

[BUG]
With my local branch to enable bs > ps support for btrfs, sometimes I
hit the following ASSERT() inside submit_one_sector():

ASSERT(block_start != EXTENT_MAP_HOLE);

Please note that it's not yet possible to hit this ASSERT() in the wild
yet, as it requires btrfs bs > ps support, which is not even in the
development branch.

But on the other hand, there is also a very low chance to hit above
ASSERT() with bs < ps cases, so this is an existing bug affect not only
the incoming bs > ps support but also the existing bs < ps support.

[CAUSE]
Firstly that ASSERT() means we're trying to submit a dirty block but
without a real extent map nor ordered extent map backing it.

Furthermore with extra debugging, the folio triggering such ASSERT() is
always larger than the fs block size in my bs > ps case.
(8K block size, 4K page size)

After some more debugging, the ASSERT() is trigger by the following
sequence:

extent_writepage()
|  We got a 32K folio (4 fs blocks) at file offset 0, and the fs block
|  size is 8K, page size is 4K.
|  And there is another 8K folio at file offset 32K, which is also
|  dirty.
|  So the filemap layout looks like the following:
|
|  "||" is the filio boundary in the filemap.
|  "//| is the dirty range.
|
|  0        8K       16K        24K         32K       40K
|  |////////|        |//////////////////////||////////|
|
|- writepage_delalloc()
|  |- find_lock_delalloc_range() for [0, 8K)
|  |  Now range [0, 8K) is properly locked.
|  |
|  |- find_lock_delalloc_range() for [16K, 40K)
|  |  |- btrfs_find_delalloc_range() returned range [16K, 40K)
|  |  |- lock_delalloc_folios() locked folio 0 successfully
|  |  |
|  |  |  The filemap range [32K, 40K) got dropped from filemap.
|  |  |
|  |  |- lock_delalloc_folios() failed with -EAGAIN on folio 32K
|  |  |  As the folio at 32K is dropped.
|  |  |
|  |  |- loops = 1;
|  |  |- max_bytes = PAGE_SIZE;
|  |  |- goto again;
|  |  |  This will re-do the lookup for dirty delalloc ranges.
|  |  |
|  |  |- btrfs_find_delalloc_range() called with @max_bytes == 4K
|  |  |  This is smaller than block size, so
|  |  |  btrfs_find_delalloc_range() is unable to return any range.
|  |  \- return false;
|  |
|  \- Now only range [0, 8K) has an OE for it, but for dirty range
|     [16K, 32K) it's dirty without an OE.
|     This breaks the assumption that writepage_delalloc() will find
|     and lock all dirty ranges inside the folio.
|
|- extent_writepage_io()
    |- submit_one_sector() for [0, 8K)
    |  Succeeded
    |
    |- submit_one_sector() for [16K, 24K)
       Triggering the ASSERT(), as there is no OE, and the original
       extent map is a hole.

Please note that, this also exposed the same problem for bs < ps
support. E.g. with 64K page size and 4K block size.

If we failed to lock a folio, and falls back into the "loops = 1;"
branch, we will re-do the search using 64K as max_bytes.
Which may fail again to lock the next folio, and exit early without
handling all dirty blocks inside the folio.

[FIX]
Instead of using the fixed size PAGE_SIZE as @max_bytes, use
@sectorsize, so that we are ensured to find and lock any remaining
blocks inside the folio.

And since we're here, add an extra ASSERT() to
before calling btrfs_find_delalloc_range() to make sure the @max_bytes is
at least no smaller than a block to avoid false negative.

Cc: stable@vger.kernel.org # 5.15+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless key offset setup in create_pending_snapshot()

There's no point in setting the key's offset to (u64)-1 since we never
use it before setting it to the current transaction's ID. So remove the
assignment of (u64)-1 to the key's offset and move the remainder of the
key initialization close to where it's used.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: annotate btrfs_is_testing() as unlikely and make it return bool

We can annotate btrfs_is_testing() as unlikely since that's the most
expected scenario and it's desirable for the compiler to optimize for
the case we are not running the self tests. So add the annotation to
btrfs_is_testing() and while at it also make it return bool instead of
int.

Also make two of the existing callers use btrfs_is_testing() directly
instead of storing its result in a local variable.

On x86_64 with Debian's gcc 14.2.0-19 this resulted in a very tiny object
code reduction.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1913263 161567   15592 2090422 1fe5b6 fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text    data     bss     dec     hex filename
  1913257 161567   15592 2090416 1fe5b0 fs/btrfs/btrfs.ko

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: make the rule checking more readable for should_cow_block()

It's quite hard and unreadable the way the rule checks are organized in
should_cow_block(). We have a single if statement that returns 0 (false)
and it checks several conditions, with one them being a negated compound
condition which is particularly hard to reason immediately.

Improve on this by using multiple if statements, each checking a single
condition and returning immediately. Also change the return type from an
integer to a boolean, since all we need is to return true or false.

At least on x86_64 with Debian's gcc 14.2.0-19, this also reduces the
object code size by 64 bytes.

Before this change:

   $ size fs/btrfs/btrfs.ko
      text    data     bss     dec     hex filename
   1913327 161567   15592 2090486 1fe5f6 fs/btrfs/btrfs.ko

After this change:

   $ size fs/btrfs/btrfs.ko
      text    data     bss     dec     hex filename
   1913263 161567   15592 2090422 1fe5b6 fs/btrfs/btrfs.ko

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: simplify inline extent end calculation at replay_one_extent()

There is no need to store the extent's ram_bytes in two variables,
further more one of them, named 'size', is used only for the extent's end
offset calculation. So remove the 'size' variable and use 'nbytes' only.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix comment about nbytes increase at replay_one_extent()

The comment is wrong about the part where it says a prealloc extent does
not contribute to an inode's nbytes - it does. Only holes don't contribute
and that's what we are checking for, as prealloc extents always have a
disk_bytenr different from 0. So fix the comment and re-organize the code
to not set nbytes twice and set it to the extent item's number of bytes
only if it doesn't represent a hole - in case it's a hole we have already
initialized nbytes to 0 when we declared it.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: return any hit error from extent_writepage_io()

Since the support of bs < ps support, extent_writepage_io() will submit
multiple blocks inside the folio.

But if we hit error submitting one sector, but the next sector can still
be submitted successfully, the function extent_writepage_io() will still
return 0.

This will make btrfs to silently ignore the error without setting error
flag for the filemap.

Fix it by recording the first error hit, and always return that value.

Fixes: 8bf334beb349 ("btrfs: fix double accounting race when extent_writepage_io() failed")
Reviewed-by: Daniel Vacek <neelx@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: mark leaf space and overflow checks as unlikely on insert and extension

We have several sanity checks when inserting or extending items in a btree
that verify we didn't overflow the leaf or access a slot beyond the last
one. These are cases that are never expected to be hit so mark them as
unlikely, allowing the compiler to potentially generate better code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: mark as unlikely not uptodate extent buffer checks when navigating btrees

We expect that after attempting to read an extent buffer we had no errors
therefore the extent buffer is up to date, so mark the checks for a not up
to date extent buffer as unlikely and allow the compiler to pontentially
generate better code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: mark extent buffer alignment checks as unlikely

We are not expecting to ever fail the extent buffer alignment checks, so
mark them as unlikely to allow the compiler to potentially generate more
optimized code.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: store and use node size in local variable in check_eb_alignment()

Instead of dereferencing fs_info every time we need to access the node
size, store in a local variable to make the code less verbose and avoid
a line split too.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print key types as human readable strings

Looking at a leaf dump from the kernel's print-tree implementation is not
so friendly to analyze since key types are printed as numbers. Improve on
this by printing key types as strings that are a diminutive of the macro
names for key types, just like we do in btrfs-progs.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: move code for processing file extent item into helper

The code for processing file extent items is quite large and it's better
to have it in a dedicated helper rather than in a huge switch statement,
just like we do in btrfs-progs.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print compression type for file extent items

We are not printing anything about the compression type, so add that
useful information in the same format as btrfs-progs.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print correct inline extent data size

We are advertising the ram_bytes of an inline extent as its data size, but
that is not true for compressed extents. The ram_bytes corresponds to the
uncompressed data size while the data size (compressed data) is given by
btrfs_file_extent_inline_item_len(). So fix this and print both values in
the same format as in btrfs-progs.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print range information for extent csum items

Currently we don't print anything for extent csum items other than the
generic line with the key, item offset and item size. While one can still
determine the range the extent csum covers by doing a few simple
computations, it makes it more time consuming to analyse a leaf dump.
So add a line that prints information about the range covered by the
checksum using the same format as btrfs-progs. This is useful when
debugging log tree issues since we log extent csum items for new extents.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print information about dir log items

We currently don't print information about dir log items (other than the
key, item offset and item size), which is useful to look at when debugging
problems with a log tree. So print their specific information (currently
they only have an end index number) in a format similar to btrfs-progs.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print information about inode extref items

Currently we ignore inode extref items, we just print their key, item
offset in the leaf and their size, no information about their content
like the index number, parent inode, name length and name.

Improve on this by printing the index, parent and name length in the same
format as btrfs-progs. Note that we don't print the name, as that would
require some processing and escaping like we do in btrfs-progs, and that
could expose sensitive information for some users in case they share their
dmesg/syslog and it contains a leaf dump. So for now leave names out.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print information about inode ref items

Currently we ignore inode ref items, we just print their key, item offset
in the leaf and their size, no information about their content like the
index number, name length and name.

Improve on this by printing the index and name length in the same format
as btrfs-progs. Note that we don't print the name, as that would require
some processing and escaping like we do in btrfs-progs, and that could
expose sensitive information for some users in case they share their
dmesg/syslog and it contains a leaf dump. So for now leave names out.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print dir items for dir index and xattr keys too

Currently we only print the dir items for BTRFS_DIR_ITEM_KEY keys, but
we also have dir items for BTRFS_DIR_INDEX_KEY and BTRFS_XATTR_ITEM_KEY
keys too. So print them for those keys too.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print more information about dir items

Currently we only print the object id component of the location key from a
dir item and the flags. We are missing the whole key, transid and the name
and data lengths. We are also ignoring the fact that we can have multiple
dir item objects encoded in a single item for a BTRFS_DIR_ITEM_KEY key, so
what we print is only for the first item.

Improve on this by iterating on all dir items and print the missing
information. This is done with the same format as in btrfs-progs, what
we miss is printing the names and data since not only that would require
some processing and escaping like in btrfs-progs, but it would also reveal
information that may be sensitive and users may not want to share that in
case that get a leaf dumped in dmesg.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: print-tree: print missing fields for inode items

We are not dumping a lot of fields for an inode item which are useful for
debugging whenever we dump a leaf (log replay failure for example), so add
them and make it as close as possible to the print tree implementation in
btrfs-progs (things like converting timespecs to human readable dates and
converting flags to strings are missing since they are not so practical to
do in the kernel).

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: tree-checker: add inode extref checks

Like inode refs, inode extrefs have a variable length name, which means
we have to do a proper check to make sure no header nor name can exceed
the item limits.

The check itself is very similar to check_inode_ref(), just a different
structure (btrfs_inode_extref vs btrfs_inode_ref).

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: send: index backref cache by node number instead of by sector number

We now have a nodesize_bits member in fs_info so we can index an extent
buffer in the backref cache by node number instead of by sector number.
While this allows for a denser index space with the possibility of using
less maple tree nodes, in practice it's unlikely to hit such benefits
since we currently limit the maximum number of keys in the cache to 128,
so unless all extent buffers are contiguous we are unlikely to see a
memory usage reduction in the backing maple tree due to fewer nodes.
Nevertheless it doesn't cost anything to index by node number and it's
more logical.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: dump detailed info and specific messages on log replay failures

Currently debugging log replay failures can be harder than needed, since
all we do now is abort a transaction, which gives us a line number, a
stack trace and an error code. But that is most of the times not enough
to give some clue about what went wrong. So add a new helper to abort
log replay and provide contextual information:

1) Dump the current leaf of the log tree being processed and print the
   slot we are currently at and the key at that slot;

2) Dump the current subvolume tree leaf if we have any;

3) Print the current stage of log replay;

4) Print the id of the subvolume root associated with the log tree we
   are currently processing (as we can have multiple);

5) Print some error message to mention what we were trying to do when we
   got an error.

Replace all transaction abort calls (btrfs_abort_transaction()) with the
new helper btrfs_abort_log_replay(), which besides dumping all that extra
information, it also aborts the current transaction.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction if we fail to update inode in log replay dir fixup

If we fail to update the inode at link_to_fixup_dir(), we don't abort the
transaction and propagate the error up the call chain, which makes it hard
to pinpoint the error to the inode update. So abort the transaction if the
inode update call fails, so that if it happens we known immediately.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: abort transaction if we fail to find dir item during log replay

At __add_inode_ref() if we get an error when trying to lookup a dir item
we don't abort the transaction and propagate the error up the call chain,
so that somewhere else up in the call chain the transaction is aborted.
This however makes it hard to know that the failure comes from looking up
a dir item, so add a transaction abort in case we fail there, so that we
immediately pinpoint where the problem comes from during log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove pointless inode lookup when processing extrefs during log replay

At unlink_extrefs_not_in_log() we do an inode lookup of the directory but
we already have the directory inode accessible as a function argument, so
the lookup is redudant. Remove it and use the directory inode passed in as
an argument.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: stop passing inode object IDs to __add_inode_ref() in log replay

There's no point in passing the inode and parent inode object IDs to
__add_inode_ref() and its helpers because we can get them by calling
btrfs_ino() against the inode and the directory inode, and we pass both
inodes to __add_inode_ref() and its helpers. So remove the object IDs
parameters to reduce arguments passed and to make things less confusing.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add path for subvolume tree changes to struct walk_control

While replaying log trees we need to do searches and updates to subvolume
trees and for that we use a path that we allocate in replay_one_buffer()
and then pass it as a parameter to other functions deeper in the log
replay call chain. Instead of passing it as parameter, add it to struct
walk_control since we pass a pointer to that structure for every log
replay function.

This reduces the number of arguments passed to the functions and it will
be needed and important for an upcoming changes that improves error
reporting for log replay. Also name the new filed in struct walk_control
to 'subvol_path' - while that is longer to type, the naming makes it clear
it's used for subvolume tree operations since many of the log replay
functions operate both on subvolume and log trees, and for the log tree
searches we have struct walk_control::log_leaf to also make it obvious
it's an extent buffer for a log tree extent buffer.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant path release when overwriting item during log replay

At overwrite_item() we have a redundant btrfs_release_path() just before
failing with -ENOMEM, as the caller who passed in the path will free it
and therefore also release any refcounts and locks on the extent buffers
of the path. So remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: remove redundant path release when processing dentry during log replay

At replay_one_one() we have a redundant btrfs_release_path() just before
calling insert_one_name(), as some lines above we have already released
the path with another btrfs_release_path() call. So remove it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid unnecessary path allocation when replaying a dir item

There's no need to allocate 'fixup_path' at replay_one_dir_item(), as the
path passed as an argument is unused by the time link_to_fixup_dir() is
called (replay_one_name() releases the path before it returns).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid path allocations when dropping extents during log replay

We can avoid a path allocation in the btrfs_drop_extents() calls we have
at replay_one_extent() and replay_one_buffer() by passing the path we
already have in those contextes as it's unused by the time they call
btrfs_drop_extents().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: avoid unnecessary path allocation at fixup_inode_link_count()

There's no need to allocate a path as our single caller already has a
path that we can use. So pass the caller's path and use it.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add current log leaf, key and slot to struct walk_control

A lot of the log replay functions get passed the current log leaf being
processed as well as the current slot and the key at that slot. Instead
of passing them as parameters, add them to struct walk_control so that
we reduce the numbers of parameters. This is also going to be needed to
further changes that improve error reporting during log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use the inode item boolean everywhere in overwrite_item()

We have this boolean 'inode_item' to tell if we are processing an inode
item key and we use it in a couple of places while in another two places
we open code by checking if the key type matches the inode item type.
Make this consistent and use the boolean everywhere. Also rename it from
'inode_item' to 'is_inode_item', which makes it more clear that it's a
boolean and not an instance of struct btrfs_inode_item, and make it const
too.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use level argument in log tree walk callback replay_one_buffer()

We already have the extent buffer's level in an argument, there's no need
to first ensure the extent buffer's data is loaded (by calling
btrfs_read_extent_buffer()) and then call btrfs_header_level() to check
the level. So use the level argument and do the check before calling
btrfs_read_extent_buffer().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use level argument in log tree walk callback process_one_buffer()

We already have the extent buffer's level in an argument, there's no need
to call btrfs_header_level(). So use the level argument and make the code
shorter.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to overwrite_item()

Instead of passing the transaction and subvolume root as arguments to
overwrite_item(), pass the walk_control structure as we can grab them
from the structure. This reduces the number of arguments passed and it's
going to be needed by an incoming change that improves error reporting
for log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to drop_one_dir_item() and helpers

Instead of passing the transaction as an argument to drop_one_dir_item()
and its helpers (link_to_fixup_dir() and unlink_inode_for_log_replay()),
pass the walk_control structure as we can access the transaction from it
and the subvolume root. This is going to be needed by an incoming change
that improves error reporting for log replay and also reduces the number
of arguments passed to link_to_fixup_dir().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to replay_one_dir_item() and replay_one_name()

Instead of passing the transaction and subvolume root and log tree as
arguments, pass the walk_control structure as we can grab all of those
from the structure. This reduces the number of arguments passed and it's
going to be needed by an incoming change that improves error reporting
for log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to add_inode_ref() and helpers

Instead of passing the transaction, subvolume root and log tree as
arguments to add_inode_ref() and its helpers (__add_inode_ref(),
unlink_refs_not_in_log(), unlink_extrefs_not_in_log() and
unlink_old_inode_refs()), pass the walk_control structure as we can
access all of those from the structure. This reduces the number of
arguments passed and it's going to be needed by an incoming change
that improves error reporting for log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to replay_one_extent()

Instead of passing the transaction and subvolume root as arguments to
replay_one_extent(), pass the walk_control structure as we can grab all
of those from the structure. This reduces the number of arguments passed
and it's going to be needed by an incoming change that improves error
reporting for log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to check_item_in_log()

Instead of passing the transaction and log tree as arguments to
check_item_in_log(), pass the walk_control structure as we can grab those
from the structure. This reduces the number of arguments passed and it's
going to be needed by an incoming change that improves error reporting for
log replay. Notice that a NULL log root argument to check_item_in_log()
makes it unconditionally delete a directory entry, so since the
walk_control always has a non-NULL log root, we add an extra boolean to
check_item_in_log() to tell it if it should unconditionally delete a
directory entry, preserving the behaviour and also making it a bit more
clear.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to replay_dir_deletes()

Instead of passing the transaction, subvolume root and log tree as
arguments to replay_dir_deletes(), pass the walk_control structure as
we can grab all of those from the structure. This reduces the number of
arguments passed and it's going to be needed by an incoming change that
improves error reporting for log replay. This also requires changing
fixup_inode_link_counts() and fixup_inode_link_count() to take that
structure as an argument since fixup_inode_link_count() makes a call
to replay_dir_deletes().

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: move up the definition of struct walk_control

In upcoming changes we need to pass struct walk_control as an argument to
replay_dir_deletes() and link_to_fixup_dir() so we need to move its
definition above the prototypes of those functions. So move it up right
below the enum that defines log replay stages and before any functions and
function prototypes are declared. Also fixup the comments while moving it
so that they comply with the preferred code style (capitalize the first
word in a sentence, end sentences with punctuation, makes lines wider and
closer to the 80 characters limit).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: pass walk_control structure to replay_xattr_deletes()

Instead of passing the transaction, subvolume root and log tree as
arguments to replay_xattr_deletes(), pass the walk_control structure as
we can grab all of those from the structure. This reduces the number of
arguments passed and it's going to be needed by an incoming change that
improves error reporting for log replay.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: always drop log root tree reference in btrfs_replay_log()

Currently we have this odd behaviour:

1) At btrfs_replay_log() we drop the reference of the log root tree if
   the call to btrfs_recover_log_trees() failed;

2) But if the call to btrfs_recover_log_trees() did not fail, we don't
   drop the reference in btrfs_replay_log() - we expect that
   btrfs_recover_log_trees() does it in case it returns success.

Let's simplify this and make btrfs_replay_log() always drop the reference
on the log root tree, not only this simplifies code as it's what makes
sense since it's btrfs_replay_log() who grabbed the reference in the first
place.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: stop setting log_root_tree->log_root to NULL in btrfs_recover_log_trees()

There's no point in setting log_root_tree->log_root to NULL as this is
already NULL, we never assigned anything to it before and it's meaningless
as a log root never has a value other than NULL for the ->log_root field,
that can be not NULL only for non log roots.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: stop passing transaction parameter to log tree walk functions

It's unncessary to pass a transaction parameter since struct walk_control
already has a member that points to the transaction, so we can make the
functions access the structure.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: deduplicate log root free in error paths from btrfs_recover_log_trees()

Instead of duplicating the dropping of a log tree in case we jump to the
'error' label, move the dropping under the 'error' label and get rid of the
the unnecessary setting of the log root to NULL since we return immediately
after.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: add and use a log root field to struct walk_control

Instead of passing an extra log root parameter for the log tree walk
functions and callbacks, add the log tree to struct walk_control and
make those functions and callbacks extract the log root from that
structure, reducing the number of parameters. This also simplifies
further upcoming changes to report log tree replay failures.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename root to log in walk_down_log_tree() and walk_up_log_tree()

Everywhere we have a log root we name it as 'log' or 'log_root' except in
walk_down_log_tree() and walk_up_log_tree() where we name it as 'root',
which not only it's inconsistent, it's also confusing since we typically
use 'root' when naming variables that refer to a subvolume tree. So for
clairty and consistency rename the 'root' argument to 'log'.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename replay_dest member of struct walk_control to root

Everywhere else we refer to a subvolume root we are replaying to simply
as 'root', so rename from 'replay_dest' to 'root' for consistency and
having a more meaningful and shorter name. While at it also update the
comment to be more detailed and comply to preferred style (first word in
a sentence is capitalized and sentence ends with punctuation).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: use booleans in walk control structure for log replay

The 'free' and 'pin' member of struct walk_control, used during log replay
and when freeing a log tree, are defined as integers but in practice are
used as booleans. Change their type to bool and while at it update their
comments to be more detailed and comply with the preferred comment style
(first word in a sentence is capitalized, sentences end with punctuation
and the comment opening (/*) is on a line of its own).

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cache max and min order inside btrfs_fs_info

Inside btrfs_fs_info we cache several bits shift like sectorsize_bits.

Apply this to max and min folio orders so that every time mapping order
needs to be applied we can skip the calculation.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: introduce btrfs_bio_for_each_block_all() helper

Currently if we want to iterate all blocks inside a bio, we do something
like this:

bio_for_each_segment_all(bvec, bio, iter_all) {
for (off = 0; off < bvec->bv_len; off += sectorsize) {
/* Iterate blocks using bv + off */
}
}

That's fine for now, but it will not handle future bs > ps, as
bio_for_each_segment_all() is a single-page iterator, it will always
return a bvec that's no larger than a page.

But for bs > ps cases, we need a full folio (which covers at least one
block) so that we can work on the block.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block_all()
  This helper will create a local bvec_iter, which has the size of the
  target bio. Then grab the current physical address of the current
  location, then advance the iterator by block size.

- Use btrfs_bio_for_each_block_all() to replace existing call sites
  Including:

  * set_bio_pages_uptodate() in raid56
  * verify_bio_data_sectors() in raid56

  Both will result much easier to read code.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: introduce btrfs_bio_for_each_block() helper

Currently if we want to iterate a bio in block unit, we do something
like this:

while (iter->bi_size) {
struct bio_vec bv = bio_iter_iovec();

/* Do something with using the bv */

bio_advance_iter_single(&bbio->bio, iter, sectorsize);
}

That's fine for now, but it will not handle future bs > ps, as
bio_iter_iovec() returns a single-page bvec, meaning the bv_len will not
exceed page size.

This means the code using that bv can only handle a block if bs <= ps.

To address this problem and handle future bs > ps cases better:

- Introduce a helper btrfs_bio_for_each_block()
  Instead of bio_vec, which has single and multiple page version and
  multiple page version has quite some limits, use my favorite way to
  represent a block, phys_addr_t.

  For bs <= ps cases, nothing is changed, except we will do a very
  small overhead to convert phys_addr_t to a folio, then use the proper
  folio helpers to handle the possible highmem cases.

  For bs > ps cases, all blocks will be backed by large folios, meaning
  every folio will cover at least one block. And still use proper folio
  helpers to handle highmem cases.

  With phys_addr_t, we will handle both large folio and highmem
  properly. So there is no better single variable to present a btrfs
  block than phys_addr_t.

- Extract the data block csum calculation into a helper
  The new helper, btrfs_calculate_block_csum() will be utilized by
  btrfs_csum_one_bio().

- Use btrfs_bio_for_each_block() to replace existing call sites
  Including:

  * index_one_bio() from raid56.c
    Very straight-forward.

  * btrfs_check_read_bio()
    Also update repair_one_sector() to grab the folio using phys_addr_t,
    and do extra checks to make sure the folio covers at least one
    block.
    We do not need to bother bv_len at all now.

  * btrfs_csum_one_bio()
    Now we can move the highmem handling into a dedicated helper,
    calculate_block_csum(), and use btrfs_bio_for_each_block() helper.

There is one exception in btrfs_decompress_buf2page(), which is copying
decompressed data into the original bio, which is not iterating using
block size thus we don't need to bother.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: concentrate highmem handling for data verification

Currently for btrfs checksum verification, we do it in the following
pattern:

kaddr = kmap_local_*();
ret = btrfs_check_csum_csum(kaddr);
kunmap_local(kaddr);

It's OK for now, but it's still not following the patterns of helpers
inside linux/highmem.h, which never requires a virt memory address.

In those highmem helpers, they mostly accept a folio, some offset/length
inside the folio, and in the implementation they check if the folio
needs partial kmap, and do the handling.

Inspired by those formal highmem helpers, enhance the highmem handling
of data checksum verification by:

- Rename btrfs_check_sector_csum() to btrfs_check_block_csum()
  To follow the more common term "block" used in all other major
  filesystems.

- Pass a physical address into btrfs_check_block_csum() and
  btrfs_data_csum_ok()
  The physical address is always available even for a highmem page.
  Since it's page frame number << PAGE_SHIFT + offset in page.

  And with that physical address, we can grab the folio covering the
  page, and do extra checks to ensure it covers at least one block.

  This also allows us to do the kmap inside btrfs_check_block_csum().
  This means all the extra HIGHMEM handling will be concentrated into
  btrfs_check_block_csum(), and no callers will need to bother highmem
  by themselves.

- Properly zero out the block if csum mismatch
  Since btrfs_data_csum_ok() only got a paddr, we can not and should not
  use memzero_bvec(), which only accepts single page bvec.
  Instead use paddr to grab the folio and call folio_zero_range()

Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: support all block sizes which is no larger than page size

Currently if block size < page size, btrfs only supports one single
config, 4K.

This is mostly to reduce the test configurations, as 4K is going to be
the default block size for all architectures.

However all other major filesystems have no artificial limits on the
support block size, and some are already supporting block size > page
sizes.

Since the btrfs subpage block support has been there for a long time,
it's time for us to enable all block size <= page size support.

So here enable all block sizes support as long as it's no larger than
page size for experimental builds.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: scrub: replace max_t()/min_t() with clamp() in scrub_throttle_dev_io()

Replace max_t() followed by min_t() with a single clamp().

As was pointed by David Laight in
https://lore.kernel.org/linux-btrfs/20250906122458.75dfc8f0@pumpkin/
the calculation may overflow u32 when the input value is too large, so
clamp_t() is not used. In practice the expected values are in range of
megabytes to gigabytes (throughput limit) so the bug would not happen.

Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: David Sterba <dsterba@suse.com>
[ Use clamp() and add explanation. ]
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: fix typos in comments and strings

Annual typo fixing pass. Strangely codespell found only about 30% of
what is in this patch, the rest was done manually using text
spellchecker with a custom dictionary of acceptable terms.

Reviewed-by: Neal Gompa <neal@gompa.dev>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: reduce compression workspace buffer space to block size

Currently the compression workspace buffer size is always based on
PAGE_SIZE, but btrfs has support subpage sized block size for some time.

This means for one-shot compression algorithm like lzo, we're wasting
quite some memory if the block size is smaller than page size, as the
LZO only works on one block (thus one-shot).

On 64K page sized systems with 4K block size, it means we only need at
most 8K buffer space for lzo, but in reality we're allocating 64K
buffer.

So to reduce the memory usage, change all workspace buffer to base its
size based on block size other than page size.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: rename btrfs_compress_op to btrfs_compress_levels

Since all workspace managers are per-fs, there is no need nor no way to
store them inside btrfs_compress_op::wsm anymore.

With that said, we can do the following modifications:

- Remove zstd_workspace_mananger::ops
Zstd always grab the global btrfs_compress_op[].
- Remove btrfs_compress_op::wsm member
- Rename btrfs_compress_op to btrfs_compress_levels

This should make it more clear that btrfs_compress_levels structures are
only to indicate the levels of each compress algorithm.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>

btrfs: cleanup the per-module compression workspace managers

Since all workspaces are handled by the per-fs workspace managers, we
can safely remove the old per-module managers.

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>