Qu Wenruo [Fri, 27 Feb 2026 03:33:44 +0000 (14:03 +1030)]
btrfs: move shutdown and remove_bdev callbacks out of experimental features
These two new callbacks have been introduced in v6.19, and it has been
two releases in v7.1.
During that time we have not yet exposed bugs related that two features,
thus it's time to expose them for end users.
It's especially important to expose remove_bdev callback to end users.
That new callback makes btrfs automatically shutdown or go degraded
when a device is missing (depending on if the fs can maintain RW), which
is affecting end users.
We want some feedback from early adopters.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Yochai Eisenrich [Sun, 22 Mar 2026 06:39:35 +0000 (08:39 +0200)]
btrfs: fix btrfs_ioctl_space_info() slot_count TOCTOU which can lead to info-leak
btrfs_ioctl_space_info() has a TOCTOU race between two passes over the
block group RAID type lists. The first pass counts entries to determine
the allocation size, then the second pass fills the buffer. The
groups_sem rwlock is released between passes, allowing concurrent block
group removal to reduce the entry count.
When the second pass fills fewer entries than the first pass counted,
copy_to_user() copies the full alloc_size bytes including trailing
uninitialized kmalloc bytes to userspace.
Fix by copying only total_spaces entries (the actually-filled count from
the second pass) instead of alloc_size bytes, and switch to kzalloc so
any future copy size mismatch cannot leak heap data.
Fixes: 7fde62bffb57 ("Btrfs: buffer results in the space_info ioctl") CC: stable@vger.kernel.org # 3.0 Signed-off-by: Yochai Eisenrich <echelonh@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 18 Mar 2026 13:39:51 +0000 (13:39 +0000)]
btrfs: avoid taking the device_list_mutex in btrfs_run_dev_stats()
btrfs_run_dev_stats() is called during the critical section of a
transaction commit and it takes the device_list_mutex, which is also
acquired by fitrim, which does discard operations while holding that
mutex. Most of the time, if we are on a healthy filesystem, we don't have
new stat updates to persist in the device tree, so blocking on the
device_list_mutex is just wasting time and making any tasks that need to
start a new transaction wait longer that necessary.
Since the device list is RCU safe/protected, make btrfs_run_dev_stats()
do an initial check for device stat updates using RCU and quit without
taking the device_list_mutex in case there are no new device stats that
need to be persisted in the device tree.
Also note that adding/removing devices also requires starting a
transaction, and since btrfs_run_dev_stats() is called from the critical
section of a transaction commit, no one can be concurrently adding or
removing a device while btrfs_run_dev_stats() is called.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Leo Martins [Thu, 19 Mar 2026 23:49:08 +0000 (16:49 -0700)]
btrfs: avoid GFP_ATOMIC allocations in qgroup free paths
When qgroups are enabled, __btrfs_qgroup_release_data() and
qgroup_free_reserved_data() pass an extent_changeset to
btrfs_clear_record_extent_bits() to track how many bytes had their
EXTENT_QGROUP_RESERVED bits cleared. Inside the extent IO tree spinlock,
add_extent_changeset() calls ulist_add() with GFP_ATOMIC to record each
changed range. If this allocation fails, it hits a BUG_ON and panics the
kernel.
However, both of these callers only read changeset.bytes_changed
afterwards — the range_changed ulist is populated and immediately freed
without ever being iterated. The GFP_ATOMIC allocation is entirely
unnecessary for these paths.
Introduce extent_changeset_init_bytes_only() which uses a sentinel value
(EXTENT_CHANGESET_BYTES_ONLY) on the ulist's prealloc field to signal
that only bytes_changed should be tracked. add_extent_changeset() checks
for this sentinel and returns early after updating bytes_changed,
skipping the ulist_add() call entirely. This eliminates the GFP_ATOMIC
allocation and makes the BUG_ON unreachable for these paths.
Callers that need range tracking (qgroup_reserve_data,
qgroup_unreserve_range, btrfs_qgroup_check_reserved_leak) continue to
use extent_changeset_init() and are unaffected.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: decrease indentation of find_free_extent_update_loop
Decrease the indentation of find_free_extent_update_loop(), by inverting
the check if the loop state is smaller than LOOP_NO_EMPTY_SIZE.
This also allows for an early return from find_free_extent_update_loop(),
in case LOOP_NO_EMPTY_SIZE is already set at this point.
While at it change a
if () {
}
else if
else
pattern to all using curly braces and be consistent with the rest of btrfs
code.
Also change 'int exists' to 'bool have_trans' giving it a more meaningful
name and type.
No functional changes intended.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 17 Mar 2026 19:35:58 +0000 (19:35 +0000)]
btrfs: unexport btrfs_qgroup_reserve_meta()
There's only one caller outside qgroup.c of btrfs_qgroup_reserve_meta()
and we have btrfs_qgroup_reserve_meta_prealloc() is a wrapper around
that function. Make that caller use btrfs_qgroup_reserve_meta_prealloc()
and unexport btrfs_qgroup_reserve_meta(), simplifying the external API.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 17 Mar 2026 19:19:43 +0000 (19:19 +0000)]
btrfs: collapse __btrfs_qgroup_reserve_meta() into btrfs_qgroup_reserve_meta_prealloc()
Since __btrfs_qgroup_reserve_meta() is only called by
btrfs_qgroup_reserve_meta_prealloc(), which is a simple inline wrapper,
get rid of the later and rename __btrfs_qgroup_reserve_meta() to the
later.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 17 Mar 2026 19:09:15 +0000 (19:09 +0000)]
btrfs: collapse __btrfs_qgroup_free_meta() into btrfs_qgroup_free_meta_prealloc()
Since __btrfs_qgroup_free_meta() is only called by
btrfs_qgroup_free_meta_prealloc(), which is a simple inline wrapper, get
rid of the later and rename __btrfs_qgroup_free_meta() to the later.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 18:36:36 +0000 (18:36 +0000)]
btrfs: optimize clearing all bits from first extent record in an io tree
When we are clearing all the bits from the first record that contains the
target range and that record ends at or before our target range but starts
before our target range, we are doing a lot of unnecessary work:
1) Allocating a prealloc state if we don't have one already;
2) Adjust that record's start offset to the start of our range and
make the prealloc state have a range going from the original start
offset of that first record to the start offset of our target range,
and with the same bits as that first record. Then we insert the
prealloc extent in the rbtree - this is done in split_state();
3) Remove our adjusted first state from the rbtree since all the bits
were cleared - this is done in clear_state_bit().
This is only wasting time when we can simply trim that first record, so
that it represents the range from its start offset to the start offset of
our target range. So optimize for that case and avoid the prealloc state
allocation, insertion and deletion from the rbtree.
This patch is the last patch of a patchset comprised of the following
patches (in descending order):
btrfs: optimize clearing all bits from first extent record in an io tree
btrfs: panic instead of warn when splitting extent state not in the tree
btrfs: free cached state outside critical section in wait_extent_bit()
btrfs: avoid unnecessary wake ups on io trees when there are no waiters
btrfs: remove wake parameter from clear_state_bit()
btrfs: change last argument of add_extent_changeset() to boolean
btrfs: use extent_io_tree_panic() instead of BUG_ON()
btrfs: make add_extent_changeset() only return errors or success
btrfs: tag as unlikely branches that call extent_io_tree_panic()
btrfs: turn extent_io_tree_panic() into a macro for better error reporting
btrfs: optimize clearing all bits from the last extent record in an io tree
The following fio script was used to measure performance before and after
applying all the patches:
Also, using bpftrace to collect the duration (in nanoseconds) of all the
btrfs_clear_extent_bit_changeset() calls done during that fio test and
then making an histogram from that data, held the following results:
Filipe Manana [Wed, 11 Mar 2026 16:15:59 +0000 (16:15 +0000)]
btrfs: panic instead of warn when splitting extent state not in the tree
We are not expected ever to split an extent state record that is not in
the rbtree, as every record we pass to split_state() was found by
iterating the rbtree, so if that ever happens it means we are not holding
the extent io tree's spinlock or we have some memory corruption.
Instead of simply warning in case the extent state record passed to
split_state() is not in the rbtree, panic as this is a serious problem.
Also tag as unlikely the case where the record is not in the rbtree.
This also makes a tiny reduction the btrfs module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 2000080 174328 15592 2190000 216ab0 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 2000064 174328 15592 2189984 216aa0 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 16 Mar 2026 11:38:36 +0000 (11:38 +0000)]
btrfs: free cached state outside critical section in wait_extent_bit()
There's no need to free the cached extent state record while holding the
io tree's spinlock, it's just making the critical section longer than it
needs to be. So just do it after unlocking the io tree.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 15:07:11 +0000 (15:07 +0000)]
btrfs: avoid unnecessary wake ups on io trees when there are no waiters
Whenever clearing the extent lock bits of an extent state record, we
unconditionally call wake_up() on the state's waitqueue. Most of the
time there are no waiters on the queue so we are just wasting time calling
wake_up(), since that requires locking and unlocking the queue's spinlock,
disable and re-enable interrupts, function calls, and other minor overhead
while we are holding a critical section delimited by the extent io tree's
spinlock.
So call wake_up() only if there are waiters on an extent state's wait
queue.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 12:50:03 +0000 (12:50 +0000)]
btrfs: remove wake parameter from clear_state_bit()
There's no need to pass the 'wake' parameter, we can determine if we have
to wake up waiters by checking if EXTENT_LOCK_BITS is set in the bits to
clear. So simplify things and remove the parameter.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 12:35:33 +0000 (12:35 +0000)]
btrfs: use extent_io_tree_panic() instead of BUG_ON()
There's no need to call BUG_ON(), instead call extent_io_tree_panic(),
which also calls BUG(), but it prints an additional error message with
some useful information before hitting BUG().
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 12:17:03 +0000 (12:17 +0000)]
btrfs: make add_extent_changeset() only return errors or success
Currently add_extent_changeset() always returns the return value from its
call to ulist_add(), which can return an error, 0 or 1. There are no
callers that care about the difference between 0 and 1 and all except one
of them, check for negative values and ignore other values, but there is
another caller (btrfs_clear_extent_bit_changeset()) that must set its
'ret' variable to 0 after calling add_extent_changeset(), so that it
does not return an unexpected value of 1 to its caller.
So change add_extent_changeset() to only return errors or 0, avoiding
that caller (and any future callers) from having to deal with a return
value of 1.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 12:07:03 +0000 (12:07 +0000)]
btrfs: tag as unlikely branches that call extent_io_tree_panic()
It's unexpected to ever call extent_io_tree_panic() so surround with
'unlikely' every if statement condition that leads to it, making it
explicit to a reader and to hint the compiler to potentially generate
better code.
On x86_64, using gcc 14.2.0-19 from Debian, this resulted in a slightly
decrease of the btrfs module's text size.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1999832 174320 15592 2189744 2169b0 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1999768 174320 15592 2189680 216970 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 11 Mar 2026 12:00:31 +0000 (12:00 +0000)]
btrfs: turn extent_io_tree_panic() into a macro for better error reporting
When extent_io_tree_panic() is called we get a stace trace that is not
very useful since the error message reports the location inside the
extent_io_tree_panic() function and not in the caller of the function.
Filipe Manana [Tue, 10 Mar 2026 15:29:33 +0000 (15:29 +0000)]
btrfs: optimize clearing all bits from the last extent record in an io tree
When we are clearing all the bits from the last record that contains the
target range (i.e. the record starts before our target range and ends
beyond it), we are doing a lot of unnecessary work:
1) Allocating a prealloc state if we don't have one already;
2) Adjust that last record's start offset to the end of our range and
make the prealloc state have a range going from the original start
offset of that last record to the end offset of our target range and
the same bits as the last record. Then we insert the prealloc extent
in the rbtree - this is done in split_state();
3) Remove our prealloc state from the rbtree since all the bits were
cleared - this is done in clear_state_bit().
This is only wasting time when we can simply trim the last record so
that it's start offset is adjust to the end of the target range. So
optimize for that case and avoid the prealloc state allocation, insertion
and deletion from the rbtree.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sun, 15 Mar 2026 21:38:24 +0000 (08:08 +1030)]
btrfs: remove atomic parameter from btrfs_buffer_uptodate()
That parameter was introduced by commit b9fab919b748 ("Btrfs: avoid
sleeping in verify_parent_transid while atomic").
At that time we needed to lock the extent buffer range inside the io tree
to avoid content changes, thus it could sleep.
But that behavior is no longer there, as later commit 9e2aff90fc2a
("btrfs: stop using lock_extent in btrfs_buffer_uptodate") dropped the
io tree lock.
We can remove the @atomic parameter safely now.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 14 Mar 2026 00:00:39 +0000 (10:30 +1030)]
btrfs: output more info when duplicated ordered extent is found
During development of a new feature, I triggered that btrfs_panic()
inside insert_ordered_extent() and spent quite some unnecessary before
noticing I'm passing incorrect flags when creating a new ordered extent.
Unfortunately the existing error message is not providing much help.
Enhance the output to provide file offset, num bytes and flags of both
existing and new ordered extents.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 14 Mar 2026 00:00:38 +0000 (10:30 +1030)]
btrfs: check type flags in alloc_ordered_extent()
Unlike other flags used in btrfs, BTRFS_ORDERED_* macros are different as
they cannot be directly used as flags.
They are defined as bit values, thus they should be utilized with
bit operations, not directly with logical operations.
Unfortunately sometimes I forgot this and passed the incorrect flags
to alloc_ordered_extent() and hit weird bugs.
Enhance the type checks in alloc_ordered_extent():
- Make sure there is one and only one bit set for exclusive type flags
There are four exclusive type flags, REGULAR, NOCOW, PREALLOC and
COMPRESSED.
So introduce a new macro, BTRFS_ORDERED_EXCLUSIVE_FLAGS, to cover
above flags.
Add an ASSERT() to check one and only one of those exclusive flags can
be set for alloc_ordered_extent().
- Re-order the type bit numbers to the end of the enum
This is make it much harder to get a valid false negative.
E.g., with the old code BTRFS_ORDERED_REGULAR starts at zero, we can
have the following flags passing the bit uniqueness check:
* BTRFS_ORDERED_NOCOW
Be treated as BTRFS_ORDERED_REGULAR (1 == 1UL << 0).
* BTRFS_ORDERED_PREALLOC
Be treated as BTRFS_ORDERED_NOCOW (2 == 1UL << 1).
* BTRFS_ORDERED_DIRECT
Be treated as BTRFS_ORDERED_PREALLOC (4 == 1UL << 2).
Now all those types start at 8, passing any of those bit numbers as
flags directly will not pass the ASSERT().
- Add a static assert to avoid overflow
To make sure all BTRFS_ORDERED_* flags can fit into an unsigned long.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
ZhengYuan Huang [Fri, 13 Mar 2026 09:19:23 +0000 (17:19 +0800)]
btrfs: revalidate cached tree blocks on the uptodate path
read_extent_buffer_pages_nowait() returns immediately when an extent
buffer is already marked uptodate. On that cache-hit path,
the caller supplied btrfs_tree_parent_check is not re-run.
This can let read_tree_root_path() accept a cached tree block whose
actual header level/owner does not match the expected value derived from
the parent.
E.g. a corrupted root item that points to a tree block which doesn't
even belong to that root, and has mismatching level/owner.
But that tree block is already read and cached, later the corrupted tree
root got read from disk and hit the cached tree block.
Fix this by re-validating cached extent buffers against the supplied
btrfs_tree_parent_check on the uptodate path, and make
read_tree_root_path() pass its check to btrfs_buffer_uptodate().
This makes cache hits and fresh reads follow the same tree-parent
verification rules, and turns the corruption into a read failure instead
of constructing an inconsistent root object.
Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com>
[ Resolve the conflict with extent_buffer_uptodate() helper, handle
transid mismatch case ] Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Philipp Hahn [Tue, 10 Mar 2026 11:48:28 +0000 (12:48 +0100)]
btrfs: prefer IS_ERR_OR_NULL() over manual NULL check
Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
check.
IS_ERR_OR_NULL() already uses likely(!ptr) internally. checkpatch does
not like nesting it:
> WARNING: nested (un)?likely() calls, IS_ERR_OR_NULL already uses
> unlikely() internally
Remove the explicit use of likely().
Change generated with coccinelle.
Signed-off-by: Philipp Hahn <phahn-oss@avm.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
ZhengYuan Huang [Tue, 10 Mar 2026 21:56:10 +0000 (08:26 +1030)]
btrfs: tree-checker: introduce checks for FREE_SPACE_BITMAP
Introduce checks for FREE_SPACE_BITMAP item, which include:
- Key alignment check
Same as FREE_SPACE_EXTENT, the objectid is the logical bytenr of the
free space, and offset is the length of the free space, so both
should be aligned to the fs block size.
- Non-zero range check
A zero key->offset would describe an empty bitmap, which is invalid.
- Item size check
The item must hold exactly DIV_ROUND_UP(key->offset >> sectorsize_bits,
BITS_PER_BYTE) bytes. A mismatch indicates a truncated or otherwise
corrupt bitmap item; without this check, the bitmap loading path would
walk past the end of the leaf and trigger a NULL dereference in
assert_eb_folio_uptodate().
Signed-off-by: ZhengYuan Huang <gality369@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 9 Mar 2026 22:19:26 +0000 (08:49 +1030)]
btrfs: tree-checker: introduce checks for FREE_SPACE_EXTENT
Introduce FREE_SPACE_EXTENT checks, which include:
- The key alignment check
The objectid is the logical bytenr of the free space, and offset is the
length of the free space, thus they should all be aligned to the fs
block size.
- The item size check
The FREE_SPACE_EXTENT item should have a size of zero.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 9 Mar 2026 22:19:25 +0000 (08:49 +1030)]
btrfs: tree-checker: introduce checks for FREE_SPACE_INFO
Introduce checks for FREE_SPACE_INFO item, which include:
- Key alignment check
The objectid is the logical bytenr of the chunk/bg, and offset is the
length of the chunk/bg, thus they should all be aligned to the fs
block size.
- Item size check
The FREE_SPACE_INFO should a fix size.
- Flags check
The flags member should have no other flags than
BTRFS_FREE_SPACE_USING_BITMAPS.
For future expansion, introduce a new macro
BTRFS_FREE_SPACE_FLAGS_MASK for such checks.
And since we're here, the BTRFS_FREE_SPACE_USING_BITMAPS should not
use unsigned long long, as the flags is only 32 bits wide.
So fix that to use unsigned long.
- Extent count check
That member shows how many free space bitmap/extent items there are
inside the chunk/bg.
We know the chunk size (from key->offset), thus there should be at
most (key->offset >> sectorsize_bits) blocks inside the chunk.
Use that value as the upper limit and if that counter is larger than
that, there is a high chance it's a bitflip in high bits.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: limit number of zones reclaimed in flush_space()
Limit the number of zones reclaimed in flush_space()'s RECLAIM_ZONES
state.
This prevents possibly long running reclaim sweeps to block other tasks in
the system, while the system is under pressure anyways, causing the
tasks to hang.
An example of this can be seen here, triggered by fstests generic/551:
generic/551 [ 27.042349] run fstests generic/551 at 2026-02-27 11:05:30
BTRFS: device fsid 78c16e29-20d9-4c8e-bc04-7ba431be38ff devid 1 transid 8 /dev/vdb (254:16) scanned by mount (806)
BTRFS info (device vdb): first mount of filesystem 78c16e29-20d9-4c8e-bc04-7ba431be38ff
BTRFS info (device vdb): using crc32c checksum algorithm
BTRFS info (device vdb): host-managed zoned block device /dev/vdb, 64 zones of 268435456 bytes
BTRFS info (device vdb): zoned mode enabled with zone size 268435456
BTRFS info (device vdb): checking UUID tree
BTRFS info (device vdb): enabling free space tree
INFO: task kworker/u38:1:90 blocked for more than 120 seconds.
Not tainted 7.0.0-rc1+ #345
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u38:1 state:D stack:0 pid:90 tgid:90 ppid:2 task_flags:0x4208060 flags:0x00080000
Workqueue: events_unbound btrfs_async_reclaim_data_space
Call Trace:
<TASK>
__schedule+0x34f/0xe70
schedule+0x41/0x140
schedule_timeout+0xa3/0x110
? mark_held_locks+0x40/0x70
? lockdep_hardirqs_on_prepare+0xd8/0x1c0
? trace_hardirqs_on+0x18/0x100
? lockdep_hardirqs_on+0x84/0x130
? _raw_spin_unlock_irq+0x33/0x50
wait_for_completion+0xa4/0x150
? __flush_work+0x24c/0x550
__flush_work+0x339/0x550
? __pfx_wq_barrier_func+0x10/0x10
? wait_for_completion+0x39/0x150
flush_space+0x243/0x660
? find_held_lock+0x2b/0x80
? kvm_sched_clock_read+0x11/0x20
? local_clock_noinstr+0x17/0x110
? local_clock+0x15/0x30
? lock_release+0x1b7/0x4b0
do_async_reclaim_data_space+0xe8/0x160
btrfs_async_reclaim_data_space+0x19/0x30
process_one_work+0x20a/0x5f0
? lock_is_held_type+0xcd/0x130
worker_thread+0x1e2/0x3c0
? __pfx_worker_thread+0x10/0x10
kthread+0x103/0x150
? __pfx_kthread+0x10/0x10
ret_from_fork+0x20d/0x320
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Showing all locks held in the system:
1 lock held by khungtaskd/67:
#0: ffffffff824d58e0 (rcu_read_lock){....}-{1:3}, at: debug_show_all_locks+0x3d/0x194
2 locks held by kworker/u38:1/90:
#0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0
#1: ffffc90000c17e58 ((work_completion)(&fs_info->async_data_reclaim_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0
5 locks held by kworker/u39:1/191:
#0: ffff8881000aa158 ((wq_completion)events_unbound){+.+.}-{0:0}, at: process_one_work+0x3c4/0x5f0
#1: ffffc90000dfbe58 ((work_completion)(&fs_info->reclaim_bgs_work)){+.+.}-{0:0}, at: process_one_work+0x1c0/0x5f0
#2: ffff888101da0420 (sb_writers#9){.+.+}-{0:0}, at: process_one_work+0x20a/0x5f0
#3: ffff88811040a648 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: btrfs_reclaim_bgs_work+0x1de/0x770
#4: ffff888110408a18 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x95a/0x20f0
1 lock held by aio-dio-write-v/980:
#0: ffff888110093008 (&sb->s_type->i_mutex_key#15){++++}-{4:4}, at: btrfs_inode_lock+0x51/0xb0
=============================================
To prevent these long running reclaims from blocking the system, only
reclaim 5 block_groups in the RECLAIM_ZONES state of flush_space(). Also
as these reclaims are now constrained, it opens up the use for a
synchronous call to brtfs_reclaim_block_groups(), eliminating the need
to place the reclaim task on a workqueue and then flushing the workqueue
again.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Create a function btrfs_reclaim_block_groups() that gets called from the
block-group reclaim worker.
This allows creating synchronous block_group reclaim later on.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: move reclaiming of a single block group into its own function
The main work of reclaiming a single block-group in
btrfs_reclaim_bgs_work() is done inside the loop iterating over all the
block_groups in the fs_info->reclaim_bgs list.
Factor out reclaim of a single block group from the loop to improve
readability.
No functional change intended.
Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 4 Mar 2026 05:18:44 +0000 (15:48 +1030)]
btrfs: extract inlined creation into a dedicated delalloc helper
Currently we call cow_file_range_inline() in different situations, from
regular cow_file_range() to compress_file_range().
This is because inline extent creation has different conditions based on
whether it's a compressed one or not.
But on the other hand, inline extent creation shouldn't be so
distributed, we can just have a dedicated branch in
btrfs_run_delalloc_range().
It will become more obvious for compressed inline cases, it makes no
sense to go through all the complex async extent mechanism just to
inline a single block.
So here we introduce a dedicated run_delalloc_inline() helper, and
remove all inline related handling from cow_file_range() and
compress_file_range().
There is a special update to inode_need_compress(), that a new
@check_inline parameter is introduced.
This is to allow inline specific checks to be done inside
run_delalloc_inline(), which allows single block compression, but
other call sites should always reject single block compression.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 3 Mar 2026 08:15:10 +0000 (18:45 +1030)]
btrfs: move the mapping_set_error() out of the loop in end_bbio_data_write()
Previously we have to call mapping_set_error() inside the
for_each_folio_all() loop, because we do not have a better way to grab
an inode, other than through folio->mapping.
But nowadays every btrfs_bio has its inode member populated, thus we can
easily grab the inode and its i_mapping easily, without the help from a
folio.
Now we can move that mapping_set_error() out of the loop, and use
bbio->inode to grab the i_mapping.
Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 3 Mar 2026 08:15:09 +0000 (18:45 +1030)]
btrfs: remove the alignment check in end_bbio_data_write()
The check is not necessary because:
- There is already assert_bbio_alignment() at btrfs_submit_bbio()
- There is also btrfs_subpage_assert() for all btrfs_folio_*() helpers
- The original commit mentions the check may go away in the future
Commit 17a5adccf3fd01 ("btrfs: do away with non-whole_page extent
I/O") introduced the check first, and in the commit message:
I've replaced the whole_page computations with warnings, just to be
sure that we're not issuing partial page reads or writes. The
warnings should probably just go away some time.
- No similar check in all other endio functions
No matter if it's data read, compressed read or write.
- There is no such report for very long
I do not even remember if there is any such report.
Thus the need to do such check in end_bbio_data_write() is very weak,
and we can just get rid of it.
Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Leo Martins [Thu, 26 Feb 2026 09:51:08 +0000 (01:51 -0800)]
btrfs: add tracepoint for search slot restart tracking
Add a btrfs_search_slot_restart tracepoint that fires at each restart
site in btrfs_search_slot(), recording the root, tree level, and
reason for the restart. This enables tracking search slot restarts
which contribute to COW amplification under memory pressure.
The four restart reasons are:
- write_lock: insufficient write lock level, need to restart with
higher lock
- setup_nodes: node setup returned -EAGAIN
- slot_zero: insertion at slot 0 requires higher write lock level
- read_block: read_block_for_search returned -EAGAIN (block not
cached or lock contention)
COW counts are already tracked by the existing trace_btrfs_cow_block()
tracepoint. The per-restart-site tracepoint avoids counter overhead
in the critical path when tracepoints are disabled, and provides
richer per-event information that bpftrace scripts can aggregate into
counts, histograms, and per-root breakdowns.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
Leo Martins [Thu, 26 Feb 2026 09:51:07 +0000 (01:51 -0800)]
btrfs: inhibit extent buffer writeback to prevent COW amplification
Inhibit writeback on COW'd extent buffers for the lifetime of the
transaction handle, preventing background writeback from setting
BTRFS_HEADER_FLAG_WRITTEN and causing unnecessary re-COW.
COW amplification occurs when background writeback flushes an extent
buffer that a transaction handle is still actively modifying. When
lock_extent_buffer_for_io() transitions a buffer from dirty to
writeback, it sets BTRFS_HEADER_FLAG_WRITTEN, marking the block as
having been persisted to disk at its current bytenr. Once WRITTEN is
set, should_cow_block() must either COW the block again or overwrite
it in place, both of which are unnecessary overhead when the buffer
is still being modified by the same handle that allocated it. By
inhibiting background writeback on actively-used buffers, WRITTEN is
never set while a transaction handle holds a reference to the buffer,
avoiding this overhead entirely.
Add an atomic_t writeback_inhibitors counter to struct extent_buffer,
which fits in an existing 6-byte hole without increasing struct size.
When a buffer is COW'd in btrfs_force_cow_block(), call
btrfs_inhibit_eb_writeback() to store the eb in the transaction
handle's writeback_inhibited_ebs xarray (keyed by eb->start), take a
reference, and increment writeback_inhibitors. The function handles
dedup (same eb inhibited twice by the same handle) and replacement
(different eb at the same logical address). Allocation failure is
graceful: the buffer simply falls back to the pre-existing behavior
where it may be written back and re-COW'd.
Also inhibit writeback in should_cow_block() when COW is skipped,
so that every transaction handle that reuses an already-COW'd buffer
also inhibits its writeback. Without this, if handle A COWs a block
and inhibits it, and handle B later reuses the same block without
inhibiting, handle A's uninhibit on end_transaction leaves the buffer
unprotected while handle B is still using it. This ensures all handles
that access a COW'd buffer contribute to the inhibitor count, and the
buffer remains protected until the last handle releases it.
In lock_extent_buffer_for_io(), when writeback_inhibitors is non-zero
and the writeback mode is WB_SYNC_NONE, skip the buffer. WB_SYNC_NONE
is used by the VM flusher threads for background and periodic
writeback, which are the only paths that cause COW amplification by
opportunistically writing out dirty extent buffers mid-transaction.
Skipping these is safe because the buffers remain dirty in the page
cache and will be written out at transaction commit time.
WB_SYNC_ALL must always proceed regardless of writeback_inhibitors.
This is required for correctness in the fsync path: btrfs_sync_log()
writes log tree blocks via filemap_fdatawrite_range() (WB_SYNC_ALL)
while the transaction handle that inhibited those same blocks is still
active. Without the WB_SYNC_ALL bypass, those inhibited log tree
blocks would be silently skipped, resulting in an incomplete log on
disk and corruption on replay. btrfs_write_and_wait_transaction()
also uses WB_SYNC_ALL via filemap_fdatawrite_range(); for that path,
inhibitors are already cleared beforehand, but the bypass ensures
correctness regardless.
Uninhibit in __btrfs_end_transaction() before atomic_dec(num_writers)
to prevent a race where the committer proceeds while buffers are still
inhibited. Also uninhibit in btrfs_commit_transaction() before writing
and in cleanup_transaction() for the error path.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Sun YangKai <sunk67188@gmail.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 25 Feb 2026 19:22:41 +0000 (19:22 +0000)]
btrfs: remove pointless error check in btrfs_check_dir_item_collision()
We're under the IS_ERR() branch so we know that 'ret', which got assigned
the value of PTR_ERR(di) is always negative, so there's no point in
checking if it's negative.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 19 Feb 2026 16:05:39 +0000 (16:05 +0000)]
btrfs: remove duplicated uuid tree existence check in btrfs_uuid_tree_add()
There's no point in checking if the uuid root exists in
btrfs_uuid_tree_add(), since we already do it in btrfs_uuid_tree_lookup().
We can just remove the check from btrfs_uuid_tree_add() and make
btrfs_uuid_tree_lookup() return -EINVAL instead of -ENOENT in case the
uuid tree does not exists.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 24 Feb 2026 15:13:32 +0000 (15:13 +0000)]
btrfs: stop checking for -EEXIST return value from btrfs_uuid_tree_add()
We never return -EEXIST from btrfs_uuid_tree_add(), if the item already
exists we extend it, so it's pointless to check for such return value.
Furthermore, in create_pending_snapshot(), the logic is completely broken.
The goal was to not error out and abort the transaction in case of -EEXIST
but we left 'ret' with the -EEXIST value, so we end up setting
pending->error to -EEXIST and return that error up the call chain up to
btrfs_commit_transaction(), which will abort the transaction.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commit 347b7042fb26 ("Merge patch series "fs: generic file IO error
reporting"") has introduced a common framework for reporting errors to
fsnotify in a standard way.
One of the functions being introduced is fserror_report_shutdown() that,
when combined with the experimental support for shutdown in btrfs, it
means that user-space can also easily detect whenever a btrfs filesystem
has been marked as shutdown.
Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Commit 2932ba8d9c99 ("slab: Introduce kmalloc_obj() and family")
introduced, among many others, the kzalloc_objs() helper, which has some
benefits over kcalloc(). Namely, internal introspection of the allocated
type now becomes possible, allowing for future alignment-aware choices
to be made by the allocator and future hardening work that can be type
sensitive. Dropping 'sizeof' comes also as a nice side-effect.
Moreover, this also allows us to be in line with the recent tree-wide
migration to the kmalloc_obj() and family of helpers. See
commit 69050f8d6d07 ("treewide: Replace kmalloc with kmalloc_obj for
non-scalar types").
Reviewed-by: Kees Cook <kees@kernel.org> Signed-off-by: Miquel Sabaté Solà <mssola@mssola.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 19 Feb 2026 23:43:38 +0000 (10:13 +1030)]
btrfs: do compressed bio size roundup and zeroing in one go
Currently we zero out all the remaining bytes of the last folio of
the compressed bio, then round the bio size to fs block boundary.
But that is done in two different functions, zero_last_folio() to zero
the remaining bytes of the last folio, and round_up_last_block() to
round up the bio to fs block boundary.
There are some minor problems:
- zero_last_folio() is zeroing ranges we won't submit
This is mostly affecting block size < page size cases, where we can
have a large folio (e.g. 64K), but the fs block size is only 4K.
In that case, we may only want to submit the first 4K of the folio,
the remaining range won't matter, but we still zero them all.
This causes unnecessary CPU usage just to zero out some bytes we won't
utilized.
- compressed_bio_last_folio() is called twice in two different functions
Which in theory we only need to call it once.
Enhance the situation by:
- Only zero out bytes up to the fs block boundary
Thus this will reduce some overhead for bs < ps cases.
- Move the folio_zero_range() call into round_up_last_block()
So that we can reuse the same folio returned by
compressed_bio_last_folio().
Reviewed-by: Anand Jain <asj@kernel.org> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 20 Feb 2026 03:41:51 +0000 (14:11 +1030)]
btrfs: reduce the size of compressed_bio
The member compressed_bio::compressed_len can be replaced by the bio
size, as we always submit the full compressed data without any partial
read/write.
Furthermore we already have enough ASSERT()s making sure the bio size
matches the ordered extent or the extent map.
btrfs: remove redundant nowait check in lock_extent_direct()
The nowait flag is always false in this context, making the conditional
check unnecessary. Simplify the code by directly assigning -ENOTBLK.
Found by Linux Verification Center (linuxtesting.org) with SVACE.
Signed-off-by: Alexey Velichayshiy <a.velichayshiy@ispras.ru> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Mark Harmstone [Wed, 18 Feb 2026 12:41:00 +0000 (12:41 +0000)]
btrfs: fix placement of unlikely() in btrfs_insert_one_raid_extent()
Fix the unlikely added to btrfs_insert_one_raid_extent() by commit a929904cf73b65 ("btrfs: add unlikely annotations to branches leading to
transaction abort"): the exclamation point is in the wrong place, so we
are telling the compiler that allocation failure is actually expected.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Mark Harmstone <mark@harmstone.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 18 Feb 2026 15:07:09 +0000 (15:07 +0000)]
btrfs: avoid unnecessary root node COW during snapshotting
There's no need to COW the root node of the subvolume we are snapshotting
because we then call btrfs_copy_root(), which creates a copy of the root
node and sets its generation to the current transaction. So remove this
redundant COW right before calling btrfs_copy_root(), saving one extent
allocation, memory allocation, copying things, etc, and making the code
less confusing. Also rename the extent buffer variable from "old" to
"root_eb" since that name no longer makes any sense after removing the
unnecessary COW operation.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Chen Guan Jie [Mon, 16 Feb 2026 22:16:32 +0000 (06:16 +0800)]
btrfs: check snapshot_force_cow earlier in can_nocow_file_extent()
When a snapshot is being created, the atomic counter snapshot_force_cow
is incremented to force incoming writes to fallback to COW. This is a
critical mechanism to protect the consistency of the snapshot being taken.
Currently, can_nocow_file_extent() checks this counter only after
performing several checks, most notably the expensive cross-reference
check via btrfs_cross_ref_exist(). btrfs_cross_ref_exist() releases the
path and performs a search in the extent tree or backref cache, which
involves btree traversals and locking overhead.
Moves the snapshot_force_cow check to the very beginning of
can_nocow_file_extent().
This reordering is safe and beneficial because:
1. args->writeback_path is invariant for the duration of the call (set
by caller run_delalloc_nocow).
2. is_freespace_inode is a static property of the inode.
3. The state of snapshot_force_cow is driven by the btrfs_mksnapshot()
process. Checking it earlier does not change the outcome of the NOCOW
decision, but effectively prunes the expensive code path when a
fallback to COW is inevitable.
By failing fast when a snapshot is pending, we avoid the unnecessary
overhead of btrfs_cross_ref_exist() and other extent item checks in the
scenario where NOCOW is already known to be impossible.
Signed-off-by: Chen Guan Jie <jk.chen1095@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Please note that, this behavior is there even before commit 59615e2c1f63
("btrfs: reject single block sized compression early").
[CAUSE]
At compress_file_range(), after btrfs_compress_folios() call, we try
making an inlined extent by calling cow_file_range_inline().
But cow_file_range_inline() calls can_cow_file_range_inline() which has
more accurate checks on if the range can be inlined.
One of the user configurable conditions is the "max_inline=" mount
option. If that value is set low (like the example, 4 bytes, which
cannot store any header), or the compressed content is just slightly
larger than 2K (the default value, meaning a 50% compression ratio),
cow_file_range_inline() will return 1 immediately.
And since we're here only to try inline the compressed data, the range
is no larger than a single fs block.
Thus compression is never going to make it a win, we fall back to
marking the inode incompressible unavoidably.
[FIX]
Just add an extra check after inline attempt, so that if the inline
attempt failed, do not set the nocompress flag.
As there is no way to remove that flag, and the default 50% compression
ratio is way too strict for the whole inode.
CC: stable@vger.kernel.org # 6.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 12 Feb 2026 09:13:56 +0000 (19:43 +1030)]
btrfs: remove folio parameter from ordered io related functions
Both functions btrfs_finish_ordered_extent() and
btrfs_mark_ordered_io_finished() are accepting an optional folio
parameter.
That @folio is passed into can_finish_ordered_extent(), which later will
test and clear the ordered flag for the involved range.
However I do not think there is any other call site that can clear
ordered flags of an page cache folio and can affect
can_finish_ordered_extent().
There are limited *_clear_ordered() callers out of
can_finish_ordered_extent() function:
- btrfs_migrate_folio()
This is completely unrelated, it's just migrating the ordered flag to
the new folio.
- btrfs_cleanup_ordered_extents()
We manually clean the ordered flags of all involved folios, then call
btrfs_mark_ordered_io_finished() without a @folio parameter.
So it doesn't need and didn't pass a @folio parameter in the first
place.
- btrfs_writepage_fixup_worker()
This function is going to be removed soon, and we should not hit that
function anymore.
- btrfs_invalidate_folio()
This is the real call site we need to bother with.
If we already have a bio running, btrfs_finish_ordered_extent() in
end_bbio_data_write() will be executed first, as
btrfs_invalidate_folio() will wait for the writeback to finish.
Thus if there is a running bio, it will not see the range has
ordered flags, and just skip to the next range.
If there is no bio running, meaning the ordered extent is created but
the folio is not yet submitted.
In that case btrfs_invalidate_folio() will manually clear the folio
ordered range, but then manually finish the ordered extent with
btrfs_dec_test_ordered_pending() without bothering the folio ordered
flags.
Meaning if the OE range with folio ordered flags will be finished
manually without the need to call can_finish_ordered_extent().
This means all can_finish_ordered_extent() call sites should get a range
that has folio ordered flag set, thus the old "return false" branch
should never be triggered.
Now we can:
- Remove the @folio parameter from involved functions
* btrfs_mark_ordered_io_finished()
* btrfs_finish_ordered_extent()
For call sites passing a @folio into those functions, let them
manually clear the ordered flag of involved folios.
- Move btrfs_finish_ordered_extent() out of the loop in
end_bbio_data_write()
We only need to call btrfs_finish_ordered_extent() once per bbio,
not per folio.
- Add an ASSERT() to make sure all folio ranges have ordered flags
It's only for end_bbio_data_write().
And we already have enough safe nets to catch over-accounting of ordered
extents.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 11 Feb 2026 21:14:21 +0000 (07:44 +1030)]
btrfs: remove out-of-date comments in btree_writepages()
There is a lengthy comment introduced in commit b3ff8f1d380e ("btrfs:
Don't submit any btree write bio if the fs has errors") and commit c9583ada8cc4 ("btrfs: avoid double clean up when submit_one_bio()
failed"), explaining two things:
- Why we don't want to submit metadata write if the fs has errors
- Why we re-set @ret to 0 if it's positive
However it's no longer uptodate by the following reasons:
- We have better checks nowadays
Commit 2618849f31e7 ("btrfs: ensure no dirty metadata is written back
for an fs with errors") has introduced better checks, that if the
fs is in an error state, metadata writes will not result in any bio
but instead complete immediately.
That covers all metadata writes better.
- Mentioned incorrect function name
The commit c9583ada8cc4 ("btrfs: avoid double clean up when
submit_one_bio() failed") introduced this ret > 0 handling, but at that
time the function name submit_extent_page() was already incorrect.
It was submit_eb_page() that could return >0 at that time,
and submit_extent_page() could only return 0 or <0 for errors, never >0.
Later commit b35397d1d325 ("btrfs: convert submit_extent_page() to use
a folio") changed "submit_extent_page()" to "submit_extent_folio()" in
the comment, but it doesn't make any difference since the function name
is wrong from day 1.
Finally commit 5e121ae687b8 ("btrfs: use buffer xarray for extent
buffer writeback operations") completely reworked how metadata
writeback works, and removed submit_eb_page(), leaving only the wrong
function name in the comment.
Furthermore the function submit_extent_folio() still exists in the
latest code base, but is never utilized for metadata writeback, causing
more confusion.
Just remove the lengthy comment, and replace the "if (ret > 0)" check
with an ASSERT(), since only btrfs_check_meta_write_pointer() can modify
@ret and it returns 0 or <0 for errors.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sun, 8 Feb 2026 18:42:13 +0000 (18:42 +0000)]
btrfs: remove bogus root search condition in load_extent_tree_free()
There's no need to pass the maximum between the block group's start offset
and BTRFS_SUPER_INFO_OFFSET (64K) since we can't have any block groups
allocated in the first megabyte, as that's reserved space. Furthermore,
even if we could, the correct thing to do was to pass the block group's
start offset anyway - and that's precisely what we do for block groups
hat happen to contain a superblock mirror (the range for the super block
is never marked as free and it's marked as dirty in the
fs_info->excluded_extents io tree).
So simplify this and get rid of that maximum expression.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Chen Ni [Wed, 11 Feb 2026 04:40:32 +0000 (12:40 +0800)]
btrfs: remove duplicate include of delayed-inode.h in disk-io.c
Remove duplicate inclusion of delayed-inode.h in disk-io.c to clean up
redundant code.
Signed-off-by: Chen Ni <nichen@iscas.ac.cn> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 10 Feb 2026 12:18:50 +0000 (12:18 +0000)]
btrfs: pass literal booleans to functions that take boolean arguments
We have several functions with parameters defined as booleans but then we
have callers passing integers, 0 or 1, instead of false and true. While
this isn't a bug since 0 and 1 are converted to false and true, it is odd
and less readable. Change the callers to pass true and false literals
instead.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 10 Feb 2026 03:24:30 +0000 (13:54 +1030)]
btrfs: make add_pending_csums() to take an ordered extent as parameter
The structure btrfs_ordered_extent has a lot of list heads for different
purposes, passing a random list_head pointer is never a good idea as if
the wrong list is passed in, the type casting along with the fs will be
screwed up.
Instead pass the btrfs_ordered_extent pointer, and grab the csum_list
inside add_pending_csums() to make it a little safer.
Since we're here, also update the comments to follow the current style.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 10 Feb 2026 03:24:29 +0000 (13:54 +1030)]
btrfs: rename btrfs_ordered_extent::list to csum_list
That list head records all pending checksums for that ordered
extent. And unlike other lists, we just use the name "list", which can
be very confusing for readers.
Rename it to "csum_list" which follows the remaining lists, showing the
purpose of the list.
And since we're here, remove a comment inside
btrfs_finish_ordered_zoned() where we have
"ASSERT(!list_empty(&ordered->csum_list))" to make sure the OE has
pending csums.
That comment is only here to make sure we do not call list_first_entry()
before checking BTRFS_ORDERED_PREALLOC.
But since we already have that bit checked and even have a dedicated
ASSERT(), there is no need for that comment anymore.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: change return type of cache_save_setup to void
None of the callers of `cache_save_setup` care about the return type as
the space cache is purely and optimization.
Also the free space cache is a deprecated feature that is being phased
out.
Change the return type of `cache_save_setup` to void to reflect this.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 9 Feb 2026 19:45:06 +0000 (19:45 +0000)]
btrfs: avoid starting new transaction and commit in relocate_block_group()
We join a transaction with the goal of catching the current transaction
and then commit it to get rid of pinned extents and reclaim free space,
but a join can create a new transaction if there isn't any running, and if
right before we did the join the current transaction happened to be
committed by someone else (like the transaction kthread for example),
we end up starting and committing a new transaction, causing rotation of
the super block backup roots besides extra and useless IO.
So instead of doing a transaction join followed by a commit, use the
helper btrfs_commit_current_transaction() which ensures no transaction is
created if there isn't any running.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 9 Feb 2026 10:54:27 +0000 (10:54 +0000)]
btrfs: remove redundant extent_buffer_uptodate() checks after read_tree_block()
We have several places that call extent_buffer_uptodate() after reading a
tree block with read_tree_block(), but that is redundant since we already
call extent_buffer_uptodate() in the call chain of read_tree_block():
read_tree_block()
btrfs_read_extent_buffer()
read_extent_buffer_pages()
returns -EIO if extent_buffer_uptodate() returns false
So remove those redundant checks.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 9 Feb 2026 10:44:16 +0000 (10:44 +0000)]
btrfs: use the helper extent_buffer_uptodate() everywhere
Instead of open coding testing the uptodate bit on the extent buffer's
flags, use the existing helper extent_buffer_uptodate() (which is even
shorter to type). Also change the helper's return value from int to bool,
since we always use it in a boolean context.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: add zone reclaim flush state for DATA space_info
On zoned block devices, DATA block groups can accumulate large amounts
of zone_unusable space (space between the write pointer and zone end).
When zone_unusable reaches high levels (e.g., 98% of total space), new
allocations fail with ENOSPC even though space could be reclaimed by
relocating data and resetting zones.
The existing flush states don't handle this scenario effectively - they
either try to free cached space (which doesn't exist for zone_unusable)
or reset empty zones (which doesn't help when zones contain valid data
mixed with zone_unusable space).
Add a new RECLAIM_ZONES flush state that triggers the block group
reclaim machinery. This state:
- Calls btrfs_reclaim_sweep() to identify reclaimable block groups
- Calls btrfs_reclaim_bgs() to queue reclaim work
- Waits for reclaim_bgs_work to complete via flush_work()
- Commits the transaction to finalize changes
The reclaim work (btrfs_reclaim_bgs_work) safely relocates valid data
from fragmented block groups to other locations before resetting zones,
converting zone_unusable space back into usable space.
Insert RECLAIM_ZONES before RESET_ZONES in data_flush_states so that
we attempt to reclaim partially-used block groups before falling back
to resetting completely empty ones.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: move partially zone_unusable block groups to reclaim list
On zoned block devices, block groups accumulate zone_unusable space
(space between the write pointer and zone end that cannot be allocated
until the zone is reset). When a block group becomes mostly
zone_unusable but still contains some valid data and it gets added to the
unused_bgs list it can never be deleted because it's not actually empty.
The deletion code (btrfs_delete_unused_bgs) skips these block groups
due to the btrfs_is_block_group_used() check, leaving them on the
unused_bgs list indefinitely. This causes two problems:
1. The block groups are never reclaimed, permanently wasting space
2. Eventually leads to ENOSPC even though reclaimable space exists
Fix by detecting block groups where zone_unusable exceeds 50% of the
block group size. Move these to the reclaim_bgs list instead of
skipping them. This triggers btrfs_reclaim_bgs_work() which:
1. Marks the block group read-only
2. Relocates the remaining valid data via btrfs_relocate_chunk()
3. Removes the emptied block group
4. Resets the zones, converting zone_unusable back to usable space
The 50% threshold ensures we only reclaim block groups where most space
is unusable, making relocation worthwhile. Block groups with less
zone_unusable are left on unused_bgs to potentially become fully empty
through normal deletion.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: cap delayed refs metadata reservation to avoid overcommit
On zoned filesystems metadata space accounting can become overly optimistic
due to delayed refs reservations growing without a hard upper bound.
The delayed_refs_rsv block reservation is allowed to speculatively grow and
is only backed by actual metadata space when refilled. On zoned devices this
can result in delayed_refs_rsv reserving a large portion of metadata space
that is already effectively unusable due to zone write pointer constraints.
As a result, space_info->may_use can grow far beyond the usable metadata
capacity, causing the allocator to believe space is available when it is not.
This leads to premature ENOSPC failures and "cannot satisfy tickets" reports
even though commits would be able to make progress by flushing delayed refs.
Analysis of "-o enospc_debug" dumps using a Python debug script
confirmed that delayed_refs_rsv was responsible for the majority of
metadata overcommit on zoned devices. By correlating space_info counters
(total, used, may_use, zone_unusable) across transactions, the analysis
showed that may_use continued to grow even after usable metadata space
was exhausted, with delayed refs refills accounting for the excess
reservations.
Here's the output of the analysis:
======================================================================
Space Type: METADATA
======================================================================
Percentages:
Zone Unusable: 0.07% of total
May Use: 99.80% of total
Fix this by adding a zoned-specific cap in btrfs_delayed_refs_rsv_refill():
Before reserving additional metadata bytes, limit the delayed refs
reservation based on the usable metadata space (total bytes minus
zone_unusable). If the reservation would exceed this cap, return -EAGAIN
to trigger the existing flush/commit logic instead of overcommitting
metadata space.
This preserves the existing reservation and flushing semantics while
preventing metadata overcommit on zoned devices. The change is limited to
metadata space and does not affect non-zoned filesystems.
This patch addresses premature metadata ENOSPC conditions on zoned devices
and ensures delayed refs are throttled before exhausting usable metadata.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 9 Feb 2026 00:21:09 +0000 (10:51 +1030)]
btrfs: fix the inline compressed extent check in inode_need_compress()
[BUG]
Since commit 59615e2c1f63 ("btrfs: reject single block sized compression
early"), the following script will result the inode to have NOCOMPRESS
flag, meanwhile old kernels don't:
This will make a lot of files no longer to be compressed.
[CAUSE]
The old compressed inline check looks like this:
if (total_compressed <= blocksize &&
(start > 0 || end + 1 < inode->disk_i_size))
goto cleanup_and_bail_uncompressed;
That inline part check is equal to "!(start == 0 && end + 1 >=
inode->disk_i_size)", but the new check no longer has that disk_i_size
check.
Thus it means any single block sized write at file offset 0 will pass
the inline check, which is wrong.
Furthermore, since we have merged the old check into
inode_need_compress(), there is no disk_i_size based inline check
anymore, we will always try compressing that single block at file offset
0, then later find out it's not a net win and go to the
mark_incompressible tag.
This results the inode to have NOCOMPRESS flag.
[FIX]
Add back the missing disk_i_size based check into inode_need_compress().
Now the same script will no longer cause NOCOMPRESS flag.
Fixes: 59615e2c1f63 ("btrfs: reject single block sized compression early") Reported-by: Chris Mason <clm@meta.com> Link: https://lore.kernel.org/linux-btrfs/20260208183840.975975-1-clm@meta.com/ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 20:02:18 +0000 (20:02 +0000)]
btrfs: set written super flag once in write_all_supers()
In case we have multiple devices, there is no point in setting the written
flag in the super block on every iteration over the device list. Just do
it once before the loop.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 19:45:14 +0000 (19:45 +0000)]
btrfs: remove max_mirrors argument from write_all_supers()
There's no need to pass max_mirrors to write_all_supers() since from the
given transaction handle we can infer if we are in a transaction commit
or fsync context, so we can determine how many mirrors we need to use.
So remove the max_mirror argument from write_all_supers() and stop
adjusting it in the callees write_dev_supers() and wait_dev_supers(),
simplifying them besides the parameter list for write_all_supers().
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 17:47:33 +0000 (17:47 +0000)]
btrfs: tag error branches as unlikely during super block writes
Mark all the unexpected error checks as unlikely, to make it more clear
they are unexpected and to allow the compiler to potentially generate
better code.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 17:31:58 +0000 (17:31 +0000)]
btrfs: abort transaction on error in write_all_supers()
We are in a transaction context and have an handle, so instead of using
the less preferred btrfs_handle_fs_error(), abort the transaction and
log an error to give some context information.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 17:23:20 +0000 (17:23 +0000)]
btrfs: pass transaction handle to write_all_supers()
We are holding a transaction In every context we call write_all_supers(),
so pass the transaction handle instead of fs_info to it. This will allow
us to abort the transaction in write_all_supers() instead of calling
btrfs_handle_fs_error() in a later patch.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 16:11:42 +0000 (16:11 +0000)]
btrfs: mark all error and warning checks as unlikely in btrfs_validate_super()
When validating a super block, either when mounting or every time we write
a super block to disk, we do many checks for error and warnings and we
don't expect to hit any. So mark each one as unlikely to reflect that and
allow the compiler to potentially generate better code.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 2 Feb 2026 16:47:02 +0000 (16:47 +0000)]
btrfs: update comment for BTRFS_RESERVE_NO_FLUSH
The comment is incomplete as BTRFS_RESERVE_NO_FLUSH is used for more
reasons than currently holding a transaction handle open. Update the
comment with all the other reasons and give some details.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 2 Feb 2026 16:35:01 +0000 (16:35 +0000)]
btrfs: don't allow log trees to consume global reserve or overcommit metadata
For a fsync we never reserve space in advance, we just start a transaction
without reserving space and we use an empty block reserve for a log tree.
We reserve space as we need while updating a log tree, we end up in
btrfs_use_block_rsv() when reserving space for the allocation of a log
tree extent buffer and we attempt first to reserve without flushing,
and if that fails we attempt to consume from the global reserve or
overcommit metadata. This makes us consume space that may be the last
resort for a transaction commit to succeed, therefore increasing the
chances for a transaction abort with -ENOSPC.
So make btrfs_use_block_rsv() fail if we can't reserve metadata space for
a log tree extent buffer allocation without flushing, making the fsync
fallback to a transaction commit and avoid using critical space that could
be the only resort for a transaction commit to succeed when we are in a
critical space situation.
Reviewed-by: Leo Martins <loemra.dev@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 2 Feb 2026 15:12:29 +0000 (15:12 +0000)]
btrfs: be less aggressive with metadata overcommit when we can do full flushing
Over the years we often get reports of some -ENOSPC failure while updating
metadata that leads to a transaction abort. I have seen this happen for
filesystems of all sizes and with workloads that are very user/customer
specific and unable to reproduce, but Aleksandar recently reported a
simple way to reproduce this with a 1G filesystem and using the bonnie++
benchmark tool. The following test script reproduces the failure:
$ cat test.sh
#!/bin/bash
# Create and use a 1G null block device, memory backed, otherwise
# the test takes a very long time.
modprobe null_blk nr_devices="0"
null_dev="/sys/kernel/config/nullb/nullb0"
mkdir "$null_dev"
size=$((1 * 1024)) # in MB
echo 2 > "$null_dev/submit_queues"
echo "$size" > "$null_dev/size"
echo 1 > "$null_dev/memory_backed"
echo 1 > "$null_dev/discard"
echo 1 > "$null_dev/power"
When running this bonnie++ fails in the phase where it deletes test
directories and files:
$ ./test.sh
(...)
Using uid:0, gid:0.
Writing a byte at a time...done
Writing intelligently...done
Rewriting...done
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...Can't sync directory, turning off dir-sync.
Can't delete file 9Bq7sr0000000338
Cleaning up test directory after error.
Bonnie: drastic I/O error (rmdir): Read-only file system
And in the syslog/dmesg we can see the following transaction abort trace:
[161915.501506] BTRFS warning (device nullb0): Skipping commit of aborted transaction.
[161915.502983] ------------[ cut here ]------------
[161915.503832] BTRFS: Transaction aborted (error -28)
[161915.504748] WARNING: fs/btrfs/transaction.c:2045 at btrfs_commit_transaction+0xa21/0xd30 [btrfs], CPU#11: bonnie++/3377975
[161915.506786] Modules linked in: btrfs dm_zero dm_snapshot (...)
[161915.518759] CPU: 11 UID: 0 PID: 3377975 Comm: bonnie++ Tainted: G W 6.19.0-rc7-btrfs-next-224+ #4 PREEMPT(full)
[161915.520857] Tainted: [W]=WARN
[161915.521405] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[161915.523414] RIP: 0010:btrfs_commit_transaction+0xa24/0xd30 [btrfs]
[161915.524630] Code: 48 8b 7c 24 (...)
[161915.526982] RSP: 0018:ffffd3fe8206fda8 EFLAGS: 00010292
[161915.527707] RAX: 0000000000000002 RBX: ffff8f4886d3c000 RCX: 0000000000000000
[161915.528723] RDX: 0000000002040001 RSI: 00000000ffffffe4 RDI: ffffffffc088f780
[161915.529691] RBP: ffff8f4f5adae7e0 R08: 0000000000000000 R09: ffffd3fe8206fb90
[161915.530842] R10: ffff8f4f9c1fffa8 R11: 0000000000000003 R12: 00000000ffffffe4
[161915.532027] R13: ffff8f4ef2cf2400 R14: ffff8f4f5adae708 R15: ffff8f4f62d18000
[161915.533229] FS: 00007ff93112a780(0000) GS:ffff8f4ff63ee000(0000) knlGS:0000000000000000
[161915.534611] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[161915.535575] CR2: 00005571b3072000 CR3: 0000000176080005 CR4: 0000000000370ef0
[161915.536758] Call Trace:
[161915.537185] <TASK>
[161915.537575] btrfs_sync_file+0x431/0x530 [btrfs]
[161915.538473] do_fsync+0x39/0x80
[161915.539042] __x64_sys_fsync+0xf/0x20
[161915.539750] do_syscall_64+0x50/0xf20
[161915.540396] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[161915.541301] RIP: 0033:0x7ff930ca49ee
[161915.541904] Code: 08 0f 85 f5 (...)
[161915.544830] RSP: 002b:00007ffd94291f38 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[161915.546152] RAX: ffffffffffffffda RBX: 00007ff93112a780 RCX: 00007ff930ca49ee
[161915.547263] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
[161915.548383] RBP: 0000000000000dab R08: 0000000000000000 R09: 0000000000000000
[161915.549853] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd94291fb0
[161915.551196] R13: 00007ffd94292350 R14: 0000000000000001 R15: 00007ffd94292340
[161915.552161] </TASK>
[161915.552457] ---[ end trace 0000000000000000 ]---
[161915.553232] BTRFS info (device nullb0 state A): dumping space info:
[161915.553236] BTRFS info (device nullb0 state A): space_info DATA (sub-group id 0) has 12582912 free, is not full
[161915.553239] BTRFS info (device nullb0 state A): space_info total=12582912, used=0, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553243] BTRFS info (device nullb0 state A): space_info METADATA (sub-group id 0) has -5767168 free, is full
[161915.553245] BTRFS info (device nullb0 state A): space_info total=53673984, used=6635520, pinned=46956544, reserved=16384, may_use=5767168, readonly=65536 zone_unusable=0
[161915.553251] BTRFS info (device nullb0 state A): space_info SYSTEM (sub-group id 0) has 8355840 free, is not full
[161915.553254] BTRFS info (device nullb0 state A): space_info total=8388608, used=16384, pinned=16384, reserved=0, may_use=0, readonly=0 zone_unusable=0
[161915.553257] BTRFS info (device nullb0 state A): global_block_rsv: size 5767168 reserved 5767168
[161915.553261] BTRFS info (device nullb0 state A): trans_block_rsv: size 0 reserved 0
[161915.553263] BTRFS info (device nullb0 state A): chunk_block_rsv: size 0 reserved 0
[161915.553265] BTRFS info (device nullb0 state A): remap_block_rsv: size 0 reserved 0
[161915.553268] BTRFS info (device nullb0 state A): delayed_block_rsv: size 0 reserved 0
[161915.553270] BTRFS info (device nullb0 state A): delayed_refs_rsv: size 0 reserved 0
[161915.553272] BTRFS: error (device nullb0 state A) in cleanup_transaction:2045: errno=-28 No space left
[161915.554463] BTRFS info (device nullb0 state EA): forced readonly
The problem is that we allow for a very aggressive metadata overcommit,
about 1/8th of the currently available space, even when the task
attempting the reservation allows for full flushing. Over time this allows
more and more tasks to overcommit without getting a transaction commit to
release pinned extents, joining the same transaction and eventually lead
to the transaction abort when attempting some tree update, as the extent
allocator is not able to find any available metadata extent and it's not
able to allocate a new metadata block group either (not enough unallocated
space for that).
Fix this by allowing the overcommit to be up to 1/64th of the available
(unallocated) space instead and for that limit to apply to both types of
full flushing, BTRFS_RESERVE_FLUSH_ALL and BTRFS_RESERVE_FLUSH_ALL_STEAL.
This way we get more frequent transaction commits to release pinned
extents in case our caller is in a context where full flushing is allowed.
Note that the space infos dump in the dmesg/syslog right after the
transaction abort give the wrong idea that we have plenty of unallocated
space when the abort happened. During the bonnie++ workload we had a
metadata chunk allocation attempt and it failed with -ENOSPC because at
that time we had a bunch of data block groups allocated, which then became
empty and got deleted by the cleaner kthread after the metadata chunk
allocation failed with -ENOSPC and before the transaction abort happened
and dumped the space infos.
The custom tracing (some trace_printk() calls spread in strategic places)
used to check that:
536870912 is 512M, the standard 256M metadata chunk size times 2 because
of the DUP profile for metadata.
'max_avail' is what find_free_dev_extent() returns to us in
gather_device_info().
As a result, gather_device_info() sets ctl->ndevs to 0, making
decide_stripe_size() fail with -ENOSPC, and therefore metadata chunk
allocation fails while we are attempting to run delayed items during
the transaction commit.
In the syslog/dmesg pasted above, which happened after the transaction was
aborted, the space info dumps did not account for all these data block
groups that were allocated during bonnie++'s workload. And that is because
after the metadata chunk allocation failed with -ENOSPC and before the
transaction abort happened, most of the data block groups had become empty
and got deleted by by the cleaner kthread - when the abort happened, we
had bonnie++ in the middle of deleting the files it created.
But dumping the space infos right after the metadata chunk allocation fails
by adding a call to btrfs_dump_space_info_for_trans_abort() in
decide_stripe_size() when it returns -ENOSPC, we get:
[29972.409295] BTRFS info (device nullb0): dumping space info:
[29972.409300] BTRFS info (device nullb0): space_info DATA (sub-group id 0) has 673341440 free, is not full
[29972.409303] BTRFS info (device nullb0): space_info total=948568064, used=0, pinned=275226624, reserved=0, may_use=0, readonly=0 zone_unusable=0
[29972.409305] BTRFS info (device nullb0): space_info METADATA (sub-group id 0) has 3915776 free, is not full
[29972.409306] BTRFS info (device nullb0): space_info total=53673984, used=163840, pinned=42827776, reserved=147456, may_use=6553600, readonly=65536 zone_unusable=0
[29972.409308] BTRFS info (device nullb0): space_info SYSTEM (sub-group id 0) has 7979008 free, is not full
[29972.409310] BTRFS info (device nullb0): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=393216, readonly=0 zone_unusable=0
[29972.409311] BTRFS info (device nullb0): global_block_rsv: size 5767168 reserved 5767168
[29972.409313] BTRFS info (device nullb0): trans_block_rsv: size 0 reserved 0
[29972.409314] BTRFS info (device nullb0): chunk_block_rsv: size 393216 reserved 393216
[29972.409315] BTRFS info (device nullb0): remap_block_rsv: size 0 reserved 0
[29972.409316] BTRFS info (device nullb0): delayed_block_rsv: size 0 reserved 0
So here we see there's ~904.6M of data space, ~51.2M of metadata space and
8M of system space, making a total of 963.8M.
As can_overcommit() simply uses unallocated space with factor to
calculate the allocatable metadata chunk size, resulting 25.5GiB
available space.
But in reality we can only allocate one 1GiB RAID1 chunk, the remaining
49GiB on devid 2 will never be utilized to fulfill a RAID1 chunk.
This leads to various ENOSPC related transaction abort and flips the fs
read-only.
Now use per-profile available space in calc_available_free_space(), and
only when that failed we fall back to the old factor based estimation.
And for zoned devices or for the very low chance of temporary memory
allocation failure, we will still fallback to factor based estimation.
But I hope in reality it's very rare.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 4 Feb 2026 02:54:07 +0000 (13:24 +1030)]
btrfs: update per-profile available estimation
This involves the following timing:
- Chunk allocation
- Chunk removal
- After Mount
- New device
- Device removal
- Device shrink
- Device enlarge
And since the function btrfs_update_per_profile_avail() will not return
an error, this won't cause new error handling path.
Although when btrfs_update_per_profile_avail() failed (only ENOSPC
possible) it will mark the per-profile available estimation as
unreliable, so later btrfs_get_per_profile_avail() will return false and
require the caller to have a fallback solution.
The function btrfs_update_per_profile_avail() will be executed with
chunk_mutex hold, thus it will slightly slow down those involved
functions, but not a lot.
As all the core workload is just various u64 calculations inside a loop,
without any tree search, the overhead should be acceptable even for all
supported 9 profiles.
For 4 disks (which exercises all 9 profiles), the execution time of that
function will still be less than 10 us.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Wed, 4 Feb 2026 02:54:06 +0000 (13:24 +1030)]
btrfs: introduce the device layout aware per-profile available space
[BUG]
There is a long known bug that if metadata is using RAID1 on two disks
with unbalanced sizes, there is a very high chance to hit ENOSPC related
transaction abort.
[CAUSE]
The root cause is in the available space estimation code:
- Factor based calculation
Just use all unallocated space, divide by the profile factor
One obvious user is can_overcommit().
If using factor based estimation, we can use (1GiB + 50GiB) / 2 = 25.5GiB
free space for metadata.
Thus we can continue allocating metadata (over-commit) way beyond the
1GiB limit.
But this estimation is completely wrong, in reality we can only allocate
one single 1GiB RAID1 block group, thus if we continue over-commit, at
one time we will hit ENOSPC at some critical path and flips the fs
read-only.
[SOLUTION]
This patch will introduce per-profile available space estimation,
which can provide chunk-allocator like behavior to give a (mostly)
accurate result, with under-estimate corner cases.
There are some differences between the estimation and real chunk
allocator:
- No consideration on hole size
It's fine for most cases, as all data/metadata strips are in 1GiB size
thus there should not be any hole wasting much space.
And chunk allocator is able to use smaller stripes when there is
really no other choice.
Although in theory this means it can lead to some over-estimation, it
should not cause too much hassle in the real world.
The other benefit of such behavior is, we avoid dev-extent tree search
completely, thus the overhead is very small.
- No true balance for certain cases
If we have 3 disks RAID1, and each device has 2GiB unallocated space,
we can load balance the chunk allocation so that we can allocate 3GiB
RAID1 chunks, and that's what chunk allocator will do.
But this current estimation code is using the largest available space
to do a single allocation. Meaning the estimation will be 2GiB, thus
under estimate.
Such under estimation is fine and after the first chunk allocation, the
estimation will be updated and still give a correct 2GiB
estimation.
So this only means the estimation will be a little conservative, which
is safer for call sites like metadata over-commit check.
With that facility, for above 1GiB + 50GiB case, it will give a RAID1
estimation of 1GiB, instead of the incorrect 25.5GiB.
Or for a more complex example:
devid 1 unallocated: 1T
devid 2 unallocated: 1T
devid 3 unallocated: 10T
We will get an array of:
RAID10: 2T
RAID1: 2T
RAID1C3: 1T
RAID1C4: 0 (not enough devices)
DUP: 6T
RAID0: 3T
SINGLE: 12T
RAID5: 2T
RAID6: 1T
[IMPLEMENTATION]
And for the each profile , we go chunk allocator level calculation:
The pseudo code looks like:
clear_virtual_used_space_of_all_rw_devices();
do {
/*
* The same as chunk allocator, despite used space,
* we also take virtual used space into consideration.
*/
sort_device_with_virtual_free_space();
/*
* Unlike chunk allocator, we don't need to bother hole/stripe
* size, so we use the smallest device to make sure we can
* allocated as many stripes as regular chunk allocator
*/
stripe_size = device_with_smallest_free->avail_space;
stripe_size = min(stripe_size, to_alloc / ndevs);
/*
* Allocate a virtual chunk, allocated virtual chunk will
* increase virtual used space, allow next iteration to
* properly emulate chunk allocator behavior.
*/
ret = alloc_virtual_chunk(stripe_size, &allocated_size);
if (ret == 0)
avail += allocated_size;
} while (ret == 0)
This minimal available space based calculation is not perfect, but the
important part is, the estimation is never exceeding the real available
space.
This patch just introduces the infrastructure, no hooks are executed
yet.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Jiasheng Jiang [Wed, 14 Jan 2026 14:44:50 +0000 (14:44 +0000)]
btrfs: zoned: remove redundant space_info lock and variable in do_allocation_zoned()
In do_allocation_zoned(), the code acquires space_info->lock before
block_group->lock. However, the critical section does not access or
modify any members of the space_info structure. Thus, the lock is
redundant as it provides no necessary synchronization here.
This change simplifies the locking logic and aligns the function with
other zoned paths, such as __btrfs_add_free_space_zoned(), which only
rely on block_group->lock. Since the 'space_info' local variable is
no longer used after removing the lock calls, it is also removed.
Removing this unnecessary lock reduces contention on the global
space_info lock, improving concurrency in the zoned allocation path.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 15:22:54 +0000 (15:22 +0000)]
btrfs: move min sys chunk array size check to validate_sys_chunk_array()
We check the minimum size of the sys chunk array in btrfs_validate_super()
but we have a better place for that, the helper validate_sys_chunk_array()
which we use for every other sys chunk array check. So move it there, also
converting the return error from -EINVAL to -EUCLEAN, which is a better
fit and also consistent with the other checks.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 3 Feb 2026 15:14:01 +0000 (15:14 +0000)]
btrfs: remove duplicate system chunk array max size overflow check
We check it twice, once in validate_sys_chunk_array() and then again in
its caller, btrfs_validate_super(), right after it calls
validate_sys_chunk_array(). So remove the duplicated check from
btrfs_validate_super().
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 2 Feb 2026 16:50:18 +0000 (16:50 +0000)]
btrfs: pass boolean literals as the last argument to inc_block_group_ro()
The last argument of inc_block_group_ro() is defined as a boolean, but
every caller is passing an integer literal, 0 or 1 for false and true
respectively. While this is not incorrect, as 0 and 1 are converted to
false and true, it's less readable and somewhat awkward since the
argument is defined as boolean. Replace 0 and 1 with false and true.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Wed, 4 Feb 2026 07:31:25 +0000 (16:31 +0900)]
btrfs: tests: zoned: add tests cases for zoned code
Add a test function for the zoned code, for now it tests
btrfs_load_block_group_by_raid_type() with various test cases. The
load_zone_info_tests[] array defines the test cases.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Merge tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Paul Walmsley:
- Fix a CONFIG_SPARSEMEM crash on RV32 by avoiding early phys_to_page()
- Prevent runtime const infrastructure from being used by modules,
similar to what was done for x86
- Avoid problems when shutting down ACPI systems with IOMMUs by adding
a device dependency between IOMMU and devices that use it
- Fix a bug where the CPU pointer masking state isn't properly reset
when tagged addresses aren't enabled for a task
- Fix some incorrect register assignments, and add some missing ones,
in kgdb support code
- Fix compilation of non-kernel code that uses the ptrace uapi header
by replacing BIT() with _BITUL()
- Fix compilation of the validate_v_ptrace kselftest by working around
kselftest macro expansion issues
* tag 'riscv-for-linus-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
ACPI: RIMT: Add dependency between iommu and devices
selftests: riscv: Add braces around EXPECT_EQ()
riscv: use _BITUL macro rather than BIT() in ptrace uapi and kselftests
riscv: Reset pmm when PR_TAGGED_ADDR_ENABLE is not set
riscv: make runtime const not usable by modules
riscv: patch: Avoid early phys_to_page()
riscv: kgdb: fix several debug register assignment bugs
* tag 'x86-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/platform/geode: Fix on-stack property data use-after-return bug
x86/kexec: Disable KCOV instrumentation after load_segments()
Merge tag 'perf-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Ingo Molnar:
- Fix potential bad container_of() in intel_pmu_hw_config() (Ian
Rogers)
* tag 'perf-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86: Fix potential bad container_of in intel_pmu_hw_config
Merge tag 'irq-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fix from Ingo Molnar:
- Fix RISC-V APLIC irqchip driver setup errors on ACPI systems (Jessica
Liu)
* tag 'irq-urgent-2026-04-05' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/riscv-aplic: Restrict genpd notifier to device tree only
i915: don't use a vma that didn't match the context VM
In eb_lookup_vma(), the code checks that the context vm matches before
incrementing the i915 vma usage count, but for the non-matching case it
didn't clear the non-matching vma pointer, so it would then mistakenly
be returned, causing potential UaF and refcount issues.
Merge tag 'mips-fixes_7.0_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux
Pull MIPS fixes from Thomas Bogendoerfer:
- Fix TLB uniquification for systems with TLB not initialised by
firmware
- Fix allocation in TLB uniquification
- Fix SiByte cache initialisation
- Check uart parameters from firmware on Loongson64 systems
- Fix clock id mismatch for Ralink SoCs
- Fix GCC version check for __mutli3 workaround
* tag 'mips-fixes_7.0_1' of git://git.kernel.org/pub/scm/linux/kernel/git/mips/linux:
mips: mm: Allocate tlb_vpn array atomically
MIPS: mm: Rewrite TLB uniquification for the hidden bit feature
MIPS: mm: Suppress TLB uniquification on EHINV hardware
MIPS: Always record SEGBITS in cpu_data.vmbits
MIPS: Fix the GCC version check for `__multi3' workaround
MIPS: SiByte: Bring back cache initialisation
mips: ralink: update CPU clock index
MIPS: Loongson64: env: Check UARTs passed by LEFI cautiously
Merge tag 'char-misc-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc/iio driver fixes from Greg KH:
"Here are a relativly large number of small char/misc/iio and other
driver fixes for 7.0-rc7. There's a bunch, but overall they are all
small fixes for issues that people have been having that I finally
caught up with getting merged due to delays on my end.
The "largest" change overall is just some documentation updates to the
security-bugs.rst file to hopefully tell the AI tools (and any users
that actually read the documentation), how to send us better security
bug reports as the quantity of reports these past few weeks has
increased dramatically due to tools getting better at "finding"
things.
Included in here are:
- lots of small IIO driver fixes for issues reported in 7.0-rc
- gpib driver fixes
- comedi driver fixes
- interconnect driver fix
- nvmem driver fixes
- mei driver fix
- counter driver fix
- binder rust driver fixes
- some other small misc driver fixes
All of these have been in linux-next this week with no reported issues"
* tag 'char-misc-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (63 commits)
Documentation: fix two typos in latest update to the security report howto
Documentation: clarify the mandatory and desirable info for security reports
Documentation: explain how to find maintainers addresses for security reports
Documentation: minor updates to the security contacts
.get_maintainer.ignore: add myself
nvmem: zynqmp_nvmem: Fix buffer size in DMA and memcpy
nvmem: imx: assign nvmem_cell_info::raw_len
misc: fastrpc: check qcom_scm_assign_mem() return in rpmsg_probe
misc: fastrpc: possible double-free of cctx->remote_heap
comedi: dt2815: add hardware detection to prevent crash
comedi: runflags cannot determine whether to reclaim chanlist
comedi: Reinit dev->spinlock between attachments to low-level drivers
comedi: me_daq: Fix potential overrun of firmware buffer
comedi: me4000: Fix potential overrun of firmware buffer
comedi: ni_atmio16d: Fix invalid clean-up after failed attach
gpib: fix use-after-free in IO ioctl handlers
gpib: lpvo_usb: fix memory leak on disconnect
gpib: Fix fluke driver s390 compile issue
lis3lv02d: Omit IRQF_ONESHOT if no threaded handler is provided
lis3lv02d: fix kernel-doc warnings
...
Merge tag 'tty-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty fixes from Greg KH:
"Here are two small tty vt fixes for 7.0-rc7 to resolve some reported
issues with the resize ability of the alt screen buffer. Both of these
have been in linux-next all week with no reported issues"
* tag 'tty-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
vt: resize saved unicode buffer on alt screen exit after resize
vt: discard stale unicode buffer on alt screen exit after resize
Merge tag 'usb-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/Thunderbolt fixes from Greg KH:
"Here are a bunch of USB and Thunderbolt fixes (most all are USB) for
7.0-rc7. More than I normally like this late in the release cycle,
partly due to my recent travels, and partly due to people banging away
on the USB gadget interfaces and apis more than normal (big shoutout
to Android for getting the vendors to actually work upstream on this,
that's a huge win overall for everyone here)
Included in here are:
- Small thunderbolt fix
- new USB serial driver ids added
- typec driver fixes
- gadget driver fixes for some disconnect issues
- other usb gadget driver fixes for reported problems with binding
and unbinding devices as happens when a gadget device connects /
disconnects from a system it is plugged into (or it switches device
mode at a user's request, these things are complex little
beasts...)
- usb offload fixes (where USB audio tunnels through the controller
while the main CPU is asleep) for when EMP spikes hit the system
causing disconnects to happen (as often happens with static
electricity in the winter months). This has been much reported by
at least one vendor, and resolves the issues they have been seeing
with this codepath. Can't wait for the "formal methods are the
answer!" people to try to model that one properly...
- Other small usb driver fixes for issues reported.
All of these have been in linux-next this week, and before, with no
reported issues, and I've personally been stressing these harder than
normal on my systems here with no problems"
* tag 'usb-7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (39 commits)
usb: gadget: f_hid: move list and spinlock inits from bind to alloc
usb: host: xhci-sideband: delegate offload_usage tracking to class drivers
usb: core: use dedicated spinlock for offload state
usb: cdns3: gadget: fix state inconsistency on gadget init failure
usb: dwc3: imx8mp: fix memory leak on probe failure path
usb: gadget: f_uac1_legacy: validate control request size
usb: ulpi: fix double free in ulpi_register_interface() error path
usb: misc: usbio: Fix URB memory leak on submit failure
USB: core: add NO_LPM quirk for Razer Kiyo Pro webcam
usb: cdns3: gadget: fix NULL pointer dereference in ep_queue
usb: core: phy: avoid double use of 'usb3-phy'
USB: serial: option: add MeiG Smart SRM825WN
usb: gadget: f_rndis: Fix net_device lifecycle with device_move
usb: gadget: f_subset: Fix net_device lifecycle with device_move
usb: gadget: f_eem: Fix net_device lifecycle with device_move
usb: gadget: f_ecm: Fix net_device lifecycle with device_move
usb: gadget: u_ncm: Add kernel-doc comments for struct f_ncm_opts
usb: gadget: f_rndis: Protect RNDIS options with mutex
usb: gadget: f_subset: Fix unbalanced refcnt in geth_free
dt-bindings: connector: add pd-disable dependency
...