Lijuan Li [Thu, 29 Feb 2024 08:30:07 +0000 (16:30 +0800)]
btrfs: mark btrfs_put_caching_control() static
btrfs_put_caching_control() is only used in block-group.c, so mark it
static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Chengming Zhou [Sat, 24 Feb 2024 13:47:09 +0000 (13:47 +0000)]
btrfs: remove SLAB_MEM_SPREAD flag use
The SLAB_MEM_SPREAD flag used to be implemented in SLAB, which was
removed as of v6.8-rc1, so it became a dead flag since the commit 16a1d968358a ("mm/slab: remove mm/slab.c and slab_def.h"). And the
series[1] went on to mark it obsolete to avoid confusion for users.
Here we can just remove all its users, which has no functional change.
Qu Wenruo [Fri, 23 Feb 2024 07:43:38 +0000 (18:13 +1030)]
btrfs: qgroup: always free reserved space for extent records
[BUG]
If qgroup is marked inconsistent (e.g. caused by operations needing full
subtree rescan, like creating a snapshot and assign to a higher level
qgroup), btrfs would immediately start leaking its data reserved space.
# This snapshot creation would mark qgroup inconsistent,
# as the ownership involves different higher level qgroup, thus
# we have to rescan both source and snapshot, which can be very
# time consuming, thus here btrfs just choose to mark qgroup
# inconsistent, and let users to determine when to do the rescan.
btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
# Now this write would lead to qgroup rsv leak.
xfs_io -f -c "pwrite 0 64k" $mnt/file1
# And at unmount time, btrfs would report 64K DATA rsv space leaked.
umount $mnt
And we would have the following dmesg output for the unmount:
BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
[CAUSE]
Since commit e15e9f43c7ca ("btrfs: introduce
BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
we introduce a mode for btrfs qgroup to skip the timing consuming
backref walk, if the qgroup is already inconsistent.
But this skip also covered the data reserved freeing, thus the qgroup
reserved space for each newly created data extent would not be freed,
thus cause the leakage.
[FIX]
Make the data extent reserved space freeing mandatory.
The qgroup reserved space handling is way cheaper compared to the
backref walking part, and we always have the super sensitive leak
detector, thus it's definitely worth to always free the qgroup
reserved data space.
[ANALYZE]
The root cause is still unclear, but there are some clues already:
- Unaligned eb bytenr
The block bytenr is 8550954455682405139, which is not even aligned to
2.
This bytenr is fetched from extent buffer header, not from eb->start.
This means, at the initial time of read, eb header bytenr is still
correct (the very basis check to continue read), but later something
wrong happened, got at least the first page corrupted.
Thus we got such obviously incorrect value.
- Invalid extent buffer header owner
The read itself is triggered for subvolume 256, but the eb header
owner is 11858205567642294356, which is not really possible.
The problem here is, subvolume id is limited to (1 << 48 - 1),
and this one definitely goes beyond that limit.
So this value is another garbage.
We already got two garbage from an extent buffer, which passed the
initial bytenr and csum checks, but later the contents become garbage at
some point.
This looks like a page lifespan problem (e.g. we didn't properly hold the
page).
[ENHANCEMENT]
The current tree-checker only outputs things from the extent buffer,
nothing with the page status.
So this patch would enhance the tree-checker output by also dumping the
first page, which would look like this:
From the dump we can see some extra info, something can help us to do
extra cross-checks:
- Page refcount
if it's too low, it definitely means something bad.
- Page aops
Any mapped eb page should have btree_aops with inode number 1.
- Page index
Since a mapped eb page should has its bytenr matching the page
position, (index << PAGE_SHIFT) should match the bytenr of the
bytenr from the critical line.
- Page Private flags
A mapped eb page should have Private flag set to indicate it's managed
by btrfs.
Qu Wenruo [Thu, 22 Feb 2024 03:30:25 +0000 (14:00 +1030)]
btrfs: compression: remove dead comments in btrfs_compress_heuristic()
Since commit a440d48c7f93 ("Btrfs: heuristic: implement sampling
logic"), btrfs_compress_heuristic() is no longer a simple "return true",
but more complex to determine if we should compress.
Thus the comment is dead and can be confusing, just remove it.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 17 Feb 2024 06:29:49 +0000 (16:59 +1030)]
btrfs: subpage: make reader lock utilize bitmap
Currently btrfs_subpage utilizes its atomic member @reader to manage the
reader counter. However it is only utilized to prevent the page to be
released/unlocked when we still have reads underway.
In that use case, we don't really allow multiple readers on the same
subpage sector. So here we can introduce a new locked bitmap to
represent exactly which subpage range is locked for read.
In theory we can remove btrfs_subpage::reader as it's just the set bits
of the new locked bitmap. But unfortunately bitmap doesn't provide such
handy API yet, so we still keep the reader counter.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Sat, 17 Feb 2024 06:29:48 +0000 (16:59 +1030)]
btrfs: unexport btrfs_subpage_start_writer() and btrfs_subpage_end_and_test_writer()
Both functions were introduced in commit 1e1de38792e0 ("btrfs: make
process_one_page() to handle subpage locking"), but they have never
been utilized out of subpage code. So just unexport them.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Feb 2024 08:56:17 +0000 (09:56 +0100)]
btrfs: merge btrfs_del_delalloc_inode() helpers
The helpers btrfs_del_delalloc_inode() and __btrfs_del_delalloc_inode()
don't follow the pattern when the "__" helper does a special case and
are in fact reversed regarding the naming. We can merge them into one as
there's only one place that needs to be open coded.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 22 Feb 2024 08:35:54 +0000 (09:35 +0100)]
btrfs: handle transaction commit errors in flush_reservations()
Other errors in flush_reservations() are handled and also in the caller.
Ignoring commit might make some sense as it's called right after join so
it's to poke the whole commit machinery to free space.
However for consistency return the error. The caller
btrfs_quota_disable() would try to start the transaction which would
in turn fail too so there's no effective change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Kunwu Chan [Tue, 20 Feb 2024 09:06:44 +0000 (17:06 +0800)]
btrfs: use KMEM_CACHE() to create delayed ref caches
Use the KMEM_CACHE() macro instead of kmem_cache_create() to simplify
the creation of SLAB caches related to delayed refs when the default
values are used.
Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 16 Feb 2024 13:27:28 +0000 (14:27 +0100)]
btrfs: uninline some static inline helpers from delayed-ref.h
The helpers are doing an initialization or release work, none of which
is performance critical that it would require a static inline, so move
them to the .c file.
David Sterba [Fri, 16 Feb 2024 13:03:08 +0000 (14:03 +0100)]
btrfs: uninline some static inline helpers from tree-log.h
The helpers are doing an initialization or release work, none of which
is performance critical that it would require a static inline, so move
them to the .c file.
David Sterba [Fri, 16 Feb 2024 12:59:06 +0000 (13:59 +0100)]
btrfs: drop static inline specifiers from tree-mod-log.c
Using static inline in a .c file should be justified, e.g. when
functions are on a hot path but none of the affected functions seem to
be. As it's all in one compilation unit let the compiler decide.
David Sterba [Fri, 16 Feb 2024 12:36:13 +0000 (13:36 +0100)]
btrfs: uninline some static inline helpers from backref.h
There are many helpers doing simple things but not simple enough to
justify the static inline. None of them seems to be on a hot path so
move them to .c.
David Sterba [Fri, 16 Feb 2024 14:53:25 +0000 (15:53 +0100)]
btrfs: open code btrfs_backref_get_eb()
The helper is trivial, we can inline it. It's safe to remove the 'if' as
the iterator is always valid when used, the potential NULL was never
checked anyway.
Naohiro Aota [Mon, 5 Feb 2024 13:01:16 +0000 (22:01 +0900)]
btrfs: introduce offload_csum_mode to tweak checksum offloading behavior
We disable offloading checksum to workqueues and do it synchronously when
the checksum algorithm is fast. However, as reported in the link below,
RAID0 with multiple devices may suffer from the sync checksum, because
"fast checksum" is still not fast enough to catch up with RAID0 writing.
We don't have an effective way to determine whether to offload or not,
for now add a sysfs knob so this can be debugged. This is intentionally
under CONFIG_BTRFS_DEBUG so ti's not exposed to users as it may be
removed in the future agin.
Introduce fs_devices->offload_csum_mode, so that a btrfs developer can
change the behavior by writing to /sys/fs/btrfs/<uuid>/offload_csum. The
default is "auto" which is the same as the previous behavior. Or, you
can set "on" or "off" (or "y" or "n" whatever kstrtobool() accepts) to
always/never offload checksum.
More benchmark need to be collected with this knob to implement a proper
criteria to enable/disable checksum offloading.
Qu Wenruo [Fri, 26 Jan 2024 03:21:32 +0000 (13:51 +1030)]
btrfs: raid56: extra debugging for raid6 syndrome generation
[BUG]
I have got at least two crash report for RAID6 syndrome generation, no
matter if it's AVX2 or SSE2, they all seems to have a similar
calltrace with corrupted RAX:
[CAUSE]
The cause is not known. Recently I also hit this in AVX512 path, and
that's even in v5.15 backport, which doesn't have any of my RAID56
rework.
Filipe Manana [Mon, 19 Feb 2024 12:51:25 +0000 (12:51 +0000)]
btrfs: avoid unnecessary ref initialization when freeing log tree block
At btrfs_free_tree_block(), we are always initializing a delayed reference
to drop the given extent buffer but we only use if it does not belong to a
log root tree. So we are doing unnecessary work here and increasing the
duration of a critical section as this is normally called while holding a
lock on the parent tree block (if any) and while holding a log transaction
open.
So initialize the delayed reference only if the extent buffer is not from
a log tree, avoiding unnecessary work and making the code also a bit
easier to follow.
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sat, 17 Feb 2024 22:23:02 +0000 (22:23 +0000)]
btrfs: send: avoid duplicated search for last extent when sending hole
During an incremental send, before determining if we need to send a hole
(write operations full of zeroes) we will search for the last extent's
end offset if we are at the first slot of a leaf and the last processed
extent's end offset is smaller then the current extent's start offset.
However we are repeating this search in case we had the last extent's end
offset undefined (set to the (u64)-1 value) when we entered
maybe_send_hole(), wasting time.
So avoid this duplicated search by combining the two conditions that
trigger a search for the last extent's end offset into a single if
statement.
Reviewed-by: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 14 Feb 2024 09:54:31 +0000 (10:54 +0100)]
btrfs: factor out validation of btrfs_ioctl_vol_args_v2::name
The validation of vol args v2 name in snapshot and device remove ioctls
is not done properly. A terminating NUL is written to the end of the
buffer unconditionally, assuming that this would be the last place in
case the buffer is used completely. This does not communicate back the
actual error (either an invalid or too long path).
Factor out all such cases and use a helper to do the verification,
simply look for NUL in the buffer. There's no expected practical
change, the size of buffer is 4088, this is enough for most paths or
names.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 14 Feb 2024 09:32:47 +0000 (10:32 +0100)]
btrfs: factor out validation of btrfs_ioctl_vol_args::name
The validation of vol args name in several ioctls is not done properly.
a terminating NUL is written to the end of the buffer unconditionally,
assuming that this would be the last place in case the buffer is used
completely. This does not communicate back the actual error (either an
invalid or too long path).
Factor out all such cases and use a helper to do the verification,
simply look for NUL in the buffer. There's no expected practical change,
the size of buffer is 4088, this is enough for most paths or names.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 13 Feb 2024 15:23:35 +0000 (15:23 +0000)]
btrfs: remove no longer used btrfs_transaction_in_commit()
The function btrfs_transaction_in_commit() is no longer used, its last
use was removed in commit 11aeb97b45ad ("btrfs: don't arbitrarily slow
down delalloc if we're committing"), so just remove it.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Neal Gompa [Mon, 12 Feb 2024 01:34:44 +0000 (20:34 -0500)]
btrfs: sysfs: drop unnecessary double logical negation in acl_show()
The IS_ENABLED() macro already guarantees the result will be a
suitable boolean return value ("1" for enabled, and "0" for disabled).
Thus, it seems that the "!!" used right before is unnecessary to force
the 0/1 values.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Neal Gompa <neal@gompa.dev> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 7 Feb 2024 02:24:06 +0000 (03:24 +0100)]
btrfs: delete BUG_ON in btrfs_init_locked_inode()
The purpose of the BUG_ON is not clear. The helper btrfs_grab_root()
could return a NULL in case args->root would be a NULL or if there are
zero references. Then we check if the root pointer stored in the inode
still exists.
David Sterba [Tue, 6 Feb 2024 22:20:53 +0000 (23:20 +0100)]
btrfs: delete pointless BUG_ONs on extent item size
Checking extent item size in add_inline_refs() is redundant, we do that
already in tree-checker after reading the extent buffer and it won't
change under normal circumstances. It was added long ago in 8da6d5815c592b ("Btrfs: added btrfs_find_all_roots()") and does not seem
to have a clear purpose.
Similar case in extent_from_logical(), added in a542ad1bafc7df ("btrfs:
added helper functions to iterate backrefs").
David Sterba [Tue, 6 Feb 2024 22:20:53 +0000 (23:20 +0100)]
btrfs: delete pointless BUG_ON check on quota root in btrfs_qgroup_account_extent()
The BUG_ON is deep in the qgroup code where we can expect that it
exists. A NULL pointer would cause a crash.
It was added long ago in 550d7a2ed5db35 ("btrfs: qgroup: Add new qgroup
calculation function btrfs_qgroup_account_extents()."). It maybe made
sense back then as the quota enable/disable state machine was not that
robust as it is nowadays, so we can just delete it.
David Sterba [Tue, 6 Feb 2024 22:06:46 +0000 (23:06 +0100)]
btrfs: change BUG_ONs to assertions in btrfs_qgroup_trace_subtree()
The only caller do_walk_down() of btrfs_qgroup_trace_subtree() validates
the value of level and uses it several times before it's passed as an
argument. Same for root_eb that's called 'next' in the caller.
Change both BUG_ONs to assertions as this is to assure proper interface
use rather than real errors.
David Sterba [Tue, 6 Feb 2024 21:47:13 +0000 (22:47 +0100)]
btrfs: send: handle unexpected data in header buffer in begin_cmd()
Change BUG_ON to a proper error handling in the unlikely case of seeing
data when the command is started. This is supposed to be reset when the
command is finished (send_cmd, send_encoded_extent).
David Sterba [Wed, 24 Jan 2024 21:58:01 +0000 (22:58 +0100)]
btrfs: handle invalid root reference found in may_destroy_subvol()
The may_destroy_subvol() looks up a root by a key, allowing to do an
inexact search when key->offset is -1. It's never expected to find such
item, as it would break the allowed range of a root id.
David Sterba [Wed, 24 Jan 2024 21:49:02 +0000 (22:49 +0100)]
btrfs: handle invalid extent item reference found in find_first_extent_item()
The find_first_extent_item() helper looks up an extent item by a key,
allowing to do an inexact search when key->offset is -1. It's never
expected to find such item, as it would break the allowed range of a
extent item offset.
David Sterba [Wed, 24 Jan 2024 21:41:01 +0000 (22:41 +0100)]
btrfs: handle invalid extent item reference found in extent_from_logical()
The extent_from_logical() helper looks up an extent item by a key,
allowing to do an inexact search when key->offset is -1. It's never
expected to find such item, as it would break the allowed range of a
extent item offset.
The same error is already handled in btrfs_backref_iter_start() so add a
comment for consistency.
David Sterba [Wed, 24 Jan 2024 21:29:46 +0000 (22:29 +0100)]
btrfs: update comment and drop assertion in extent item lookup in find_parent_nodes()
Same comment was added to this type of error, unify that and drop the
assertion as we'd find out quickly that something is wrong after
returning -EUCLEAN.
David Sterba [Wed, 24 Jan 2024 16:26:25 +0000 (17:26 +0100)]
btrfs: push errors up from add_async_extent()
The memory allocation error in add_async_extent() is not handled
properly, return an error and push the BUG_ON to the caller. Handling it
there is not trivial so at least make it visible.
Filipe Manana [Fri, 9 Feb 2024 12:42:28 +0000 (12:42 +0000)]
btrfs: remove do_list variable at btrfs_clear_delalloc_extent()
The "do_list" variable has a rather confusing name, so remove it and
directly use btrfs_is_free_space_inode() instead.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 9 Feb 2024 12:35:20 +0000 (12:35 +0000)]
btrfs: remove do_list variable at btrfs_set_delalloc_extent()
The "do_list" variable is only used once, plus its name/meaning is a bit
confusing, so remove it and directory use btrfs_is_free_space_inode().
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 9 Feb 2024 12:25:43 +0000 (12:25 +0000)]
btrfs: use assertion instead of BUG_ON when adding/removing to delalloc list
When adding or removing and inode to/from the root's delalloc list,
instead of using a BUG_ON() to validate list emptiness, use ASSERT()
since this is to check logic errors rather than real errors.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 9 Feb 2024 12:19:55 +0000 (12:19 +0000)]
btrfs: add lockdep assertion to remaining delalloc callbacks
The merge and split callbacks for an inode's io tree are supposed to be
called while the io tree's spinlock is being held, so that the given
extent_state records are stable, not modified or freed while the callbacks
are using them. So add lockdep assertions in the callbacks.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 9 Feb 2024 10:37:10 +0000 (10:37 +0000)]
btrfs: reduce inode lock critical section when setting and clearing delalloc
When setting and clearing a delalloc range, at btrfs_set_delalloc_extent()
and btrfs_clear_delalloc_extent(), we are adding/removing the inode
to/from the root's list of delalloc inodes while under the protection of
the inode's lock. This however is not needed, we can add and remove the
inode to the root's list without holding the inode's lock because here
we are under the protection of the io tree's lock, reducing the size of
the critical section delimited by the inode's lock. The inode's lock is
used in many other places such as when finishing an ordered extent (when
calling btrfs_update_inode_bytes() or btrfs_delalloc_release_metadata(),
or decreasing the number of outstanding extents) or when reserving space
when doing a buffered or direct IO write (calls to functions from
delalloc-space.c).
So move the inode add/remove operations to the root's list of delalloc
inodes to outside the critical section delimited by the inode's lock.
This also allows us to get rid of the BTRFS_INODE_IN_DELALLOC_LIST flag
since we can rely on the inode's delalloc bytes counter to determine if
the inode is or is not in the list.
The following fio based test, that exercises IO to multiple files in the
same subvolume, was used to test:
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 8 Feb 2024 22:08:34 +0000 (22:08 +0000)]
btrfs: rename btrfs_add_delalloc_inodes() to singular form
The function btrfs_add_delalloc_inodes() adds a single inode its root's
list of delalloc inodes, so it doesn't make any sense at all for the
function's name to be plural. Rename it to the singular form
btrfs_add_delalloc_inode() to avoid any confusion.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 8 Feb 2024 22:03:31 +0000 (22:03 +0000)]
btrfs: assert root delalloc lock is held at __btrfs_del_delalloc_inode()
This function requires the delalloc lock of the inode's root to be held,
so assert it's held.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 8 Feb 2024 21:55:42 +0000 (21:55 +0000)]
btrfs: stop passing root argument to __btrfs_del_delalloc_inode()
There's no need to pass a root argument to __btrfs_del_delalloc_inode()
and btrfs_del_delalloc_inode(), we can just pass the inode since the root
is always the root associated to that inode. Some remove the root argument
from these functions.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 8 Feb 2024 15:32:36 +0000 (15:32 +0000)]
btrfs: stop passing root argument to btrfs_add_delalloc_inodes()
There's no need to pass a root argument to btrfs_add_delalloc_inodes(), we
can just pass the inode since the root is always the root associated to
the inode in the context it's called. So remove it and have the single
caller pass only the inode.
Reviewed-by: Boris Burkov <boris@bur.io> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 14 Sep 2023 14:45:41 +0000 (16:45 +0200)]
btrfs: add helper to get fs_info from struct inode pointer
Add a convenience helper to get a fs_info from a VFS inode pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct btrfs_inode,
btrfs_root or btrfs_fs_info.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 14 Sep 2023 14:24:43 +0000 (16:24 +0200)]
btrfs: add helpers to get fs_info from page/folio pointers
Add convenience helpers to get a fs_info from a page or folio pointer
instead of open coding the chain or using btrfs_sb() that in some cases
does one more pointer hop. This is implemented as a macro (still with
type checking) so we don't need full definitions of struct page, folio,
btrfs_root and btrfs_fs_info. The latter can't be static inlines as this
would create loop between ctree.h <-> fs.h, or the headers would have to
be restructured.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 13 Sep 2023 14:11:29 +0000 (16:11 +0200)]
btrfs: add helpers to get inode from page/folio pointers
Add convenience helpers to get a struct btrfs_inode from a page or folio
pointer instead of open coding the chain or intermediate BTRFS_I. This
is implemented as a macro (still with type checking) so we don't need
full definitions of struct page or address_space.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Lijuan Li [Tue, 6 Feb 2024 01:56:00 +0000 (09:56 +0800)]
btrfs: mark __btrfs_add_free_space static
__btrfs_add_free_space is only used in free-space-cache.c,
so mark it static.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Lijuan Li <lilijuan@iscas.ac.cn> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 23:23:49 +0000 (00:23 +0100)]
btrfs: move transaction abort to the error site btrfs_rebuild_free_space_tree()
The recommended pattern for transaction abort after error is to place it
right after the error is handled. That way it's easier to locate where
it failed and help debugging.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 23:23:49 +0000 (00:23 +0100)]
btrfs: move transaction abort to the error site in btrfs_create_free_space_tree()
The recommended pattern for transaction abort after error is to place it
right after the error is handled. That way it's easier to locate where
it failed and help debugging.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 23:23:49 +0000 (00:23 +0100)]
btrfs: move transaction abort to the error site in btrfs_delete_free_space_tree()
The recommended pattern for transaction abort after error is to place it
right after the error is handled. That way it's easier to locate where
it failed and help debugging.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 24 Jan 2024 14:59:36 +0000 (15:59 +0100)]
btrfs: unify handling of return values of btrfs_insert_empty_items()
The error values returned by btrfs_insert_empty_items() are following
the common patter of 0/-errno, but some callers check for a value > 0,
which can't happen. Document that and update calls to not expect
positive values.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 24 Jan 2024 16:23:11 +0000 (17:23 +0100)]
btrfs: change BUG_ON to assertion in reset_balance_state()
The balance state machine is complex so it's good to verify the
assumptions in helpers, however reset_balance_state() is used
at the end of balance and fs_info::balance_ctl is properly set up before
and protected by the exclusive op ownership in btrfs_balance().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 24 Jan 2024 15:18:11 +0000 (16:18 +0100)]
btrfs: change BUG_ON to assertion when verifying root in btrfs_alloc_reserved_file_extent()
The file extents are normally reserved in subvolume roots but could be
also in the data reloc tree. Change the BUG_ON to assertions as this
verifies the usage assumptions.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 22:09:18 +0000 (23:09 +0100)]
btrfs: change BUG_ON to assertion when verifying lockdep class setup
The BUG_ON in btrfs_set_buffer_lockdep_class() is a sanity check of the
level which is verified in callers, e.g. when initializing an extent
buffer or reading from an eb header. Change it to an assertion as this
would not happen unless things are really bad and would fail elsewhere
too.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 24 Jan 2024 00:09:46 +0000 (01:09 +0100)]
btrfs: change BUG_ON to assertion in btrfs_read_roots()
There's one caller of btrfs_read_roots() and that already uses the
tree_root pointer, it's pointless to BUG_ON on it. As it's an assumption
of the initialization helpers make it an assert instead.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 19 Jan 2024 19:15:41 +0000 (20:15 +0100)]
btrfs: defrag: change BUG_ON to assertion in btrfs_defrag_leaves()
The BUG_ON verifies a condition that should be guaranteed by the correct
use of the path search (with keep_locks and lowest_level set), an
assertion is the suitable check.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Sat, 20 Jan 2024 01:22:37 +0000 (02:22 +0100)]
btrfs: delayed-inode: drop pointless BUG_ON in __btrfs_remove_delayed_item()
There's a BUG_ON checking for a valid pointer of fs_info::delayed_root
but it is valid since init_mount_fs_info() and has the same lifetime as
fs_info.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 19 Jan 2024 20:19:18 +0000 (21:19 +0100)]
btrfs: export: handle invalid inode or root reference in btrfs_get_parent()
The get_parent handler looks up a parent of a given dentry, this can be
either a subvolume or a directory. The search is set up with offset -1
but it's never expected to find such item, as it would break allowed
range of inode number or a root id. This means it's a corruption (ext4
also returns this error code).
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 24 Jan 2024 14:37:59 +0000 (15:37 +0100)]
btrfs: handle invalid extent item reference found in check_committed_ref()
The check_committed_ref() helper looks up an extent item by a key,
allowing to do an inexact search when key->offset is -1. It's never
expected to find such item, as it would break the allowed range of a
extent item offset.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 22:42:29 +0000 (23:42 +0100)]
btrfs: handle chunk tree lookup error in btrfs_relocate_sys_chunks()
The unhandled case in btrfs_relocate_sys_chunks() loop is a corruption,
as it could be caused only by two impossible conditions:
- at first the search key is set up to look for a chunk tree item, with
offset -1, this is an inexact search and the key->offset will contain
the correct offset upon a successful search, a valid chunk tree item
cannot have an offset -1
- after first successful search, the found_key corresponds to a chunk
item, the offset is decremented by 1 before the next loop, it's
impossible to find a chunk item there due to alignment and size
constraints
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 22:34:57 +0000 (23:34 +0100)]
btrfs: handle invalid root reference found in btrfs_init_root_free_objectid()
The btrfs_init_root_free_objectid() looks up a root by a key, allowing
to do an inexact search when key->offset is -1. It's never expected to
find such item, as it would break the allowed range of a root id.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 22:28:24 +0000 (23:28 +0100)]
btrfs: handle invalid root reference found in btrfs_find_root()
The btrfs_find_root() looks up a root by a key, allowing to do an
inexact search when key->offset is -1. It's never expected to find such
item, as it would break allowed the range of a root id.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 23 Jan 2024 22:19:19 +0000 (23:19 +0100)]
btrfs: handle root deletion lookup error in btrfs_del_root()
We're deleting a root and looking it up by key does not succeed, this
is an inconsistent state and we can't do anything. All callers handle
errors and abort a transaction.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Sat, 20 Jan 2024 01:17:03 +0000 (02:17 +0100)]
btrfs: handle block group lookup error when it's being removed
The unlikely case of lookup error in btrfs_remove_block_group() can be
handled properly, in its caller this would lead to a transaction abort.
We can't do anything else, a block group must have been loaded first.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 19 Jan 2024 19:44:57 +0000 (20:44 +0100)]
btrfs: handle invalid range and start in merge_extent_mapping()
Turn a BUG_ON to a properly handled error and update the error message
in the caller. It is expected that @em_in and @start passed to
btrfs_add_extent_mapping() overlap. Besides tests, the only caller
btrfs_get_extent() makes sure this is true.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 19 Jan 2024 19:23:56 +0000 (20:23 +0100)]
btrfs: handle directory and dentry mismatch in btrfs_may_delete()
The helper btrfs_may_delete() is a copy of generic fs/namei.c:may_delete()
to verify various conditions before deletion. There's a BUG_ON added
before linux.git started, we can turn it to a proper error handling
at least in our local helper. A mistmatch between directory and the
deleted dentry is clearly invalid.
This won't be probably ever hit due to the way how the parameters are
set from the caller btrfs_ioctl_snap_destroy(), using a VFS helper
lookup_one().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Fri, 2 Feb 2024 04:23:28 +0000 (13:23 +0900)]
btrfs: use READ/WRITE_ONCE for fs_devices->read_policy
Since we can read/modify the value from the sysfs interface concurrently,
it would be better to protect it from compiler optimizations.
Currently, there is only one read policy BTRFS_READ_POLICY_PID available,
so no actual problem can happen now. This is a preparation for the future
expansion.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 26 Jan 2024 12:59:23 +0000 (12:59 +0000)]
btrfs: preallocate temporary extent buffer for inode logging when needed
When logging an inode and we require to copy items from subvolume leaves
to the log tree, we clone each subvolume leaf and than use that clone to
copy items to the log tree. This is required to avoid possible deadlocks
as stated in commit 796787c978ef ("btrfs: do not modify log tree while
holding a leaf from fs tree locked").
The cloning requires allocating an extent buffer (struct extent_buffer)
and then allocating pages (folios) to attach to the extent buffer. This
may be slow in case we are under memory pressure, and since we are doing
the cloning while holding a read lock on a subvolume leaf, it means we
can be blocking other operations on that leaf for significant periods of
time, which can increase latency on operations like creating other files,
renaming files, etc. Similarly because we're under a log transaction, we
may also cause extra delay on other tasks doing an fsync, because syncing
the log requires waiting for tasks that joined a log transaction to exit
the transaction.
So to improve this, for any inode logging operation that needs to copy
items from a subvolume leaf ("full sync" or "copy everything" bit set
in the inode), preallocate a dummy extent buffer before locking any
extent buffer from the subvolume tree, and even before joining a log
transaction, add it to the log context and then use it when we need to
copy items from a subvolume leaf to the log tree. This avoids making
other operations get extra latency when waiting to lock a subvolume
leaf that is used during inode logging and we are under heavy memory
pressure.
The following test script with bonnie++ was used to test this:
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 25 Jan 2024 09:53:26 +0000 (09:53 +0000)]
btrfs: add comment about list_is_singular() use at btrfs_delete_unused_bgs()
At btrfs_delete_unused_bgs(), the use of the list_is_singular() check on
a block group may not be immediately obvious. It is there to prevent
losing raid profile information for a block group type (data, metadata or
system), as that information is removed from
fs_info->avail_[data|metadata|system]_alloc_bits when the last block group
of a given type is deleted. So deleting the block group would later result
in creating block groups of that type with a single profile (because
fs_info->avail_*_alloc_bits would have a value of 0).
This check was added in commit aefbe9a633b5 ("btrfs: Fix lost-data-profile
caused by auto removing bg").
So add a comment mentioning the need for the check.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 25 Jan 2024 09:53:23 +0000 (09:53 +0000)]
btrfs: document what the spinlock unused_bgs_lock protects
Add some comments to struct btrfs_fs_info to explicitly document which
members are protected by the spinlock unused_bgs_lock. It is currently
used to protect two linked lists, the reclaim_bgs and unused_bgs lists.
So add an explicit comment on top of each list to mention its protected
by unused_bgs_lock, as well as comment on top of unused_bgs_lock to
mention the lists it protects.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 12 Jan 2024 17:45:24 +0000 (18:45 +0100)]
btrfs: return errors from unpin_extent_range()
Handle the lookup failure of the block group to unpin, this is a logic
error as the block group must exist at this point. If not, something else
must have freed it, like clean_pinned_extents() would do without locking
the unused_bg_unpin_mutex.
Push the errors to the callers, proper handling will be done in followup
patches.
David Sterba [Fri, 12 Jan 2024 17:31:40 +0000 (18:31 +0100)]
btrfs: handle errors returned from unpin_extent_cache()
We've had numerous attempts to let function unpin_extent_cache() return
void as it only returns 0. There are still error cases to handle so do
that, in addition to the verbose messages. The only caller
btrfs_finish_one_ordered() will now abort the transaction, previously it
let it continue which could lead to further problems.
Qu Wenruo [Tue, 23 Jan 2024 03:03:30 +0000 (13:33 +1030)]
btrfs: zstd: fix and simplify the inline extent decompression (v2)
Note: this is a fixed version that was previously reverted as e01a83e12604 ("Revert "btrfs: zstd: fix and simplify the inline extent
decompression""), with fixed parameters to memzero_page().
[BUG]
If we have a filesystem with 4k sectorsize, and an inlined compressed
extent created like this:
[CAUSE]
In zstd_decompress(), we didn't treat @start_byte as just a page offset,
but also use it as an indicator on whether we should error out, without
any proper explanation (this is copied from other decompression code).
In reality, for subpage cases, although @start_byte can be non-zero,
we should never switch input/output buffer nor error out, since the whole
input/output buffer should never exceed one sector, thus we should not
need to do any buffer switch.
Thus the current code using @start_byte as a condition to switch
input/output buffer or finish the decompression is completely incorrect.
[FIX]
The fix involves several modification:
- Rename @start_byte to @dest_pgoff to properly express its meaning
- Use @sectorsize other than PAGE_SIZE to properly initialize the
output buffer size
- Use correct destination offset inside the destination page
- Simplify the main loop
Since the input/output buffer should never switch, we only need one
zstd_decompress_stream() call.
- Consider early end as an error
After the fix, even on 64K page sized aarch64, above reflink now
works as expected:
David Sterba [Thu, 25 Jan 2024 16:44:47 +0000 (17:44 +0100)]
btrfs: remove unused included headers
With help of neovim, LSP and clangd we can identify header files that
are not actually needed to be included in the .c files. This is focused
only on removal (with minor fixups), further cleanups are possible but
will require doing the header files properly with forward declarations,
minimized includes and include-what-you-use care.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 16 Jan 2024 17:17:14 +0000 (18:17 +0100)]
btrfs: replace i_blocksize by fs_info::sectorsize
The block size calculated by i_blocksize from inode is the same as what
we have in fs_info, initalized in inode_init_always(). Unify that to use
the fs_info value everywhere.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 16 Jan 2024 16:33:20 +0000 (17:33 +0100)]
btrfs: replace sb::s_blocksize by fs_info::sectorsize
The block size stored in the super block is used by subsystems outside
of btrfs and it's a copy of fs_info::sectorsize. Unify that to always
use our sectorsize, with the exception of mount where we first need to
use fixed values (4K) until we read the super block and can set the
sectorsize.
Replace all uses, in most cases it's fewer pointer indirections.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove duplicate recording of physical address
Remove the duplicate physical recording of the original write physical
address in case of a single device write.
This duplicated code is most likely present due to a rebase error.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>