Filipe Manana [Wed, 13 Sep 2023 11:23:18 +0000 (12:23 +0100)]
btrfs: remove pointless loop from btrfs_update_block_group()
When an extent is allocated or freed, we call btrfs_update_block_group()
to update its block group and space info. An extent always belongs to a
single block group, it can never span multiple block groups, so the loop
we have at btrfs_update_block_group() is pointless, as it always has a
single iteration. The loop was added in the very early days, 2007, when
the block group code was added in commit 9078a3e1e4e4 ("Btrfs: start of
block group code"), but even back then it seemed pointless.
So remove the loop and assert the block group containing the start offset
of the extent also contains the whole extent.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 12 Sep 2023 12:04:31 +0000 (13:04 +0100)]
btrfs: mark transaction id check as unlikely at btrfs_mark_buffer_dirty()
At btrfs_mark_buffer_dirty(), having a transaction id mismatch is never
expected to happen and it usually means there's a bug or some memory
corruption due to a bitflip for example. So mark the condition as unlikely
to optimize code generation as well as to make it obvious for human
readers that it is a very unexpected condition.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 12 Sep 2023 12:04:30 +0000 (13:04 +0100)]
btrfs: use btrfs_crit at btrfs_mark_buffer_dirty()
There's no need to use WARN() at btrfs_mark_buffer_dirty() to print an
error message, as we have the fs_info pointer we can use btrfs_crit()
which prints device information and makes the message have a more uniform
format. As we are already aborting the transaction we already have a stack
trace printed as well. So replace the use of WARN() with btrfs_crit().
Also slightly reword the message to use 'logical' instead of 'block' as
it's what is used in other error/warning messages.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 12 Sep 2023 12:04:29 +0000 (13:04 +0100)]
btrfs: abort transaction on generation mismatch when marking eb as dirty
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: scan but don't register device on single device filesystem
After the commit 5f58d783fd78 ("btrfs: free device in btrfs_close_devices
for a single device filesystem") we unregister the device from the kernel
memory upon unmounting for a single device.
So, device registration that was performed before mounting if any is no
longer in the kernel memory.
However, in fact, note that device registration is unnecessary for a
single-device btrfs filesystem unless it's a seed device.
So for commands like 'btrfs device scan' or 'btrfs device ready' with a
non-seed single-device btrfs filesystem, they can return success just
after superblock verification and without the actual device scan. When
'device scan --forget' is called on such device no error is returned.
The seed device must remain in the kernel memory to allow the sprout
device to mount without the need to specify the seed device explicitly.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Fri, 8 Sep 2023 19:05:47 +0000 (21:05 +0200)]
btrfs: rename errno identifiers to error
We sync the kernel files to userspace and the 'errno' symbol is defined
by standard library, which does not matter in kernel but the parameters
or local variables could clash. Rename them all.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:38 +0000 (18:20 +0100)]
btrfs: always reserve space for delayed refs when starting transaction
When starting a transaction (or joining an existing one with
btrfs_start_transaction()), we reserve space for the number of items we
want to insert in a btree, but we don't do it for the delayed refs we
will generate while using the transaction to modify (COW) extent buffers
in a btree or allocate new extent buffers. Basically how it works:
1) When we start a transaction we reserve space for the number of items
the caller wants to be inserted/modified/deleted in a btree. This space
goes to the transaction block reserve;
2) If the delayed refs block reserve is not full, its size is greater
than the amount of its reserved space, and the flush method is
BTRFS_RESERVE_FLUSH_ALL, then we attempt to reserve more space for
it corresponding to the number of items the caller wants to
insert/modify/delete in a btree;
3) The size of the delayed refs block reserve is increased when a task
creates delayed refs after COWing an extent buffer, allocating a new
one or deleting (freeing) an extent buffer. This happens after the
the task started or joined a transaction, whenever it calls
btrfs_update_delayed_refs_rsv();
4) The delayed refs block reserve is then refilled by anyone calling
btrfs_delayed_refs_rsv_refill(), either during unlink/truncate
operations or when someone else calls btrfs_start_transaction() with
a 0 number of items and flush method BTRFS_RESERVE_FLUSH_ALL;
5) As a task COWs or allocates extent buffers, it consumes space from the
transaction block reserve. When the task releases its transaction
handle (btrfs_end_transaction()) or it attempts to commit the
transaction, it releases any remaining space in the transaction block
reserve that it did not use, as not all space may have been used (due
to pessimistic space calculation) by calling btrfs_block_rsv_release()
which will try to add that unused space to the delayed refs block
reserve (if its current size is greater than its reserved space).
That transferred space may not be enough to completely fulfill the
delayed refs block reserve.
Plus we have some tasks that will attempt do modify as many leaves
as they can before getting -ENOSPC (and then reserving more space and
retrying), such as hole punching and extent cloning which call
btrfs_replace_file_extents(). Such tasks can generate therefore a
high number of delayed refs, for both metadata and data (we can't
know in advance how many file extent items we will find in a range
and therefore how many delayed refs for dropping references on data
extents we will generate);
6) If a transaction starts its commit before the delayed refs block
reserve is refilled, for example by the transaction kthread or by
someone who called btrfs_join_transaction() before starting the
commit, then when running delayed references if we don't have enough
reserved space in the delayed refs block reserve, we will consume
space from the global block reserve.
Now this doesn't make a lot of sense because:
1) We should reserve space for delayed references when starting the
transaction, since we have no guarantees the delayed refs block
reserve will be refilled;
2) If no refill happens then we will consume from the global block reserve
when running delayed refs during the transaction commit;
3) If we have a bunch of tasks calling btrfs_start_transaction() with a
number of items greater than zero and at the time the delayed refs
reserve is full, then we don't reserve any space at
btrfs_start_transaction() for the delayed refs that will be generated
by a task, and we can therefore end up using a lot of space from the
global reserve when running the delayed refs during a transaction
commit;
4) There are also other operations that result in bumping the size of the
delayed refs reserve, such as creating and deleting block groups, as
well as the need to update a block group item because we allocated or
freed an extent from the respective block group;
5) If we have a significant gap between the delayed refs reserve's size
and its reserved space, two very bad things may happen:
1) The reserved space of the global reserve may not be enough and we
fail the transaction commit with -ENOSPC when running delayed refs;
2) If the available space in the global reserve is enough it may result
in nearly exhausting it. If the fs has no more unallocated device
space for allocating a new block group and all the available space
in existing metadata block groups is not far from the global
reserve's size before we started the transaction commit, we may end
up in a situation where after the transaction commit we have too
little available metadata space, and any future transaction commit
will fail with -ENOSPC, because although we were able to reserve
space to start the transaction, we were not able to commit it, as
running delayed refs generates some more delayed refs (to update the
extent tree for example) - this includes not even being able to
commit a transaction that was started with the goal of unlinking a
file, removing an empty data block group or doing reclaim/balance,
so there's no way to release metadata space.
In the worst case the next time we mount the filesystem we may
also fail with -ENOSPC due to failure to commit a transaction to
cleanup orphan inodes. This later case was reported and hit by
someone running a SLE (SUSE Linux Enterprise) distribution for
example - where the fs had no more unallocated space that could be
used to allocate a new metadata block group, and the available
metadata space was about 1.5M, not enough to commit a transaction
to cleanup an orphan inode (or do relocation of data block groups
that were far from being full).
So improve on this situation by always reserving space for delayed refs
when calling start_transaction(), and if the flush method is
BTRFS_RESERVE_FLUSH_ALL, also try to refill the delayed refs block
reserve if it's not full. The space reserved for the delayed refs is added
to a local block reserve that is part of the transaction handle, and when
a task updates the delayed refs block reserve size, after creating a
delayed ref, the space is transferred from that local reserve to the
global delayed refs reserve (fs_info->delayed_refs_rsv). In case the
local reserve does not have enough space, which may happen for tasks
that generate a variable and potentially large number of delayed refs
(such as the hole punching and extent cloning cases mentioned before),
we transfer any available space and then rely on the current behaviour
of hoping some other task refills the delayed refs reserve or fallback
to the global block reserve.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:37 +0000 (18:20 +0100)]
btrfs: stop doing excessive space reservation for csum deletion
Currently when reserving space for deleting the csum items for a data
extent, when adding or updating a delayed ref head, we determine how
many leaves of csum items we can have and then pass that number to the
helper btrfs_calc_delayed_ref_bytes(). This helper is used for calculating
space for all tree modifications we need when running delayed references,
however the amount of space it computes is excessive for deleting csum
items because:
1) It uses btrfs_calc_insert_metadata_size() which is excessive because
we only need to delete csum items from the csum tree, we don't need
to insert any items, so btrfs_calc_metadata_size() is all we need (as
it computes space needed to delete an item);
2) If the free space tree is enabled, it doubles the amount of space,
which is pointless for csum deletion since we don't need to touch the
free space tree or any other tree other than the csum tree.
So improve on this by tracking how many csum deletions we have and using
a new helper to calculate space for csum deletions (just a wrapper around
btrfs_calc_metadata_size() with a comment). This reduces the amount of
space we need to reserve for csum deletions by a factor of 4, and it helps
reduce the number of times we have to block space reservations and have
the reclaim task enter the space flushing algorithm (flush delayed items,
flush delayed refs, etc) in order to satisfy tickets.
For example this results in a total time decrease when unlinking (or
truncating) files with many extents, as we end up having to block on space
metadata reservations less often. Example test:
$ cat test.sh
#!/bin/bash
DEV=/dev/nullb0
MNT=/mnt/test
umount $DEV &> /dev/null
mkfs.btrfs -f $DEV
# Use compression to quickly create files with a lot of extents
# (each with a size of 128K).
mount -o compress=lzo $DEV $MNT
# 100G gives at least 983040 extents with a size of 128K.
xfs_io -f -c "pwrite -S 0xab -b 1M 0 120G" $MNT/foobar
# Flush all delalloc and clear all metadata from memory.
umount $MNT
mount -o compress=lzo $DEV $MNT
Filipe Manana [Fri, 8 Sep 2023 17:20:36 +0000 (18:20 +0100)]
btrfs: remove pointless initialization at btrfs_delayed_refs_rsv_release()
There's no point in initializing to 0 the local variable 'released' as
we don't use it before the next assignment to it. So remove the
initialization. This may help avoid some warnings with clang tools such
as the one reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant
initialization of variables in log_new_ancestors").
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:35 +0000 (18:20 +0100)]
btrfs: reserve space for delayed refs on a per ref basis
Currently when reserving space for delayed refs we do it on a per ref head
basis. This is generally enough because most back refs for an extent end
up being inlined in the extent item - with the default leaf size of 16K we
can have at most 33 inline back refs (this is calculated by the macro
BTRFS_MAX_EXTENT_ITEM_SIZE()). The amount of bytes reserved for each ref
head is given by btrfs_calc_delayed_ref_bytes(), which basically
corresponds to a single path for insertion into the extent tree plus
another path for insertion into the free space tree if it's enabled.
However if we have reached the limit of inline refs or we have a mix of
inline and non-inline refs, then we will need to insert a non-inline ref
and update the existing extent item to update the total number of
references for the extent. This implies we need reserved space for two
insertion paths in the extent tree, but we only reserved for one path.
The extent item and the non-inline ref item may be located in different
leaves, or even if they are located in the same leaf, after updating the
extent item and before inserting the non-inline ref item, the extent
buffers in the btree path may have been written (due to memory pressure
for e.g.), in which case we need to COW the entire path again. In this
case since we have not reserved enough space for the delayed refs block
reserve, we will use the global block reserve.
If we are in a situation where the fs has no more unallocated space enough
to allocate a new metadata block group and available space in the existing
metadata block groups is close to the maximum size of the global block
reserve (512M), we may end up consuming too much of the free metadata
space to the point where we can't commit any future transaction because it
will fail, with -ENOSPC, during its commit when trying to allocate an
extent for some COW operation (running delayed refs generated by running
delayed refs or COWing the root tree's root node at commit_cowonly_roots()
for example). Such dramatic scenario can happen if we have many delayed
refs that require the insertion of non-inline ref items, due to too many
reflinks or snapshots. We also have situations where we use the global
block reserve because we could not in advance know that we will need
space to update some trees (block group creation for example), so this
all adds up to increase the chances of exhausting the global block reserve
and making any future transaction commit to fail with -ENOSPC and turn
the fs into RO mode, or fail the mount operation in case the mount needs
to start and commit a transaction, such as when we have orphans to cleanup
for example - such case was reported and hit by someone running a SLE
(SUSE Linux Enterprise) distribution for example - where the fs had no
more unallocated space that could be used to allocate a new metadata block
group, and the available metadata space was about 1.5M, not enough to
commit a transaction to cleanup an orphan inode (or do relocation of data
block groups that were far from being full).
So reserve space for delayed refs by individual refs and not by ref heads,
as we may need to COW multiple extent tree paths due to non-inline ref
items.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:34 +0000 (18:20 +0100)]
btrfs: allow to run delayed refs by bytes to be released instead of count
When running delayed references, through btrfs_run_delayed_refs(), we can
specify how many to run, run all existing delayed references and keep
running delayed references while we can find any. This is controlled with
the value of the 'count' argument, where a value of 0 means to run all
delayed references that exist by the time btrfs_run_delayed_refs() is
called, (unsigned long)-1 means to keep running delayed references while
we are able find any, and any other value to run that exact number of
delayed references.
Typically a specific value other than 0 or -1 is used when flushing space
to try to release a certain amount of bytes for a ticket. In this case
we just simply calculate how many delayed reference heads correspond to a
specific amount of bytes, with calc_delayed_refs_nr(). However that only
takes into account the space reserved for the reference heads themselves,
and does not account for the space reserved for deleting checksums from
the csum tree (see add_delayed_ref_head() and update_existing_head_ref())
in case we are going to delete a data extent. This means we may end up
running more delayed references than necessary in case we process delayed
references for deleting a data extent.
So change the logic of btrfs_run_delayed_refs() to take a bytes argument
to specify how many bytes of delayed references to run/release, using the
special values of 0 to mean all existing delayed references and U64_MAX
(or (u64)-1) to keep running delayed references while we can find any.
This prevents running more delayed references than necessary, when we have
delayed references for deleting data extents, but also makes the upcoming
changes/patches simpler and it's preparatory work for them.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:33 +0000 (18:20 +0100)]
btrfs: simplify check for extent item overrun at lookup_inline_extent_backref()
At lookup_inline_extent_backref() we can simplify the check for an overrun
of the extent item by making the while loop's condition to be "ptr < end"
and then check after the loop if an overrun happened ("ptr > end"). This
reduces indentation and makes the loop condition more clear. So move the
check out of the loop and change the loop condition accordingly, while
also adding the 'unlikely' tag to the check since it's not supposed to be
triggered.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:32 +0000 (18:20 +0100)]
btrfs: return -EUCLEAN if extent item is missing when searching inline backref
At lookup_inline_extent_backref() when trying to insert an inline backref,
if we don't find the extent item we log an error and then return -EIO.
This error code is confusing because there was actually no IO error, and
this means we have some corruption, either caused by a bug or something
like a memory bitflip for example. So change the error code from -EIO to
-EUCLEAN.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:31 +0000 (18:20 +0100)]
btrfs: use a single variable for return value at lookup_inline_extent_backref()
At lookup_inline_extent_backref(), instead of using a 'ret' and an 'err'
variable for tracking the return value, use a single one ('ret'). This
simplifies the code, makes it comply with most of the existing code and
it's less prone for logic errors as time has proven over and over in the
btrfs code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:30 +0000 (18:20 +0100)]
btrfs: use a single variable for return value at run_delayed_extent_op()
Instead of using a 'ret' and an 'err' variable at run_delayed_extent_op()
for tracking the return value, use a single one ('ret'). This simplifies
the code, makes it comply with most of the existing code and it's less
prone for logic errors as time has proven over and over in the btrfs code.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:28 +0000 (18:20 +0100)]
btrfs: remove pointless 'ref_root' variable from run_delayed_data_ref()
The 'ref_root' variable, at run_delayed_data_ref(), is not really needed
as we can always use ref->root directly, plus its initialization to 0 is
completely pointless as we assign it ref->root before its first use.
So just drop that variable and use ref->root directly.
This may help avoid some warnings with clang tools such as the one
reported/fixed by commit 966de47ff0c9 ("btrfs: remove redundant
initialization of variables in log_new_ancestors").
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:27 +0000 (18:20 +0100)]
btrfs: initialize key where it's used when running delayed data ref
At run_delayed_data_ref() we are always initializing a key but the key
is only needed and used if we are inserting a new extent. So move the
declaration and initialization of the key to 'if' branch where it's used.
Also rename the key from 'ins' to 'key', as it's a more clear name.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:26 +0000 (18:20 +0100)]
btrfs: remove refs_to_drop argument from __btrfs_free_extent()
Currently the 'refs_to_drop' argument of __btrfs_free_extent() always
matches the value of node->ref_mod, so remove the argument and use
node->ref_mod at __btrfs_free_extent().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:25 +0000 (18:20 +0100)]
btrfs: remove refs_to_add argument from __btrfs_inc_extent_ref()
Currently the 'refs_to_add' argument of __btrfs_inc_extent_ref() always
matches the value of node->ref_mod, so remove the argument and use
node->ref_mod at __btrfs_inc_extent_ref().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:22 +0000 (18:20 +0100)]
btrfs: remove the refcount warning/check at btrfs_put_delayed_ref()
At btrfs_put_delayed_ref(), it's pointless to have a WARN_ON() to check if
the refcount of the delayed ref is zero. Such check is already done by the
refcount_t module and refcount_dec_and_test(), which loudly complains if
we try to decrement a reference count that is currently 0.
The WARN_ON() dates back to the time when used a regular atomic_t type
for the reference counter, before we switched to the refcount_t type.
The main goal of the refcount_t type/module is precisely to catch such
types of bugs and loudly complain if they happen.
This also reduces a bit the module's text size.
Before this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1612483 167145 16864 1796492 1b698c fs/btrfs/btrfs.ko
After this change:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1612371 167073 16864 1796308 1b68d4 fs/btrfs/btrfs.ko
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:21 +0000 (18:20 +0100)]
btrfs: remove unnecessary logic when running new delayed references
When running delayed references, at btrfs_run_delayed_refs(), we have this
logic to run any new delayed references that might have been added just
after we ran all delayed references. This logic grabs the first delayed
reference, then locks it to wait for any contention on it before running
all new delayed references. This however is pointless and not necessary
because at __btrfs_run_delayed_refs() when we start running delayed
references, we pick the first reference with btrfs_obtain_ref_head() and
then we will lock it (with btrfs_delayed_ref_lock()).
So remove the duplicate and unnecessary logic at btrfs_run_delayed_refs().
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Fri, 8 Sep 2023 17:20:20 +0000 (18:20 +0100)]
btrfs: pass a space_info argument to btrfs_reserve_metadata_bytes()
We are passing a block reserve argument to btrfs_reserve_metadata_bytes()
which is not really used, all we need is to pass the space_info associated
to the block reserve, we don't change the block reserve at all.
Not only it's pointless to pass the block reserve, it's also confusing as
one might think that the reserved bytes will end up being added to the
passed block reserve, when that's not the case. The pattern for reserving
space and adding it to a block reserve is to first reserve space with
btrfs_reserve_metadata_bytes() and if that succeeds, then add the space to
a block reserve by calling btrfs_block_rsv_add_bytes().
Also the reverse of btrfs_reserve_metadata_bytes(), which is
btrfs_space_info_free_bytes_may_use(), takes a space_info argument and
not a block reserve, so one more reason to pass a space_info and not a
block reserve to btrfs_reserve_metadata_bytes().
So change btrfs_reserve_metadata_bytes() and its callers to pass a
space_info argument instead of a block reserve argument.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the need_raid_map parameter from btrfs_map_block()
The parameter @need_raid_map is mostly a legacy from the old days where
we don't yet have a solid definition on the @mirror_num, and only
check-integrity was using that parameter, while all other call sites
just pass 1 for that parameter.
Now since we have removed check-integrity functionality, we can also
remove the @need_raid_map parameter.
This change will also remove the ability to read P/Q stripe directly
when passing 0 as @need_raid_map.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: check-integrity: remove btrfsic_unmount() function
The function btrfsic_mount() is part of the deprecated check-integrity
functionality.
Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: check-integrity: remove btrfsic_mount() function
The function btrfsic_mount() is part of the deprecated check-integrity
functionality.
Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: check-integrity: remove btrfsic_check_bio() function
The function btrfsic_check_bio() is part of the deprecated
check-integrity functionality.
Now let's remove the main entry point of check-integrity, and thankfully
most of the check-integrity code is self-contained inside
check-integrity.c, we can safely remove the function without huge
changes to btrfs code base.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 7 Sep 2023 23:09:42 +0000 (01:09 +0200)]
btrfs: move extent_buffer::lock_owner to debug section
The lock_owner is used for a rare corruption case and we haven't seen
any reports in years. Move it to the debugging section of eb. To close
the holes also move log_index so the final layout looks like:
David Sterba [Thu, 7 Sep 2023 23:09:40 +0000 (01:09 +0200)]
btrfs: reduce size of struct btrfs_ref
We can reduce two members' size that in turn reduce size of struct
btrfs_ref from 64 to 56 bytes. As the structure is often used as a local
variable several functions reduce their stack usage.
- make enum btrfs_ref_type packed, there are only 4 values
David Sterba [Thu, 7 Sep 2023 23:09:38 +0000 (01:09 +0200)]
btrfs: reduce size and reorder compression members in struct btrfs_inode
Currently the compression type values are bounded and fit to an u8, we
can pack the btrfs_inode a bit by reordering them to the space created
by the location key. This reduces size from 1112 to 1104.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 7 Sep 2023 23:09:35 +0000 (01:09 +0200)]
btrfs: reduce size of prelim_ref::level
The values of level are bounded and fit into a byte so let's use it for
the structure to reduce size from 88 to 80 bytes on a release build,
which increases number of objects in the default 8K slab from 93 to 102.
David Sterba [Thu, 7 Sep 2023 23:09:33 +0000 (01:09 +0200)]
btrfs: reduce arguments of helpers space accounting root item
There are two helpers to increase used bytes of root items that add or
subtract one node size, we don't need to pass the argument for that.
Rename the function so it matches the root item member that gets
changed.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 7 Sep 2023 23:09:29 +0000 (01:09 +0200)]
btrfs: reduce parameters of btrfs_pin_reserved_extent
There is only one caller of btrfs_pin_reserved_extent that expands the
parameters to extent buffer members. We can simply pass the extent
buffer instead.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 7 Sep 2023 23:09:25 +0000 (01:09 +0200)]
btrfs: reformat remaining kdoc style comments
Function name in the comment does not bring much value to code not
exposed as API and we don't stick to the kdoc format anymore. Update
formatting of parameter descriptions.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: qgroup: remove unused helpers for ulist aux data
These functions are defined in the qgroup.c file, but not called
anymore since commit "btrfs: qgroup: use qgroup_iterator_nested to in
qgroup_update_refcnt()" so we can delete them.
fs/btrfs/qgroup.c:149:19: warning: unused function 'qgroup_to_aux'.
fs/btrfs/qgroup.c:154:36: warning: unused function 'unode_aux_to_qgroup'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6566 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: qgroup: prealloc btrfs_qgroup_list for __add_relation_rb()
Currently we go GFP_ATOMIC allocation for qgroup relation add, this
includes the following 3 call sites:
- btrfs_read_qgroup_config()
This is not really needed, as at that time we're still in single
thread mode, and no spin lock is held.
- btrfs_add_qgroup_relation()
This one is holding a spinlock, but we're ensured to add at most one
relation, thus we can easily do a preallocation and use the
preallocated memory to avoid GFP_ATOMIC.
- btrfs_qgroup_inherit()
This is a little more tricky, as we may have as many relationships as
inherit::num_qgroups.
Thus we have to properly allocate an array then preallocate all the
memory.
This patch would remove the GFP_ATOMIC allocation for above involved
call sites, by doing preallocation before holding the spinlock, and let
__add_relation_rb() to handle the freeing of the structure.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()
The ulist @qgroups is utilized to record all involved qgroups from both
old and new roots inside btrfs_qgroup_account_extent().
Due to the fact that qgroup_update_refcnt() itself is already utilizing
qgroup_iterator, here we have to introduce another list_head,
btrfs_qgroup::nested_iterator, allowing nested iteration.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: qgroup: iterate qgroups without memory allocation for qgroup_reserve()
Qgroup heavily relies on ulist to go through all the involved
qgroups, but since we're using ulist inside fs_info->qgroup_lock
spinlock, this means we're doing a lot of GFP_ATOMIC allocations.
This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by
eliminating the memory allocation completely.
This is done by moving the needed memory to btrfs_qgroup::iterator
list_head, so that we can put all the involved qgroup into a on-stack
list, thus eliminating the need to allocate memory while holding
spinlock.
The only cost is the slightly higher memory usage, but considering the
reduce GFP_ATOMIC during a hot path, it should still be acceptable.
Function qgroup_reserve() is the perfect start point for this
conversion.
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:30 +0000 (16:19 -0400)]
btrfs: remove extraneous includes from ctree.h
We don't need any of these includes in the ctree.h header file for the
header file itself, remove them to clean up ctree.h a little bit.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:29 +0000 (16:19 -0400)]
btrfs: include linux/security.h in super.c
We use some of the security related code in here, include it in super.c
so we can remove the include from ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:28 +0000 (16:19 -0400)]
btrfs: include trace header in where necessary
If we no longer include the tracepoints from ctree.h we fail to compile
because we have the dependency in some of the header files and source
files. Add the include where we have these dependencies to allow us to
remove the include from ctree.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:27 +0000 (16:19 -0400)]
btrfs: add btrfs_delayed_ref_head declaration to extent-tree.h
extent-tree.h uses btrfs_delayed_ref_head in a function argument but
doesn't pull it's declaration from anywhere, add it to the top of the
header.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:26 +0000 (16:19 -0400)]
btrfs: add fscrypt related dependencies to respective headers
These headers have struct fscrypt_str as function arguments, so add
struct fscrypt_str to the theader, and include linux/fscrypt.h in
btrfs_inode.h as it also needs the definition of struct fscrypt_name for
the new inode args.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:25 +0000 (16:19 -0400)]
btrfs: include linux/iomap.h in file.c
We use the iomap code in file.c, include it so we have our dependencies.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:23 +0000 (16:19 -0400)]
btrfs: include asm/unaligned.h in accessors.h
We use the unaligned helpers directly in accessors.h, add the include
here.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:22 +0000 (16:19 -0400)]
btrfs: move btrfs_name_hash to dir-item.h
This is related to the name hashing for dir items, move it into
dir-item.h.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:21 +0000 (16:19 -0400)]
btrfs: move btrfs_extref_hash into inode-item.h
Ideally this would be un-inlined, but that is a cleanup for later. For
now move this into inode-item.h, which is where the extref code lives.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:20 +0000 (16:19 -0400)]
btrfs: remove btrfs_crc32c wrapper
This simply sends the same arguments into crc32c(), and is just used in
a few places. Remove this wrapper and directly call crc32c() in these
instances.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Fri, 25 Aug 2023 20:19:19 +0000 (16:19 -0400)]
btrfs: move btrfs_crc32c_final into free-space-cache.c
This is the only place this helper is used, take it out of ctree.h and
move it into free-space-cache.c.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Tue, 29 Aug 2023 07:14:25 +0000 (15:14 +0800)]
btrfs: do not require EXTENT_NOWAIT for btrfs_redirty_list_add()
The flag EXTENT_NOWAIT is a special flag to notify extent-io-tree code
that this operation should not sleep for the extent state preallocation.
However for btrfs_redirty_list_add(), all callers are able to sleep:
- clean_log_buffer()
Just 2 lines before, we call btrfs_pin_reserved_extent(), which calls
pin_down_extent(), and that function does not require EXTENT_NOWAIT.
Thus we're safe to call it without EXTENT_NOWAIT.
- btrfs_free_tree_block()
This function have several call sites which trigger tree read, e.g.
walk_up_proc(), thus we're safe to call it without EXTENT_NOWAIT.
Thus there is no need to require EXTENT_NOWAIT flag.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 23 Aug 2023 14:52:13 +0000 (22:52 +0800)]
btrfs: sipmlify uuid parameters of alloc_fs_devices()
Among all the callers, only the device_list_add() function uses the
second argument of alloc_fs_devices(). It passes metadata_uuid when
available, otherwise, it passes NULL. And in turn, alloc_fs_devices()
is designed to copy either metadata_uuid or fsid into
fs_devices::metadata_uuid.
So remove the second argument in alloc_fs_devices(), and always copy the
fsid. In the caller device_list_add() function, we will overwrite it
with metadata_uuid when it is available.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 28 Aug 2023 07:38:36 +0000 (08:38 +0100)]
btrfs: update comment for reservation of metadata space for delayed items
The second comment at btrfs_delayed_item_reserve_metadata() refers to a
field named "index_items_size" of a delayed inode, however that field
does not exists - it existed in a previous patch version, but then it
split into the fields "curr_index_batch_size" and "index_item_leaves"
in the final patch version that was picked. So update the comment.
Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Wed, 11 Oct 2023 20:58:32 +0000 (13:58 -0700)]
Merge tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"A revert of recent mount option parsing fix, this breaks mounts with
security options.
The second patch is a flexible array annotation"
* tag 'for-6.6-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()
Revert "btrfs: reject unknown mount options early"
Linus Torvalds [Wed, 11 Oct 2023 20:46:56 +0000 (13:46 -0700)]
Merge tag 'ata-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ata fixes from Damien Le Moal:
- Three fixes for the pata_parport driver to address a typo in the
code, a missing operation implementation and port reset handling in
the presence of slave devices (Ondrej)
- Fix handling of ATAPI devices reset with the fit3 protocol driver of
the pata_parport driver (Ondrej)
- A follow up fix for the recent suspend/resume corrections to avoid
attempting rescanning on resume the scsi device associated with an
ata disk when the request queue of the scsi device is still suspended
(in addition to not doing the rescan if the scsi device itself is
still suspended) (me)
* tag 'ata-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
scsi: Do not rescan devices with a suspended queue
ata: pata_parport: fit3: implement IDE command set registers
ata: pata_parport: add custom version of wait_after_reset
ata: pata_parport: implement set_devctl
ata: pata_parport: fix pata_parport_devchk
Linus Torvalds [Wed, 11 Oct 2023 20:27:44 +0000 (13:27 -0700)]
Merge tag 'for-linus-2023101101' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Benjamin Tissoires:
- regression fix for i2c-hid when used on DT platforms (Johan Hovold)
- kernel crash fix on removal of the Logitech USB receiver (Hans de
Goede)
* tag 'for-linus-2023101101' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: logitech-hidpp: Fix kernel crash on receiver USB disconnect
HID: i2c-hid: fix handling of unpopulated devices
btrfs: add __counted_by for struct btrfs_delayed_item and use struct_size()
Prepare for the coming implementation by GCC and Clang of the __counted_by
attribute. Flexible array members annotated with __counted_by can have
their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
functions).
While there, use struct_size() helper, instead of the open-coded
version, to calculate the size for the allocation of the whole
flexible structure, including of course, the flexible-array member.
This code was found with the help of Coccinelle, and audited and
fixed manually.
Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Tue, 10 Oct 2023 18:31:42 +0000 (11:31 -0700)]
Merge tag 'xsa441-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fix from Juergen Gross:
"A fix for the xen events driver:
Closing of an event channel in the Linux kernel can result in a
deadlock. This happens when the close is being performed in parallel
to an unrelated Xen console action and the handling of a Xen console
interrupt in an unprivileged guest.
The closing of an event channel is e.g. triggered by removal of a
paravirtual device on the other side. As this action will cause
console messages to be issued on the other side quite often, the
chance of triggering the deadlock is not negligible"
* tag 'xsa441-6.6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/events: replace evtchn_rwlock with RCU
Static calls invocations aren't well supported from module __init and
__exit functions. Especially the static call from cleanup_trusted() led
to a crash on x86 kernel with CONFIG_DEBUG_VIRTUAL=y.
However, the usage of static call invocations for trusted_key_init()
and trusted_key_exit() don't add any value from either a performance or
security perspective. Hence switch to use indirect function calls instead.
Note here that although it will fix the current crash report, ultimately
the static call infrastructure should be fixed to either support its
future usage from module __init and __exit functions or not.
Linus Torvalds [Tue, 10 Oct 2023 18:14:07 +0000 (11:14 -0700)]
Merge tag 'irq-urgent-2023-10-10-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"A set of updates for interrupt chip drivers:
- Fix the fail of the Qualcomm PDC driver on v3.2 hardware which is
caused by a control bit being moved to a different location
- Update the SM8150 device tree PDC resource so the version register
can be read
- Make the Renesas RZG2L driver correct for interrupts which are
outside of the LSB in the TSSR register by using the proper macro
for calculating the mask
- Document the Renesas RZ2GL device tree binding correctly and update
them for a few devices which faul to boot otherwise
- Use the proper accessor in the RZ2GL driver instead of blindly
dereferencing an unchecked pointer
- Make GICv3 handle the dma-non-coherent attribute correctly
- Ensure that all interrupt controller nodes on RISCV are marked as
initialized correctly
Maintainer changes:
- Add a new entry for GIC interrupt controllers and assign Marc
Zyngier as the maintainer
- Remove Marc Zyngier from the core and driver maintainer entries as
he is burried in work and short of time to handle that.
Thanks to Marc for all the great work he has done in the past couple
of years!
Also note that commit 5873d380f4c0 ("irqchip/qcom-pdc: Add support for
v3.2 HW") has a incorrect SOB chain.
The real author is Neil. His patch was posted by Dmitry once and Neil
picked it up from the list and reposted it with the bogus SOB chain.
Not a big deal, but worth to mention. I wanted to fix that up, but
then got distracted and Marc piled more changes on top. So I decided
to leave it as is instead of rebasing world"
* tag 'irq-urgent-2023-10-10-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
MAINTAINERS: Remove myself from the general IRQ subsystem maintenance
MAINTAINERS: Add myself as the ARM GIC maintainer
irqchip/renesas-rzg2l: Convert to irq_data_get_irq_chip_data()
irqchip/stm32-exti: add missing DT IRQ flag translation
irqchip/riscv-intc: Mark all INTC nodes as initialized
irqchip/gic-v3: Enable non-coherent redistributors/ITSes DT probing
irqchip/gic-v3-its: Split allocation from initialisation of its_node
dt-bindings: interrupt-controller: arm,gic-v3: Add dma-noncoherent property
dt-bindings: interrupt-controller: renesas,irqc: Add r8a779f0 support
dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Document RZ/G2UL SoC
irqchip: renesas-rzg2l: Fix logic to clear TINT interrupt source
dt-bindings: interrupt-controller: renesas,rzg2l-irqc: Update description for '#interrupt-cells' property
arm64: dts: qcom: sm8150: extend the size of the PDC resource
irqchip/qcom-pdc: Add support for v3.2 HW
Linus Torvalds [Tue, 10 Oct 2023 18:01:21 +0000 (11:01 -0700)]
Merge tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyperv fixes from Wei Liu:
- fixes for Hyper-V VTL code (Saurabh Sengar and Olaf Hering)
- fix hv_kvp_daemon to support keyfile based connection profile
(Shradha Gupta)
* tag 'hyperv-fixes-signed-20231009' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
hv/hv_kvp_daemon:Support for keyfile based connection profile
hyperv: reduce size of ms_hyperv_info
x86/hyperv: Add common print prefix "Hyper-V" in hv_init
x86/hyperv: Remove hv_vtl_early_init initcall
x86/hyperv: Restrict get_vtl to only VTL platforms
Linus Torvalds [Tue, 10 Oct 2023 17:33:21 +0000 (10:33 -0700)]
Merge tag 'sound-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of pending fixes since a couple of weeks ago, which
became slightly bigger than usual due to my vacation.
Most of changes are about ASoC device-specific fixes while USB- and
HD-audio received quirks as usual. All fixes, including two ASoC core
changes, are reasonably small and safe to apply"
* tag 'sound-6.6-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (23 commits)
ALSA: usb-audio: Fix microphone sound on Nexigo webcam.
ALSA: hda/realtek: Change model for Intel RVP board
ALSA: usb-audio: Fix microphone sound on Opencomm2 Headset
ALSA: hda: cs35l41: Cleanup and fix double free in firmware request
ASoC: dt-bindings: fsl,micfil: Document #sound-dai-cells
ASoC: amd: yc: Fix non-functional mic on Lenovo 82YM
ASoC: tlv320adc3xxx: BUG: Correct micbias setting
ASoC: rt5682: Fix regulator enable/disable sequence
ASoC: hdmi-codec: Fix broken channel map reporting
ASoC: core: Do not call link_exit() on uninitialized rtd objects
ASoC: core: Print component name when printing log
ASoC: SOF: amd: fix for firmware reload failure after playback
ASoC: fsl-asoc-card: use integer type for fll_id and pll_id
ASoC: fsl_sai: Don't disable bitclock for i.MX8MP
dt-bindings: ASoC: rockchip: Add compatible for RK3128 spdif
ASoC: soc-generic-dmaengine-pcm: Fix function name in comment
ALSA: hda/realtek - ALC287 merge RTK codec with CS CS35L41 AMP
ASoC: simple-card: fixup asoc_simple_probe() error handling
ASoC: simple-card-utils: fixup simple_util_startup() error handling
ASoC: Intel: sof_sdw: add support for SKU 0B14
...
The patch breaks mounts with security mount options like
$ mount -o context=system_u:object_r:root_t:s0 /dev/sdX /mn
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdX, missing codepage or helper program, ...
We cannot reject all unknown options in btrfs_parse_subvol_options() as
intended, the security options can be present at this point and it's not
possible to enumerate them in a future proof way. This means unknown
mount options are silently accepted like before when the filesystem is
mounted with either -o subvol=/path or as followup mounts of the same
device.
Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com Signed-off-by: David Sterba <dsterba@suse.com>
Damien Le Moal [Wed, 4 Oct 2023 08:50:49 +0000 (17:50 +0900)]
scsi: Do not rescan devices with a suspended queue
Commit ff48b37802e5 ("scsi: Do not attempt to rescan suspended devices")
modified scsi_rescan_device() to avoid attempting rescanning a suspended
device. However, the modification added a check to verify that a SCSI
device is in the running state without checking if the device request
queue (in the case of block device) is also running, thus allowing the
exectuion of internal requests. Without checking the device request
queue, commit ff48b37802e5 fix is incomplete and deadlocks on resume can
still happen. Use blk_queue_pm_only() to check if the device request
queue allows executing commands in addition to checking the SCSI device
state.
Reported-by: Petr Tesarik <petr@tesarici.cz> Fixes: ff48b37802e5 ("scsi: Do not attempt to rescan suspended devices") Cc: stable@vger.kernel.org Tested-by: Petr Tesarik <petr@tesarici.cz> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Ondrej Zary [Thu, 5 Oct 2023 20:55:59 +0000 (22:55 +0200)]
ata: pata_parport: fit3: implement IDE command set registers
fit3 protocol driver does not support accessing IDE control registers
(device control/altstatus). The DOS driver does not use these registers
either (as observed from DOSEMU trace). But the HW seems to be capable
of accessing these registers - I simply tried bit 3 and it works!
The control register is required to properly reset ATAPI devices or
they will be detected only once (after a power cycle).
Tested with EXP Computer CD-865 with MC-1285B EPP cable and
TransDisk 3000.
Signed-off-by: Ondrej Zary <linux@zary.sk> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Ondrej Zary [Thu, 5 Oct 2023 20:55:58 +0000 (22:55 +0200)]
ata: pata_parport: add custom version of wait_after_reset
Some parallel adapters (e.g. EXP Computer MC-1285B EPP Cable) return
bogus values when there's no master device present. This can cause
reset to fail, preventing the lone slave device (such as EXP Computer
CD-865) from working.
Add custom version of wait_after_reset that ignores master failure when
a slave device is present. The custom version is also needed because
the generic ata_sff_wait_after_reset uses direct port I/O for slave
device detection.
Signed-off-by: Ondrej Zary <linux@zary.sk> Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru> Signed-off-by: Damien Le Moal <dlemoal@kernel.org>
Shradha Gupta [Mon, 9 Oct 2023 10:38:40 +0000 (03:38 -0700)]
hv/hv_kvp_daemon:Support for keyfile based connection profile
Ifcfg config file support in NetworkManger is deprecated. This patch
provides support for the new keyfile config format for connection
profiles in NetworkManager. The patch modifies the hv_kvp_daemon code
to generate the new network configuration in keyfile
format(.ini-style format) along with a ifcfg format configuration.
The ifcfg format configuration is also retained to support easy
backward compatibility for distro vendors. These configurations are
stored in temp files which are further translated using the
hv_set_ifconfig.sh script. This script is implemented by individual
distros based on the network management commands supported.
For example, RHEL's implementation could be found here:
https://gitlab.com/redhat/centos-stream/src/hyperv-daemons/-/blob/c9s/hv_set_ifconfig.sh
Debian's implementation could be found here:
https://github.com/endlessm/linux/blob/master/debian/cloud-tools/hv_set_ifconfig
The next part of this support is to let the Distro vendors consume
these modified implementations to the new configuration format.
Tested-on: Rhel9(Hyper-V, Azure)(nm and ifcfg files verified) Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Reviewed-by: Saurabh Sengar <ssengar@linux.microsoft.com> Reviewed-by: Ani Sinha <anisinha@redhat.com> Signed-off-by: Wei Liu <wei.liu@kernel.org> Link: https://lore.kernel.org/r/1696847920-31125-1-git-send-email-shradhagupta@linux.microsoft.com
John Ogness [Fri, 6 Oct 2023 08:21:50 +0000 (10:21 +0200)]
printk: flush consoles before checking progress
Commit 9e70a5e109a4 ("printk: Add per-console suspended state")
removed console lock usage during resume and replaced it with
the clearly defined console_list_lock and srcu mechanisms.
However, the console lock usage had an important side-effect
of flushing the consoles. After its removal, consoles were no
longer flushed before checking their progress.
Add the console_lock/console_unlock dance to the beginning
of __pr_flush() to actually flush the consoles before checking
their progress. Also add comments to clarify this additional
usage of the console lock.
Note that console_unlock() does not guarantee flushing all messages
since the commit dbdda842fe96f89 ("printk: Add console owner and waiter
logic to load balance console writes").
Reported-by: Todd Brandt <todd.e.brandt@intel.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217955 Fixes: 9e70a5e109a4 ("printk: Add per-console suspended state") Co-developed-by: Petr Mladek <pmladek@suse.com> Signed-off-by: Petr Mladek <pmladek@suse.com> Signed-off-by: John Ogness <john.ogness@linutronix.de> Link: https://lore.kernel.org/r/20231006082151.6969-2-pmladek@suse.com
Juergen Gross [Mon, 28 Aug 2023 06:09:47 +0000 (08:09 +0200)]
xen/events: replace evtchn_rwlock with RCU
In unprivileged Xen guests event handling can cause a deadlock with
Xen console handling. The evtchn_rwlock and the hvc_lock are taken in
opposite sequence in __hvc_poll() and in Xen console IRQ handling.
Normally this is no problem, as the evtchn_rwlock is taken as a reader
in both paths, but as soon as an event channel is being closed, the
lock will be taken as a writer, which will cause read_lock() to block:
read_lock(evtchn_rwlock)
spin_lock(hvc_lock)
write_lock(evtchn_rwlock)
[blocks]
spin_lock(hvc_lock)
[blocks]
read_lock(evtchn_rwlock)
[blocks due to writer waiting,
and not in_interrupt()]
This issue can be avoided by replacing evtchn_rwlock with RCU in
xen_free_irq(). Note that RCU is used only to delay freeing of the
irq_info memory. There is no RCU based dereferencing or replacement of
pointers involved.
In order to avoid potential races between removing the irq_info
reference and handling of interrupts, set the irq_info pointer to NULL
only when freeing its memory. The IRQ itself must be freed at that
time, too, as otherwise the same IRQ number could be allocated again
before handling of the old instance would have been finished.
This is XSA-441 / CVE-2023-34324.
Fixes: 54c9de89895e ("xen/events: add a new "late EOI" evtchn framework") Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com> Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: Julien Grall <jgrall@amazon.com> Signed-off-by: Juergen Gross <jgross@suse.com>
Christos Skevis [Fri, 6 Oct 2023 15:53:30 +0000 (17:53 +0200)]
ALSA: usb-audio: Fix microphone sound on Nexigo webcam.
I own an external usb Webcam, model NexiGo N930AF, which had low mic volume and
inconsistent sound quality. Video works as expected.
(snip)
[ +0.047857] usb 5-1: new high-speed USB device number 2 using xhci_hcd
[ +0.003406] usb 5-1: New USB device found, idVendor=1bcf, idProduct=2283, bcdDevice=12.17
[ +0.000007] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ +0.000004] usb 5-1: Product: NexiGo N930AF FHD Webcam
[ +0.000003] usb 5-1: Manufacturer: SHENZHEN AONI ELECTRONIC CO., LTD
[ +0.000004] usb 5-1: SerialNumber: 20201217011
[ +0.003900] usb 5-1: Found UVC 1.00 device NexiGo N930AF FHD Webcam (1bcf:2283)
[ +0.025726] usb 5-1: 3:1: cannot get usb sound sample rate freq at ep 0x86
[ +0.071482] usb 5-1: 3:2: cannot get usb sound sample rate freq at ep 0x86
[ +0.004679] usb 5-1: 3:3: cannot get usb sound sample rate freq at ep 0x86
[ +0.051607] usb 5-1: Warning! Unlikely big volume range (=4096), cval->res is probably wrong.
[ +0.000005] usb 5-1: [7] FU [Mic Capture Volume] ch = 1, val = 0/4096/1
Set up quirk cval->res to 16 for 256 levels,
Set GET_SAMPLE_RATE quirk flag to stop trying to get the sample rate.
Confirmed that happened anyway later due to the backoff mechanism, after 3 failures
All audio stream on device interfaces share the same values,
apart from wMaxPacketSize and tSamFreq :
Based on the usb data about manufacturer, SPCA2281B3 is the most likely controller IC
Manufacturer does not provide link for datasheet nor detailed specs.
No way to confirm if the firmware supports any other way of getting the sample rate.
Testing patch provides consistent good sound recording quality and volume range.
(snip)
[ +0.045764] usb 5-1: new high-speed USB device number 2 using xhci_hcd
[ +0.106290] usb 5-1: New USB device found, idVendor=1bcf, idProduct=2283, bcdDevice=12.17
[ +0.000006] usb 5-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[ +0.000004] usb 5-1: Product: NexiGo N930AF FHD Webcam
[ +0.000003] usb 5-1: Manufacturer: SHENZHEN AONI ELECTRONIC CO., LTD
[ +0.000004] usb 5-1: SerialNumber: 20201217011
[ +0.043700] usb 5-1: set resolution quirk: cval->res = 16
[ +0.002585] usb 5-1: Found UVC 1.00 device NexiGo N930AF FHD Webcam (1bcf:2283)
Linus Torvalds [Sun, 8 Oct 2023 17:10:52 +0000 (10:10 -0700)]
Merge tag '6.6-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
"Six SMB3 server fixes for various races found by RO0T Lab of Huawei:
- Fix oops when racing between oplock break ack and freeing file
- Simultaneous request fixes for parallel logoffs, and for parallel
lock requests
- Fixes for tree disconnect race, session expire race, and close/open
race"
* tag '6.6-rc4-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix race condition between tree conn lookup and disconnect
ksmbd: fix race condition from parallel smb2 lock requests
ksmbd: fix race condition from parallel smb2 logoff requests
ksmbd: fix uaf in smb20_oplock_break_ack
ksmbd: fix race condition with fp
ksmbd: fix race condition between session lookup and expire
Linus Torvalds [Sun, 8 Oct 2023 16:57:59 +0000 (09:57 -0700)]
Merge tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc scheduler fixes from Ingo Molnar:
- Two EEVDF fixes: one to fix sysctl_sched_base_slice propagation, and
to fix an avg_vruntime() corner-case.
- A cpufreq frequency scaling fix
* tag 'sched-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpufreq: schedutil: Update next_freq when cpufreq_limits change
sched/eevdf: Fix avg_vruntime()
sched/eevdf: Also update slice on placement
Linus Torvalds [Sun, 8 Oct 2023 16:27:20 +0000 (09:27 -0700)]
Merge tag 'x86-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Fix SEV-SNP guest crashes that may happen on NMIs
- Fix a potential SEV platform memory setup overflow
* tag 'x86-urgent-2023-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Change npages to unsigned long in snp_accept_memory()
x86/sev: Use the GHCB protocol when available for SNP CPUID requests
Linus Torvalds [Sat, 7 Oct 2023 20:05:43 +0000 (13:05 -0700)]
Merge tag 'parisc-for-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull parisc fixes from Helge Deller:
- fix random faults in mmap'd memory on pre PA8800 processors
- fix boot crash with nr_cpus=1 on kernel command line
* tag 'parisc-for-6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Restore __ldcw_align for PA-RISC 2.0 processors
parisc: Fix crash with nr_cpus=1 option
parisc: Restore __ldcw_align for PA-RISC 2.0 processors
Back in 2005, Kyle McMartin removed the 16-byte alignment for
ldcw semaphores on PA 2.0 machines (CONFIG_PA20). This broke
spinlocks on pre PA8800 processors. The main symptom was random
faults in mmap'd memory (e.g., gcc compilations, etc).
Unfortunately, the errata for this ldcw change is lost.
The issue is the 16-byte alignment required for ldcw semaphore
instructions can only be reduced to natural alignment when the
ldcw operation can be handled coherently in cache. Only PA8800
and PA8900 processors actually support doing the operation in
cache.
Aligning the spinlock dynamically adds two integer instructions
to each spinlock.
Linus Torvalds [Sat, 7 Oct 2023 17:44:28 +0000 (10:44 -0700)]
Merge tag '6.6-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- protect cifs/smb3 socket connect from BPF address overwrite
- fix case when directory leases disabled but wasting resources with
unneeded thread on each mount
* tag '6.6-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: do not start laundromat thread on nohandlecache
smb: use kernel_connect() and kernel_bind()
Linus Torvalds [Sat, 7 Oct 2023 17:30:35 +0000 (10:30 -0700)]
Merge tag 'xfs-6.6-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Chandan Babu:
- Prevent filesystem hang when executing fstrim operations on large and
slow storage
* tag 'xfs-6.6-fixes-3' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: abort fstrim if kernel is suspending
xfs: reduce AGF hold times during fstrim operations
xfs: move log discard work to xfs_discard.c
Linus Torvalds [Sat, 7 Oct 2023 17:17:48 +0000 (10:17 -0700)]
Merge tag 'for-6.6/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix memory leak when freeing dm zoned target device
- Update dm-devel mailing list address in MAINTAINERS
* tag 'for-6.6/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
MAINTAINERS: update the dm-devel mailing list
dm zoned: free dmz->ddev array in dmz_put_zoned_devices
Linus Torvalds [Sat, 7 Oct 2023 16:21:09 +0000 (09:21 -0700)]
Merge tag 'gpio-fixes-for-v6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
"Another round of driver one-liners from the GPIO subsystem:
- disable pin control on MMP GPIOs in gpio-pxa
- fix the GPIO number passed to one of the pinctrl callbacks in
gpio-aspeed"
* tag 'gpio-fixes-for-v6.6-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: aspeed: fix the GPIO number passed to pinctrl_gpio_set_config()
gpio: pxa: disable pinctrl calls for MMP_GPIO
Linus Torvalds [Sat, 7 Oct 2023 16:16:23 +0000 (09:16 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"This includes a fix for a significant security miss in checking the
RDMA_NLDEV_CMD_SYS_SET operation.
Summary:
- UAF in SRP
- Error unwind failure in siw connection management
- Missing error checks
- NULL/ERR_PTR confusion in erdma
- Possible string truncation in CMA configfs and mlx4
- Data ordering issue in bnxt_re
- Missing stats decrement on object destroy in bnxt_re
- Mlx5 bugs in this merge window:
* Incorrect access_flag in the new mkey cache
* Missing unlock on error in flow steering
* lockdep possible deadlock on new mkey cache destruction (Plus a
fix for this too)
- Don't leak kernel stack memory to userspace in the CM
- Missing permission validation for RDMA_NLDEV_CMD_SYS_SET"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/core: Require admin capabilities to set system parameters
RDMA/mlx5: Remove not-used cache disable flag
RDMA/cma: Initialize ib_sa_multicast structure to 0 when join
RDMA/mlx5: Fix mkey cache possible deadlock on cleanup
RDMA/mlx5: Fix NULL string error
RDMA/mlx5: Fix mutex unlocking on error flow for steering anchor creation
RDMA/mlx5: Fix assigning access flags to cache mkeys
IB/mlx4: Fix the size of a buffer in add_port_entries()
RDMA/bnxt_re: Decrement resource stats correctly
RDMA/bnxt_re: Fix the handling of control path response data
RDMA/cma: Fix truncation compilation warning in make_cma_ports
RDMA/erdma: Fix NULL pointer access in regmr_cmd
RDMA/erdma: Fix error code in erdma_create_scatter_mtt()
RDMA/uverbs: Fix typo of sizeof argument
RDMA/cxgb4: Check skb value for failure to allocate
RDMA/siw: Fix connection failure handling
RDMA/srp: Do not call scsi_done() from srp_abort()
Marc Zyngier [Mon, 2 Oct 2023 14:13:02 +0000 (15:13 +0100)]
MAINTAINERS: Remove myself from the general IRQ subsystem maintenance
It is pretty obvious that I haven't done much on the IRQ side
for a while, and it is unlikely that I'll have more bandwidth
for it any time soon. People keep sending me patches that
I end-up reviewing in a cursory manner, which isn't great for
anyone.
So in everyone's interest, I'm removing myself from the list
of maintainers and leave the irqchip and irqdomain subsystems
in Thomas' capable hands.