git.ipfire.org Git - thirdparty/linux.git/log

ext4: implement linear-like traversal across order xarrays

Although we now perform ordered traversal within an xarray, this is
currently limited to a single xarray. However, we have multiple such
xarrays, which prevents us from guaranteeing a linear-like traversal
where all groups on the right are visited before all groups on the left.

For example, suppose we have 128 block groups, with a target group of 64,
a target length corresponding to an order of 1, and available free groups
of 16 (order 1) and group 65 (order 8):

For linear traversal, when no suitable free block is found in group 64, it
will search in the next block group until group 127, then start searching
from 0 up to block group 63. It ensures continuous forward traversal, which
is consistent with the unidirectional rotation behavior of HDD platters.

Additionally, the block group lock contention during freeing block is
unavoidable. The goal increasing from 0 to 64 indicates that previously
scanned groups (which had no suitable free space and are likely to free
blocks later) and skipped groups (which are currently in use) have newly
freed some used blocks. If we allocate blocks in these groups, the
probability of competing with other processes increases.

For non-linear traversal, we first traverse all groups in order_1. If only
group 16 has free space in this list, we first traverse [63, 128), then
traverse [0, 64) to find the available group 16, and then allocate blocks
in group 16. Therefore, it cannot guarantee continuous traversal in one
direction, thus increasing the probability of contention.

So refactor ext4_mb_scan_groups_xarray() to ext4_mb_scan_groups_xa_range()
to only traverse a fixed range of groups, and move the logic for handling
wrap around to the caller. The caller first iterates through all xarrays
in the range [start, ngroups) and then through the range [0, start). This
approach simulates a linear scan, which reduces contention between freeing
blocks and allocating blocks.

Assume we have the following groups, where "|" denotes the xarray traversal
start position:

order_1_groups: AB | CD
order_2_groups: EF | GH

Traversal order:
Before: C > D > A > B > G > H > E > F
After:  C > D > G > H > A > B > E > F

Performance test data follows:

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19555 | 20049 (+2.5%)  | 315636 | 316724 (-0.3%) |
|mb_optimize_scan=1 | 15496 | 19342 (+24.8%) | 323569 | 328324 (+1.4%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53192 | 52125 (-2.0%)  | 212678 | 215136 (+1.1%) |
|mb_optimize_scan=1 | 37636 | 50331 (+33.7%) | 214189 | 209431 (-2.2%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-18-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: refactor choose group to scan group

This commit converts the `choose group` logic to `scan group` using
previously prepared helper functions. This allows us to leverage xarrays
for ordered non-linear traversal, thereby mitigating the "bouncing" issue
inherent in the `choose group` mechanism.

This also decouples linear and non-linear traversals, leading to cleaner
and more readable code.

Key changes:

* ext4_mb_choose_next_group() is refactored to ext4_mb_scan_groups().

* Replaced ext4_mb_good_group() with ext4_mb_scan_group() in non-linear
   traversals, and related functions now return error codes instead of
   group info.

* Added ext4_mb_scan_groups_linear() for performing linear scans starting
   from a specific group for a set number of times.

* Linear scans now execute up to sbi->s_mb_max_linear_groups times,
   so ac_groups_linear_remaining is removed as it's no longer used.

* ac->ac_criteria is now used directly instead of passing cr around.
   Also, ac->ac_criteria is incremented directly after groups scan fails
   for the corresponding criteria.

* Since we're now directly scanning groups instead of finding a good group
   then scanning, the following variables and flags are no longer needed,
   s_bal_cX_groups_considered is sufficient.

    s_bal_p2_aligned_bad_suggestions
    s_bal_goal_fast_bad_suggestions
    s_bal_best_avail_bad_suggestions
    EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED
    EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED
    EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-17-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: convert free groups order lists to xarrays

While traversing the list, holding a spin_lock prevents load_buddy, making
direct use of ext4_try_lock_group impossible. This can lead to a bouncing
scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
fails, forcing the list traversal to repeatedly restart from grp_A.

In contrast, linear traversal directly uses ext4_try_lock_group(),
avoiding this bouncing. Therefore, we need a lockless, ordered traversal
to achieve linear-like efficiency.

Therefore, this commit converts both average fragment size lists and
largest free order lists into ordered xarrays.

In an xarray, the index represents the block group number and the value
holds the block group information; a non-empty value indicates the block
group's presence.

While insertion and deletion complexity remain O(1), lookup complexity
changes from O(1) to O(nlogn), which may slightly reduce single-threaded
performance.

Additionally, xarray insertions might fail, potentially due to memory
allocation issues. However, since we have linear traversal as a fallback,
this isn't a major problem. Therefore, we've only added a warning message
for insertion failures here.

A helper function ext4_mb_find_good_group_xarray() is added to find good
groups in the specified xarray starting at the specified position start,
and when it reaches ngroups-1, it wraps around to 0 and then to start-1.
This ensures an ordered traversal within the xarray.

Performance test results are as follows: Single-process operations
on an empty disk show negligible impact, while multi-process workloads
demonstrate a noticeable performance gain.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20097 | 19555 (-2.6%)  | 316141 | 315636 (-0.2%) |
|mb_optimize_scan=1 | 13318 | 15496 (+16.3%) | 325273 | 323569 (-0.5%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53603 | 53192 (-0.7%)  | 214243 | 212678 (-0.7%) |
|mb_optimize_scan=1 | 20887 | 37636 (+80.1%) | 213632 | 214189 (+0.2%) |

[ Applied spelling fixes per discussion on the ext4-list see thread
  referened in the Link tag. --tytso]

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-16-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: factor out ext4_mb_scan_group()

Extract ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-15-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: factor out ext4_mb_might_prefetch()

Extract ext4_mb_might_prefetch() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-14-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: factor out __ext4_mb_scan_group()

Extract __ext4_mb_scan_group() to make the code clearer and to
prepare for the later conversion of 'choose group' to 'scan groups'.
No functional changes.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-13-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix largest free orders lists corruption on mb_optimize_scan switch

The grp->bb_largest_free_order is updated regardless of whether
mb_optimize_scan is enabled. This can lead to inconsistencies between
grp->bb_largest_free_order and the actual s_mb_largest_free_orders list
index when mb_optimize_scan is repeatedly enabled and disabled via remount.

For example, if mb_optimize_scan is initially enabled, largest free
order is 3, and the group is in s_mb_largest_free_orders[3]. Then,
mb_optimize_scan is disabled via remount, block allocations occur,
updating largest free order to 2. Finally, mb_optimize_scan is re-enabled
via remount, more block allocations update largest free order to 1.

At this point, the group would be removed from s_mb_largest_free_orders[3]
under the protection of s_mb_largest_free_orders_locks[2]. This lock
mismatch can lead to list corruption.

To fix this, whenever grp->bb_largest_free_order changes, we now always
attempt to remove the group from its old order list. However, we only
insert the group into the new order list if `mb_optimize_scan` is enabled.
This approach helps prevent lock inconsistencies and ensures the data in
the order lists remains reliable.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-12-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix zombie groups in average fragment size lists

Groups with no free blocks shouldn't be in any average fragment size list.
However, when all blocks in a group are allocated(i.e., bb_fragments or
bb_free is 0), we currently skip updating the average fragment size, which
means the group isn't removed from its previous s_mb_avg_fragment_size[old]
list.

This created "zombie" groups that were always skipped during traversal as
they couldn't satisfy any block allocation requests, negatively impacting
traversal efficiency.

Therefore, when a group becomes completely full, bb_avg_fragment_size_order
is now set to -1. If the old order was not -1, a removal operation is
performed; if the new order is not -1, an insertion is performed.

Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
CC: stable@vger.kernel.org
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-11-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: merge freed extent with existing extents before insertion

Attempt to merge ext4_free_data with already inserted free extents prior
to adding new ones. This strategy drastically cuts down the number of
times locks are held.

For example, if prev, new, and next extents are all mergeable, the existing
code (before this patch) requires acquiring the s_md_lock three times:

  prev merge into new and free prev // hold lock
  next merge into new and free next // hold lock
  insert new // hold lock

After the patch, it only needs to be acquired once:

  new merge into next and free new // no lock
  next merge into prev and free next // hold lock

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 20043 | 20097 (+0.2%)  | 314331 | 316141 (+0.5%) |
|mb_optimize_scan=1 | 7290  | 13318 (+87.4%) | 324226 | 325273 (+0.3%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 54999 | 53603 (-2.5%)  | 214380 | 214243 (-0.06%)|
|mb_optimize_scan=1 | 13497 | 20887 (+54.6%) | 216276 | 213632 (-1.2%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-10-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: convert sbi->s_mb_free_pending to atomic_t

Previously, s_md_lock was used to protect s_mb_free_pending during
modifications, while smp_mb() ensured fresh reads, so s_md_lock just
guarantees the atomicity of s_mb_free_pending. Thus we optimized it by
converting s_mb_free_pending into an atomic variable, thereby eliminating
s_md_lock and minimizing lock contention. This also prepares for future
lockless merging of free extents.

Following this modification, s_md_lock is exclusively responsible for
managing insertions and deletions within s_freed_data_list, along with
operations involving list_splice.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 19628 | 20043 (+2.1%)  | 320885 | 314331 (-2.0%) |
|mb_optimize_scan=1 | 7129  | 7290  (+2.2%)  | 321275 | 324226 (+0.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 53760 | 54999 (+2.3%)  | 213145 | 214380 (+0.5%) |
|mb_optimize_scan=1 | 12716 | 13497 (+6.1%)  | 215262 | 216276 (+0.4%) |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-9-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix typo in CR_GOAL_LEN_SLOW comment

Remove the superfluous "find_".

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-8-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: get rid of some obsolete EXT4_MB_HINT flags

Since nobody has used these EXT4_MB_HINT flags for ages,
let's remove them.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-7-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: utilize multiple global goals to reduce contention

When allocating data blocks, if the first try (goal allocation) fails and
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.

However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.

To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the minimum between the number of
possible CPUs and one-quarter of the filesystem's total block group count.

To ensure a consistent goal for each inode, we select the corresponding
goal by taking the inode number modulo the total number of goals.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 9636  | 19628 (+103%)  | 337597 | 320885 (-4.9%) |
|mb_optimize_scan=1 | 4834  | 7129  (+47.4%) | 341440 | 321275 (-5.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 22341 | 53760 (+140%)  | 219707 | 213145 (-2.9%) |
|mb_optimize_scan=1 | 9177  | 12716 (+38.5%) | 215732 | 215262 (+0.2%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-6-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: remove unnecessary s_md_lock on update s_mb_last_group

After we optimized the block group lock, we found another lock
contention issue when running will-it-scale/fallocate2 with multiple
processes. The fallocate's block allocation and the truncate's block
release were fighting over the s_md_lock. The problem is, this lock
protects totally different things in those two processes: the list of
freed data blocks (s_freed_data_list) when releasing, and where to start
looking for new blocks (mb_last_group) when allocating.

Now we only need to track s_mb_last_group and no longer need to track
s_mb_last_start, so we don't need the s_md_lock lock to ensure that the
two are consistent. Since s_mb_last_group is merely a hint and doesn't
require strong synchronization, READ_ONCE/WRITE_ONCE is sufficient.

Besides, the s_mb_last_group data type only requires ext4_group_t
(i.e., unsigned int), rendering unsigned long superfluous.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80           |            P1           |
|Memory: 512GB      |------------------------|-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 4821  | 9636  (+99.8%) | 314065 | 337597 (+7.4%) |
|mb_optimize_scan=1 | 4784  | 4834  (+1.04%) | 316344 | 341440 (+7.9%) |

|CPU: AMD 9654 * 2  |          P96           |             P1          |
|Memory: 1536GB     |------------------------|-------------------------|
|960GB SSD (1GB/s)  | base  |    patched     | base   |    patched     |
|-------------------|-------|----------------|--------|----------------|
|mb_optimize_scan=0 | 15371 | 22341 (+45.3%) | 205851 | 219707 (+6.7%) |
|mb_optimize_scan=1 | 6101  | 9177  (+50.4%) | 207373 | 215732 (+4.0%) |

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-5-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: remove unnecessary s_mb_last_start

Since stream allocation does not use ac->ac_f_ex.fe_start, it is set to -1
by default, so the no longer needed sbi->s_mb_last_start is removed.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-4-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: separate stream goal hits from s_bal_goals for better tracking

In ext4_mb_regular_allocator(), after the call to ext4_mb_find_by_goal()
fails to achieve the inode goal, allocation continues with the stream
allocation global goal. Currently, hits for both are combined in
sbi->s_bal_goals, hindering accurate optimization.

This commit separates global goal hits into sbi->s_bal_stream_goals. Since
stream allocation doesn't use ac->ac_g_ex.fe_start, set fe_start to -1.
This prevents stream allocations from being counted in s_bal_goals. Also
clear EXT4_MB_HINT_TRY_GOAL to avoid calling ext4_mb_find_by_goal again.

After adding `stream_goal_hits`, `/proc/fs/ext4/sdx/mb_stats` will show:

mballoc:
reqs: 840347
success: 750992
groups_scanned: 1230506
cr_p2_aligned_stats:
hits: 21531
groups_considered: 411664
extents_scanned: 21531
useless_loops: 0
bad_suggestions: 6
cr_goal_fast_stats:
hits: 111222
groups_considered: 1806728
extents_scanned: 467908
useless_loops: 0
bad_suggestions: 13
cr_best_avail_stats:
hits: 36267
groups_considered: 1817631
extents_scanned: 156143
useless_loops: 0
bad_suggestions: 204
cr_goal_slow_stats:
hits: 106396
groups_considered: 5671710
extents_scanned: 22540056
useless_loops: 123747
cr_any_free_stats:
hits: 138071
groups_considered: 724692
extents_scanned: 23615593
useless_loops: 585
extents_scanned: 46804261
goal_hits: 1307
stream_goal_hits: 236317
len_goal_hits: 155549
2^n_hits: 21531
breaks: 225096
lost: 35062
buddies_generated: 40/40
buddies_time_used: 48004
preallocated: 5962467
discarded: 4847560

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-3-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: add ext4_try_lock_group() to skip busy groups

When ext4 allocates blocks, we used to just go through the block groups
one by one to find a good one. But when there are tons of block groups
(like hundreds of thousands or even millions) and not many have free space
(meaning they're mostly full), it takes a really long time to check them
all, and performance gets bad. So, we added the "mb_optimize_scan" mount
option (which is on by default now). It keeps track of some group lists,
so when we need a free block, we can just grab a likely group from the
right list. This saves time and makes block allocation much faster.

But when multiple processes or containers are doing similar things, like
constantly allocating 8k blocks, they all try to use the same block group
in the same list. Even just two processes doing this can cut the IOPS in
half. For example, one container might do 300,000 IOPS, but if you run two
at the same time, the total is only 150,000.

Since we can already look at block groups in a non-linear way, the first
and last groups in the same list are basically the same for finding a block
right now. Therefore, add an ext4_try_lock_group() helper function to skip
the current group when it is locked by another process, thereby avoiding
contention with other processes. This helps ext4 make better use of having
multiple block groups.

Also, to make sure we don't skip all the groups that have free space
when allocating blocks, we won't try to skip busy groups anymore when
ac_criteria is CR_ANY_FREE.

Performance test data follows:

Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.

|CPU: Kunpeng 920   |          P80            |
|Memory: 512GB      |-------------------------|
|960GB SSD (0.5GB/s)| base  |    patched      |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 2667  | 4821  (+80.7%)  |
|mb_optimize_scan=1 | 2643  | 4784  (+81.0%)  |

|CPU: AMD 9654 * 2  |          P96            |
|Memory: 1536GB     |-------------------------|
|960GB SSD (1GB/s)  | base  |    patched      |
|-------------------|-------|-----------------|
|mb_optimize_scan=0 | 3450  | 15371 (+345%)   |
|mb_optimize_scan=1 | 3209  | 6101  (+90.0%)  |

Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250714130327.1830534-2-libaokun1@huawei.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: initialize superblock fields in the kballoc-test.c kunit tests

Various changes in the "ext4: better scalability for ext4 block
allocation" patch series have resulted in kunit test failures, most
notably in the test_new_blocks_simple and the test_mb_mark_used tests.
The root cause of these failures is that various in-memory ext4 data
structures were not getting initialized, and while previous versions
of the functions exercised by the unit tests didn't use these
structure members, this was arguably a test bug.

Since one of the patches in the block allocation scalability patches
is a fix which is has a cc:stable tag, this commit also has a
cc:stable tag.

CC: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20250714130327.1830534-1-libaokun1@huawei.com
Link: https://patch.msgid.link/20250725021550.3177573-1-yi.zhang@huaweicloud.com
Link: https://patch.msgid.link/20250725021654.3188798-1-yi.zhang@huaweicloud.com
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/linux-ext4/b0635ad0-7ebf-4152-a69b-58e7e87d5085@roeck-us.net/
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: refactor the inline directory conversion and new directory codepaths

There was a lot of common code in the codepaths used to convert an
inline directory and to creaet a new directory. To address this,
rename ext4_init_dot_dotdot() to ext4_init_dirblock() and then move
common code into that function.

This reduces the lines of code count in fs/ext4/inline.c and
fs/ext4/namei.c, as well as reducing the size of their object files.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-3-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: use memcpy() instead of strcpy()

The strcpy() function is considered dangerous and eeeevil by people
who are using sophisticated code analysis tools such as "grep".  This
is true even when a quick inspection would show that the source is a
constant string ("." or "..") and the destination is a fixed array
which is guaranteed to have enough space.  Make the "grep" code
analysis tool happy by using memcpy() isstead of strcpy().  :-)

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-2-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: replace strcmp with direct comparison for '.' and '..'

In a discussion over a proposed patch, "ext4: replace strcpy() with
'.' assignment"[1], I had asserted that directory entries in ext4 were
not NUL terminated, and hence it was safe to replace strcpy() with a
direct assignment.  As it turns out, this was incorrect.  It's true
for all all directory entries *except* for '.' and '..' where the
kernel was using strcmp() and where e2fsck actually checks and offers
to fix things if '.'  and '..' are not NUL terminated.

[1] https://lore.kernel.org/r/202505191316.JJMnPobO-lkp@intel.com

We can't change this without breaking old kernel versions, but in the
spirit of "be liberal in what you receive", use direct comparison of
de->name_len and de->name[0,1] instead of strcmp().  This has the side
benefit of reducing the compiled text size by 96 bytes on x86_64.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Link: https://patch.msgid.link/20250712181249.434530-1-tytso@mit.edu
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: Make sure BH_New bit is cleared in ->write_end handler

Currently we clear BH_New bit in case of error and also in the standard
ext4_write_end() handler (in block_commit_write()). However
ext4_journalled_write_end() misses this clearing and thus we are leaving
stale BH_New bits behind. Generally ext4_block_write_begin() clears
these bits before any harm can be done but in case blocksize < pagesize
and we hit some error when processing a page with these stale bits,
we'll try to zero buffers with these stale BH_New bits and jbd2 will
complain (as buffers were not prepared for writing in this transaction).
Fix the problem by clearing BH_New bits in ext4_journalled_write_end()
and WARN if ext4_block_write_begin() sees stale BH_New bits.

Reported-by: Baolin Liu <liubaolin12138@163.com>
Reported-by: Zhi Long <longzhi@sangfor.com.cn>
Fixes: 3910b513fcdf ("ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers")
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250709084831.23876-2-jack@suse.cz
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix inode use after free in ext4_end_io_rsv_work()

In ext4_io_end_defer_completion(), check if io_end->list_vec is empty to
avoid adding an io_end that requires no conversion to the
i_rsv_conversion_list, which in turn prevents starting an unnecessary
worker. An ext4_emergency_state() check is also added to avoid attempting
to abort the journal in an emergency state.

Additionally, ext4_put_io_end_defer() is refactored to call
ext4_io_end_defer_completion() directly instead of being open-coded.
This also prevents starting an unnecessary worker when EXT4_IO_END_FAILED
is set but data_err=abort is not enabled.

This ensures that the check in ext4_put_io_end_defer() is consistent with
the check in ext4_end_bio(). Otherwise, we might add an io_end to the
i_rsv_conversion_list and then call ext4_finish_bio(), after which the
inode could be freed before ext4_end_io_rsv_work() is called, triggering
a use-after-free issue.

Fixes: ce51afb8cc5e ("ext4: abort journal on data writeback failure if in data_err=abort mode")
Signed-off-by: Baokun Li <libaokun1@huawei.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250708111504.3208660-1-libaokun@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: Refactor breaking condition for xattr_find_entry()

Refactor the condition for breaking the loop within xattr_find_entry().
Elimate the usage of "<=" and take condition shortcut when "!cmp" is
true.

Originally, the condition was "(cmp <= 0 && (sorted || cmp == 0))", which
means after it knows "cmp <= 0" is true, it has to check the value of
"sorted" and "cmp". The checking of "cmp" here would be redundant since
it has already checked it.

Observing from the logic, when "cmp == 0" the branch is going to be true,
no need to check "cmp == 0" again, so we only need to take shortcut when
"cmp == 0", on the other hand, we'll check "sorted" when "cmp < 0".

The refactor can shrink the generated code size by 44 bytes. Numerous
instructions can be saved thus should also benefit execution efficiency
as well.

$ ./scripts/bloat-o-meter vmlinux_old vmlinux_new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-44 (-44)
Function old new delta
xattr_find_entry 300 256 -44
Total: Before=22989434, After=22989390, chg -0.00%

The test is done on kernel version 6.16 with x86_64 defconfig
and gcc 13.3.0.

Signed-off-by: I Hsin Cheng <richard120310@gmail.com>
Link: https://patch.msgid.link/20250708020013.175728-1-richard120310@gmail.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: limit the maximum folio order

In environments with a page size of 64KB, the maximum size of a folio
can reach up to 128MB. Consequently, during the write-back of folios,
the 'rsv_blocks' will be overestimated to 1,577, which can make
pressure on the journal space where the journal is small. This can
easily exceed the limit of a single transaction. Besides, an excessively
large folio is meaningless and will instead increase the overhead of
traversing the bhs within the folio. Therefore, limit the maximum order
of a folio to 2048 filesystem blocks.

Reported-by: Naresh Kamboju <naresh.kamboju@linaro.org>
Reported-by: Joseph Qi <jiangqi903@gmail.com>
Closes: https://lore.kernel.org/linux-ext4/CA+G9fYsyYQ3ZL4xaSg1-Tt5Evto7Zd+hgNWZEa9cQLbahA1+xg@mail.gmail.com/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Tested-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-12-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix insufficient credits calculation in ext4_meta_trans_blocks()

The calculation of journal credits in ext4_meta_trans_blocks() should
include pextents, as each extent separately may be allocated from a
different group and thus need to update different bitmap and group
descriptor block.

Fixes: 0e32d8617012 ("ext4: correct the journal credits calculations of allocating blocks")
Reported-by: Jan Kara <jack@suse.cz>
Closes: https://lore.kernel.org/linux-ext4/nhxfuu53wyacsrq7xqgxvgzcggyscu2tbabginahcygvmc45hy@t4fvmyeky33e/
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-11-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: replace ext4_writepage_trans_blocks()

After ext4 supports large folios, the semantics of reserving credits in
pages is no longer applicable. In most scenarios, reserving credits in
extents is sufficient. Therefore, introduce ext4_chunk_trans_extent()
to replace ext4_writepage_trans_blocks(). move_extent_per_page() is the
only remaining location where we are still processing extents in pages.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-10-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: reserved credits for one extent during the folio writeback

After ext4 supports large folios, reserving journal credits for one
maximum-ordered folio based on the worst case cenario during the
writeback process can easily exceed the maximum transaction credits.
Additionally, reserving journal credits for one page is also no
longer appropriate.

Currently, the folio writeback process can either extend the journal
credits or initiate a new transaction if the currently reserved journal
credits are insufficient. Therefore, it can be modified to reserve
credits for only one extent at the outset. In most cases involving
continuous mapping, these credits are generally adequate, and we may
only need to perform some basic credit expansion. However, in extreme
cases where the block size and folio size differ significantly, or when
the folios are sufficiently discontinuous, it may be necessary to
restart a new transaction and resubmit the folios.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-9-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: correct the reserved credits for extent conversion

Now, we reserve journal credits for converting extents in only one page
to written state when the I/O operation is complete. This is
insufficient when large folio is enabled.

Fix this by reserving credits for converting up to one extent per block in
the largest 2MB folio, this calculation should only involve extents index
and leaf blocks, so it should not estimate too many credits.

Fixes: 7ac67301e82f ("ext4: enable large folio for regular file")
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-8-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: enhance tracepoints during the folios writeback

After mpage_map_and_submit_extent() supports restarting handle if
credits are insufficient during allocating blocks, it is more likely to
exit the current mapping iteration and continue to process the current
processing partially mapped folio again. The existing tracepoints are
not sufficient to track this situation, so enhance the tracepoints to
track the writeback position and the return value before and after
submitting the folios.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-7-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: restart handle if credits are insufficient during allocating blocks

After large folios are supported on ext4, writing back a sufficiently
large and discontinuous folio may consume a significant number of
journal credits, placing considerable strain on the journal. For
example, in a 20GB filesystem with 1K block size and 1MB journal size,
writing back a 2MB folio could require thousands of credits in the
worst-case scenario (when each block is discontinuous and distributed
across different block groups), potentially exceeding the journal size.
This issue can also occur in ext4_write_begin() and ext4_page_mkwrite()
when delalloc is not enabled.

Fix this by ensuring that there are sufficient journal credits before
allocating an extent in mpage_map_one_extent() and
ext4_block_write_begin(). If there are not enough credits, return
-EAGAIN, exit the current mapping loop, restart a new handle and a new
transaction, and allocating blocks on this folio again in the next
iteration.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-6-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: refactor the block allocation process of ext4_page_mkwrite()

The block allocation process and error handling in ext4_page_mkwrite()
is complex now. Refactor it by introducing a new helper function,
ext4_block_page_mkwrite(). It will call ext4_block_write_begin() to
allocate blocks instead of directly calling block_page_mkwrite().
Preparing to implement retry logic in a subsequent patch to address
situations where the reserved journal credits are insufficient.
Additionally, this modification will help prevent potential deadlocks
that may occur when waiting for folio writeback while holding the
transaction handle.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-5-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: fix stale data if it bail out of the extents mapping loop

During the process of writing back folios, if
mpage_map_and_submit_extent() exits the extent mapping loop due to an
ENOSPC or ENOMEM error, it may result in stale data or filesystem
inconsistency in environments where the block size is smaller than the
folio size.

When mapping a discontinuous folio in mpage_map_and_submit_extent(),
some buffers may have already be mapped. If we exit the mapping loop
prematurely, the folio data within the mapped range will not be written
back, and the file's disk size will not be updated. Once the transaction
that includes this range of extents is committed, this can lead to stale
data or filesystem inconsistency.

Fix this by submitting the current processing partially mapped folio.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-4-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: move the calculation of wbc->nr_to_write to mpage_folio_done()

mpage_folio_done() should be a more appropriate place than
mpage_submit_folio() for updating the wbc->nr_to_write after we have
submitted a fully mapped folio. Preparing to make mpage_submit_folio()
allows to submit partially mapped folio that is still under processing.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250707140814.542883-3-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: process folios writeback in bytes

Since ext4 supports large folios, processing writebacks in pages is no
longer appropriate, it can be modified to process writebacks in bytes.

Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Link: https://patch.msgid.link/20250707140814.542883-2-yi.zhang@huaweicloud.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: remove unused EXT_STATS macro from ext4_extents.h

The EXT_STATS macro in fs/ext4/ext4_extents.h has been defined
but never used in the codebase since its introduction. This patch
removes it.

Analysis:
1. No references found in fs/ext4/ or other kernel code.
2. No impact on compilation or functionality.
3. Git history shows it was never utilized.

Signed-off-by: Baolin Liu <liubaolin@kylinos.cn>
Reviewed-by: Baokun Li <libaokun1@huawei.com>
Link: https://patch.msgid.link/20250527053805.1550912-1-liubaolin12138@163.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: remove unnecessary duplicate check in ext4_map_blocks()

The previous lines ensure that EXT4_GET_BLOCKS_QUERY_LAST_IN_LEAF is
set so remove this duplicate check.

Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Link: https://patch.msgid.link/aDCdjUhpzxB64vkD@stanley.mountain
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

ext4: remove duplicate check for EXT4_FC_REPLAY

EXT4_FC_REPLAY will be checked in ext4_es_lookup_extent(). If it is
set, ext4_es_lookup_extent() will return 0.

Remove the repeated check for EXT4_FC_REPLAY in ext4_map_blocks()
to simplify the code.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
Reviewed-by: Zhang Yi <yi.zhang@huawei.com>
Link: https://patch.msgid.link/20250429111722.294975-1-alexjlzheng@tencent.com
Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Linux 6.16-rc4

Merge tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging driver fix from Greg KH:
"Here is a single staging driver fix for 6.16-rc4. It resolves a build
  error in the rtl8723bs driver for some versions of clang on arm64 when
  checking the frame size with -Wframe-larger-than.

  It has been in linux-next for a while now with no reported issues"

* tag 'staging-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
  staging: rtl8723bs: Avoid memset() in aes_cipher() and aes_decipher()

Merge tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

Pull tty/serial driver fixes from Greg KH:
"Here are five small serial and tty and vt fixes for 6.16-rc4. Included
  in here are:

   - kerneldoc fixes for recent vt changes

   - imx serial driver fix

   - of_node sysfs fix for a regression

   - vt missing notification fix

   - 8250 dt bindings fix

  All of these have been in linux-next for a while with no reported issues"

* tag 'tty-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  dt-bindings: serial: 8250: Make clocks and clock-frequency exclusive
  serial: imx: Restore original RXTL for console to fix data loss
  serial: core: restore of_node information in sysfs
  vt: fix kernel-doc warnings in ucs_get_fallback()
  vt: add missing notification when switching back to text mode

Merge tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras

Pull EDAC fix from Borislav Petkov:

- Consider secondary address mask registers in amd64_edac in order to
get the correct total memory size of the system

* tag 'edac_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/amd64: Fix size calculation for Non-Power-of-Two DIMMs

Merge tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 fixes from Borislav Petkov:

- Make sure DR6 and DR7 are initialized to their architectural values
   and not accidentally cleared, leading to misconfigurations

* tag 'x86_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/traps: Initialize DR7 by writing its architectural reset value
  x86/traps: Initialize DR6 by writing its architectural reset value

Merge tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull perf fix from Borislav Petkov:

- Make sure an AUX perf event is really disabled when it overruns

* tag 'perf_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/aux: Fix pending disable flow when the AUX ring buffer overruns

Merge tag 'locking_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking fix from Borislav Petkov:

- Make sure the new futex phash is not copied during fork in order to
avoid a double-free

* tag 'locking_urgent_for_v6.16_rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
futex: Initialize futex_phash_new during fork().

Merge tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux

Pull i2c fixes from Wolfram Sang:

- imx: fix SMBus protocol compliance during block read

- omap: fix error handling path in probe

- robotfuzz, tiny-usb: prevent zero-length reads

- x86, designware, amdisp: fix build error when modules are disabled
   (agreed to go in via i2c)

- scx200_acb: fix build error because of missing HAS_IOPORT

* tag 'i2c-for-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
  i2c: scx200_acb: depends on HAS_IOPORT
  i2c: omap: Fix an error handling path in omap_i2c_probe()
  platform/x86: Use i2c adapter name to fix build errors
  i2c: amd-isp: Initialize unique adapter name
  i2c: designware: Initialize adapter name only when not set
  i2c: tiny-usb: disable zero-length read messages
  i2c: robotfuzz-osif: disable zero-length read messages
  i2c: imx: fix emulated smbus block read

Merge tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace

Pull tracing fix from Steven Rostedt:

- Fix possible UAF on error path in filter_free_subsystem_filters()

   When freeing a subsystem filter, the filter for the subsystem is
   passed in to be freed and all the events within the subsystem will
   have their filter freed too. In order to free without waiting for RCU
   synchronization, list items are allocated to hold what is going to be
   freed to free it via a call_rcu(). If the allocation of these items
   fails, it will call the synchronization directly and free after that
   (causing a bit of delay for the user).

   The subsystem filter is first added to this list and then the filters
   for all the events under the subsystem. The bug is if one of the
   allocations of the list items for the event filters fail to allocate,
   it jumps to the "free_now" label which will free the subsystem
   filter, then all the items on the allocated list, and then the event
   filters that were not added to the list yet. But because the
   subsystem filter was added first, it gets freed twice.

   The solution is to add the subsystem filter after the events, and
   then if any of the allocations fail it will not try to free any of
   them twice

* tag 'trace-v6.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
  tracing: Fix filter logic error

Merge tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson

Pull LoongArch fixes from Huacai Chen:

- replace __ASSEMBLY__ with __ASSEMBLER__ in headers like others

- fix build warnings about export.h

- reserve the EFI memory map region for kdump

- handle __init vs inline mismatches

- fix some KVM bugs

* tag 'loongarch-fixes-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
  LoongArch: KVM: Disable updating of "num_cpu" and "feature"
  LoongArch: KVM: Check validity of "num_cpu" from user space
  LoongArch: KVM: Check interrupt route from physical CPU
  LoongArch: KVM: Fix interrupt route update with EIOINTC
  LoongArch: KVM: Add address alignment check for IOCSR emulation
  LoongArch: KVM: Avoid overflow with array index
  LoongArch: Handle KCOV __init vs inline mismatches
  LoongArch: Reserve the EFI memory map region
  LoongArch: Fix build warnings about export.h
  LoongArch: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers

Merge tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6

Pull smb client fixes from Steve French:

- Multichannel reconnect lock ordering deadlock fix

- Fix for regression in handling native Windows symlinks

- Three smbdirect fixes:
     - oops in RDMA response processing
     - smbdirect memcpy issue
     - fix smbdirect regression with large writes (smbdirect test cases
       now all passing)

- Fix for "FAILED_TO_PARSE" warning in trace-cmd report output

* tag 'v6.16-rc3-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code
  cifs: Fix the smbd_response slab to allow usercopy
  smb: client: fix potential deadlock when reconnecting channels
  smb: client: remove \t from TP_printk statements
  smb: client: let smbd_post_send_iter() respect the peers max_send_size and transmit all data
  smb: client: fix regression with native SMB symlinks

Merge tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm

Pull misc fixes from Andrew Morton:
"16 hotfixes.

  6 are cc:stable and the remainder address post-6.15 issues or aren't
  considered necessary for -stable kernels. 5 are for MM"

* tag 'mm-hotfixes-stable-2025-06-27-16-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
  MAINTAINERS: add Lorenzo as THP co-maintainer
  mailmap: update Duje Mihanović's email address
  selftests/mm: fix validate_addr() helper
  crashdump: add CONFIG_KEYS dependency
  mailmap: correct name for a historical account of Zijun Hu
  mailmap: add entries for Zijun Hu
  fuse: fix runtime warning on truncate_folio_batch_exceptionals()
  scripts/gdb: fix dentry_name() lookup
  mm/damon/sysfs-schemes: free old damon_sysfs_scheme_filter->memcg_path on write
  mm/alloc_tag: fix the kmemleak false positive issue in the allocation of the percpu variable tag->counters
  lib/group_cpus: fix NULL pointer dereference from group_cpus_evenly()
  mm/hugetlb: remove unnecessary holding of hugetlb_lock
  MAINTAINERS: add missing files to mm page alloc section
  MAINTAINERS: add tree entry to mm init block
  mm: add OOM killer maintainer structure
  fs/proc/task_mmu: fix PAGE_IS_PFNZERO detection for the huge zero folio

Merge tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V Fixes for 5.16-rc4

- .rodata is no longer linkd into PT_DYNAMIC.

   It was not supposed to be there in the first place and resulted in
   invalid (but unused) entries. This manifests as at least warnings in
   llvm-readelf

- A fix for runtime constants with all-0 upper 32-bits. This should
   only manifest on MMU=n kernels

- A fix for context save/restore on systems using the T-Head vector
   extensions

- A fix for a conflicting "+r"/"r" register constraint in the VDSO
   getrandom syscall wrapper, which is undefined behavior in clang

- A fix for a missing register clobber in the RVV raid6 implementation.

   This manifests as a NULL pointer reference on some compilers, but
   could trigger in other ways

- Misaligned accesses from userspace at faulting addresses are now
   handled correctly

- A fix for an incorrect optimization that allowed access_ok() to mark
   invalid addresses as accessible, which can result in userspace
   triggering BUG()s

- A few fixes for build warnings, and an update to Drew's email address

* tag 'riscv-for-linus-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: export boot_cpu_hartid
  Revert "riscv: Define TASK_SIZE_MAX for __access_ok()"
  riscv: Fix sparse warning in vendor_extensions/sifive.c
  Revert "riscv: misaligned: fix sleeping function called during misaligned access handling"
  MAINTAINERS: Update Drew Fustini's email address
  RISC-V: uaccess: Wrap the get_user_8 uaccess macro
  raid6: riscv: Fix NULL pointer dereference caused by a missing clobber
  RISC-V: vDSO: Correct inline assembly constraints in the getrandom syscall wrapper
  riscv: vector: Fix context save/restore with xtheadvector
  riscv: fix runtime constant support for nommu kernels
  riscv: vdso: Exclude .rodata from the PT_DYNAMIC segment

Merge tag 'pci-v6.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci

Pull PCI fix from Bjorn Helgaas:

- Fix a PTM debugfs build error with CONFIG_DEBUG_FS=n &&
CONFIG_PCIE_PTM=y (Manivannan Sadhasivam)

* tag 'pci-v6.16-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI/PTM: Build debugfs code only if CONFIG_DEBUG_FS is enabled

Merge tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel

Pull drm fixes from Dave Airlie:
"Regular weekly drm updates, nothing out of the ordinary, amdgpu, xe,
  i915 and a few misc bits. Seems about right for this time in the
  release cycle.

  core:
   - fix drm_writeback_connector_cleanup function signature
   - use correct HDMI audio bridge in drm_connector_hdmi_audio_init

  bridge:
   - SN65DSI86: fix HPD

  amdgpu:
   - Cleaner shader support for additional GFX9 GPUs
   - MES firmware compatibility fixes
   - Discovery error reporting fixes
   - SDMA6/7 userq fixes
   - Backlight fix
   - EDID sanity check

  i915:
   - Fix for SNPS PHY HDMI for 1080p@120Hz
   - Correct DP AUX DPCD probe address
   - Followup build fix for GCOV and AutoFDO enabled config

  xe:
   - Missing error check
   - Fix xe_hwmon_power_max_write
   - Move flushes
   - Explicitly exit CT safe mode on unwind
   - Process deferred GGTT node removals on device unwind"

* tag 'drm-fixes-2025-06-28' of https://gitlab.freedesktop.org/drm/kernel:
  drm/xe: Process deferred GGTT node removals on device unwind
  drm/xe/guc: Explicitly exit CT safe mode on unwind
  drm/xe: move DPT l2 flush to a more sensible place
  drm/xe: Move DSB l2 flush to a more sensible place
  drm/bridge: ti-sn65dsi86: Add HPD for DisplayPort connector type
  drm/i915: fix build error some more
  drm/xe/hwmon: Fix xe_hwmon_power_max_write
  drm/xe/display: Add check for alloc_ordered_workqueue()
  drm/amd/display: Add sanity checks for drm_edid_raw()
  drm/amd/display: Fix AMDGPU_MAX_BL_LEVEL value
  drm/amdgpu/sdma7: add ucode version checks for userq support
  drm/amdgpu/sdma6: add ucode version checks for userq support
  drm/amd: Adjust output for discovery error handling
  drm/amdgpu/mes: add compatibility checks for set_hw_resource_1
  drm/amdgpu/gfx9: Add Cleaner Shader Support for GFX9.x GPUs
  drm/bridge-connector: Fix bridge in drm_connector_hdmi_audio_init()
  drm/dp: Change AUX DPCD probe address from DPCD_REV to LANE0_1_STATUS
  drm/i915/snps_hdmi_pll: Fix 64-bit divisor truncation by using div64_u64
  drm: writeback: Fix drm_writeback_connector_cleanup signature

Merge tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl

Pull Compute Express Link (CXL) fixes from Dave Jiang:
"These fixes address a few issues in the CXL subsystem, including
  dealing with some bugs in the CXL EDAC and RAS drivers:

   - Fix return value of cxlctl_validate_set_features()

   - Fix min_scrub_cycle of a region miscaculation and add additional
     documentation

   - Fix potential memory leak issues for CXL EDAC

   - Fix CPER handler device confusion for CXL RAS

   - Fix using wrong repair type to check DRAM event record"

* tag 'cxl-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl:
  cxl/edac: Fix using wrong repair type to check dram event record
  cxl/ras: Fix CPER handler device confusion
  cxl/edac: Fix potential memory leak issues
  cxl/Documentation: Add more description about min/max scrub cycle
  cxl/edac: Fix the min_scrub_cycle of a region miscalculation
  cxl: fix return value in cxlctl_validate_set_features()

Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux

Pull crypto library fix from Eric Biggers:
"Fix a regression where the purgatory code sometimes fails to build"

* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: sha256: Mark sha256_choose_blocks as __always_inline

Merge tag 'drm-misc-fixes-2025-06-26' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes

drm-misc-fixes for v6.16-rc4:
- Fix function signature of drm_writeback_connector_cleanup.
- Use correct HDMI audio bridge in drm_connector_hdmi_audio_init.
- Make HPD work on SN65DSI86.

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Link: https://lore.kernel.org/r/3dd1d5e1-73b6-4b0c-a208-f7d6235cf530@linux.intel.com

tracing: Fix filter logic error

If the processing of the tr->events loop fails, the filter that has been
added to filter_head will be released twice in free_filter_list(&head->rcu)
and __free_filter(filter).

After adding the filter of tr->events, add the filter to the filter_head
process to avoid triggering uaf.

Link: https://lore.kernel.org/tencent_4EF87A626D702F816CD0951CE956EC32CD0A@qq.com
Fixes: a9d0aab5eb33 ("tracing: Fix regression of filter waiting a long time on RCU synchronization")
Reported-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=daba72c4af9915e9c894
Tested-by: syzbot+daba72c4af9915e9c894@syzkaller.appspotmail.com
Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Edward Adam Davis <eadavis@qq.com>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>

Merge tag 'acpi-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull ACPI fix from Rafael Wysocki:
"Revert a commit that attempted to fix a memory leak in an error code
path and introduced a different issue (Zhe Qiao)"

* tag 'acpi-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
Revert "PCI/ACPI: Fix allocated memory release on error in pci_acpi_scan_root()"

Merge tag 'block-6.16-20250626' of git://git.kernel.dk/linux

Pull block fixes from Jens Axboe:

- Fixes for ublk:
      - fix C++ narrowing warnings in the uapi header
      - update/improve UBLK_F_SUPPORT_ZERO_COPY comment in uapi header
      - fix for the ublk ->queue_rqs() implementation, limiting a batch
        to just the specific task AND ring
      - ublk_get_data() error handling fix
      - sanity check more arguments in ublk_ctrl_add_dev()
      - selftest addition

- NVMe pull request via Christoph:
      - reset delayed remove_work after reconnect
      - fix atomic write size validation

- Fix for a warning introduced in bdev_count_inflight_rw() in this
   merge window

* tag 'block-6.16-20250626' of git://git.kernel.dk/linux:
  block: fix false warning in bdev_count_inflight_rw()
  ublk: sanity check add_dev input for underflow
  nvme: fix atomic write size validation
  nvme: refactor the atomic write unit detection
  nvme: reset delayed remove_work after reconnect
  ublk: setup ublk_io correctly in case of ublk_get_data() failure
  ublk: update UBLK_F_SUPPORT_ZERO_COPY comment in UAPI header
  ublk: fix narrowing warnings in UAPI header
  selftests: ublk: don't take same backing file for more than one ublk devices
  ublk: build batch from IOs in same io_ring_ctx and io task

Merge tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux

Pull io_uring fixes from Jens Axboe:

- Two tweaks for a recent fix: fixing a memory leak if multiple iovecs
   were initially mapped but only the first was used and hence turned
   into a UBUF rathan than an IOVEC iterator, and catching a case where
   a retry would be done even if the previous segment wasn't full

- Small series fixing an issue making the vm unhappy if debugging is
   turned on, hitting a VM_BUG_ON_PAGE()

- Fix a resource leak in io_import_dmabuf() in the error handling case,
   which is a regression in this merge window

- Mark fallocate as needing to be write serialized, as is already done
   for truncate and buffered writes

* tag 'io_uring-6.16-20250626' of git://git.kernel.dk/linux:
  io_uring/kbuf: flag partial buffer mappings
  io_uring/net: mark iov as dynamically allocated even for single segments
  io_uring: fix resource leak in io_import_dmabuf()
  io_uring: don't assume uaddr alignment in io_vec_fill_bvec
  io_uring/rsrc: don't rely on user vaddr alignment
  io_uring/rsrc: fix folio unpinning
  io_uring: make fallocate be hashed work

Merge tag 'ata-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux

Pull ata fix from Niklas Cassel:

- Use the correct DMI identifier for ASUSPRO-D840SA LPM quirk such that
the quirk actually gets applied (me)

* tag 'ata-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/libata/linux:
ata: ahci: Use correct DMI identifier for ASUSPRO-D840SA LPM quirk

Merge tag 's390-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux

Pull s390 fixes from Alexander Gordeev:

- Fix incorrectly dropped dereferencing of the stack nth entry
   introduced with a previous KASAN false positive fix

- Use a proper memdup_array_user() helper to prevent overflow in a
   protected key size calculation

* tag 's390-6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/ptrace: Fix pointer dereferencing in regs_get_kernel_stack_nth()
  s390/pkey: Prevent overflow in size calculation for memdup_user()

Merge tag 'sound-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"A collection of small fixes again:

   - A regression fix for hibernation bug in ASoC SoundWire

   - Fixes for the new Qualcomm USB offload stuff

   - A potential OOB access fix in USB-audio

   - A potential memleadk fix in ASoC Intel

   - Quirks for HD-audio and ASoC AMD ACP"

* tag 'sound-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek: Fix built-in mic on ASUS VivoBook X507UAR
  ALSA: usb: qcom: fix NULL pointer dereference in qmi_stop_session
  ASoC: SOF: Intel: hda: Use devm_kstrdup() to avoid memleak.
  ASoC: rt721-sdca: fix boost gain calculation error
  ALSA: qc_audio_offload: Fix missing error code in prepare_qmi_response()
  ALSA: hda/realtek: Add mic-mute LED setup for ASUS UM5606
  ALSA: usb-audio: Fix out-of-bounds read in snd_usb_get_audioformat_uac3()
  ALSA: hda/realtek: fix mute/micmute LEDs for HP EliteBook 6 G1a
  ASoC: amd: ps: fix for soundwire failures during hibernation exit sequence
  ASoC: amd: yc: Add DMI quirk for Lenovo IdeaPad Slim 5 15
  ASoC: amd: yc: add quirk for Acer Nitro ANV15-41 internal mic
  ASoC: qcom: sm8250: Fix possibly undefined reference
  ALSA: hda/realtek - Enable mute LED on HP Pavilion Laptop 15-eg100
  ALSA: hda/realtek: Add quirks for some Clevo laptops

i2c: scx200_acb: depends on HAS_IOPORT

It already depends on X86_32, but that's also set for ARCH=um.
Recent changes made UML no longer have IO port access since
it's not needed, but this driver uses it. Build it only for
HAS_IOPORT. This is pretty much the same as depending on X86,
but on the off-chance that HAS_IOPORT will ever be optional
on x86 HAS_IOPORT is the real prerequisite.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>

LoongArch: KVM: Disable updating of "num_cpu" and "feature"

Property "num_cpu" and "feature" are read-only once eiointc is created,
which are set with KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL attr group before
device creation.

Attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_SW_STATUS is to update register
and software state for migration and reset usage, property "num_cpu" and
"feature" can not be update again if it is created already.

Here discard write operation with property "num_cpu" and "feature" in
attr group KVM_DEV_LOONGARCH_EXTIOI_GRP_CTRL.

Cc: stable@vger.kernel.org
Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Check validity of "num_cpu" from user space

The maximum supported cpu number is EIOINTC_ROUTE_MAX_VCPUS about
irqchip EIOINTC, here add validation about cpu number to avoid array
pointer overflow.

Cc: stable@vger.kernel.org
Fixes: 1ad7efa552fd ("LoongArch: KVM: Add EIOINTC user mode read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Check interrupt route from physical CPU

With EIOINTC interrupt controller, physical CPU ID is set for irq route.
However the function kvm_get_vcpu() is used to get destination vCPU when
delivering irq. With API kvm_get_vcpu(), the logical CPU ID is used.

With API kvm_get_vcpu_by_cpuid(), vCPU ID can be searched from physical
CPU ID.

Cc: stable@vger.kernel.org
Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Fix interrupt route update with EIOINTC

With function eiointc_update_sw_coremap(), there is forced assignment
like val = *(u64 *)pvalue. Parameter pvalue may be pointer to char type
or others, there is problem with forced assignment with u64 type.

Here the detailed value is passed rather address pointer.

Cc: stable@vger.kernel.org
Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: KVM: Add address alignment check for IOCSR emulation

IOCSR instruction supports 1/2/4/8 bytes access, the address should be
naturally aligned with its access size. Here address alignment check is
added in the EIOINTC kernel emulation.

Cc: stable@vger.kernel.org
Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

Merge tag 'i2c-host-fixes-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current

i2c-host fixes for v6.16-rc4

- imx: fix SMBus protocol compliance during block read
- omap: fix error handling path in probe
- robotfuzz, tiny-usb: prevent zero-length reads
- x86, designware, amdisp: fix build error when modules are
disabled

Merge tag 'v6.16-p6' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6

Pull crypto fix from Herbert Xu:
"This fixes a regression where wp512 can no longer be used with hmac"

* tag 'v6.16-p6' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: wp512 - Use API partial block handling

Merge tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs

Pull bcachefs fixes from Kent Overstreet:

- Lots of small check/repair fixes, primarily in subvol loop and
   directory structure loop (when involving snapshots).

- Fix a few 6.16 regressions: rare UAF in the foreground allocator path
   when taking a transaction restart from the transaction bump
   allocator, and some small fallout from the change to log the error
   being corrected in the journal when repairing errors, also some
   fallout from the btree node read error logging improvements.

   (Alan, Bharadwaj)

- New option: journal_rewind

   This lets the entire filesystem be reset to an earlier point in time.

   Note that this is only a disaster recovery tool, and right now there
   are major caveats to using it (discards should be disabled, in
   particular), but it successfully restored the filesystem of one of
   the users who was bit by the subvolume deletion bug and didn't have
   backups. I'll likely be making some changes to the discard path in
   the future to make this a reliable recovery tool.

- Some new btree iterator tracepoints, for tracking down some
   livelock-ish behaviour we've been seeing in the main data write path.

* tag 'bcachefs-2025-06-26' of git://evilpiepirate.org/bcachefs: (51 commits)
  bcachefs: Plumb correct ip to trans_relock_fail tracepoint
  bcachefs: Ensure we rewind to run recovery passes
  bcachefs: Ensure btree node scan runs before checking for scanned nodes
  bcachefs: btree_root_unreadable_and_scan_found_nothing should not be autofix
  bcachefs: fix bch2_journal_keys_peek_prev_min() underflow
  bcachefs: Use wait_on_allocator() when allocating journal
  bcachefs: Check for bad write buffer key when moving from journal
  bcachefs: Don't unlock the trans if ret doesn't match BCH_ERR_operation_blocked
  bcachefs: Fix range in bch2_lookup_indirect_extent() error path
  bcachefs: fix spurious error_throw
  bcachefs: Add missing bch2_err_class() to fileattr_set()
  bcachefs: Add missing key type checks to check_snapshot_exists()
  bcachefs: Don't log fsck err in the journal if doing repair elsewhere
  bcachefs: Fix *__bch2_trans_subbuf_alloc() error path
  bcachefs: Fix missing newlines before ero
  bcachefs: fix spurious error in read_btree_roots()
  bcachefs: fsck: Fix oops in key_visible_in_snapshot()
  bcachefs: fsck: fix unhandled restart in topology repair
  bcachefs: fsck: Fix check_directory_structure when no check_dirents
  bcachefs: Fix restart handling in btree_node_scrub_work()
  ...

Merge tag 'hid-for-linus-2025062701' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid

Pull HID fixes from Jiri Kosina:

- fix for stalls during suspend/resume cycles with hid-nintendo (Daniel
   J. Ogorchock)

- memory leak and reference count fixes in hid-wacom and in-appletb-kdb
   (Qasim Ijaz)

- race condition (leading to kernel crash) fix during device removal in
   hid-wacom (Thomas Zeitlhofer)

- fix for missed interrupt in intel-thc-hid (Intel-thc-hid:)

- support for a bunch of new device IDs

* tag 'hid-for-linus-2025062701' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
  HID: lenovo: Add support for ThinkPad X1 Tablet Thin Keyboard Gen2
  HID: appletb-kbd: fix "appletb_backlight" backlight device reference counting
  HID: wacom: fix crash in wacom_aes_battery_handler()
  HID: intel-ish-hid: ipc: Add Wildcat Lake PCI device ID
  hid: intel-ish-hid: Use PCI_DEVICE_DATA() macro for ISH device table
  HID: lenovo: Restrict F7/9/11 mode to compact keyboards only
  HID: Add IGNORE quirk for SMARTLINKTECHNOLOGY
  HID: input: lower message severity of 'No inputs registered, leaving' to debug
  HID: quirks: Add quirk for 2 Chicony Electronics HP 5MP Cameras
  HID: Intel-thc-hid: Intel-quicki2c: Enhance QuickI2C reset flow
  HID: nintendo: avoid bluetooth suspend/resume stalls
  HID: wacom: fix kobject reference count leak
  HID: wacom: fix memory leak on sysfs attribute creation failure
  HID: wacom: fix memory leak on kobject creation failure

Merge tag 'drm-xe-fixes-2025-06-26' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

UAPI Changes:
Driver Changes:
- Missing error check (Haoxiang Li)
- Fix xe_hwmon_power_max_write (Karthik)
- Move flushes (Maarten and Matthew Auld)
- Explicitly exit CT safe mode on unwind (Michal)
- Process deferred GGTT node removals on device unwind (Michal)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Thomas Hellstrom <thomas.hellstrom@linux.intel.com>
Link: https://lore.kernel.org/r/aF1T6EzzC3xj4K4H@fedora

Merge tag 'drm-intel-fixes-2025-06-26' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Fix for SNPS PHY HDMI for 1080p@120Hz
- Correct DP AUX DPCD probe address
- Followup build fix for GCOV and AutoFDO enabled config

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://lore.kernel.org/r/aFzsHR9WLYsxg8jy@jlahtine-mobl

Merge tag 'amd-drm-fixes-6.16-2025-25-25' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-6.16-2025-25-25:

amdgpu:
- Cleaner shader support for additional GFX9 GPUs
- MES firmware compatibility fixes
- Discovery error reporting fixes
- SDMA6/7 userq fixes
- Backlight fix
- EDID sanity check

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://lore.kernel.org/r/20250625151734.11537-1-alexander.deucher@amd.com

Merge tag 'devicetree-fixes-for-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux

Pull devicetree fixes from Rob Herring:

- Convert altr,uart-1.0 and altr,juart-1.0 to DT schema. These were
   applied for nios2, but never sent upstream.

- Fix extra '/' in fsl,ls1028a-reset '$id' path

- Fix warnings in ti,sn65dsi83 schema due to unnecessary $ref.

* tag 'devicetree-fixes-for-6.16-1' of git://git.kernel.org/pub/scm/linux/kernel/git/robh/linux:
  dt-bindings: serial: Convert altr,uart-1.0 to DT schema
  dt-bindings: serial: Convert altr,juart-1.0 to DT schema
  dt-bindings: soc: fsl,ls1028a-reset: Drop extra "/" in $id
  dt-bindings: drm/bridge: ti-sn65dsi83: drop $ref to fix lvds-vod* warnings

io_uring/kbuf: flag partial buffer mappings

A previous commit aborted mapping more for a non-incremental ring for
bundle peeking, but depending on where in the process this peeking
happened, it would not necessarily prevent a retry by the user. That can
create gaps in the received/read data.

Add struct buf_sel_arg->partial_map, which can pass this information
back. The networking side can then map that to internal state and use it
to gate retry as well.

Since this necessitates a new flag, change io_sr_msg->retry to a
retry_flags member, and store both the retry and partial map condition
in there.

Cc: stable@vger.kernel.org
Fixes: 26ec15e4b0c1 ("io_uring/kbuf: don't truncate end buffer for multiple buffer peeks")
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'net-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from Paolo Abeni:
"Including fixes from bluetooth and wireless.

  Current release - regressions:

   - bridge: fix use-after-free during router port configuration

  Current release - new code bugs:

   - eth: wangxun: fix the creation of page_pool

  Previous releases - regressions:

   - netpoll: initialize UDP checksum field before checksumming

   - wifi: mac80211: finish link init before RCU publish

   - bluetooth: fix use-after-free in vhci_flush()

   - eth:
      - ionic: fix DMA mapping test
      - bnxt: properly flush XDP redirect lists

  Previous releases - always broken:

   - netlink: specs: enforce strict naming of properties

   - unix: don't leave consecutive consumed OOB skbs.

   - vsock: fix linux/vm_sockets.h userspace compilation errors

   - selftests: fix TCP packet checksum"

* tag 'net-6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
  net: libwx: fix the creation of page_pool
  net: selftests: fix TCP packet checksum
  atm: Release atm_dev_mutex after removing procfs in atm_dev_deregister().
  netlink: specs: enforce strict naming of properties
  netlink: specs: tc: replace underscores with dashes in names
  netlink: specs: rt-link: replace underscores with dashes in names
  netlink: specs: mptcp: replace underscores with dashes in names
  netlink: specs: ovs_flow: replace underscores with dashes in names
  netlink: specs: devlink: replace underscores with dashes in names
  netlink: specs: dpll: replace underscores with dashes in names
  netlink: specs: ethtool: replace underscores with dashes in names
  netlink: specs: fou: replace underscores with dashes in names
  netlink: specs: nfsd: replace underscores with dashes in names
  net: enetc: Correct endianness handling in _enetc_rd_reg64
  atm: idt77252: Add missing `dma_map_error()`
  bnxt: properly flush XDP redirect lists
  vsock/uapi: fix linux/vm_sockets.h userspace compilation errors
  wifi: mac80211: finish link init before RCU publish
  wifi: iwlwifi: mvm: assume '1' as the default mac_config_cmd version
  selftest: af_unix: Add tests for -ECONNRESET.
  ...

cifs: Fix reading into an ITER_FOLIOQ from the smbdirect code

When performing a file read from RDMA, smbd_recv() prints an "Invalid msg
type 4" error and fails the I/O. This is due to the switch-statement there
not handling the ITER_FOLIOQ handed down from netfslib.

Fix this by collapsing smbd_recv_buf() and smbd_recv_page() into
smbd_recv() and just using copy_to_iter() instead of memcpy(). This
future-proofs the function too, in case more ITER_* types are added.

Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: Stefan Metzmacher <metze@samba.org>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Tom Talpey <tom@talpey.com>
cc: Paulo Alcantara (Red Hat) <pc@manguebit.com>
cc: Matthew Wilcox <willy@infradead.org>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

cifs: Fix the smbd_response slab to allow usercopy

The handling of received data in the smbdirect client code involves using
copy_to_iter() to copy data from the smbd_reponse struct's packet trailer
to a folioq buffer provided by netfslib that encapsulates a chunk of
pagecache.

If, however, CONFIG_HARDENED_USERCOPY=y, this will result in the checks
then performed in copy_to_iter() oopsing with something like the following:

CIFS: Attempting to mount //172.31.9.1/test
CIFS: VFS: RDMA transport established
usercopy: Kernel memory exposure attempt detected from SLUB object 'smbd_response_0000000091e24ea1' (offset 81, size 63)!
------------[ cut here ]------------
kernel BUG at mm/usercopy.c:102!
...
RIP: 0010:usercopy_abort+0x6c/0x80
...
Call Trace:
  <TASK>
  __check_heap_object+0xe3/0x120
  __check_object_size+0x4dc/0x6d0
  smbd_recv+0x77f/0xfe0 [cifs]
  cifs_readv_from_socket+0x276/0x8f0 [cifs]
  cifs_read_from_socket+0xcd/0x120 [cifs]
  cifs_demultiplex_thread+0x7e9/0x2d50 [cifs]
  kthread+0x396/0x830
  ret_from_fork+0x2b8/0x3b0
  ret_from_fork_asm+0x1a/0x30

The problem is that the smbd_response slab's packet field isn't marked as
being permitted for usercopy.

Fix this by passing parameters to kmem_slab_create() to indicate that
copy_to_iter() is permitted from the packet region of the smbd_response
slab objects, less the header space.

Fixes: ee4cdf7ba857 ("netfs: Speed up buffered reading")
Reported-by: Stefan Metzmacher <metze@samba.org>
Link: https://lore.kernel.org/r/acb7f612-df26-4e2a-a35d-7cd040f513e1@samba.org/
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Stefan Metzmacher <metze@samba.org>
Tested-by: Stefan Metzmacher <metze@samba.org>
cc: Paulo Alcantara <pc@manguebit.com>
cc: linux-cifs@vger.kernel.org
cc: netfs@lists.linux.dev
cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Steve French <stfrench@microsoft.com>

smb: client: fix potential deadlock when reconnecting channels

Fix cifs_signal_cifsd_for_reconnect() to take the correct lock order
and prevent the following deadlock from happening

======================================================
WARNING: possible circular locking dependency detected
6.16.0-rc3-build2+ #1301 Tainted: G S      W
------------------------------------------------------
cifsd/6055 is trying to acquire lock:
ffff88810ad56038 (&tcp_ses->srv_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x134/0x200

but task is already holding lock:
ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #2 (&ret_buf->chan_lock){+.+.}-{3:3}:
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_setup_session+0x81/0x4b0
       cifs_get_smb_ses+0x771/0x900
       cifs_mount_get_session+0x7e/0x170
       cifs_mount+0x92/0x2d0
       cifs_smb3_do_mount+0x161/0x460
       smb3_get_tree+0x55/0x90
       vfs_get_tree+0x46/0x180
       do_new_mount+0x1b0/0x2e0
       path_mount+0x6ee/0x740
       do_mount+0x98/0xe0
       __do_sys_mount+0x148/0x180
       do_syscall_64+0xa4/0x260
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #1 (&ret_buf->ses_lock){+.+.}-{3:3}:
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_match_super+0x101/0x320
       sget+0xab/0x270
       cifs_smb3_do_mount+0x1e0/0x460
       smb3_get_tree+0x55/0x90
       vfs_get_tree+0x46/0x180
       do_new_mount+0x1b0/0x2e0
       path_mount+0x6ee/0x740
       do_mount+0x98/0xe0
       __do_sys_mount+0x148/0x180
       do_syscall_64+0xa4/0x260
       entry_SYSCALL_64_after_hwframe+0x76/0x7e

-> #0 (&tcp_ses->srv_lock){+.+.}-{3:3}:
       check_noncircular+0x95/0xc0
       check_prev_add+0x115/0x2f0
       validate_chain+0x1cf/0x270
       __lock_acquire+0x60e/0x780
       lock_acquire.part.0+0xb4/0x1f0
       _raw_spin_lock+0x2f/0x40
       cifs_signal_cifsd_for_reconnect+0x134/0x200
       __cifs_reconnect+0x8f/0x500
       cifs_handle_standard+0x112/0x280
       cifs_demultiplex_thread+0x64d/0xbc0
       kthread+0x2f7/0x310
       ret_from_fork+0x2a/0x230
       ret_from_fork_asm+0x1a/0x30

other info that might help us debug this:

Chain exists of:
  &tcp_ses->srv_lock --> &ret_buf->ses_lock --> &ret_buf->chan_lock

Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(&ret_buf->chan_lock);
                               lock(&ret_buf->ses_lock);
                               lock(&ret_buf->chan_lock);
  lock(&tcp_ses->srv_lock);

*** DEADLOCK ***

3 locks held by cifsd/6055:
#0: ffffffff857de398 (&cifs_tcp_ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x7b/0x200
#1: ffff888119c64060 (&ret_buf->ses_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0x9c/0x200
#2: ffff888119c64330 (&ret_buf->chan_lock){+.+.}-{3:3}, at: cifs_signal_cifsd_for_reconnect+0xcf/0x200

Cc: linux-cifs@vger.kernel.org
Reported-by: David Howells <dhowells@redhat.com>
Fixes: d7d7a66aacd6 ("cifs: avoid use of global locks for high contention data")
Reviewed-by: David Howells <dhowells@redhat.com>
Tested-by: David Howells <dhowells@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.org>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>

block: fix false warning in bdev_count_inflight_rw()

While bdev_count_inflight is interating all cpus, if some IOs are issued
from traversed cpu and then completed from the cpu that is not traversed
yet:

cpu0
cpu1
bdev_count_inflight
//for_each_possible_cpu
// cpu0 is 0
infliht += 0
// issue a io
blk_account_io_start
// cpu0 inflight ++

cpu2
// the io is done
blk_account_io_done
// cpu2 inflight --
// cpu 1 is 0
inflight += 0
// cpu2 is -1
inflight += -1
...

In this case, the total inflight will be -1, causing lots of false
warning. Fix the problem by removing the warning.

Noted there is still a valid warning for nvme-mpath(From Yi) that is not
fixed yet.

Fixes: f5482ee5edb9 ("block: WARN if bdev inflight counter is negative")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#mae89155a5006463d0a21a4a2c35ae0034b26a339
Reported-and-tested-by: Calvin Owens <calvin@wbinvd.org>
Closes: https://lore.kernel.org/linux-block/aFtUXy-lct0WxY2w@mozart.vkv.me/T/#m1d935a00070bf95055d0ac84e6075158b08acaef
Reported-by: Dave Chinner <david@fromorbit.com>
Closes: https://lore.kernel.org/linux-block/aFuypjqCXo9-5_En@dread.disaster.area/
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20250626115743.1641443-1-yukuai3@huawei.com
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'nvme-6.16-2025-06-26' of git://git.infradead.org/nvme into block-6.16

Pull NVMe fixes from Christoph:

" - reset delayed remove_work after reconnect (Keith Busch)
  - fix atomic write size validation (Christoph Hellwig)"

* tag 'nvme-6.16-2025-06-26' of git://git.infradead.org/nvme:
  nvme: fix atomic write size validation
  nvme: refactor the atomic write unit detection
  nvme: reset delayed remove_work after reconnect

ublk: sanity check add_dev input for underflow

Add additional checks that queue depth and number of queues are
non-zero.

Signed-off-by: Ronnie Sahlberg <rsahlberg@whamcloud.com>
Reviewed-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20250626022046.235018-1-ronniesahlberg@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>

drm/xe: Process deferred GGTT node removals on device unwind

While we are indirectly draining our dedicated workqueue ggtt->wq
that we use to complete asynchronous removal of some GGTT nodes,
this happends as part of the managed-drm unwinding (ggtt_fini_early),
which could be later then manage-device unwinding, where we could
already unmap our MMIO/GMS mapping (mmio_fini).

This was recently observed during unsuccessful VF initialization:

[ ] xe 0000:00:02.1: probe with driver xe failed with error -62
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e747340 __xe_bo_unpin_map_no_vm (16 bytes)
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e747540 __xe_bo_unpin_map_no_vm (16 bytes)
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e747240 __xe_bo_unpin_map_no_vm (16 bytes)
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e747040 tiles_fini (16 bytes)
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e746840 mmio_fini (16 bytes)
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e747f40 xe_bo_pinned_fini (16 bytes)
[ ] xe 0000:00:02.1: DEVRES REL ffff88811e746b40 devm_drm_dev_init_release (16 bytes)
[ ] xe 0000:00:02.1: [drm:drm_managed_release] drmres release begin
[ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef81640 __fini_relay (8 bytes)
[ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80d40 guc_ct_fini (8 bytes)
[ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80040 __drmm_mutex_release (8 bytes)
[ ] xe 0000:00:02.1: [drm:drm_managed_release] REL ffff88810ef80140 ggtt_fini_early (8 bytes)

and this was leading to:

[ ] BUG: unable to handle page fault for address: ffffc900058162a0
[ ] #PF: supervisor write access in kernel mode
[ ] #PF: error_code(0x0002) - not-present page
[ ] Oops: Oops: 0002 [#1] SMP NOPTI
[ ] Tainted: [W]=WARN
[ ] Workqueue: xe-ggtt-wq ggtt_node_remove_work_func [xe]
[ ] RIP: 0010:xe_ggtt_set_pte+0x6d/0x350 [xe]
[ ] Call Trace:
[ ]  <TASK>
[ ]  xe_ggtt_clear+0xb0/0x270 [xe]
[ ]  ggtt_node_remove+0xbb/0x120 [xe]
[ ]  ggtt_node_remove_work_func+0x30/0x50 [xe]
[ ]  process_one_work+0x22b/0x6f0
[ ]  worker_thread+0x1e8/0x3d

Add managed-device action that will explicitly drain the workqueue
with all pending node removals prior to releasing MMIO/GSM mapping.

Fixes: 919bb54e989c ("drm/xe: Fix missing runtime outer protection for ggtt_remove_node")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://lore.kernel.org/r/20250612220937.857-2-michal.wajdeczko@intel.com
(cherry picked from commit 89d2835c3680ab1938e22ad81b1c9f8c686bd391)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe/guc: Explicitly exit CT safe mode on unwind

During driver probe we might be briefly using CT safe mode, which
is based on a delayed work, but usually we are able to stop this
once we have IRQ fully operational.  However, if we abort the probe
quite early then during unwind we might try to destroy the workqueue
while there is still a pending delayed work that attempts to restart
itself which triggers a WARN.

This was recently observed during unsuccessful VF initialization:

[ ] xe 0000:00:02.1: probe with driver xe failed with error -62
[ ] ------------[ cut here ]------------
[ ] workqueue: cannot queue safe_mode_worker_func [xe] on wq xe-g2h-wq
[ ] WARNING: CPU: 9 PID: 0 at kernel/workqueue.c:2257 __queue_work+0x287/0x710
[ ] RIP: 0010:__queue_work+0x287/0x710
[ ] Call Trace:
[ ]  delayed_work_timer_fn+0x19/0x30
[ ]  call_timer_fn+0xa1/0x2a0

Exit the CT safe mode on unwind to avoid that warning.

Fixes: 09b286950f29 ("drm/xe/guc: Allow CTB G2H processing without G2H IRQ")
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
Link: https://lore.kernel.org/r/20250612220937.857-3-michal.wajdeczko@intel.com
(cherry picked from commit 2ddbb73ec20b98e70a5200cb85deade22ccea2ec)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe: move DPT l2 flush to a more sensible place

Only need the flush for DPT host updates here. Normal GGTT updates don't
need special flush.

Fixes: 01570b446939 ("drm/xe/bmg: implement Wa_16023588340")
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Cc: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Link: https://lore.kernel.org/r/20250606104546.1996818-4-matthew.auld@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 35db1da40c8cfd7511dc42f342a133601eb45449)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

drm/xe: Move DSB l2 flush to a more sensible place

Flushing l2 is only needed after all data has been written.

Fixes: 01570b446939 ("drm/xe/bmg: implement Wa_16023588340")
Signed-off-by: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: stable@vger.kernel.org # v6.12+
Reviewed-by: Matthew Auld <matthew.auld@intel.com>
Signed-off-by: Matthew Auld <matthew.auld@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
Link: https://lore.kernel.org/r/20250606104546.1996818-3-matthew.auld@intel.com
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 0dd2dd0182bc444a62652e89d08c7f0e4fde15ba)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>

LoongArch: KVM: Avoid overflow with array index

The variable index is modified and reused as array index when modify
register EIOINTC_ENABLE. There will be array index overflow problem.

Cc: stable@vger.kernel.org
Fixes: 3956a52bc05b ("LoongArch: KVM: Add EIOINTC read and write functions")
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Handle KCOV __init vs inline mismatches

When the KCOV is enabled all functions get instrumented, unless
the __no_sanitize_coverage attribute is used. To prepare for
__no_sanitize_coverage being applied to __init functions, we have to
handle differences in how GCC's inline optimizations get resolved.
For LoongArch this exposed several places where __init annotations
were missing but ended up being "accidentally correct". So fix these
cases.

Signed-off-by: Kees Cook <kees@kernel.org>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Reserve the EFI memory map region

The EFI memory map at 'boot_memmap' is crucial for kdump to understand
the primary kernel's memory layout. This memory region, typically part
of EFI Boot Services (BS) data, can be overwritten after ExitBootServices
if not explicitly preserved by the kernel.

This commit addresses this by:
1. Calling memblock_reserve() to reserve the entire physical region
   occupied by the EFI memory map (header + descriptors). This prevents
   the primary kernel from reallocating and corrupting this area.
2. Setting the EFI_PRESERVE_BS_REGIONS flag in efi.flags. This indicates
   that efforts have been made to preserve critical BS code/data regions
   which can be useful for other kernel subsystems or debugging.

These changes ensure the original EFI memory map data remains intact,
improving kdump reliability and potentially aiding other EFI-related
functionalities that might rely on preserved BS code/data.

Signed-off-by: Ming Wang <wangming01@loongson.cn>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Fix build warnings about export.h

After commit a934a57a42f64a4 ("scripts/misc-check: check missing #include
<linux/export.h> when W=1") and 7d95680d64ac8e836c ("scripts/misc-check:
check unnecessary #include <linux/export.h> when W=1"), we get some build
warnings with W=1:

arch/loongarch/kernel/acpi.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/alternative.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/kfpu.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/traps.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/unwind_guess.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/unwind_orc.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/unwind_prologue.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/lib/crc32-loongarch.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/lib/csum.c: warning: EXPORT_SYMBOL() is used, but #include <linux/export.h> is missing
arch/loongarch/kernel/elf.c: warning: EXPORT_SYMBOL() is not used, but #include <linux/export.h> is present
arch/loongarch/kernel/paravirt.c: warning: EXPORT_SYMBOL() is not used, but #include <linux/export.h> is present
arch/loongarch/pci/pci.c: warning: EXPORT_SYMBOL() is not used, but #include <linux/export.h> is present

So fix these build warnings for LoongArch.

Reviewed-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

LoongArch: Replace __ASSEMBLY__ with __ASSEMBLER__ in headers

While the GCC and Clang compilers already define __ASSEMBLER__
automatically when compiling assembler code, __ASSEMBLY__ is a macro
that only gets defined by the Makefiles in the kernel. This is bad
since macros starting with two underscores are names that are reserved
by the C language. It can also be very confusing for the developers
when switching between userspace and kernelspace coding, or when
dealing with uapi headers that rather should use __ASSEMBLER__ instead.
So let's now standardize on the __ASSEMBLER__ macro that is provided
by the compilers.

This is almost a completely mechanical patch (done with a simple
"sed -i" statement), with one comment tweaked manually in the
arch/loongarch/include/asm/cpu.h file (it was missing the trailing
underscores).

Signed-off-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>

nvme: fix atomic write size validation

Don't mix the namespace and controller values, and validate the
per-controller limit when probing the controller. This avoid spurious
failures for controllers with namespaces that have different namespaces
with different logical block sizes, or report the per-namespace values
only for some namespaces.

It also fixes a missing queue_limits_cancel_update in an error path by
removing that error path.

Fixes: 8695f060a029 ("nvme: all namespaces in a subsystem must adhere to a common atomic write size")
Reported-by: Yi Zhang <yi.zhang@redhat.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>
Tested-by: Yi Zhang <yi.zhang@redhat.com>

nvme: refactor the atomic write unit detection

Move all the code out of nvme_update_disk_info into the helper, and
rename the helper to have a somewhat less clumsy name.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: John Garry <john.g.garry@oracle.com>

nvme: reset delayed remove_work after reconnect

The remove_work will proceed with permanently disconnecting on the
initial final path failure if the head shows no paths after the delay.
If a new path connects while the remove_work is pending, and if that new
path happens to disconnect before that remove_work executes, the delayed
removal should reset based on the most recent path disconnect time, but
queue_delayed_work() won't do anything if the work is already pending.
Attempt to cancel the delayed work when a new path connects, and use
mod_delayed_work() in case the remove_work remains pending anyway.

Signed-off-by: Keith Busch <kbusch@kernel.org>
Reviewed-by: Nilay Shroff <nilay@linux.ibm.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>

net: libwx: fix the creation of page_pool

'rx_ring->size' means the count of ring descriptors multiplied by the
size of one descriptor. When increasing the count of ring descriptors,
it may exceed the limit of pool size.

[ 864.209610] page_pool_create_percpu() gave up with errno -7
[ 864.209613] txgbe 0000:11:00.0: Page pool creation failed: -7

Fix to set the pool_size to the count of ring descriptors.

Fixes: 850b971110b2 ("net: libwx: Allocate Rx and Tx resources")
Cc: stable@vger.kernel.org
Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/434C72BFB40E350A+20250625023924.21821-1-jiawenwu@trustnetic.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

net: selftests: fix TCP packet checksum

The length in the pseudo header should be the length of the L3 payload
AKA the L4 header+payload. The selftest code builds the packet from
the lower layers up, so all the headers are pushed already when it
constructs L4. We need to subtract the lower layer headers from skb->len.

Fixes: 3e1e58d64c3d ("net: add generic selftest support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Gerhard Engleder <gerhard@engleder-embedded.com>
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Tested-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://patch.msgid.link/20250624183258.3377740-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>

perf/aux: Fix pending disable flow when the AUX ring buffer overruns

If an AUX event overruns, the event core layer intends to disable the
event by setting the 'pending_disable' flag. Unfortunately, the event
is not actually disabled afterwards.

In commit:

ca6c21327c6a ("perf: Fix missing SIGTRAPs")

the 'pending_disable' flag was changed to a boolean. However, the
AUX event code was not updated accordingly. The flag ends up holding a
CPU number. If this number is zero, the flag is taken as false and the
IRQ work is never triggered.

Later, with commit:

2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")

a new IRQ work 'pending_disable_irq' was introduced to handle event
disabling. The AUX event path was not updated to kick off the work queue.

To fix this bug, when an AUX ring buffer overrun is detected, call
perf_event_disable_inatomic() to initiate the pending disable flow.

Also update the outdated comment for setting the flag, to reflect the
boolean values (0 or 1).

Fixes: 2b84def990d3 ("perf: Split __perf_pending_irq() out of perf_pending_irq()")
Fixes: ca6c21327c6a ("perf: Fix missing SIGTRAPs")
Signed-off-by: Leo Yan <leo.yan@arm.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: James Clark <james.clark@linaro.org>
Reviewed-by: Yeoreum Yun <yeoreum.yun@arm.com>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Liang Kan <kan.liang@linux.intel.com>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: linux-perf-users@vger.kernel.org
Link: https://lore.kernel.org/r/20250625170737.2918295-1-leo.yan@arm.com