git.ipfire.org Git - thirdparty/kernel/stable.git/log

selftests/cgroup: avoid OOM in test_swapin_nozswap

test_swapin_nozswap can hit OOM before reaching its assertions on some
setups.  The test currently sets memory.max=8M and then allocates/reads
32M with memory.zswap.max=0, which may over-constrain reclaim and kill the
workload process.

Replace hardcoded sizes with PAGE_SIZE-based values:
  - control_allocation_size = PAGE_SIZE * 512
  - memory.max = control_allocation_size * 3 / 4
  - minimum expected swap = control_allocation_size / 4

This keeps the test pressure model intact (allocate/read beyond memory.max
to force swap-in/out) while making it more robust across different
environments.

The test intent is unchanged: confirm that swapping occurs while zswap remains
unused when memory.zswap.max=0.

=== Error Logs ===

  # ./test_zswap
  TAP version 13
  1..7
  ok 1 test_zswap_usage
  not ok 2 test_swapin_nozswap
  ...

  # dmesg
  [271641.879153] test_zswap invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
  [271641.879168] CPU: 1 UID: 0 PID: 177372 Comm: test_zswap Kdump: loaded Not tainted 6.12.0-211.el10.ppc64le #1 VOLUNTARY
  [271641.879171] Hardware name: IBM,9009-41A POWER9 (architected) 0x4e0202 0xf000005 of:IBM,FW940.02 (UL940_041) hv:phyp pSeries
  [271641.879173] Call Trace:
  [271641.879174] [c00000037540f730] [c00000000127ec44] dump_stack_lvl+0x88/0xc4 (unreliable)
  [271641.879184] [c00000037540f760] [c0000000005cc594] dump_header+0x5c/0x1e4
  [271641.879188] [c00000037540f7e0] [c0000000005cb464] oom_kill_process+0x324/0x3b0
  [271641.879192] [c00000037540f860] [c0000000005cbe48] out_of_memory+0x118/0x420
  [271641.879196] [c00000037540f8f0] [c00000000070d8ec] mem_cgroup_out_of_memory+0x18c/0x1b0
  [271641.879200] [c00000037540f990] [c000000000713888] try_charge_memcg+0x598/0x890
  [271641.879204] [c00000037540fa70] [c000000000713dbc] charge_memcg+0x5c/0x110
  [271641.879207] [c00000037540faa0] [c0000000007159f8] __mem_cgroup_charge+0x48/0x120
  [271641.879211] [c00000037540fae0] [c000000000641914] alloc_anon_folio+0x2b4/0x5a0
  [271641.879215] [c00000037540fb60] [c000000000641d58] do_anonymous_page+0x158/0x6b0
  [271641.879218] [c00000037540fbd0] [c000000000642f8c] __handle_mm_fault+0x4bc/0x910
  [271641.879221] [c00000037540fcf0] [c000000000643500] handle_mm_fault+0x120/0x3c0
  [271641.879224] [c00000037540fd40] [c00000000014bba0] ___do_page_fault+0x1c0/0x980
  [271641.879228] [c00000037540fdf0] [c00000000014c44c] hash__do_page_fault+0x2c/0xc0
  [271641.879232] [c00000037540fe20] [c0000000001565d8] do_hash_fault+0x128/0x1d0
  [271641.879236] [c00000037540fe50] [c000000000008be0] data_access_common_virt+0x210/0x220
  [271641.879548] Tasks state (memory values in pages):
  ...
  [271641.879550] [  pid  ]   uid  tgid total_vm      rss rss_anon rss_file rss_shmem pgtables_bytes swapents oom_score_adj name
  [271641.879555] [ 177372]     0 177372      571        0        0        0         0    51200       96             0 test_zswap
  [271641.879562] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/no_zswap_test,task_memcg=/no_zswap_test,task=test_zswap,pid=177372,uid=0
  [271641.879578] Memory cgroup out of memory: Killed process 177372 (test_zswap) total-vm:36544kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB, UID:0 pgtables:50kB oom_score_adj:0

Link: https://lore.kernel.org/20260424040059.12940-3-li.wang@linux.dev
Signed-off-by: Li Wang <li.wang@linux.dev>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/cgroup: skip test_zswap if zswap is globally disabled

Patch series "selftests/cgroup: improve zswap tests robustness and support
large page sizes", v7.

This patchset aims to fix various spurious failures and improve the
overall robustness of the cgroup zswap selftests.

The primary motivation is to make the tests compatible with architectures
that use non-4K page sizes (such as 64K on ppc64le and arm64).  Currently,
the tests rely heavily on hardcoded 4K page sizes and fixed memory limits.
On 64K page size systems, these hardcoded values lead to sub-page
granularity accesses, incorrect page count calculations, and insufficient
memory pressure to trigger zswap writeback, ultimately causing the tests
to fail.

Additionally, this series addresses OOM kills occurring in
test_swapin_nozswap by dynamically scaling memory limits, and prevents
spurious test failures when zswap is built into the kernel but globally
disabled.

This patch (of 8):

test_zswap currently only checks whether zswap is present by testing
/sys/module/zswap.  This misses the runtime global state exposed in
/sys/module/zswap/parameters/enabled.

When zswap is built/loaded but globally disabled, the zswap cgroup
selftests run in an invalid environment and may fail spuriously.

Check the runtime enabled state before running the tests:
  - skip if zswap is not configured,
  - fail if the enabled knob cannot be read,
  - skip if zswap is globally disabled.

Also print a hint in the skip message on how to enable zswap.

Link: https://lore.kernel.org/20260424040059.12940-1-li.wang@linux.dev
Link: https://lore.kernel.org/20260424040059.12940-2-li.wang@linux.dev
Signed-off-by: Li Wang <li.wang@linux.dev>
Acked-by: Yosry Ahmed <yosry@kernel.org>
Acked-by: Nhat Pham <nphamcs@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Jiayuan Chen <jiayuan.chen@linux.dev>
Cc: Waiman Long <longman@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/damon/sysfs.py: test failed region quota charge ratio

Extend sysfs.py DAMON selftest to setup DAMOS action failed region quota
charge ratio and assert the setup is made into DAMON internal state.

Link: https://lore.kernel.org/20260428013402.115171-12-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/damon/drgn_dump_damon_status: support failed region quota charge ratio

Extend drgn_dump_damon_status.py to dump DAMON internal state for DAMOS
action failed regions quota charge ratio, to be able to show if the
internal state for the feature is working, with future DAMON selftests.

Link: https://lore.kernel.org/20260428013402.115171-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/damon/_damon_sysfs: support failed region quota charge ratio

Extend _damon_sysfs.py for DAMOS action failed regions quota charge ratio
setup, so that we can add kselftest for the new feature.

Link: https://lore.kernel.org/20260428013402.115171-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/tests/core-kunit: test fail_charge_{num,denom} committing

Extend damos_test_commit_quotas() kunit test to ensure
damos_commit_quota() handles fail_charge_{num,denom} parameters.

Link: https://lore.kernel.org/20260428013402.115171-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Docs/ABI/damon: document fail_charge_{num,denom}

Update DAMON ABI document for the DAMOS action failed regions quota charge
ratio control sysfs files.

Link: https://lore.kernel.org/20260428013402.115171-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Docs/admin-guide/mm/damon/usage: document fail_charge_{num,denom} files

Update DAMON usage document for the DAMOS action failed regions quota
charge ratio control sysfs files.

Link: https://lore.kernel.org/20260428013402.115171-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Docs/mm/damon/design: document fail_charge_{num,denom}

Update DAMON design document for the DAMOS action failed region quota
charge ratio.

Link: https://lore.kernel.org/20260428013402.115171-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/sysfs-schemes: implement fail_charge_{num,denom} files

Implement the user-space ABI for the DAMOS action failed region
quota-charge ratio setup. For this, add two new sysfs files under the
DAMON sysfs interface for DAMOS quotas. Names of the files are
fail_charge_num and fail_charge_denom, and work for reading and setting
the numerator and denominator of the failed regions charge ratio.

Link: https://lore.kernel.org/20260428013402.115171-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/core: introduce failed region quota charge ratio

DAMOS quota is charged to all DAMOS action application attempted memory,
regardless of how much of the memory the action was successful and failed.
This makes understanding quota behavior without DAMOS stat but only with
end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT
action) difficult. Also, charging action-failed memory same as
action-successful memory is somewhat unfair, as successful action
application will induce more overhead in most cases.

Introduce DAMON core API for setting the charge ratio for such
action-failed memory. It allows API callers to specify the ratio in a
flexible way, by setting the numerator and the denominator.

Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/core: merge regions after applying DAMOS schemes

damos_apply_scheme() could split the given region if applying the scheme's
action to the entire region can result in violating the quota-set upper
limit.  Keeping regions that are created by such split operations is
unnecessary overhead.

The overhead would be negligible in the common case because such split
operations could happen only up to the number of installed schemes per
scheme apply interval.  The following commit could make the impact larger,
though.  The following commit will allow the action-failed region to be
charged in a different ratio.  If both the ratio and the remaining quota
is quite small while the region to apply the scheme is quite large and the
action is nearly always failing, a high number of split operations could
happen.

Remove the unnecessary overhead by merging regions after applying schemes
is done for each region.  The merge operation is made only if it will not
lose monitoring information and keep min_nr_regions constraint.  In the
worst case, the max_nr_regions could still be violated until the next
per-aggregation interval merge operation is made.

Link: https://lore.kernel.org/20260428013402.115171-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/core: handle <min_region_sz remaining quota as empty

Patch series "mm/damon: introduce DAMOS failed region quota charge ratio".

Let users set different DAMOS quota charge ratios for DAMOS action failed
regions, for deterministic and consistent DAMOS action progress.

Common Reports: Unexpectedly Slow DAMOS
=======================================

One common issue report that we get from DAMON users is that DAMOS action
applying progress speed is sometimes much slower than expected.  And one
common root cause is that the DAMOS quota is exceeded by the action
applying failed memory regions.

For example, a group of users tried to run DAMOS-based proactive memory
reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota.  They ran
it on a system having no active workload which means all memory of the
system is cold.  The expectation was that the system will show 100 MiB per
second reclamation until (nearly) all memory is reclaimed.  But what they
found is that the speed is quite inconsistent and sometimes it becomes
very slower than the expectation, sometimes even no reclamation at all for
about tens of seconds.  The upper limit of the speed (100 MiB per second)
was being kept as expected, though.

By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS
stat, we found DAMOS quota is always exceeded when the speed is slow.  By
monitoring sz_tried and sz_applied (the total amount of DAMOS action tried
memory and succeeded memory) DAMOS stats together, we found the
reclamation attempts nearly always failed when the speed is slow.

DAMOS quota charges DAMOS action tried regions regardless of the
successfulness of the try.  Hence in the example reported case, there was
unreclaimable memory spread around the system memory.  Sometimes nearly
100 MiB of memory that DAMOS tried to reclaim in the given quota interval
was reclaimable, and therefore showed nearly 100 MiB per second speed.
Sometimes nearly 99 MiB of memory that DAMOS was trying to reclaim in the
given quota interval was unreclaimable, and therefore showing only about 1
MiB per second reclaim speed.

We explained it is an expected behavior of the feature rather than a bug,
as DAMOS quota is there for only the upper-limit of the speed.  The users
agreed and later reported a huge win from the adoption of DAMON_RECLAIM on
their products.

It is Not a Bug but a Feature; But...
=====================================

So nothing is broken.  DAMOS quota is working as intended, as the upper
limit of the speed.  It also provides its behavior observability via DAMOS
stat.  In the real world production environment that runs long term active
workloads and matters stability, the speed sometimes being slow is not a
real problem.

But, the non-deterministic behavior is sometimes annoying, especially in
lab environments.  Even in a realistic production environment, when there
is a huge amount of DAMOS action unapplicable memory, the speed could be
problematically slow.  Let's suppose a virtual machines provider that
setup 99% of the host memory as hugetlb pages that cannot be reclaimed, to
give it to virtual machines.  Also, when aim-oriented DAMOS auto-tuning is
applied, this could also make the internal feedback loop confused.

The intention of the current behavior was that trying DAMOS action to
regions would anyway impose some overhead, and therefore somehow be
charged.  But in the real world, the overhead for failed action is much
lighter than successful action.  Charging those at the same ratio may be
unfair, or at least suboptimum in some environments.

DAMOS Action Failed Region Quota Charge Ratio
=============================================

Let users set the charge ratio for the action-failed memory, for more
optimal and deterministic use of DAMOS.  It allows users to specify the
numerator and the denominator of the ratio for flexible setup.  For
example, let's suppose the numerator and the denominator are set to 1 and
4,096, respectively.  The ratio is 1 / 4,096.  A DAMOS scheme action is
applied to 5 GiB memory.  For 1 GiB of the memory, the action is
succeeded.  For the rest (4 GiB), the action is failed.  Then, only 1 GiB
and 1 MiB quota is charged.

The optimal charge ratio will depend on the use case and system/workload.
I'd recommend starting from setting the nominator as 1 and the denominator
as PAGE_SIZE and tune based on the results, because many DAMOS actions are
applied at page level.

Tests
=====

I tested this feature in the steps below.

1. Allocate 50% of system memory and mlock() it using a test program.
2. Fill up the page cache to exhaust nearly all free memory.
3. Start DAMON-based proactive reclamation with 100 MiB/second DAMOS
   hard-quota.  Auto-tune the DAMOS soft-quota under the hard-quota for
   achieving 40% free memory of the system with 'temporal' tuner.

For step 1, I run a simple C program that is written by Gemini.  It is
quite straightforward, so I'm not sharing the code here.

For step 2, I use dd command like below:

   dd if=/dev/zero of=foo bs=1M count=$50_percent_of_system_memory

For step 3, I use the latest version of DAMON user-space tool (damo) like
below.

    sudo damo start --damos_action pageout \
            ` # Do the pageout only up to 100 MiB per second ` \
            --damos_quota_space 100M --damos_quota_interval 1s \
            ` # Auto-tune the quota below the hard quota aiming` \
            ` # 40% free memory of the node 0 ` \
            ` # (entire node of the test system)` \
            --damos_quota_goal node_mem_free_bp 40% 0 \
            ` # use temporal tuner, which is easy to understnd ` \
            --damos_quota_goal_tuner temporal

As expected, the progress of the reclamation is not consistent, because
the quota is exceeded for the failed reclamation of the unreclaimable
memory.

I do this again, but with the failed region charge ratio feature.  For
this, the above 'damo' command is used, after appending command line
option for setup of the charge ratio like below.  Note that the option was
added to 'damo' after v3.1.9.

    sudo ./damo start --damos_action pageout \
            [...]
            ` # quota-charge only 1/4096 for pageout-failed regions ` \
            --damos_quota_fail_charge_ratio 1 4096

The progress of the reclamation was nearly 100 MiB per second until the
goal was achieved, meeting the expectation.

Patches Sequence
================

The first two patches make preparational changes.  Patch 1 updates fully
charged quota check to handle <min_region_sz remaining quota, which will
be able to exist after this series is applied.  Patch 2 merges regions
after applying schemes is done as long as it is ok to do, since regions
split operations for quota could happen much more frequently under a
corner case that this series will make available.

Patch 3 implements the feature and exposes it via DAMON core API.  Patch 4
implements DAMON sysfs ABI for the feature.  Three following patches (5-7)
document the feature and ABI on design, usage, and ABI documents,
respectively.  Four patches for testing of the new feature follow.  Patch
8 implements a kunit test for the feature.  Patches 9 and 10 extend DAMON
selftest helpers for DAMON sysfs control and internal state dumping for
adding a new selftest for the feature.  Patch 11 extends existing DAMON
sysfs interface selftest to test the new feature using the extended helper
scripts.

This patch (of 11):

Less than min_region_sz remaining quota effectively means the quota is
fully charged.  In other words, no remaining quota.  This is because DAMOS
actions are applied in the region granularity, and each region should have
min_region_sz or larger size.  However the existing fully charged quota
check, which is also used for setting charge_target_from and
charge_addr_from of the quota, is not aware of the case.  For the reason,
charge_target_from and charge_addr_from of the quota will not be updated
in the case.  This can result in DAMOS action being applied more
frequently to a specific area of the memory.

The case is unreal because quota charging is also made in the region
granularity.  It could be changed in future, though.  Actually, the
following commit will make the change, by allowing users to set arbitrary
quota charging ratio for action-failed regions.  To be prepared for the
change, update the fully charged quota checks to treat having less than
min_region_sz remaining quota as fully charged.

Link: https://lore.kernel.org/20260428013402.115171-1-sj@kernel.org
Link: https://lore.kernel.org/20260428013402.115171-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon: add node_eligible_mem_bp goal metric

Background and Motivation
=========================

In heterogeneous memory systems, controlling memory distribution across
NUMA nodes is essential for performance optimization.  This patch enables
system-wide page distribution with target-state goals such as "maintain
60% of scheme-eligible memory on DRAM" using PA-mode DAMON schemes.

Rather than using absolute thresholds, this metric tracks the ratio of
memory that matches each scheme's access pattern filters on a target node,
enabling the quota system to automatically adjust migration aggressiveness
to maintain the desired distribution.

What This Metric Measures
=========================

node_eligible_mem_bp:
    scheme_eligible_bytes_on_node / total_scheme_eligible_bytes * 10000

Two-Scheme Setup for Hot Page Distribution
==========================================

For maintaining 60% of hot memory on DRAM (node 0) and 40% on CXL
(node 1):

    PULL scheme: migrate_hot to node 0
      goal: node_eligible_mem_bp, nid=0, target=6000
      addr filter: node 1 address range (only migrate FROM CXL)
      "Move hot pages to DRAM if less than 60% of hot data is in DRAM"

    PUSH scheme: migrate_hot to node 1
      goal: node_eligible_mem_bp, nid=1, target=4000
      addr filter: node 0 address range (only migrate FROM DRAM)
      "Move hot pages to CXL if less than 40% of hot data is in CXL"

Each scheme independently measures its own eligible memory and adjusts its
quota to achieve its target ratio.  The schemes work in concert through
DAMON's unified monitoring context, with the quota autotuner balancing
their relative aggressiveness.

Implementation Details
======================

The implementation adds a new quota goal metric type
DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP to the existing DAMOS quota goal
framework.  When this metric is configured for a scheme:

1. During each quota adjustment cycle, damos_get_node_eligible_mem_bp()
   is called to calculate the current memory distribution.

2. The function iterates through all regions that match the scheme's
   access pattern (via __damos_valid_target()) and calculates:
   - Total eligible bytes across all nodes
   - Eligible bytes specifically on the target node (goal->nid)

3. For each eligible region, damos_calc_eligible_bytes() walks through
   the physical address range, using damon_get_folio() to look up
   each folio and determine its NUMA node via folio_nid().

4. Large folios are handled by calculating the exact overlap between
   the region boundaries and folio boundaries, ensuring accurate
   byte counts even when regions partially span folios.

5. The ratio (node_eligible / total_eligible * 10000) is returned
   as basis points, which the quota autotuner uses to adjust the
   scheme's effective quota size (esz).

The implementation requires CONFIG_DAMON_PADDR since damon_get_folio()
is only available for physical address space monitoring.

Testing Results
===============

Functionally tested on a two-node heterogeneous memory system with DRAM
(node 0) and CXL memory (node 1).  A PUSH+PULL scheme configuration using
migrate_hot actions was used to reach a target hot memory ratio between
the two tiers.

With the TEMPORAL tuner, the system converges quickly to the target
distribution.  The tuner drives esz to maximum when under goal and to zero
once the goal is met, forming a simple on/off feedback loop that
stabilizes at the desired ratio.

With the CONSIST tuner, the scheme still converges but more slowly, as it
migrates and then throttles itself based on quota feedback.  The time to
reach the goal varies depending on workload intensity.

Note: This metric works with both TEMPORAL and CONSIST goal tuners.

Link: https://lore.kernel.org/20260428030520.701-1-ravis.opensrc@gmail.com
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com>
Suggested-by: SeongJae Park <sj@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Honggyu Kim <honggyu.kim@sk.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Yunjeong Mun <yunjeong.mun@sk.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/damon/core: make charge_addr_from aware of end-address exclusivity

DAMON region end address is exclusive one, but charge_addr_from is
assigned assuming the end address is inclusive.  As a result, DAMOS action
to next up to min_region_sz memory can be skipped.  This is quite
negligible user impact.  But, the bug is a bug that can be very simply
fixed.  Fix the wrong assignment to respect the exclusiveness of the
address.

The issue was discovered [1] by Sashiko.

Link: https://lore.kernel.org/20260428042942.118230-1-sj@kernel.org
Link: https://lore.kernel.org/20260428032324.115663-1-sj@kernel.org
Fixes: 50585192bc2e ("mm/damon/schemes: skip already charged targets and regions")
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: <stable@vger.kernel.org> # 5.16.x
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory: update stale locking comments for fault handlers

Update the comments for wp_page_copy(), do_wp_page(), do_swap_page(),
do_anonymous_page(), __do_fault(), do_fault(), handle_pte_fault(),
__handle_mm_fault(), and handle_mm_fault() to concisely clarify that they
can be entered holding either the mmap_lock or the VMA lock, and that the
lock may be released upon returning VM_FAULT_RETRY.

Additionally, make the following corrections:
- In do_anonymous_page(), correct the outdated claim that the function
  is entered with the PTE "mapped but not yet locked". Since
  handle_pte_fault() unmaps the empty PTE before routing to
  do_pte_missing(), the comment now correctly states it is entered
  with the PTE unmapped and unlocked.
- In __do_fault(), update the stale reference from __lock_page_retry()
  to __folio_lock_or_retry().

Link: https://lore.kernel.org/20260424092217.263648-1-adi.sharma@zohomail.in
Signed-off-by: Aditya Sharma <adi.sharma@zohomail.in>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/gup: cleanup pgtable entry accessors

PMD and PUD entries revalidation has the same semantics as PTE entry
revalidation. Convert the remaining direct entry dereferences to the
corresponding accessors.

The PTE validation in gup_fast_pte_range() is inconsistent with the prior
value acquisition in the sense that it drops the lockless access
semantics.

Use the lockless accessor not only for the PTE, but also for the PMD
validation, which is likewise inconsistent with the prior value
acquisition in gup_fast_pmd_range().

Link: https://lore.kernel.org/20260421051754.1691221-1-agordeev@linux.ibm.com
Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kevin Brodsky <kevin.brodsky@arm.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: optimize __free_contig_frozen_range()

Apply the same batch-freeing optimization from free_contig_range() to the
frozen page path.  The previous __free_contig_frozen_range() freed each
order-0 page individually via free_frozen_pages(), which is slow for the
same reason the old free_contig_range() was: each page goes to the order-0
pcp list rather than being coalesced into higher-order blocks.

Rewrite __free_contig_frozen_range() to call free_pages_prepare() for each
order-0 page, then batch the prepared pages into the largest possible
power-of-2 aligned chunks via free_prepared_contig_range().  If
free_pages_prepare() fails (e.g.  HWPoison, bad page) the page is
deliberately not freed; it should not be returned to the allocator.

I've tested CMA through debugfs.  The test allocates 16384 pages per
allocation for several iterations.  There is 3.5x improvement.

Before: 1406 usec per iteration
After:   402 usec per iteration

Before:

    70.89%     0.69%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
            |
            |--70.20%--free_contig_frozen_range
            |          |
            |          |--46.41%--__free_frozen_pages
            |          |          |
            |          |           --36.18%--free_frozen_page_commit
            |          |                     |
            |          |                      --29.63%--_raw_spin_unlock_irqrestore
            |          |
            |          |--8.76%--_raw_spin_trylock
            |          |
            |          |--7.03%--__preempt_count_dec_and_test
            |          |
            |          |--4.57%--_raw_spin_unlock
            |          |
            |          |--1.96%--__get_pfnblock_flags_mask.isra.0
            |          |
            |           --1.15%--free_frozen_page_commit
            |
             --0.69%--el0t_64_sync

After:

    23.57%     0.00%  cma              [kernel.kallsyms]      [.] free_contig_frozen_range
            |
            ---free_contig_frozen_range
               |
               |--20.45%--__free_contig_frozen_range
               |          |
               |          |--17.77%--free_pages_prepare
               |          |
               |           --0.72%--free_prepared_contig_range
               |                     |
               |                      --0.55%--__free_frozen_pages
               |
                --3.12%--free_pages_prepare

Link: https://lore.kernel.org/20260401101634.2868165-4-usama.anjum@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Suggested-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

vmalloc: optimize vfree with free_pages_bulk()

Whenever vmalloc allocates high order pages (e.g.  for a huge mapping) it
must immediately split_page() to order-0 so that it remains compatible
with users that want to access the underlying struct page.  Commit
a06157804399 ("mm/vmalloc: request large order pages from buddy
allocator") recently made it much more likely for vmalloc to allocate high
order pages which are subsequently split to order-0.

Unfortunately this had the side effect of causing performance regressions
for tight vmalloc/vfree loops (e.g.  test_vmalloc.ko benchmarks).  See
Closes: tag. This happens because the high order pages must be gotten
from the buddy but then because they are split to order-0, when they are
freed they are freed to the order-0 pcp.  Previously allocation was for
order-0 pages so they were recycled from the pcp.

It would be preferable if when vmalloc allocates an (e.g.) order-3 page
that it also frees that order-3 page to the order-3 pcp, then the
regression could be removed.

So let's do exactly that; update stats separately first as coalescing is
hard to do correctly without complexity.  Use free_pages_bulk() which uses
the new __free_contig_range() API to batch-free contiguous ranges of pfns.
This not only removes the regression, but significantly improves
performance of vfree beyond the baseline.

A selection of test_vmalloc benchmarks running on arm64 server class
system.  mm-new is the baseline.  Commit a06157804399 ("mm/vmalloc:
request large order pages from buddy allocator") was added in v6.19-rc1
where we see regressions.  Then with this change performance is much
better.  (>0 is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement):

+-----------------+----------------------------------------------------------+-------------------+--------------------+
| Benchmark       | Result Class                                             |   mm-new          |  this series       |
+=================+==========================================================+===================+====================+
| micromm/vmalloc | fix_align_alloc_test: p:1, h:0, l:500000 (usec)          |        1331843.33 |         (I) 67.17% |
|                 | fix_size_alloc_test: p:1, h:0, l:500000 (usec)           |         415907.33 |             -5.14% |
|                 | fix_size_alloc_test: p:4, h:0, l:500000 (usec)           |         755448.00 |         (I) 53.55% |
|                 | fix_size_alloc_test: p:16, h:0, l:500000 (usec)          |        1591331.33 |         (I) 57.26% |
|                 | fix_size_alloc_test: p:16, h:1, l:500000 (usec)          |        1594345.67 |         (I) 68.46% |
|                 | fix_size_alloc_test: p:64, h:0, l:100000 (usec)          |        1071826.00 |         (I) 79.27% |
|                 | fix_size_alloc_test: p:64, h:1, l:100000 (usec)          |        1018385.00 |         (I) 84.17% |
|                 | fix_size_alloc_test: p:256, h:0, l:100000 (usec)         |        3970899.67 |         (I) 77.01% |
|                 | fix_size_alloc_test: p:256, h:1, l:100000 (usec)         |        3821788.67 |         (I) 89.44% |
|                 | fix_size_alloc_test: p:512, h:0, l:100000 (usec)         |        7795968.00 |         (I) 82.67% |
|                 | fix_size_alloc_test: p:512, h:1, l:100000 (usec)         |        6530169.67 |        (I) 118.09% |
|                 | full_fit_alloc_test: p:1, h:0, l:500000 (usec)           |         626808.33 |             -0.98% |
|                 | kvfree_rcu_1_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         532145.67 |             -1.68% |
|                 | kvfree_rcu_2_arg_vmalloc_test: p:1, h:0, l:500000 (usec) |         537032.67 |             -0.96% |
|                 | long_busy_list_alloc_test: p:1, h:0, l:500000 (usec)     |        8805069.00 |         (I) 74.58% |
|                 | pcpu_alloc_test: p:1, h:0, l:500000 (usec)               |         500824.67 |              4.35% |
|                 | random_size_align_alloc_test: p:1, h:0, l:500000 (usec)  |        1637554.67 |         (I) 76.99% |
|                 | random_size_alloc_test: p:1, h:0, l:500000 (usec)        |        4556288.67 |         (I) 72.23% |
|                 | vm_map_ram_test: p:1, h:0, l:500000 (usec)               |         107371.00 |             -0.70% |
+-----------------+----------------------------------------------------------+-------------------+--------------------+

Link: https://lore.kernel.org/20260401101634.2868165-3-usama.anjum@arm.com
Fixes: a06157804399 ("mm/vmalloc: request large order pages from buddy allocator")
Closes: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: optimize free_contig_range()

Patch series "mm: Free contiguous order-0 pages efficiently", v6.

A recent change to vmalloc caused some performance benchmark regressions
(see [1]).  I'm attempting to fix that (and at the same time significantly
improve beyond the baseline) by freeing a contiguous set of order-0 pages
as a batch.

At the same time I observed that free_contig_range() was essentially doing
the same thing as vfree() so I've fixed it there too.  While at it,
optimize the __free_contig_frozen_range() as well.

Check that the contiguous range falls in the same section.  If they aren't
enabled, the if conditions get optimized out by the compiler as
memdesc_section() returns 0.  See num_pages_contiguous() for more details
about it.

This patch (of 3):

Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy.  This improves on the previous approach which freed each order-0
page individually in a loop.  Testing shows performance to be improved by
more than 10x in some cases.

Since each page is order-0, we must decrement each page's reference count
individually and only consider the page for freeing as part of a high
order chunk if the reference count goes to zero.  Additionally
free_pages_prepare() must be called for each individual order-0 page too,
so that the struct page state and global accounting state can be
appropriately managed.  But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.

This significantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.

vmalloc will shortly become a user of this new optimized
free_contig_range() since it aggressively allocates high order
non-compound pages, but then calls split_page() to end up with contiguous
order-0 pages.  These can now be freed much more efficiently.

The execution time of the following function was measured in a server
class arm64 machine:

static int page_alloc_high_order_test(void)
{
unsigned int order = HPAGE_PMD_ORDER;
struct page *page;
int i;

for (i = 0; i < 100000; i++) {
page = alloc_pages(GFP_KERNEL, order);
if (!page)
return -1;
split_page(page, order);
free_contig_range(page_to_pfn(page), 1UL << order);
}

return 0;
}

Execution time before: 4097358 usec
Execution time after:   729831 usec

Perf trace before:

    99.63%     0.00%  kthreadd         [kernel.kallsyms]      [.] kthread
            |
            ---kthread
               0xffffb33c12a26af8
               |
               |--98.13%--0xffffb33c12a26060
               |          |
               |          |--97.37%--free_contig_range
               |          |          |
               |          |          |--94.93%--___free_pages
               |          |          |          |
               |          |          |          |--55.42%--__free_frozen_pages
               |          |          |          |          |
               |          |          |          |           --43.20%--free_frozen_page_commit
               |          |          |          |                     |
               |          |          |          |                      --35.37%--_raw_spin_unlock_irqrestore
               |          |          |          |
               |          |          |          |--11.53%--_raw_spin_trylock
               |          |          |          |
               |          |          |          |--8.19%--__preempt_count_dec_and_test
               |          |          |          |
               |          |          |          |--5.64%--_raw_spin_unlock
               |          |          |          |
               |          |          |          |--2.37%--__get_pfnblock_flags_mask.isra.0
               |          |          |          |
               |          |          |           --1.07%--free_frozen_page_commit
               |          |          |
               |          |           --1.54%--__free_frozen_pages
               |          |
               |           --0.77%--___free_pages
               |
                --0.98%--0xffffb33c12a26078
                          alloc_pages_noprof

Perf trace after:

     8.42%     2.90%  kthreadd         [kernel.kallsyms]         [k] __free_contig_range
            |
            |--5.52%--__free_contig_range
            |          |
            |          |--5.00%--free_prepared_contig_range
            |          |          |
            |          |          |--1.43%--__free_frozen_pages
            |          |          |          |
            |          |          |           --0.51%--free_frozen_page_commit
            |          |          |
            |          |          |--1.08%--_raw_spin_trylock
            |          |          |
            |          |           --0.89%--_raw_spin_unlock
            |          |
            |           --0.52%--free_pages_prepare
            |
             --2.90%--ret_from_fork
                       kthread
                       0xffffae1c12abeaf8
                       0xffffae1c12abe7a0
                       |
                        --2.69%--vfree
                                  __free_contig_range

Link: https://lore.kernel.org/20260401101634.2868165-1-usama.anjum@arm.com
Link: https://lore.kernel.org/20260401101634.2868165-2-usama.anjum@arm.com
Link: https://lore.kernel.org/all/66919a28-bc81-49c9-b68f-dd7c73395a0d@arm.com
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Sterba <dsterba@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/vmscan: add balance_pgdat begin/end tracepoints

Vmscan has six main reclaim entry points: try_to_free_pages() for
direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
for node reclaim, shrink_all_memory() for hibernation reclaim, and
balance_pgdat() for kswapd reclaim.

All of them, except for shrink_all_memory() and balance_pgdat(),
already have begin/end tracepoints.  This makes it harder to trace
which reclaim path is responsible for memory reclaim activity, because
kswapd reclaim cannot be identified as cleanly as other reclaim entry
points, even though it is the main background reclaim path under memory
pressure.  There may be no need to trace shrink_all_memory() as it is
primarily used during hibernation.  So this patch adds the missing
tracepoint pair for balance_pgdat().

The begin tracepoint records the node id, requested reclaim order, and
the requested classzone bound (highest_zoneidx).  The end tracepoint
records the node id, the reclaim order that balance_pgdat() finished
with, the requested classzone bound, and nr_reclaimed.  Together, they
show the requested reclaim order and classzone bound, whether reclaim
fell back to a lower order, and how much reclaim work was done.

The end tracepoint also records highest_zoneidx even though it does not
change within a balance_pgdat() invocation.  This keeps the end event
self-contained, so users can analyze reclaim results directly from end
events without depending on begin/end correlation, which is less
convenient when tracing is filtered or records are dropped.  It also
makes it straightforward to relate nr_reclaimed and the final reclaim
order to the requested classzone bound.

Link: https://lore.kernel.org/20260424031418.174597-1-b.suvonov@sjtu.edu.cn
Link: https://lore.kernel.org/20260423103753.546582-1-b.suvonov@sjtu.edu.cn
Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Wei Xu <weixugc@google.com>
Cc: Yuanchu Xie <yuanchu@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm: convert vmemmap_p?d_populate() to static functions

Since the vmemmap_p?d_populate functions are unused outside the mm
subsystem, we can remove their external declarations and convert them to
static functions.

Link: https://lore.kernel.org/20260423101441.7089-1-kaitao.cheng@linux.dev
Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn>
Acked-by: David Hildenbrand (arm) <david@kernel.org>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/huge_memory: fix outdated comment about freeing subpages in __folio_split

The comment appears to be outdated. add_to_swap() no longer exists,
and the explanation of why we need to call put_page() after splitting
could be made more general.

Link: https://lore.kernel.org/20260423034917.8234-1-baohua@kernel.org
Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Baoquan He <bhe@redhat.com>
Cc: Youngjun Park <youngjun.park@lge.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Revert "tmpfs: don't enable large folios if not supported"

This reverts commit 5a90c155defa684f3a21f68c3f8e40c056e6114c.

Currently, when shmem mounts are initialized, they only use 'sbinfo->huge'
to determine whether the shmem mount supports large folios. However, for
anonymous shmem, whether it supports large folios can be dynamically
configured via sysfs interfaces, so setting or not setting
mapping_set_large_folios() during initialization cannot accurately reflect
whether anonymous shmem actually supports large folios, which has already
caused some confusion[1].

Moreover, for tmpfs mounts, relying on 'sbinfo->huge' cannot keep the
mapping_set_large_folios() setting consistent across all mappings in the
entire tmpfs mount. In other words, under the same tmpfs mount, after
remount, we might end up with some mappings supporting large folios
(calling mapping_set_large_folios()) while others don't.

After some investigation, I found that the write performance regression
addressed by commit 5a90c155defa has already been fixed by the following
commit 665575cff098b ("filemap: move prefaulting out of hot write path").
See the following test data:

Base:
dd if=/dev/zero of=/mnt/tmpfs/test bs=400K count=10485 (3.2 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=800K count=5242 (3.2 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=1600K count=2621 (3.1 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=2200K count=1906 (3.0 GB/s )
dd if=/dev/zero of=/mnt/tmpfs/test bs=3000K count=1398 (3.0 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=4500K count=932 (3.1 GB/s)

Base + revert 5a90c155defa:
dd if=/dev/zero of=/mnt/tmpfs/test bs=400K count=10485 (3.3 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=800K count=5242 (3.3 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=1600K count=2621 (3.2 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=2200K count=1906 (3.1 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/testbs=3000K count=1398 (3.0 GB/s)
dd if=/dev/zero of=/mnt/tmpfs/test bs=4500K count=932 (3.1 GB/s)

The data is basically consistent with minor fluctuation noise. So we can now
safely revert commit 5a90c155defa to set mapping_set_large_folios() for all
shmem mounts unconditionally.

Link: https://lore.kernel.org/b2c7deee259a94b0d00a7c320d8d24d2c421f761.1776908112.git.baolin.wang@linux.alibaba.com
Link: https://lore.kernel.org/all/ec927492-4577-4192-8fad-85eb1bb43121@linux.alibaba.com/
Link: https://lore.kernel.org/all/116df9f9-4db7-40d4-a4a4-30a87c0feffa@linux.alibaba.com/
Fixes: 5a90c155defa ("tmpfs: don't enable large folios if not supported")
Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/page_alloc: replace kernel_init_pages() with batch page clearing

When init_on_alloc is enabled, kernel_init_pages() clears every page one
at a time via clear_highpage_kasan_tagged(), which incurs per-page
kmap_local_page()/kunmap_local() overhead and prevents the architecture
clearing primitive from operating on contiguous ranges.

Introduce clear_highpages_kasan_tagged() as a static batch clearing helper
in page_alloc.c that calls clear_pages() for the full contiguous range on
!HIGHMEM systems, bypassing the per-page kmap overhead and allowing a
single invocation of the arch clearing primitive across the entire
allocation.  The HIGHMEM path falls back to per-page clearing since those
pages require kmap.

Replace kernel_init_pages() with direct calls to the new helper, as it
becomes a trivial wrapper.

Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:

  Before: 0.445s
  After:  0.166s  (-62.7%, 2.68x faster)

Kernel time (sys) reduction per workload with init_on_alloc=1:

  Workload            Before       After       Change
  Graph500 64C128T    30m 41.8s    15m 14.8s   -50.3%
  Graph500 16C32T     15m 56.7s     9m 43.7s   -39.0%
  Pagerank 32T         1m 58.5s     1m 12.8s   -38.5%
  Pagerank 128T        2m 36.3s     1m 40.4s   -35.7%

[hsalunke@amd.com: move clear_highpages_kasan_tagged() to page_alloc.c]
Link: https://lore.kernel.org/20260504063942.553438-1-hsalunke@amd.com
Link: https://lore.kernel.org/20260422102729.166599-1-hsalunke@amd.com
Signed-off-by: Hrushikesh Salunke <hsalunke@amd.com>
Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Acked-by: Pankaj Gupta <pankaj.gupta@amd.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Lorenzo Stoakes <ljs@kernel.org>
Cc: Ankur Arora <ankur.a.arora@oracle.com>
Cc: Bharata B Rao <bharata@amd.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: suppress compiler error in liburing check

When building the mm selftests on a system without liburing development
headers, check_config.sh leaks a raw compiler error:

  /tmp/tmp.kIIOIqwe3n.c:2:10: fatal error: liburing.h: No such file or directory
      2 | #include <liburing.h>
        |          ^~~~~~~~~~~~

Since this is an expected failure during the configuration probe,
redirect the compiler output to /dev/null to hide it.

And the build system prints a clear warning when this occurs:

  Warning: missing liburing support. Some tests will be skipped.

Because the user is properly notified about the missing dependency, the
raw compiler error is redundant and only confuse users.

Additionally, update the Makefile to use $(Q) and $(call msg,...) for the
check_config.sh execution.  This aligns the probe with standard kbuild
output formatting, providing a clean "CHK" message instead of printing the
raw command during the build.

Link: https://lore.kernel.org/20260422080446.26020-3-wangli.ahau@gmail.com
Signed-off-by: Li Wang <wangli.ahau@gmail.com>
Tested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

selftests/mm: respect build verbosity settings for 32/64-bit targets

Patch series "selftests/mm: clean up build output and verbosity", v3.

Currently, the build process for the mm selftests is unnecessarily noisy.

First, it leaks raw compiler errors during the liburing feature probe if
the headers are missing, which is confusing since the build system already
handles this gracefully with a clear warning.

Second, the specific 32-bit and 64-bit compilation targets ignore the
standard kbuild verbosity settings, always printing their full compiler
commands even during a default quiet build.

This patch (of 2):

The 32-bit and 64-bit compilation rules invoke $(CC) directly, bypassing
the $(Q) quiet prefix and $(call msg,...) helper used by the rest of the
selftests build system.  This causes these rules to always print the full
compiler command line, even when V=0 (the default).

Wrap the commands with $(Q) and $(call msg,CC,,$@) to match the convention
used by lib.mk, so that quiet and verbose builds behave consistently
across all targets.

==== Build logs ====
  ...
  CC       merge
  CC       rmap
  CC       soft-dirty
  gcc -Wall -O2 -I /usr/src/25/tools/testing/selftests/../../..
                -isystem /usr/src/25/tools/testing/selftests/../../../usr/include
                -isystem /usr/src/25/tools/testing/selftests/../../../tools/include/uapi
                -Wunreachable-code -U_FORTIFY_SOURCE -no-pie -D_GNU_SOURCE=
                -I/usr/src/25/tools/testing/selftests/../../../tools/testing/selftests
                -m32 -mxsave  protection_keys.c vm_util.c thp_settings.c pkey_util.c
                -lrt -lpthread -lm -lrt -ldl -lm
                -o /usr/src/25/tools/testing/selftests/mm/protection_keys_32
  gcc -Wall -O2 -I /usr/src/25/tools/testing/selftests/../../..
                -isystem /usr/src/25/tools/testing/selftests/../../../usr/include
                -isystem /usr/src/25/tools/testing/selftests/../../../tools/include/uapi
                -Wunreachable-code -U_FORTIFY_SOURCE -no-pie -D_GNU_SOURCE=
                -I/usr/src/25/tools/testing/selftests/../../../tools/testing/selftests
                -m32 -mxsave  pkey_sighandler_tests.c vm_util.c thp_settings.c pkey_util.c
                -lrt -lpthread -lm -lrt -ldl -lm
                -o /usr/src/25/tools/testing/selftests/mm/pkey_sighandler_tests_32
  ...

Link: https://lore.kernel.org/20260422080446.26020-1-wangli.ahau@gmail.com
Link: https://lore.kernel.org/20260422080446.26020-2-wangli.ahau@gmail.com
Signed-off-by: Li Wang <wangli.ahau@gmail.com>
Reported-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: Andrew Morton <akpm@linux-foundation.org>
Tested-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/cma: fix reserved page leak on activation failure

If cma_activate_area() fails after allocating only part of the range
bitmaps, the cleanup path still has to release the reserved pages when
CMA_RESERVE_PAGES_ON_ERROR is clear.

That is still worth doing even in this __init path.  A bitmap_zalloc()
failure does not necessarily mean the system cannot make further progress:
freeing the reserved CMA pages can return a substantial amount of memory
to the buddy allocator and may relieve the temporary memory shortage that
caused the allocation failure in the first place.

However, the cleanup path currently uses the bitmap-freeing bound for page
release as well.  That is only correct for ranges whose bitmap allocation
already succeeded.  The failed range and all later ranges still keep their
reserved pages, so a partial bitmap allocation failure can permanently
leak them.

Fix this by releasing reserved pages for all ranges.  Use the saved
early_pfn[] value for ranges whose bitmap allocation already succeeded and
for the failed range, and use cmr->early_pfn for later ranges whose bitmap
allocation was never attempted.

Link: https://lore.kernel.org/20260523060123.2207992-1-songmuchun@bytedance.com
Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Acked-by: Usama Arif <usama.arif@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison

Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can
trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock
when racing with a concurrent unmap:

  thread#0                              thread#1
  --------                              --------
  madvise(folio, MADV_HWPOISON)
    -> poisons the folio successfully
  madvise(folio, MADV_HWPOISON)         unmap(folio)
    try_memory_failure_hugetlb
      get_huge_page_for_hwpoison
        spin_lock_irq(&hugetlb_lock)    <- held
        __get_huge_page_for_hwpoison
          hugetlb_update_hwpoison()
            -> MF_HUGETLB_FOLIO_PRE_POISONED
          goto out:
            folio_put()
              refcount: 1 -> 0
              free_huge_folio()
                spin_lock_irqsave(&hugetlb_lock)
                  -> AA DEADLOCK!

The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop
the GUP reference while the hugetlb_lock is still held by the hugetlb.c
wrapper get_huge_page_for_hwpoison().  If concurrent unmap has released
the page table mapping reference, folio_put() drops the folio refcount to
zero, triggering free_huge_folio() which attempts to re-acquire the
non-recursive hugetlb_lock.

Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper
into get_huge_page_for_hwpoison().  Place spin_unlock_irq() before the
folio_put() at the out: label so the folio is always released outside the
lock.

[akpm@linux-foundation.org: fix race, rename label per Miaohe]
Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com
Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com
Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com
Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Acked-by: Muchun Song <muchun.song@linux.dev>
Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Acked-by: Miaohe Lin <linmiaohe@huawei.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam Howlett <liam.howlett@oracle.com>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Naoya Horiguchi <nao.horiguchi@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/hugetlb: restore reservation on error in hugetlb folio copy paths

Two sites in mm/hugetlb.c allocate a hugetlb folio via
alloc_hugetlb_folio() (consuming a VMA reservation) and then call
copy_user_large_folio(), which became int-returning in commit 1cb9dc4b475c
("mm: hwpoison: support recovery from HugePage copy-on-write faults") and
can now fail (e.g.  -EHWPOISON on a hwpoisoned source page).  On the
failure path, folio_put() restores the global hugetlb pool count through
free_huge_folio(), but the per-VMA reservation map entry is left marked
consumed:

  - hugetlb_mfill_atomic_pte() resubmission path (UFFDIO_COPY)
  - copy_hugetlb_page_range() fork-time CoW path when
    hugetlb_try_dup_anon_rmap() fails (rare: pinned hugetlb anon
    folio under fork)

User-visible effect: on UFFDIO_COPY into a private hugetlb VMA where the
resubmission copy fails, the reservation for that address is leaked from
the VMA's reserve map.  A subsequent fault at the same address takes the
no-reservation path, and under hugetlb pool pressure the task is SIGBUSed
at an address it had previously reserved.  The fork-time CoW path leaks
the same way in the child VMA's reserve map, though it requires the much
rarer combination of pinned hugetlb anon page + hwpoisoned source.

Add the missing restore_reserve_on_error() call before folio_put() on both
error paths.

Link: https://lore.kernel.org/20260520044912.6751-1-devnexen@gmail.com
Fixes: 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults")
Signed-off-by: David Carlier <devnexen@gmail.com>
Reviewed-by: Muchun Song <muchun.song@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Mina Almasry <almasrymina@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: yuehaibing <yuehaibing@huawei.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/cma_debug: fix invalid accesses for inactive CMA areas

cma_activate_area() can fail after allocating range bitmaps.  Its cleanup
path frees those bitmaps, but only clears cma->count and
cma->available_count.  It leaves cma->nranges and each range's count in
place, so cma_debugfs_init() can still register debugfs files for an area
that never activated successfully.

That exposes two problems.  Reading the bitmap file can make debugfs walk
a freed range bitmap and trigger an invalid memory access.  Reading
maxchunk can also take cma->lock even though that lock is initialized only
on the successful activation path.

Fix this by creating debugfs entries only for CMA areas that reached
CMA_ACTIVATED.

c009da4258f9 introduced the invalid access to bitmap file.  2e32b947606d
introduced the invalid access to cma->lock.  This change applies to both
issues.  So I added two Fixes tags.

Link: https://lore.kernel.org/20260520061025.3971821-1-songmuchun@bytedance.com
Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested")
Fixes: 2e32b947606d ("mm: cma: add functions to get region pages counters")
Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Dmitry Safonov <0x7f454c46@gmail.com>
Cc: Frank van der Linden <fvdl@google.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Michal Nazarewicz <mina86@mina86.com>
Cc: Stefan Strogin <stefan.strogin@gmail.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

memcg: use round-robin victim selection in refill_stock

Harry Yoo reported that get_random_u32_below() is not safe to call in the
nmi context and memcg charge draining can happen in nmi context.

More specifically get_random_u32_below() is neither reentrant- nor
NMI-safe: it acquires a per-cpu local_lock via local_lock_irqsave() on the
batched_entropy_u32 state.  An NMI that lands on a CPU mid-update of the
ChaCha batch state and recurses into the random subsystem would corrupt
that state.  The memcg_stock local_trylock prevents re-entry on the percpu
stock itself, but cannot protect an unrelated subsystem's per-cpu lock.

Replace the random pick with a per-cpu round-robin counter stored in
memcg_stock_pcp and serialized by the same local_trylock that already
guards cached[] and nr_pages[].  No atomics, no random calls, no extra
locks needed.

Link: https://lore.kernel.org/20260521223751.3794625-1-shakeel.butt@linux.dev
Fixes: f735eebe55f8f ("memcg: multi-memcg percpu charge cache")
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reported-by: Harry Yoo <harry@kernel.org>
Closes: https://lore.kernel.org/4e20f643-6983-4b6e-b12d-c6c4eb20ae0c@kernel.org/
Acked-by: Harry Yoo (Oracle) <harry@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

mm/hugetlb: avoid false positive lockdep assertion

Commit 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split,
not before") changed the locking model around hugetlbfs PMD unsharing on
VMA split, but did not update the function which asserts the locks,
hugetlb_vma_assert_locked().

This function asserts that either the hugetlb VMA lock is held (if a
shared mapping) or that the reservation map lock is held (if private).

If you get an unfortunate race between something which results in one of
these locks being released and a hugetlb VMA split and you have
CONFIG_LOCKDEP enabled, you can therefore see a false positive assertion
arise when there is in fact no issue.

Since this change introduced a new take_locks parameter to
hugetlb_unshare_pmds(), which, when set to false, indicates that locking
is sufficient, simply pass this to the unsharing logic and predicate the
lock assertions on this.

This is safe, as we already asserted the file rmap lock and the VMA write
lock prior to this (implying exclusive mmap write lock), so we cannot be
raced by either rmap or page fault page table walkers which the asserted
locks are intended to protect against (we don't mind GUP-fast).

Separate out huge_pmd_unshare() into __huge_pmd_unshare() to add a
check_locks parameter, and update hugetlb_unshare_pmds() to pass this
parameter to it.

This leaves all other callers of huge_pmd_unshare() still correctly
asserting the locks.

The below reproducer will trigger the assert in a kernel with
CONFIG_LOCKDEP enabled by racing process teardown (which will release the
hugetlb lock) against a hugetlb split.

void execute_one(void)
{
void *ptr;
pid_t pid;

/*
* Create a hugetlb mapping spanning a PUD entry.
*
* We force the hugetlb page allocation with populate and
* noreserve.
*
* |---------------------|
* |                     |
* |---------------------|
* 0                 PUD boundary
*/
ptr = mmap(0, PUD_SIZE, PROT_READ | PROT_WRITE,
   MAP_FIXED | MAP_SHARED | MAP_ANON |
   MAP_NORESERVE | MAP_HUGETLB | MAP_POPULATE,
   -1, 0);
if (ptr == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}

/*
* Fork but with a bogus stack pointer so we try to execute code in
* a non-VM_EXEC VMA, causing segfault + teardown via exit_mmap().
*
* The clone will cause PMD page table sharing between the
* processes first via:
* copy_process() -> ... -> huge_pte_alloc() -> huge_pmd_share()
*
* Then tear down and release the hugetlb 'VMA' lock via:
* exit_mmap() -> ... -> vma_close() -> hugetlb_vma_lock_free()
*/
pid = syscall(__NR_clone, 0, 2 * PMD_SIZE, 0, 0, 0);
if (pid < 0) {
perror("clone");
exit(EXIT_FAILURE);
} if (pid == 0) {
/* Pop stack... */
return;
}

/*
* We are the parent process.
*
* Race the child process's teardown with a PMD unshare.
*
* We do this by triggering:
*
* __split_vma() -> hugetlb_split() -> hugetlb_unshare_pmds()
*
* Which, importantly, doesn't hold the hugetlb VMA lock (nor can
* it), meaning we assert in hugetlb_vma_assert_locked().
*
*            .
* |----------.----------|
* |          .          |
* |----------.----------|
* 0          .     PUD boundary
*/
mmap(0, PUD_SIZE / 2, PROT_READ | PROT_WRITE,
     MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
}

int main(void)
{
int i;

/* Kick off fork children. */
for (i = 0; i < NUM_FORKS; i++) {
pid_t pid = fork();

if (pid < 0) {
perror("fork");
exit(EXIT_FAILURE);
}

/* Fork children do their work and exit. */
if (!pid) {
int j;

for (j = 0; j < NUM_ITERS; j++)
execute_one();
return EXIT_SUCCESS;
}
}

/* If we succeeded, wait on children. */
for (i = 0; i < NUM_FORKS; i++)
wait(NULL);

return EXIT_SUCCESS;
}

[ljs@kernel.org: account for the !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING case]
Link: https://lore.kernel.org/agWZsPGYid08uU6O@lucifer
Link: https://lore.kernel.org/20260513085658.45264-1-ljs@kernel.org
Fixes: 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split, not before")
Signed-off-by: Lorenzo Stoakes <ljs@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Acked-by: Oscar Salvador <osalvador@suse.de>
Cc: Jann Horn <jannh@google.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

bpf: replace pop/push emptiness check with bpf_list_empty()

Simplify fq_flows_is_empty() by replacing the pop/push based emptiness
check with a direct call to bpf_list_empty().
This avoids unnecessary list mutation and simplifies the code while
preserving correctness.

Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com>
Changes since v1:
- Removed unused variable node
Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com>
Link: https://lore.kernel.org/r/20260524025853.13786-1-suchitkarunakaran@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Fix race between bpf_map_new_fd() and close_fd()

Because there is time gap between bpf_map_new_fd() and close_fd(), a
concurrent thread is able to close the new fd and opens a new, unrelated
file with the exact same fd number. Thereafter, this close_fd() might
inadvertently close the unrelated file.

To avoid such regression, do finalize log before security_bpf_map_create().

However, in order to achieve it, move bpf_get_file_flag(),
security_bpf_map_create(), bpf_map_alloc_id(), and bpf_map_new_fd() from
__map_create() to map_create(). And, rename __map_create() to
map_create_alloc() meanwhile.

Then, in order to reuse the map and token when all checks pass in
map_create_alloc(), pass "struct bpf_map **" and "struct bpf_token **" to
map_create_alloc().

Fixes: 49f9b2b2a18c ("bpf: Add syscall common attributes support for map_create")
Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
Link: https://lore.kernel.org/r/20260521142909.95818-1-leon.hwang@linux.dev
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

ring-buffer: Show persistent buffer dropped events in trace_pipe file

When the persistent ring buffer is validated on boot up, if a subbuffer is
deemed invalid, it resets the buffer and continues. Have the code preserve
the RB_MISSED_EVENTS flag in the commit portion of the subbuffer header
and pass that back so that the trace_pipe file can show the missed events
like the trace file does.

For example:

   <...>-1242    [005] d....  4429.120116: page_fault_user: address=0x7ffaebb6e728 ip=0x7ffaeb9d4960 error_code=0x7
   <...>-1242    [005] .....  4429.120124: mm_page_alloc: page=00000000055254f3 pfn=0x1373bd order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
   <...>-1242    [005] d..2.  4429.120132: tlb_flush: pages:1 reason:local MM shootdown (3)
CPU:5 [LOST EVENTS]
   <...>-1242    [005] d....  4429.120661: page_fault_user: address=0x55ba7c2d0944 ip=0x55ba7c20cd02 error_code=0x7
   <...>-1242    [005] .....  4429.120669: mm_page_alloc: page=0000000005a02500 pfn=0x12b6e4 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
   <...>-1242    [005] d..2.  4429.120680: tlb_flush: pages:1 reason:local MM shootdown (3)

Link: https://patch.msgid.link/20260522171052.156419479@kernel.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Show persistent buffer dropped events in trace file

When the persistent ring buffer is validated on boot up, if a subbuffer is
deemed invalid, it resets the buffer and continues. Currently, these lost
events are not shown in the trace file output.

Have the trace iterator look for subbuffers that have the RB_MISSED_EVENTS
set and set the iter->missed_events flag when it is detected. This will
then have the trace file shows "LOST EVENTS" when it reads across a
subbuffer that was corrupted and invalidated.

For example:

<...>-1016 [005] ...1. 6230.660403: preempt_disable: caller=__mod_memcg_state+0x1c8/0x200 parent=__mod_memcg_state+0x1c8/0x200
CPU:5 [LOST EVENTS]
<...>-1016 [005] ..... 6230.660673: kmem_cache_alloc: call_site=__anon_vma_prepare+0x1ad/0x1e0 ptr=000000006e40294c name=anon_vma bytes_req=200 bytes_alloc=208 gfp_flags=GFP_KERNEL node=-1 accounted=true

Link: https://patch.msgid.link/20260522171052.006276604@kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Have dropped subbuffers be persistent across reboots

When the persistent ring buffer detects a corrupted subbuffer, it will
zero its size and report dropped pages in the dmesg, then it continues
normally.

But if a reboot happens without clearing or restarting tracing on the
persistent ring buffer, the next boot will show no pages are dropped.

If the persistent ring buffer is still the same, then it should still
report dropped pages so the user knows that the buffer has missing events.

Add the RB_MISSED_EVENTS flag to the commit value of the subbuffer so that
the next boot will still show that pages were dropped.

Link: https://patch.msgid.link/20260522171051.860780286@kernel.org
Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Cleanup buffer_data_page related code

Code cleanup related to buffer_data_page for readability,
which includes:
- Introduce rb_data_page_commit() and rb_data_page_size()
- Use 'dpage' for buffer_data_page, instead of 'bpage' because
'bpage' is used for buffer_page.

Link: https://patch.msgid.link/20260522171051.722645963@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Cleanup persistent ring buffer validation

Cleanup rb_meta_validate_events() function to make it easier to read.
This includes the following cleanups:
- Introduce rb_validatation_state to hold working variables in
validation.
- Move repleated validation state updates into rb_validate_buffer().
- Move reader_page injection code outside of rb_meta_validate_events().

Link: https://patch.msgid.link/20260522171051.577231395@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Show commit numbers in buffer_meta file

In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)

Link: https://patch.msgid.link/20260522171051.424411323@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Add persistent ring buffer invalid-page inject test

Add a self-corrupting test for the persistent ring buffer.

This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.

This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.

The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.

To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:

reserve_mem=20M:2M:trace trace_instance=ptracingtest^traceoff@trace
panic=1

Run the following commands after the 1st boot:

cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger

After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.

Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)

Link: https://patch.msgid.link/20260522171051.260140328@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer

Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.

To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.

Link: https://patch.msgid.link/20260522171051.091265852@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[ SDR: Have reader_page still get evaluated if header_page fails ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer

Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).

If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.

Link: https://lore.kernel.org/all/20260520185018.051228084@kernel.org/
Link: https://patch.msgid.link/20260522171050.914418536@kernel.org
Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[SDR: Fixed max_loops in rb_iter_peek() as well ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>

Merge tag 'amd-drm-fixes-7.1-2026-05-28' of https://gitlab.freedesktop.org/agd5f/linux into drm-fixes

amd-drm-fixes-7.1-2026-05-28:

amdgpu:
- GEM_OP warning fix
- GEM_OP locking fix
- Userq fixes
- DCN 2.1 refclk fix
- SI fix
- HMM fixes

amdkfd:
- svm_range_set_attr locking fix
- CRIU restore fix
- KFD debugger fix

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Alex Deucher <alexander.deucher@amd.com>
Link: https://patch.msgid.link/20260528211843.893681-1-alexander.deucher@amd.com

x86/cpu: Fix a F00F bug warning and clean up surrounding code

On x86 SMP systems with the F00F bug present, do_clear_cpu_cap()
rightfully warns that the code clears the X86_BUG_F00F flag after
alternatives have been patched.

X86_BUG_F00F is first cleared in intel_workarounds() and then set for
the affected models. This sequence works fine on the BSP but on AP
bringup, where alternatives have already been patched and clearing the
flag there triggers the warning.

There is no technical reason for clearing the flag before setting it. It
is mainly an artifact of introducing the X86_BUG_F00F flag in

e2604b49e8a8 ("x86, cpu: Convert F00F bug detection").

Remove the unnecessary clearing of the flag.

While at it, remove the kernel notification and the surrounding logic to
inform the user about the workaround exactly once. If needed, the
presence of the F00F bug can be determined through /proc/cpuinfo.

Additionally, the F00F bug was the last remaining user of clear_cpu_bug().
With no users left, get rid of this helper as well.

[ bp: Massage commit message. ]

Co-developed-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Richard Weinberger <richard@nod.at>
Signed-off-by: Sohil Mehta <sohil.mehta@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Ahmed S. Darwish <darwi@linutronix.de>
Link: https://patch.msgid.link/20260528184826.3642051-1-sohil.mehta@intel.com

Merge tag 'drm-xe-fixes-2026-05-28' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes

- Restore IDLEDLY regiter on engine reset (Bala)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Rodrigo Vivi <rodrigo.vivi@intel.com>
Link: https://patch.msgid.link/ahhBUt8fDqjB-mQq@intel.com

Merge tag 'qcom-clk-fixes-for-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into clk-fixes

Pull Qualcomm clk driver fixes from Bjorn Andersson:

The parking of shared RCGs during registration parks the MDP source
clock, disabling display until the msm display driver successfully
initializes the hardware again. Mark this clock on Makena and Hamoa as
"no_init_park" to retain a working recovery console etc.

* tag 'qcom-clk-fixes-for-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
clk: qcom: dispcc-sc8280xp: Don't park mdp_clk_src at registration time
clk: qcom: x1e80100-dispcc: Stop disp_cc_mdss_mdp_clk_src from getting parked

Merge tag 'samsung-clk-fixes-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into clk-fixes

Pull a Samsung clk driver fix from Krzysztof Kozlowski:

Google GS101: Correct the register name for saving and restoring state
during system suspend and resume. Lack of proper save/restore leads to
incorrect clock values after system resume.

* tag 'samsung-clk-fixes-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
clk: samsung: gs101: Fix missing USI7_USI DIV clock in peric0_clk_regs

Merge tag 'drm-intel-fixes-2026-05-27' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes

- Fix potential UAF in TTM object purge (Janusz Krzysztofik)
- Use polling when irqs are unavailable [aux] (Michał Grzelak)
- Fix HDR pre-CSC LUT programming loop [color] (Pranay Samala)
- Block DC states on vblank enable when Panel Replay supported [psr] (Jouni Högander)
- Use DC_OFF wake reference to block DC6 on vblank enable [psr] (Jouni Högander)

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Tvrtko Ursulin <tursulin@igalia.com>
Link: https://patch.msgid.link/ahajfPMJE_qKzig5@linux

Merge branch 'docs-page_pool-tweaks-and-updates'

Jakub Kicinski says:

====================
docs: page_pool: tweaks and updates

I'm hoping to start feeding our docs into the AI review tools, instead
of maintaining a separate repo with review prompts. To experiment with
that we have to refresh the docs a little bit.

This set exclusively focuses on the page pool API. First patch is
a straightforward fix for information which is now out of date.
Second one attempts to clarify the NAPI linking requirements.
Third drops the dedicated section about the stats; the document
is primarily developer-facing and the stats should require no
development effort in most cases. Last but not least minor
API cleanup.
====================

Link: https://patch.msgid.link/20260526155722.2790742-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: make page_pool_get_stats() void

The kdoc for page_pool_get_stats() is missing a Returns: statement.
Looking at this function, I have no idea what is the purpose of
the bool it returns. My guess was that maybe the static inline
stub returns false if CONFIG_PAGE_POOL_STATS=n but such static
inline helper doesn't exist at all. All callers pass a pointer
to a struct on the stack. Make this function void.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260526155722.2790742-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: page_pool: drop the mention of the legacy stats API

The Netlink support for querying page pool stats has been
proven out in production, let's remove the mention of the
helper meant for dumping page pool stats into ethtool -S
from the docs.

Call out in the kdoc that this API is deprecated.
Some drivers may not be able to use the Netlink API
(if page pool is shared across netdevs). So the old API
is not _completely_ dead. But we shouldn't advertise it.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260526155722.2790742-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: clarify page pool NAPI consumer requirement

The comment about requirements when to set the NAPI pointer
may not be super clear. Add more words.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260526155722.2790742-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: net: page_pool: drop reference to removed PP_FLAG_PAGE_FRAG

The flag was removed in commit 09d96ee5674a ("page_pool: remove
PP_FLAG_PAGE_FRAG"), but the documentation still mentions it when
describing fragment usage. Drop the stale reference; the fragment
API does not require any opt-in flag.

Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Link: https://patch.msgid.link/20260526155722.2790742-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: pcs: pcs-mtk-lynxi: fix bpi-r3 serdes configuration

Commit 8871389da151 introduces common pcs dts properties which writes
rx=normal,tx=normal polarity to register SGMSYS_QPHY_WRAP_CTRL of switch.
This is initialized with tx-bit set and so change inverts polarity
compared to before.

It looks like mt7531 has tx polarity inverted in hardware and set tx-bit
by default to restore the normal polarity.

The MT7531 datasheet quite clearly states:
Register 000050EC QPHY_WRAP_CTRL -- QPHY wrapper control
Reset value: 0x00000501

BIT 1 RX_BIT_POLARITY -- RX bit polarity control
1'b0: normal
1'b1: inverted

BIT 0 TX_BIT_POLARITY -- TX bit polarity control (TX default inversed
in MT7531)
1'b0: normal
1'b1: inverted

Till this patch the register write was only called when mediatek,pnswap
property was set which cannot be done for switch because the fw-node param
was always NULL from switch driver in the mtk_pcs_lynxi_create call.

Do not configure switch side like it's done before.

Fixes: 8871389da151 ("net: pcs: pcs-mtk-lynxi: deprecate "mediatek,pnswap"")
Signed-off-by: Frank Wunderlich <frank-w@public-files.de>
Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://patch.msgid.link/20260526153239.30194-1-linux@fw-web.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: Use named initializer for zorro_device_id arrays

Using named initializers is more explicit and thus easier to parse for a
human.

While touching these arrays, drop explicit zeros from the list terminator.

This change doesn't introduce changes to the compiled zorro_device_id
arrays.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Link: https://patch.msgid.link/2243204c0f57f79750de00c072914354d4f65707.1779803053.git.u.kleine-koenig@baylibre.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'remove-unused-support-for-crypto-tfm-cloning'

Eric Biggers says:

====================
Remove unused support for crypto tfm cloning

This series is targeting net-next because it depends on
"net/tcp: Remove tcp_sigpool".  So far no commits in cryptodev conflict
with this, so I suggest that this be taken through net-next for 7.2.

This series removes support for transformation cloning from the crypto
API.  Now that the TCP-AO and TCP-MD5 code no longer uses it, it no
longer has a user.  And it's unlikely that a new one will appear, as the
library API solves the problem in a much simpler and more efficient way.

This feature also regressed performance for all crypto API users, since
it changed crypto transformation objects into reference-counted objects.
That added expensive atomic operations.  The refcount is reverted by
this series, thus fixing the performance regression.

A subset of this was previously sent in
https://lore.kernel.org/r/20260307224341.5644-1-ebiggers@kernel.org
Compared to that version, this version is a bit more comprehensive.
====================

Link: https://patch.msgid.link/20260522053028.91165-1-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: api - Fold crypto_alloc_tfmmem() into crypto_create_tfm_node()

Fold crypto_alloc_tfmmem() into its only remaining caller,
crypto_create_tfm_node(). Previously crypto_alloc_tfmmem() was called
by crypto_clone_tfm(), but crypto_clone_tfm() was removed.

This rolls back the refactoring that was done in commit 3c3a24cb0ae4
("crypto: api - Add crypto_clone_tfm").

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20260522053028.91165-7-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: api - Fold __crypto_alloc_tfmgfp() into __crypto_alloc_tfm()

This reverts commit fa3b3565f3ac ("crypto: api - Add
__crypto_alloc_tfmgfp").

Fold __crypto_alloc_tfmgfp() into its only remaining caller,
__crypto_alloc_tfm(). Previously __crypto_alloc_tfmgfp() was called by
crypto_clone_cipher(), but crypto_clone_cipher() was removed.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20260522053028.91165-6-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: api - Remove per-tfm refcount

This reverts commit ae131f4970f0 ("crypto: api - Add crypto_tfm_get").

The refcount in struct crypto_tfm was added solely to support
crypto_clone_tfm(). Before then it was a simple non-refcounted object.

Since crypto_clone_tfm() has been removed, remove the refcount as well.

Note that this eliminates an expensive atomic operation from every tfm
freeing operation. So this revert doesn't just remove unused code, but
it also fixes a performance regression.

Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20260522053028.91165-5-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: api - Remove crypto_clone_tfm()

Since all callers of crypto_clone_tfm() have been removed, remove it.

Note that no tests need to be removed, as this function had no tests.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20260522053028.91165-4-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: cipher - Remove crypto_clone_cipher()

Since the only caller of crypto_clone_cipher() was cmac_clone_tfm()
which has been removed, remove crypto_clone_cipher() as well.

Note that no tests need to be removed, as this function had no tests.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20260522053028.91165-3-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

crypto: hash - Remove support for cloning hash tfms

Hash transformation cloning no longer has a user, and there's a good
chance no new one will appear because the library API solves the problem
in a much simpler and more efficient way. Remove support for it.

Note that no tests need to be removed, as this feature had no tests.

Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Signed-off-by: Eric Biggers <ebiggers@kernel.org>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://patch.msgid.link/20260522053028.91165-2-ebiggers@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

gpu: nova: separate driver type from driver data

Split NovaDriver into a unit struct for trait implementations and a
separate Nova struct for the private driver data.

Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260525225838.276108-6-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: gsp: replace ARef<Device> with &'a Device in sequencer

GspSequencer, GspSeqIter, and GspSequencerParams are already
lifetime-parameterized; the ARef<Device> is unnecessary -- a plain
&'a Device reference suffices.

Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260525225838.276108-5-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: replace ARef<Device> with &'bound Device in SysmemFlush

Now that SysmemFlush is lifetime-parameterized, the ARef<Device> is
unnecessary -- a plain &'bound Device reference suffices.

Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260525225838.276108-4-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: unregister sysmem flush page from Drop

Now that SysmemFlush can borrow the Bar via HRT lifetime, store a
&'bound Bar0 reference and implement Drop to automatically unregister
the sysmem flush page. This removes the need for manual unregister()
calls and the Gpu::unbind() method.

Reported-by: Eliot Courtney <ecourtney@nvidia.com>
Closes: https://lore.kernel.org/all/20260409-fix-systemflush-v1-1-a1d6c968f17c@nvidia.com/
Fixes: 6554ad65b589 ("gpu: nova-core: register sysmem flush page")
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260525225838.276108-3-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

gpu: nova-core: use lifetime for Bar

Take advantage of the lifetime-parameterized pci::Bar<'bound> to hold
the BAR mapping directly in NovaCore<'bound>, and pass a borrowed
reference to Gpu<'bound>.

This eliminates the Arc<Devres<Bar0>> indirection, removes runtime
revocation checks for BAR access, and simplifies Gpu::unbind().

Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Tested-by: Alexandre Courbot <acourbot@nvidia.com>
Link: https://patch.msgid.link/20260525225838.276108-2-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

Merge tag 'wireless-next-2026-05-28' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next

Johannes Berg says:

====================
Mostly driver updates:
- iwlwifi
   - more UHR support
   - NAN (multicast, schedule improvements, multi-station)
   - cleanups, etc.
- ath12k
   - thermal throttling/cooling device support
   - 6 GHz incumbent interference detection
   - channel 177 in 5 GHz
- hwsim: S1G fixes
- mac80211: NAN channel handling improvements

* tag 'wireless-next-2026-05-28' of https://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (143 commits)
  wifi: cfg80211: use strscpy in cfg80211_wext_giwname
  wifi: mac80211: fix channel evacuation logic
  wifi: mac80211: refactor ieee80211_nan_try_evacuate
  wifi: mac80211: add an option to filter out a channel in combinations check
  wifi: mac80211_hwsim: add debug messages for link changes
  wifi: nl80211: re-check wiphy netns in testmode and vendor dump continuations
  wifi: mac80211_hwsim: modernise S1G channel list
  wifi: mac80211_hwsim: don't run RC update on new STA on S1G vif
  wifi: mwifiex: remove an unnecessary check
  wifi: mac80211: add KUnit coverage for negotiated TTLM parser
  wifi: ath12k: fix error unwind on arch_init() failure in PCI probe
  wifi: iwlwifi: mld: fix indentation in iwl_mld_fill_supp_rates()
  wifi: iwlwifi: transport: add memory read under NIC access
  wifi: iwlwifi: dbg: remove unused 'range_len' arg from dump
  wifi: iwlwifi: fw: separate out old-style dump code
  wifi: iwlwifi: fw: dbg: always use non-tracing PRPH access
  wifi: iwlwifi: fw: separate ini dump allocation
  wifi: iwlwifi: fw: move struct iwl_fw_ini_dump_entry to dbg.c
  wifi: iwlwifi: clean up location format/BW encoding
  wifi: iwlwifi: Add names for Killer BE1735x and BE1730x
  ...
====================

Link: https://patch.msgid.link/20260528123911.284536-26-johannes@sipsolutions.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'for-net-2026-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth

Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

- hci_core: Rework hci_dev_do_reset() to use hci_sync functions
- hci_conn: Fix memory leak in hci_le_big_terminate()
- hci_sync: Set HCI_CMD_DRAIN_WORKQUEUE during device close
- hci_sync: Reset device counters in hci_dev_close_sync()
- hci_sync: fix UAF in hci_le_create_cis_sync
- L2CAP: Fix possible crash on l2cap_ecred_conn_rsp
- L2CAP: fix chan ref leak in l2cap_chan_timeout() on !conn
- L2CAP: use chan timer to close channels in cleanup_listen()
- L2CAP: clear chan->ident on ECRED reconfiguration success
- ISO: fix UAF in iso_recv_frame
- ISO: serialize iso_sock_clear_timer with socket lock
- HIDP: fix missing length checks in hidp_input_report()
- 6lowpan: check skb_clone() return value in send_mcast_pkt()
- btusb: Allow firmware re-download when version matches
- hci_qca: Use 100 ms SSR delay for rampatch and NVM loading

* tag 'for-net-2026-05-28' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: hci_sync: Reset device counters in hci_dev_close_sync()
  Bluetooth: hci_sync: Set HCI_CMD_DRAIN_WORKQUEUE during device close
  Bluetooth: hci_core: Rework hci_dev_do_reset() to use hci_sync functions
  Bluetooth: ISO: serialize iso_sock_clear_timer with socket lock
  Bluetooth: ISO: fix UAF in iso_recv_frame
  Bluetooth: L2CAP: Fix possible crash on l2cap_ecred_conn_rsp
  Bluetooth: l2cap: clear chan->ident on ECRED reconfiguration success
  Bluetooth: hci_qca: Use 100 ms SSR delay for rampatch and NVM loading
  Bluetooth: hci_sync: fix UAF in hci_le_create_cis_sync
  Bluetooth: 6lowpan: check skb_clone() return value in send_mcast_pkt()
  Bluetooth: btusb: Allow firmware re-download when version matches
  Bluetooth: HIDP: fix missing length checks in hidp_input_report()
  Bluetooth: L2CAP: use chan timer to close channels in cleanup_listen()
  Bluetooth: L2CAP: fix chan ref leak in l2cap_chan_timeout() on !conn
  Bluetooth: hci_conn: Fix memory leak in hci_le_big_terminate()
====================

Link: https://patch.msgid.link/20260528131839.462344-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'selftests-mptcp-reduce-bufferbloat-and-cleanup'

Matthieu Baerts says:

====================
selftests: mptcp: reduce bufferbloat and cleanup

Bufferbloat is baaaad, even in our selftests: let's kill it (or at least
reduce it). By doing that, the tests (seem to) have a more stable
transfer, and are then less unstable. That's what patches 1-2 are doing,
and they can be backported up to 5.10.

Patch 3 is not related: a small fix in the selftests to remove temp
files that were not deleted in some conditions, since v5.13.
====================

Link: https://patch.msgid.link/20260527-net-mptcp-sft-bufferbloat-exit-v1-0-9afc4e742090@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: sockopt: set EXIT trap earlier

Set the EXIT trap for cleanup immediately after creating temporary file
variables, before init and make_file, to ensure cleanup runs on any
failure or interruption during the early setup phase.

Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn>
Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260527-net-mptcp-sft-bufferbloat-exit-v1-3-9afc4e742090@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: simult_flows: adapt limits

Avoid using a fixed limit, no matter the setup. This was causing too
high bufferbloat in some situations, e.g. with a low bandwidth and very
low delay because the default limit was too high for this case.

Instead, use more appropriated limits. Note that unbalanced bandwidth
modes seem to require slightly higher limits to cope with the different
bursts.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260527-net-mptcp-sft-bufferbloat-exit-v1-2-9afc4e742090@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: simult_flows: disable GSO

Netem is used to apply a rate limit, and its 'limit' option is per
packet.

Disable GSO on both sides to work with packets of a specific size. That
increases the number of packets, but stabilise the throughput. As a
consequence, limits are more adapted, and the bufferbloat is reduced.

Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Link: https://patch.msgid.link/20260527-net-mptcp-sft-bufferbloat-exit-v1-1-9afc4e742090@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: fix race between sctp_wait_for_connect and peeloff

sctp_wait_for_connect() drops and re-acquires the socket lock while
waiting for the association to reach ESTABLISHED state. During this
window, another thread can peeloff the association to a new socket via
getsockopt(SCTP_SOCKOPT_PEELOFF), changing asoc->base.sk. After
re-acquiring the old socket lock, sctp_wait_for_connect() returns
success without noticing the migration — the caller then accesses
the association under the wrong lock in sctp_datamsg_from_user().

Add the same sk != asoc->base.sk check that sctp_wait_for_sndbuf()
already has, returning an error if the association was migrated while
we slept.

Fixes: 668c9beb9020 ("sctp: implement assign_number for sctp_stream_interleave")
Signed-off-by: Zhenghang Xiao <kipreyyy@gmail.com>
Acked-by: Xin Long <lucien.xin@gmail.com>
Link: https://patch.msgid.link/20260527032411.60959-1-kipreyyy@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-mana-fix-null-dereferences-during-teardown-after-attach-failure'

Dipayaan Roy says:

====================
net: mana: Fix NULL dereferences during teardown after attach failure

When mana_attach() fails (e.g. during queue allocation), the error
cleanup frees apc->tx_qp and apc->rxqs and sets them to NULL. Multiple
subsequent teardown paths can then dereference these NULL pointers,
causing kernel panics.

Patch 1 adds NULL guards in the low-level teardown functions
(mana_fence_rqs, mana_destroy_vport, mana_dealloc_queues) so they are
safe to call regardless of queue initialization state. This covers all
callers: mana_remove(), mana_change_mtu() recovery, and internal error
paths in mana_alloc_queues().

Patch 2 adds an early exit in mana_detach() for already-detached ports,
making it safe for non-close callers. This allows the queue reset
handler to safely retry mana_attach() without redundant teardown.
====================

Link: https://patch.msgid.link/20260525081129.1230035-1-dipayanroy@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mana: Skip redundant detach on already-detached port

When mana_per_port_queue_reset_work_handler() runs after a previous
detach succeeded but attach failed, the port is left in a detached
state with apc->tx_qp and apc->rxqs already freed. Calling
mana_detach() again unconditionally leads to NULL pointer dereferences
during queue teardown.

Add an early exit in mana_detach() when the port is already in
detached state (!netif_device_present) for non-close callers, making
it safe to call idempotently. This allows the queue reset handler and
other recovery paths to simply retry mana_attach() without redundant
teardown.

Fixes: 3b194343c250 ("net: mana: Implement ndo_tx_timeout and serialize queue resets per port.")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20260525081129.1230035-3-dipayanroy@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mana: Add NULL guards in teardown path to prevent panic on attach failure

When queue allocation fails partway through, the error cleanup frees
and NULLs apc->tx_qp and apc->rxqs. Multiple teardown paths such as
mana_remove(), mana_change_mtu() recovery, and internal error handling
in mana_alloc_queues() can subsequently call into functions that
dereference these pointers without NULL checks:

- mana_chn_setxdp() dereferences apc->rxqs[0], causing a NULL pointer
dereference panic (CR2: 0000000000000000 at mana_chn_setxdp+0x26).
- mana_destroy_vport() iterates apc->rxqs without a NULL check.
- mana_fence_rqs() iterates apc->rxqs without a NULL check.
- mana_dealloc_queues() iterates apc->tx_qp without a NULL check.

Add NULL guards for apc->rxqs in mana_fence_rqs(),
mana_destroy_vport(), and before the mana_chn_setxdp() call. Add a
NULL guard for apc->tx_qp in mana_dealloc_queues() to skip TX queue
draining when TX queues were never allocated or already freed.

Fixes: ca9c54d2d6a5 ("net: mana: Add a driver for Microsoft Azure Network Adapter (MANA)")
Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
Link: https://patch.msgid.link/20260525081129.1230035-2-dipayanroy@linux.microsoft.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

arm64: dts: rockchip: Add EL2 virtual timer interrupt

The ARMv8.2 based CPUs used in a number of Rockchip SoCs are missing
the EL2 virtual timer interrupt. Add it.

Signed-off-by: Marc Zyngier <maz@kernel.org>
Link: https://patch.msgid.link/20260523140242.586031-16-maz@kernel.org
Signed-off-by: Heiko Stuebner <heiko@sntech.de>

rust: devres: add 'static bound to Devres<T>

Devres::new() registers a callback with the C devres subsystem via
devres_node_add(). If the Devres is leaked (e.g. via
core::mem::forget(), which is safe), its Drop impl never runs, and the
devres release callback will revoke the inner Revocable on device
unbind, which drops T in place. If T contains non-'static references,
those may be dangling by that point.

Add a 'static bound to prevent storing types with borrowed data in
Devres.

Fixes: 76c01ded724b ("rust: add devres abstraction")
Reviewed-by: Alexandre Courbot <acourbot@nvidia.com>
Reviewed-by: Eliot Courtney <ecourtney@nvidia.com>
Link: https://patch.msgid.link/20260526000447.350558-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

Merge tag 'dd-lifetimes-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core into drm-rust-next

Higher-Ranked Lifetime Types for Rust device drivers

Replace drvdata() with registration data on the auxiliary bus. Private
data is now scoped to the registration object, removing the ordering
constraints and lifetime complications that came with drvdata().

Add Higher-Ranked Lifetime Types (HRT) so driver structs can borrow
device resources like pci::Bar and IoMem directly, tied to the device
binding scope. This removes the need for Devres indirection and
ARef<Device> in most driver code.

This is a stable tag for other trees to merge.

Signed-off-by: Danilo Krummrich <dakr@kernel.org>

i2c: designware: Add ACPI ID LECA0003 for LECARC SoCs

Add ACPI ID "LECA0003" for LECARC SoCs that integrate
the DesignWare I2C controller.
Also add corresponding ACPI description in acpi_apd.c.

Signed-off-by: Thomas Lin <thomas_lin@lecomputing.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260526-lecarc-i2c-acpi-id-v1-1-f0942bd491d2@lecomputing.com

i2c: designware: Handle active target cleanly

When the I2C controller attempts a new transaction while the target
controller is shutting down or restarting, it can lead to bus lockups
and system bootloops if the hardware enters an inconsistent state.

Address this by ensuring that the internal state machines are properly
cleared when disabling the controller if target activity is detected.

If the controller remains active after disabling, perform a bus recovery
to reset it to a known good state.

Signed-off-by: William A. Kennington III <william@wkennington.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260527-dw-i2c-v5-4-3483057f8d67@wkennington.com

i2c: designware: Convert platform driver to use shutdown hook

Convert the platform driver to use the new i2c_dw_shutdown() hook,
allowing the controller to gracefully NACK controllers requests during
system shutdown.

Signed-off-by: William A. Kennington III <william@wkennington.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260527-dw-i2c-v5-3-3483057f8d67@wkennington.com

i2c: designware: Convert PCI driver to use shutdown hook

Convert the PCI driver to use the new i2c_dw_shutdown() hook, allowing
the controller to gracefully NACK controller requests during system
shutdown.

Signed-off-by: William A. Kennington III <william@wkennington.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260527-dw-i2c-v5-2-3483057f8d67@wkennington.com

i2c: designware: Introduce shutdown exported function

Introduce an exported shutdown function to safely shutdown the
DesignWare I2C controller.

This shutdown hook gracefully sets the target disable bit before disabling
the controller. This guarantees that any incoming requests from the
controller are immediately NACKed during shutdown, preventing the bus from
hanging.

Signed-off-by: William A. Kennington III <william@wkennington.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20260527-dw-i2c-v5-1-3483057f8d67@wkennington.com

Merge patch series "rust: device: Higher-Ranked Lifetime Types for device drivers"

Danilo Krummrich <dakr@kernel.org> says:

Currently, Rust device drivers access device resources such as PCI BAR mappings
and I/O memory regions through Devres<T>.

Devres::access() provides zero-overhead access by taking a &Device<Bound>
reference as proof that the device is still bound. Since a &Device<Bound> is
available in almost all contexts by design, Devres is mostly a type-system level
proof that the resource is valid, but it can also be used from scopes without
this guarantee through its try_access() accessor.

This works well in general, but has a few limitations:

  - Every access to a device resource goes through Devres::access(), which
    despite zero cost, adds boilerplate to every access site.

  - Destructors do not receive a &Device<Bound>, so they must use try_access(),
    which can fail. In practice the access succeeds if teardown ordering is
    correct, but the type system can't express this, forcing drivers to handle a
    failure path that should never be taken.

  - Sharing a resource across components (e.g. passing a BAR to a sub-component)
    requires Arc<Devres<T>>.

  - Device references must be stored as ARef<Device> rather than plain &Device
    borrows.

These limitations stem from the driver's bus device private data being 'static
-- the driver struct cannot borrow from the device reference it receives in
probe(), even though it structurally cannot outlive the device binding.

This series introduces Higher-Ranked Lifetime Types (HRT) for Rust device
drivers. An HRT is a type that is generic over a lifetime -- it does not have a
fixed lifetime, but can be instantiated with any lifetime chosen by the caller.

Bus driver traits use a Generic Associated Type (GAT) type Data<'bound> to
introduce the lifetime on the private data, rather than parameterizing the
Driver trait itself. This avoids a driver trait global lifetime and avoids the
need for ForLt for bus device private data, making the bus implementations much
simpler. ForLt is only needed for auxiliary registration data, where the
lifetime is not introduced by a trait callback but must be threaded through
Registration.

With HRT, driver structs carry a lifetime parameter tied to the device binding
scope -- the interval of a bus device being bound to a driver. Device resources
like pci::Bar<'bound> and IoMem<'bound> are handed out with this lifetime, so
the compiler enforces at build time that they do not escape the binding scope.

Before:

struct MyDriver {
    pdev: ARef<pci::Device>,
    bar: Devres<pci::Bar<BAR_SIZE>>,
}

let io = self.bar.access(dev)?;
io.read32(OFFSET);

After:

struct MyDriver<'bound> {
    pdev: &'bound pci::Device,
    bar: pci::Bar<'bound, BAR_SIZE>,
}

self.bar.read32(OFFSET);

Lifetime-parameterized device resources can be put into a Devres at any point
via Bar::into_devres() / IoMem::into_devres(), providing the exact same
semantics as before. This is useful for resources shared across subsystem
boundaries where revocation is needed.

This also synergizes with the upcoming self-referential initialization support
in pin-init, which allows one field of the driver struct to borrow another
during initialization without unsafe code.

The same pattern is applied to auxiliary device registration data as a first
example beyond bus device private data. Registration<F: ForLt> can hold
lifetime-parameterized data tied to the parent driver's binding scope. Since the
auxiliary bus guarantees that the parent remains bound while the auxiliary
device is registered, the registration data can safely borrow the parent's
device resources.

More generally, binding resource lifetimes to a registration scope applies to
every registration that is scoped to a driver binding -- auxiliary devices,
class devices, IRQ handlers, workqueues.

A follow-up series extends this to class device registrations, starting with
DRM, so that class device callbacks (IOCTLs, etc.) can safely access device
resources through the separate registration data bound to the registration's
lifetime without Devres indirection.

Thanks to Gary for coming up with the ForLt implementation; thanks to Alice for
the early discussions around lifetime-parameterized private data that helped
shape the direction of this work.

Link: https://patch.msgid.link/20260525202921.124698-1-dakr@kernel.org
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

arm: boot: ep93xx: don't rely on machine_is_*() for removed board files

Code in misc-ep93xx.h relies on machine_is_*() macros for several
boards that no longer have legacy board files. They were removed in
commit e5ef574dda70 ("ARM: ep93xx: delete all boardfiles"). This
prevents the removal of machine IDs no longer used by the kernel from
mach-types. To resolve this issue, create local copies of these macros.
(The checks themselves are still valid because the IDs are still passed
in by the bootloader on these machines.) Also take the opportunity to
remove three repeated checks for the same ID.

Signed-off-by: Ethan Nelson-Moore <enelsonmoore@gmail.com>
Acked-by: Alexander Sverdlin <alexander.sverdlin@gmail.com>
Signed-off-by: Alexander Sverdlin <asv@kernel.org>

rust: drm: add FEAT_RENDER flag for render node support

Add FEAT_RENDER bool constant to the Driver trait to control
render node support. When enabled, the driver exposes /dev/dri/renderDXX
render nodes to userspace. The flag defaults to false, drivers can opt
in by setting it to true in their Driver implementation.

This is then enabled in the Tyr driver, while it's left disabled for
Nova for the time being.

Co-developed-by: Daniel Almeida <daniel.almeida@collabora.com>
Signed-off-by: Daniel Almeida <daniel.almeida@collabora.com>
Reviewed-by: Alice Ryhl <aliceryhl@google.com>
Signed-off-by: Laura Nao <laura.nao@collabora.com>
Link: https://patch.msgid.link/20260507080914.95478-2-laura.nao@collabora.com
Signed-off-by: Danilo Krummrich <dakr@kernel.org>

i2c: icy: Use named initializer for zorro_device_id arrays

Using named initializers is more explicit and thus easier to parse for a
human.

While touching this array, drop explicit zeros from the list terminator.

This change doesn't introduce changes to the compiled zorro_device_id
array.

Signed-off-by: Uwe Kleine-König (The Capable Hub) <u.kleine-koenig@baylibre.com>
Reviewed-by: Max Staudt <max@enpas.org>
Reviewed-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/3d7690c7a8948f977d6c50bd0c8010efb715fbdc.1779803053.git.u.kleine-koenig@baylibre.com

arm64: tegra: Add #{address,size}-cells to Chromium-based /firmware

Chromium/Depthcharge bootloaders may dynamically add a few device nodes
to a system's DTB under a /firmware node. A typical DT looks something
like the following:

/ {
        firmware {
                ranges;

                coreboot {
                        compatible = "coreboot";
                        reg = <...>;
                        ...;
                };
        };
};

Notably, the /firmware node has an empty 'ranges', but does not have
address/size-cells.

Commit 6e5773d52f4a ("of/address: Fix WARN when attempting translating
non-translatable addresses") started requiring #address-cells for a
device's parent if we want to use the reg resource in a device node.
This leads to errors like the following:

[    7.763870] coreboot_table firmware:coreboot: probe with driver coreboot_table failed with error -22

Add appropriate #{address,size}-cells to work around the problem.

Note that Google has also patched the Depthcharge bootloader source to
add {address,size}-cells [1], but bootloader updates are typically
delivered only via Google OS updates. Not all users install Google
software updates, and even if they do, Google may not produce updated
binaries for all/older devices.

[1] https://lore.kernel.org/all/20241209092809.GA3246424@google.com/
    https://crrev.com/c/6051580 ("coreboot: Insert #address-cells and
    #size-cells for firmware node")

Closes: https://lore.kernel.org/all/aeKlYzTiL0OB1y3g@google.com/
Fixes: 6e5773d52f4a ("of/address: Fix WARN when attempting translating non-translatable addresses")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: tegra: Configure Tegra114 power domains

Add power domains found in Tegra114 and configure operating-points-v2 for
supported devices accordingly.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Reviewed-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: tegra: Add DC interconnections for Tegra114

Add DC interconnections to Tegra114 device tree to reflect connections
between MC, EMC and DC.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Reviewed-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: tegra: Add EMC OPP and ICC properties to Tegra114 EMC and ACTMON device-tree nodes

Add EMC OPP tables and interconnect paths that will be used for dynamic
memory bandwidth scaling based on memory utilization statistics.

Signed-off-by: Svyatoslav Ryhel <clamor95@gmail.com>
Reviewed-by: Mikko Perttunen <mperttunen@nvidia.com>
Signed-off-by: Thierry Reding <treding@nvidia.com>

ARM: tegra: Add #{address,size}-cells to Chromium-based /firmware

Chromium/Depthcharge bootloaders may dynamically add a few device nodes
to a system's DTB under a /firmware node. A typical DT looks something
like the following:

/ {
        firmware {
                ranges;

                coreboot {
                        compatible = "coreboot";
                        reg = <...>;
                        ...;
                };
        };
};

Notably, the /firmware node has an empty 'ranges', but does not have
address/size-cells.

Commit 6e5773d52f4a ("of/address: Fix WARN when attempting translating
non-translatable addresses") started requiring #address-cells for a
device's parent if we want to use the reg resource in a device node.
This leads to errors like the following:

[    7.763870] coreboot_table firmware:coreboot: probe with driver coreboot_table failed with error -22

Add appropriate #{address,size}-cells to work around the problem.

Note that Google has also patched the Depthcharge bootloader source to
add {address,size}-cells [1], but bootloader updates are typically
delivered only via Google OS updates. Not all users install Google
software updates, and even if they do, Google may not produce updated
binaries for all/older devices.

[1] https://lore.kernel.org/all/20241209092809.GA3246424@google.com/
    https://crrev.com/c/6051580 ("coreboot: Insert #address-cells and
    #size-cells for firmware node")

Closes: https://lore.kernel.org/all/aeKlYzTiL0OB1y3g@google.com/
Fixes: 6e5773d52f4a ("of/address: Fix WARN when attempting translating non-translatable addresses")
Signed-off-by: Brian Norris <briannorris@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>

Merge tag 'memory-controller-drv-tegra-dt-bindings-7.2' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux-mem-ctrl into for-7.2/arm/dt

DT bindings for memory controller drivers Tegra SoC

Devicetree bindings used by memory controller drivers and DTS for Tegra
SoCs.

i2c: sis630: Replace dev_err() with dev_err_probe() in probe function

This simplifies the code while improving log.

Signed-off-by: Enrico Zanda <e.zanda1@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20250520194400.341079-11-e.zanda1@gmail.com

i2c: sis96x: Replace dev_err() with dev_err_probe() in probe function

This simplifies the code while improving log.

Signed-off-by: Enrico Zanda <e.zanda1@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20250520194400.341079-10-e.zanda1@gmail.com

i2c: sprd: Replace dev_err() with dev_err_probe() in probe function

This simplifies the code while improving log.

Signed-off-by: Enrico Zanda <e.zanda1@gmail.com>
Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
Link: https://lore.kernel.org/r/20250520194400.341079-9-e.zanda1@gmail.com