Kairui Song [Sun, 17 May 2026 15:39:51 +0000 (23:39 +0800)]
mm, swap: merge zeromap into swap table
By allocating one additional bit in the swap table entry's flags field
alongside the count, we can store the zeromap inline
For 64 bit systems, zeromap will store in the swap table, avoiding zeromap
allocation. It reduces the allocated memory. That is the happy path.
For certain 32-bit archs, there might not be enough bits in the swap table
to contain both PFN and flags. Therefore, conditionally let each cluster
have a zeromap field at build time, and use that instead. If the swapfile
cluster is not fully used, it will still save memory for zeromap. The
empty cluster does not allocate a zeromap. In the worst case, all cluster
are fully populated. We will use memory similar to the previous zeromap
implementation.
A few macros were moved to different headers for build time struct
definition.
[akpm@linux-foundation.org: swap_cluster_alloc_table(): remove unused local `ret]
[akpm@linux-foundation.org: fix unused label `err_free'] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-12-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Youngjun Park <youngjun.park@lge.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:50 +0000 (23:39 +0800)]
mm/memcg: remove no longer used swap cgroup array
Now all swap cgroup records are stored in the swap cluster directly, the
static array is no longer needed.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-11-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:49 +0000 (23:39 +0800)]
mm/memcg, swap: store cgroup id in cluster table directly
Drop the usage of the swap_cgroup_ctrl, and use the dynamic cluster table
instead.
The per-cluster memcg table is 1024 / 512 bytes on most archs, and does
not need RCU protection: the cgroup data is only read and written under
the cluster lock. That keeps things simple, lets the allocation use plain
kmalloc with immediate kfree (no deferred free), and keeps fragmentation
acceptable.
[akpm@linux-foundation.org: memcgv1: don't compile swap functions when CONFIG_SWAP=n] Link: https://lore.kernel.org/202605281711.bSeZlErK-lkp@intel.com
[akpm@linux-foundation.org: fix CONFIG_SWAP=n build] Link: https://lore.kernel.org/20260517-swap-table-p4-v5-10-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:48 +0000 (23:39 +0800)]
mm, swap: consolidate cluster allocation helpers
Swap cluster table management is spread across several narrow helpers. As
a result, the allocation and fallback sequences are open-coded in multiple
places.
A few more per-cluster tables will be added soon, so avoid duplicating
these sequences per table type. Fold the existing pairs into
cluster-oriented helpers, and rename for consistency.
No functional change, only a few sanity checks are slightly adjusted.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-9-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:47 +0000 (23:39 +0800)]
mm, swap: delay and unify memcg lookup and charging for swapin
Instead of checking the cgroup private ID during page table walk in
swap_pte_batch(), move the memcg lookup into __swap_cache_add_check()
under the cluster lock.
The first pre-alloc check is speculative and skips the memcg check since
the post-alloc stable check ensures all slots covered by the folio belong
to the same memcg. It is very rare for contiguous and aligned entries
across a contiguous region of a page table of the same process or shmem
mapping to belong to different memcgs.
This also prepares for recording the memcg info in the cluster's table.
Also make the order check and fallback more compact.
There should be no user-observable behavior change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-8-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:46 +0000 (23:39 +0800)]
mm, swap: support flexible batch freeing of slots in different memcgs
Instead of requiring the caller to ensure all slots are in the same memcg,
make the function handle different memcgs at once.
This is both a micro optimization and required for removing the memcg
lookup in the page table layer, so it can be unified at the swap layer.
We are not removing the memcg lookup in the page table in this commit. It
has to be done after the memcg lookup is deferred to the swap layer.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-7-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:45 +0000 (23:39 +0800)]
mm/memcg, swap: tidy up cgroup v1 memsw swap helpers
The cgroup v1 swap helpers always operate on swap cache folios whose swap
entry is stable: the folio is locked and in the swap cache. There is no
need to pass the swap entry or page count as separate parameters when they
can be derived from the folio itself.
Simplify the redundant parameters and add sanity checks to document the
required preconditions.
Also rename memcg1_swapout to __memcg1_swapout to indicate it requires
special calling context: the folio must be isolated and dying, and the
call must be made with interrupts disabled.
No functional change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-6-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:44 +0000 (23:39 +0800)]
mm, swap: unify large folio allocation
Now that direct large order allocation is supported in the swap cache,
both anon and shmem can use it instead of implementing their own methods.
This unifies the fallback and swap cache check, which also reduces the
TOCTOU race window of swap cache state: previously, high order swapin
required checking swap cache states first, then allocating and falling
back separately. Now all these steps happen in the same compact loop.
Order fallback and statistics are also unified, callers just need to check
and pass the acceptable order bitmask.
There is basically no behavior change. This only makes things more
unified and prepares for later commits. Cgroup and zero map checks can
also be moved into the compact loop, further reducing race windows and
redundancy
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-5-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:43 +0000 (23:39 +0800)]
mm, swap: add support for stable large allocation in swap cache directly
To make it possible to allocate large folios directly in swap cache,
provide a new infrastructure helper to handle the swap cache status check,
allocation, and order fallback in the swap cache layer
The new helper replaces the existing swap_cache_alloc_folio. Based on
this, all the separate swap folio allocation that is being done by anon /
shmem before is converted to use this helper directly, unifying folio
allocation for anon, shmem, and readahead.
This slightly consolidates how allocation is synchronized, making it more
stable and less prone to errors. The slot-count and cache-conflict check
is now always performed with the cluster lock held before allocation, and
repeated under the same lock right before cache insertion. This double
check produces a stable result compared to the previous anon and shmem
mTHP allocation implementation, avoids the false-negative conflict checks
that the lockless path can return — large allocations no longer have to
be unwound because the range turned out to be occupied — and aborts
early for already-freed slots, which helps ordinary swapin and especially
readahead, with only a marginal increase in cluster-lock contention (the
lock is very lightly contended and stays local in the first place).
Hence, callers of swap_cache_alloc_folio() no longer need to check the
swap slot count or swap cache status themselves.
And now whoever first successfully allocates a folio in the swap cache
will be the one who charges it and performs the swap-in. The race window
of swapping is also reduced since the loop is much more compact.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-4-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:42 +0000 (23:39 +0800)]
mm/huge_memory: move THP gfp limit helper into header
Shmem has some special requirements for THP GFP and has to limit it in
certain zones or provide a more lenient fallback.
We'll use this helper for generic swap THP allocation, which needs to
support shmem. For a typical GFP_HIGHUSER_MOVABLE swap-in, this helper is
basically a no-op. But it's necessary for certain shmem users, mostly
drivers.
No feature change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-3-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Sun, 17 May 2026 15:39:41 +0000 (23:39 +0800)]
mm, swap: move common swap cache operations into standalone helpers
Move a few swap cache checking, adding, and deletion operations into
standalone helpers to be used later. And while at it, add proper kernel
doc.
No feature or behavior change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-2-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Youngjun Park <youngjun.park@lge.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This series unifies the allocation and charging of anon and shmem swap in
folios, provides better synchronization, consolidates the metadata
management, hence dropping the static array and map, and improves the
performance. The static metadata overhead is now close to zero, and
workload performance is slightly improved.
For example, mounting a 1TB swap device saves about 512MB of memory:
Before:
free -m
total used free shared buff/cache available
Mem: 1464 805 346 1 382 658
Swap: 1048575 0 1048575
After:
free -m
total used free shared buff/cache available
Mem: 1464 277 899 1 356 1187
Swap: 1048575 0 1048575
Memory usage is ~512M lower, and we now have a close to 0 static overhead.
It was about 2 bytes per slot before, now roughly 0.09375 bytes per slot
(48 bytes ci info per cluster, which is 512 slots).
Performance test is also looking good, testing Redis in a 2G VM using 6G
ZRAM as swap:
This series also reduces memory thrashing, I no longer see any: "Huh
VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was shown
several times during stress testing before this series when under great
pressure:
Instead of trying to return the existing folio if the entry is already
cached in swap_cache_alloc_folio, simply return an error pointer if the
allocation failed, and drop the output argument that indicates what kind
of folio is actually returned.
And a proper wrapper swap_cache_read_folio that decouples and handles the
actual requirement - read in the folio, or return the already read folio
in cache. This is what async swapin and readahead actually required.
As for zswap swap out, the caller just needs to abort if the allocation
fails because the entry is gone or already cached, so removing simplifies
the return argument, making it cleaner.
No feature change.
Link: https://lore.kernel.org/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com Link: https://lore.kernel.org/20260517-swap-table-p4-v5-1-88ae43e064c7@tencent.com Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Zi Yan <ziy@nvidia.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: swap_cgroup: fix NULL deref in lookup_swap_cgroup_id on swapless host
lookup_swap_cgroup_id() passes swap_cgroup_ctrl[type].map to
__swap_cgroup_id_lookup() without checking that the type was ever
registered via swap_cgroup_swapon(). On a swapless host every ctrl->map
is NULL, so __swap_cgroup_id_lookup() dereferences NULL + a scaled
swp_offset().
Since commit bea67dcc5eea ("mm: attempt to batch free swap entries for
zap_pte_range()"), zap_pte_range() -> swap_pte_batch() calls
lookup_swap_cgroup_id() on any non-present, non-none PTE that decodes as a
real swap entry, without first validating it against swap_info[]. A
single PTE corrupted into a type-0 swap entry takes the host down at
process exit.
We hit this in production on a swapless 6.12.58 host: ~1s of
"get_swap_device: Bad swap file entry 3f800204222bb" (do_swap_page() being
correctly defensive about the same entry) followed by
BUG: unable to handle page fault for address: 000003f800204220
RIP: 0010:lookup_swap_cgroup_id+0x2b/0x60
Call Trace:
swap_pte_batch+0xbf/0x230
zap_pte_range+0x4c8/0x780
unmap_page_range+0x190/0x3e0
exit_mmap+0xd9/0x3c0
do_exit+0x20c/0x4b0
syzbot has reported the identical stack.
The source of the PTE corruption is a separate bug; this change makes the
teardown path as robust as the fault path already is. Every other caller
of lookup_swap_cgroup_id() is downstream of a get_swap_device() that has
already validated the entry, so the new branch is cold.
Link: https://lore.kernel.org/20260504-swap-cgroup-fix-7-0-v1-1-f53ff41ee553@linux.dev Fixes: bea67dcc5eea ("mm: attempt to batch free swap entries for zap_pte_range()") Signed-off-by: Jose Fernandez (Anthropic) <jose.fernandez@linux.dev> Reported-by: syzbot+e12bd9ca48157add237a@syzkaller.appspotmail.com Link: https://lore.kernel.org/r/69859728.050a0220.3b3015.0033.GAE@google.com Assisted-by: Claude:unspecified Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <ryncsn@gmail.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Tue, 19 May 2026 14:17:58 +0000 (14:17 +0000)]
mm/page_alloc: document that alloc_pages_nolock() uses RCU
The allocator interacts with cgroups which rely on RCU. RCU does not work
everywhere, so the "any context" claim is slightly overstated here.
This should already be enforced by objtool, since this function is not
marked noinstr the x86 build should fail if you call it from a place where
RCU is not watching. But, expecting readers to make that connection for
themselves seems a bit cruel (I don't think there is even any
documentation of what noinstr means at all, let alone the connection with
RCU).
Note this is not claiming that any cgroup code called from the allocator
would actually break if this restriction was violated, it could very well
be that there's no real way for the allocator to act on a cgroup that can
disappear concurrently. But, since it's likely nobody has verified this
one way or another, better to just be safe and declare that RCU is
required. Allocating from an RCU-unsafe context seems a bit crazy anyway.
Link: https://lore.kernel.org/20260519-nolock-rcu-comment-v1-1-4a630c8794e5@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Suggested-by: Junaid Shahid <junaids@google.com> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Wed, 13 May 2026 12:35:15 +0000 (12:35 +0000)]
mm: rejig pageblock mask definitions
- Add a PAGEBLOCK_ prefix to the names to avoid polluting the "global
namespace" too much.
- This new prefix makes MIGRATETYPE_AND_ISO_MASK look pretty long. Well,
that global mask only exists for quite a specific purpose, and is
quite a weird thing to have a name for anyway. So drop it and take
advantage of the newly-defined PAGEBLOCK_ISO_MASK.
Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-3-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Wed, 13 May 2026 12:35:14 +0000 (12:35 +0000)]
mm/page_alloc: don't overload migratetype in find_suitable_fallback()
This function currently returns a signed integer that encodes status
in-band, as negative numbers, along with a migratetype. Switch to a more
explicit/verbose style that encodes the status and migratetype separately.
In the spirit of making things more explicit, also create an enum to avoid
using magic integer literals with special meanings. This enables
documenting the values at their definition instead of in one of the
callers.
Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-2-dacdf5402be8@google.com Signed-off-by: Brendan Jackman <jackmanb@google.com> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Brendan Jackman [Wed, 13 May 2026 12:35:13 +0000 (12:35 +0000)]
mm: introduce for_each_free_list()
Patch series "mm: misc cleanups from __GFP_UNMAPPED series".
In v2 of the __GFP_UNMAPPED series [0], we realised that some of the
patches could potentially be merged as independent cleanups.
These are all independent of one another, if you think some are useful
cleanups and others are pointless churn, it should be fine to just pick
whatever subset you prefer.
No functional change intended.
This patch (of 4):
There are a couple of places that iterate over the freelists with
awareness of the data structures' layout.
It seems ideally, code outside of mm should not be aware of the page
allocator's freelists at all. But, this patch just doesn't hide them
completely, it's just a meek incremental step in that direction: provide a
macro to iterate over it without needing to be aware of the actual struct
fields.
Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-0-dacdf5402be8@google.com Link: https://lore.kernel.org/20260513-page_alloc-unmapped-prep-v1-1-dacdf5402be8@google.com Link: https://lore.kernel.org/all/20260320-page_alloc-unmapped-v2-0-28bf1bd54f41@google.com/ Signed-off-by: Brendan Jackman <jackmanb@google.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Len Brown <lenb@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Tue, 12 May 2026 20:45:59 +0000 (16:45 -0400)]
mm/filemap: fix page_cache_prev_miss() when no hole is found
page_cache_prev_miss() is documented to return a value outside the
searched range when no gap is found. However, the no-gap-found path
returns xas.xa_index, which after a successful loop is the first index in
the range. As such, that index is misreported as a gap.
The sole caller, page_cache_sync_ra(), uses the return value to estimate
the cached run preceding a sequential read. In some cases, the buggy
return value can undercount the contiguous range by one, shrinking the
readahead window or pushing borderline requests into the small-random-read
branch.
Fix this by returning the start of the range - 1 when no hole is found.
Update page_cache_next_miss() for clarity as well.
Both helpers were previously fixed together in commit 9425c591e06a ("page
cache: fix page_cache_next/prev_miss off by one"), but the fix was
reverted because it caused a hugetlb performance regression. hugetlb no
longer uses these functions and next_miss was subsequently refixed in
commit 901a269ff3d5 ("filemap: fix page_cache_next_miss() when no hole
found") and commit bbcaee20e03e ("readahead: fix return value of
page_cache_next_miss() when no hole is found"), but prev_miss was not
addressed.
This was found by pointing Claude Opus 4.7 at mm/filemap.c.
Link: https://lore.kernel.org/20260512-prev_miss_fix-v2-1-4af8e5c1ae62@columbia.edu Fixes: 0d3f92966629 ("page cache: Convert hole search to XArray") Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Tal Zussman <tz2294@columbia.edu> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Vishal Moola <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Wed, 13 May 2026 02:21:18 +0000 (10:21 +0800)]
tools/mm/page-types: fix kpageflags option argument in getopt_long
The --kpageflags option requires an argument to specify the kpageflags
file path, but has_arg was set to 0 (no_argument) in the long options
table. Change it to 1 (required_argument) so getopt_long correctly parses
the argument.
Link: https://lore.kernel.org/20260513022120.58033-4-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Wed, 13 May 2026 02:21:17 +0000 (10:21 +0800)]
tools/mm/page-types: fix ternary operator precedence in sigbus handler
The ternary operator (?:) has lower precedence than addition (+), so the
expression `off + sigbus_addr ? sigbus_addr - ptr : 0` was parsed as
`(off + sigbus_addr) ? (sigbus_addr - ptr) : 0` rather than the intended
`off + (sigbus_addr ? sigbus_addr - ptr : 0)`. Add explicit parentheses
to ensure the correct evaluation order.
Link: https://lore.kernel.org/20260513022120.58033-3-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Acked-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Wed, 13 May 2026 02:21:16 +0000 (10:21 +0800)]
tools/mm/page-types: fix typo in madvise() error message
Patch series "tools/mm/page-types: Fix misc bugs".
This series fixes three issues in tools/mm/page-types.c:
1. Fix two typos in madvise() error messages ("madvice" -> "madvise")
2. Fix operator precedence bug in the sigbus handler where the ternary
operator binds looser than addition, producing incorrect offset
calculation when sigbus_addr is non-NULL
3. Fix --kpageflags option declaration in getopt_long: has_arg should
be 1 (required_argument) since the option requires a file path
This patch (of 3):
Two error messages incorrectly spelled the madvise() function name as
"madvice". Fix the typo in both occurrences.
wangxuewen [Wed, 13 May 2026 07:52:14 +0000 (15:52 +0800)]
mm/shrinker: simplify shrinker_memcg_alloc() using guard()
Use guard(mutex) to automatically handle shrinker_mutex locking and
unlocking in shrinker_memcg_alloc(). This removes the explicit
mutex_unlock() call, the goto-based error path, and the redundant ret
variable, resulting in cleaner and more concise code.
Link: https://lore.kernel.org/20260513075214.2655710-1-18810879172@163.com Signed-off-by: wangxuewen <wangxuewen@kylinos.cn> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Xuewen Wang <wangxuewen@kylinos.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mremap_userfaultfd_prep() increments ctx->mmap_changing to stall
concurrent operations, but mremap_userfaultfd_fail() does not
decrement it before dropping the context reference.
If an mremap operation fails, ctx->mmap_changing remains elevated. This
will causes subsequent userfaultfd operations like a UFFDIO_COPY to fail
with -EAGAIN.
Decrement ctx->mmap_changing in mremap_userfaultfd_fail().
Link: https://sashiko.dev/#/patchset/20260430113512.115938-1-rppt@kernel.org Link: https://lore.kernel.org/20260513081416.495963-1-rppt@kernel.org Fixes: df2cc96e7701 ("userfaultfd: prevent non-cooperative events vs mcopy_atomic races") Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Peter Xu <peterx@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hao Ge [Wed, 13 May 2026 08:25:25 +0000 (16:25 +0800)]
lib/test_hmm: use kvfree() to free kvcalloc() allocations
Coccinelle scripts/coccinelle/api/kfree_mismatch.cocci reports
the following warnings:
lib/test_hmm.c:1256:15-16: WARNING kvmalloc is used to allocate this memory at line 1191
lib/test_hmm.c:1257:15-16: WARNING kvmalloc is used to allocate this memory at line 1196
Fix this by replacing kfree() with kvfree() to correctly handle the
vmalloc() fallback path of kvcalloc().
Link: https://lore.kernel.org/20260513082525.154036-1-hao.ge@linux.dev Fixes: 775465fd26a3 ("lib/test_hmm: add zone device private THP test infrastructure") Signed-off-by: Hao Ge <hao.ge@linux.dev> Acked-by: Balbir Singh <balbirs@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Leon Romanovsky <leon@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Wed, 13 May 2026 09:21:11 +0000 (17:21 +0800)]
mm, swap: avoid leaving unused extend table after alloc race
Allocating an extend table requires dropping the ci lock first. While the
lock is dropped, a concurrent put can decrease the slot's swap count to a
value that is no longer maxed out, so the extend table is no longer
required. The current allocation path still attach the new extend table
to the cluster anyway, leaving it unused.
The next maxed out count on the same cluster may still reuse the table,
and frees it properly. But swapoff could leak it indeed.
To eliminate the waste, re-check under the ci lock that the extend table
is still needed before publishing it, and free the local allocation
otherwise.
Also close the check window by ensuring every count decrement that brings
a slot below SWP_TB_COUNT_MAX - 1 runs swap_extend_table_try_free(), not
just the MAX to MAX - 1 transition. With this, a freshly published extend
table that becomes redundant due to a racing put is freed on the very next
decrement, restoring the invariant that an empty cluster never has a
non-NULL ci->extend_table.
The added overhead is ignorable.
[kasong@tencent.com: v2] Link: https://lore.kernel.org/20260515-swap-extend-table-fix-v2-1-833d72ad53e5@tencent.com Link: https://lore.kernel.org/20260513-swap-extend-table-fix-v1-1-a71dea851fb3@tencent.com Fixes: 0d6af9bcf383 ("mm, swap: use the swap table to track the swap count") Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: Breno Leitao <leitao@debian.org> Closes: https://lore.kernel.org/linux-mm/agG6Dp0umhs6O1SY@gmail.com/ Tested-by: Breno Leitao <leitao@debian.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Frederick Mayle [Fri, 8 May 2026 18:12:31 +0000 (11:12 -0700)]
mm/readahead: no PG_readahead on EOF
When readahead pulls in all the remaining pages for a file, setting the
readahead bit is counter productive. The async readahead it would trigger
would almost certainly be a no-op. Additionally, for mmap'd file IO, the
readahead bit limits the fault around [1], causing an extra minor fault
when the page is accessed.
This was discovered when looking at /sys/kernel/tracing/events/readahead
traces for a simple program. With the patch applied, fewer
page_cache_ra_unbounded calls are observed.
[1] do_fault_around calls filemap_map_pages, which finds eligible pages
by calling next_uptodate_folio [2]. next_uptodate_folio skips pages
with PG_readahead set [3].
Sang-Heon Jeon [Sun, 3 May 2026 08:42:25 +0000 (17:42 +0900)]
mm/hugetlb_cma: restrict hugetlb_cma parameter to gigantic-page alignment
Existing hugetlb_cma parameter handling logic rejects sizes smaller than
one gigantic page, but rounds up larger sizes that are not a multiple of
it. The two behaviors are inconsistent and neither is documented.
To remove existing inconsistent and undefined behavior, restrict
hugetlb_cma parameter to only accept multiples of the gigantic page size.
After this restriction, the redundant round_up() in the allocation loop
can be removed.
The new restriction is also documented in kernel-parameters.txt.
Also, including other minor changes for readability improvement with no
functional change.
Link: https://lore.kernel.org/20260503084225.415980-1-ekffu200098@gmail.com Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Suggested-by: Muchun Song <muchun.song@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Update write() checks to properly detect and handle partial writes.
Previously, the write() calls used <= 0 to detect failure. This condition
is never true for partial writes (ret > 0 but ret < len), so partial
writes were silently treated as success.
Fix this by verifying that write() returns the full expected length and
treating any mismatch as failure.
Link: https://lore.kernel.org/20260504081638.683223-1-agarwal.vineet2006@gmail.com Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
As pointed out by Dan Carpenter, test_kmemcache() was using a bitwise AND
on two bools instead of a boolean AND. Fix this for the sake of code
cleanliness.
Link: https://lore.kernel.org/20260504100637.1535762-1-glider@google.com Fixes: 5015a300a522 ("lib: introduce test_meminit module") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Dan Carpenter <error27@gmail.com> Closes: https://lore.kernel.org/kernel-janitors/afOcIan1ap9kD26M@stanley.mountain/ Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
`read_pages` ensures that `ractl->_nr_pages` is zero before it returns, so
the `ractl->_nr_pages` term in these expressions contributes nothing.
This seems to have been true since the statements were introduced in
commit f615bd5c4725f ("mm/readahead: Handle ractl nr_pages being
modified").
The new expression has an intuitive explanation. When filesystems perform
readahead, they increment `ractl->_index` by the number of pages
processed, so, after `read_pages` returns, `ractl->_index` points to the
first page after those already processed. `index` points to the first
page considered in the loop. So, `ractl->_index - index` is the number of
pages processed by the loop so far.
Steven Rostedt [Tue, 12 May 2026 21:56:23 +0000 (17:56 -0400)]
maple_tree: document that "last" in mtree_insert_range() is inclusive
The kernel doc of mtree_insert_range() does not state if the address
represented by the "last" parameter is inclusive or exclusive. This can
lead to bugs by code that assumes it is exclusive. Explicitly state that
the parameter is inclusive.
Link: https://lore.kernel.org/20260512175623.4c5ca8d2@gandalf.local.home Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: "Liam R. Howlett" <liam@infradead.org> Acked-by: SeongJae Park <sj@kernel.org> Cc: Alice Ryhl <aliceryhl@google.com> Cc: Andrew Ballance <andrewjballance@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Carlier [Sun, 10 May 2026 18:37:00 +0000 (19:37 +0100)]
mm/shrinker: avoid out-of-bounds read in set_shrinker_bit()
set_shrinker_bit() reads info->unit[shrinker_id_to_index(shrinker_id)]
before checking shrinker_id against info->map_nr_max, so an id past the
currently visible map_nr_max reads past the unit[] array before the
WARN_ON_ONCE() catches it.
Determined from code inspection.
Move the load into the bounded branch.
Link: https://lore.kernel.org/20260510183700.102475-1-devnexen@gmail.com Fixes: 307bececcd12 ("mm: shrinker: add a secondary array for shrinker_info::{map, nr_deferred}") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Qi Zheng <qi.zheng@linux.dev> Acked-by: Muchun Song <muchun.song@linux.dev> Cc: Dave Chinner <david@fromorbit.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Mon, 11 May 2026 02:54:07 +0000 (10:54 +0800)]
mm/khugepaged: fix inconsistent MMF_VM_HUGEPAGE flag due to allocation failure order
__khugepaged_enter() sets MMF_VM_HUGEPAGE before allocating the
corresponding mm_slot. If mm_slot_alloc() fails, the function returns
with the flag set but without inserting the mm into the khugepaged
tracking structures, leaving the mm in an inconsistent state where future
registration attempts are skipped.
Fix this by reordering: allocate the mm_slot first, then check and set the
flag. If the flag is already set, free the allocated slot and return.
This ensures the flag is only set when the mm is successfully registered
in the khugepaged tracking structures.
Link: https://lore.kernel.org/20260511025408.54035-1-ye.liu@linux.dev Fixes: 16618670276a ("mm: khugepaged: avoid pointless allocation for "struct mm_slot"") Signed-off-by: Ye Liu <liuye@kylinos.cn> Suggested-by: David Hildenbrand <david@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Xin Hao <xhao@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zenghongling [Mon, 11 May 2026 07:03:09 +0000 (15:03 +0800)]
mm/percpu-internal.h: optimise pcpu_chunk struct to save memory
Using pahole, we can see that there are some padding holes in the current
pcpu_chunk structure,Adjusting the layout of pcpu_chunk can reduce these
holes,decreasing its size from 192 bytes to 128 bytes and eliminating a
wasted cache line.
Liew Rui Yan [Fri, 1 May 2026 01:37:50 +0000 (09:37 +0800)]
mm/damon/reclaim: validate min_region_size to be power of 2
Problem
=======
When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_RECLAIM,
'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx()
correctly detects this and returns -EINVAL, it sets the
'maybe_corrupted' flag during this process.
This flag causes the running kdamond to terminate. While the termination
is a safety measure, it is suboptimal in this case because the error is
just a simple invalid input from the user, which shouldn't neccessitate
stopping the kdamond.
Reproduction
============
1. Enable DAMON_RECLAIM
2. Set addr_unit=3
3. Commit inputs via 'commit_inputs'
4. Observe kdamond termination
Solution
========
Add an early validation in damon_reclaim_apply_parameters() to check
'min_region_sz' before any state change occurs. If it is non-power-of-2,
return -EINVAL immediately, preventing 'maybe_corrupted' from being set.
Liew Rui Yan [Fri, 1 May 2026 01:37:49 +0000 (09:37 +0800)]
mm/damon/lru_sort: validate min_region_size to be power of 2
Patch series "mm/damon: validate min_region_size to be power of 2", v5.
Problem
=======
When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_LRU_SORT or
DAMON_RECLAIM, 'min_region_sz' becomes a non-power-of-2 value. While
damon_commit_ctx() correctly detects this and returns -EINVAL, it sets
the 'maybe_corrupted' flag during this process.
This flag causes the running kdamond to terminate. While the termination
is a safety measure, it is suboptimal in this case because the error is
just a simple invalid input from the user, which shouldn't neccessitate
stopping the kdamond.
Solution
========
Add an early validation in damon_lru_sort_apply_parameters() and
damon_reclaim_apply_parameters() to check 'min_region_sz' before any
state change occurs. If it is non-power-of-2, return -EINVAL immediately,
preventing 'maybe_corrupted' from being set.
Patch 1 fixes the issue for DAMON_LRU_SORT.
Patch 2 fixes the issue for DAMON_RECLAIM.
This patch (of 2):
Problem
=======
When a user sets an invalid 'addr_unit' (e.g., 3) via DAMON_LRU_SORT,
'min_region_sz' becomes a non-power-of-2 value. While damon_commit_ctx()
correctly detects this and returns -EINVAL, it sets the
'maybe_corrupted' flag during this process.
This flag causes the running kdamond to terminate. While the termination
is a safety measure, it is suboptimal in this case because the error is
just a simple invalid input from the user, which shouldn't neccessitate
stopping the kdamond.
Reproduction
============
1. Enable DAMON_LRU_SORT
2. Set addr_unit=3
3. Commit inputs via 'commit_inputs'
4. Observe kdamond termination
Solution
========
Add an early validation in damon_lru_sort_apply_parameters() to check
'min_region_sz' before any state change occurs. If it is non-power-of-2,
return -EINVAL immediately, preventing 'maybe_corrupted' from being set.
Vineet Agarwal [Tue, 12 May 2026 04:11:57 +0000 (09:41 +0530)]
mm/damon/sysfs-schemes: fix double increment of nr_regions
damos_sysfs_populate_region_dir() increments sysfs_regions->nr_regions
twice when adding a new region: once explicitly before
kobject_init_and_add(), and once again through the post-increment used for
the kobject name.
As a result, nr_regions no longer matches the actual number of live
regions, and region directory names skip numbers (1, 3, 5, ...).
Use the already incremented value for naming instead of incrementing
nr_regions a second time.
Link: https://lore.kernel.org/20260512041157.109845-1-agarwal.vineet2006@gmail.com Fixes: 66178e4ec30a ("mm/damon/sysfs: use damos_walk() for update_schemes_tried_{bytes,regions}") Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 12 May 2026 07:26:35 +0000 (15:26 +0800)]
drivers/base/memory: make memory block get/put explicit
Rename the memory block lookup helper to make the acquired reference
explicit, add memory_block_put() to wrap put_device(), remove
find_memory_block(), and use memory_block_get() as the single block-id
based lookup interface.
This makes it clearer to callers that a successful lookup holds a
reference that must be dropped, reducing the chance of forgetting the
matching put and leaking the memory block device reference.
Link: https://lore.kernel.org/linux-mm/7887915D-E598-42B3-9AFE-BFFBACE8DE2D@linux.dev/#t Link: https://lore.kernel.org/20260512072635.3969576-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Tested-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Sumanth Korikkar <sumanthk@linux.ibm.com> #s390 Cc: Richard Cheng <icheng@nvidia.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Doug Anderson <dianders@chromium.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Kees Cook <kees@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
register_page_bootmem_info_node() essentially only calls
register_page_bootmem_memmap(). However, on powerpc that function is a
nop. So there is not benefit in using CONFIG_HAVE_BOOTMEM_INFO_NODE
anymore, let's just drop it.
We can stop including bootmem_info.h.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-8-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/bootmem_info: stop marking mem_section_usage as MIX_SECTION_INFO
We never free the ms->usage data for boot memory sections (see
section_deactivate()). And to identify whether ms->usage was allocated
from memblock, we simply identify it by looking at PG_reserved.
Consequently, there is no need to mark ms->usage as MIX_SECTION_INFO.
Let's just stop doing that.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-6-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/bootmem_info: stop marking the pgdat as NODE_INFO
We removed the last user of NODE_INFO in commit 119c31caa59e ("mm/sparse:
remove !CONFIG_SPARSEMEM_VMEMMAP leftovers for CONFIG_MEMORY_HOTPLUG").
But it really was never used it besides for safety-checks ever since it was
introduced in commit 04753278769f ("memory hotplug: register section/node
id to free"), where we had the comment:
5) The node information like pgdat has similar issues. But, this
will be able to be solved too by this.
(Not implemented yet, but, remembering node id in the pages.)
Of course, that never happened, and we are not planning on freeing the
node data (pgdat/pglist_data), during memory hotunplug.
So let's just stop marking the pgdat as NODE_INFO.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-5-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/bootmem_info: remove call to kmemleak_free_part_phys()
The call to kmemleak_free_part_phys() was added in 2022 in
commit dd0ff4d12dd2 ("bootmem: remove the vmemmap pages from kmemleak in
put_page_bootmem").
In 2025, commit b2aad24b5333 ("mm/memmap: prevent double scanning of memmap
by kmemleak") started to use MEMBLOCK_ALLOC_NOLEAKTRACE when allocating
the memmap to skip the kmemleak_alloc_phys() in the buddy.
So remove the call to kmemleak_free_part_phys(). If this would still
be required for other purposes, either free_reserved_page() should take
care of it, or selected users.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-4-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nobody checks PG_private for these pages, and we can happily use
set_page_private() without setting PG_private. So let's just stop
setting/clearing PG_private.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-3-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In the past, we used to store the type in page->lru.next, introduced by
commit 5f24ce5fd34c ("thp: remove PG_buddy"). The location changed over
the years; ever since commit 0386aaa6e9c8 ("bootmem: stop using
page->index"), we store it alongside the info in page->private.
Consequently, there is no need to reset page->lru anymore.
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-2-3fb0be6fc688@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: remove CONFIG_HAVE_BOOTMEM_INFO_NODE (Part 1)".
We want to remove CONFIG_HAVE_BOOTMEM_INFO_NODE. As a first step, let's
limit the remaining harm to x86 and core code, removing sparc, ppc and
s390 leftovers, starting the stepwise removal by removing and simplifying
some code.
Once a related x86 vmemmap fix [1] is in, we can merge part 2 that will
remove CONFIG_HAVE_BOOTMEM_INFO_NODE entirely.
Tested on x86-64 with hugetlb vmemmap optimization in combination with
KMEMLEAK, making sure that the problem reported in dd0ff4d12dd2 ("bootmem:
remove the vmemmap pages from kmemleak in put_page_bootmem") does not
reappear -- hoping I managed to trigger the original problem.
This patch (of 8):
sparc does not select CONFIG_HAVE_BOOTMEM_INFO_NODE, therefore,
register_page_bootmem_info_node() is a nop.
Let's just get rid of register_page_bootmem_info().
Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-0-3fb0be6fc688@kernel.org Link: https://lore.kernel.org/20260511-bootmem_info_prep-v1-1-3fb0be6fc688@kernel.org Link: https://lore.kernel.org/r/20260429-vmemmap-v2-1-8dfcacffd877@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hongfu Li [Tue, 12 May 2026 10:13:05 +0000 (18:13 +0800)]
selftests/mm: fix mmap() return value check in run_migration_benchmark
mmap() returns MAP_FAILED on error, not NULL. The current check uses
!buffer->ptr, which evaluates to false when mmap() fails (since MAP_FAILED
is (void *)-1, not 0), so the error path is never taken.
Link: https://lore.kernel.org/20260512101305.139509-1-lihongfu@kylinos.cn Signed-off-by: Hongfu Li <lihongfu@kylinos.cn> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Mon, 11 May 2026 08:43:07 +0000 (16:43 +0800)]
mm/memory_hotplug: factor out altmap freeing checks
Use a small helper to centralize altmap freeing after verifying that all
vmemmap pages were released. This keeps the check consistent between the
normal teardown path and the memory hotplug error paths.
Link: https://lore.kernel.org/20260511084307.1827127-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Suggested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Donet Tom <donettom@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Hao Ge [Sat, 9 May 2026 00:56:31 +0000 (08:56 +0800)]
proc/meminfo: expose per-node balloon pages in node meminfo
Commit 835de37603ef ("meminfo: add a per node counter for balloon
drivers") added NR_BALLOON_PAGES and exposed it in /proc/meminfo.
However, the per-node view at /sys/devices/system/node/nodeX/meminfo was
not updated, even though the counter is already tracked per-node.
Add it to node_read_meminfo() so users can see balloon usage per NUMA node
without having to parse the raw vmstat file.
Link: https://lore.kernel.org/20260509005631.17183-1-hao.ge@linux.dev Signed-off-by: Hao Ge <hao.ge@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiayuan Chen [Sat, 9 May 2026 01:18:15 +0000 (18:18 -0700)]
mm/damon: replace damon_rand() with a per-ctx lockless PRNG
damon_rand() on the sampling_addr hot path called get_random_u32_below(),
which takes a local_lock_irqsave() around a per-CPU batched entropy pool
and periodically refills it with ChaCha20. At elevated nr_regions counts
(20k+), the lock_acquire / local_lock pair plus __get_random_u32_below()
dominate kdamond perf profiles.
Replace the helper with a lockless lfsr113 generator (struct rnd_state)
held per damon_ctx and seeded from get_random_u64() in damon_new_ctx().
kdamond is the single consumer of a given ctx, so no synchronization is
required. Range mapping uses traditional reciprocal multiplication,
similar as get_random_u32_below(); for spans larger than U32_MAX (only
reachable on 64-bit) the slow path combines two u32 outputs and uses
mul_u64_u64_shr() at 64-bit width. On 32-bit the slow path is dead code
and gets eliminated by the compiler.
The new helper takes a ctx parameter; damon_split_regions_of() and the
kunit tests that call it directly are updated accordingly.
lfsr113 is a linear PRNG and MUST NOT be used for anything
security-sensitive. DAMON's sampling_addr is not exposed to userspace and
is only consumed as a probe point for PTE accessed-bit sampling, so a
non-cryptographic PRNG is appropriate here.
Tested with paddr monitoring and max_nr_regions=20000: kdamond CPU usage
reduced from ~72% to ~50% of one core.
Han Gao [Tue, 31 Mar 2026 17:12:48 +0000 (01:12 +0800)]
riscv: dts: sophgo: Add dma-coherent to SG2042 PCIe controllers
SG2042's PCIe root complexes are cache-coherent with the CPU. Mark all
four PCIe controller nodes (pcie_rc0 through pcie_rc3) as dma-coherent
so the kernel uses coherent DMA mappings instead of non-coherent bounce
buffering.
Andrew Morton [Tue, 2 Jun 2026 22:06:49 +0000 (15:06 -0700)]
Merge branch 'mm-hotfixes-stable' into mm-stable to pick up the series
"userfaultfd: verify VMA state across UFFDIO_COPY retry", which is a
prerequisite for mm-unnstable's series "userfaultfd: merge
fs/userfaultfd.c into mm/userfaultfd.c".
dsa_port_from_netdev() may return a valid port from a different switch
chip. Programming another chip's port index into the local hardware
causes redirection to the wrong port, or an out-of-bounds access if the
index exceeds the local chip's port count.
Apply a minimal fix that adds a check to catch this case and adjusts the
extack message. When cls->common.skip_sw is not set, the operation could
instead redirect to the upstream port and let the software or upstream
switch(es) handle the forward, but that is not addressed here.
Tejun Heo [Mon, 1 Jun 2026 19:22:37 +0000 (09:22 -1000)]
sched_ext: Don't warn on NULL cgrp_moving_from in scx_cgroup_move_task()
A WARN fires when systemd's user manager writes "+cpu +memory +pids" to
its own subtree_control while a sched_ext scheduler is loaded:
WARNING: at kernel/sched/ext.c:3227 scx_cgroup_move_task+0xa8/0xb0
scx_cgroup_move_task+0xa8/0xb0
sched_move_task+0x134/0x290
cpu_cgroup_attach+0x39/0x70
cgroup_migrate_execute+0x37d/0x450
cgroup_update_dfl_csses+0x1e3/0x270
cgroup_subtree_control_write+0x3e7/0x440
scx_cgroup_can_attach() arms cgrp_moving_from only when a task's cpu
cgroup changes. It can still be NULL when scx_cgroup_move_task() runs,
through this sequence:
Step Result
--------------------------------- ----------------------------------
1. cpu enabled on cgroup G cpu css = A
2. cpu toggled off then on for G A killed, B created (same cgroup)
3. an exiting task keeps A alive migration skips it, A now stale
4. +memory migrates G stale A vs current B pulls cpu in
5. cpu attach runs for all tasks hits a live, cpu-unchanged task
6. scx_cgroup_move_task() on it cgrp_moving_from NULL -> WARN
The mismatch is that scx_cgroup_can_attach() keys on cgroup identity
while migration drives the move on css identity, so a NULL cgrp_moving_from
here is a legitimate css-only migration, not a missing prep.
The call is already gated on cgrp_moving_from, so just drop the warning.
ops.cgroup_prep_move() and ops.cgroup_move() stay paired.
Rosen Penev [Sun, 31 May 2026 00:00:42 +0000 (17:00 -0700)]
net: fec_mpc52xx: add missing kernel-doc for @may_sleep
Add the missing @may_sleep parameter description to the
mpc52xx_fec_stop kernel-doc comment.
Signed-off-by: Rosen Penev <rosenp@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20260531000042.369043-1-rosenp@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhao Zhang [Sat, 30 May 2026 15:57:14 +0000 (23:57 +0800)]
sctp: diag: reject stale associations in dump_one path
The SCTP exact sock_diag lookup can hold a transport reference, block on
lock_sock(sk), and then resume after sctp_association_free() has marked
the association dead and freed its bind address list.
When that happens, inet_assoc_attr_size() and
inet_diag_msg_sctpasoc_fill() can still dereference association state
that is no longer valid for reporting. In particular,
inet_diag_msg_sctpasoc_fill() may read an empty bind-address list as a
real sctp_sockaddr_entry and trigger an out-of-bounds read from
unrelated association memory.
Reject the association after taking the socket lock if it has been
reaped or detached from the endpoint, and report the lookup as stale.
This keeps the exact dump-one path from formatting torn association
state.
Fixes: 8f840e47f190 ("sctp: add the sctp_diag.c file") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Zhengchuan Liang <zcliangcn@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Zhao Zhang <zzhan461@ucr.edu> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/fac6043fa20a2ff68e12958c431836f692c51268.1780113823.git.zzhan461@ucr.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Minxi Hou [Sat, 30 May 2026 02:14:43 +0000 (10:14 +0800)]
selftests: openvswitch: add dec_ttl action support and test
Add dec_ttl action support to the OVS kernel datapath selftest
framework:
- Add dec_ttl nested NLA class to ovs-dpctl.py with proper
OVS_DEC_TTL_ATTR_ACTION sub-attribute handling
- Add parse support for dec_ttl(le_1(<inner_actions>)) action
string, consistent with the odp-util.c format where le_1()
holds the actions taken when TTL reaches 1
- Add dpstr output formatting for dec_ttl actions
- Add test_dec_ttl() to openvswitch.sh that verifies:
* Normal TTL packets are forwarded after decrement
* TTL=1 packets are dropped (TTL expiry)
* Graceful skip via ksft_skip if kernel lacks dec_ttl support
The dec_ttl class uses late-binding type resolution to reference
ovsactions for its inner action list, avoiding circular references
at class definition time.
Tapio Reijonen [Fri, 29 May 2026 06:18:57 +0000 (06:18 +0000)]
net: fec: fix pinctrl default state restore order on resume
In fec_resume(), fec_enet_clk_enable() is called before
pinctrl_pm_select_default_state() in the non-WoL path, inverting the
ordering used in fec_suspend() which correctly switches to the sleep
pinctrl state before disabling clocks.
For PHYs with the PHY_RST_AFTER_CLK_EN flag (e.g. TI DP83848 or
SMSC LAN87xx), fec_enet_clk_enable() triggers a hardware reset pulse
via the phy-reset GPIO. With the GPIO pin still in sleep pinctrl state
at that point, the GPIO write has no physical effect and the PHY never
receives the required reset after clock enable, leading to unreliable
link establishment after system resume.
Fix by restoring the default pinctrl state before enabling clocks,
making resume the proper mirror of suspend. The call is made
unconditionally: fec_suspend() only switches to the sleep pinctrl state
on the non-WoL path and leaves the pins in the default state when WoL
is enabled, so on a WoL resume the device is already in the default
state and pinctrl_pm_select_default_state() is a no-op.
Mark Brown [Fri, 24 Apr 2026 12:45:21 +0000 (13:45 +0100)]
selftests/rseq: Add config fragment
Currently there is no config fragment for the rseq selftests but there are
a couple of configuration options which are required for running them:
- CONFIG_RSEQ is required for obvious reasons, it is enabled by default
but it doesn't hurt to specify it in case the user is usinsg a
defconfig that disables it.
- CONFIG_RSEQ_SLICE_EXTENSION is tested by the slice_test test, the
test will fail without it.
Add a configuration fragment which enables these options, helping encourage
CI systems and people doing manual testing to run the tests with all the
features. This also requires CONFIG_EXPERT since it is a dependency for
slice extension.
====================
net/mlx5: Avoid payload in skb's linear part for better GRO-processing
This is V7 of a series originally submitted by Christoph.
When LRO is enabled on the MLX, mlx5e_skb_from_cqe_mpwrq_nonlinear
copies parts of the payload to the linear part of the skb.
This triggers suboptimal processing in GRO, causing slow throughput.
This patch series addresses this by using eth_get_headlen to compute the
size of the protocol headers and only copy those bits. This results in a
significant throughput improvement (detailed results in the specific
patch).
====================
net/mlx5e: Avoid copying payload to the skb's linear part
mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
bytes from the page-pool to the skb's linear part. Those 256 bytes
include part of the payload.
When attempting to do GRO in skb_gro_receive, if headlen > data_offset
(and skb->head_frag is not set), we end up aggregating packets in the
frag_list.
This is of course not good when we are CPU-limited. Also causes a worse
skb->len/truesize ratio,...
So, let's avoid copying parts of the payload to the linear part. We use
eth_get_headlen() to parse the headers and compute the length of the
protocol headers, which will be used to copy the relevant bits of the
skb's linear part.
We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
stack needs to call pskb_may_pull() later on, we don't need to reallocate
memory.
This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
LRO enabled):
Additional tests across a larger range of parameters w/ and w/o LRO, w/
and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
better performance with this patch.
For XDP pull at most ETH_HLEN bytes in the linear area so that XDP_PASS
can also benefit from this improvement and keep things simple when
dealing with skb geometry changes from the XDP program.
Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Christoph Paasch <cpaasch@openai.com> Signed-off-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Link: https://patch.msgid.link/20260601061522.398044-3-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Neil Armstrong [Tue, 2 Jun 2026 08:39:20 +0000 (10:39 +0200)]
media: qcom: iris: vdec: update find_format to handle 8bit and 10bit formats
The 10bit pixel format can be only used when the decoder identifies the
stream as decoding into 10bit pixel format buffers, so update the
find_format helper to filter the formats and only allow the proper
formats when setting or trying a capture format.
Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Reviewed-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org> Tested-by: Wangao Wang <wangao.wang@oss.qualcomm.com> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Signed-off-by: Bryan O'Donoghue <bod@kernel.org>
The P010 (YUV format with 16-bits per pixel with interleaved UV)
and QC10C (P010 compressed mode similar to QC08C) requires specific
buffer calculations to allocate the right buffer size for the DPB
(decoded picture buffer) frames and frames consumed by userspace.
Similar to 8bit, the 10bit DPB frames uses QC10C format.
Reviewed-by: Bryan O'Donoghue <bryan.odonoghue@linaro.org> Tested-by: Wangao Wang <wangao.wang@oss.qualcomm.com> Signed-off-by: Neil Armstrong <neil.armstrong@linaro.org> Signed-off-by: Bryan O'Donoghue <bod@kernel.org>
David Thompson [Fri, 29 May 2026 21:03:00 +0000 (21:03 +0000)]
net: lan743x: permit VLAN-tagged packets up to configured MTU
VLAN-tagged interfaces on lan743x devices were previously unreachable via
SSH and failed to respond to large ping packets (e.g. "ping -s 1469" given
MTU=1500). In these scenarios, "ethtool -S" reports non-zero "RX Oversize
Frame Errors". According to Microchip AN2948, the MAC_RX FSE (VLAN field
size enforcement) bit determines whether frames with VLAN tags exceeding
the base MTU plus tag length are discarded.
The driver must set the MAC_RX.FSE bit before setting MAC_RX.RXEN to allow
VLAN-tagged frames up to the interface MTU, preventing them from being
treated as oversized. As a result, both the base and VLAN-tagged interfaces
can use the same MTU without receive errors.
Fixes: 23f0703c125b ("lan743x: Add main source files for new lan743x driver") Signed-off-by: David Thompson <davthompson@nvidia.com> Reviewed-by: Thangaraj Samynathan <Thangaraj.s@microchip.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Tested-by: Nicolai Buchwitz <nb@tipi-net.de> # lan7430 on arm64 (RevPi Link: https://patch.msgid.link/20260529210300.433135-1-davthompson@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Thorsten Blum [Mon, 4 May 2026 18:19:46 +0000 (20:19 +0200)]
x86/platform/uv: Use str_enabled_disabled() in uv_nmi_setup_hubless_intr()
Replace hard-coded strings with the str_enabled_disabled() helper. This
unifies the output and helps the linker with deduplication, which can result
in a smaller binary. Additionally, address the following Coccinelle/coccicheck
warning reported by string_choices.cocci:
opportunity for str_enabled_disabled(uv_pch_intr_now_enabled)
Lyude Paul [Thu, 7 May 2026 21:59:21 +0000 (17:59 -0400)]
rust/drm/gem: Use DeviceContext with GEM objects
Now that we have the ability to represent the context in which a DRM device
is in at compile-time, we can start carrying around this context with GEM
object types in order to allow a driver to safely create GEM objects before
a DRM device has registered with userspace.
arm64: dts: rockchip: Fix vcc_sdio regulator max voltage on Pinebook Pro
The vcc_sdio regulator supports 1.8V to 3.4V output range according to
its datasheet.
The current DT incorrectly limits the max voltage to 3.0V. This limit
causes issues issues downstream with u-boot, which refuses to apply the
out-of range value, and falls back to the minimum in that range: 1.8V.
This is insufficient to power the SD card, so driver initialisation
fails and booting from it does not work.
Set regulator-max-microvolt to 3400000 µV to match hardware capability.
This matches the rk3399-orangepi for the same regulator.
Jonas Karlman [Fri, 29 May 2026 19:03:51 +0000 (21:03 +0200)]
arm64: dts: rockchip: Add USB nodes for RK3528
Rockchip RK3528 has one USB 3.0 DWC3 controller and oneUSB 2.0 EHCI/OHCI
controller and uses an Innosilicon-USB2PHY for USB 2.0. The DWC3
controller additionally uses the Naneng Combo PHY for USB3.
Add device tree nodes to describe these USB controllers along with the
USB 2.0 PHYs.
[moved snps,dis_u2_susphy_quirk here from individual boards,
describe both usb2+3 default phy connections, usb2 boards can override]
Yuqi Xu [Fri, 29 May 2026 13:01:44 +0000 (21:01 +0800)]
net: rds: clear i_sends on setup unwind
The RDS IB connection teardown path is written so it can run during
partial startup and on repeated shutdown attempts. It uses NULL
pointers to distinguish resources that are still owned from resources
that have already been released.
When rds_ib_setup_qp() fails after allocating i_sends but before
allocating i_recvs, the sends_out path frees i_sends without clearing
the pointer. A later shutdown pass can still treat that stale pointer
as a live send ring allocation.
Clear i_sends after vfree() in the error unwind path so the existing
shutdown logic continues to use the correct ownership state.
Ji'an Zhou [Tue, 2 Jun 2026 09:12:04 +0000 (09:12 +0000)]
futex/requeue: Prevent NULL pointer dereference in remove_waiter() on self-deadlock
When FUTEX_CMP_REQUEUE_PI requeues a non-top waiter that already owns the
target PI futex, task_blocks_on_rt_mutex() returns -EDEADLK before setting
waiter->task.
The subsequent remove_waiter() in rt_mutex_start_proxy_lock() dereferences
the NULL waiter->task, causing a kernel crash.
Add a self-deadlock check for non-top waiters before calling
rt_mutex_start_proxy_lock(), analogous to the top-waiter check in
futex_lock_pi_atomic().
Fixes: 3bfdc63936dd4773109b7b8c280c0f3b5ae7d349 ("rtmutex: Use waiter::task instead of current in remove_waiter()") Signed-off-by: Ji'an Zhou <eilaimemedsnaimel@gmail.com> Signed-off-by: Thomas Gleixner <tglx@kernel.org> Cc: stable@vger.kernel.org
====================
net: airoha: Preliminary patches to support multiple net_devices connected to the same GDM port
EN7581 or AN7583 SoCs support connecting multiple external SerDes (e.g.
Ethernet or USB SerDes) to GDM3 or GDM4 ports via a hw arbiter that
manages the traffic in a TDM manner. As a result multiple net_devices can
connect to the same GDM{3,4} port and there is a theoretical "1:n"
relation between GDM ports and net_devices.
This is a preliminary series to introduce support for multiple net_devices
connected to the same Frame Engine (FE) GDM port (GDM3 or GDM4) via an
external hw arbiter.
====================
Lorenzo Bianconi [Wed, 27 May 2026 10:21:20 +0000 (12:21 +0200)]
net: airoha: Rename airoha_set_gdm2_loopback in airoha_enable_gdm2_loopback
This is a preliminary patch in order to allow the user to select if the
configured device will be used as hw lan or wan.
Please not this patch does not introduce any logical changes, just
cosmetic ones.
Lorenzo Bianconi [Wed, 27 May 2026 10:21:19 +0000 (12:21 +0200)]
net: airoha: Move {cpu,fwd}_tx_packets in airoha_gdm_dev struct
Since now multiple net_devices connected to different QDMA blocks can
share the same GDM port, cpu_tx_packets and fwd_tx_packets fields can
be overwritten with the value from a different QDMA block. In order to
fix the issue move cpu_tx_packets and fwd_tx_packets fields from
airoha_gdm_port struct to airoha_gdm_dev one.
Lorenzo Bianconi [Wed, 27 May 2026 10:21:18 +0000 (12:21 +0200)]
net: airoha: Move qos_sq_bmap in airoha_gdm_dev struct
Since now multiple net_devices connected to different QDMA blocks can
share the same GDM port, qos_sq_bmap field can be overwritten with the
configuration obtained from a net_device connected to a different QDMA
block. In order to fix the issue move qos_sq_bmap field from
airoha_gdm_port struct to airoha_gdm_dev one.
Add qos_channel_map bitmap in airoha_qdma struct to track if a shared
QDMA channel is already in use by another net_device.
Lorenzo Bianconi [Wed, 27 May 2026 10:21:17 +0000 (12:21 +0200)]
net: airoha: Rely on airoha_gdm_dev pointer in airoha_is_lan_gdm_port()
Rename airoha_is_lan_gdm_port in airoha_is_lan_gdm_dev. Moreover, rely
on airoha_gdm_dev pointer in airoha_is_lan_gdm_dev() instead of
airoha_gdm_port one.
This is a preliminary patch to support multiple net_devices connected to
the same GDM{3,4} port via an external hw arbiter.
Lorenzo Bianconi [Wed, 27 May 2026 10:21:16 +0000 (12:21 +0200)]
net: airoha: Move airoha_qdma pointer in airoha_gdm_dev struct
Move airoha_qdma pointer from airoha_gdm_port struct to airoha_gdm_dev
one since the QDMA block used depends on the particular net_device
WAN/LAN configuration and in the current codebase net_device pointer is
associated to airoha_gdm_dev struct.
This is a preliminary patch to support multiple net_devices connected
to the same GDM{3,4} port via an external hw arbiter.
Lorenzo Bianconi [Wed, 27 May 2026 10:21:15 +0000 (12:21 +0200)]
net: airoha: Introduce airoha_gdm_dev struct
EN7581 and AN7583 SoCs support connecting multiple external SerDes to GDM3
or GDM4 ports via a hw arbiter that manages the traffic in a TDM manner.
As a result multiple net_devices can connect to the same GDM{3,4} port
and there is a theoretical "1:n" relation between GDM port and
net_devices.
Introduce airoha_gdm_dev struct to collect net_device related info (e.g.
net_device and external phy pointer). Please note this is just a
preliminary patch and we are still supporting a single net_device for
each GDM port. Subsequent patches will add support for multiple net_devices
connected to the same GDM port.
====================
netdevsim: psp: fix issues with stats collection
It has come to my attention via a sashiko review of my net-next series
for aes-gcm in netdevsim [1] that there were preexisting issues with
netdevsim's implementation of psp statistics.
API usage issues:
1. not calling u64_stats_init() on the u64_stats_sync object during
init
2. not serializing usage of the writer side API during stats update
Logical Bugs:
1. We were incrementing rx stats on the sending devices stats
counters.
Fix the first set of issues by removing the u64_stats_t api entirely,
and keep track of stats with atomics. Fix the second issue by charging
events to the right netdevsim object.
TAP version 13
1..28
ok 1 psp.data_basic_send_v0_ip4
ok 2 psp.data_basic_send_v0_ip6
ok 3 psp.data_basic_send_v1_ip4
ok 4 psp.data_basic_send_v1_ip6
ok 5 psp.data_basic_send_v2_ip4
ok 6 psp.data_basic_send_v2_ip6
ok 7 psp.data_basic_send_v3_ip4
ok 8 psp.data_basic_send_v3_ip6
ok 9 psp.data_mss_adjust_ip4
ok 10 psp.data_mss_adjust_ip6
ok 11 psp.dev_list_devices
ok 12 psp.dev_get_device
ok 13 psp.dev_get_device_bad
ok 14 psp.dev_rotate
ok 15 psp.dev_rotate_spi
ok 16 psp.assoc_basic
ok 17 psp.assoc_bad_dev
ok 18 psp.assoc_sk_only_conn
ok 19 psp.assoc_sk_only_mismatch
ok 20 psp.assoc_sk_only_mismatch_tx
ok 21 psp.assoc_sk_only_unconn
ok 22 psp.assoc_version_mismatch
ok 23 psp.assoc_twice
ok 24 psp.data_send_bad_key
ok 25 psp.data_send_disconnect
ok 26 psp.data_stale_key
ok 27 psp.removal_device_rx
ok 28 psp.removal_device_bi
# Totals: pass:28 fail:0 xfail:0 xpass:0 skip:0 error:0
Dump stats on both devs tx on one should match rx on other:
local dev:
id=5 ifindex=2 stats={'dev-id': 5, 'key-rotations': 0,
'stale-events': 0, 'rx-packets': 1226, 'rx-bytes': 39244,
'rx-auth-fail': 0, 'rx-error': 0, 'rx-bad': 0, 'tx-packets': 1931,
'tx-bytes': 2478908, 'tx-error': 0}
Daniel Zahka [Fri, 29 May 2026 13:15:28 +0000 (06:15 -0700)]
netdevsim: psp: use atomic64 for psp stats counters
The existing u64_stats_t-based psp counters had two preexisting api
usage bugs: u64_stats_init() was never called on the syncp object, and
the writer side of the u64_stats_update_begin()/end() api was not
serialized. Switch the counters to atomic64_t instead. Atomics need
no initialization and are inherently safe against concurrent writers,
eliminating both bugs at once.
Use atomic64_t rather than atomic_long_t so byte counters don't wrap
at 4 GiB on 32-bit builds.
Fixes: 178f0763c5f3 ("netdevsim: implement psp device stats") Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260529-fix-psp-stats-v2-2-3a194eacf18e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Zahka [Fri, 29 May 2026 13:15:27 +0000 (06:15 -0700)]
netdevsim: psp: update rx stats on the peer netdevsim
nsim_do_psp() handles both tx and rx psp processing in the sending
device's nsim_start_xmit() path. The existing code has a logical bug,
where we erroneously increment rx_bytes and rx_packets on the sending
devices stats, instead of the peer device.
Additionally, compute psp_len after psp_dev_encapsulate() and before
psp_dev_rcv(), which modifies the header region of the skb. The
existing calculation was actually correct, because psp_dev_rcv()
leaves skb_inner_transport_header pointing at the tcp header, but this
is fragile and confusing as there is no actual inner transport header
after psp_dev_rcv has removed udp encapsulation.
Fixes: 178f0763c5f3 ("netdevsim: implement psp device stats") Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20260529-fix-psp-stats-v2-1-3a194eacf18e@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zijing Yin [Fri, 29 May 2026 13:57:17 +0000 (06:57 -0700)]
netdevsim: fib: fix use-after-free of FIB data via debugfs
Writing to the netdevsim debugfs file
"netdevsim/netdevsimN/fib/nexthop_bucket_activity" enters
nsim_nexthop_bucket_activity_write(), which looks up a nexthop in
data->nexthop_ht under rtnl_lock(). If a network namespace teardown,
devlink reload or device deletion runs concurrently, nsim_fib_destroy()
frees that rhashtable (and the surrounding nsim_fib_data) while the
write is still in flight, leading to a slab-use-after-free:
BUG: KASAN: slab-use-after-free in nsim_nexthop_bucket_activity_write+0xb9e/0xdf0
Read of size 4 at addr ff1100001a379808 by task syz.0.11967/27894
The buggy address belongs to the object at ff1100001a379800
which belongs to the cache kmalloc-1k of size 1024
The freed 1k object is the bucket table of data->nexthop_ht. Shortly
after, the dangling table is dereferenced again and the machine also
takes a GPF in __rht_bucket_nested() from the same call site.
The root cause is a lifetime mismatch: the debugfs files reference
nsim_fib_data (the writer dereferences data->nexthop_ht), but the
interface is not bracketed around the lifetime of that data.
nsim_fib_destroy() freed both rhashtables and only removed the debugfs
directory afterwards, and nsim_fib_create() created the debugfs files
before the rhashtables were initialized and, on the error path, freed
them before removing the files. debugfs keeps the file itself alive
across a ->write() via debugfs_file_get()/debugfs_file_put()
(fs/debugfs/file.c), but it does not keep data->nexthop_ht alive, so the
in-flight writer dereferenced freed memory. rtnl_lock() in the writer
does not help, because the teardown path does not take rtnl around
rhashtable_free_and_destroy().
Fix it by bracketing the debugfs interface around the data it exposes,
keeping nsim_fib_create() and nsim_fib_destroy() symmetric:
- In nsim_fib_destroy(), tear down the debugfs files before the data
structures they reference. debugfs_remove_recursive() drops the
initial active-user reference and then waits for every in-flight
->write() to drop its reference before returning, and rejects new
opens (__debugfs_file_removed(), fs/debugfs/inode.c). Once it returns,
no debugfs accessor can reach the FIB data, so the rhashtables and
nsim_fib_data can be destroyed safely. This also covers the bool knobs
in the same directory, which store pointers into the same
nsim_fib_data, and the final kfree(data).
- In nsim_fib_create(), create the debugfs files after the rhashtables
and notifiers are set up. This closes the same race on the
error-unwind path, where a concurrent writer could otherwise observe a
half-constructed instance or a table that the unwind has already
freed. (With only the destroy-side change, a writer racing the create
window instead dereferences an uninitialized data->nexthop_ht.)
This is reproducible by racing, in a loop, writes to
/sys/kernel/debug/netdevsim/netdevsimN/fib/nexthop_bucket_activity
against a teardown of the same netdevsim instance -- a devlink reload
("devlink dev reload netdevsim/netdevsimN"), destroying the network
namespace it lives in, or "echo N > /sys/bus/netdevsim/del_device". It
was found with syzkaller; a syzkaller reproducer is available. A
standalone C reproducer does not trigger it reliably because the race
needs the netns-teardown/reload path.
Cc: <stable+noautosel@kernel.org> # netdevsim is a test harness, it's never loaded on production systems Signed-off-by: Zijing Yin <yzjaurora@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260529135718.1804031-1-yzjaurora@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>