mm/damon: support MADV_COLLAPSE via DAMOS_COLLAPSE scheme action
This patch set introces a new action: DAMOS_COLLAPSE.
For DAMOS_HUGEPAGE and DAMOS_NOHUGEPAGE to work, khugepaged should be
working, since it relies on hugepage_madvise to add a new slot. This slot
should be picked up by khugepaged and eventually collapse (or not, if we
are using DAMOS_NOHUGEPAGE) the pages. If THP is not enabled, khugepaged
will not be working, and therefore no collapse will happen.
DAMOS_COLLAPSE eventually calls madvise_collapse, which will collapse the
address range synchronously. In cases where there is a large VMA
(databases, for example), DAMOS_COLLAPSE allows us to collapse only the
hot region, and not the entire VMA.
This new action may be required to support autotuning with hugepage
as a goal[1].
=========
Benchmarks:
=========
MySQL
=====
Tests were performed in an ARM physical server with MariaDB 10.5 and
sysbench. Read only benchmark was perform with gaussian row hitting,
which follows a normal distribution.
T n, D h: THP set to never, DAMON action set to hugepage
T m, D h: THP set to madvise, DAMON action set to hugepage
T n, D c: THP set to never, DAMON action set to collapse
Memory consumption. Lower is better.
+------------------+----------+----------+----------+
| | T n, D h | T m, D h | T n, D c |
+------------------+----------+----------+----------+
| Total memory use | 2.13 | 2.20 | 2.20 |
| Huge pages | 0 | 1.3 | 1.27 |
+------------------+----------+----------+----------+
Performance in TPS (Transactions Per Second). Higher is better.
T n, D h: 18225.58
T m, D h 18252.93
T n, D c: 18270.21
Performance counter
I got the number of L1 D/I TLB accesses and the number a D/I TLB
accesses that triggered a page walk. I divided the second by the
first to get the percentage of page walkes per TLB access. The
lower the better.
I used masim with the "demo" configuration, but changing the times
to 100 seconds for the initial phase and 50 seconds for the rest of
the phases.
Memory consumption:
+------------------+----------+----------+----------+
| | T n, D h | T m, D h | T n, D c |
+------------------+----------+----------+----------+
| Total memory use | 2.38 GB | 2.36 GB | 2.37 GB |
| Huge pages | 0 | 190 MB | 188 MB |
+------------------+----------+----------+----------+
Performance:
THP never, DAMOS_HUGEPAGE
initial phase: 40,491 accesses/msec, 100001 msecs run
low phase 0: 39,658 accesses/msec, 50002 msecs run
high phase 0: 41,678 accesses/msec, 50000 msecs run
low phase 1: 39,625 accesses/msec, 50003 msecs run
high phase 1: 41,658 accesses/msec, 50002 msecs run
low phase 2: 39,642 accesses/msec, 50002 msecs run
high phase 2: 41,640 accesses/msec, 50001 msecs run
THP madvise, DAMOS_HUGEPAGE
initial phase: 51,977 accesses/msec, 100000 msecs run
low phase 0: 86,953 accesses/msec, 50000 msecs run
high phase 0: 94,812 accesses/msec, 50000 msecs run
low phase 1: 101,017 accesses/msec, 50000 msecs run
high phase 1: 94,841 accesses/msec, 50000 msecs run
low phase 2: 100,993 accesses/msec, 50000 msecs run
high phase 2: 94,791 accesses/msec, 50001 msecs run
THP never, DAMOS_COLLAPSE
initial phase: 93,678 accesses/msec, 100001 msecs run
low phase 0: 101,475 accesses/msec, 50000 msecs run
high phase 0: 98,589 accesses/msec, 50000 msecs run
low phase 1: 101,531 accesses/msec, 50001 msecs run
high phase 1: 98,506 accesses/msec, 50001 msecs run
low phase 2: 101,458 accesses/msec, 50001 msecs run
high phase 2: 98,555 accesses/msec, 50000 msecs run
- We can see that DAMOS "hugepage" action works only when THP is set
to madvise. "collapse" action works even when THP is set to never.
- Performance for "collapse" action is slightly lower than "hugepage"
action and THP madvise. This is due to the fact that collapases
occur synchronously. With "hugepage" they may occur during page
faults.
- Memory consumption is slighly lower for "collapse" than "hugepage"
with THP madvise. This is due to the khugepage collapses all VMAs,
while "collapse" action only collapses the VMAs in the hot region.
- There is an improvement in TLB utilization when collapse through
"hugepage" or "collapse" actions are triggered. The amount of
TLB misses is lower.
- "collapse" action is performance synchronously, which means that
page collapses happen earlier and more rapidly. This can be
useful or not, depending on the scenario.
- "hugepage" action may trigger a VMA split in some scenarios, since
it needs to change the flag of the VMA to THP enabled. This may
lead to additional overhead.
Collapse action just adds a new option to chose the correct system
balance.
Link: https://lore.kernel.org/20260426231619.107231-5-sj@kernel.org Link: https://lore.kernel.org/damon/20260313000816.79933-1-sj@kernel.org/ Signed-off-by: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Signed-off-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Cheng-Han Wu <hank20010209@gmail.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Liew Rui Yan <aethernet65535@gmail.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon: add synchronous commit for commit_inputs
Problem
=======
Writing invalid parameters to sysfs followed by 'commit_inputs=Y' fails
silently (no error returned to shell), because the validation happens
asynchronously in the kdamond.
Solution
========
To fix this, the commit_inputs_store() callback now uses damon_call() to
synchronously commit parameters in the kdamond thread's safe context.
This ensures that validation errors are returned immediately to
userspace, following the pattern used by DAMON_SYSFS.
Changes
=======
1. Added commit_inputs_store() and commit_inputs_fn() to commit
synchronously.
2. Removed handle_commit_inputs().
This change is motivated from another discussion [1].
Link: https://lore.kernel.org/20260426231619.107231-4-sj@kernel.org Link: https://lore.kernel.org/20260318153731.97470-1-aethernet65535@gmail.com Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Asier Gutierrez <gutierrez.asier@huawei-partners.com> Cc: Cheng-Han Wu <hank20010209@gmail.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon/ops-common: optimize damon_hot_score() using ilog2()
Patch series "mm/damon: repost non-hotfix reviewed patches in damon/next
tree", v2.
The first patch from Liew Rui Yan add a minor performance optimization
using ilog2() instead of inefficient manual implementation of the
functionality.
The second patch from Cheng-Han Wu fixes a minor typo:
s/parametrs/parameters/.
The third patch from Liew Rui Yan make commit_inputs operation of
DAMON_RECLAIM and DAMON_LRU_SORT synchronous to improve the user
experience.
The fourth patch from Asier Gutierrez adds a new DAMOS action,
DAMOS_COLLAPSE for deterministic DAMOS-based access-aware THP system.
This patch (of 4):
The current implementation of damon_hot_score() uses a manual for-loop to
calculate the value of 'age_in_log'. This can be efficiently replaced by
ilog2(), which is semantically more appropriate for calculating the
logarithmic value of age.
In a simulated-kernel-module performance test with 10,000,000 iterations,
this optimization showed a significant reduction in latency (average
latency reduced from ~12ns to ~1ns).
Muchun Song [Tue, 28 Apr 2026 08:18:55 +0000 (16:18 +0800)]
mm/mm_init: fix uninitialized struct pages for ZONE_DEVICE
If DAX memory is hotplugged into an unoccupied subsection of an early
section, section_activate() reuses the unoptimized boot memmap. However,
compound_nr_pages() still assumes that vmemmap optimization is in effect
and initializes only the reduced number of struct pages. As a result, the
remaining tail struct pages are left uninitialized, which can later lead
to unexpected behavior or crashes.
Fix this by treating early sections as unoptimized when calculating how
many struct pages to initialize.
Link: https://lore.kernel.org/20260428081855.1249045-7-songmuchun@bytedance.com Fixes: 6fd3620b3428 ("mm/page_alloc: reuse tail struct pages for compound devmaps") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Apr 2026 08:18:54 +0000 (16:18 +0800)]
mm/mm_init: fix pageblock migratetype for ZONE_DEVICE compound pages
The memmap_init_zone_device() function only initializes the migratetype of
the first pageblock of a compound page. If the compound page size exceeds
pageblock_nr_pages (e.g., 1GB hugepages with 2MB pageblocks), subsequent
pageblocks in the compound page remain uninitialized.
Move the migratetype initialization out of __init_zone_device_page() and
into a separate pageblock_migratetype_init_range() function. This
iterates over the entire PFN range of the memory, ensuring that all
pageblocks are correctly initialized.
Also remove the stale confusing comment about MEMINIT_HOTPLUG above the
migratetype setting since it is an obsolete relic from commit 966cf44f637e
("mm: defer ZONE_DEVICE page initialization to the point where we init
pgmap") and no longer makes sense here.
Link: https://lore.kernel.org/20260428081855.1249045-6-songmuchun@bytedance.com Fixes: c4386bd8ee3a ("mm/memremap: add ZONE_DEVICE support for compound pages") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Apr 2026 08:18:53 +0000 (16:18 +0800)]
mm/sparse-vmemmap: fix DAX vmemmap accounting with optimization
When vmemmap optimization is enabled for DAX, the nr_memmap_pages counter
in /proc/vmstat is incorrect. The current code always accounts for the
full, non-optimized vmemmap size, but vmemmap optimization reduces the
actual number of vmemmap pages by reusing tail pages. This causes the
system to overcount vmemmap usage, leading to inaccurate page statistics
in /proc/vmstat.
Fix this by introducing section_nr_vmemmap_pages(), which returns the
exact vmemmap page count for a given pfn range based on whether
optimization is in effect.
Link: https://lore.kernel.org/20260428081855.1249045-5-songmuchun@bytedance.com Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Apr 2026 08:18:52 +0000 (16:18 +0800)]
mm/sparse-vmemmap: pass @pgmap argument to memory deactivation paths
Currently, the memory hot-remove call chain -- arch_remove_memory(),
__remove_pages(), sparse_remove_section() and section_deactivate() -- does
not carry the struct dev_pagemap pointer. This prevents the lower levels
from knowing whether the section was originally populated with vmemmap
optimizations (e.g., DAX with vmemmap optimization enabled).
Without this information, we cannot call vmemmap_can_optimize() to
determine if the vmemmap pages were optimized. As a result, the vmemmap
page accounting during teardown will mistakenly assume a non-optimized
allocation, leading to incorrect memmap statistics.
To lay the groundwork for fixing the vmemmap page accounting, we need to
pass the @pgmap pointer down to the deactivation location. Plumb the
@pgmap argument through the APIs of arch_remove_memory(), __remove_pages()
and sparse_remove_section(), mirroring the corresponding *_activate()
paths.
Link: https://lore.kernel.org/20260428081855.1249045-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Tue, 28 Apr 2026 08:18:51 +0000 (16:18 +0800)]
mm/memory_hotplug: fix incorrect altmap passing in error path
In create_altmaps_and_memory_blocks(), when arch_add_memory() succeeds
with memmap_on_memory enabled, the vmemmap pages are allocated from
params.altmap. If create_memory_block_devices() subsequently fails, the
error path calls arch_remove_memory() with a NULL altmap instead of
params.altmap.
This is a bug that could lead to memory corruption. Since altmap is NULL,
vmemmap_free() falls back to freeing the vmemmap pages into the system
buddy allocator via free_pages() instead of the altmap.
arch_remove_memory() then immediately destroys the physical linear mapping
for this memory. This injects unowned pages into the buddy allocator,
causing machine checks or memory corruption if the system later attempts
to allocate and use those freed pages.
Fix this by passing params.altmap to arch_remove_memory() in the error
path.
Link: https://lore.kernel.org/20260428081855.1249045-3-songmuchun@bytedance.com Fixes: 6b8f0798b85a ("mm/memory_hotplug: split memmap_on_memory requests across memblocks") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Reviewed-by: Georgi Djakov <georgi.djakov@oss.qualcomm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: Fix vmemmap optimization accounting and initialization",
v8.
The series fixes several bugs in vmemmap optimization, mainly around
incorrect page accounting and memmap initialization in DAX and memory
hotplug paths. It also fixes pageblock migratetype initialization and
struct page initialization for ZONE_DEVICE compound pages.
Patches 1-4 fix vmemmap accounting issues. Patch 1 fixes an accounting
underflow in the section activation failure path by moving vmemmap page
accounting into the lower-level allocation and freeing helpers. Patch 2
fixes incorrect altmap passing in the memory hotplug error path. Patch 3
passes pgmap through memory deactivation paths so the teardown side can
determine whether vmemmap optimization was in effect. Patch 4 uses that
information to account the optimized DAX vmemmap size correctly.
Patches 5-6 fix initialization issues in mm/mm_init. One makes sure all
pageblocks in ZONE_DEVICE compound pages get their migratetype
initialized. The other fixes a case where DAX memory hotplug reuses an
unoptimized early-section memmap while compound_nr_pages() still assumes
vmemmap optimization, leaving tail struct pages uninitialized.
This patch (of 6):
In section_activate(), if populate_section_memmap() fails, the error
handling path calls section_deactivate() to roll back the state. This
causes a vmemmap accounting imbalance.
Since commit c3576889d87b ("mm: fix accounting of memmap pages"), memmap
pages are accounted for only after populate_section_memmap() succeeds.
However, the failure path unconditionally calls section_deactivate(),
which decreases the vmemmap count. Consequently, a failure in
populate_section_memmap() leads to an accounting underflow, incorrectly
reducing the system's tracked vmemmap usage.
Fix this more thoroughly by moving all accounting calls into the lower
level functions that actually perform the vmemmap allocation and freeing:
- populate_section_memmap() accounts for newly allocated vmemmap pages -
depopulate_section_memmap() unaccounts when vmemmap is freed
This ensures proper accounting in all code paths, including error handling
and early section cases.
Link: https://lore.kernel.org/20260428081855.1249045-1-songmuchun@bytedance.com Link: https://lore.kernel.org/20260428081855.1249045-2-songmuchun@bytedance.com Fixes: c3576889d87b ("mm: fix accounting of memmap pages") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Liam R. Howlett <liam@infradead.org> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> Cc: Joao Martins <joao.m.martins@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dev Jain [Wed, 15 Apr 2026 04:45:09 +0000 (10:15 +0530)]
selftests/mm: simplify byte pattern checking in mremap_test
The original version of mremap_test (7df666253f26: "kselftests: vm: add
mremap tests") validated remapped contents byte-by-byte and printed a
mismatch index in case the bytes streams didn't match. That was rather
inefficient, especially also if the test passed.
Later, commit 7033c6cc9620 ("selftests/mm: mremap_test: optimize execution
time from minutes to seconds using chunkwise memcmp") used memcmp() on
bigger chunks, to fallback to byte-wise scanning to detect the problematic
index only if it discovered a problem.
However, the implementation is overly complicated (e.g., get_sqrt() is
currently not optimal) and we don't really have to report the exact index:
whoever debugs the failing test can figure that out.
Let's simplify by just comparing both byte streams with memcmp() and not
detecting the exact failed index.
Link: https://lore.kernel.org/20260415044509.579428-1-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Reported-by: Sarthak Sharma <sarthak.sharma@arm.com> Tested-by: Sarthak Sharma <sarthak.sharma@arm.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: David Laight <david.laight.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 23 Feb 2026 20:15:16 +0000 (12:15 -0800)]
dax/kmem: account for partial discontiguous resource upon removal
When dev_dax_kmem_probe() partially succeeds (at least one range is
mapped) but a subsequent range fails request_mem_region() or
add_memory_driver_managed(), the probe silently continues, ultimately
returning success, but with the corresponding range resource NULL'ed out.
dev_dax_kmem_remove() iterates over all dax_device ranges regardless of if
the underlying resource exists. When remove_memory() is called later, it
returns 0 because the memory was never added which causes
dev_dax_kmem_remove() to incorrectly assume the (nonexistent) resource can
be removed and attempts cleanup on a NULL pointer.
Fix this by skipping these ranges altogether, noting that these cases are
considered success, such that the cleanup is still reached when all
actually-added ranges are successfully removed.
Link: https://lore.kernel.org/20260223201516.1517657-1-dave@stgolabs.net Fixes: 60e93dc097f7 ("device-dax: add dis-contiguous resource support") Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com> Reviewed-by: Alison Schofield <alison.schofield@intel.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In alloc_hugetlb_folio(), a single h_cg pointer is used for both the rsvd
and non-rsvd hugetlb cgroup charges. When map_chg is set,
hugetlb_cgroup_charge_cgroup_rsvd() stores the charged cgroup in h_cg, but
the immediately following hugetlb_cgroup_charge_cgroup() overwrites h_cg
with the non-rsvd cgroup pointer.
As a result, hugetlb_cgroup_commit_charge_rsvd() stores the wrong
(non-rsvd) cgroup pointer into the folio's rsvd slot.
When the folio is later freed, free_huge_folio() unconditionally calls
both hugetlb_cgroup_uncharge_folio() and
hugetlb_cgroup_uncharge_folio_rsvd(). The rsvd uncharge reads back the
wrong cgroup from the folio and decrements a counter that was never
charged for that cgroup, causing a page_counter underflow:
page_counter underflow: -512 nr_pages=512
WARNING: mm/page_counter.c:61 at page_counter_cancel
Fix this by introducing a separate h_cg_rsvd pointer exclusively for the
rsvd charge path, keeping the rsvd and non-rsvd charges fully independent
through their charge, commit, and error uncharge paths.
Link: https://lore.kernel.org/20260328065534.346053-1-kartikey406@gmail.com Fixes: 08cf9faf7558 ("hugetlb_cgroup: support noreserve mappings") Reported-by: syzbot+226c1f947186f8fef796@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=226c1f947186f8fef796 Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mina Almasry <almasrymina@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/lruvec: preemptively free dead folios during lru_add drain
Of all observable lruvec lock contention in our fleet, we find that ~24%
occurs when dead folios are present in lru_add batches at drain time.
This is wasteful in the sense that the folio is added to the LRU just to
be immediately removed via folios_put_refs(), incurring two unnecessary
lock acquisitions.
Eliminate this overhead by preemptively cleaning up dead folios before
they make it into the LRU. Use folio_ref_freeze() to filter folios whose
only remaining refcount is the batch ref. When dead folios are found,
move them off the add batch and onto a temporary batch to be freed.
PG_active may be set on a batched folio as well as PG_unevictable (via
migration path). Since filtered folios bypass the normal lru_add()
cleanup, both flags must be cleared before freeing.
During A/B testing on one of our prod instagram workloads (high-frequency
short-lived requests), the patch intercepted almost all dead folios before
they entered the LRU. Data collected using the mm_lru_insertion
tracepoint shows the effectiveness of the patch:
Per-host LRU add averages at 95% CPU load
(60 hosts each side, 3 x 60s intervals)
dead folios/min total folios/min dead %
unpatched: 1,297,785 19,341,986 6.7097%
patched: 14 19,039,996 0.0001%
Within this workload, we save ~2.6M lock acquisitions per minute per host
as a result.
System-wide memory stats improved on the patched side also at 95% CPU load:
- direct reclaim scanning reduced 7%
- allocation stalls reduced 5.2%
- compaction stalls reduced 12.3%
- page frees reduced 4.9%
No regressions were observed in requests served per second or request tail
latency (p99). Both metrics showed directional improvement at higher CPU
utilization (comparing 85% to 95%).
Note that tests were performed using classic LRU.
Link: https://lore.kernel.org/20260425053417.351146-1-jp.kobryn@linux.dev Signed-off-by: JP Kobryn (Meta) <jp.kobryn@linux.dev> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Chris Li <chrisl@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Rik van Riel <riel@surriel.com> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Rientjes [Tue, 31 Mar 2026 01:20:57 +0000 (18:20 -0700)]
mm, page_alloc: reintroduce page allocation stall warning
Previously, we had warnings when a single page allocation took longer than
reasonably expected. This was introduced in commit 63f53dea0c98 ("mm:
warn about allocations which stall for too long").
The warning was subsequently reverted in commit 400e22499dd9 ("mm: don't
warn about allocations which stall for too long") because it was possible
to generate memory pressure that would effectively stall further progress
through printk execution.
Page allocation stalls in excess of 10 seconds are always useful to debug
because they can result in severe userspace unresponsiveness. Adding this
artifact can be used to correlate with userspace going out to lunch and to
understand the state of memory at the time.
There should be a reasonable expectation that this warning will never
trigger given it is very passive, it will only be emitted when a page
allocation takes longer than 10 seconds. If it does trigger, this reveals
an issue that should be fixed: a single page allocation should never loop
for more than 10 seconds without oom killing to make memory available.
Unlike the original implementation, this implementation only reports
stalls once for the system every 10 seconds. Otherwise, many concurrent
reclaimers could spam the kernel log unnecessarily. Stalls are only
reported when calling into direct reclaim.
Link: https://lore.kernel.org/371c86c8-1d47-bd70-b74c-769842718b1f@google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Julian Braha [Tue, 31 Mar 2026 07:07:30 +0000 (08:07 +0100)]
mm/thp: dead code cleanup in Kconfig
There is already an 'if TRANSPARENT_HUGEPAGE' condition wrapping several
config options e.g. 'READ_ONLY_THP_FOR_FS', making the 'depends on'
statement for each of these a duplicate dependency (dead code).
I propose leaving the outer 'if TRANSPARENT_HUGEPAGE...endif' and removing
the individual 'depends on TRANSPARENT_HUGEPAGE' statement from each
option.
This dead code was found by kconfirm, a static analysis tool for Kconfig.
Link: https://lore.kernel.org/20260331070730.33915-1-julianbraha@gmail.com Signed-off-by: Julian Braha <julianbraha@gmail.com> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Carlier [Thu, 2 Apr 2026 07:26:50 +0000 (08:26 +0100)]
mm/page_io: rename swap_iocb fields for clarity
swap_iocb->pages tracks the number of bvec entries (folios), not base
pages. Rename the array from bvec to bvecs and the counter from pages to
nr_bvecs to accurately reflect their purpose.
Link: https://lore.kernel.org/20260402072650.48811-1-devnexen@gmail.com Signed-off-by: David Carlier <devnexen@gmail.com> Suggested-by: Matthew Wilcox (Oracle) <willy@infradead.org> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: NeilBrown <neil@brown.name> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmpressure: skip socket pressure for costly order reclaim
When reclaim is triggered by high order allocations on a fragmented
system, vmpressure() can report poor reclaim efficiency even though the
system has plenty of free memory. This is because many pages are scanned,
but few are found to actually reclaim - the pages are actively in use and
don't need to be freed. The resulting scan:reclaim ratio causes
vmpressure() to assert socket pressure, throttling TCP throughput
unnecessarily.
Costly order allocations (above PAGE_ALLOC_COSTLY_ORDER) rely heavily on
compaction to succeed, so poor reclaim efficiency at these orders does not
necessarily indicate memory pressure. The kernel already treats this
order as the boundary where reclaim is no longer expected to succeed and
compaction may take over.
Make vmpressure() order-aware through an additional parameter sourced from
scan_control at existing call sites. Socket pressure is now only asserted
when order <= PAGE_ALLOC_COSTLY_ORDER.
Memcg reclaim is unaffected since try_to_free_mem_cgroup_pages() always
uses order 0, which passes the filter unconditionally. Similarly,
vmpressure_prio() now passes order 0 internally when calling vmpressure(),
ensuring critical pressure from low reclaim priority is not suppressed by
the order filter.
The patch was motivated by a case of impacted net throughput in
production. On one affected host, the memory state at the time showed
~15GB available, zero cgroup pressure, and the following buddyinfo state:
mm: huge_memory: refactor defrag_show() to use defrag_flags[]
Replace the hardcoded if/else chain of test_bit() calls and string
literals in defrag_show() with a loop over defrag_flags[] and
defrag_mode_strings[] arrays introduced in the previous commit.
This makes defrag_show() consistent with defrag_store() and eliminates the
duplicated mode name strings.
Link: https://lore.kernel.org/20260408-thp_defrag-v2-2-bc544c1bde4e@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: huge_memory: use sysfs_match_string() in defrag_store()
Patch series "mm: huge_memory: clean up defrag sysfs with shared", v2.
Refactor defrag_store() and defrag_show() to use shared data tables
instead of duplicated if/else chains.
Patch 1 introduces an enum defrag_mode, a defrag_mode_strings[] table, and
a defrag_flags[] mapping array, then rewrites defrag_store() to use
sysfs_match_string() with a loop over defrag_flags[].
Patch 2 refactors defrag_show() to use the same arrays, replacing its
hardcoded if/else chain of test_bit() calls and string literals.
This follows the same pattern applied to anon_enabled_store() in commit 522dfb4ba71f ("mm: huge_memory: refactor anon_enabled_store() with
change_anon_orders()").
This patch (of 2):
Replace the if/else chain of sysfs_streq() calls in defrag_store() with
sysfs_match_string() and a defrag_mode_strings[] table.
Introduce enum defrag_mode and defrag_flags[] array mapping each mode to
its corresponding transparent_hugepage_flag. The store function now loops
over defrag_flags[], setting the bit for the selected mode and clearing
the others. When mode is DEFRAG_NEVER (index 4), no index in the
4-element defrag_flags[] matches, so all flags are cleared.
Note that the enum ordering (always, defer, defer+madvise, madvise, never)
differs from the original if/else chain order in defrag_store() (always,
defer+madvise, defer, madvise, never). This is intentional to match the
display order used by defrag_show().
This is a follow-up cleanup to commit 522dfb4ba71f ("mm: huge_memory:
refactor anon_enabled_store() with change_anon_orders()") which applied
the same sysfs_match_string() pattern to anon_enabled_store().
Link: https://lore.kernel.org/20260408-thp_defrag-v2-0-bc544c1bde4e@debian.org Link: https://lore.kernel.org/20260408-thp_defrag-v2-1-bc544c1bde4e@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Tested-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Tested-by: Zi Yan <ziy@nvidia.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Liam Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Thu, 9 Apr 2026 01:43:22 +0000 (09:43 +0800)]
mm/khugepaged: use ALIGN helpers for PMD alignment
PMD alignment in khugepaged is currently implemented using a mix of
rounding helpers and open-coded bitmask operations.
Use ALIGN() and ALIGN_DOWN() consistently for PMD-sized address range
alignment, matching the preferred style for address and size handling.
No functional change intended.
Link: https://lore.kernel.org/20260409014323.2385982-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam@infradead.org> Cc: Liu Ye <liuye@kylinos.cn> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Fri, 10 Apr 2026 07:47:39 +0000 (15:47 +0800)]
mm/memory-failure: use bool for forcekill state
'forcekill' is used as a boolean flag to control whether processes should
be forcibly killed. It is only assigned from boolean expressions and
never used in arithmetic or bitmask operations.
Convert it from int to bool.
No functional change intended.
Link: https://lore.kernel.org/20260410074740.2524718-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Reviewed-by: SeongJae Park <sj@kernel.org> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Liu Ye <liuye@kylinos.cn> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 9bdac9142407 ("sparsemem: Put mem map for one node together.")
introduced a mechanism to pre-allocate a large memory block to hold all
memmaps for a NUMA node upfront.
However, the original commit message did not clearly state the actual
benefits or the necessity of explicitly pre-allocating a single chunk for
all memmap areas of a given node.
One of the concerns about removing this pre-allocation is that the
subsequent per-section memmap allocations could become scattered around,
and might turn too many memory blocks/sections into an "un-offlinable"
state. However, tests show that even without the explicit node-wide
pre-allocation, memblock still allocates memory closely and back-to-back.
When tracing vmemmap_set_pmd allocations, the physical chunks allocated by
memblock are strictly adjacent to each other in a single contiguous
physical range (mapped top-down). Because they are packed tightly
together naturally, they will at most consume or pollute the exact same
number of memory blocks as the explicit pre-allocation did.
Another concern is the boot performance impact of calling memmap_alloc()
multiple times compared to one large node-wide allocation. Tests on a
256GB VM showed that memmap allocation time increased from 199,555 ns to
741,292 ns. Even though it is 3.7x slower, on a 1TB machine, the entire
memory allocation time would only take a few milliseconds. This boot
performance difference is completely negligible.
Since no negative impact on memory offlining behavior or noticeable boot
performance regression was found, this patch proposes removing the
explicit node-wide memmap pre-allocation mechanism to reduce the
maintenance burden.
Link: https://lore.kernel.org/20260410092419.2446420-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhen Ni [Tue, 14 Apr 2026 07:58:13 +0000 (15:58 +0800)]
mm/page_owner: fix %pGp format specifier argument type
The %pGp format specifier expects an argument of type 'unsigned long *',
but page->flags is now of type 'memdesc_flags_t' (a struct containing an
unsigned long member 'f') after the introduction of memdesc_flags_t.
Fix the type mismatch by passing &page->flags.f instead of &page->flags,
which matches the expected type.
Link: https://lore.kernel.org/20260414075813.3425968-1-zhen.ni@easystack.cn Fixes: 53fbef56e07d ("mm: introduce memdesc_flags_t") Signed-off-by: Zhen Ni <zhen.ni@easystack.cn> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Anthony Yznaga [Thu, 16 Apr 2026 03:39:39 +0000 (20:39 -0700)]
selftests/mm: run the MAP_DROPPABLE selftest
The test was not being run by the selftest framework so it was never
noticed that it would fail with an assertion failure on configs without
support for MAP_DROPPABLE. Update the test so that it is skipped instead
when MAP_DROPPABLE is not supported, and add it to the mmap category so
that the test is run by the framework.
Link: https://lore.kernel.org/20260416033939.49981-4-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Mark Brown <broonie@kernel.org> Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Anthony Yznaga [Thu, 16 Apr 2026 03:39:38 +0000 (20:39 -0700)]
selftests/mm: verify droppable mappings cannot be locked
For configs that support MAP_DROPPABLE verify that a mapping created with
MAP_DROPPABLE cannot be locked via mlock(), and that it will not be locked
if it's created after mlockall(MCL_FUTURE).
Link: https://lore.kernel.org/20260416033939.49981-3-anthony.yznaga@oracle.com Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Mark Brown <broonie@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka (SUSE) <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Anthony Yznaga [Thu, 16 Apr 2026 03:39:37 +0000 (20:39 -0700)]
mm: fix mmap errno value when MAP_DROPPABLE is not supported
Patch series "fix MAP_DROPPABLE not supported errno", v4.
Mark Brown reported seeing a regression in -next on 32 bit arm with the
mlock selftests. Before exiting and marking the tests failed, the
following message was logged after an attempt to create a MAP_DROPPABLE
mapping:
Bail out! mmap error: Unknown error 524
It turns out error 524 is ENOTSUPP which is an error that userspace is not
supposed to see, but it indicates in this instance that MAP_DROPPABLE is
not supported.
The first patch changes the errno returned to EOPNOTSUPP. The second
patch is a second version of a prior patch to introduce selftests to
verify locking behavior with droppable mappings with the additional change
to skip the tests when MAP_DROPPABLE is not supported. The third patch
fixes the MAP_DROPPABLE selftest so that it is run by the framework and
skips if MAP_DROPPABLE is not supported.
This patch (of 3):
On configs where MAP_DROPPABLE is not supported (currently any 32-bit
config except for PPC32), mmap fails with errno set to ENOTSUPP. However,
ENOTSUPP is not a standard error value that userspace knows about. The
acceptable userspace-visible errno to use is EOPNOTSUPP. checkpatch.pl
has a warning to this effect.
Link: https://lore.kernel.org/20260416033939.49981-1-anthony.yznaga@oracle.com Link: https://lore.kernel.org/20260416033939.49981-2-anthony.yznaga@oracle.com Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings") Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reported-by: Mark Brown <broonie@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Liam Howlett <liam@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
- Line 112: "zome_reclaim_mode" -> "zone_reclaim_mode"
- Line 6208: "prioities" -> "priorities"
- Line 7067: "that that high" -> "that the high" (duplicated word)
mm/sparse: remove unnecessary NULL check before allocating mem_section
Commit 850ed20539a4 ("mm: move array mem_section init code out of
memory_present()") moved mem_section allocation logic into
memblocks_present().
Before that move, memory_present() could be called multiple times, so
unlikely() matched the common case, where most calls found mem_section
already allocated.
After that move, memblocks_present() is called exactly once from
sparse_init(). Under CONFIG_SPARSEMEM_EXTREME, mem_section is always NULL
when it is called.
So remove unnecessary NULL check before allocating mem_section. No
functional change.
Link: https://lore.kernel.org/20260419144225.2875654-1-ekffu200098@gmail.com Signed-off-by: Sang-Heon Jeon <ekffu200098@gmail.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Reviewed by: Donet Tom <donettom@linux.ibm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 24 Apr 2026 04:00:59 +0000 (12:00 +0800)]
selftests/cgroup: test_zswap: wait for asynchronous writeback
zswap writeback is asynchronous, but test_zswap.c checks writeback
counters immediately after reclaim/trigger paths. On some platforms (e.g.
ppc64le), this can race with background writeback and cause spurious
failures even when behavior is correct.
Add wait_for_writeback() to poll get_cg_wb_count() with a bounded
timeout, and use it in:
test_zswap_writeback_one() when writeback is expected
test_no_invasive_cgroup_shrink() for the wb_group check
This keeps the original before/after assertion style while making the
tests robust against writeback completion latency.
No test behavior change, selftest stability improvement only.
Link: https://lore.kernel.org/20260424040059.12940-9-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 24 Apr 2026 04:00:58 +0000 (12:00 +0800)]
selftest/cgroup: fix zswap attempt_writeback() on 64K pagesize system
In attempt_writeback(), a memsize of 4M only covers 64 pages on 64K page
size systems. When memory.reclaim is called, the kernel prefers
reclaiming clean file pages (binary, libc, linker, etc.) over swapping
anonymous pages. With only 64 pages of anonymous memory, the reclaim
target can be largely or entirely satisfied by dropping file pages,
resulting in very few or zero anonymous pages being pushed into zswap.
This causes zswap_usage to be extremely small or zero, making
zswap_usage/4 insufficient to create meaningful writeback pressure. The
test then fails because no writeback is triggered.
On 4K page size systems this is not an issue because 4M covers 1024
pages, and file pages are a small fraction of the reclaim target.
Fix this by:
- Always allocating 1024 pages regardless of page size. This ensures
enough anonymous pages to reliably populate zswap and trigger
writeback, while keeping the original 4M allocation on 4K systems.
- Setting zswap.max to zswap_usage/4 instead of zswap_usage/2 to
create stronger writeback pressure, ensuring reclaim reliably
triggers writeback even on large page size systems.
Li Wang [Fri, 24 Apr 2026 04:00:57 +0000 (12:00 +0800)]
selftest/cgroup: fix zswap test_no_invasive_cgroup_shrink on large pagesize system
test_no_invasive_cgroup_shrink sets up two cgroups: wb_group, which is
expected to trigger zswap writeback, and a control group (renamed to
zw_group), which should only have pages sitting in zswap without any
writeback.
There are two problems with the current test:
1) The data patterns are reversed. wb_group uses allocate_bytes(), which
writes only a single byte per page — trivially compressible,
especially by zstd — so compressed pages fit within zswap.max and
writeback is never triggered. Meanwhile, the control group uses
getrandom() to produce hard-to-compress data, but it is the group
that does *not* need writeback.
2) The test uses fixed sizes (10K zswap.max, 10MB allocation) that are
too small on systems with large PAGE_SIZE (e.g. 64K), failing to
build enough memory pressure to trigger writeback reliably.
Fix both issues by:
- Swapping the data patterns: fill wb_group pages with partially
random data (getrandom for page_size/4 bytes) to resist compression
and trigger writeback, and fill zw_group pages with simple repeated
data to stay compressed in zswap.
- Making all size parameters PAGE_SIZE-aware: set allocation size to
PAGE_SIZE * 1024, memory.zswap.max to PAGE_SIZE, and memory.max to
allocation_size / 2 for both cgroups.
- Allocating memory inline instead of via cg_run() so the pages
remain resident throughout the test.
=== Error Log ===
# getconf PAGESIZE
65536
# ./test_zswap
TAP version 13
...
ok 5 test_zswap_writeback_disabled
ok 6 # SKIP test_no_kmem_bypass
not ok 7 test_no_invasive_cgroup_shrink
Link: https://lore.kernel.org/20260424040059.12940-7-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 24 Apr 2026 04:00:56 +0000 (12:00 +0800)]
selftests/cgroup: replace hardcoded page size values in test_zswap
test_zswap uses hardcoded values of 4095 and 4096 throughout as page
stride and page size, which are only correct on systems with a 4K page
size. On architectures with larger pages (e.g., 64K on arm64 or ppc64),
these constants cause memory to be touched at sub-page granularity,
leading to inefficient access patterns and incorrect page count
calculations, which can cause test failures.
Replace all hardcoded 4095 and 4096 values with a global pagesize variable
initialized from sysconf(_SC_PAGESIZE) at startup, and remove the
redundant local sysconf() calls scattered across individual functions. No
functional change on 4K page size systems.
Link: https://lore.kernel.org/20260424040059.12940-6-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Acked-by: Yosry Ahmed <yosry@kernel.org> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 24 Apr 2026 04:00:55 +0000 (12:00 +0800)]
selftests/cgroup: rename PAGE_SIZE to BUF_SIZE in cgroup_util
The cgroup utility code defines a local PAGE_SIZE macro hardcoded to 4096,
which is used primarily as a generic buffer size for reading cgroup and
proc files. This naming is misleading because the value has nothing to do
with the actual page size of the system. On architectures with larger
pages (e.g., 64K on arm64 or ppc64), the name suggests a relationship that
does not exist. Additionally, the name can shadow or conflict with
PAGE_SIZE definitions from system headers, leading to confusion or subtle
bugs.
To resolve this, rename the macro to BUF_SIZE to accurately reflect its
purpose as a general I/O buffer size.
Furthermore, test_memcontrol currently relies on this hardcoded 4K value
to stride through memory and trigger page faults. Update this logic to
use the actual system page size dynamically. This micro-optimizes the
memory faulting process by ensuring it iterates correctly and efficiently
based on the underlying architecture's true page size. (This part from
Waiman)
Link: https://lore.kernel.org/20260424040059.12940-5-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Yosry Ahmed <yosry@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 24 Apr 2026 04:00:54 +0000 (12:00 +0800)]
selftests/cgroup: use runtime page size for zswpin check
test_zswapin compares memory.stat:zswpin (counted in pages) against a byte
threshold converted with PAGE_SIZE. In cgroup selftests, PAGE_SIZE is
hardcoded to 4096, which makes the conversion wrong on systems with non-4K
base pages (e.g. 64K).
As a result, the test requires too many pages to pass and fails spuriously
even when zswap is working.
Use sysconf(_SC_PAGESIZE) for the zswpin threshold conversion so the check
matches the actual system page size.
Link: https://lore.kernel.org/20260424040059.12940-4-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Reviewed-by: Yosry Ahmed <yosry@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Fri, 24 Apr 2026 04:00:53 +0000 (12:00 +0800)]
selftests/cgroup: avoid OOM in test_swapin_nozswap
test_swapin_nozswap can hit OOM before reaching its assertions on some
setups. The test currently sets memory.max=8M and then allocates/reads
32M with memory.zswap.max=0, which may over-constrain reclaim and kill the
workload process.
This keeps the test pressure model intact (allocate/read beyond memory.max
to force swap-in/out) while making it more robust across different
environments.
The test intent is unchanged: confirm that swapping occurs while zswap remains
unused when memory.zswap.max=0.
=== Error Logs ===
# ./test_zswap
TAP version 13
1..7
ok 1 test_zswap_usage
not ok 2 test_swapin_nozswap
...
Li Wang [Fri, 24 Apr 2026 04:00:52 +0000 (12:00 +0800)]
selftests/cgroup: skip test_zswap if zswap is globally disabled
Patch series "selftests/cgroup: improve zswap tests robustness and support
large page sizes", v7.
This patchset aims to fix various spurious failures and improve the
overall robustness of the cgroup zswap selftests.
The primary motivation is to make the tests compatible with architectures
that use non-4K page sizes (such as 64K on ppc64le and arm64). Currently,
the tests rely heavily on hardcoded 4K page sizes and fixed memory limits.
On 64K page size systems, these hardcoded values lead to sub-page
granularity accesses, incorrect page count calculations, and insufficient
memory pressure to trigger zswap writeback, ultimately causing the tests
to fail.
Additionally, this series addresses OOM kills occurring in
test_swapin_nozswap by dynamically scaling memory limits, and prevents
spurious test failures when zswap is built into the kernel but globally
disabled.
This patch (of 8):
test_zswap currently only checks whether zswap is present by testing
/sys/module/zswap. This misses the runtime global state exposed in
/sys/module/zswap/parameters/enabled.
When zswap is built/loaded but globally disabled, the zswap cgroup
selftests run in an invalid environment and may fail spuriously.
Check the runtime enabled state before running the tests:
- skip if zswap is not configured,
- fail if the enabled knob cannot be read,
- skip if zswap is globally disabled.
Also print a hint in the skip message on how to enable zswap.
Link: https://lore.kernel.org/20260424040059.12940-1-li.wang@linux.dev Link: https://lore.kernel.org/20260424040059.12940-2-li.wang@linux.dev Signed-off-by: Li Wang <li.wang@linux.dev> Acked-by: Yosry Ahmed <yosry@kernel.org> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Jiayuan Chen <jiayuan.chen@linux.dev> Cc: Waiman Long <longman@redhat.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 28 Apr 2026 01:33:59 +0000 (18:33 -0700)]
selftests/damon/drgn_dump_damon_status: support failed region quota charge ratio
Extend drgn_dump_damon_status.py to dump DAMON internal state for DAMOS
action failed regions quota charge ratio, to be able to show if the
internal state for the feature is working, with future DAMON selftests.
Link: https://lore.kernel.org/20260428013402.115171-11-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Implement the user-space ABI for the DAMOS action failed region
quota-charge ratio setup. For this, add two new sysfs files under the
DAMON sysfs interface for DAMOS quotas. Names of the files are
fail_charge_num and fail_charge_denom, and work for reading and setting
the numerator and denominator of the failed regions charge ratio.
Link: https://lore.kernel.org/20260428013402.115171-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 28 Apr 2026 01:33:52 +0000 (18:33 -0700)]
mm/damon/core: introduce failed region quota charge ratio
DAMOS quota is charged to all DAMOS action application attempted memory,
regardless of how much of the memory the action was successful and failed.
This makes understanding quota behavior without DAMOS stat but only with
end level metrics (e.g., increased amount of free memory for DAMOS_PAGEOUT
action) difficult. Also, charging action-failed memory same as
action-successful memory is somewhat unfair, as successful action
application will induce more overhead in most cases.
Introduce DAMON core API for setting the charge ratio for such
action-failed memory. It allows API callers to specify the ratio in a
flexible way, by setting the numerator and the denominator.
Link: https://lore.kernel.org/20260428013402.115171-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 28 Apr 2026 01:33:51 +0000 (18:33 -0700)]
mm/damon/core: merge regions after applying DAMOS schemes
damos_apply_scheme() could split the given region if applying the scheme's
action to the entire region can result in violating the quota-set upper
limit. Keeping regions that are created by such split operations is
unnecessary overhead.
The overhead would be negligible in the common case because such split
operations could happen only up to the number of installed schemes per
scheme apply interval. The following commit could make the impact larger,
though. The following commit will allow the action-failed region to be
charged in a different ratio. If both the ratio and the remaining quota
is quite small while the region to apply the scheme is quite large and the
action is nearly always failing, a high number of split operations could
happen.
Remove the unnecessary overhead by merging regions after applying schemes
is done for each region. The merge operation is made only if it will not
lose monitoring information and keep min_nr_regions constraint. In the
worst case, the max_nr_regions could still be violated until the next
per-aggregation interval merge operation is made.
Link: https://lore.kernel.org/20260428013402.115171-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 28 Apr 2026 01:33:50 +0000 (18:33 -0700)]
mm/damon/core: handle <min_region_sz remaining quota as empty
Patch series "mm/damon: introduce DAMOS failed region quota charge ratio".
Let users set different DAMOS quota charge ratios for DAMOS action failed
regions, for deterministic and consistent DAMOS action progress.
Common Reports: Unexpectedly Slow DAMOS
=======================================
One common issue report that we get from DAMON users is that DAMOS action
applying progress speed is sometimes much slower than expected. And one
common root cause is that the DAMOS quota is exceeded by the action
applying failed memory regions.
For example, a group of users tried to run DAMOS-based proactive memory
reclamation (DAMON_RECLAIM) with 100 MiB per second DAMOS quota. They ran
it on a system having no active workload which means all memory of the
system is cold. The expectation was that the system will show 100 MiB per
second reclamation until (nearly) all memory is reclaimed. But what they
found is that the speed is quite inconsistent and sometimes it becomes
very slower than the expectation, sometimes even no reclamation at all for
about tens of seconds. The upper limit of the speed (100 MiB per second)
was being kept as expected, though.
By monitoring the qt_exceeds (number of DAMOS quota exceed events) DAMOS
stat, we found DAMOS quota is always exceeded when the speed is slow. By
monitoring sz_tried and sz_applied (the total amount of DAMOS action tried
memory and succeeded memory) DAMOS stats together, we found the
reclamation attempts nearly always failed when the speed is slow.
DAMOS quota charges DAMOS action tried regions regardless of the
successfulness of the try. Hence in the example reported case, there was
unreclaimable memory spread around the system memory. Sometimes nearly
100 MiB of memory that DAMOS tried to reclaim in the given quota interval
was reclaimable, and therefore showed nearly 100 MiB per second speed.
Sometimes nearly 99 MiB of memory that DAMOS was trying to reclaim in the
given quota interval was unreclaimable, and therefore showing only about 1
MiB per second reclaim speed.
We explained it is an expected behavior of the feature rather than a bug,
as DAMOS quota is there for only the upper-limit of the speed. The users
agreed and later reported a huge win from the adoption of DAMON_RECLAIM on
their products.
It is Not a Bug but a Feature; But...
=====================================
So nothing is broken. DAMOS quota is working as intended, as the upper
limit of the speed. It also provides its behavior observability via DAMOS
stat. In the real world production environment that runs long term active
workloads and matters stability, the speed sometimes being slow is not a
real problem.
But, the non-deterministic behavior is sometimes annoying, especially in
lab environments. Even in a realistic production environment, when there
is a huge amount of DAMOS action unapplicable memory, the speed could be
problematically slow. Let's suppose a virtual machines provider that
setup 99% of the host memory as hugetlb pages that cannot be reclaimed, to
give it to virtual machines. Also, when aim-oriented DAMOS auto-tuning is
applied, this could also make the internal feedback loop confused.
The intention of the current behavior was that trying DAMOS action to
regions would anyway impose some overhead, and therefore somehow be
charged. But in the real world, the overhead for failed action is much
lighter than successful action. Charging those at the same ratio may be
unfair, or at least suboptimum in some environments.
DAMOS Action Failed Region Quota Charge Ratio
=============================================
Let users set the charge ratio for the action-failed memory, for more
optimal and deterministic use of DAMOS. It allows users to specify the
numerator and the denominator of the ratio for flexible setup. For
example, let's suppose the numerator and the denominator are set to 1 and
4,096, respectively. The ratio is 1 / 4,096. A DAMOS scheme action is
applied to 5 GiB memory. For 1 GiB of the memory, the action is
succeeded. For the rest (4 GiB), the action is failed. Then, only 1 GiB
and 1 MiB quota is charged.
The optimal charge ratio will depend on the use case and system/workload.
I'd recommend starting from setting the nominator as 1 and the denominator
as PAGE_SIZE and tune based on the results, because many DAMOS actions are
applied at page level.
Tests
=====
I tested this feature in the steps below.
1. Allocate 50% of system memory and mlock() it using a test program.
2. Fill up the page cache to exhaust nearly all free memory.
3. Start DAMON-based proactive reclamation with 100 MiB/second DAMOS
hard-quota. Auto-tune the DAMOS soft-quota under the hard-quota for
achieving 40% free memory of the system with 'temporal' tuner.
For step 1, I run a simple C program that is written by Gemini. It is
quite straightforward, so I'm not sharing the code here.
For step 3, I use the latest version of DAMON user-space tool (damo) like
below.
sudo damo start --damos_action pageout \
` # Do the pageout only up to 100 MiB per second ` \
--damos_quota_space 100M --damos_quota_interval 1s \
` # Auto-tune the quota below the hard quota aiming` \
` # 40% free memory of the node 0 ` \
` # (entire node of the test system)` \
--damos_quota_goal node_mem_free_bp 40% 0 \
` # use temporal tuner, which is easy to understnd ` \
--damos_quota_goal_tuner temporal
As expected, the progress of the reclamation is not consistent, because
the quota is exceeded for the failed reclamation of the unreclaimable
memory.
I do this again, but with the failed region charge ratio feature. For
this, the above 'damo' command is used, after appending command line
option for setup of the charge ratio like below. Note that the option was
added to 'damo' after v3.1.9.
sudo ./damo start --damos_action pageout \
[...]
` # quota-charge only 1/4096 for pageout-failed regions ` \
--damos_quota_fail_charge_ratio 1 4096
The progress of the reclamation was nearly 100 MiB per second until the
goal was achieved, meeting the expectation.
Patches Sequence
================
The first two patches make preparational changes. Patch 1 updates fully
charged quota check to handle <min_region_sz remaining quota, which will
be able to exist after this series is applied. Patch 2 merges regions
after applying schemes is done as long as it is ok to do, since regions
split operations for quota could happen much more frequently under a
corner case that this series will make available.
Patch 3 implements the feature and exposes it via DAMON core API. Patch 4
implements DAMON sysfs ABI for the feature. Three following patches (5-7)
document the feature and ABI on design, usage, and ABI documents,
respectively. Four patches for testing of the new feature follow. Patch
8 implements a kunit test for the feature. Patches 9 and 10 extend DAMON
selftest helpers for DAMON sysfs control and internal state dumping for
adding a new selftest for the feature. Patch 11 extends existing DAMON
sysfs interface selftest to test the new feature using the extended helper
scripts.
This patch (of 11):
Less than min_region_sz remaining quota effectively means the quota is
fully charged. In other words, no remaining quota. This is because DAMOS
actions are applied in the region granularity, and each region should have
min_region_sz or larger size. However the existing fully charged quota
check, which is also used for setting charge_target_from and
charge_addr_from of the quota, is not aware of the case. For the reason,
charge_target_from and charge_addr_from of the quota will not be updated
in the case. This can result in DAMOS action being applied more
frequently to a specific area of the memory.
The case is unreal because quota charging is also made in the region
granularity. It could be changed in future, though. Actually, the
following commit will make the change, by allowing users to set arbitrary
quota charging ratio for action-failed regions. To be prepared for the
change, update the fully charged quota checks to treat having less than
min_region_sz remaining quota as fully charged.
Link: https://lore.kernel.org/20260428013402.115171-1-sj@kernel.org Link: https://lore.kernel.org/20260428013402.115171-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Background and Motivation
=========================
In heterogeneous memory systems, controlling memory distribution across
NUMA nodes is essential for performance optimization. This patch enables
system-wide page distribution with target-state goals such as "maintain
60% of scheme-eligible memory on DRAM" using PA-mode DAMON schemes.
Rather than using absolute thresholds, this metric tracks the ratio of
memory that matches each scheme's access pattern filters on a target node,
enabling the quota system to automatically adjust migration aggressiveness
to maintain the desired distribution.
What This Metric Measures
=========================
Two-Scheme Setup for Hot Page Distribution
==========================================
For maintaining 60% of hot memory on DRAM (node 0) and 40% on CXL
(node 1):
PULL scheme: migrate_hot to node 0
goal: node_eligible_mem_bp, nid=0, target=6000
addr filter: node 1 address range (only migrate FROM CXL)
"Move hot pages to DRAM if less than 60% of hot data is in DRAM"
PUSH scheme: migrate_hot to node 1
goal: node_eligible_mem_bp, nid=1, target=4000
addr filter: node 0 address range (only migrate FROM DRAM)
"Move hot pages to CXL if less than 40% of hot data is in CXL"
Each scheme independently measures its own eligible memory and adjusts its
quota to achieve its target ratio. The schemes work in concert through
DAMON's unified monitoring context, with the quota autotuner balancing
their relative aggressiveness.
Implementation Details
======================
The implementation adds a new quota goal metric type
DAMOS_QUOTA_NODE_ELIGIBLE_MEM_BP to the existing DAMOS quota goal
framework. When this metric is configured for a scheme:
1. During each quota adjustment cycle, damos_get_node_eligible_mem_bp()
is called to calculate the current memory distribution.
2. The function iterates through all regions that match the scheme's
access pattern (via __damos_valid_target()) and calculates:
- Total eligible bytes across all nodes
- Eligible bytes specifically on the target node (goal->nid)
3. For each eligible region, damos_calc_eligible_bytes() walks through
the physical address range, using damon_get_folio() to look up
each folio and determine its NUMA node via folio_nid().
4. Large folios are handled by calculating the exact overlap between
the region boundaries and folio boundaries, ensuring accurate
byte counts even when regions partially span folios.
5. The ratio (node_eligible / total_eligible * 10000) is returned
as basis points, which the quota autotuner uses to adjust the
scheme's effective quota size (esz).
The implementation requires CONFIG_DAMON_PADDR since damon_get_folio()
is only available for physical address space monitoring.
Testing Results
===============
Functionally tested on a two-node heterogeneous memory system with DRAM
(node 0) and CXL memory (node 1). A PUSH+PULL scheme configuration using
migrate_hot actions was used to reach a target hot memory ratio between
the two tiers.
With the TEMPORAL tuner, the system converges quickly to the target
distribution. The tuner drives esz to maximum when under goal and to zero
once the goal is met, forming a simple on/off feedback loop that
stabilizes at the desired ratio.
With the CONSIST tuner, the scheme still converges but more slowly, as it
migrates and then throttles itself based on quota feedback. The time to
reach the goal varies depending on workload intensity.
Note: This metric works with both TEMPORAL and CONSIST goal tuners.
Link: https://lore.kernel.org/20260428030520.701-1-ravis.opensrc@gmail.com Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@gmail.com> Suggested-by: SeongJae Park <sj@kernel.org> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Honggyu Kim <honggyu.kim@sk.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yunjeong Mun <yunjeong.mun@sk.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Tue, 28 Apr 2026 04:29:40 +0000 (21:29 -0700)]
mm/damon/core: make charge_addr_from aware of end-address exclusivity
DAMON region end address is exclusive one, but charge_addr_from is
assigned assuming the end address is inclusive. As a result, DAMOS action
to next up to min_region_sz memory can be skipped. This is quite
negligible user impact. But, the bug is a bug that can be very simply
fixed. Fix the wrong assignment to respect the exclusiveness of the
address.
mm/memory: update stale locking comments for fault handlers
Update the comments for wp_page_copy(), do_wp_page(), do_swap_page(),
do_anonymous_page(), __do_fault(), do_fault(), handle_pte_fault(),
__handle_mm_fault(), and handle_mm_fault() to concisely clarify that they
can be entered holding either the mmap_lock or the VMA lock, and that the
lock may be released upon returning VM_FAULT_RETRY.
Additionally, make the following corrections:
- In do_anonymous_page(), correct the outdated claim that the function
is entered with the PTE "mapped but not yet locked". Since
handle_pte_fault() unmaps the empty PTE before routing to
do_pte_missing(), the comment now correctly states it is entered
with the PTE unmapped and unlocked.
- In __do_fault(), update the stale reference from __lock_page_retry()
to __folio_lock_or_retry().
Link: https://lore.kernel.org/20260424092217.263648-1-adi.sharma@zohomail.in Signed-off-by: Aditya Sharma <adi.sharma@zohomail.in> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
PMD and PUD entries revalidation has the same semantics as PTE entry
revalidation. Convert the remaining direct entry dereferences to the
corresponding accessors.
The PTE validation in gup_fast_pte_range() is inconsistent with the prior
value acquisition in the sense that it drops the lockless access
semantics.
Use the lockless accessor not only for the PTE, but also for the PMD
validation, which is likewise inconsistent with the prior value
acquisition in gup_fast_pmd_range().
Link: https://lore.kernel.org/20260421051754.1691221-1-agordeev@linux.ibm.com Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Peter Xu <peterx@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Apply the same batch-freeing optimization from free_contig_range() to the
frozen page path. The previous __free_contig_frozen_range() freed each
order-0 page individually via free_frozen_pages(), which is slow for the
same reason the old free_contig_range() was: each page goes to the order-0
pcp list rather than being coalesced into higher-order blocks.
Rewrite __free_contig_frozen_range() to call free_pages_prepare() for each
order-0 page, then batch the prepared pages into the largest possible
power-of-2 aligned chunks via free_prepared_contig_range(). If
free_pages_prepare() fails (e.g. HWPoison, bad page) the page is
deliberately not freed; it should not be returned to the allocator.
I've tested CMA through debugfs. The test allocates 16384 pages per
allocation for several iterations. There is 3.5x improvement.
Before: 1406 usec per iteration
After: 402 usec per iteration
Ryan Roberts [Wed, 1 Apr 2026 10:16:20 +0000 (11:16 +0100)]
vmalloc: optimize vfree with free_pages_bulk()
Whenever vmalloc allocates high order pages (e.g. for a huge mapping) it
must immediately split_page() to order-0 so that it remains compatible
with users that want to access the underlying struct page. Commit a06157804399 ("mm/vmalloc: request large order pages from buddy
allocator") recently made it much more likely for vmalloc to allocate high
order pages which are subsequently split to order-0.
Unfortunately this had the side effect of causing performance regressions
for tight vmalloc/vfree loops (e.g. test_vmalloc.ko benchmarks). See Closes: tag. This happens because the high order pages must be gotten
from the buddy but then because they are split to order-0, when they are
freed they are freed to the order-0 pcp. Previously allocation was for
order-0 pages so they were recycled from the pcp.
It would be preferable if when vmalloc allocates an (e.g.) order-3 page
that it also frees that order-3 page to the order-3 pcp, then the
regression could be removed.
So let's do exactly that; update stats separately first as coalescing is
hard to do correctly without complexity. Use free_pages_bulk() which uses
the new __free_contig_range() API to batch-free contiguous ranges of pfns.
This not only removes the regression, but significantly improves
performance of vfree beyond the baseline.
A selection of test_vmalloc benchmarks running on arm64 server class
system. mm-new is the baseline. Commit a06157804399 ("mm/vmalloc:
request large order pages from buddy allocator") was added in v6.19-rc1
where we see regressions. Then with this change performance is much
better. (>0 is faster, <0 is slower, (R)/(I) = statistically significant
Regression/Improvement):
Ryan Roberts [Wed, 1 Apr 2026 10:16:19 +0000 (11:16 +0100)]
mm/page_alloc: optimize free_contig_range()
Patch series "mm: Free contiguous order-0 pages efficiently", v6.
A recent change to vmalloc caused some performance benchmark regressions
(see [1]). I'm attempting to fix that (and at the same time significantly
improve beyond the baseline) by freeing a contiguous set of order-0 pages
as a batch.
At the same time I observed that free_contig_range() was essentially doing
the same thing as vfree() so I've fixed it there too. While at it,
optimize the __free_contig_frozen_range() as well.
Check that the contiguous range falls in the same section. If they aren't
enabled, the if conditions get optimized out by the compiler as
memdesc_section() returns 0. See num_pages_contiguous() for more details
about it.
This patch (of 3):
Decompose the range of order-0 pages to be freed into the set of largest
possible power-of-2 size and aligned chunks and free them to the pcp or
buddy. This improves on the previous approach which freed each order-0
page individually in a loop. Testing shows performance to be improved by
more than 10x in some cases.
Since each page is order-0, we must decrement each page's reference count
individually and only consider the page for freeing as part of a high
order chunk if the reference count goes to zero. Additionally
free_pages_prepare() must be called for each individual order-0 page too,
so that the struct page state and global accounting state can be
appropriately managed. But once this is done, the resulting high order
chunks can be freed as a unit to the pcp or buddy.
This significantly speeds up the free operation but also has the side
benefit that high order blocks are added to the pcp instead of each page
ending up on the pcp order-0 list; memory remains more readily available
in high orders.
vmalloc will shortly become a user of this new optimized
free_contig_range() since it aggressively allocates high order
non-compound pages, but then calls split_page() to end up with contiguous
order-0 pages. These can now be freed much more efficiently.
The execution time of the following function was measured in a server
class arm64 machine:
static int page_alloc_high_order_test(void)
{
unsigned int order = HPAGE_PMD_ORDER;
struct page *page;
int i;
for (i = 0; i < 100000; i++) {
page = alloc_pages(GFP_KERNEL, order);
if (!page)
return -1;
split_page(page, order);
free_contig_range(page_to_pfn(page), 1UL << order);
}
return 0;
}
Execution time before: 4097358 usec
Execution time after: 729831 usec
Vmscan has six main reclaim entry points: try_to_free_pages() for
direct reclaim, try_to_free_mem_cgroup_pages() for memcg reclaim,
mem_cgroup_shrink_node() for memcg soft limit reclaim, node_reclaim()
for node reclaim, shrink_all_memory() for hibernation reclaim, and
balance_pgdat() for kswapd reclaim.
All of them, except for shrink_all_memory() and balance_pgdat(),
already have begin/end tracepoints. This makes it harder to trace
which reclaim path is responsible for memory reclaim activity, because
kswapd reclaim cannot be identified as cleanly as other reclaim entry
points, even though it is the main background reclaim path under memory
pressure. There may be no need to trace shrink_all_memory() as it is
primarily used during hibernation. So this patch adds the missing
tracepoint pair for balance_pgdat().
The begin tracepoint records the node id, requested reclaim order, and
the requested classzone bound (highest_zoneidx). The end tracepoint
records the node id, the reclaim order that balance_pgdat() finished
with, the requested classzone bound, and nr_reclaimed. Together, they
show the requested reclaim order and classzone bound, whether reclaim
fell back to a lower order, and how much reclaim work was done.
The end tracepoint also records highest_zoneidx even though it does not
change within a balance_pgdat() invocation. This keeps the end event
self-contained, so users can analyze reclaim results directly from end
events without depending on begin/end correlation, which is less
convenient when tracing is filtered or records are dropped. It also
makes it straightforward to relate nr_reclaimed and the final reclaim
order to the requested classzone bound.
Link: https://lore.kernel.org/20260424031418.174597-1-b.suvonov@sjtu.edu.cn Link: https://lore.kernel.org/20260423103753.546582-1-b.suvonov@sjtu.edu.cn Signed-off-by: Bunyod Suvonov <b.suvonov@sjtu.edu.cn> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: convert vmemmap_p?d_populate() to static functions
Since the vmemmap_p?d_populate functions are unused outside the mm
subsystem, we can remove their external declarations and convert them to
static functions.
Link: https://lore.kernel.org/20260423101441.7089-1-kaitao.cheng@linux.dev Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Acked-by: David Hildenbrand (arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam@infradead.org> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/huge_memory: fix outdated comment about freeing subpages in __folio_split
The comment appears to be outdated. add_to_swap() no longer exists,
and the explanation of why we need to call put_page() after splitting
could be made more general.
Link: https://lore.kernel.org/20260423034917.8234-1-baohua@kernel.org Signed-off-by: Barry Song (Xiaomi) <baohua@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Chris Li <chrisl@kernel.org> Cc: Kairui Song <kasong@tencent.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Youngjun Park <youngjun.park@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Currently, when shmem mounts are initialized, they only use 'sbinfo->huge'
to determine whether the shmem mount supports large folios. However, for
anonymous shmem, whether it supports large folios can be dynamically
configured via sysfs interfaces, so setting or not setting
mapping_set_large_folios() during initialization cannot accurately reflect
whether anonymous shmem actually supports large folios, which has already
caused some confusion[1].
Moreover, for tmpfs mounts, relying on 'sbinfo->huge' cannot keep the
mapping_set_large_folios() setting consistent across all mappings in the
entire tmpfs mount. In other words, under the same tmpfs mount, after
remount, we might end up with some mappings supporting large folios
(calling mapping_set_large_folios()) while others don't.
After some investigation, I found that the write performance regression
addressed by commit 5a90c155defa has already been fixed by the following
commit 665575cff098b ("filemap: move prefaulting out of hot write path").
See the following test data:
The data is basically consistent with minor fluctuation noise. So we can now
safely revert commit 5a90c155defa to set mapping_set_large_folios() for all
shmem mounts unconditionally.
mm/page_alloc: replace kernel_init_pages() with batch page clearing
When init_on_alloc is enabled, kernel_init_pages() clears every page one
at a time via clear_highpage_kasan_tagged(), which incurs per-page
kmap_local_page()/kunmap_local() overhead and prevents the architecture
clearing primitive from operating on contiguous ranges.
Introduce clear_highpages_kasan_tagged() as a static batch clearing helper
in page_alloc.c that calls clear_pages() for the full contiguous range on
!HIGHMEM systems, bypassing the per-page kmap overhead and allowing a
single invocation of the arch clearing primitive across the entire
allocation. The HIGHMEM path falls back to per-page clearing since those
pages require kmap.
Replace kernel_init_pages() with direct calls to the new helper, as it
becomes a trivial wrapper.
Allocating 8192 x 2MB HugeTLB pages (16GB) with init_on_alloc=1:
Li Wang [Wed, 22 Apr 2026 08:04:46 +0000 (16:04 +0800)]
selftests/mm: suppress compiler error in liburing check
When building the mm selftests on a system without liburing development
headers, check_config.sh leaks a raw compiler error:
/tmp/tmp.kIIOIqwe3n.c:2:10: fatal error: liburing.h: No such file or directory
2 | #include <liburing.h>
| ^~~~~~~~~~~~
Since this is an expected failure during the configuration probe,
redirect the compiler output to /dev/null to hide it.
And the build system prints a clear warning when this occurs:
Warning: missing liburing support. Some tests will be skipped.
Because the user is properly notified about the missing dependency, the
raw compiler error is redundant and only confuse users.
Additionally, update the Makefile to use $(Q) and $(call msg,...) for the
check_config.sh execution. This aligns the probe with standard kbuild
output formatting, providing a clean "CHK" message instead of printing the
raw command during the build.
Link: https://lore.kernel.org/20260422080446.26020-3-wangli.ahau@gmail.com Signed-off-by: Li Wang <wangli.ahau@gmail.com> Tested-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li Wang [Wed, 22 Apr 2026 08:04:45 +0000 (16:04 +0800)]
selftests/mm: respect build verbosity settings for 32/64-bit targets
Patch series "selftests/mm: clean up build output and verbosity", v3.
Currently, the build process for the mm selftests is unnecessarily noisy.
First, it leaks raw compiler errors during the liburing feature probe if
the headers are missing, which is confusing since the build system already
handles this gracefully with a clear warning.
Second, the specific 32-bit and 64-bit compilation targets ignore the
standard kbuild verbosity settings, always printing their full compiler
commands even during a default quiet build.
This patch (of 2):
The 32-bit and 64-bit compilation rules invoke $(CC) directly, bypassing
the $(Q) quiet prefix and $(call msg,...) helper used by the rest of the
selftests build system. This causes these rules to always print the full
compiler command line, even when V=0 (the default).
Wrap the commands with $(Q) and $(call msg,CC,,$@) to match the convention
used by lib.mk, so that quiet and verbose builds behave consistently
across all targets.
Muchun Song [Sat, 23 May 2026 06:01:23 +0000 (14:01 +0800)]
mm/cma: fix reserved page leak on activation failure
If cma_activate_area() fails after allocating only part of the range
bitmaps, the cleanup path still has to release the reserved pages when
CMA_RESERVE_PAGES_ON_ERROR is clear.
That is still worth doing even in this __init path. A bitmap_zalloc()
failure does not necessarily mean the system cannot make further progress:
freeing the reserved CMA pages can return a substantial amount of memory
to the buddy allocator and may relieve the temporary memory shortage that
caused the allocation failure in the first place.
However, the cleanup path currently uses the bitmap-freeing bound for page
release as well. That is only correct for ranges whose bitmap allocation
already succeeded. The failed range and all later ranges still keep their
reserved pages, so a partial bitmap allocation failure can permanently
leak them.
Fix this by releasing reserved pages for all ranges. Use the saved
early_pfn[] value for ranges whose bitmap allocation already succeeded and
for the failed range, and use cmr->early_pfn for later ranges whose bitmap
allocation was never attempted.
Link: https://lore.kernel.org/20260523060123.2207992-1-songmuchun@bytedance.com Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: Usama Arif <usama.arif@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Frank van der Linden <fvdl@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wupeng Ma [Fri, 22 May 2026 01:03:05 +0000 (09:03 +0800)]
mm/memory-failure: fix hugetlb_lock AA deadlock in get_huge_page_for_hwpoison
Two concurrent madvise(MADV_HWPOISON) calls on the same hugetlb page can
trigger a recursive spinlock self-deadlock (AA deadlock) on hugetlb_lock
when racing with a concurrent unmap:
The out: path in __get_huge_page_for_hwpoison() calls folio_put() to drop
the GUP reference while the hugetlb_lock is still held by the hugetlb.c
wrapper get_huge_page_for_hwpoison(). If concurrent unmap has released
the page table mapping reference, folio_put() drops the folio refcount to
zero, triggering free_huge_folio() which attempts to re-acquire the
non-recursive hugetlb_lock.
Fix this by moving hugetlb_lock acquisition from the hugetlb.c wrapper
into get_huge_page_for_hwpoison(). Place spin_unlock_irq() before the
folio_put() at the out: label so the folio is always released outside the
lock.
[akpm@linux-foundation.org: fix race, rename label per Miaohe] Link: https://sashiko.dev/#/patchset/20260522010305.4099834-1-mawupeng1@huawei.com Link: https://lore.kernel.org/f39f405e-4b4b-8f79-70fe-a2b5b62114eb@huawei.com Link: https://lore.kernel.org/20260522010305.4099834-1-mawupeng1@huawei.com Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: Wupeng Ma <mawupeng1@huawei.com> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Carlier [Wed, 20 May 2026 04:49:12 +0000 (05:49 +0100)]
mm/hugetlb: restore reservation on error in hugetlb folio copy paths
Two sites in mm/hugetlb.c allocate a hugetlb folio via
alloc_hugetlb_folio() (consuming a VMA reservation) and then call
copy_user_large_folio(), which became int-returning in commit 1cb9dc4b475c
("mm: hwpoison: support recovery from HugePage copy-on-write faults") and
can now fail (e.g. -EHWPOISON on a hwpoisoned source page). On the
failure path, folio_put() restores the global hugetlb pool count through
free_huge_folio(), but the per-VMA reservation map entry is left marked
consumed:
- hugetlb_mfill_atomic_pte() resubmission path (UFFDIO_COPY)
- copy_hugetlb_page_range() fork-time CoW path when
hugetlb_try_dup_anon_rmap() fails (rare: pinned hugetlb anon
folio under fork)
User-visible effect: on UFFDIO_COPY into a private hugetlb VMA where the
resubmission copy fails, the reservation for that address is leaked from
the VMA's reserve map. A subsequent fault at the same address takes the
no-reservation path, and under hugetlb pool pressure the task is SIGBUSed
at an address it had previously reserved. The fork-time CoW path leaks
the same way in the child VMA's reserve map, though it requires the much
rarer combination of pinned hugetlb anon page + hwpoisoned source.
Add the missing restore_reserve_on_error() call before folio_put() on both
error paths.
Link: https://lore.kernel.org/20260520044912.6751-1-devnexen@gmail.com Fixes: 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: yuehaibing <yuehaibing@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Muchun Song [Wed, 20 May 2026 06:10:25 +0000 (14:10 +0800)]
mm/cma_debug: fix invalid accesses for inactive CMA areas
cma_activate_area() can fail after allocating range bitmaps. Its cleanup
path frees those bitmaps, but only clears cma->count and
cma->available_count. It leaves cma->nranges and each range's count in
place, so cma_debugfs_init() can still register debugfs files for an area
that never activated successfully.
That exposes two problems. Reading the bitmap file can make debugfs walk
a freed range bitmap and trigger an invalid memory access. Reading
maxchunk can also take cma->lock even though that lock is initialized only
on the successful activation path.
Fix this by creating debugfs entries only for CMA areas that reached
CMA_ACTIVATED.
c009da4258f9 introduced the invalid access to bitmap file. 2e32b947606d
introduced the invalid access to cma->lock. This change applies to both
issues. So I added two Fixes tags.
Link: https://lore.kernel.org/20260520061025.3971821-1-songmuchun@bytedance.com Fixes: c009da4258f9 ("mm, cma: support multiple contiguous ranges, if requested") Fixes: 2e32b947606d ("mm: cma: add functions to get region pages counters") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Acked-by: Oscar Salvador (SUSE) <osalvador@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: Frank van der Linden <fvdl@google.com> Cc: Liam R. Howlett <liam@infradead.org> Cc: Lorenzo Stoakes <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Stefan Strogin <stefan.strogin@gmail.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Shakeel Butt [Thu, 21 May 2026 22:37:51 +0000 (15:37 -0700)]
memcg: use round-robin victim selection in refill_stock
Harry Yoo reported that get_random_u32_below() is not safe to call in the
nmi context and memcg charge draining can happen in nmi context.
More specifically get_random_u32_below() is neither reentrant- nor
NMI-safe: it acquires a per-cpu local_lock via local_lock_irqsave() on the
batched_entropy_u32 state. An NMI that lands on a CPU mid-update of the
ChaCha batch state and recurses into the random subsystem would corrupt
that state. The memcg_stock local_trylock prevents re-entry on the percpu
stock itself, but cannot protect an unrelated subsystem's per-cpu lock.
Replace the random pick with a per-cpu round-robin counter stored in
memcg_stock_pcp and serialized by the same local_trylock that already
guards cached[] and nr_pages[]. No atomics, no random calls, no extra
locks needed.
Link: https://lore.kernel.org/20260521223751.3794625-1-shakeel.butt@linux.dev Fixes: f735eebe55f8f ("memcg: multi-memcg percpu charge cache") Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reported-by: Harry Yoo <harry@kernel.org> Closes: https://lore.kernel.org/4e20f643-6983-4b6e-b12d-c6c4eb20ae0c@kernel.org/ Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Commit 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split,
not before") changed the locking model around hugetlbfs PMD unsharing on
VMA split, but did not update the function which asserts the locks,
hugetlb_vma_assert_locked().
This function asserts that either the hugetlb VMA lock is held (if a
shared mapping) or that the reservation map lock is held (if private).
If you get an unfortunate race between something which results in one of
these locks being released and a hugetlb VMA split and you have
CONFIG_LOCKDEP enabled, you can therefore see a false positive assertion
arise when there is in fact no issue.
Since this change introduced a new take_locks parameter to
hugetlb_unshare_pmds(), which, when set to false, indicates that locking
is sufficient, simply pass this to the unsharing logic and predicate the
lock assertions on this.
This is safe, as we already asserted the file rmap lock and the VMA write
lock prior to this (implying exclusive mmap write lock), so we cannot be
raced by either rmap or page fault page table walkers which the asserted
locks are intended to protect against (we don't mind GUP-fast).
Separate out huge_pmd_unshare() into __huge_pmd_unshare() to add a
check_locks parameter, and update hugetlb_unshare_pmds() to pass this
parameter to it.
This leaves all other callers of huge_pmd_unshare() still correctly
asserting the locks.
The below reproducer will trigger the assert in a kernel with
CONFIG_LOCKDEP enabled by racing process teardown (which will release the
hugetlb lock) against a hugetlb split.
void execute_one(void)
{
void *ptr;
pid_t pid;
/*
* Create a hugetlb mapping spanning a PUD entry.
*
* We force the hugetlb page allocation with populate and
* noreserve.
*
* |---------------------|
* | |
* |---------------------|
* 0 PUD boundary
*/
ptr = mmap(0, PUD_SIZE, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_SHARED | MAP_ANON |
MAP_NORESERVE | MAP_HUGETLB | MAP_POPULATE,
-1, 0);
if (ptr == MAP_FAILED) {
perror("mmap");
exit(EXIT_FAILURE);
}
/*
* Fork but with a bogus stack pointer so we try to execute code in
* a non-VM_EXEC VMA, causing segfault + teardown via exit_mmap().
*
* The clone will cause PMD page table sharing between the
* processes first via:
* copy_process() -> ... -> huge_pte_alloc() -> huge_pmd_share()
*
* Then tear down and release the hugetlb 'VMA' lock via:
* exit_mmap() -> ... -> vma_close() -> hugetlb_vma_lock_free()
*/
pid = syscall(__NR_clone, 0, 2 * PMD_SIZE, 0, 0, 0);
if (pid < 0) {
perror("clone");
exit(EXIT_FAILURE);
} if (pid == 0) {
/* Pop stack... */
return;
}
/*
* We are the parent process.
*
* Race the child process's teardown with a PMD unshare.
*
* We do this by triggering:
*
* __split_vma() -> hugetlb_split() -> hugetlb_unshare_pmds()
*
* Which, importantly, doesn't hold the hugetlb VMA lock (nor can
* it), meaning we assert in hugetlb_vma_assert_locked().
*
* .
* |----------.----------|
* | . |
* |----------.----------|
* 0 . PUD boundary
*/
mmap(0, PUD_SIZE / 2, PROT_READ | PROT_WRITE,
MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
}
int main(void)
{
int i;
/* Kick off fork children. */
for (i = 0; i < NUM_FORKS; i++) {
pid_t pid = fork();
if (pid < 0) {
perror("fork");
exit(EXIT_FAILURE);
}
/* Fork children do their work and exit. */
if (!pid) {
int j;
/* If we succeeded, wait on children. */
for (i = 0; i < NUM_FORKS; i++)
wait(NULL);
return EXIT_SUCCESS;
}
[ljs@kernel.org: account for the !CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING case] Link: https://lore.kernel.org/agWZsPGYid08uU6O@lucifer Link: https://lore.kernel.org/20260513085658.45264-1-ljs@kernel.org Fixes: 081056dc00a2 ("mm/hugetlb: unshare page tables during VMA split, not before") Signed-off-by: Lorenzo Stoakes <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
bpf: replace pop/push emptiness check with bpf_list_empty()
Simplify fq_flows_is_empty() by replacing the pop/push based emptiness
check with a direct call to bpf_list_empty().
This avoids unnecessary list mutation and simplifies the code while
preserving correctness.
Leon Hwang [Thu, 21 May 2026 14:29:09 +0000 (22:29 +0800)]
bpf: Fix race between bpf_map_new_fd() and close_fd()
Because there is time gap between bpf_map_new_fd() and close_fd(), a
concurrent thread is able to close the new fd and opens a new, unrelated
file with the exact same fd number. Thereafter, this close_fd() might
inadvertently close the unrelated file.
To avoid such regression, do finalize log before security_bpf_map_create().
However, in order to achieve it, move bpf_get_file_flag(),
security_bpf_map_create(), bpf_map_alloc_id(), and bpf_map_new_fd() from
__map_create() to map_create(). And, rename __map_create() to
map_create_alloc() meanwhile.
Then, in order to reuse the map and token when all checks pass in
map_create_alloc(), pass "struct bpf_map **" and "struct bpf_token **" to
map_create_alloc().
Steven Rostedt [Fri, 22 May 2026 17:09:06 +0000 (13:09 -0400)]
ring-buffer: Show persistent buffer dropped events in trace_pipe file
When the persistent ring buffer is validated on boot up, if a subbuffer is
deemed invalid, it resets the buffer and continues. Have the code preserve
the RB_MISSED_EVENTS flag in the commit portion of the subbuffer header
and pass that back so that the trace_pipe file can show the missed events
like the trace file does.
Steven Rostedt [Fri, 22 May 2026 17:09:05 +0000 (13:09 -0400)]
ring-buffer: Show persistent buffer dropped events in trace file
When the persistent ring buffer is validated on boot up, if a subbuffer is
deemed invalid, it resets the buffer and continues. Currently, these lost
events are not shown in the trace file output.
Have the trace iterator look for subbuffers that have the RB_MISSED_EVENTS
set and set the iter->missed_events flag when it is detected. This will
then have the trace file shows "LOST EVENTS" when it reads across a
subbuffer that was corrupted and invalidated.
Steven Rostedt [Fri, 22 May 2026 17:09:04 +0000 (13:09 -0400)]
ring-buffer: Have dropped subbuffers be persistent across reboots
When the persistent ring buffer detects a corrupted subbuffer, it will
zero its size and report dropped pages in the dmesg, then it continues
normally.
But if a reboot happens without clearing or restarting tracing on the
persistent ring buffer, the next boot will show no pages are dropped.
If the persistent ring buffer is still the same, then it should still
report dropped pages so the user knows that the buffer has missing events.
Add the RB_MISSED_EVENTS flag to the commit value of the subbuffer so that
the next boot will still show that pages were dropped.
ring-buffer: Cleanup buffer_data_page related code
Code cleanup related to buffer_data_page for readability,
which includes:
- Introduce rb_data_page_commit() and rb_data_page_size()
- Use 'dpage' for buffer_data_page, instead of 'bpage' because
'bpage' is used for buffer_page.
ring-buffer: Cleanup persistent ring buffer validation
Cleanup rb_meta_validate_events() function to make it easier to read.
This includes the following cleanups:
- Introduce rb_validatation_state to hold working variables in
validation.
- Move repleated validation state updates into rb_validate_buffer().
- Move reader_page injection code outside of rb_meta_validate_events().
ring-buffer: Show commit numbers in buffer_meta file
In addition to the index number, show the commit numbers of
each data page in the per_cpu buffer_meta file.
This is useful for understanding the current status of the
persistent ring buffer. (Note that this file is shown
only for persistent ring buffer and its backup instance)
ring-buffer: Add persistent ring buffer invalid-page inject test
Add a self-corrupting test for the persistent ring buffer.
This will inject an erroneous value to some sub-buffer pages (where
the index is even or multiples of 5) in the persistent ring buffer
when the kernel panics, and checks whether the number of detected
invalid pages and the total entry_bytes are the same as the recorded
values after reboot.
This ensures that the kernel can correctly recover a partially
corrupted persistent ring buffer after a reboot or panic.
The test only runs on the persistent ring buffer whose name is
"ptracingtest". The user has to fill it with events before a
kernel panic.
To run the test, enable CONFIG_RING_BUFFER_PERSISTENT_INJECT
and add the following kernel cmdline:
cd /sys/kernel/tracing/instances/ptracingtest
echo 1 > tracing_on
echo 1 > events/enable
sleep 3
echo c > /proc/sysrq-trigger
After panic message, the kernel will reboot and run the verification
on the persistent ring buffer, e.g.
Ring buffer meta [2] invalid buffer page detected
Ring buffer meta [2] is from previous boot! (318 pages discarded)
Ring buffer testing [2] invalid pages: PASSED (318/318)
Ring buffer testing [2] entry_bytes: PASSED (1300476/1300476)
ring-buffer: Skip invalid sub-buffers when rewinding persistent ring buffer
Skip invalid sub-buffers when rewinding the persistent ring buffer
instead of stopping the rewinding the ring buffer. The skipped
buffers are cleared.
To ensure the rewinding stops at the unused page, this also clears
buffer_data_page::time_stamp when tracing resets the buffer. This
allows us to identify unused pages and empty pages.
Link: https://patch.msgid.link/20260522171051.091265852@kernel.org Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org>
[ SDR: Have reader_page still get evaluated if header_page fails ] Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
ring-buffer: Skip invalid sub-buffers when validating persistent ring buffer
Skip invalid sub-buffers when validating the persistent ring buffer
instead of discarding the entire ring buffer. Only skipped buffers
are invalidated (cleared).
If the cache data in memory fails to be synchronized during a reboot,
the persistent ring buffer may become partially corrupted, but other
sub-buffers may still contain readable event data. Only discard the
subbuffers that are found to be corrupted.
net/9p: fix infinite loop in p9_client_rpc on fatal signal
When p9_client_rpc() is called with type P9_TFLUSH and the transport
has no peer (e.g. fd transport backed by pipes with no 9p server),
a fatal signal causes an infinite loop:
if (err == -ERESTARTSYS && c->status == Connected &&
type == P9_TFLUSH) {
sigpending = 1;
clear_thread_flag(TIF_SIGPENDING);
goto again;
}
clear_thread_flag() clears TIF_SIGPENDING before jumping back to
io_wait_event_killable(). signal_pending_state() checks TIF_SIGPENDING,
finds it zero, and the task goes to sleep again. The task can only wake
on the next signal delivery that calls signal_wake_up() and sets
TIF_SIGPENDING again. When that happens the loop repeats, clears
TIF_SIGPENDING, and sleeps again indefinitely.
This is triggered in practice by coredump_wait(): when a thread in a
multi-threaded process causes a coredump (e.g. via SIGSYS from Syscall
User Dispatch), coredump_wait() sends SIGKILL to all other threads and
waits for them to call mm_release(). If one of those threads is blocked
in p9_client_rpc() over an fd transport with no peer, it enters the
P9_TFLUSH loop and never calls mm_release(), so coredump_wait() stalls
forever:
Fix: check fatal_signal_pending() before clearing TIF_SIGPENDING in the
P9_TFLUSH retry loop. At that point TIF_SIGPENDING is still set, so
fatal_signal_pending() works correctly. If a fatal signal is pending,
jump to recalc_sigpending to restore TIF_SIGPENDING and return
-ERESTARTSYS to the caller.
The same defect is present in stable kernels back to 5.4. On those
kernels the infinite loop is broken earlier by a second SIGKILL from
the parent process (e.g. kill_and_wait() retrying after a timeout),
resulting in a zombie process and a shutdown delay rather than a
permanent D-state hang, but the underlying flaw is the same.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: 91b8534fa8f5 ("9p: make rpc code common and rework flush code") Closes: https://syzkaller.appspot.com/bug?extid=3ce7863f8fc836a427e7 Cc: stable@vger.kernel.org Signed-off-by: Vasiliy Kovalev <kovalev@altlinux.org>
Message-ID: <20260415155237.182891-1-kovalev@altlinux.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Aayush Patil [Sun, 10 May 2026 18:28:56 +0000 (23:58 +0530)]
docs/filesystems/9p: fix broken external links
The xcpu.org links for xcpu-talk, kvmfs, and cellfs-talk are dead
with no archived snapshots available on the Wayback Machine, so
remove them. The PROSE I/O link redirects to a dead server; replace
it with an archived version from web.archive.org.
Stephen Boyd [Fri, 29 May 2026 01:44:37 +0000 (18:44 -0700)]
Merge tag 'renesas-clk-for-v7.2-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers into clk-renesas
Pull Renesas clk driver updates from Geert Uytterhoeven:
- Add Ethernet, GPIO, CPU core, watchdog, serial, I2C, sound, and SPI
clocks and resets on Renesas RZ/G3L
- Add the timer (MTU3) clock on Renesas RZ/T2H and RZ/N2H
- Add Coresight trace clocks on Renesas R-Mobile A1 and APE6
- Add display clocks and resets on Renesas RZ/G3E
* tag 'renesas-clk-for-v7.2-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers: (29 commits)
clk: renesas: r8a73a4: Add ZT/ZTR trace clocks
dt-bindings: clock: renesas,cpg-clocks: Document ZT/ZTR trace clock on R-Mobile APE6
clk: renesas: r9a08g046: Add RSPI clocks and resets
clk: renesas: r9a08g046: Add SSIF-2 clocks and resets
clk: renesas: r9a08g046: Add RSCI clocks and resets
clk: renesas: cpg-mssr: Add number of clock cells check
clk: renesas: rzg2l: Refactor rzg3l_cpg_pll_clk_endisable()
clk: renesas: rzg2l: Consolidate DEF_MUX() and DEF_MUX_FLAGS()
clk: renesas: r9a08g046: Add IA55_PCLK to critical module clocks
clk: renesas: r9a09g047: Add support for LCDC{0,1} clocks and resets
clk: renesas: r9a09g047: Add support for DSI clocks and resets
clk: renesas: r9a09g047: Add support for SMUX2_DSI{0,1}_CLK
clk: renesas: r9a09g047: Add CLK_PLLDSI{0,1}_CSDIV clocks
clk: renesas: r9a09g047: Add CLK_PLLDSI{0,1}_DIV7 clocks
clk: renesas: r9a09g047: Add CLK_PLLDSI{0,1} clocks
clk: renesas: r9a09g047: Add CLK_PLLETH_LPCLK support
clk: renesas: rzv2h: Add PLLDSI clk mux support
clk: renesas: r8a7740: Add ZT/ZTR trace clocks
dt-bindings: clock: renesas,cpg-clocks: Document ZT/ZTR trace clock on R-Mobile A1
clk: renesas: r9a09g077: Add MTU3 module clock
...
Sohil Mehta [Thu, 28 May 2026 18:48:25 +0000 (11:48 -0700)]
x86/cpu: Fix a F00F bug warning and clean up surrounding code
On x86 SMP systems with the F00F bug present, do_clear_cpu_cap()
rightfully warns that the code clears the X86_BUG_F00F flag after
alternatives have been patched.
X86_BUG_F00F is first cleared in intel_workarounds() and then set for
the affected models. This sequence works fine on the BSP but on AP
bringup, where alternatives have already been patched and clearing the
flag there triggers the warning.
There is no technical reason for clearing the flag before setting it. It
is mainly an artifact of introducing the X86_BUG_F00F flag in
While at it, remove the kernel notification and the surrounding logic to
inform the user about the workaround exactly once. If needed, the
presence of the F00F bug can be determined through /proc/cpuinfo.
Additionally, the F00F bug was the last remaining user of clear_cpu_bug().
With no users left, get rid of this helper as well.
[ bp: Massage commit message. ]
Co-developed-by: Richard Weinberger <richard@nod.at> Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Sohil Mehta <sohil.mehta@intel.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Ahmed S. Darwish <darwi@linutronix.de> Link: https://patch.msgid.link/20260528184826.3642051-1-sohil.mehta@intel.com
Randy Dunlap [Thu, 16 Apr 2026 22:48:33 +0000 (15:48 -0700)]
spmi: clean up kernel-doc in spmi.h
Fix all kernel-doc warnings in spmi.h:
Warning: include/linux/spmi.h:114 function parameter 'ctrl' not described
in 'spmi_controller_put'
Warning: include/linux/spmi.h:144 struct member 'shutdown' not described
in 'spmi_driver'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
spmi: spmi-pmic-arb: add support for PMIC arbiter v8.5
PMIC arbiter v8.5 is an extension of PMIC arbiter v8 that updated
the definition of the channel status register bit fields. Add support
to handle this difference.
dt-bindings: spmi: glymur-spmi-pmic-arb: Add compatible for Qualcomm Hawi SoC
The PMIC arbiter in the Qualcomm Hawi SoC is version v8.5, which
introduces parity and CRC checks for data received from the PMIC,
as well as NACK checks for command sequences except for read.
All other features in PMIC arbiter remain the same as the one in
the Qualcomm Glymur SoC, with the only differences being some
additional error status checks.
Therefore, add a string for "qcom,hawi-spmi-pmic-arb" as a compatible
entry for "qcom,glymur-spmi-pmic-arb".
Signed-off-by: Fenglin Wu <fenglin.wu@oss.qualcomm.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Signed-off-by: Stephen Boyd <sboyd@kernel.org>
Stephen Boyd [Fri, 29 May 2026 01:21:54 +0000 (18:21 -0700)]
Merge tag 'qcom-clk-fixes-for-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux into clk-fixes
Pull Qualcomm clk driver fixes from Bjorn Andersson:
The parking of shared RCGs during registration parks the MDP source
clock, disabling display until the msm display driver successfully
initializes the hardware again. Mark this clock on Makena and Hamoa as
"no_init_park" to retain a working recovery console etc.
* tag 'qcom-clk-fixes-for-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/qcom/linux:
clk: qcom: dispcc-sc8280xp: Don't park mdp_clk_src at registration time
clk: qcom: x1e80100-dispcc: Stop disp_cc_mdss_mdp_clk_src from getting parked
Stephen Boyd [Fri, 29 May 2026 01:19:26 +0000 (18:19 -0700)]
Merge tag 'samsung-clk-fixes-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into clk-fixes
Pull a Samsung clk driver fix from Krzysztof Kozlowski:
Google GS101: Correct the register name for saving and restoring state
during system suspend and resume. Lack of proper save/restore leads to
incorrect clock values after system resume.
* tag 'samsung-clk-fixes-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
clk: samsung: gs101: Fix missing USI7_USI DIV clock in peric0_clk_regs
Dave Airlie [Fri, 29 May 2026 01:12:48 +0000 (11:12 +1000)]
Merge tag 'drm-intel-fixes-2026-05-27' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Fix potential UAF in TTM object purge (Janusz Krzysztofik)
- Use polling when irqs are unavailable [aux] (Michał Grzelak)
- Fix HDR pre-CSC LUT programming loop [color] (Pranay Samala)
- Block DC states on vblank enable when Panel Replay supported [psr] (Jouni Högander)
- Use DC_OFF wake reference to block DC6 on vblank enable [psr] (Jouni Högander)
Jakub Kicinski [Fri, 29 May 2026 01:10:05 +0000 (18:10 -0700)]
Merge branch 'docs-page_pool-tweaks-and-updates'
Jakub Kicinski says:
====================
docs: page_pool: tweaks and updates
I'm hoping to start feeding our docs into the AI review tools, instead
of maintaining a separate repo with review prompts. To experiment with
that we have to refresh the docs a little bit.
This set exclusively focuses on the page pool API. First patch is
a straightforward fix for information which is now out of date.
Second one attempts to clarify the NAPI linking requirements.
Third drops the dedicated section about the stats; the document
is primarily developer-facing and the stats should require no
development effort in most cases. Last but not least minor
API cleanup.
====================