Liew Rui Yan [Sun, 15 Mar 2026 16:29:44 +0000 (09:29 -0700)]
Docs/mm/damon: document exclusivity of special-purpose modules
Add a section in design.rst to explain that DAMON special-purpose kernel
modules (LRU_SORT, RECLAIM, STAT) run in an exclusive manner and return
-EBUSY if another is already running.
Update lru_sort.rst, reclaim.rst and stat.rst by adding cross-references
to this exclusivity rule at the end of their respective Example sections.
This change is motivated from another discussion [1].
Link: https://lkml.kernel.org/r/20260315162945.80994-1-sj@kernel.org Link: https://lore.kernel.org/damon/20260314002119.79742-1-sj@kernel.org/T/#t Signed-off-by: Liew Rui Yan <aethernet65535@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When read_from_bdev_async() fails to chain bio, for instance fails to
allocate request or bio, we need to propagate the error condition so that
upper layer is aware of it. zram already does that by setting
BLK_STS_IOERR ->bi_status, but only for sync reads. Change async read
path to return its error status so that async errors are also handled.
Link: https://lkml.kernel.org/r/20260316015354.114465-1-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Brian Geffon <bgeffon@google.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Richard Chang <richardycc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Calling `LZ4_loadDict()` repeatedly in Zram causes significant overhead
due to its internal dictionary pre-processing. This commit introduces a
template stream mechanism to pre-process the dictionary only once when the
dictionary is initially set or modified. It then efficiently copies this
state for subsequent compressions.
Verification Test Items:
Test Platform: android16-6.12
1. Collect Anonymous Page Dataset
1) Apply the following patch:
static bool zram_meta_alloc(struct zram *zram, u64 disksize)
if (!huge_class_size)
- huge_class_size = zs_huge_class_size(zram->mem_pool);
+ huge_class_size = 0;
2)Install multiple apps and monkey testing until SwapFree is close to 0.
3)Execute the following command to export data:
dd if=/dev/block/zram0 of=/data/samples/zram_dump.img bs=4K
2. Train Dictionary
Since LZ4 does not have a dedicated dictionary training tool, the zstd
tool can be used for training[1]. The command is as follows:
zstd --train /data/samples/* --split=4096 --maxdict=64KB -o /vendor/etc/dict_data
3. Test Code
adb shell "dd if=/data/samples/zram_dump.img of=/dev/test_pattern bs=4096 count=131072 conv=fsync"
adb shell "swapoff /dev/block/zram0"
adb shell "echo 1 > /sys/block/zram0/reset"
adb shell "echo lz4 > /sys/block/zram0/comp_algorithm"
adb shell "echo dict=/vendor/etc/dict_data > /sys/block/zram0/algorithm_params"
adb shell "echo 6G > /sys/block/zram0/disksize"
echo "Start Compression"
adb shell "taskset 80 dd if=/dev/test_pattern of=/dev/block/zram0 bs=4096 count=131072 conv=fsync"
echo.
echo "Start Decompression"
adb shell "taskset 80 dd if=/dev/block/zram0 of=/dev/output_result bs=4096 count=131072 conv=fsync"
echo "mm_stat:"
adb shell "cat /sys/block/zram0/mm_stat"
echo.
Note: To ensure stable test results, it is best to lock the CPU frequency
before executing the test.
LZ4 supports dictionaries up to 64KB. Below are the test results for
compression rates at various dictionary sizes:
dict_size base patch
4 KB 156M/s 219M/s
8 KB 136M/s 217M/s
16KB 98M/s 214M/s
32KB 66M/s 225M/s
64KB 38M/s 224M/s
When an LZ4 compression dictionary is enabled, compression speed is
negatively impacted by the dictionary's size; larger dictionaries result
in slower compression. This patch eliminates the influence of dictionary
size on compression speed, ensuring consistent performance regardless of
dictionary scale.
Nico Pache [Wed, 25 Mar 2026 11:40:22 +0000 (05:40 -0600)]
mm/khugepaged: unify khugepaged and madv_collapse with collapse_single_pmd()
The khugepaged daemon and madvise_collapse have two different
implementations that do almost the same thing. Create collapse_single_pmd
to increase code reuse and create an entry point to these two users.
Refactor madvise_collapse and collapse_scan_mm_slot to use the new
collapse_single_pmd function. To help reduce confusion around the
mmap_locked variable, we rename mmap_locked to lock_dropped in the
collapse_scan_mm_slot() function, and remove the redundant mmap_locked in
madvise_collapse(); this further unifies the code readiblity. the
SCAN_PTE_MAPPED_HUGEPAGE enum is no longer reachable in the
madvise_collapse() function, so we drop it from the list of "continuing"
enums.
This introduces a minor behavioral change that is most likely an
undiscovered bug. The current implementation of khugepaged tests
collapse_test_exit_or_disable() before calling collapse_pte_mapped_thp,
but we weren't doing it in the madvise_collapse case. By unifying these
two callers madvise_collapse now also performs this check. We also modify
the return value to be SCAN_ANY_PROCESS which properly indicates that this
process is no longer valid to operate on.
By moving the madvise_collapse writeback-retry logic into the helper
function we can also avoid having to revalidate the VMA.
We guard the khugepaged_pages_collapsed variable to ensure its only
incremented for khugepaged.
As requested we also convert a VM_BUG_ON to a VM_WARN_ON.
Link: https://lkml.kernel.org/r/20260325114022.444081-6-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nico Pache [Wed, 25 Mar 2026 11:40:21 +0000 (05:40 -0600)]
mm/khugepaged: rename hpage_collapse_* to collapse_*
The hpage_collapse functions describe functions used by madvise_collapse
and khugepaged. remove the unnecessary hpage prefix to shorten the
function name.
Link: https://lkml.kernel.org/r/20260325114022.444081-5-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nico Pache [Wed, 25 Mar 2026 11:40:20 +0000 (05:40 -0600)]
mm/khugepaged: define KHUGEPAGED_MAX_PTES_LIMIT as HPAGE_PMD_NR - 1
The value (HPAGE_PMD_NR - 1) is used often in the khugepaged code to
signify the limit of the max_ptes_* values. Add a define for this to
increase code readability and reuse.
Link: https://lkml.kernel.org/r/20260325114022.444081-4-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nico Pache [Wed, 25 Mar 2026 11:40:19 +0000 (05:40 -0600)]
mm: introduce is_pmd_order helper
In order to add mTHP support to khugepaged, we will often be checking if a
given order is (or is not) a PMD order. Some places in the kernel already
use this check, so lets create a simple helper function to keep the code
clean and readable.
Link: https://lkml.kernel.org/r/20260325114022.444081-3-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Suggested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Nico Pache [Wed, 25 Mar 2026 11:40:18 +0000 (05:40 -0600)]
mm: consolidate anonymous folio PTE mapping into helpers
Patch series "mm: khugepaged cleanups and mTHP prerequisites", v4.
The following series contains cleanups and prerequisites for my work on
khugepaged mTHP support [1]. These have been separated out to ease
review.
The first patch in the series refactors the page fault folio to pte
mapping and follows a similar convention as defined by
map_anon_folio_pmd_(no)pf(). This not only cleans up the current
implementation of do_anonymous_page(), but will allow for reuse later in
the khugepaged mTHP implementation.
The second patch adds a small is_pmd_order() helper to check if an order
is the PMD order. This check is open-coded in a number of places. This
patch aims to clean this up and will be used more in the khugepaged mTHP
work. The third patch also adds a small DEFINE for (HPAGE_PMD_NR - 1)
which is used often across the khugepaged code.
The fourth and fifth patch come from the khugepaged mTHP patchset [1].
These two patches include the rename of function prefixes, and the
unification of khugepaged and madvise_collapse via a new
collapse_single_pmd function.
Patch 1: refactor do_anonymous_page into map_anon_folio_pte_(no)pf
Patch 2: add is_pmd_order helper
Patch 3: Add define for (HPAGE_PMD_NR - 1)
Patch 4: Refactor/rename hpage_collapse
Patch 5: Refactoring to combine madvise_collapse and khugepaged
A big thanks to everyone that has reviewed, tested, and participated in
the development process.
This patch (of 5):
The anonymous page fault handler in do_anonymous_page() open-codes the
sequence to map a newly allocated anonymous folio at the PTE level:
- construct the PTE entry
- add rmap
- add to LRU
- set the PTEs
- update the MMU cache.
Introduce two helpers to consolidate this duplicated logic, mirroring the
existing map_anon_folio_pmd_nopf() pattern for PMD-level mappings:
map_anon_folio_pte_nopf(): constructs the PTE entry, takes folio
references, adds anon rmap and LRU. This function also handles the
uffd_wp that can occur in the pf variant. The future khugepaged mTHP code
calls this to handle mapping the new collapsed mTHP to its folio.
map_anon_folio_pte_pf(): extends the nopf variant to handle MM_ANONPAGES
counter updates, and mTHP fault allocation statistics for the page fault
path.
The zero-page read path in do_anonymous_page() is also untangled from the
shared setpte label, since it does not allocate a folio and should not
share the same mapping sequence as the write path. We can now leave
nr_pages undeclared at the function intialization, and use the single page
update_mmu_cache function to handle the zero page update.
This refactoring will also help reduce code duplication between
mm/memory.c and mm/khugepaged.c, and provides a clean API for PTE-level
anonymous folio mapping that can be reused by future callers (like
khugpeaged mTHP support)
Link: https://lkml.kernel.org/r/20260325114022.444081-1-npache@redhat.com Link: https://lkml.kernel.org/r/20260325114022.444081-2-npache@redhat.com Link: https://lore.kernel.org/all/20260122192841.128719-1-npache@redhat.com Signed-off-by: Nico Pache <npache@redhat.com> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Rientjes <rientjes@google.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shivank Garg <shivankg@amd.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Takashi Iwai (SUSE) <tiwai@suse.de> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Will Deacon <will@kernel.org> Cc: Yang Shi <yang@os.amperecomputing.com> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
In mfill_atomic_hugetlb(), linear_page_index() is used to calculate the
page index for hugetlb_fault_mutex_hash(). However, linear_page_index()
returns the index in PAGE_SIZE units, while hugetlb_fault_mutex_hash()
expects the index in huge page units. This mismatch means that different
addresses within the same huge page can produce different hash values,
leading to the use of different mutexes for the same huge page. This can
cause races between faulting threads, which can corrupt the reservation
map and trigger the BUG_ON in resv_map_release().
Fix this by introducing hugetlb_linear_page_index(), which returns the
page index in huge page granularity, and using it in place of
linear_page_index().
Link: https://lkml.kernel.org/r/20260310110526.335749-1-jianhuizzzzz@gmail.com Fixes: a08c7193e4f1 ("mm/filemap: remove hugetlb special casing in filemap.c") Signed-off-by: Jianhui Zhou <jianhuizzzzz@gmail.com> Reported-by: syzbot+f525fd79634858f478e7@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=f525fd79634858f478e7 Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Jane Chu <jane.chu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: JonasZhou <JonasZhou@zhaoxin.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Cc: SeongJae Park <sj@kernel.org> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 11 Mar 2026 05:29:26 +0000 (22:29 -0700)]
mm/damon/lru_sort: respect addr_unit on default monitoring region setup
In the past, damon_set_region_biggest_system_ram_default(), which is the
core function for setting the default monitoring target region of
DAMON_LRU_SORT, didn't support addr_unit. Hence DAMON_LRU_SORT was
silently ignoring the user input for addr_unit when the user doesn't
explicitly set the monitoring target regions, and therefore the default
target region is being used. No real problem from that ignorance was
reported so far. But, the implicit rule is only making things confusing.
Also, the default target region setup function is updated to support
addr_unit. Hence there is no reason to keep ignoring it. Respect the
user-passed addr_unit for the default target monitoring region use case.
SeongJae Park [Wed, 11 Mar 2026 05:29:25 +0000 (22:29 -0700)]
mm/damon/reclaim: respect addr_unit on default monitoring region setup
In the past, damon_set_region_biggest_system_ram_default(), which is the
core function for setting the default monitoring target region of
DAMON_RECLAIM, didn't support addr_unit. Hence DAMON_RECLAIM was silently
ignoring the user input for addr_unit when the user doesn't explicitly set
the monitoring target regions, and therefore the default target region is
being used. No real problem from that ignorance was reported so far.
But, the implicit rule is only making things confusing. Also, the default
target region setup function is updated to support addr_unit. Hence there
is no reason to keep ignoring it. Respect the user-passed addr_unit for
the default target monitoring region use case.
SeongJae Park [Wed, 11 Mar 2026 05:29:24 +0000 (22:29 -0700)]
mm/damon/core: receive addr_unit on damon_set_region_biggest_system_ram_default()
damon_find_biggest_system_ram() was not supporting addr_unit in the past.
Hence, its caller, damon_set_region_biggest_system_ram_default(), was also
not supporting addr_unit. The previous commit has updated the inner
function to support addr_unit. There is no more reason to not support
addr_unit on damon_set_region_biggest_system_ram_default(). Rather, it
makes unnecessary inconsistency on support of addr_unit. Update it to
receive addr_unit and handle it inside.
SeongJae Park [Wed, 11 Mar 2026 05:29:23 +0000 (22:29 -0700)]
mm/damon/core: support addr_unit on damon_find_biggest_system_ram()
damon_find_biggest_system_ram() sets an 'unsigned long' variable with
'resource_size_t' value. This is fundamentally wrong. On environments
such as ARM 32 bit machines having LPAE (Large Physical Address
Extensions), which DAMON supports, the size of 'unsigned long' may be
smaller than that of 'resource_size_t'. It is safe, though, since we
restrict the walk to be done only up to ULONG_MAX.
DAMON supports the address size gap using 'addr_unit'. We didn't add the
support to the function, just to make the initial support change small.
Now the support is reasonably settled. This kind of gap is only making
the code inconsistent and easy to be confused. Add the support of
'addr_unit' to the function, by letting callers pass the 'addr_unit' and
handling it in the function. All callers are passing 'addr_unit' 1,
though, to keep the old behavior.
SeongJae Park [Wed, 11 Mar 2026 05:29:22 +0000 (22:29 -0700)]
mm/damon/core: fix wrong end address assignment on walk_system_ram()
Patch series "mm/damon: support addr_unit on default monitoring targets
for modules".
DAMON_RECLAIM and DAMON_LRU_SORT support 'addr_unit' parameters only when
the monitoring target address range is explicitly set. This was
intentional for making the initial 'addr_unit' support change small. Now
'addr_unit' support is being quite stabilized. Having the corner case of
the support is only making the code inconsistent with implicit rules. The
inconsistency makes it easy to confuse [1] readers. After all, there is
no real reason to keep 'addr_unit' support incomplete. Add the support
for the case to improve the readability and more completely support
'addr_unit'.
This series is constructed with five patches. The first one (patch 1)
fixes a small bug that mistakenly assigns inclusive end address to open
end address, which was found from this work. The second and third ones
(patches 2 and 3) extend the default monitoring target setting functions
in the core layer one by one, to support the 'addr_unit' while making no
visible changes. The final two patches (patches 4 and 5) update
DAMON_RECLAIM and DAMON_LRU_SORT to support 'addr_unit' for the default
monitoring target address ranges, by passing the user input to the core
functions.
This patch (of 5):
'struct damon_addr_range' and 'struct resource' represent different types
of address ranges. 'damon_addr_range' is for end-open ranges ([start,
end)). 'resource' is for fully-closed ranges ([start, end]). But
walk_system_ram() is assigning resource->end to damon_addr_range->end
without the inclusiveness adjustment. As a result, the function returns
an address range that is missing the last one byte.
The function is being used to find and set the biggest system ram as the
default monitoring target for DAMON_RECLAIM and DAMON_LRU_SORT. Missing
the last byte of the big range shouldn't be a real problem for the real
use cases. That said, the loss is definitely an unintended behavior. Do
the correct adjustment.
mm/mremap: check map count under mmap write lock and abstract
We are checking the mmap count in check_mremap_params(), prior to
obtaining an mmap write lock, which means that accesses to
current->mm->map_count might race with this field being updated.
Resolve this by only checking this field after the mmap write lock is held.
Additionally, abstract this check into a helper function with extensive
ASCII documentation of what's going on.
Link: https://lkml.kernel.org/r/18be0b48eaa8e8804eb745974ee729c3ade0c687.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reported-by: Jianzhou Zhao <luckd0g@163.com> Closes: https://lore.kernel.org/all/1a7d4c26.6b46.19cdbe7eaf0.Coremail.luckd0g@163.com/ Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: abstract reading sysctl_max_map_count, and READ_ONCE()
Concurrent reads and writes of sysctl_max_map_count are possible, so we
should READ_ONCE() and WRITE_ONCE().
The sysctl procfs logic already enforces WRITE_ONCE(), so abstract the
read side with get_sysctl_max_map_count().
While we're here, also move the field to mm/internal.h and add the getter
there since only mm interacts with it, there's no need for anybody else to
have access.
Finally, update the VMA userland tests to reflect the change.
Link: https://lkml.kernel.org/r/0715259eb37cbdfde4f9e5db92a20ec7110a1ce5.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Jianzhou Zhao <luckd0g@163.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Firstly, in mremap(), it appears that our map count checks have been overly
conservative - there is simply no reason to require that we have headroom
of 4 mappings prior to moving the VMA, we only need headroom of 2 VMAs
since commit 659ace584e7a ("mmap: don't return ENOMEM when mapcount is
temporarily exceeded in munmap()").
Likely the original headroom of 4 mappings was a mistake, and 3 was
actually intended.
Next, we access sysctl_max_map_count in a number of places without being
all that careful about how we do so.
We introduce a simple helper that READ_ONCE()'s the field
(get_sysctl_max_map_count()) to ensure that the field is accessed
correctly. The WRITE_ONCE() side is already handled by the sysctl procfs
code in proc_int_conv().
We also move this field to internal.h as there's no reason for anybody
else to access it outside of mm. Unfortunately we have to maintain the
extern variable, as mmap.c implements the procfs code.
Finally, we are accessing current->mm->map_count without holding the mmap
write lock, which is also not correct, so this series ensures the lock is
head before we access it.
We also abstract the check to a helper function, and add ASCII diagrams to
explain why we're doing what we're doing.
This patch (of 3):
We currently check to see, if on moving a VMA when doing mremap(), if it
might violate the sys.vm.max_map_count limit.
This was introduced in the mists of time prior to 2.6.12.
At this point in time, as now, the move_vma() operation would copy the VMA
(+1 mapping if not merged), then potentially split the source VMA upon
unmap.
Prior to commit 659ace584e7a ("mmap: don't return ENOMEM when mapcount is
temporarily exceeded in munmap()"), a VMA split would check whether
mm->map_count >= sysctl_max_map_count prior to a split before it ran.
On unmap of the source VMA, if we are moving a partial VMA, we might split
the VMA twice.
This would mean, on invocation of split_vma() (as was), we'd check whether
mm->map_count >= sysctl_max_map_count with a map count elevated by one,
then again with a map count elevated by two, ending up with a map count
elevated by three.
At this point we'd reduce the map count on unmap.
At the start of move_vma(), there was a check that has remained throughout
mremap()'s history of mm->map_count >= sysctl_max_map_count - 3 (which
implies mm->mmap_count + 4 > sysctl_max_map_count - that is, we must have
headroom for 4 additional mappings).
After mm->map_count is elevated by 3, it is decremented by one once the
unmap completes. The mmap write lock is held, so nothing else will observe
mm->map_count > sysctl_max_map_count.
It appears this check was always incorrect - it should have either be one
of 'mm->map_count > sysctl_max_map_count - 3' or 'mm->map_count >=
sysctl_max_map_count - 2'.
After commit 659ace584e7a ("mmap: don't return ENOMEM when mapcount is
temporarily exceeded in munmap()"), the map count check on split is
eliminated in the newly introduced __split_vma(), which the unmap path
uses, and has that path check whether mm->map_count >=
sysctl_max_map_count.
This is valid since, net, an unmap can only cause an increase in map count
of 1 (split both sides, unmap middle).
Since we only copy a VMA and (if MREMAP_DONTUNMAP is not set) unmap
afterwards, the maximum number of additional mappings that will actually be
subject to any check will be 2.
Therefore, update the check to assert this corrected value. Additionally,
update the check introduced by commit ea2c3f6f5545 ("mm,mremap: bail out
earlier in mremap_to under map pressure") to account for this.
While we're here, clean up the comment prior to that.
Link: https://lkml.kernel.org/r/cover.1773249037.git.ljs@kernel.org Link: https://lkml.kernel.org/r/73e218c67dcd197c5331840fb011e2c17155bfb0.1773249037.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Jianzhou Zhao <luckd0g@163.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kexin Sun [Thu, 12 Mar 2026 05:38:12 +0000 (13:38 +0800)]
kasan: update outdated comment
kmalloc_large() was renamed kmalloc_large_noprof() by commit 7bd230a26648
("mm/slab: enable slab allocation tagging for kmalloc and friends"), and
subsequently renamed __kmalloc_large_noprof() by commit a0a44d9175b3 ("mm,
slab: don't wrap internal functions with alloc_hooks()"), making it an
internal implementation detail.
Large kmalloc allocations are now performed through the public kmalloc()
interface directly, making the reference to KMALLOC_MAX_SIZE also stale
(KMALLOC_MAX_CACHE_SIZE would be more accurate). Remove the references to
kmalloc_large() and KMALLOC_MAX_SIZE, and rephrase the description for
large kmalloc allocations.
Usama Arif [Thu, 12 Mar 2026 10:47:23 +0000 (03:47 -0700)]
mm: migrate: requeue destination folio on deferred split queue
During folio migration, __folio_migrate_mapping() removes the source folio
from the deferred split queue, but the destination folio is never
re-queued. This causes underutilized THPs to escape the shrinker after
NUMA migration, since they silently drop off the deferred split list.
Fix this by recording whether the source folio was on the deferred split
queue and its partially mapped state before move_to_new_folio() unqueues
it, and re-queuing the destination folio after a successful migration if
it was.
By the time migrate_folio_move() runs, partially mapped folios without a
pin have already been split by migrate_pages_batch(). So only two cases
remain on the deferred list at this point:
1. Partially mapped folios with a pin (split failed).
2. Fully mapped but potentially underused folios. The recorded
partially_mapped state is forwarded to deferred_split_folio() so that
the destination folio is correctly re-queued in both cases.
Because THPs are removed from the deferred_list, THP shinker cannot
split the underutilized THPs in time. As a result, users will show
less free memory than before.
Link: https://lkml.kernel.org/r/20260312104723.1351321-1-usama.arif@linux.dev Fixes: dafff3f4c850 ("mm: split underused THPs") Signed-off-by: Usama Arif <usama.arif@linux.dev> Reported-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nico Pache <npache@redhat.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Waiman Long [Wed, 11 Mar 2026 20:05:26 +0000 (16:05 -0400)]
selftest: memcg: skip memcg_sock test if address family not supported
The test_memcg_sock test in memcontrol.c sets up an IPv6 socket and send
data over it to consume memory and verify that memory.stat.sock and
memory.current values are close.
On systems where IPv6 isn't enabled or not configured to support
SOCK_STREAM, the test_memcg_sock test always fails. When the socket()
call fails, there is no way we can test the memory consumption and verify
the above claim. I believe it is better to just skip the test in this
case instead of reporting a test failure hinting that there may be
something wrong with the memcg code.
Link: https://lkml.kernel.org/r/20260311200526.885899-1-longman@redhat.com Fixes: 5f8f019380b8 ("selftests: cgroup/memcontrol: add basic test for socket accounting") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Michal Koutný <mkoutny@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Breno Leitao [Tue, 17 Mar 2026 15:33:59 +0000 (08:33 -0700)]
mm: ratelimit min_free_kbytes adjustment messages
The "raising min_free_kbytes" pr_info message in
set_recommended_min_free_kbytes() and the "min_free_kbytes is not updated
to" pr_warn in calculate_min_free_kbytes() can spam the kernel log when
called repeatedly.
Switch the pr_info in set_recommended_min_free_kbytes() and the pr_warn in
calculate_min_free_kbytes() to their _ratelimited variants to prevent the
log spam for this message.
Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-4-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Zi Yan <ziy@nvidia.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Breno Leitao [Tue, 17 Mar 2026 15:33:58 +0000 (08:33 -0700)]
mm: huge_memory: refactor enabled_store() with set_global_enabled_mode()
Refactor enabled_store() to use a new set_global_enabled_mode() helper.
Introduce a separate enum global_enabled_mode and
global_enabled_mode_strings[], mirroring the anon_enabled_mode pattern
from the previous commit.
A separate enum is necessary because the global THP setting does not
support "inherit", only "always", "madvise", and "never". Reusing
anon_enabled_mode would leave a NULL gap in the string array, causing
sysfs_match_string() to stop early and fail to match entries after the
gap.
The helper uses the same loop pattern as set_anon_enabled_mode(),
iterating over an array of flag bit positions and using
test_and_set_bit()/test_and_clear_bit() to track whether the state
actually changed.
Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-3-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Breno Leitao [Tue, 17 Mar 2026 15:33:57 +0000 (08:33 -0700)]
mm: huge_memory: refactor anon_enabled_store() with set_anon_enabled_mode()
Consolidate the repeated spin_lock/set_bit/clear_bit pattern in
anon_enabled_store() into a new set_anon_enabled_mode() helper that loops
over an orders[] array, setting the bit for the selected mode and clearing
the others.
Introduce enum anon_enabled_mode and anon_enabled_mode_strings[] for the
per-order anon THP setting.
Use sysfs_match_string() with the anon_enabled_mode_strings[] table to
replace the if/else chain of sysfs_streq() calls.
The helper uses __test_and_set_bit()/__test_and_clear_bit() to track
whether the state actually changed, so start_stop_khugepaged() is only
called when needed. When the mode is unchanged,
set_recommended_min_free_kbytes() is called directly to preserve the
watermark recalculation behavior of the original code.
Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-2-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: thp: reduce unnecessary start_stop_khugepaged()", v7.
Writing to /sys/kernel/mm/transparent_hugepage/enabled causes
start_stop_khugepaged() called independent of any change.
start_stop_khugepaged() SPAMs the printk ring buffer overflow with the
exact same message, even when nothing changes.
For instance, if you have a custom vm.min_free_kbytes, just touching
/sys/kernel/mm/transparent_hugepage/enabled causes a printk message.
Example:
# sysctl -w vm.min_free_kbytes=112382
# for i in $(seq 100); do echo never > /sys/kernel/mm/transparent_hugepage/enabled ; done
and you have 100 WARN messages like the following, which is pretty dull:
khugepaged: min_free_kbytes is not updated to 112381 because user defined value 112382 is preferred
A similar message shows up when setting thp to "always":
# for i in $(seq 100); do
# echo 1024 > /proc/sys/vm/min_free_kbytes
# echo always > /sys/kernel/mm/transparent_hugepage/enabled
# done
And then, we have 100 messages like:
khugepaged: raising min_free_kbytes from 1024 to 67584 to help transparent hugepage allocations
This is more common when you have a configuration management system that
writes the THP configuration without an extra read, assuming that nothing
will happen if there is no change in the configuration, but it prints
these annoying messages.
For instance, at Meta's fleet, ~10K servers were producing 3.5M of these
messages per day.
Fix this by making the sysfs _store helpers easier to digest and
ratelimiting the message.
This patch (of 4):
Make set_recommended_min_free_kbytes() callable from outside khugepaged.c
by removing the static qualifier and adding a declaration in
mm/internal.h.
This allows callers that change THP settings to recalculate watermarks
without going through start_stop_khugepaged().
Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-0-31eb98fa5a8b@debian.org Link: https://lkml.kernel.org/r/20260317-thp_logs-v7-1-31eb98fa5a8b@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Suggested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Add a new DAMON sysfs interface file, namely 'goal_tuner' under the DAMOS
quotas directory. It is connected to the damos_quota->goal_tuner field.
Users can therefore select their favorite goal-based quotas tuning
algorithm by writing the name of the tuner to the file. Reading the file
returns the name of the currently selected tuner.
Introduce a new goal-based DAMOS quota auto-tuning algorithm, namely
DAMOS_QUOTA_GOAL_TUNER_TEMPORAL (temporal in short). The algorithm aims
to trigger the DAMOS action only for a temporal time, to achieve the goal
as soon as possible. For the temporal period, it uses as much quota as
allowed. Once the goal is achieved, it sets the quota zero, so
effectively makes the scheme be deactivated.
SeongJae Park [Tue, 10 Mar 2026 01:05:18 +0000 (18:05 -0700)]
mm/damon/core: allow quota goals set zero effective size quota
User-explicit quotas (size and time quotas) having zero value means the
quotas are unset. And, effective size quota is set as the minimum value
of the explicit quotas. When quota goals are set, the goal-based quota
tuner can make it lower. But the existing only single tuner never sets
the effective size quota zero. Because of the fact, DAMON core assumes
zero effective quota means the user has set no quota.
Multiple tuners are now allowed, though. In the future, some tuners might
want to set a zero effective size quota. There is no reason to restrict
that. Meanwhile, because of the current implementation, it will only
deactivate all quotas and make the scheme work at its full speed.
Introduce a dedicated function for checking if no quota is set. The
function checks the fact by showing if the user-set explicit quotas are
zero and no goal is installed. It is decoupled from zero effective quota,
and hence allows future tuners set zero effective quota for intentionally
deactivating the scheme by a purpose.
SeongJae Park [Tue, 10 Mar 2026 01:05:17 +0000 (18:05 -0700)]
mm/damon/core: introduce damos_quota_goal_tuner
Patch series "mm/damon: support multiple goal-based quota tuning
algorithms".
Aim-oriented DAMOS quota auto-tuning uses a single tuning algorithm. The
algorithm is designed to find a quota value that should be consistently
kept for achieving the aimed goal for long term. It is useful and
reliable at automatically operating systems that have dynamic environments
in the long term.
As always, however, no single algorithm fits all. When the environment
has static characteristics or there are control towers in not only the
kernel space but also the user space, the algorithm shows some
limitations. In such environments, users want kernel work in a more short
term deterministic way. Actually there were at least two reports [1,2] of
such cases.
Extend DAMOS quotas goal to support multiple quota tuning algorithms that
users can select. Keep the current algorithm as the default one, to not
break the old users. Also give it a name, "consist", as it is designed to
"consistently" apply the DAMOS action. And introduce a new tuning
algorithm, namely "temporal". It is designed to apply the DAMOS action
only temporally, in a deterministic way. In more detail, as long as the
goal is under-achieved, it uses the maximum quota available. Once the
goal is over-achieved, it sets the quota zero.
Tests
=====
I confirmed the feature is working as expected using the latest version of
DAMON user-space tool, like below.
Note that >=3.1.8 version of DAMON user-space tool supports this feature
(--damos_quota_goal_tuner). As expected, DAMOS stops reclaiming memory as
soon as the goal amount of free memory is made. When 'consist' tuner is
used, the reclamation was continued even after the goal amount of free
memory is made, resulting in more than goal amount of free memory, as
expected.
Patch Sequence
==============
First four patches implement the features. Patch 1 extends core API to
allow multiple tuners and make the current tuner as the default and only
available tuner, namely 'consist'. Patch 2 allows future tuners setting
zero effective quota. Patch 3 introduces the second tuner, namely
'temporal'. Patch 4 further extends DAMON sysfs API to let users use
that.
Three following patches (patches 5-7) update design, usage, and ABI
documents, respectively.
Final four patches (patches 8-11) are for adding tests. The eighth patch
(patch 8) extends the kunit test for online parameters commit for
validating the goal_tuner. The ninth and the tenth patches (patches 9-10)
extend the testing-purpose DAMON sysfs control helper and DAMON status
dumping tool to support the newly added feature. The final eleventh one
(patch 11) extends the existing online commit selftest to cover the new
feature.
This patch (of 11):
DAMOS quota goal feature utilizes a single feedback loop based algorithm
for automatic tuning of the effective quota. It is useful in dynamic
environments that operate systems with only kernels in the long term.
But, no one fits all. It is not very easy to control in environments
having more controlled characteristics and user-space control towers. We
actually got multiple reports [1,2] of use cases that the algorithm is not
optimal.
Introduce a new field of 'struct damos_quotas', namely 'goal_tuner'. It
specifies what tuning algorithm the given scheme should use, and allows
DAMON API callers to set it as they want. Nonetheless, this commit
introduces no new tuning algorithm but only the interface. This commit
hence makes no behavioral change. A new algorithm will be added by the
following commit.
Hui Zhu [Tue, 10 Mar 2026 01:56:57 +0000 (09:56 +0800)]
mm/swap: strengthen locking assertions and invariants in cluster allocation
swap_cluster_alloc_table() requires several locks to be held by its
callers: ci->lock, the per-CPU swap_cluster lock, and, for non-solid-state
devices (non-SWP_SOLIDSTATE), the si->global_cluster_lock.
While most call paths (e.g., via cluster_alloc_swap_entry() or
alloc_swap_scan_list()) correctly acquire these locks before invocation,
the path through swap_reclaim_work() -> swap_reclaim_full_clusters() ->
isolate_lock_cluster() is distinct. This path operates exclusively on
si->full_clusters, where the swap allocation tables are guaranteed to be
already allocated. Consequently, isolate_lock_cluster() should never
trigger a call to swap_cluster_alloc_table() for these clusters.
Strengthen the locking and state assertions to formalize these invariants:
1. Add a lockdep_assert_held() for si->global_cluster_lock in
swap_cluster_alloc_table() for non-SWP_SOLIDSTATE devices.
2. Reorder existing lockdep assertions in swap_cluster_alloc_table() to
match the actual lock acquisition order (per-CPU lock, then global lock,
then cluster lock).
3. Add a VM_WARN_ON_ONCE() in isolate_lock_cluster() to ensure that table
allocations are only attempted for clusters being isolated from the
free list. Attempting to allocate a table for a cluster from other
lists (like the full list during reclaim) indicates a violation of
subsystem invariants.
These changes ensure locking consistency and help catch potential
synchronization or logic issues during development.
Anthony Yznaga [Tue, 10 Mar 2026 15:58:20 +0000 (08:58 -0700)]
mm: prevent droppable mappings from being locked
Droppable mappings must not be lockable. There is a check for VMAs with
VM_DROPPABLE set in mlock_fixup() along with checks for other types of
unlockable VMAs which ensures this when calling mlock()/mlock2().
For mlockall(MCL_FUTURE), the check for unlockable VMAs is different. In
apply_mlockall_flags(), if the flags parameter has MCL_FUTURE set, the
current task's mm's default VMA flag field mm->def_flags has VM_LOCKED
applied to it. VM_LOCKONFAULT is also applied if MCL_ONFAULT is also set.
When these flags are set as default in this manner they are cleared in
__mmap_complete() for new mappings that do not support mlock. A check for
VM_DROPPABLE in __mmap_complete() is missing resulting in droppable
mappings created with VM_LOCKED set. To fix this and reduce that chance
of similar bugs in the future, introduce and use vma_supports_mlock().
Link: https://lkml.kernel.org/r/20260310155821.17869-1-anthony.yznaga@oracle.com Fixes: 9651fcedf7b9 ("mm: add MAP_DROPPABLE for designating always lazily freeable mappings") Signed-off-by: Anthony Yznaga <anthony.yznaga@oracle.com> Suggested-by: David Hildenbrand <david@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Tested-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
zram: unify and harden algo/priority params handling
We have two functions that accept algo= and priority= params -
algorithm_params_store() and recompress_store(). This patch unifies and
hardens handling of those parameters.
There are 4 possible cases:
- only priority= provided [recommended]
We need to verify that provided priority value is
within permitted range for each particular function.
- both algo= and priority= provided
We cannot prioritize one over another. All we should
do is to verify that zram is configured in the way
that user-space expects it to be. Namely that zram
indeed has compressor algo= setup at given priority=.
- only algo= provided [not recommended]
We should lookup priority in compressors list.
- none provided [not recommended]
Just use function's defaults.
Link: https://lkml.kernel.org/r/20260311084312.1766036-7-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Chained recompression has unpredictable behavior and is not useful in
practice.
First, systems usually configure just one alternative recompression
algorithm, which has slower compression/decompression but better
compression ratio. A single alternative algorithm doesn't need chaining.
Second, even with multiple recompression algorithms, chained recompression
is suboptimal. If a lower priority algorithm succeeds, the page is never
attempted with a higher priority algorithm, leading to worse memory
savings. If a lower priority algorithm fails, the page is still attempted
with a higher priority algorithm, wasting resources on the failed lower
priority attempt.
In either case, the system would be better off targeting a specific
priority directly.
Chained recompression also significantly complicates the code. Remove it.
Link: https://lkml.kernel.org/r/20260311084312.1766036-6-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Emphasize usage of the `priority` parameter for recompression and explain
why `algo` parameter can lead to unexpected behavior and thus is not
recommended.
Link: https://lkml.kernel.org/r/20260311084312.1766036-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
It's not entirely correct to use ->num_active_comps for max-prio limit, as
->num_active_comps just tells the number of configured algorithms, not the
max configured priority. For instance, in the following theoretical
example:
[lz4] [nil] [nil] [deflate]
->num_active_comps is 2, while the actual max-prio is 3.
Drop ->num_active_comps and use ZRAM_MAX_COMPS instead.
Link: https://lkml.kernel.org/r/20260311084312.1766036-4-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Suggested-by: Minchan Kim <minchan@kernel.org> Cc: Brian Geffon <bgeffon@google.com> Cc: gao xu <gaoxu2@honor.com> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "zram: recompression cleanups and tweaks", v2.
This series is a somewhat random mix of fixups, recompression cleanups and
improvements partly based on internal conversations. A few patches in the
series remove unexpected or confusing behaviour, e.g. auto correction of
bad priority= param for recompression, which should have always been just
an error. Then it also removes "chain recompression" which has a tricky,
unexpected and confusing behaviour at times. We also unify and harden the
handling of algo/priority params. There is also an addition of missing
device lock in algorithm_params_store() which previously permitted
modification of algo params while the device is active.
This patch (of 6):
First, algorithm_params_store(), like any sysfs handler, should grab
device lock.
Second, like any write() sysfs handler, it should grab device lock in
exclusive mode.
Third, it should not permit change of algos' parameters after device init,
as this doesn't make sense - we cannot compress with one C/D dict and then
just change C/D dict to a different one, for example.
Another thing to notice is that algorithm_params_store() accesses device's
->comp_algs for algo priority lookup, which should be protected by device
lock in exclusive mode in general.
Pratyush Yadav [Mon, 9 Mar 2026 12:34:07 +0000 (12:34 +0000)]
kho: drop restriction on maximum page order
KHO currently restricts the maximum order of a restored page to the
maximum order supported by the buddy allocator. While this works fine for
much of the data passed across kexec, it is possible to have pages larger
than MAX_PAGE_ORDER.
For one, it is possible to get a larger order when using
kho_preserve_pages() if the number of pages is large enough, since it
tries to combine multiple aligned 0-order preservations into one higher
order preservation.
For another, upcoming support for hugepages can have gigantic hugepages
being preserved over KHO.
There is no real reason for this limit. The KHO preservation machinery
can handle any page order. Remove this artificial restriction on max page
order.
kho: make sure preservations do not span multiple NUMA nodes
The KHO restoration machinery is not capable of dealing with preservations
that span multiple NUMA nodes. kho_preserve_folio() guarantees the
preservation will only span one NUMA node since folios can't span multiple
nodes.
This leaves kho_preserve_pages(). While semantically kho_preserve_pages()
only deals with 0-order pages, so all preservations should be single page
only, in practice it combines preservations to higher orders for
efficiency. This can result in a preservation spanning multiple nodes.
Break up the preservations into a smaller order if that happens.
vma_mmu_pagesize() is also queried on non-hugetlb VMAs and does not really
belong into hugetlb.c.
PPC64 provides a custom overwrite with CONFIG_HUGETLB_PAGE, see
arch/powerpc/mm/book3s64/slice.c, so we cannot easily make this a static
inline function.
So let's move it to vma.c and add some proper kerneldoc.
To make vma tests happy, add a simple vma_kernel_pagesize() stub in
tools/testing/vma/include/custom.h.
Link: https://lkml.kernel.org/r/20260309151901.123947-3-david@kernel.org Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: move vma_kernel_pagesize() from hugetlb to mm.h
Patch series "mm: move vma_(kernel|mmu)_pagesize() out of hugetlb.c", v2.
Looking into vma_(kernel|mmu)_pagesize(), I realized that there is one
scenario where DAX would not do the right thing when the kernel is not
compiled with hugetlb support.
Without hugetlb support, vma_(kernel|mmu)_pagesize() will always return
PAGE_SIZE instead of using the ->pagesize() result provided by dax-device
code.
Fix that by moving vma_kernel_pagesize() to core MM code, where it
belongs. I don't think this is stable material, but am not 100% sure.
Also, move vma_mmu_pagesize() while at it. Remove the unnecessary
hugetlb.h inclusion from KVM code.
This patch (of 4):
In the past, only hugetlb had special "vma_kernel_pagesize()"
requirements, so it provided its own implementation.
In commit 05ea88608d4e ("mm, hugetlbfs: introduce ->pagesize() to
vm_operations_struct") we generalized that approach by providing a
vm_ops->pagesize() callback to be used by device-dax.
Once device-dax started using that callback in commit c1d53b92b95c
("device-dax: implement ->pagesize() for smaps to report MMUPageSize") it
was missed that CONFIG_DEV_DAX does not depend on hugetlb support.
So building a kernel with CONFIG_DEV_DAX but without CONFIG_HUGETLBFS
would not pick up that value.
Fix it by moving vma_kernel_pagesize() to mm.h, providing only a single
implementation. While at it, improve the kerneldoc a bit.
Ideally, we'd move vma_mmu_pagesize() as well to the header. However, its
__weak symbol might be overwritten by a PPC variant in hugetlb code. So
let's leave it in there for now, as it really only matters for some
hugetlb oddities.
This was found by code inspection.
Link: https://lkml.kernel.org/r/20260309151901.123947-1-david@kernel.org Link: https://lkml.kernel.org/r/20260309151901.123947-2-david@kernel.org Fixes: c1d53b92b95c ("device-dax: implement ->pagesize() for smaps to report MMUPageSize") Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: "Christophe Leroy (CS GROUP)" <chleroy@kernel.org> Cc: Jann Horn <jannh@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pedro Falcato <pfalcato@suse.de> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Akinobu Mita [Tue, 10 Mar 2026 15:18:37 +0000 (00:18 +0900)]
docs: mm: fix typo in numa_memory_policy.rst
Fix a typo: MPOL_INTERLEAVED -> MPOL_INTERLEAVE.
Link: https://lkml.kernel.org/r/20260310151837.5888-1-akinobu.mita@gmail.com Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:53:54 +0000 (11:53 -0800)]
Docs/mm/damon/maintainer-profile: use flexible review cadence
The document mentions the maitainer is working in the usual 9-5 fashion.
The maintainer nowadays prefers working in a more flexible way. Update
the document to avoid contributors having a wrong time expectation.
Link: https://lkml.kernel.org/r/20260307195356.203753-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: wang lian <lianux.mm@gmail.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:53:51 +0000 (11:53 -0800)]
mm/damon/core: clarify damon_set_attrs() usages
damon_set_attrs() is called for multiple purposes from multiple places.
Calling it in an unsafe context can make DAMON internal state polluted and
results in unexpected behaviors. Clarify when it is safe, and where it is
being called.
Link: https://lkml.kernel.org/r/20260307195356.203753-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: wang lian <lianux.mm@gmail.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:53:49 +0000 (11:53 -0800)]
mm/damon/core: use mult_frac()
Patch series "mm/damon: improve/fixup/update ratio calculation, test and
documentation".
Yet another batch of misc/minor improvements and fixups. Use mult_frac()
instead of the worse open-coding for rate calculations (patch 1). Add a
test for a previously found and fixed bug (patch 2). Improve and update
comments and documentations for easier code review and up-to-date
information (patches 3-6). Finally, fix an obvious typo (patch 7).
This patch (of 7):
There are multiple places in core code that do open-code rate
calculations. Use mult_frac(), which is developed for doing that in a way
more safe from overflow and precision loss.
Link: https://lkml.kernel.org/r/20260307195356.203753-1-sj@kernel.org Link: https://lkml.kernel.org/r/20260307195356.203753-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Acked-by: wang lian <lianux.mm@gmail.com> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Shuah Khan <skhan@linuxfoundation.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:49:14 +0000 (11:49 -0800)]
mm/damon/core: use time_after_eq() in kdamond_fn()
damon_ctx->passed_sample_intervals and damon_ctx->next_*_sis are unsigned
long. Those are compared in kdamond_fn() using normal comparison
operators. It is unsafe from overflow. Use time_after_eq(), which is
safe from overflows when correctly used, instead.
SeongJae Park [Sat, 7 Mar 2026 19:49:13 +0000 (11:49 -0800)]
mm/damon/core: use time_before() for next_apply_sis
damon_ctx->passed_sample_intervals and damos->next_apply_sis are unsigned
long, and compared via normal comparison operators. It is unsafe from
overflow. Use time_before(), which is safe from overflow when correctly
used, instead.
Patch series "mm/damon/core: make passed_sample_intervals comparisons
overflow-safe".
DAMON accounts time using its own jiffies-like time counter, namely
damon_ctx->passed_sample_intervals. The counter is incremented on each
iteration of kdamond_fn() main loop, which sleeps at least one sample
interval. Hence the name is like that.
DAMON has time-periodic operations including monitoring results
aggregation and DAMOS action application. DAMON sets the next time to do
each of such operations in the passed_sample_intervals unit. And it does
the operation when the counter becomes the same to or larger than the
pre-set values, and update the next time for the operation. Note that the
operation is done not only when the values exactly match but also when the
time is passed, because the values can be updated for online-committed
DAMON parameters.
The counter is 'unsigned long' type, and the comparison is done using
normal comparison operators. It is not safe from overflows. This can
cause rare and limited but odd situations.
Let's suppose there is an operation that should be executed every 20
sampling intervals, and the passed_sample_intervals value for next
execution of the operation is ULONG_MAX - 3. Once the
passed_sample_intervals reaches ULONG_MAX - 3, the operation will be
executed, and the next time value for doing the operation becomes 17
(ULONG_MAX - 3 + 20), since overflow happens. In the next iteration of
the kdamond_fn() main loop, passed_sample_intervals is larger than the
next operation time value, so the operation will be executed again. It
will continue executing the operation for each iteration, until the
passed_sample_intervals also overflows.
Note that this will not be common and problematic in the real world. The
sampling interval, which takes for each passed_sample_intervals increment,
is 5 ms by default. And it is usually [auto-]tuned for hundreds of
milliseconds. That means it takes about 248 days or 4,971 days to have
the overflow on 32 bit machines when the sampling interval is 5 ms and 100
ms, respectively (1<<32 * sampling_interval_in_seconds / 3600 / 24). On
64 bit machines, the numbers become 2924712086.77536 and 58494241735.5072
years. So the real user impact is negligible. But still this is better
to be fixed as long as the fix is simple and efficient.
Fix this by simply replacing the overflow-unsafe native comparison
operators with the existing overflow-safe time comparison helpers.
The first patch only cleans up the next DAMOS action application time
setup for consistency and reduced code. The second and the third patches
update DAMOS action application time setup and rest, respectively.
This patch (of 3):
There is a function for damos->next_apply_sis setup. But some places are
open-coding it. Consistently use the helper.
SeongJae Park [Sat, 7 Mar 2026 19:42:21 +0000 (11:42 -0800)]
Docs/mm/damon/design: document the power-of-two limitation for addr_unit
The min_region_sz is set as max(DAMON_MIN_REGION_SZ / addr_unit, 1).
DAMON_MIN_REGION_SZ is the same to PAGE_SIZE, and addr_unit is what the
user can arbitrarily set. Commit c80f46ac228b ("mm/damon/core: disallow
non-power of two min_region_sz") made min_region_sz to always be a power
of two. Hence, addr_unit should be a power of two when it is smaller than
PAGE_SIZE. While 'addr_unit' is a user-exposed parameter, the rule is not
documented. This can confuse users. Specifically, if the user sets
addr_unit as a value that is smaller than PAGE_SIZE and not a power of
two, the setup will explicitly fail.
Document the rule on the design document. Usage documents reference the
design document for detail, so updating only the design document should
suffice.
Link: https://lkml.kernel.org/r/20260307194222.202075-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 7 Mar 2026 19:42:20 +0000 (11:42 -0800)]
mm/damon/tests/core-kunit: add a test for damon_commit_ctx()
Patch series "mm/damon: test and document power-of-2 min_region_sz
requirement".
Since commit c80f46ac228b ("mm/damon/core: disallow non-power of two
min_region_sz"), min_region_sz is always restricted to be a power of two.
Add a kunit test to confirm the functionality. Also, the change adds a
restriction to addr_unit parameter. Clarify it on the document.
This patch (of 2):
Add a kunit test for confirming the change that is made on commit c80f46ac228b ("mm/damon/core: disallow non-power of two min_region_sz")
functions as expected.
Link: https://lkml.kernel.org/r/20260307194222.202075-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: David Hildenbrand <david@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: SeongJae Park <sj@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
At time of damon_reset_aggregated(), aggregation of the interval should be
completed, and hence nr_accesses and nr_accesses_bp should match. I found
a few bugs caused it to be broken in the past, from online parameters
update and complicated nr_accesses handling changes. Add a sanity check
for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-9-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_merge_regions_of() should be called only after aggregation is
finished and therefore each region's nr_accesses and nr_accesses_bp match.
There were bugs that broke the assumption, during development of online
DAMON parameter updates and monitoring results handling changes. Add a
sanity check for that under CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-7-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
A data corruption could cause damon_merge_two_regions() creating zero
length DAMON regions. Add a sanity check for that under
CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-6-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
damon_target->nr_regions is introduced to get the number quickly without
having to iterate regions always. Add a sanity check for that under
CONFIG_DAMON_DEBUG_SANITY.
Link: https://lkml.kernel.org/r/20260306152914.86303-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Brendan Higgins <brendan.higgins@linux.dev> Cc: David Gow <davidgow@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Fri, 6 Mar 2026 15:29:04 +0000 (07:29 -0800)]
mm/damon: add CONFIG_DAMON_DEBUG_SANITY
Patch series "mm/damon: add optional debugging-purpose sanity checks".
DAMON code has a few assumptions that can be critical if violated.
Validating the assumptions in code can be useful at finding such critical
bugs. I was actually adding some such additional sanity checks in my
personal tree, and those were useful at finding bugs that I made during
the development of new patches. We also found [1] sometimes the
assumptions are misunderstood. The validation can work as good
documentation for such cases.
Add some of such debugging purpose sanity checks. Because those
additional checks can impose more overhead, make those only optional via
new config, CONFIG_DAMON_DEBUG_SANITY, that is recommended for only
development and test setups. And as recommended, enable it for DAMON
kunit tests and selftests.
Note that the verification only WARN_ON() for each of the insanity. The
developer or tester may better to set panic_on_oops together, like
damon-tests/corr did [2].
This patch (of 10):
Add a new build config that will enable additional DAMON sanity checks.
It is recommended to be enabled on only development and test setups, since
it can impose additional overhead.
Usama Arif [Mon, 9 Mar 2026 21:25:02 +0000 (14:25 -0700)]
mm/migrate_device: document folio_get requirement before frozen PMD split
split_huge_pmd_address() with freeze=true splits a PMD migration entry
into PTE migration entries, consuming one folio reference in the process.
The folio_get() before it provides this reference.
Add a comment explaining this relationship. The expected folio refcount
at the start of migrate_vma_split_unmapped_folio() is 1.
Link: https://lkml.kernel.org/r/20260309212502.3922825-1-usama.arif@linux.dev Signed-off-by: Usama Arif <usama.arif@linux.dev> Suggested-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Nico Pache <npache@redhat.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Ying Huang <ying.huang@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Arnd Bergmann [Fri, 6 Mar 2026 15:05:49 +0000 (16:05 +0100)]
ubsan: turn off kmsan inside of ubsan instrumentation
The structure initialization in the two type mismatch handling functions
causes a call to __msan_memset() to be generated inside of a UACCESS
block, which in turn leads to an objtool warning about possibly leaking
uaccess-enabled state:
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch+0xda: call to __msan_memset() with UACCESS enabled
lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1+0xf4: call to __msan_memset() with UACCESS enabled
Most likely __msan_memset() is safe to be called here and could be added
to the uaccess_safe_builtin[] list of safe functions, but seeing that the
ubsan file itself already has kasan, ubsan and kcsan disabled itself, it
is probably a good idea to also turn off kmsan here, in particular this
also avoids the risk of recursing between ubsan and kcsan checks in other
functions of this file.
I saw this happen while testing randconfig builds with clang-22, but did
not try older versions, or attempt to see which kernel change introduced
the warning.
Byungchul Park [Tue, 24 Feb 2026 05:13:47 +0000 (14:13 +0900)]
mm: introduce a new page type for page pool in page type
Currently, the condition 'page->pp_magic == PP_SIGNATURE' is used to
determine if a page belongs to a page pool. However, with the planned
removal of @pp_magic, we should instead leverage the page_type in struct
page, such as PGTY_netpp, for this purpose.
Introduce and use the page type APIs e.g. PageNetpp(), __SetPageNetpp(),
and __ClearPageNetpp() instead, and remove the existing APIs accessing
@pp_magic e.g. page_pool_page_is_pp(), netmem_or_pp_magic(), and
netmem_clear_pp_magic().
Plus, add @page_type to struct net_iov at the same offset as struct page
so as to use the page_type APIs for struct net_iov as well. While at it,
reorder @type and @owner in struct net_iov to avoid a hole and increasing
the struct size.
Chengkaitao [Sun, 1 Feb 2026 06:35:31 +0000 (14:35 +0800)]
sparc: use vmemmap_populate_hugepages for vmemmap_populate
Change sparc's implementation of vmemmap_populate() using
vmemmap_populate_hugepages() to streamline the code. Another benefit is
that it allows us to eliminate the external declarations of
vmemmap_p?d_populate functions and convert them to static functions.
Since vmemmap_populate_hugepages may fallback to vmemmap_populate-
_basepages, which differs from sparc's original implementation. During
the v1 discussion with Mike Rapoport, sparc uses base pages in the kernel
page tables, so it should be able to use them in vmemmap as well.
Consequently, no additional special handling is required.
1. In the SPARC architecture, reimplement vmemmap_populate using
vmemmap_populate_hugepages.
2. Allow the SPARC arch to fallback to vmemmap_populate_basepages(),
when vmemmap_alloc_block returns NULL.
Link: https://lkml.kernel.org/r/20260201063532.44807-2-pilgrimtao@gmail.com Signed-off-by: Chengkaitao <chengkaitao@kylinos.cn> Tested-by: Andreas Larsson <andreas@gaisler.com> Acked-by: Andreas Larsson <andreas@gaisler.com> Cc: David Hildenbrand <david@kernel.org> Cc: David S. Miller <davem@davemloft.net> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
tools/testing/vma: add test for vma_flags_test(), vma_desc_test()
Now we have helpers which test singular VMA flags - vma_flags_test() and
vma_desc_test() - add a test to explicitly assert that these behave as
expected.
[ljs@kernel.org: test_vma_flags_test(): use struct initializer, per David] Link: https://lkml.kernel.org/r/f6f396d2-1ba2-426f-b756-d8cc5985cc7c@lucifer.local Link: https://lkml.kernel.org/r/376a39eb9e134d2c8ab10e32720dd292970b080a.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: reintroduce vma_desc_test() as a singular flag test
Similar to vma_flags_test(), we have previously renamed vma_desc_test() to
vma_desc_test_any(). Now that is in place, we can reintroduce
vma_desc_test() to explicitly check for a single VMA flag.
As with vma_flags_test(), this is useful as often flag tests are against a
single flag, and vma_desc_test_any(flags, VMA_READ_BIT) reads oddly and
potentially causes confusion.
As with vma_flags_test() a combination of sparse and vma_flags_t being a
struct means that users cannot misuse this function without it getting
flagged.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/3a65ca23defb05060333f0586428fe279a484564.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: reintroduce vma_flags_test() as a singular flag test
Since we've now renamed vma_flags_test() to vma_flags_test_any() to be
very clear as to what we are in fact testing, we now have the opportunity
to bring vma_flags_test() back, but for explicitly testing a single VMA
flag.
This is useful, as often flag tests are against a single flag, and
vma_flags_test_any(flags, VMA_READ_BIT) reads oddly and potentially causes
confusion.
We use sparse to enforce that users won't accidentally pass vm_flags_t to
this function without it being flagged so this should make it harder to
get this wrong.
Of course, passing vma_flags_t to the function is impossible, as it is a
struct.
Also update the VMA tests to reflect this change.
Link: https://lkml.kernel.org/r/f33f8d7f16c3f3d286a1dc2cba12c23683073134.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: always inline __mk_vma_flags() and invoked functions
Be explicit about __mk_vma_flags() (which is used by the mk_vma_flags()
macro) always being inline, as we rely on the compiler to evaluate the
loop in this function and determine that it can replace the code with the
an equivalent constant value, e.g. that:
Most likely an 'inline' will suffice for this, but be explicit as we can
be.
Also update all of the functions __mk_vma_flags() ultimately invokes to be
always inline too.
Note that test_bitmap_const_eval() asserts that the relevant bitmap
functions result in build time constant values.
Additionally, vma_flag_set() operates on a vma_flags_t type, so it is
inconsistently named versus other VMA flags functions.
We only use vma_flag_set() in __mk_vma_flags() so we don't need to worry
about its new name being rather cumbersome, so rename it to
vma_flags_set_flag() to disambiguate it from vma_flags_set().
Also update the VMA test headers to reflect the changes.
Link: https://lkml.kernel.org/r/241f49c52074d436edbb9c6a6662a8dc142a8f43.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
erofs and zonefs are using vma_desc_test_any() twice to check whether all
of VMA_SHARED_BIT and VMA_MAYWRITE_BIT are set, this is silly, so add
vma_desc_test_all() to test all flags and update erofs and zonefs to use
it.
While we're here, update the helper function comments to be more
consistent.
Also add the same to the VMA test headers.
Link: https://lkml.kernel.org/r/568c8f8d6a84ff64014f997517cba7a629f7eed6.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The ongoing work around introducing non-system word VMA flags has
introduced a number of helper functions and macros to make life easier
when working with these flags and to make conversions from the legacy use
of VM_xxx flags more straightforward.
This series improves these to reduce confusion as to what they do and to
improve consistency and readability.
Firstly the series renames vma_flags_test() to vma_flags_test_any() to
make it abundantly clear that this function tests whether any of the flags
are set (as opposed to vma_flags_test_all()).
It then renames vma_desc_test_flags() to vma_desc_test_any() for the same
reason. Note that we drop the 'flags' suffix here, as
vma_desc_test_any_flags() would be cumbersome and 'test' implies a flag
test.
Similarly, we rename vma_test_all_flags() to vma_test_all() for
consistency.
Next, we have a couple of instances (erofs, zonefs) where we are now
testing for vma_desc_test_any(desc, VMA_SHARED_BIT) &&
vma_desc_test_any(desc, VMA_MAYWRITE_BIT).
This is silly, so this series introduces vma_desc_test_all() so these
callers can instead invoke vma_desc_test_all(desc, VMA_SHARED_BIT,
VMA_MAYWRITE_BIT).
We then observe that quite a few instances of vma_flags_test_any() and
vma_desc_test_any() are in fact only testing against a single flag.
Using the _any() variant here is just confusing - 'any' of single item
reads strangely and is liable to cause confusion.
So in these instances the series reintroduces vma_flags_test() and
vma_desc_test() as helpers which test against a single flag.
The fact that vma_flags_t is a struct and that vma_flag_t utilises sparse
to avoid confusion with vm_flags_t makes it impossible for a user to
misuse these helpers without it getting flagged somewhere.
The series also updates __mk_vma_flags() and functions invoked by it to
explicitly mark them always inline to match expectation and to be
consistent with other VMA flag helpers.
It also renames vma_flag_set() to vma_flags_set_flag() (a function only
used by __mk_vma_flags()) to be consistent with other VMA flag helpers.
Finally it updates the VMA tests for each of these changes, and introduces
explicit tests for vma_flags_test() and vma_desc_test() to assert that
they behave as expected.
This patch (of 6):
On reflection, it's confusing to have vma_flags_test() and
vma_desc_test_flags() test whether any comma-separated VMA flag bit is
set, while also having vma_flags_test_all() and vma_test_all_flags()
separately test whether all flags are set.
Firstly, rename vma_flags_test() to vma_flags_test_any() to eliminate this
confusion.
Secondly, since the VMA descriptor flag functions are becoming rather
cumbersome, prefer vma_desc_test*() to vma_desc_test_flags*(), and also
rename vma_desc_test_flags() to vma_desc_test_any().
Finally, rename vma_test_all_flags() to vma_test_all() to keep the
VMA-specific helper consistent with the VMA descriptor naming convention
and to help avoid confusion vs. vma_flags_test_all().
While we're here, also update whitespace to be consistent in helper
functions.
Link: https://lkml.kernel.org/r/cover.1772704455.git.ljs@kernel.org Link: https://lkml.kernel.org/r/0f9cb3c511c478344fac0b3b3b0300bb95be95e9.1772704455.git.ljs@kernel.org Signed-off-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Suggested-by: Pedro Falcato <pfalcato@suse.de> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Pedro Falcato <pfalcato@suse.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Babu Moger <babu.moger@amd.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chao Yu <chao@kernel.org> Cc: Chatre, Reinette <reinette.chatre@intel.com> Cc: Chunhai Guo <guochunhai@vivo.com> Cc: Damien Le Maol <dlemoal@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Dave Martin <dave.martin@arm.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Hongbo Li <lihongbo22@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: James Morse <james.morse@arm.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Konstantin Komarov <almaz.alexandrovich@paragon-software.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Naohiro Aota <naohiro.aota@wdc.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Sandeep Dhavale <dhavale@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Yue Hu <zbestahu@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andrey Ryabinin [Thu, 5 Mar 2026 18:56:59 +0000 (19:56 +0100)]
kasan: fix bug type classification for SW_TAGS mode
kasan_non_canonical_hook() derives orig_addr from kasan_shadow_to_mem(),
but the pointer tag may remain in the top byte. In SW_TAGS mode this
tagged address is compared against PAGE_SIZE and TASK_SIZE, which leads to
incorrect bug classification.
As a result, NULL pointer dereferences may be reported as
"wild-memory-access".
Strip the tag before performing these range checks and use the untagged
value when reporting addresses in these ranges.
Before:
[ ] Unable to handle kernel paging request at virtual address ffef800000000000
[ ] KASAN: maybe wild-memory-access in range [0xff00000000000000-0xff0000000000000f]
After:
[ ] Unable to handle kernel paging request at virtual address ffef800000000000
[ ] KASAN: null-ptr-deref in range [0x0000000000000000-0x000000000000000f]
Link: https://lkml.kernel.org/r/20260305185659.20807-1-ryabinin.a.a@gmail.com Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Maciej Wieczor-Retman <maciej.wieczor-retman@intel.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Bing Jiao [Tue, 3 Mar 2026 05:25:17 +0000 (05:25 +0000)]
mm/vmscan: fix unintended mtc->nmask mutation in alloc_demote_folio()
In alloc_demote_folio(), mtc->nmask is set to NULL for the first
allocation. If that succeeds, it returns without restoring mtc->nmask to
allowed_mask. For subsequent allocations from the migrate_pages() batch,
mtc->nmask will be NULL. If the target node then becomes full, the
fallback allocation will use nmask = NULL, allocating from any node
allowed by the task cpuset, which for kswapd is all nodes.
To address this issue, use a local copy of the mtc structure with nmask =
NULL for the first allocation attempt specifically, ensuring the original
mtc remains unmodified.
Link: https://lkml.kernel.org/r/20260303052519.109244-1-bingjiao@google.com Fixes: 320080272892 ("mm/demotion: demote pages according to allocation fallback order") Signed-off-by: Bing Jiao <bingjiao@google.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuvraj Sakshith [Tue, 3 Mar 2026 11:30:32 +0000 (03:30 -0800)]
mm/page_reporting: change page_reporting_order to PAGE_REPORTING_ORDER_UNSPECIFIED
page_reporting_order when uninitialised, holds a magic number -1.
Since we now maintain PAGE_REPORTING_ORDER_UNSPECIFIED as -1, which is
also a flag, set page_reporting_order to this flag.
Link: https://lkml.kernel.org/r/20260303113032.3008371-6-yuvraj.sakshith@oss.qualcomm.com Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuvraj Sakshith [Tue, 3 Mar 2026 11:30:31 +0000 (03:30 -0800)]
mm/page_reporting: change PAGE_REPORTING_ORDER_UNSPECIFIED to -1
PAGE_REPORTING_ORDER_UNSPECIFIED is now set to zero. This means, pages of
order zero cannot be reported to a client/driver -- as zero is used to
signal a fallback to MAX_PAGE_ORDER.
Change PAGE_REPORTING_ORDER_UNSPECIFIED to (-1), so that zero can be used
as a valid order with which pages can be reported.
Link: https://lkml.kernel.org/r/20260303113032.3008371-5-yuvraj.sakshith@oss.qualcomm.com Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuvraj Sakshith [Tue, 3 Mar 2026 11:30:30 +0000 (03:30 -0800)]
hv_balloon: set unspecified page reporting order
Explicitly mention page reporting order to be set to default value using
PAGE_REPORTING_ORDER_UNSPECIFIED fallback value.
Link: https://lkml.kernel.org/r/20260303113032.3008371-4-yuvraj.sakshith@oss.qualcomm.com Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yuvraj Sakshith [Tue, 3 Mar 2026 11:30:29 +0000 (03:30 -0800)]
virtio_balloon: set unspecified page reporting order
virtio_balloon page reporting order is set to MAX_PAGE_ORDER implicitly as
vb->prdev.order is never initialised and is auto-set to zero.
Explicitly mention usage of default page order by making use of
PAGE_REPORTING_ORDER_UNSPECIFIED fallback value.
Link: https://lkml.kernel.org/r/20260303113032.3008371-3-yuvraj.sakshith@oss.qualcomm.com Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "Allow order zero pages in page reporting", v4.
Today, page reporting sets page_reporting_order in two ways:
(1) page_reporting.page_reporting_order cmdline parameter
(2) Driver can pass order while registering itself.
In both cases, order zero is ignored by free page reporting because it is
used to set page_reporting_order to a default value, like MAX_PAGE_ORDER.
In some cases we might want page_reporting_order to be zero.
For instance, when virtio-balloon runs inside a guest with tiny memory
(say, 16MB), it might not be able to find a order 1 page (or in the worst
case order MAX_PAGE_ORDER page) after some uptime. Page reporting should
be able to return order zero pages back for optimal memory relinquishment.
This patch changes the default fallback value from '0' to '-1' in all
possible clients of free page reporting (hv_balloon and virtio-balloon)
together with allowing '0' as a valid order in page_reporting_register().
This patch (of 5):
Drivers can pass order of pages to be reported while registering itself.
Today, this is a magic number, 0.
Label this with PAGE_REPORTING_ORDER_UNSPECIFIED and check for it when the
driver is being registered.
This macro will be used in relevant drivers next.
[akpm@linux-foundation.org: tweak whitespace, per David] Link: https://lkml.kernel.org/r/20260303113032.3008371-1-yuvraj.sakshith@oss.qualcomm.com Link: https://lkml.kernel.org/r/20260303113032.3008371-2-yuvraj.sakshith@oss.qualcomm.com Signed-off-by: Yuvraj Sakshith <yuvraj.sakshith@oss.qualcomm.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Dexuan Cui <decui@microsoft.com> Cc: Eugenio Pérez <eperezma@redhat.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Long Li <longli@microsoft.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Liu <wei.liu@kernel.org> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Mon, 2 Mar 2026 19:50:18 +0000 (14:50 -0500)]
mm: memcg: separate slab stat accounting from objcg charge cache
Cgroup slab metrics are cached per-cpu the same way as the sub-page charge
cache. However, the intertwined code to manage those dependent caches
right now is quite difficult to follow.
Specifically, cached slab stat updates occur in consume() if there was
enough charge cache to satisfy the new object. If that fails, whole pages
are reserved, and slab stats are updated when the remainder of those
pages, after subtracting the size of the new slab object, are put into the
charge cache. This already juggles a delicate mix of the object size, the
page charge size, and the remainder to put into the byte cache. Doing
slab accounting in this path as well is fragile, and has recently caused a
bug where the input parameters between the two caches were mixed up.
Refactor the consume() and refill() paths into unlocked and locked
variants that only do charge caching. Then let the slab path manage its
own lock section and open-code charging and accounting.
This makes the slab stat cache subordinate to the charge cache:
__refill_obj_stock() is called first to prepare it; __account_obj_stock()
follows to hitch a ride.
This results in a minor behavioral change: previously, a mismatching
percpu stock would always be drained for the purpose of setting up slab
account caching, even if there was no byte remainder to put into the
charge cache. Now, the stock is left alone, and slab accounting takes the
uncached path if there is a mismatch. This is exceedingly rare, and it
was probably never worth draining the whole stock just to cache the slab
stat update.
Link: https://lkml.kernel.org/r/20260302195305.620713-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Hao Li <hao.li@linux.dev> Acked-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Cc: Johannes Weiner <jweiner@meta.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Johannes Weiner [Mon, 2 Mar 2026 19:50:14 +0000 (14:50 -0500)]
mm: memcg: factor out trylock_stock() and unlock_stock()
Patch series "memcg: obj stock and slab stat caching cleanups".
This is a follow-up to `[PATCH] memcg: fix slab accounting in
refill_obj_stock() trylock path`. The way the slab stat cache and the
objcg charge cache interact appears a bit too fragile. This series
factors those paths apart as much as practical.
This patch (of 5):
Consolidate the local lock acquisition and the local stock lookup. This
allows subsequent patches to use !!stock as an easy way to disambiguate
the locked vs. contended cases through the callstack.
Link: https://lkml.kernel.org/r/20260302195305.620713-1-hannes@cmpxchg.org Link: https://lkml.kernel.org/r/20260302195305.620713-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Vlastimil Babka (SUSE) <vbabka@kernel.org> Reviewed-by: Hao Li <hao.li@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:42 +0000 (14:43 +0800)]
arm64: mm: implement the architecture-specific test_and_clear_young_ptes()
Implement the Arm64 architecture-specific test_and_clear_young_ptes() to
enable batched checking of young flags, improving performance during large
folio reclamation when MGLRU is enabled.
While we're at it, simplify ptep_test_and_clear_young() by calling
test_and_clear_young_ptes(). Since callers guarantee that PTEs are
present before calling these functions, we can use pte_cont() to check the
CONT_PTE flag instead of pte_valid_cont().
Performance testing:
Enable MGLRU, then allocate 10G clean file-backed folios by mmap() in a
memory cgroup, and try to reclaim 8G file-backed folios via the
memory.reclaim interface. I can observe 60%+ performance improvement on
my Arm64 32-core server (and about 15% improvement on my X86 machine).
W/o patchset:
real 0m0.470s
user 0m0.000s
sys 0m0.470s
W/ patchset:
real 0m0.180s
user 0m0.001s
sys 0m0.179s
Link: https://lkml.kernel.org/r/7f891d42a720cc2e57862f3b79e4f774404f313c.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Reviewed-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Fri, 6 Mar 2026 06:43:41 +0000 (14:43 +0800)]
mm: support batched checking of the young flag for MGLRU
Use the batched helper test_and_clear_young_ptes_notify() to check and
clear the young flag to improve the performance during large folio
reclamation when MGLRU is enabled.
Meanwhile, we can also support batched checking the young and dirty flag
when MGLRU walks the mm's pagetable to update the folios' generation
counter. Since MGLRU also checks the PTE dirty bit, use
folio_pte_batch_flags() with FPB_MERGE_YOUNG_DIRTY set to detect batches
of PTEs for a large folio.
Then we can remove the ptep_test_and_clear_young_notify() since it has no
users now.
Note that we also update the 'young' counter and 'mm_stats[MM_LEAF_YOUNG]'
counter with the batched count in the lru_gen_look_around() and
walk_pte_range(). However, the batched operations may inflate these two
counters, because in a large folio not all PTEs may have been accessed.
(Additionally, tracking how many PTEs have been accessed within a large
folio is not very meaningful, since the mm core actually tracks
access/dirty on a per-folio basis, not per page). The impact analysis is
as follows:
1. The 'mm_stats[MM_LEAF_YOUNG]' counter has no functional impact and
is mainly for debugging.
2. The 'young' counter is used to decide whether to place the current
PMD entry into the bloom filters by suitable_to_scan() (so that next
time we can check whether it has been accessed again), which may set
the hash bit in the bloom filters for a PMD entry that hasn't seen much
access. However, bloom filters inherently allow some error, so this
effect appears negligible.
Link: https://lkml.kernel.org/r/378f4acf7d07410aa7c2e4b49d56bb165918eb34.1772778858.git.baolin.wang@linux.alibaba.com Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Rik van Riel <riel@surriel.com> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Barry Song <baohua@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Jann Horn <jannh@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes (Oracle) <ljs@kernel.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Xu <weixugc@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yuanchu Xie <yuanchu@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>