Jinjiang Tu [Fri, 27 Jun 2025 12:57:46 +0000 (20:57 +0800)]
mm/vmscan: fix hwpoisoned large folio handling in shrink_folio_list
In shrink_folio_list(), the hwpoisoned folio may be large folio, which
can't be handled by unmap_poisoned_folio(). For THP, try_to_unmap_one()
must be passed with TTU_SPLIT_HUGE_PMD to split huge PMD first and then
retry. Without TTU_SPLIT_HUGE_PMD, we will trigger null-ptr deref of
pvmw.pte. Even we passed TTU_SPLIT_HUGE_PMD, we will trigger a
WARN_ON_ONCE due to the page isn't in swapcache.
Since UCE is rare in real world, and race with reclaimation is more rare,
just skipping the hwpoisoned large folio is enough. memory_failure() will
handle it if the UCE is triggered again.
This happens when memory reclaim for large folio races with
memory_failure(), and will lead to kernel panic. The race is as
follows:
cpu0 cpu1
shrink_folio_list memory_failure
TestSetPageHWPoison
unmap_poisoned_folio
--> trigger BUG_ON due to
unmap_poisoned_folio couldn't
handle large folio
[tujinjiang@huawei.com: add comment to unmap_poisoned_folio()] Link: https://lkml.kernel.org/r/69fd4e00-1b13-d5f7-1c82-705c7d977ea4@huawei.com Link: https://lkml.kernel.org/r/20250627125747.3094074-2-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Fixes: 1b0449544c64 ("mm/vmscan: don't try to reclaim hwpoison folio") Reported-by: syzbot+3b220254df55d8ca8a61@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/68412d57.050a0220.2461cf.000e.GAE@google.com/ Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Thu, 10 Jul 2025 05:51:41 +0000 (13:51 +0800)]
docs: update docs after introducing delaytop
The "getdelays" can only display the latency of a single task by
specifying a PID, we introduce the "delaytop" with the following
capabilities:
1. system view: monitors latency metrics (CPU, I/O, memory, IRQ, etc.)
for all system processes
2. supports field-based sorting (e.g., default sort by CPU latency in
descending order)
3. dynamic interactive interface: focus on specific processes with
--pid; limit displayed entries with --processes 20; control monitoring
duration with --iterations;
Link: https://lkml.kernel.org/r/2025071013514177028RdjISjqeIOnTCRvGAwy@zte.com.cn Signed-off-by: Fan Yu <fan.yu9@zte.com.cn> Signed-off-by: Wang Yaxin <wang.yaxin@zte.com.cn> Signed-off-by: Jiang Kun <jiang.kun2@zte.com.cn> Cc: Balbir Singh <bsingharora@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Peilin He <he.peilin@zte.com.cn> Cc: Qiang Tu <tu.qiang35@zte.com.cn> Cc: wangyong <wang.yong12@zte.com.cn> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: Yunkai Zhang <zhang.yunkai@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wang Yaxin [Thu, 10 Jul 2025 05:54:51 +0000 (13:54 +0800)]
delaytop: add psi info to show system delay
Support showing whole delay of system by reading PSI, just like the first
few lines of information output by the top command. the output of
delaytop includes both system-wide delay and delay of individual tasks,
providing a more comprehensive reflection of system latency status.
selftests/thermal: remove duplicate newlines in perror calls
perror() automatically appends a newline character, so the explicit '\n'
in the format strings is redundant and results in duplicate newlines in
the output.
Remove the redundant '\n' characters from perror() calls in
workload_hint_test.c to fix the formatting.
Kuan-Wei Chiu [Fri, 6 Jun 2025 13:47:58 +0000 (21:47 +0800)]
riscv: optimize gcd() performance on RISC-V without Zbb extension
The binary GCD implementation uses FFS (find first set), which benefits
from hardware support for the ctz instruction, provided by the Zbb
extension on RISC-V. Without Zbb, this results in slower
software-emulated behavior.
Previously, RISC-V always used the binary GCD, regardless of actual
hardware support. This patch improves runtime efficiency by disabling the
efficient_ffs_key static branch when Zbb is either not enabled in the
kernel (config) or not supported on the executing CPU. This selects the
odd-even GCD implementation, which is faster in the absence of efficient
FFS.
This change ensures the most suitable GCD algorithm is chosen dynamically
based on actual hardware capabilities.
Link: https://lkml.kernel.org/r/20250606134758.1308400-4-visitorckw@gmail.com Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kuan-Wei Chiu [Fri, 6 Jun 2025 13:47:57 +0000 (21:47 +0800)]
riscv: optimize gcd() code size when CONFIG_RISCV_ISA_ZBB is disabled
The binary GCD implementation depends on efficient ffs(), which on RISC-V
requires hardware support for the Zbb extension. When
CONFIG_RISCV_ISA_ZBB is not enabled, the kernel will never use binary GCD,
as runtime logic will always fall back to the odd-even implementation.
To avoid compiling unused code and reduce code size, select
CONFIG_CPU_NO_EFFICIENT_FFS when CONFIG_RISCV_ISA_ZBB is not set.
$ ./scripts/bloat-o-meter ./lib/math/gcd.o.old ./lib/math/gcd.o.new
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-274 (-274)
Function old new delta
gcd 360 86 -274
Total: Before=384, After=110, chg -71.35%
Link: https://lkml.kernel.org/r/20250606134758.1308400-3-visitorckw@gmail.com Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Acked-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kuan-Wei Chiu [Fri, 6 Jun 2025 13:47:56 +0000 (21:47 +0800)]
lib/math/gcd: use static key to select implementation at runtime
Patch series "Optimize GCD performance on RISC-V by selecting
implementation at runtime", v3.
The current implementation of gcd() selects between the binary GCD and the
odd-even GCD algorithm at compile time, depending on whether
CONFIG_CPU_NO_EFFICIENT_FFS is set. On platforms like RISC-V, however,
this compile-time decision can be misleading: even when the compiler emits
ctz instructions based on the assumption that they are efficient (as is
the case when CONFIG_RISCV_ISA_ZBB is enabled), the actual hardware may
lack support for the Zbb extension. In such cases, ffs() falls back to a
software implementation at runtime, making the binary GCD algorithm
significantly slower than the odd-even variant.
To address this, we introduce a static key to allow runtime selection
between the binary and odd-even GCD implementations. On RISC-V, the
kernel now checks for Zbb support during boot. If Zbb is unavailable, the
static key is disabled so that gcd() consistently uses the more efficient
odd-even algorithm in that scenario. Additionally, to further reduce code
size, we select CONFIG_CPU_NO_EFFICIENT_FFS automatically when
CONFIG_RISCV_ISA_ZBB is not enabled, avoiding compilation of the unused
binary GCD implementation entirely on systems where it would never be
executed.
This series ensures that the most efficient GCD algorithm is used in
practice and avoids compiling unnecessary code based on hardware
capabilities and kernel configuration.
This patch (of 3):
On platforms like RISC-V, the compiler may generate hardware FFS
instructions even if the underlying CPU does not actually support them.
Currently, the GCD implementation is chosen at compile time based on
CONFIG_CPU_NO_EFFICIENT_FFS, which can result in suboptimal behavior on
such systems.
Introduce a static key, efficient_ffs_key, to enable runtime selection
between the binary GCD (using ffs) and the odd-even GCD implementation.
This allows the kernel to default to the faster binary GCD when FFS is
efficient, while retaining the ability to fall back when needed.
Link: https://lkml.kernel.org/r/20250606134758.1308400-1-visitorckw@gmail.com Link: https://lkml.kernel.org/r/20250606134758.1308400-2-visitorckw@gmail.com Co-developed-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Yu-Chun Lin <eleanor15x@gmail.com> Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Ching-Chun (Jim) Huang <jserv@ccns.ncku.edu.tw> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ivan Pravdin [Tue, 8 Jul 2025 02:06:40 +0000 (22:06 -0400)]
ocfs2: avoid potential ABBA deadlock by reordering tl_inode lock
In ocfs2_move_extent(), tl_inode is currently locked after the global
bitmap inode. However, in ocfs2_flush_truncate_log(), the lock order is
reversed: tl_inode is locked first, followed by the global bitmap inode.
This creates a classic ABBA deadlock scenario if two threads attempt these
operations concurrently and acquire the locks in different orders.
To prevent this, move the tl_inode locking earlier in ocfs2_move_extent(),
so that it always precedes the global bitmap inode lock.
No functional changes beyond lock ordering.
Link: https://lkml.kernel.org/r/20250708020640.387741-1-ipravdin.official@gmail.com Reported-by: syzbot+6bf948e47f9bac7aacfa@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67d5645c.050a0220.1dc86f.0004.GAE@google.com/ Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ivan Pravdin [Tue, 8 Jul 2025 00:10:09 +0000 (20:10 -0400)]
ocfs2: avoid NULL pointer dereference in dx_dir_lookup_rec()
When a directory entry is not found, ocfs2_dx_dir_lookup_rec() prints an
error message that unconditionally dereferences the 'rec' pointer.
However, if 'rec' is NULL, this leads to a NULL pointer dereference and a
kernel panic.
Add an explicit check empty extent list to avoid dereferencing NULL
'rec' pointer.
Link: https://lkml.kernel.org/r/20250708001009.372263-1-ipravdin.official@gmail.com Reported-by: syzbot+20282c1b2184a857ac4c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/67cd7e29.050a0220.e1a89.0007.GAE@google.com/ Signed-off-by: Ivan Pravdin <ipravdin.official@gmail.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
init/main.c: add warning when file specified in rdinit is inaccessible
Avoid silently ignoring the initramfs when the file specified in rdinit is
not usable. This prints an error that clearly explains the issue (file
was not found, vs initramfs was not found).
Zi Li [Fri, 27 Jun 2025 07:29:24 +0000 (15:29 +0800)]
samples: enhance hung_task detector test with read-write semaphore support
Extend the hung_task detector test module to include read-write semaphore
support alongside existing mutex and semaphore tests. This module now
creates additional debugfs files under <debugfs>/hung_task, namely
'rw_semaphore_read' and 'rw_semaphore_write', in addition to 'mutex' and
'semaphore'. Reading these files with multiple processes triggers a
prolonged sleep (256 seconds) while holding the respective lock, enabling
hung_task detector testing for various locking mechanisms.
This change builds on the extensible hung_task_tests module, adding
read-write semaphore functionality to improve test coverage for kernel
locking primitives. The implementation ensures proper lock handling and
includes checks to prevent redundant data reads.
Usage is:
> cd /sys/kernel/debug/hung_task
> cat mutex & cat mutex # Test mutex blocking
> cat semaphore & cat semaphore # Test semaphore blocking
> cat rw_semaphore_write \
& cat rw_semaphore_read # Test rwsem blocking
> cat rw_semaphore_write \
& cat rw_semaphore_write # Test rwsem blocking
Update the Kconfig description to reflect the addition of read-write
semaphore debugfs files.
Link: https://lkml.kernel.org/r/20250627072924.36567-4-lance.yang@linux.dev Signed-off-by: Zi Li <zi.li@linux.dev> Suggested-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lance Yang [Fri, 27 Jun 2025 07:29:23 +0000 (15:29 +0800)]
hung_task: extend hung task blocker tracking to rwsems
Inspired by mutex blocker tracking[1], and having already extended it to
semaphores, let's now add support for reader-writer semaphores (rwsems).
The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while
waiting for an rwsem, we just call hung_task_set_blocker(). The hung task
detector can then query the rwsem's owner to identify the lock holder.
Tracking works reliably for writers, as there can only be a single writer
holding the lock, and its task struct is stored in the owner field.
The main challenge lies with readers. The owner field points to only one
of many concurrent readers, so we might lose track of the blocker if that
specific reader unlocks, even while others remain. This is not a
significant issue, however. In practice, long-lasting lock contention is
almost always caused by a writer. Therefore, reliably tracking the writer
is the primary goal of this patch series ;)
With this change, the hung task detector can now show blocker task's info
like below:
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds.
[Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8
[Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340
[Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0
[Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10
[Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10
[Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700
[Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710
[Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer>
[Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140
[Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
Lance Yang [Fri, 27 Jun 2025 07:29:22 +0000 (15:29 +0800)]
locking/rwsem: make owner helpers globally available
Patch series "extend hung task blocker tracking to rwsems".
Inspired by mutex blocker tracking[1], and having already extended it to
semaphores, let's now add support for reader-writer semaphores (rwsems).
The approach is simple: when a task enters TASK_UNINTERRUPTIBLE while
waiting for an rwsem, we just call hung_task_set_blocker(). The hung task
detector can then query the rwsem's owner to identify the lock holder.
Tracking works reliably for writers, as there can only be a single writer
holding the lock, and its task struct is stored in the owner field.
The main challenge lies with readers. The owner field points to only one
of many concurrent readers, so we might lose track of the blocker if that
specific reader unlocks, even while others remain. This is not a
significant issue, however. In practice, long-lasting lock contention is
almost always caused by a writer. Therefore, reliably tracking the writer
is the primary goal of this patch series ;)
With this change, the hung task detector can now show blocker task's info
like below:
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 blocked for more than 122 seconds.
[Fri Jun 27 15:21:34 2025] Tainted: G S 6.16.0-rc3 #8
[Fri Jun 27 15:21:34 2025] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Jun 27 15:21:34 2025] task:cat state:D stack:0 pid:28631 tgid:28631 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? policy_nodemask+0x215/0x340
[Fri Jun 27 15:21:34 2025] ? _raw_spin_lock_irq+0x8a/0xe0
[Fri Jun 27 15:21:34 2025] ? __pfx__raw_spin_lock_irq+0x10/0x10
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_preempt_disabled+0x15/0x30
[Fri Jun 27 15:21:34 2025] rwsem_down_read_slowpath+0x55e/0xe10
[Fri Jun 27 15:21:34 2025] ? __pfx_rwsem_down_read_slowpath+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx___might_resched+0x10/0x10
[Fri Jun 27 15:21:34 2025] down_read+0xc9/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_down_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __debugfs_file_get+0x14d/0x700
[Fri Jun 27 15:21:34 2025] ? __pfx___debugfs_file_get+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? handle_pte_fault+0x52a/0x710
[Fri Jun 27 15:21:34 2025] ? selinux_file_permission+0x3a9/0x590
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_read+0x4a/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f3f8faefb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffdeda5ab98 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f3f8faefb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 00000000010fa000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 00000000010fa000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffdeda59fe0 R11: 0000000000000246 R12: 00000000010fa000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
[Fri Jun 27 15:21:34 2025] INFO: task cat:28631 <reader> blocked on an rw-semaphore likely owned by task cat:28630 <writer>
[Fri Jun 27 15:21:34 2025] task:cat state:S stack:0 pid:28630 tgid:28630 ppid:28501 task_flags:0x400000 flags:0x00004000
[Fri Jun 27 15:21:34 2025] Call Trace:
[Fri Jun 27 15:21:34 2025] <TASK>
[Fri Jun 27 15:21:34 2025] __schedule+0x7c7/0x1930
[Fri Jun 27 15:21:34 2025] ? __pfx___schedule+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __mod_timer+0x304/0xa80
[Fri Jun 27 15:21:34 2025] schedule+0x6a/0x180
[Fri Jun 27 15:21:34 2025] schedule_timeout+0xfb/0x230
[Fri Jun 27 15:21:34 2025] ? __pfx_schedule_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? __pfx_process_timeout+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? down_write+0xc4/0x140
[Fri Jun 27 15:21:34 2025] msleep_interruptible+0xbe/0x150
[Fri Jun 27 15:21:34 2025] read_dummy_rwsem_write+0x54/0x90
[Fri Jun 27 15:21:34 2025] full_proxy_read+0xff/0x1c0
[Fri Jun 27 15:21:34 2025] ? rw_verify_area+0x6d/0x410
[Fri Jun 27 15:21:34 2025] vfs_read+0x177/0xa50
[Fri Jun 27 15:21:34 2025] ? __pfx_vfs_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] ? fdget_pos+0x1cf/0x4c0
[Fri Jun 27 15:21:34 2025] ksys_read+0xfc/0x1d0
[Fri Jun 27 15:21:34 2025] ? __pfx_ksys_read+0x10/0x10
[Fri Jun 27 15:21:34 2025] do_syscall_64+0x66/0x2d0
[Fri Jun 27 15:21:34 2025] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[Fri Jun 27 15:21:34 2025] RIP: 0033:0x7f8f288efb40
[Fri Jun 27 15:21:34 2025] RSP: 002b:00007ffffb631038 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[Fri Jun 27 15:21:34 2025] RAX: ffffffffffffffda RBX: 0000000000010000 RCX: 00007f8f288efb40
[Fri Jun 27 15:21:34 2025] RDX: 0000000000010000 RSI: 000000002a4b5000 RDI: 0000000000000003
[Fri Jun 27 15:21:34 2025] RBP: 000000002a4b5000 R08: 0000000000000000 R09: 0000000000010fff
[Fri Jun 27 15:21:34 2025] R10: 00007ffffb630460 R11: 0000000000000246 R12: 000000002a4b5000
[Fri Jun 27 15:21:34 2025] R13: 0000000000000003 R14: 0000000000000000 R15: 0000000000000fff
[Fri Jun 27 15:21:34 2025] </TASK>
This patch (of 3):
In preparation for extending blocker tracking to support rwsems, make the
rwsem_owner() and is_rwsem_reader_owned() helpers globally available for
determining if the blocker is a writer or one of the readers.
Additionally, a stale owner pointer in a reader-owned rwsem can lead to
false positives in blocker tracking when CONFIG_DETECT_HUNG_TASK_BLOCKER
is enabled. To mitigate this, clear the owner field on the reader unlock
path, similar to what CONFIG_DEBUG_RWSEMS does. A NULL owner is better
than a stale one for diagnostics.
Link: https://lkml.kernel.org/r/20250627072924.36567-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627072924.36567-2-lance.yang@linux.dev Link: https://lore.kernel.org/all/174046694331.2194069.15472952050240807469.stgit@mhiramat.tok.corp.google.com/ Signed-off-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Cc: Anna Schumaker <anna.schumaker@oracle.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: John Stultz <jstultz@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Tomasz Figa <tfiga@chromium.org> Cc: Waiman Long <longman@redhat.com> Cc: Will Deacon <will@kernel.org> Cc: Yongliang Gao <leonylgao@tencent.com> Cc: Zi Li <zi.li@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
coccinelle: misc: secs_to_jiffies: implement context and report modes
As requested by Ricardo and Jakub, implement report and context modes for
the secs_to_jiffies Coccinelle script. While here, add the option to look
for opportunities to use secs_to_jiffies() in headers.
Link: https://lkml.kernel.org/r/20250703225145.152288-1-eahariha@linux.microsoft.com Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Closes: https://lore.kernel.org/all/20250129-secs_to_jiffles-v1-1-35a5e16b9f03@chromium.org/ Closes: https://lore.kernel.org/all/20250221162107.409ae333@kernel.org/ Tested-by: Ricardo Ribalda <ribalda@chromium.org> Cc: Julia Lawall <julia.lawall@inria.fr> Cc: Nicolas Palix <nicolas.palix@imag.fr> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Ricardo Ribalda <ribalda@chromium.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
panic: add note that panic_print sysctl interface is deprecated
Add a dedicated core parameter 'panic_console_replay' for controlling
console replay, and add note that 'panic_print' sysctl interface will be
obsoleted by 'panic_sys_info' and 'panic_console_replay'. When it
happens, the SYS_INFO_PANIC_CONSOLE_REPLAY can be removed as well.
Link: https://lkml.kernel.org/r/20250703021004.42328-6-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
panic: add 'panic_sys_info' sysctl to take human readable string parameter
Bitmap definition for 'panic_print' is hard to remember and decode. Add
'panic_sys_info='sysctl to take human readable string like
"tasks,mem,timers,locks,ftrace,..." and translate it into bitmap.
The detailed mapping is:
SYS_INFO_TASKS "tasks"
SYS_INFO_MEM "mem"
SYS_INFO_TIMERS "timers"
SYS_INFO_LOCKS "locks"
SYS_INFO_FTRACE "ftrace"
SYS_INFO_ALL_CPU_BT "all_bt"
SYS_INFO_BLOCKED_TASKS "blocked_tasks"
[nathan@kernel.org: add __maybe_unused to sys_info_avail] Link: https://lkml.kernel.org/r/20250708-fix-clang-sys_info_avail-warning-v1-1-60d239eacd64@kernel.org Link: https://lkml.kernel.org/r/20250703021004.42328-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
panic: generalize panic_print's function to show sys info
'panic_print' was introduced to help debugging kernel panic by dumping
different kinds of system information like tasks' call stack, memory,
ftrace buffer, etc. Actually this function could also be used to help
debugging other cases like task-hung, soft/hard lockup, etc. where user
may need the snapshot of system info at that time.
Extract system info dump function related code from panic.c to separate
file sys_info.[ch], for wider usage by other kernel parts for debugging.
Also modify the macro names about singulars/plurals.
Link: https://lkml.kernel.org/r/20250703021004.42328-3-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "generalize panic_print's dump function to be used by other
kernel parts", v3.
When working on kernel stability issues, panic, task-hung and
software/hardware lockup are frequently met. And to debug them, user may
need lots of system information at that time, like task call stacks, lock
info, memory info etc.
panic case already has panic_print_sys_info() for this purpose, and has a
'panic_print' bitmask to control what kinds of information is needed,
which is also helpful to debug other task-hung and lockup cases.
So this patchset extracts the function out to a new file 'lib/sys_info.c',
and makes it available for other cases which also need to dump system info
for debugging.
Also as suggested by Petr Mladek, add 'panic_sys_info=' interface to take
human readable string like "tasks,mem,locks,timers,ftrace,....", and
eventually obsolete the current 'panic_print' bitmap interface.
In RFC and V1 version, hung_task and SW/HW watchdog modules are enabled
with the new sys_info dump interface. In v2, they are kept out for better
review of current change, and will be posted later.
Locally these have been used in our bug chasing for stability issues and
was proven helpful.
Many thanks to Petr Mladek for great suggestions on both the code and
architectures!
This patch (of 5):
Currently the panic_print_sys_info() was called twice with different
parameters to handle console replay case, which is kind of confusing.
Add panic_console_replay() explicitly and rename
'PANIC_PRINT_ALL_PRINTK_MSG' to 'PANIC_CONSOLE_REPLAY', to make the code
straightforward. The related kernel document is also updated.
Link: https://lkml.kernel.org/r/20250703021004.42328-1-feng.tang@linux.alibaba.com Link: https://lkml.kernel.org/r/20250703021004.42328-2-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: John Ogness <john.ogness@linutronix.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <lance.yang@linux.dev> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Bohac [Thu, 12 Jun 2025 10:20:04 +0000 (12:20 +0200)]
x86: implement crashkernel cma reservation
Implement the crashkernel CMA reservation for x86:
- enable parsing of the cma suffix by parse_crashkernel()
- reserve memory with reserve_crashkernel_cma()
- add the CMA-reserved ranges to the e820 map for the crash kernel
- exclude the CMA-reserved ranges from vmcore
Link: https://lkml.kernel.org/r/aEqp1LD2og4QeBw9@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Bohac [Thu, 12 Jun 2025 10:18:40 +0000 (12:18 +0200)]
kdump: wait for DMA to finish when using CMA
When re-using the CMA area for kdump there is a risk of pending DMA into
pinned user pages in the CMA area.
Pages residing in CMA areas can usually not get long-term pinned and are
instead migrated away from the CMA area, so long-term pinning is typically
not a concern. (BUGs in the kernel might still lead to long-term pinning
of such pages if everything goes wrong.)
Pages pinned without FOLL_LONGTERM remain in the CMA and may possibly be
the source or destination of a pending DMA transfer.
Although there is no clear specification how long a page may be pinned
without FOLL_LONGTERM, pinning without the flag shows an intent of the
caller to only use the memory for short-lived DMA transfers, not a
transfer initiated by a device asynchronously at a random time in the
future.
Add a delay of CMA_DMA_TIMEOUT_SEC seconds before starting the kdump
kernel, giving such short-lived DMA transfers time to finish before the
CMA memory is re-used by the kdump kernel.
Set CMA_DMA_TIMEOUT_SEC to 10 seconds - chosen arbitrarily as both a huge
margin for a DMA transfer, yet not increasing the kdump time too
significantly.
Link: https://lkml.kernel.org/r/aEqpgDIBndZ5LXSo@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Describe the new crashkernel ",cma" suffix in Documentation/
Link: https://lkml.kernel.org/r/aEqpQwUy6gqSiUkV@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Bohac [Thu, 12 Jun 2025 10:16:39 +0000 (12:16 +0200)]
kdump: implement reserve_crashkernel_cma
reserve_crashkernel_cma() reserves CMA ranges for the crash kernel. If
allocating the requested size fails, try to reserve in smaller blocks.
Store the reserved ranges in the crashk_cma_ranges array and the number of
ranges in crashk_cma_cnt.
Link: https://lkml.kernel.org/r/aEqpBwOy_ekm0gw9@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jiri Bohac [Thu, 12 Jun 2025 10:13:21 +0000 (12:13 +0200)]
Add a new optional ",cma" suffix to the crashkernel= command line option
Patch series "kdump: crashkernel reservation from CMA", v5.
This series implements a way to reserve additional crash kernel memory
using CMA.
Currently, all the memory for the crash kernel is not usable by the 1st
(production) kernel. It is also unmapped so that it can't be corrupted by
the fault that will eventually trigger the crash. This makes sense for
the memory actually used by the kexec-loaded crash kernel image and initrd
and the data prepared during the load (vmcoreinfo, ...). However, the
reserved space needs to be much larger than that to provide enough
run-time memory for the crash kernel and the kdump userspace. Estimating
the amount of memory to reserve is difficult. Being too careful makes
kdump likely to end in OOM, being too generous takes even more memory from
the production system. Also, the reservation only allows reserving a
single contiguous block (or two with the "low" suffix). I've seen systems
where this fails because the physical memory is fragmented.
By reserving additional crashkernel memory from CMA, the main crashkernel
reservation can be just large enough to fit the kernel and initrd image,
minimizing the memory taken away from the production system. Most of the
run-time memory for the crash kernel will be memory previously available
to userspace in the production system. As this memory is no longer
wasted, the reservation can be done with a generous margin, making kdump
more reliable. Kernel memory that we need to preserve for dumping is
normally not allocated from CMA, unless it is explicitly allocated as
movable. Currently this is only the case for memory ballooning and zswap.
Such movable memory will be missing from the vmcore. User data is
typically not dumped by makedumpfile. When dumping of user data is
intended this new CMA reservation cannot be used.
There are five patches in this series:
The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more argument to
store the cma reservation size.
The second patch implements reserve_crashkernel_cma() which performs the
reservation. If the requested size is not available in a single range,
multiple smaller ranges will be reserved.
The third patch updates Documentation/, explicitly mentioning the
potential DMA corruption of the CMA-reserved memory.
The fourth patch adds a short delay before booting the kdump kernel,
allowing pending DMA transfers to finish.
The fifth patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
through /proc/vmcore by excluding them from the vmcoreinfo
PT_LOAD ranges.
Adding other architectures is easy and I can do that as soon as this
series is merged.
With this series applied, specifying
crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.
An additional 1G will be reserved from CMA, still usable by the production
system. The crash kernel will have 1.1G memory available. The 100M can
be reliably predicted based on the size of the kernel and initrd.
The new cma suffix is completely optional. When no
crashkernel=size,cma is specified, everything works as before.
This patch (of 5):
Add a new cma_size parameter to parse_crashkernel(). When not NULL, call
__parse_crashkernel to parse the CMA reservation size from
"crashkernel=size,cma" and store it in cma_size.
Set cma_size to NULL in all calls to parse_crashkernel().
Link: https://lkml.kernel.org/r/aEqnxxfLZMllMC8I@dwarf.suse.cz Link: https://lkml.kernel.org/r/aEqoQckgoTQNULnh@dwarf.suse.cz Signed-off-by: Jiri Bohac <jbohac@suse.cz> Cc: Baoquan He <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Donald Dutile <ddutile@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Philipp Rudo <prudo@redhat.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Tao Liu <ltao@redhat.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/page_owner: convert set_page_owner_migrate_reason() to folios
Both callers of set_page_owner_migrate_reason() use folios. Convert the
function to take a folio directly and move the &folio->page conversion
inside __set_page_owner_migrate_reason().
Link: https://lkml.kernel.org/r/20250711145910.90135-1-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Muchun Song <muchun.song@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/memfd: replace deprecated strcpy() with memcpy() in alloc_name()
strcpy() is deprecated; use memcpy() instead.
Not copying the NUL terminator is safe because strncpy_from_user() would
overwrite it anyway by appending uname to the destination buffer at index
MFD_NAME_PREFIX_LEN.
SeongJae Park [Sat, 12 Jul 2025 19:50:14 +0000 (12:50 -0700)]
mm/damon/core: destroy targets when kdamond_fn() finish
When kdamond_fn() completes, the targets are kept. Those are kept to let
callers do additional cleanups if they need. There are no such additional
cleanups though. DAMON sysfs interface deallocates those in
before_terminate() callback, to reduce unnecessary memory usage, for
[f]vaddr use case. Just destroy the targets for every case in the core
layer. This saves more memory and simplifies the logic.
The function was introduced for putting pids and deallocating unnecessary
targets. Hence it is called before damon_destroy_ctx(). Now vaddr puts
pid for each target destruction (cleanup_target()). damon_destroy_ctx()
deallocates the targets anyway. So damon_sysfs_destroy_targets() has no
reason to exist. Remove it.
SeongJae Park [Sat, 12 Jul 2025 19:50:12 +0000 (12:50 -0700)]
mm/damon/vaddr: put pid in cleanup_target()
Implement cleanup_target() callback for [f]vaddr, which calls put_pid()
for each target that will be destroyed. Also remove redundant put_pid()
calls in core, sysfs and sample modules, which were required to be done
redundantly due to the lack of such self cleanup in vaddr.
SeongJae Park [Sat, 12 Jul 2025 19:50:11 +0000 (12:50 -0700)]
mm/damon/core: add cleanup_target() ops callback
Some DAMON operation sets may need additional cleanup per target. For
example, [f]vaddr need to put pids of each target. Each user and core
logic is doing that redundantly. Add another DAMON ops callback that will
be used for doing such cleanups in operations set layer.
SeongJae Park [Sat, 12 Jul 2025 19:50:10 +0000 (12:50 -0700)]
mm/damon/core: do not call ops.cleanup() when destroying targets
damon_operations.cleanup() is documented to be called for kdamond
termination, but also being called for targets destruction, which is done
for any damon_ctx destruction. Nobody is using the callback for now,
though. Remove the cleanup() call under the destruction.
SeongJae Park [Sat, 12 Jul 2025 19:50:07 +0000 (12:50 -0700)]
mm/damon/lru_sort: use damon_call() repeat mode instead of damon_callback
DAMON_LRU_SORT uses damon_callback for periodically reading and writing
DAMON internal data and parameters. Use its alternative, damon_call()
repeat mode.
SeongJae Park [Sat, 12 Jul 2025 19:50:06 +0000 (12:50 -0700)]
mm/damon/reclaim: use damon_call() repeat mode instead of damon_callback
DAMON_RECLAIM uses damon_callback for periodically reading and writing
DAMON internal data and parameters. Use its alternative, damon_call()
repeat mode.
SeongJae Park [Sat, 12 Jul 2025 19:50:04 +0000 (12:50 -0700)]
mm/damon/core: introduce repeat mode damon_call()
damon_call() can be useful for reading or writing DAMON internal data for
one time. A common pattern of DAMON core usage from DAMON modules is
doing such reads and writes repeatedly, for example, to periodically
update the DAMOS stats. To do that with damon_call(), callers should call
damon_call() repeatedly, with their own delay loop. Each caller doing
that is repetitive. Introduce a repeat mode damon_call(). Callers can
use the mode by setting a new field in damon_call_control. If the mode is
turned on, damon_call() returns success immediately, and DAMON repeats
invoking the callback function inside the kdamond main loop.
SeongJae Park [Sat, 12 Jul 2025 19:50:03 +0000 (12:50 -0700)]
mm/damon: accept parallel damon_call() requests
Patch series "mm/damon: remove damon_callback".
damon_callback was the only way for communicating with DAMON for contexts
running on its worker thread. The interface is flexible and simple. But
as DAMON evolves with more features, damon_callback has become somewhat
too old. With runtime parameters update, for example, its lack of
synchronization support was found to be inconvenient. Arguably it is also
not easy to use correctly since the callers should understand when each
callback is called, and implication of the return values from the
callbacks.
To replace it, damon_call() and damos_walk() are introduced. And those
replaced a few damon_callback use cases. Some use cases of damon_callback
such as parallel or repetitive DAMON internal data reading and additional
cleanups cannot simply be replaced by damon_call() and damos_walk(),
though.
To allow those replaceable, extend damon_call() for parallel and/or
repeated callbacks and modify the core/ops layers for additional resources
cleanup. With the updates, replace the remaining damon_callback usages
and finally say goodbye to damon_callback.
This patch (of 14):
Calling damon_call() while it is serving for another parallel thread
immediately fails with -EBUSY. The caller should call it again, later.
Each caller implementing such retry logic would be redundant. Accept
parallel damon_call() requests and do the wait instead of the caller.
Chi Zhiling [Thu, 10 Jul 2025 06:04:51 +0000 (14:04 +0800)]
readahead: use folio_nr_pages() instead of shift operation
folio_nr_pages() is faster helper function to get the number of pages when
NR_PAGES_IN_LARGE_FOLIO is enabled.
Link: https://lkml.kernel.org/r/20250710060451.3535957-1-chizhiling@163.com Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This adds support for allowing proactive reclaim in general on a NUMA
system. A per-node interface extends support for beyond a memcg-specific
interface, respecting the current semantics of memory.reclaim: respecting
aging LRU and not supporting artificially triggering eviction on nodes
belonging to non-bottom tiers.
One of the premises for this is to semantically align as best as possible
with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which
semantics around reclaim (eviction) vs demotion were not clear, rendering
charging expectations to be broken.
With this approach:
1. Users who do not use memcg can benefit from proactive reclaim. The
memcg interface is not NUMA aware and there are usecases that are
focusing on NUMA balancing rather than workload memory footprint.
2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes will
trigger evicting to swap (the traditional sense of reclaim). This
follows the semantics of what is today part of the aging process on
tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.
3. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.
4. Users that *really* want to free up memory can use proactive
reclaim on nodes knowingly to be on the bottom tiers to force eviction
in a natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict'
option could be added to the parameters for both memory.reclaim and
per-node interfaces to force this action unconditionally.
[akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman]
[dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel] Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 23 Jun 2025 18:58:50 +0000 (11:58 -0700)]
mm/vmscan: make __node_reclaim() more generic
As this will be called from non page allocator paths for proactive
reclaim, allow users to pass the sc and nr of pages, and adjust the return
value as well. No change in semantics.
Link: https://lkml.kernel.org/r/20250623185851.830632-4-dave@stgolabs.net Signed-off-by: Davidlohr Bueso <dave@stgolabs.net> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Davidlohr Bueso [Mon, 23 Jun 2025 18:58:48 +0000 (11:58 -0700)]
mm/vmscan: respect psi_memstall region in node reclaim
Patch series "mm: per-node proactive reclaim", v2.
This adds support for allowing proactive reclaim in general on a NUMA
system. A per-node interface extends support for beyond a memcg-specific
interface, respecting the current semantics of memory.reclaim: respecting
aging LRU and not supporting artificially triggering eviction on nodes
belonging to non-bottom tiers.
One of the premises for this is to semantically align as best as possible
with memory.reclaim. During a brief time memcg did support nodemask until 55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which
semantics around reclaim (eviction) vs demotion were not clear, rendering
charging expectations to be broken.
With this approach:
1. Users who do not use memcg can benefit from proactive reclaim.
2. Proactive reclaim on top tiers will trigger demotion, for which
memory is still byte-addressable. Reclaiming on the bottom nodes will
trigger evicting to swap (the traditional sense of reclaim). This
follows the semantics of what is today part of the aging process on
tiered memory, mirroring what every other form of reclaim does
(reactive and memcg proactive reclaim). Furthermore per-node proactive
reclaim is not as susceptible to the memcg charging problem mentioned
above.
3. Unlike memcg, there should be no surprises of callers expecting
reclaim but instead got a demotion. Essentially relying on behavior of
shrink_folio_list() after 6b426d071419 ("mm: disable top-tier fallback
to reclaim on proactive reclaim"), without the expectations of
try_to_free_mem_cgroup_pages().
4. Unlike the nodes= arg, this interface avoids confusing semantics,
such as what exactly the user wants when mixing top-tier and low-tier
nodes in the nodemask. Further per-node interface is less exposed to
"free up memory in my container" usecases, where eviction is intended.
5. Users that *really* want to free up memory can use proactive
reclaim on nodes knowingly to be on the bottom tiers to force eviction
in a natural way - higher access latencies are still better than swap.
If compelled, while no guarantees and perhaps not worth the effort,
users could also also potentially follow a ladder-like approach to
eventually free up the memory. Alternatively, perhaps an 'evict'
option could be added to the parameters for both memory.reclaim and
per-node interfaces to force this action unconditionally.
mm/damon/vaddr: apply filters in migrate_{hot/cold}
The paddr versions of migrate_{hot/cold} filter out folios from migration
based on the scheme's filters. This patch does the same for the vaddr
versions of those schemes.
The filtering code is mostly the same for the paddr and vaddr versions.
The exception is the young filter. paddr determines if a page is young by
doing a folio rmap walk to find the page table entries corresponding to
the folio. However, vaddr schemes have easier access to the page tables,
so we add some logic to avoid the extra work.
Link: https://lkml.kernel.org/r/20250709005952.17776-14-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon: move folio filtering from paddr to ops-common
This patch moves damos_pa_filter_match and the functions it calls to
ops-common, renaming it to damos_folio_filter_match. Doing so allows us
to share the filtering logic for the vaddr version of the
migrate_{hot,cold} schemes.
Link: https://lkml.kernel.org/r/20250709005952.17776-13-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon/vaddr: use damos->migrate_dests in migrate_{hot,cold}
damos->migrate_dests provides a list of nodes the migrate_{hot,cold}
actions should migrate to, as well as the weights which specify the ratio
pages should be migrated to each destination node.
This patch interleaves pages in the migrate_{hot,cold} actions according
to the information provided in damos->migrate_dests if it is used. The
interleaving algorithm used is similar to the one used in
weighted_interleave_nid(). If damos->migration_dests is not provided, the
actions migrate pages to the node specified in damos->target_nid as
before.
Link: https://lkml.kernel.org/r/20250709005952.17776-12-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon/vaddr: add vaddr versions of migrate_{hot,cold}
migrate_{hot,cold} are paddr schemes that are used to migrate hot/cold
data to a specified node. However, these schemes are only available when
doing physical address monitoring. This patch adds an implementation for
them virtual address monitoring as well.
Link: https://lkml.kernel.org/r/20250709005952.17776-10-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/damon: move migration helpers from paddr to ops-common
This patch moves the damon_pa_migrate_pages function along with its
corresponding helper functions from paddr to ops-common. The function
prefix of "damon_pa_" was also changed to just "damon_" accordingly.
This patch will allow page migration to be available to vaddr schemes as
well as paddr schemes.
Link: https://lkml.kernel.org/r/20250709005952.17776-9-bijan311@gmail.com Co-developed-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
DAMOS_MIGRATE_{HOT,COLD} can have multiple action destinations and their
weights. Implement sysfs directory named 'dests' under each scheme
directory to let DAMON sysfs ABI users utilize the feature. The interface
is similar to other multiple parameters directory like kdamonds or
filters. The directory contains only nr_dests file initially. Writing a
number of desired destinations to nr_dests creates directories of the
number. Each of the created directories has two files named id and
weight. Users can then write the destination's identifier (node id in
case of DAMOS_MIGRATE_*) and weight to the files.
Link: https://lkml.kernel.org/r/20250709005952.17776-4-bijan311@gmail.com Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Bijan Tabatabai <bijantabatab@micron.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 9 Jul 2025 00:59:32 +0000 (19:59 -0500)]
mm/damon/core: add damos->migrate_dests field
Add a new field to 'struct damos', namely migrate_dests, to allow DAMON
API callers specify multiple migration destination nodes and their
weights. Also update 'struct damos' creation and destruction functions
accordingly to initialize the new field and free up the API
caller-allocated buffers on those, respectively.
Link: https://lkml.kernel.org/r/20250709005952.17776-3-bijan311@gmail.com Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 9 Jul 2025 00:59:31 +0000 (19:59 -0500)]
mm/damon: add struct damos_migrate_dests
Patch series "mm/damon/vaddr: Allow interleaving in migrate_{hot,cold}
actions", v4.
A recent patchset automatically sets the interleave weight for each node
according to the node's maximum bandwidth [1]. In another thread, the
patch set's author, Joshua Hahn, wondered if/how thes weights should be
changed if the bandwidth utilization of the system changes [2].
This patch set adds the mechanism for dynamically changing how application
data is interleaved across nodes while leaving the policy of what the
interleave weights should be to userspace. It does this by having the
migrate_{hot,cold} operating schemes interleave application data according
to the list of migration nodes and weights passed in via the DAMON sysfs
interface. This functionality can be used to dynamically adjust how
folios are interleaved by having a userspace process adjust those weights.
If no specific destination nodes or weights are provided, the
migrate_{hot,cold} actions will only migrate folios to damos->target_nid
as before.
The algorithm used to interleave the folios is similar to the one used for
the weighted interleave mempolicy [3]. It uses the offset from which a
folio is mapped into a VMA to determine the node the folio should be
placed in. This method is convenient because for a given set of
interleave weights, a folio has only one valid node it can be placed in,
limitng the amount of unnecessary data movement. However, finding out how
a folio is mapped inside of a VMA requires a costly rmap walk when using a
paddr scheme. As such, we have decided that this functionality makes more
sense as a vaddr scheme [4]. To this end, this patch set also adds vaddr
versions of the migrate_{hot,cold}.
Motivation
==========
There have been prior discussions about how changing the interleave
weights in response to the system's bandwidth utilization can be
beneficial [2]. However, currently the interleave weights only are
applied when data is allocated. Migrating already allocated pages
according to the dynamically changing weights will better help balance the
bandwidth utilization across nodes.
As a toy example, imagine some application that uses 75% of the local
bandwidth. Assuming sufficient capacity, when running alone, we want to
keep that application's data in local memory. However, if a second
instance of that application begins, using the same amount of bandwidth,
it would be best to interleave the data of both processes to alleviate the
bandwidth pressure from the local node. Likewise, when one of the
processes ends, the data should be moves back to local memory.
We imagine there would be a userspace application that would monitor
system performance characteristics, such as bandwidth utilization or
memory access latency, and uses that information to tune the interleave
weights. Others seem to have come to a similar conclusion in previous
discussions [5]. We are currently working on a userspace program that
does this, but it is not quite ready to be published yet.
After the userspace application tunes the interleave weights, there must
be some mechanism that actually migrates pages to be consistent with those
weights. This patchset is what provides this mechanism.
We believe DAMON is the correct venue for the interleaving mechanism for a
few reasons. First, we noticed that we don't have to migrate all of the
application's pages to improve performance. we just need to migrate the
frequently accessed pages. DAMON's existing hotness traching is very
useful for this. Second, DAMON's quota system can be used to ensure we
are not using too much bandwidth for migrations. Finally, as Ying pointed
out [6], a complete solution must also handle when a memory node is at
capacity. The existing migrate_cold action can be used in conjunction
with the functionality added in this patch set to provide that complete
solution.
Functionality Test
==================
Below is an example of this new functionality in use to confirm that these
patches behave as intended.
In this example, the user starts an application, alloc_data, which
allocates 1GB using the default memory policy (i.e. allocate to local
memory) then sleeps. Afterwards, we start DAMON to interleave the data at
a 1:1 ratio. Using numastat, we show that DAMON has migrated the
application's data to match the new interleave ratio.
For this example, I modified the userspace damo tool [8] to write to the
migration_dest sysfs files. I plan to upstream these changes when these
patches are merged.
$ # Allocate the data initially
$ ./alloc_data 1G &
[1] 6587
$ numastat -c -p alloc_data
Per-node process memory usage (in MBs) for PID 6587 (alloc_data)
Node 0 Node 1 Total
------ ------ -----
Huge 0 0 0
Heap 0 0 0
Stack 0 0 0
Private 1027 0 1027
------- ------ ------ -----
Total 1027 0 1027
$ # Start DAMON to interleave data at a 1:1 ratio
$ cat ./interleave_vaddr.yaml
kdamonds:
- contexts:
- ops: vaddr
addr_unit: null
targets:
- pid: 6587
regions: []
intervals:
sample_us: 500 ms
aggr_us: 5 s
ops_update_us: 20 s
intervals_goal:
access_bp: 0 %
aggrs: '0'
min_sample_us: 0 ns
max_sample_us: 0 ns
nr_regions:
min: '20'
max: '50'
schemes:
- action: migrate_hot
dests:
- nid: 0
weight: 1
- nid: 1
weight: 1
access_pattern:
sz_bytes:
min: 0 B
max: max
nr_accesses:
min: 0 %
max: 100 %
age:
min: 0 ns
max: max
$ sudo ./damo/damo interleave_vaddr.yaml
$ # Verify that DAMON has migrated data to match the 1:1 ratio
$ numastat -c -p alloc_data
Per-node process memory usage (in MBs) for PID 6587 (alloc_data)
Node 0 Node 1 Total
------ ------ -----
Huge 0 0 0
Heap 0 0 0
Stack 0 0 0
Private 514 514 1027
------- ------ ------ -----
Total 514 514 1027
Performance Test
================
Below is a simple example showing that interleaving application data using
these patches can improve application performance. To do this, we run a
bandwidth intensive embedding reduction application [7]. This workload is
useful for this test because it reports the time it takes each iteration
to run and each iteration reuses the same allocation, allowing us to see
the benefits of the migration.
We evaluate this on a 128 core/256 thread AMD CPU with 72GB/s of local DDR
bandwidth and 26 GB/s of CXL bandwidth.
Before we start the workload, the system bandwidth utilization is low, so
we start with the interleave weights of 1:0, i.e. allocating all data to
local memory. When the workload beings, it saturates the local bandwidth,
making the page placement suboptimal. To alleviate this, we modify the
interleave weights, triggering DAMON to migrate the workload's data.
We use the same interleave_vaddr.yaml file to setup DAMON, except we
configure it to begin with a 1:0 interleave ratio, and attach it to the
shell and its children processes.
REPEAT # 0 Baseline Total time : 7323.54 ms
REPEAT # 1 Baseline Total time : 7624.56 ms
REPEAT # 2 Baseline Total time : 7619.61 ms
REPEAT # 3 Baseline Total time : 7617.12 ms
REPEAT # 4 Baseline Total time : 7638.64 ms
REPEAT # 5 Baseline Total time : 7611.27 ms
REPEAT # 6 Baseline Total time : 7629.32 ms
REPEAT # 7 Baseline Total time : 7695.63 ms
# Interleave weights set to 3:1
REPEAT # 8 Baseline Total time : 7077.5 ms
REPEAT # 9 Baseline Total time : 5633.23 ms
REPEAT # 10 Baseline Total time : 5644.6 ms
REPEAT # 11 Baseline Total time : 5627.66 ms
REPEAT # 12 Baseline Total time : 5629.76 ms
REPEAT # 13 Baseline Total time : 5633.05 ms
REPEAT # 14 Baseline Total time : 5641.24 ms
REPEAT # 15 Baseline Total time : 5631.18 ms
REPEAT # 16 Baseline Total time : 5631.33 ms
Updating the interleave weights and having DAMON migrate the workload data
according to the weights resulted in an approximarely 25% speedup.
Patches Sequence
================
Patches 1-7 extend the DAMON API to specify multiple destination nodes and
weights for the migrate_{hot,cold} actions. These patches are from SJ'S
RFC [8].
Patches 8-10 add a vaddr implementation of the migrate_{hot,cold} schemes.
Patch 11 modifies the vaddr migrate_{hot,cold} schemes to interleave data
according to the weights provided by damos->migrate_dest.
Patches 12-13 allow the vaddr migrate_{hot,cold} implementation to filter
out folios like the paddr version.
This patch (of 13):
Introduce a new struct, namely damos_migrate_dests, for specifying
multiple DAMOS' migration destination nodes and their weights.
When committing new scheme parameters from the sysfs, the target_nid field
of the damos struct would not be copied. This would result in the
target_nid field to retain its original value, despite being updated in
the sysfs interface.
This patch fixes this issue by copying target_nid in damos_commit().
Link: https://lkml.kernel.org/r/20250709004729.17252-1-bijan311@gmail.com Fixes: 83dc7bbaecae ("mm/damon/sysfs: use damon_commit_ctx()") Signed-off-by: Bijan Tabatabai <bijantabatab@micron.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Ravi Shankar Jonnalagadda <ravis.opensrc@micron.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Yunjeong Mun [Mon, 7 Jul 2025 23:59:18 +0000 (08:59 +0900)]
samples/damon: support automatic node address detection
This patch adds a new knob `detect_node_addresses`, which determines
whether the physical address range is set manually using the existing
knobs or automatically by the mtier module. When `detect_node_addresses`
set to 'Y', mtier automatically converts node0 and node1 to their physical
addresses. If set to 'N', it uses the existing 'node#_start_addr' and
'node#_end_addr' to define regions as before.
Link: https://lkml.kernel.org/r/20250707235919.513-1-yunjeong.mun@sk.com Signed-off-by: Yunjeong Mun <yunjeong.mun@sk.com> Suggested-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
However, other sample modules of damon use "enable" parameter knobs so
it'd be better to rename them from "enable" to "enabled" to keep the
consistency with other damon modules.
Vlastimil Babka [Wed, 25 Jun 2025 15:51:52 +0000 (17:51 +0200)]
mm, vmstat: remove the NR_WRITEBACK_TEMP node_stat_item counter
The only user of the counter (FUSE) was removed in commit 0c58a97f919c
("fuse: remove tmp folio for writebacks and internal rb tree") so follow
the established pattern of removing the counter and hardcoding 0 in
meminfo output, as done recently with NR_BOUNCE. Update documentation for
procfs, including for the value for Bounce that was missed when removing
its counter.
Also remove the mention of NR_WRITEBACK_TEMP implications from a comment
in wb_position_ratio(). The rest of the comment there about fuse setting
bdi->max_ratio to 1% is still correct.
[vbabka@suse.cz: v2] Link: https://lkml.kernel.org/r/5a848e15-6a57-4ecb-a015-d4f358b8a5d3@suse.cz Link: https://lkml.kernel.org/r/20250625-nr_writeback_removal-v1-1-7f2a0df70faa@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Brendan Jackman <jackmanb@google.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Joanne Koong <joannelkoong@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Maxim Patlasov <mpatlasov@parallels.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zach O'Keefe <zokeefe@google.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm/vmstat: utilize designated initializers for the vmstat_text array
The vmstat_text array defines labels for counters displayed in
/proc/vmstat. The current definition of the array implies a specific
order of the counters in their enums, making it fragile.
To make it clear which counter the label is for, use designated
initializers.
Link: https://lkml.kernel.org/r/20250603084556.113975-1-kirill.shutemov@linux.intel.com Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
The vmstat_text array contains labels for counters displayed in
/proc/vmstat. It is important to keep the labels in sync with the
counters.
There is a BUILD_BUG_ON() check in vmstat_start() that ensures the size of
the vmstat_text is not smaller than VM_EVENT_COUNTERS. This helps to
catch cases where a new counter is added but the label is not. However,
it does not help if a counter is removed but the label remains.
It would be nice to make the BUILD_BUG_ON() check more strict to catch
such cases. However, when compiling with MEMCG enabled but
VM_EVENT_COUNTERS disabled, the vmstat_text array is larger than
NR_VMSTAT_ITEMS.
This issue arises because some elements of the vmstat_text array are
present when either MEMCG or VM_EVENT_COUNTERS is enabled, but
NR_VMSTAT_ITEMS only accounts for these elements if VM_EVENT_COUNTERS is
enabled.
Instead of adjusting the NR_VMSTAT_ITEMS definition to account for MEMCG,
make MEMCG select VM_EVENT_COUNTERS. VM_EVENT_COUNTERS is enabled in most
configurations anyway.
Link: https://lkml.kernel.org/r/20250604095111.533783-1-kirill.shutemov@linux.intel.com Fixes: ebc5d83d0443 ("mm/memcontrol: use vmstat names for printing statistics") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Cc: Konstantin Khlebnikov <koct9i@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: remove boolean output parameters from folio_pte_batch_ext()
Instead, let's just allow for specifying through flags whether we want to
have bits merged into the original PTE.
For the madvise() case, simplify by having only a single parameter for
merging young+dirty. For madvise_cold_or_pageout_pte_range() merging the
dirty bit is not required, but also not harmful. This code is not that
performance critical after all to really force all micro-optimizations.
As we now have two pte_t * parameters, use PageTable() to make sure we are
actually given a pointer at a copy of the PTE, not a pointer into an
actual page table.
Link: https://lkml.kernel.org/r/20250702104926.212243-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: split folio_pte_batch() into folio_pte_batch() and folio_pte_batch_flags()
Many users (including upcoming ones) don't really need the flags etc, and
can live with the possible overhead of a function call.
So let's provide a basic, non-inlined folio_pte_batch(), to avoid code
bloat while still providing a variant that optimizes out all flag checks
at runtime. folio_pte_batch_flags() will get inlined into
folio_pte_batch(), optimizing out any conditionals that depend on input
flags.
folio_pte_batch() will behave like folio_pte_batch_flags() when no flags
are specified. It's okay to add new users of folio_pte_batch_flags(), but
using folio_pte_batch() if applicable is preferred.
So, before this change, folio_pte_batch() was inlined into the C file
optimized by propagating constants within the resulting object file.
With this change, we now also have a folio_pte_batch() that is optimized
by propagating all constants. But instead of having one instance per
object file, we have a single shared one.
In zap_present_ptes(), where we care about performance, the compiler
already seem to generate a call to a common inlined folio_pte_batch()
variant, shared with fork() code. So calling the new non-inlined variant
should not make a difference.
While at it, drop the "addr" parameter that is unused.
Link: https://lkml.kernel.org/r/20250702104926.212243-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/linux-mm/20250503182858.5a02729fcffd6d4723afcfc2@linux-foundation.org/ Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "mm: folio_pte_batch() improvements", v2.
Ever since we added folio_pte_batch() for fork() + munmap() purposes, a
lot more users appeared (and more are being proposed), and more
functionality was added.
Most of the users only need basic functionality, and could benefit from a
non-inlined version.
So let's clean up folio_pte_batch() and split it into a basic
folio_pte_batch() (no flags) and a more advanced folio_pte_batch_ext().
Using either variant will now look much cleaner.
This series will likely conflict with some changes in some (old+new)
folio_pte_batch() users, but conflicts should be trivial to resolve.
This patch (of 4):
Respecting these PTE bits is the exception, so let's invert the meaning.
With this change, most callers don't have to pass any flags. This is a
preparation for splitting folio_pte_batch() into a non-inlined variant
that doesn't consume any flags.
Long-term, we want folio_pte_batch() to probably ignore most common PTE
bits (e.g., write/dirty/young/soft-dirty) that are not relevant for most
page table walkers: uffd-wp and protnone might be bits to consider in the
future. Only walkers that care about them can opt-in to respect them.
No functional change intended.
Link: https://lkml.kernel.org/r/20250702104926.212243-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Dev Jain <dev.jain@arm.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Jann Horn <jannh@google.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Rik van Riel <riel@surriel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Mon, 7 Jul 2025 06:57:11 +0000 (06:57 +0000)]
mm/migrate: remove the -EEXIST conversion for move_pages()
The -EEXIST conversion is introduced in commit 65462462ffb2 ("mm/gup:
follow_pfn_pte(): -EEXIST cleanup"), since follow_page() may call
follow_pfn_pte() which may return -EEXIST.
But after commit 7dff875c9436 ("mm/migrate: convert
add_page_for_migration() from follow_page() to folio_walk"), it use
folio_walk instead. This limit the error code and won't return -EEXIST.
Remove the error code conversion here.
Link: https://lkml.kernel.org/r/20250707065711.18056-1-richard.weiyang@gmail.com Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: Ying Huang <ying.huang@linux.alibaba.com> Cc: Alistair Popple <apopple@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sat, 5 Jul 2025 17:50:00 +0000 (10:50 -0700)]
Docs/mm/damon/maintainer-profile: update for mm-new tree
Recently a new mm tree for new patches, namely mm-new, has been added.
Update DAMON maintainer's profile doc for DAMON patches life cycle, which
depend on those of mm trees.
SeongJae Park [Sat, 5 Jul 2025 17:49:59 +0000 (10:49 -0700)]
mm/damon/sysfs: don't hold kdamond_lock in before_terminate()
damon_sysfs_before_terminate() is a DAMON callback that is executed from
the kdamond's context. Hence it is safe to access DAMON context internal
data. But the function is unnecessarily holding kdamond_lock of the
context. It is just unnecessary. Remove the locking code.
SeongJae Park [Sat, 5 Jul 2025 17:49:58 +0000 (10:49 -0700)]
mm/damon/sysfs: use DAMON core API damon_is_running()
DAMON core implements a static function to see if a given DAMON context is
running. DAMON sysfs interface is implementing the same one on its own.
Make the core function non-static and reuse it from the DAMON sysfs
interface.
SeongJae Park [Sat, 5 Jul 2025 17:49:57 +0000 (10:49 -0700)]
samples/damon/mtier: rename to have damon_sample_ prefix
DAMON sample module, mtier has its name 'mtier'. It could conflict with
future modules, and not very easy to identify it by name. Use a prefix,
"damon_sample_" for the name.
Note that this could break users if they depend on the old name. But it
is just a sample, so no such usage is expected, or known. Even if such
usage exists, updating it for the new name should be straightforward.
SeongJae Park [Sat, 5 Jul 2025 17:49:56 +0000 (10:49 -0700)]
samples/damon/prcl: rename to have damon_sample_ prefix
DAMON sample module, prcl has its name 'prcl'. It could conflict with
future modules, and not very easy to identify it by name. Use a prefix,
"damon_sample_" for the name.
Note that this could break users if they depend on the old name. But it
is just a sample, so no such usage is expected, or known. Even if such
usage exists, updating it for the new name should be straightforward.
SeongJae Park [Sat, 5 Jul 2025 17:49:55 +0000 (10:49 -0700)]
samples/damon/wsse: rename to have damon_sample_ prefix
Patch series "mm/damon: misc cleanups".
Yet another round of miscellaneous DAMON cleanups.
This patch (of 6):
DAMON sample module, wsse has its name 'wsse'. It could conflict with
future modules, and not very easy to identify it by name. Use a prefix,
"damon_sample_" for the name.
Note that this could break users if they depend on the old name. But it
is just a sample, so no such usage is expected, or known. Even if such
usage exists, updating it for the new name should be straightforward.
Baolin Wang [Fri, 4 Jul 2025 03:19:26 +0000 (11:19 +0800)]
mm: fault in complete folios instead of individual pages for tmpfs
After commit acd7ccb284b8 ("mm: shmem: add large folio support for
tmpfs"), tmpfs can also support large folio allocation (not just PMD-sized
large folios).
However, when accessing tmpfs via mmap(), although tmpfs supports large
folios, we still establish mappings at the base page granularity, which is
unreasonable.
We can map multiple consecutive pages of tmpfs folios at once according to
the size of the large folio. On one hand, this can reduce the overhead of
page faults; on the other hand, it can leverage hardware architecture
optimizations to reduce TLB misses, such as contiguous PTEs on the ARM
architecture.
Moreover, tmpfs mount will use the 'huge=' option to control large folio
allocation explicitly. So it can be understood that the process's RSS
statistics might increase, and I think this will not cause any obvious
effects for users.
Performance test:
I created a 1G tmpfs file, populated with 64K large folios, and write-accessed it
sequentially via mmap(). I observed a significant performance improvement:
Before the patch:
real 0m0.158s
user 0m0.008s
sys 0m0.150s
After the patch:
real 0m0.021s
user 0m0.004s
sys 0m0.017s
Link: https://lkml.kernel.org/r/440940e78aeb7430c5cc8b6d2088ae98265b9809.1751599072.git.baolin.wang@linux.alibaba.com Fixes: acd7ccb284b8 ("mm: shmem: add large folio support for tmpfs") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dev Jain <dev.jain@arm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mariano Pache <npache@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Merge tag 'efi-fixes-for-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI fix from Ard Biesheuvel:
- Fix potential memory leak reported by kmemleak
* tag 'efi-fixes-for-v6.16-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efivarfs: Fix memory leak of efivarfs_fs_info in fs_context error paths
Daniel Almeida [Mon, 14 Jul 2025 23:29:58 +0000 (20:29 -0300)]
rust: bits: add support for bits/genmask macros
In light of bindgen being unable to generate bindings for macros, and
owing to the widespread use of these macros in drivers, manually define
the bit and genmask C macros in Rust.
The *_checked version of the functions provide runtime checking while
the const version performs compile-time assertions on the arguments via
the build_assert!() macro.
Reviewed-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Alexandre Courbot <acourbot@nvidia.com> Signed-off-by: Daniel Almeida <daniel.almeida@collabora.com> Reviewed-by: Danilo Krummrich <dakr@kernel.org> Link: https://lore.kernel.org/r/20250714-topics-tyr-genmask2-v9-1-9e6422cbadb6@collabora.com
[ `expect`ed Clippy warning in doctests, hid single `use`, grouped
examples. Reworded title. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Replace `ListLinksSelfPtr::LIST_LINKS_SELF_PTR_OFFSET` with `unsafe fn
raw_get_self_ptr` which returns a pointer to the field rather than
requiring the caller to do pointer arithmetic.
Implement `HasListLinks::raw_get_list_links` in `impl_has_list_links!`,
narrowing the interface of `HasListLinks` and replacing pointer
arithmetic with `container_of!`.
Modify `impl_list_item` to also invoke `impl_has_list_links!` or
`impl_has_list_links_self_ptr!`. This is necessary to allow
`impl_list_item` to see more of the tokens used by
`impl_has_list_links{,_self_ptr}!`.
A similar API change was discussed on the hrtimer series[1].
There's a comprehensive example in `rust/kernel/list.rs` but it doesn't
exercise the `using ListLinksSelfPtr` variant nor the generic cases. Add
that here. Generalize `impl_has_list_links_self_ptr` to handle nested
fields in the same manner as `impl_has_list_links`.
Tested-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Tamir Duberstein <tamird@gmail.com> Link: https://lore.kernel.org/r/20250709-list-no-offset-v4-5-a429e75840a9@gmail.com
[ Fixed Rust < 1.82 build by enabling the `offset_of_nested`
feature. - Miguel ] Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Refer to the type parameters of `impl_has_list_links{,_self_ptr}!` by
the same name used in `impl_list_item!`. Capture type parameters of
`impl_list_item!` as `tt` using `{}` to match the style of all other
macros that work with generics.
Reviewed-by: Christian Schrefl <chrisi.schrefl@gmail.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Signed-off-by: Tamir Duberstein <tamird@gmail.com> Link: https://lore.kernel.org/r/20250709-list-no-offset-v4-2-a429e75840a9@gmail.com Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
Miguel Ojeda [Sat, 19 Jul 2025 18:36:49 +0000 (20:36 +0200)]
rust: list: undo unintended replacement of method name
When we renamed `Opaque::raw_get` to `cast_into`, there was one
replacement that was not supposed to be there.
It does not cause an issue so far because it is inside a macro rule (the
`ListLinksSelfPtr` one) that is unused so far. However, it will start
to be used soon.
Steven Rostedt [Sat, 19 Jul 2025 02:31:58 +0000 (22:31 -0400)]
tracing: Add down_write(trace_event_sem) when adding trace event
When a module is loaded, it adds trace events defined by the module. It
may also need to modify the modules trace printk formats to replace enum
names with their values.
If two modules are loaded at the same time, the adding of the event to the
ftrace_events list can corrupt the walking of the list in the code that is
modifying the printk format strings and crash the kernel.
The addition of the event should take the trace_event_sem for write while
it adds the new event.
Also add a lockdep_assert_held() on that semaphore in
__trace_add_event_dirs() as it iterates the list.
Merge tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix handling of migration disabled tasks in default idle selection
- update_locked_rq() called __this_cpu_write() spuriously with NULL
when @rq was not locked. As the writes were spurious, it didn't break
anything directly. However, the function could be called in a
preemptible leading to a context warning in __this_cpu_write(). Skip
the spurious NULL writes.
- Selftest fix on UP
* tag 'sched_ext-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: idle: Handle migration-disabled tasks in idle selection
sched/ext: Prevent update_locked_rq() calls with NULL rq
selftests/sched_ext: Fix exit selftest hang on UP