]> git.ipfire.org Git - thirdparty/kernel/linux.git/log
thirdparty/kernel/linux.git
4 weeks agoriscv: add platform-specific double word shifts for riscv32
Dmitry Antipov [Tue, 19 May 2026 17:22:57 +0000 (20:22 +0300)] 
riscv: add platform-specific double word shifts for riscv32

Add riscv32-specific '__ashldi3()', '__ashrdi3()', and '__lshrdi3()'.
Initially it was intended to fix the following link error observed when
building EFI-enabled kernel with CONFIG_EFI_STUB=y and
CONFIG_EFI_GENERIC_STUB=y:

riscv32-linux-gnu-ld: ./drivers/firmware/efi/libstub/lib-cmdline.stub.o: in function `__efistub_.L49':
__efistub_cmdline.c:(.init.text+0x1f2): undefined reference to `__efistub___ashldi3'
riscv32-linux-gnu-ld: __efistub_cmdline.c:(.init.text+0x202): undefined reference to `__efistub___lshrdi3'

Reported at [1] trying to build
https://patchew.org/linux/20260212164413.889625-1-dmantipov@yandex.ru,
tested with 'qemu-system-riscv32 -M virt' only.

Link: https://lore.kernel.org/20260519172259.908980-7-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202603041925.KLKqpK6N-lkp@intel.com [1]
Suggested-by: Ard Biesheuvel <ardb@kernel.org>
Tested-by: Charlie Jenkins <thecharlesjenkins@gmail.com>
Assisted-by: Gemini:gemini-3.1-pro-preview sashiko
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andriy Shevchenko <andriy.shevchenko@intel.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/cmdline: adjust a few comments to fix kernel-doc -Wreturn warnings
Dmitry Antipov [Tue, 19 May 2026 17:22:56 +0000 (20:22 +0300)] 
lib/cmdline: adjust a few comments to fix kernel-doc -Wreturn warnings

Fix 'get_option()', 'memparse()' and 'parse_option_str()' comments to
match the commonly used style as suggested by kernel-doc -Wreturn.

Link: https://lore.kernel.org/20260519172259.908980-6-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Charlie Jenkins <thecharlesjenkins@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/cmdline_kunit: add test case for memparse()
Dmitry Antipov [Tue, 19 May 2026 17:22:55 +0000 (20:22 +0300)] 
lib/cmdline_kunit: add test case for memparse()

Better late than never, now there is a long-awaited basic test for
'memparse()' which is provided by cmdline.c.

Link: https://lore.kernel.org/20260519172259.908980-5-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Charlie Jenkins <thecharlesjenkins@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib: add more string to 64-bit integer conversion overflow tests
Dmitry Antipov [Tue, 19 May 2026 17:22:54 +0000 (20:22 +0300)] 
lib: add more string to 64-bit integer conversion overflow tests

Add a few more string to 64-bit integer conversion tests to check whether
'kstrtoull()', 'kstrtoll()', 'kstrtou64()' and 'kstrtos64()' can handle
overflows reported by '_parse_integer_limit()'.

Link: https://lore.kernel.org/20260519172259.908980-4-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Suggested-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Charlie Jenkins <thecharlesjenkins@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib: fix memparse() to handle overflow
Dmitry Antipov [Tue, 19 May 2026 17:22:53 +0000 (20:22 +0300)] 
lib: fix memparse() to handle overflow

Since '_parse_integer_limit()' (and so 'simple_strtoull()') is now capable
to handle overflow, adjust 'memparse()' to handle overflow (denoted by
ULLONG_MAX) returned from 'simple_strtoull()'.  Also use
'check_shl_overflow()' to catch an overflow possibly caused by processing
size suffix and denote it with ULLONG_MAX as well.

Link: https://lore.kernel.org/20260519172259.908980-3-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Charlie Jenkins <thecharlesjenkins@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib: fix _parse_integer_limit() to handle overflow
Dmitry Antipov [Tue, 19 May 2026 17:22:52 +0000 (20:22 +0300)] 
lib: fix _parse_integer_limit() to handle overflow

Patch series "lib and lib/cmdline enhancements", v11.

This series is a merge of the recently posted [1] and [2].  The first one
is intended to adjust '_parse_integer_limit()' and 'memparse()' to not
ignore overflows, extend string to 64-bit integer conversion tests, add
KUnit-based test for 'memparse()' and fix kernel-doc glitches found in
lib/cmdline.c.  The second one was originated from RISCV-specific build
fixes needed to integrate the former and now aims to provide
platform-specific double-word shifts and corresponding KUnit test.

Getting feedback from RISCV core maintainers would be very helpful.

Special thanks to Andy Shevchenko, Charlie Jenkins, and Andrew Morton.

This patch (of 8):

In '_parse_integer_limit()', adjust native integer arithmetic with
near-to-overflow branch where 'check_mul_overflow()' and
'check_add_overflow()' are used to check whether an intermediate result
goes out of range, and denote such a case with ULLONG_MAX, thus making the
function more similar to standard C library's 'strtoull()'.  Adjust
comment to kernel-doc style as well.

Link: https://lore.kernel.org/20260519172259.908980-1-dmantipov@yandex.ru
Link: https://lore.kernel.org/20260519172259.908980-2-dmantipov@yandex.ru
Link: https://lore.kernel.org/linux-riscv/20260403103338.1122415-1-dmantipov@yandex.ru
Link: https://lore.kernel.org/linux-riscv/20260427090105.705529-1-dmantipov@yandex.ru
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andriy Shevchenko <andriy.shevchenko@intel.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Charlie Jenkins <thecharlesjenkins@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/tests: extend cmdline KUnit with next_arg() tests
Shuvam Pandey [Mon, 16 Mar 2026 10:12:27 +0000 (15:57 +0545)] 
lib/tests: extend cmdline KUnit with next_arg() tests

The cmdline KUnit suite covers get_option() and get_options(), but it
does not exercise next_arg().

Extend the suite with one test for a quoted value containing spaces and
one regression test for a bare quote token after a normal parameter.

The regression test covers the bare quote token path fixed by commit
9847f21225c4 ("lib/cmdline: avoid page fault in next_arg").

[shuvampandey1@gmail.com: extend cmdline next_arg() coverage with mixed tokens]
Link: https://lore.kernel.org/20260316211249.88601-1-shuvampandey1@gmail.com
Link: https://lore.kernel.org/20260316101227.15807-1-shuvampandey1@gmail.com
Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com>
Cc: Neel Natu <neelnatu@google.com>
Cc: Dmitry Antipov <dmantipov@yandex.ru>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/perf_events: fix mmap() error check in sigtrap_threads
Hongfu Li [Wed, 13 May 2026 02:58:38 +0000 (10:58 +0800)] 
selftests/perf_events: fix mmap() error check in sigtrap_threads

In sigtrap_threads(), the return value of mmap() is checked against NULL.
mmap() returns MAP_FAILED, which is (void *)-1, not NULL, when it fails.
Since MAP_FAILED is non-zero and non-NULL, the condition "p == NULL" will
never be true on failure, causing the program to proceed with an invalid
pointer and segfault if mmap() actually fails under memory pressure.

Link: https://lore.kernel.org/20260513025838.594945-1-lihongfu@kylinos.cn
Signed-off-by: Hongfu Li <lihongfu@kylinos.cn>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Mickael Salaun <mic@digikod.net>
Cc: SeongJae Park <sj@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Kyle Huey <khuey@kylehuey.com>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/bug: cleanup comment style, types and modernize logging
Lucas Poupeau [Mon, 4 May 2026 20:16:07 +0000 (22:16 +0200)] 
lib/bug: cleanup comment style, types and modernize logging

Improve the overall code quality of lib/bug.c by:
- Reformatting the main documentation block to follow the standard
  kernel multi-line comment style.
- Replacing 'unsigned' with the preferred 'unsigned int'.
- Converting legacy printk() calls to modern pr_warn() and pr_info()
  macros to include proper facility levels and satisfy checkpatch.

Link: https://lore.kernel.org/20260504201607.56932-1-lucasp.linux@gmail.com
Signed-off-by: Lucas Poupeau <lucasp.linux@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoscripts/bloat-o-meter: ignore _sdata
Yury Norov [Mon, 4 May 2026 20:36:05 +0000 (16:36 -0400)] 
scripts/bloat-o-meter: ignore _sdata

_sdata is a linker symbol, but bloat-o-meter may consider it as a real
variable:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 7/1 grow/shrink: 0/0 up/down: 3437/-4096 (-659)
Function                                     old     new   delta
crc32table_le                                  -    1024   +1024
crc32table_be                                  -    1024   +1024
crc32ctable_le                                 -    1024   +1024
byte_rev_table                                 -     256    +256
crc32_be                                       -      39     +39
crc32c                                         -      35     +35
crc32_le                                       -      35     +35
_sdata                                      4096       -   -4096
Total: Before=8592564398, After=8592563739, chg -0.00%

With the patch:

$ scripts/bloat-o-meter vmlinux.orig vmlinux
add/remove: 7/0 grow/shrink: 0/0 up/down: 3437/0 (3437)
Function                                     old     new   delta
crc32table_le                                  -    1024   +1024
crc32table_be                                  -    1024   +1024
crc32ctable_le                                 -    1024   +1024
byte_rev_table                                 -     256    +256
crc32_be                                       -      39     +39
crc32c                                         -      35     +35
crc32_le                                       -      35     +35
Total: Before=8592560302, After=8592563739, chg +0.00%

Link: https://lore.kernel.org/20260504203606.427972-1-ynorov@nvidia.com
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Cc: Valtteri Koskivuori <vkoskiv@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: validate inline xattr header before reflinking inline xattrs
ZhengYuan Huang [Fri, 8 May 2026 08:59:14 +0000 (16:59 +0800)] 
ocfs2: validate inline xattr header before reflinking inline xattrs

[BUG]
A corrupt inline xattr header can make ocfs2_reflink_xattr_inline() lock,
copy, and reflink xattr state from an unchecked ibody xattr header.

[CAUSE]
The inline reflink path still trusted di->i_xattr_inline_size to compute
header_off, xh, and new_xh before handing the source header to the reflink
allocator and copy logic.

[FIX]
Validate the source inode's inline xattr header with the shared helper
first, then derive the reflink copy offsets from the validated inline
size/header. This keeps the reflink path from traversing corrupt ibody
xattr geometry.

Link: https://lore.kernel.org/20260508085914.61647-6-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: validate inline xattr header before inline refcount attach
ZhengYuan Huang [Fri, 8 May 2026 08:59:13 +0000 (16:59 +0800)] 
ocfs2: validate inline xattr header before inline refcount attach

[BUG]
A corrupt inline xattr header can make ocfs2_xattr_inline_attach_refcount()
feed an unchecked header into the refcount-attachment walk for inline
xattr values.

[CAUSE]
The inline refcount-attach path still derived the header directly from
di->i_xattr_inline_size and then passed it to code that iterates xh_count
and xattr entries.

[FIX]
Use the shared ibody header helper before attaching refcounts to inline
xattr values so corrupt header geometry is rejected with -EFSCORRUPTED
instead of being traversed.

Link: https://lore.kernel.org/20260508085914.61647-5-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: validate inline xattr header before ibody remove
ZhengYuan Huang [Fri, 8 May 2026 08:59:12 +0000 (16:59 +0800)] 
ocfs2: validate inline xattr header before ibody remove

[BUG]
A corrupt inline xattr header can make ocfs2_xattr_ibody_remove() pass an
unchecked header into ocfs2_remove_value_outside() during inode xattr
teardown.

[CAUSE]
ocfs2_xattr_ibody_remove() still rebuilt the ibody xattr header directly
from di->i_xattr_inline_size and then handed it to code that iterates
xh_count and entry geometry.

[FIX]
Validate the inline xattr header with the shared helper before handing it
to the outside-value removal path, and propagate -EFSCORRUPTED on bad
metadata instead of traversing the unchecked header.

Link: https://lore.kernel.org/20260508085914.61647-4-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: validate inline xattr header before checking outside values
ZhengYuan Huang [Fri, 8 May 2026 08:59:11 +0000 (16:59 +0800)] 
ocfs2: validate inline xattr header before checking outside values

[BUG]
A corrupt inline xattr header can make
ocfs2_has_inline_xattr_value_outside() walk xh_count from an unchecked
header while refcount-tree teardown decides whether inline xattrs still
point outside the inode body.

[CAUSE]
ocfs2_has_inline_xattr_value_outside() still computed the inline header
directly from di->i_xattr_inline_size and immediately iterated xh_count.
That is the same unchecked metadata boundary as the ibody lookup bug.

[FIX]
Reuse the shared inline-header helper before iterating xh_count. Because
this helper returns a boolean-style answer to its caller, treat a corrupt
header conservatively as "has outside values" instead of walking it.

Link: https://lore.kernel.org/20260508085914.61647-3-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: validate inline xattr header before ibody lookups
ZhengYuan Huang [Fri, 8 May 2026 08:59:10 +0000 (16:59 +0800)] 
ocfs2: validate inline xattr header before ibody lookups

Patch series "ocfs2: validate inline xattr header consumers".

Corrupt i_xattr_inline_size can move the computed inode-body xattr header
outside the dinode block. Several OCFS2 paths then trust xh_count or
xattr entry geometry from that unchecked header.

The reported KASAN splat hits the ibody lookup path:

  BUG: KASAN: use-after-free in ocfs2_xattr_find_entry+0x37b/0x3a0
  ocfs2_xattr_ibody_get()
  ocfs2_xattr_get_nolock()
  ocfs2_calc_xattr_init()

The same unchecked header derivation also exists in the outside-value
probe, ibody remove, inline refcount attach, and inline reflink paths.

This series factors the existing ibody list validation into a shared
helper and then converts the remaining inline-header consumers one at a
time.

Patch layout:

1. validate ibody get/find and reuse the helper in ibody list
2. validate the outside-value probe
3. validate ibody remove
4. validate inline refcount attach
5. validate inline reflink

This patch (of 5):

[BUG]
mknodat() can read past the end of a dinode block when ACL inheritance
walks a corrupted inode-body xattr header. Another report shows the same
unchecked lookup later faulting in the VFS open path after create
returns a garbage status.

KASAN: use-after-free in
ocfs2_xattr_find_entry+0x37b/0x3a0 fs/ocfs2/xattr.c:1078
Read of size 2 at addr ffff88801c520300 by task syz.0.10/360

Trace:
 ...
 ocfs2_xattr_find_entry+0x37b/0x3a0 fs/ocfs2/xattr.c:1078
 ocfs2_xattr_ibody_get fs/ocfs2/xattr.c:1178 [inline]
 ocfs2_xattr_get_nolock+0x2ee/0x1110 fs/ocfs2/xattr.c:1309
 ocfs2_calc_xattr_init+0x716/0xac0 fs/ocfs2/xattr.c:628
 ocfs2_mknod+0x935/0x2400 fs/ocfs2/namei.c:333
 ocfs2_create+0x158/0x390 fs/ocfs2/namei.c:676
 vfs_create fs/namei.c:3493 [inline]
 vfs_create+0x445/0x6f0 fs/namei.c:3477
 do_mknodat+0x2d8/0x5e0 fs/namei.c:4372
 __do_sys_mknodat fs/namei.c:4400 [inline]
 __se_sys_mknodat fs/namei.c:4397 [inline]
 __x64_sys_mknodat+0xb6/0xf0 fs/namei.c:4397
 ...

Another report:
 BUG: unable to handle page fault for address: fffffbfff3e40ec0
 RIP: 0010:__d_entry_type include/linux/dcache.h:414 [inline]
 RIP: 0010:d_can_lookup include/linux/dcache.h:429 [inline]
 RIP: 0010:d_is_dir include/linux/dcache.h:439 [inline]
 RIP: 0010:path_openat+0xe2f/0x2ce0 fs/namei.c:4134

Trace:
 ...
 do_filp_open+0x1f6/0x430 fs/namei.c:4161
 do_sys_openat2+0x117/0x1c0 fs/open.c:1437
 __x64_sys_openat+0x15b/0x220 fs/open.c:1463
 ...

[CAUSE]
ocfs2_xattr_ibody_list() already validates the inline xattr size and
entry count, but ocfs2_xattr_ibody_get() and ocfs2_xattr_ibody_find()
still derive the inline header directly from di->i_xattr_inline_size and
then trust xh_count. A corrupted inline size or entry count can therefore
move the computed header outside the dinode block before get/find start
walking it. That can either make ocfs2_xattr_find_entry() dereference
xs->header->xh_count outside the block or make ocfs2_xattr_get_nolock()
bubble a garbage status back through ocfs2_calc_xattr_init() into the
create/open path.

[FIX]
Factor the existing ibody header geometry checks into a shared helper.
Use it in ocfs2_xattr_ibody_get() and ocfs2_xattr_ibody_find(), and have
ocfs2_xattr_ibody_list() reuse the same helper instead of open-coding
the validation. Reject corrupt ibody metadata with -EFSCORRUPTED before
the lookup path can walk bogus xattr geometry or return a garbage status.

Link: https://lore.kernel.org/20260508085914.61647-1-gality369@gmail.com
Link: https://lore.kernel.org/20260508085914.61647-2-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Jia-Ju Bai <baijiaju1990@gmail.com>
Cc: Zixuan Fu <r33s3n6@gmail.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: don't BUG_ON an invalid journal dinode
ZhengYuan Huang [Tue, 12 May 2026 02:41:15 +0000 (10:41 +0800)] 
ocfs2: don't BUG_ON an invalid journal dinode

[BUG]
A fuzzed OCFS2 image can corrupt the current slot journal dinode while
mount is still in progress. The mount path first reports the invalid
journal block and then crashes in shutdown:

kernel BUG at fs/ocfs2/journal.c:1034!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:ocfs2_journal_toggle_dirty+0x2d6/0x340 fs/ocfs2/journal.c:1034
Call Trace:
 ocfs2_journal_shutdown+0x414/0xc30 fs/ocfs2/journal.c:1116
 ocfs2_mount_volume fs/ocfs2/super.c:1785 [inline]
 ocfs2_fill_super+0x30a9/0x3cd0 fs/ocfs2/super.c:1083
 get_tree_bdev_flags+0x38b/0x640 fs/super.c:1698
 get_tree_bdev+0x24/0x40 fs/super.c:1721
 ocfs2_get_tree+0x21/0x30 fs/ocfs2/super.c:1184
 vfs_get_tree+0x9a/0x370 fs/super.c:1758
 fc_mount fs/namespace.c:1199 [inline]
 do_new_mount_fc fs/namespace.c:3642 [inline]
 do_new_mount fs/namespace.c:3718 [inline]
 path_mount+0x5b8/0x1ea0 fs/namespace.c:4028
 do_mount fs/namespace.c:4041 [inline]
 __do_sys_mount fs/namespace.c:4229 [inline]
 __se_sys_mount fs/namespace.c:4206 [inline]
 __x64_sys_mount+0x282/0x320 fs/namespace.c:4206
 ...

[CAUSE]
ocfs2_journal_toggle_dirty() used to return -EIO when journal->j_bh no
longer contained a valid dinode, because the startup and shutdown paths
already handled that failure. Commit 10995aa2451a
("ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.") changed
the check to a BUG_ON() under the assumption that the journal dinode had
already been validated. That turns an unexpected invalid journal dinode
during mount teardown into a kernel crash instead of a normal mount
failure.

[FIX]
Replace the BUG_ON() with WARN_ON() and return -EIO. This keeps the
invariant warning for debugging, but restores the original behavior of
failing startup or shutdown cleanly instead of panicking the kernel.

Link: https://lore.kernel.org/20260512024115.4036371-1-gality369@gmail.com
Fixes: 10995aa2451a ("ocfs2: Morph the haphazard OCFS2_IS_VALID_DINODE() checks.")
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: reject inconsistent inode size before truncate
ZhengYuan Huang [Tue, 12 May 2026 02:16:00 +0000 (10:16 +0800)] 
ocfs2: reject inconsistent inode size before truncate

[BUG]
openat(..., O_WRONLY|O_CREAT|O_TRUNC) can hit:

kernel BUG at fs/ocfs2/file.c:454!
Oops: invalid opcode: 0000 [#1] SMP KASAN NOPTI
RIP: 0010:ocfs2_truncate_file+0x1204/0x13c0 fs/ocfs2/file.c:454
Call Trace:
 ocfs2_setattr+0xa6d/0x1fd0 fs/ocfs2/file.c:1212
 notify_change+0x4b5/0x1030 fs/attr.c:546
 do_truncate+0x1d2/0x230 fs/open.c:68
 handle_truncate fs/namei.c:3596 [inline]
 do_open fs/namei.c:3979 [inline]
 path_openat+0x260f/0x2ce0 fs/namei.c:4134
 do_filp_open+0x1f6/0x430 fs/namei.c:4161
 do_sys_openat2+0x117/0x1c0 fs/open.c:1437
 do_sys_open fs/open.c:1452 [inline]
 __do_sys_openat fs/open.c:1468 [inline]
 __se_sys_openat fs/open.c:1463 [inline]
 __x64_sys_openat+0x15b/0x220 fs/open.c:1463
 ...

[CAUSE]
ocfs2_truncate_file() treats di_bh->i_size matching inode->i_size as an
internal code invariant and BUGs if it is broken.

That assumption is too strong for corrupted metadata. The dinode block can
still be structurally valid enough to pass ocfs2_read_inode_block() while
no longer matching an already-instantiated VFS inode. On local mounts,
ocfs2_inode_lock_update() skips refresh entirely, so truncate can
observe the mismatch directly and crash instead of rejecting the
corruption.

[FIX]
Turn the BUG_ON into normal OCFS2 corruption handling. If truncate sees
di_bh->i_size disagree with inode->i_size, report it with ocfs2_error() and
abort before touching truncate state.

This keeps the fix at the first boundary that actually requires the
sizes to match and avoids widening checks into hotter generic
inode-lock paths

Link: https://lore.kernel.org/20260512021601.3936417-1-gality369@gmail.com
Signed-off-by: ZhengYuan Huang <gality369@gmail.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agogcov: use atomic counter updates to fix concurrent access crashes
Konstantin Khorenko [Mon, 11 May 2026 10:50:52 +0000 (12:50 +0200)] 
gcov: use atomic counter updates to fix concurrent access crashes

GCC's GCOV instrumentation can merge global branch counters with loop
induction variables as an optimization.  In inflate_fast(), the inner copy
loops get transformed so that the GCOV counter value is loaded multiple
times to compute the loop base address, start index, and end bound.  Since
GCOV counters are global (not per-CPU), concurrent execution on different
CPUs causes the counter to change between loads, producing inconsistent
values and out-of-bounds memory writes.

The crash manifests during IPComp (IP Payload Compression) processing when
inflate_fast() runs concurrently on multiple CPUs:

  BUG: unable to handle page fault for address: ffffd0a3c0902ffa
  RIP: inflate_fast+1431
  Call Trace:
   zlib_inflate
   __deflate_decompress
   crypto_comp_decompress
   ipcomp_decompress [xfrm_ipcomp]
   ipcomp_input [xfrm_ipcomp]
   xfrm_input

At the crash point, the compiler generated three loads from the same
global GCOV counter (__gcov0.inflate_fast+216) to compute base, start, and
end for an indexed loop.  Another CPU modified the counter between loads,
making the values inconsistent - the write went 3.4 MB past a 65 KB
buffer.

Add -fprofile-update=prefer-atomic to CFLAGS_GCOV at the global level in
the top-level Makefile, guarded by a try-run compile test.  The test
compiles a minimal program with and without -fprofile-update=prefer-atomic
using the full KBUILD_CFLAGS, then compares undefined symbols in the
resulting object files.  If prefer-atomic introduces new undefined
references (such as __atomic_fetch_add_8 on i386 or __aarch64_ldadd8_relax
on arm64 with outline-atomics), the flag is not added -- the kernel does
not link against libatomic.

On architectures where GCC inlines 64-bit atomic counter updates (x86_64,
s390, ...) the test passes and the flag is enabled, preventing the
compiler from merging counters with loop induction variables and fixing
the observed concurrent-access crash.

On architectures where the flag would introduce libatomic dependencies, it
is silently omitted and behaviour is no worse than before this patch.

Move the CFLAGS_GCOV block from its original position (before the arch
Makefile include) to after the core KBUILD_CFLAGS assignments but before
the scripts/Makefile.gcc-plugins include.  This placement ensures the
try-run test sees arch-specific flags (-m32, -march=,
-mno-outline-atomics) while avoiding GCC plugin flags (-fplugin=) that
would break the test on clean builds when plugin shared objects do not yet
exist.

Link: https://lore.kernel.org/20260511105052.417187-2-khorenko@virtuozzo.com
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Tested-by: Arnd Bergmann <arnd@arndb.de>
Tested-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Reviewed-by: Peter Oberparleiter <oberpar@linux.ibm.com>
Cc: Masahiro Yamada <masahiroy@kernel.org>
Cc: Miguel Ojeda <ojeda@kernel.org>
Cc: Mikhail Zaslonko <zaslonko@linux.ibm.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Thomas Weißschuh <linux@weissschuh.net>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agollist: make locking comments consistent
Philipp Stanner [Thu, 7 May 2026 09:49:19 +0000 (11:49 +0200)] 
llist: make locking comments consistent

llist's locking requirement table has a legend which claims that all
operations not needing a lock a marked with '-', whereas in truth for some
table entries just a whitespace is used.

Add the '-' to all appropriate places.

Link: https://lore.kernel.org/20260507094918.23910-2-phasta@kernel.org
Signed-off-by: Philipp Stanner <phasta@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: "Paul E . McKenney" <paulmck@kernel.org>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokfence: fix KASAN HW tags bypass via runtime sample_interval change
Alexander Potapenko [Thu, 7 May 2026 09:52:37 +0000 (11:52 +0200)] 
kfence: fix KASAN HW tags bypass via runtime sample_interval change

If a user writes a non-zero value to the sample_interval module parameter
at runtime, the missing KASAN HW tags check in the late init path allows
KFENCE to be enabled alongside KASAN HW tags, bypassing the boot
restriction.

This patch adds the missing check to param_set_sample_interval() to reject
the parameter change if KASAN HW tags are enabled.

Link: https://lore.kernel.org/20260507095237.741017-1-glider@google.com
Fixes: 09833d99db36 ("mm/kfence: disable KFENCE upon KASAN HW tags enablement")
Signed-off-by: Alexander Potapenko <glider@google.com>
Cc: Marco Elver <elver@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Pimyn Girgis <pimyn@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoMAINTAINERS/CREDITS: remove inactive checkpatch reviewers
Joe Perches [Thu, 7 May 2026 17:41:46 +0000 (10:41 -0700)] 
MAINTAINERS/CREDITS: remove inactive checkpatch reviewers

Dwaipayan Ray and Lukas Bulwahn have not commented on checkpatch in
several years.

Lukas is still active on MAINTAINERS.
Create an entry in CREDITS for Dwaipayan.

Link: https://lore.kernel.org/64f057d1d7f247583eb616337b89b3ff7bcc627f.camel@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agorapidio/tsi721: prevent a bad dereference in tsi721_db_dpc()
Dan Carpenter [Fri, 8 May 2026 07:51:56 +0000 (10:51 +0300)] 
rapidio/tsi721: prevent a bad dereference in tsi721_db_dpc()

With a list_for_each() loop, if we don't find the item we are looking for
in the list, then the loop exits with the iterator, which is "dbell" in
this loop, pointing to invalid memory.

This code uses the "found" variable to determine if we have found the
doorbell we are looking for or not.  However, the problem that the "found"
variable needs to be set to false at the start of each iteration,
otherwise after the first correct doorbell, then everything is marked as
found.

Reset the "found" to false at the start of the iteration and move the
variable inside the loop.

Link: https://lore.kernel.org/af2WHMZiqMwdYveO@stanley.mountain
Fixes: 48618fb4e522 ("RapidIO: add mport driver for Tsi721 bridge")
Signed-off-by: Dan Carpenter <error27@gmail.com>
Cc: Alexandre Bounine <alex.bou9@gmail.com>
Cc: Chul Kim <chul.kim@idt.com>
Cc: Matt Porter <mporter@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokcov: refactor common handle ID into kcov_common_handle_id
Jann Horn [Thu, 30 Apr 2026 14:15:33 +0000 (16:15 +0200)] 
kcov: refactor common handle ID into kcov_common_handle_id

Store common handle IDs in "struct kcov_common_handle_id", which consumes
no space in non-KCOV builds.

This cleanup removes #ifdef boilerplate code from subsystems that
integrate with KCOV (in particular in usbip_common.h and skbuff.h, see the
diffstat).

This should also make it easier to add KCOV remote coverage to more
subsystems in the future.

Link: https://lore.kernel.org/20260430-kcov-refactor-common-handle-v1-1-23a0c7a0ba38@google.com
Signed-off-by: Jann Horn <jannh@google.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Eugenio Pérez <eperezma@redhat.com>
Cc: Hongren (Zenithal) Zheng <i@zenithal.me>
Cc: Jann Horn <jannh@google.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Valentina Manea <valentina.manea.m@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agouaccess: minimize INLINE_COPY_USER-related ifdefery
Yury Norov [Sat, 25 Apr 2026 02:08:57 +0000 (22:08 -0400)] 
uaccess: minimize INLINE_COPY_USER-related ifdefery

Now that we've got the same config selecting inline vs outline
copy_to_user() and copy_from_user(), we can simplify the corresponding
logic in the uaccess.h.

Link: https://lore.kernel.org/20260425020857.356850-4-ynorov@nvidia.com
Fixes: 1f9a8286bc0c ("uaccess: always export _copy_[from|to]_user with CONFIG_RUST")
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Tested-by: Alice Ryhl <aliceryhl@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agouaccess: unify inline vs outline copy_{from,to}_user() selection
Yury Norov [Sat, 25 Apr 2026 02:08:56 +0000 (22:08 -0400)] 
uaccess: unify inline vs outline copy_{from,to}_user() selection

The kernel allows arches to select between inline and outline
implementations of the copy_{from,to}_user() by defining individual
INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER, correspondingly.  However,
all arches enable or disable them always together.

Without the real use-case for one helper being inlined while the other
outlined, having independent controls is excessive and error prone.

Switch the codebase to the single unified INLINE_COPY_USER control.

Link: https://lore.kernel.org/20260425020857.356850-3-ynorov@nvidia.com
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Tested-by: Alice Ryhl <aliceryhl@google.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Viktor Malik <vmalik@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agorust: uaccess: use INLINE_COPY_TO_USER to guard copy_to_user()
Yury Norov [Sat, 25 Apr 2026 02:08:55 +0000 (22:08 -0400)] 
rust: uaccess: use INLINE_COPY_TO_USER to guard copy_to_user()

Patch series "uaccess: unify inline vs outline copy_{from,to}_user()
selection", v2.

The kernel allows arches to select between inline and outline
implementations of the copy_{from,to}_user() by defining individual
INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER, correspondingly.  However,
all arches enable or disable them always together.

Without the real use-case for one helper being inlined while the other
outlined, having independent controls is excessive and error prone.

The first patch of the series fixes rust/uaccess coppy_to_user() wrapper
guarded with INLINE_COPY_FROM_USER.  The 2nd patch switches codebase to
the unified INLINE_COPY_USER.  And the last patch cleans up ifdefery in
the include/linux/uaccess.h

This patch (of 3):

The copy_to_user() rust helper is only needed when the main kernel inlines
the function.  It is controlled by INLINE_COPY_TO_USER, but the rust
helper is protected with INLINE_COPY_FROM_USER.

Fix that.

Link: https://lore.kernel.org/20260425020857.356850-1-ynorov@nvidia.com
Link: https://lore.kernel.org/20260425020857.356850-2-ynorov@nvidia.com
Fixes: d99dc586ca7c7 ("uaccess: decouple INLINE_COPY_FROM_USER and CONFIG_RUST")
Signed-off-by: Yury Norov <ynorov@nvidia.com>
Reported-by: Christophe Leroy (CS GROUP) <chleroy@kernel.org>
Closes: https://lore.kernel.org/all/746c9c50-20c4-4dc9-a539-bf1310ff9414@kernel.org/
Cc: Alice Ryhl <aliceryhl@google.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Viktor Malik <vmalik@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokselftest/filelock: add a .gitignore file
Mark Brown [Thu, 26 Feb 2026 16:05:27 +0000 (16:05 +0000)] 
kselftest/filelock: add a .gitignore file

Tell git to ignore the generated binary for the test.

Link: https://lore.kernel.org/20260226-selftest-filelock-ktap-v4-3-db8ae192ff42@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokselftest/filelock: report each test in oftlocks separately
Mark Brown [Thu, 26 Feb 2026 16:05:26 +0000 (16:05 +0000)] 
kselftest/filelock: report each test in oftlocks separately

The filelock test checks four different things but only reports an overall
status, convert to use ksft_test_result() for these individual tests.
Each test depends on the previous ones so we still bail out if any of them
fail but we get a bit more information from UIs parsing the results.

Link: https://lore.kernel.org/20260226-selftest-filelock-ktap-v4-2-db8ae192ff42@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokselftest/filelock: use ksft_perror()
Mark Brown [Thu, 26 Feb 2026 16:05:25 +0000 (16:05 +0000)] 
kselftest/filelock: use ksft_perror()

Patch series "selftests/filelock: Make output more kselftestish", v4.

This series makes the output from the ofdlocks test a bit easier for
tooling to work with, and also ignores the generated file while we're
here.

This patch (of 3):

The ofdlocks test reports some errors via perror() which does not produce
KTAP output, convert to ksft_perror() which does.

Link: https://lore.kernel.org/20260226-selftest-filelock-ktap-v4-0-db8ae192ff42@kernel.org
Link: https://lore.kernel.org/20260226-selftest-filelock-ktap-v4-1-db8ae192ff42@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/base64: fix copy-pasted @padding doc in base64_decode()
Josh Law [Tue, 24 Mar 2026 22:32:10 +0000 (22:32 +0000)] 
lib/base64: fix copy-pasted @padding doc in base64_decode()

The @padding kernel-doc for base64_decode() says "whether to append '='
padding characters", which was copy-pasted from base64_encode().  In the
decode context, it controls whether the input is expected to include
padding, not whether to append it.

Link: https://lore.kernel.org/20260324223210.47676-3-objecting@objecting.org
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/base64: validate before writing in decode tail path
Josh Law [Tue, 24 Mar 2026 22:32:09 +0000 (22:32 +0000)] 
lib/base64: validate before writing in decode tail path

Patch series "lib/base64: decode fixes", v2.

Two small fixes for lib/base64.c:

1. base64_decode() writes a decoded byte to the output buffer before
   validating the input in the trailing-bytes path. Move the validity
   checks before any writes so dst is untouched on invalid input.

2. The @padding kernel-doc for base64_decode() was copy-pasted from
   base64_encode() and describes the wrong direction.

This patch (of 2):

The trailing-bytes path in base64_decode() writes a decoded byte to the
output buffer before checking whether the input characters are valid.  If
the input is malformed, garbage is written to dst before the function
returns -1.

Move the validity checks before any writes so the output buffer is left
untouched on invalid input.

Link: https://lore.kernel.org/20260324223210.47676-1-objecting@objecting.org
Link: https://lore.kernel.org/20260324223210.47676-2-objecting@objecting.org
Fixes: 9c7d3cf94d33 ("lib/base64: rework encode/decode for speed and stricter validation")
Signed-off-by: Josh Law <objecting@objecting.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoinit.h: discard exitcall symbols early
Arnd Bergmann [Tue, 31 Mar 2026 14:28:38 +0000 (16:28 +0200)] 
init.h: discard exitcall symbols early

Any __exitcall() and built-in module_exit() handler is marked as __used,
which leads to the code being included in the object file and later
discarded at link time.

As far as I can tell, this was originally added at the same time as
initcalls were marked the same way, to prevent them from getting dropped
with gcc-3.4, but it was never actaully necessary to keep exit functions
around.

Mark them as __maybe_unused instead, which lets the compiler treat the
exitcalls as entirely unused, and make better decisions about dropping
specializing static functions called from these.

Link: https://lore.kernel.org/all/acruxMNdnUlyRHiy@google.com/
Link: https://lore.kernel.org/20260331142846.3187706-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Nicolas Schier <nsc@kernel.org>
Cc: Andriy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@kernel.org>
Cc: Kees Cook <kees@kernel.org>
Cc: Marco Elver <elver@google.com>
Cc: Nathan Chancellor <nathan@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/tests: string_helpers: don't use "proxy" headers
Andy Shevchenko [Mon, 6 Apr 2026 19:32:48 +0000 (21:32 +0200)] 
lib/tests: string_helpers: don't use "proxy" headers

Update header inclusions to follow IWYU (Include What You Use) principle.

Link: https://lore.kernel.org/20260406193425.1534197-3-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agolib/tests: string_helpers: decouple unescape and escape cases
Andy Shevchenko [Mon, 6 Apr 2026 19:32:47 +0000 (21:32 +0200)] 
lib/tests: string_helpers: decouple unescape and escape cases

Patch series "lib/tests: string_helpers: Slight improvements".

Two ad-hoc patches to improve the test module.  It was induced by another
patch that poorly tried to add (existing) test cases and make me revisit
string_helpers_kunit.c.

This patch (of 2):

Currently the escape and unescape test cases go in one step.  Decouple
them for the better granularity and understanding test coverage in the
results.

Link: https://lore.kernel.org/20260406193425.1534197-1-andriy.shevchenko@linux.intel.com
Link: https://lore.kernel.org/20260406193425.1534197-2-andriy.shevchenko@linux.intel.com
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agotreewide: fix indentation and whitespace in Kconfig files
Anand Moon [Tue, 7 Apr 2026 05:39:34 +0000 (11:09 +0530)] 
treewide: fix indentation and whitespace in Kconfig files

Clean up inconsistent indentation (mixing tabs and spaces) and remove
extraneous whitespace in several Kconfig files across the tree.  This is a
purely cosmetic change to improve readability.

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:

$ sed -e 's/^        /\t/' -i */Kconfig

Link: https://lore.kernel.org/20260407053945.14116-1-linux.amoon@gmail.com
Signed-off-by: Anand Moon <linux.amoon@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz> [fs]
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> [mm]
Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> [mm]
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoget_maintainer: add --json output mode
Sasha Levin [Wed, 8 Apr 2026 19:45:42 +0000 (15:45 -0400)] 
get_maintainer: add --json output mode

Add a --json flag to get_maintainer.pl that emits structured JSON output,
making results machine-parseable for CI systems, IDE integrations, and
AI-assisted development tools.

The JSON output includes a maintainers array with structured name, email,
and role fields, plus optional arrays for scm, status, subsystem, web, and
bug information when those flags are enabled.

Normal text output behavior is completely unchanged when --json is not
specified.

Assisted-by: Claude:claude-opus-4-6
Link: https://lore.kernel.org/20260408194542.1354549-1-sashal@kernel.org
Signed-off-by: Sasha Levin <sashal@kernel.org>
Acked-by: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoseq_buf: export seq_buf_putmem_hex() and add KUnit tests
Shuvam Pandey [Wed, 8 Apr 2026 20:23:51 +0000 (02:08 +0545)] 
seq_buf: export seq_buf_putmem_hex() and add KUnit tests

The seq_buf KUnit suite does not exercise seq_buf_putmem_hex().

Add one test for the len > 8 chunking path and one overflow test where a
later chunk no longer fits in the buffer.

Export seq_buf_putmem_hex() as well so SEQ_BUF_KUNIT_TEST=m links cleanly.
Without the export, modpost reports seq_buf_putmem_hex as undefined when
seq_buf_kunit is built as a module.

Link: https://lore.kernel.org/20260408202351.21829-1-shuvampandey1@gmail.com
Signed-off-by: Shuvam Pandey <shuvampandey1@gmail.com>
Acked-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Cc: David Gow <david@davidgow.net>
Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/acct: add taskstats TGID retention test
Yiyang Chen [Mon, 13 Apr 2026 15:45:45 +0000 (23:45 +0800)] 
selftests/acct: add taskstats TGID retention test

Add a kselftest for the taskstats TGID aggregation fix.

The test creates a worker thread, snapshots TGID taskstats while the
worker is still alive, lets the worker exit, and then verifies that the
TGID CPU total does not regress after the thread has been reaped.

The pass/fail check intentionally keys off ac_utime + ac_stime only, which
is the primary user-visible regression fixed by the taskstats change and
is less sensitive to scheduling noise than context-switch counters.

Link: https://lore.kernel.org/0d55354911c54cd1b9f10a09f6fd378af85c8d43.1776094300.git.cyyzero16@gmail.com
Signed-off-by: Yiyang Chen <cyyzero16@gmail.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agotaskstats: retain dead thread stats in TGID queries
Yiyang Chen [Mon, 13 Apr 2026 15:45:44 +0000 (23:45 +0800)] 
taskstats: retain dead thread stats in TGID queries

Patch series "taskstats: fix TGID dead-thread stat retention", v3.

This series fixes a taskstats TGID aggregation bug where fields added in
the TGID query path were not preserved after thread exit, and adds a
kselftest covering the regression.

The first patch keeps the cached TGID aggregate used for dead threads in
step with the fields already accumulated for live threads, and also fixes
the final TGID exit notification emitted when group_dead is true.

The second patch adds a kselftest that verifies TGID CPU stats do not
regress after a worker thread exits and has been reaped.

This patch (of 2):

fill_stats_for_tgid() builds TGID stats from two sources: the cached
aggregate in signal->stats and a scan of the live threads in the group.

However, fill_tgid_exit() only accumulates delay accounting into
signal->stats.  This means that once a thread exits, TGID queries lose the
fields that fill_stats_for_tgid() adds for live threads.

This gap was introduced incrementally by two earlier changes that extended
fill_stats_for_tgid() but did not make the corresponding update to
fill_tgid_exit():

- commit 8c733420bdd5 ("taskstats: add e/u/stime for TGID command")
  added ac_etime, ac_utime, and ac_stime to the TGID query path.
- commit b663a79c1915 ("taskstats: add context-switch counters")
  added nvcsw and nivcsw to the TGID query path.

As a result, those fields were accounted for live threads in TGID queries,
but were dropped from the cached TGID aggregate after thread exit.  The
final TGID exit notification emitted when group_dead is true also copies
that cached aggregate, so it loses the same fields.

Factor the per-task TGID accumulation into tgid_stats_add_task() and use
it in both fill_stats_for_tgid() and fill_tgid_exit().  This keeps the
cached aggregate used for dead threads aligned with the live-thread
accumulation used by TGID queries.

Link: https://lore.kernel.org/cover.1776094300.git.cyyzero16@gmail.com
Link: https://lore.kernel.org/abd2a15d33343636ab5ba43d540bcfe508bd66c7.1776094300.git.cyyzero16@gmail.com
Fixes: 8c733420bdd5 ("taskstats: add e/u/stime for TGID command")
Fixes: b663a79c1915 ("taskstats: add context-switch counters")
Signed-off-by: Yiyang Chen <cyyzero16@gmail.com>
Acked-by: Balbir Singh <balbirs@nvidia.com>
Cc: Dr. Thomas Orgis <thomas.orgis@uni-hamburg.de>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoclang-format: fix formatting of guard() and scoped_guard() statements
Bart Van Assche [Mon, 13 Apr 2026 18:23:48 +0000 (11:23 -0700)] 
clang-format: fix formatting of guard() and scoped_guard() statements

Without this patch clang-format formats guard() and scoped_guard()
statements as follows:

guard(...)(...)
{
...
}

With this patch clang-format formats guard() and scoped_guard()
statements as follows:

guard(...)(...) {
...
}

Link: https://lore.kernel.org/20260413182348.1865138-1-bvanassche@acm.org
Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokunit: fat: test cluster and directory i_pos layout helpers
Adi Nata [Sun, 5 Apr 2026 01:19:20 +0000 (09:19 +0800)] 
kunit: fat: test cluster and directory i_pos layout helpers

Add KUnit coverage for fat_clus_to_blknr() and fat_get_blknr_offset()
using stub msdos_sb_info values so cluster-to-sector and i_pos split math
stays correct.

Link: https://lore.kernel.org/20260405011920.28622-1-adinata.softwareengineer@gmail.com
Signed-off-by: Adi Nata <adinata.softwareengineer@gmail.com>
Acked-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Cc: Christian Brauner <brauner@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agocheckpatch: add check for function pointer arrays in declarations
Joe Perches [Sat, 18 Apr 2026 17:34:59 +0000 (10:34 -0700)] 
checkpatch: add check for function pointer arrays in declarations

checkpatch did not allow function pointer arrays when testing
declaration blocks.

Add it.

Link: https://lore.kernel.org/eb62763085eb42193a611bca00a62d6f0ae72e1e.1776530118.git.joe@perches.com
Signed-off-by: Joe Perches <joe@perches.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Dan Carpenter <dan.carpenter@linaro.org>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoocfs2: use kzalloc for quota recovery bitmap allocation
Tristan Madani [Sat, 18 Apr 2026 13:10:48 +0000 (13:10 +0000)] 
ocfs2: use kzalloc for quota recovery bitmap allocation

ocfs2 quota recovery allocates a bitmap buffer with kmalloc and does not
fully initialize it.  This can lead to use of uninitialized bits during
quota recovery from a corrupted filesystem image.

Use kzalloc instead to ensure the bitmap is zero-initialized.

Link: https://lore.kernel.org/20260418131048.1052507-1-tristmd@gmail.com
Reported-by: syzbot+7ea0b96c4ddb49fd1a70@syzkaller.appspotmail.com
Signed-off-by: Tristan Madani <tristan@talencesecurity.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Cc: Heming Zhao <heming.zhao@suse.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agotools/accounting/getdelays: fix -Wformat-truncation warning in format_timespec
Yiyang Chen [Thu, 23 Apr 2026 15:11:39 +0000 (23:11 +0800)] 
tools/accounting/getdelays: fix -Wformat-truncation warning in format_timespec

Reproduce with GCC 13.3.0:

  $ cd tools/accounting
  $ make

This emits:

getdelays.c: In function `format_timespec':
getdelays.c:218:67: warning: `:' directive output may be truncated writing 1 byte into a region of size between 0 and 16 [-Wformat-truncation=]
  218 |         snprintf(buffer, sizeof(buffer), "%04d-%02d-%02dT%02d:%02d:%02d",
      |
getdelays.c:218:9: note: `snprintf' output between 20 and 72 bytes into a destination of size 32

The problem is that %04d and %02d specify minimum field widths only. GCC
cannot prove that formatting tm_year + 1900 and the other struct tm
fields will always fit in the fixed 32-byte buffer, so it warns about
possible truncation.

Fix this by replacing the manual snprintf() formatting with
strftime("%Y-%m-%dT%H:%M:%S", ...). That matches the data we already have
in struct tm, keeps the intended timestamp format, and avoids the warning
when building tools/accounting with GCC.

Link: https://lore.kernel.org/87d9723e0b59d816ee2e4bd7cddd58a54c6c9f91.1776956545.git.cyyzero16@gmail.com
Signed-off-by: Yiyang Chen <cyyzero16@gmail.com>
Cc: Fan Yu <fan.yu9@zte.com.cn>
Cc: Wang Yaxin <wang.yaxin@zte.com.cn>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoproc: use strnlen() for name validation in __proc_create
Thorsten Blum [Tue, 21 Apr 2026 12:26:47 +0000 (14:26 +0200)] 
proc: use strnlen() for name validation in __proc_create

Replace strlen(fn) with strnlen(fn, NAME_MAX + 1) when validating the
final path component in __proc_create().

This preserves the existing name limit while bounding the length scan to
one byte past the maximum name length.  Handle empty names separately, and
treat names longer than NAME_MAX as too long.

Link: https://lore.kernel.org/20260421122648.56723-2-thorsten.blum@linux.dev
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Thorsten Blum <thorsten.blum@linux.dev>
Cc: wangzijie <wangzijie1@honor.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agocheckpatch: add option to not force /* */ for SPDX
Petr Vorel [Tue, 21 Apr 2026 21:14:07 +0000 (23:14 +0200)] 
checkpatch: add option to not force /* */ for SPDX

Add option --spdx-cxx-comments to not force C comments (/* */) for SPDX,
but allow also C++ comments (//).

As documented in aa19a176df95d6, this is required for some old toolchains
still have older assembler tools which cannot handle C++ style comments.
This avoids forcing this for projects which vendored checkpatch.pl (e.g.
LTP or u-boot).

Link: https://lore.kernel.org/20260421211408.383972-2-pvorel@suse.cz
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Reviewed-by: Simon Glass <sjg@chromium.org>
Acked-by: Joe Perches <joe@perches.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agocheckpatch: allow passing config directory
Petr Vorel [Tue, 21 Apr 2026 21:14:06 +0000 (23:14 +0200)] 
checkpatch: allow passing config directory

checkpatch.pl searches for .checkpatch.conf in $CWD, $HOME and
$CWD/.scripts.  Allow passing a single directory via CHECKPATCH_CONFIG_DIR
environment variable (empty value is ignored).  This allows to directly
use project configuration file for projects which vendored checkpatch.pl
(e.g.  LTP or u-boot).

Although it'd be more convenient for user to have --conf-dir option
(instead of using environment variable), code would get ugly because
options from the configuration file needs to be read before processing
command line options with Getopt::Long.

While at it, document directories and environment variable in -h help
and HTML doc.

Link: https://lore.kernel.org/20260421211408.383972-1-pvorel@suse.cz
Signed-off-by: Petr Vorel <pvorel@suse.cz>
Reviewed-by: Simon Glass <sjg@chromium.org>
Acked-by: Joe Perches <joe@perches.com>
Cc: Dwaipayan Ray <dwaipayanray1@gmail.com>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoproc: rewrite next_tgid()
Alexey Dobriyan [Wed, 22 Apr 2026 19:17:45 +0000 (22:17 +0300)] 
proc: rewrite next_tgid()

* deduplicate "iter.tgid += 1" line,
  Right now it is done once inside next_tgid() itself and second time
  inside "for" loop.

* deduplicate next_tgid() call itself with different loop style:

auto it = make_tgid_iter();
while (next_tgid(&it)) {
...
}

  gcc seems to inline it twice:

   $ ./scripts/bloat-o-meter ../vmlinux-000 ../obj/vmlinux
add/remove: 0/1 grow/shrink: 1/0 up/down: 100/-245 (-145)
Function                                     old     new   delta
proc_pid_readdir                             531     631    +100
next_tgid                                    245       -    -245

* make tgid_iter.pid_ns const
  it never changes during readdir anyway

[akpm@linux-foundation.org: remove newline]
Link: https://lore.kernel.org/20260422191745.435556-2-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoproc: add tgid_iter.pid_ns member
Alexey Dobriyan [Wed, 22 Apr 2026 19:17:44 +0000 (22:17 +0300)] 
proc: add tgid_iter.pid_ns member

next_tgid() accepts pid namespace as an argument, but it never changes
during readdir (which would be unthinkable thing to do anyway).

Move it inside iterator type and hide from direct usage.

Link: https://lore.kernel.org/20260422191745.435556-1-adobriyan@gmail.com
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/mm: add kmemleak verbose dedup test
Breno Leitao [Wed, 6 May 2026 12:58:25 +0000 (05:58 -0700)] 
selftests/mm: add kmemleak verbose dedup test

Add a regression test for the per-scan verbose dedup added in the
preceding commit.  The test loads samples/kmemleak's helper module
(CONFIG_SAMPLE_KMEMLEAK=m) to generate orphan allocations, several of
which share an allocation backtrace, runs four kmemleak scans with verbose
printing enabled, then walks dmesg looking for two "unreferenced object"
reports within a single scan that share an identical backtrace - which
would mean dedup failed to collapse them.

The test is intentionally permissive on detection but strict on
regressions:

 - PASS when no duplicates are observed, regardless of whether the
   dedup summary line ("... and N more object(s) with the same
   backtrace") was actually emitted. Per-CPU chunk reuse, slab
   freelist pointers, kernel stack residue and CONFIG_DEBUG_KMEMLEAK_
   AUTO_SCAN can all keep most of the orphans "still referenced" or
   reported across many separate scans, so the dedup path may have
   nothing to fold within one scan. That is not a regression.

 - PASS reports whether dedup actually fired, so a passing run on a
   well-behaved environment is still informative.

 - FAIL when two same-backtrace reports land in a single scan (clear
   dedup regression).

 - FAIL when kmemleak's own per-scan tally counts leaks but the
   verbose path emits zero "unreferenced object" lines - that catches
   a regression in the verbose printer itself, which would otherwise
   pass the duplicate check trivially.

 - SKIP when kmemleak is absent, disabled at runtime, or the helper
   module is not built.

The dmesg parser anchors stack-frame matching to the indentation kmemleak
uses for them (4+ spaces under "kmemleak: ") so unrelated kmemleak
warnings landing between reports do not get lumped into the backtrace key
and mask a duplicate.

Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-2-2d36aafc34da@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/kmemleak: dedupe verbose scan output by allocation backtrace
Breno Leitao [Wed, 6 May 2026 12:58:24 +0000 (05:58 -0700)] 
mm/kmemleak: dedupe verbose scan output by allocation backtrace

Patch series "mm/kmemleak: dedupe verbose scan output", v3.

I am starting to run with kmemleak in verbose enabled in some "probe
points" across the my employers fleet so that suspected leaks land in
dmesg without needing a separate read of /sys/kernel/debug/kmemleak.

The downside is that workloads which leak many objects from a single
allocation site flood the console with byte-for-byte identical backtraces.
Hundreds of duplicates per scan are common, drowning out distinct leaks
and unrelated kernel messages, while adding no signal beyond the first
occurrence.

This series collapses those duplicates inside kmemleak itself.  Each
unique stackdepot trace_handle prints once per scan, followed by a short
summary line when more than one object shares it:

  kmemleak: unreferenced object 0xff110001083beb00 (size 192):
  kmemleak:   comm "modprobe", pid 974, jiffies 4294754196
  kmemleak:   ...
  kmemleak:   backtrace (crc 6f361828):
  kmemleak:     __kmalloc_cache_noprof+0x1af/0x650
  kmemleak:     ...
  kmemleak:   ... and 71 more object(s) with the same backtrace

The "N new suspected memory leaks" tally and the contents of
/sys/kernel/debug/kmemleak are unchanged - the per-object detail is still
available on demand, only the verbose (dmesg) output is collapsed.

Patch 1 is the kmemleak change.

Patch 2 adds a selftest that loads samples/kmemleak's CONFIG_SAMPLE
kmemleak-test module to generate ten leaks sharing one call site and
checks that the printed count is strictly less than the reported leak
total.  Not sure if Patch 2 is useful or not, if not, it is easier to
discard.

This patch (of 2):

In kmemleak's verbose mode, every unreferenced object found during a scan
is logged with its full header, hex dump and 16-frame backtrace.
Workloads that leak many objects from a single allocation site flood dmesg
with byte-for-byte identical backtraces, drowning out distinct leaks and
other kernel messages.

Dedupe within each scan using stackdepot's trace_handle as the key: for
every leaked object with a recorded stack trace, look up the
representative kmemleak_object in a per-scan xarray keyed by trace_handle.
The first sighting stores the object pointer (with a get_object()
reference) and sets object->dup_count to 1; later sightings just bump
dup_count on the representative.  After the scan, walk the xarray once and
emit each unique backtrace, followed by a single summary line when more
than one object shares it.

Leaks whose trace_handle is 0 (early-boot allocations tracked before
kmemleak_init() set up object_cache, or stack_depot_save() failures under
memory pressure) cannot be deduped, so they are still printed inline via
the same locked OBJECT_ALLOCATED-checked helper.  The contents of
/sys/kernel/debug/kmemleak are unchanged - only the verbose console output
is collapsed.

Safety notes:

 - The xarray store happens outside object->lock: object->lock is a
   raw spinlock, while xa_store() may grab xa_node slab locks at a
   higher wait-context level which lockdep flags as invalid.
   trace_handle is captured under object->lock (which serialises with
   kmemleak_update_trace()'s writer), so it is safe to use after
   dropping the lock.

 - get_object() pins the kmemleak_object metadata across
   rcu_read_unlock(), but the underlying tracked allocation can still
   be freed concurrently. The deferred print path therefore re-acquires
   object->lock and re-checks OBJECT_ALLOCATED via print_leak_locked()
   before touching object->pointer; __delete_object() clears that flag
   under the same lock before the user memory goes away. The same
   helper is used by the trace_handle == 0 and xa_store() failure
   fallbacks, so every printer in the new path has identical safety
   guarantees.

 - If get_object() fails after we set OBJECT_REPORTED, the object is
   already being torn down (use_count hit zero); the leak count is
   still accurate but the verbose line is dropped, which is correct
   - the memory was freed concurrently and is no longer a leak.

 - If xa_store() fails to allocate an xa_node under memory pressure,
   we fall back to printing inline via print_leak_locked() instead of
   silently dropping the leak.

 - The hex dump is skipped for coalesced entries (dup_count > 1):
   bytes would differ across objects sharing a backtrace anyway, and
   skipping it removes the only remaining read of object->pointer's
   contents in the deferred path. The representative's reported size
   may also differ from the coalesced objects' sizes; the printed
   trace_handle reflects the representative's current value rather
   than the value used as the dedup key, which is normally - but not
   strictly - identical.

Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-0-2d36aafc34da@debian.org
Link: https://lore.kernel.org/20260506-kmemleak_dedup-v3-1-2d36aafc34da@debian.org
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/swap: add cond_resched() in swap_reclaim_full_clusters to prevent softlockup
Zijiang Huang [Wed, 6 May 2026 13:09:19 +0000 (21:09 +0800)] 
mm/swap: add cond_resched() in swap_reclaim_full_clusters to prevent softlockup

We hit a real softlockup in an internal stress test environment.  The
workload was LTP memory/swap stress on a large arm64 machine, with 320
CPUs, about 1TB memory and an 8.6GB swap device.  The system was under
heavy load and the swap device had a large number of full clusters.  The
softlockup was triggered during a stress test after about 3 days.

So, add periodic cond_resched() calls during large full_clusters
reclaim operations to prevent softlockup issues.

Detailed call trace as follow:

PID: 3817773  TASK: ffff0883bb28b780  CPU: 48   COMMAND: "kworker/48:7"
   #0 [ffff800080183d10] __crash_kexec at ffffa4c1361e5de4
   #1 [ffff800080183d90] panic at ffffa4c1360d5e9c
   #2 [ffff800080183e20] watchdog_timer_fn at ffffa4c136231fa8
   ...
  #16 [ffff8000c4ad3cb0] swap_cache_del_folio at ffffa4c1363e1614
  #17 [ffff8000c4ad3ce0] __try_to_reclaim_swap at ffffa4c1363e4bfc
  #18 [ffff8000c4ad3d40] swap_reclaim_full_clusters at ffffa4c1363e5474
  #19 [ffff8000c4ad3da0] swap_reclaim_work at ffffa4c1363e550c
  #20 [ffff8000c4ad3dc0] process_one_work at ffffa4c136102edc
  #21 [ffff8000c4ad3e10] worker_thread at ffffa4c136103398
  #22 [ffff8000c4ad3e70] kthread at ffffa4c13610d95c

Link: https://lore.kernel.org/20260506130919.2298807-1-kerayhuang@tencent.com
Fixes: 5168a68eb78f ("mm, swap: avoid over reclaim of full clusters")
Signed-off-by: Zijiang Huang <kerayhuang@tencent.com>
Reviewed-by: Kairui Song <kasong@tencent.com>
Reviewed-by: Hao Peng <flyingpeng@tencent.com>
Reviewed-by: albinwyang <albinwyang@tencent.com>
Reviewed-by: Baoquan He <baoquan.he@linux.dev>
Acked-by: Chris Li <chrisl@kernel.org>
Cc: Barry Song <baohua@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kemeng Shi <shikemeng@huaweicloud.com>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: Youngjun Park <youngjun.park@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agohighmem-internal.h: fix typo in the comment for kunmap_atomic()
Zhouyi Zhou [Tue, 5 May 2026 02:11:25 +0000 (02:11 +0000)] 
highmem-internal.h: fix typo in the comment for kunmap_atomic()

Replace `PREEMP_RT` with `PREEMPT_RT` in the header comment to match the
correct kernel configuration name.

Link: https://lore.kernel.org/20260505021125.1941691-1-zhouzhouyi@gmail.com
Signed-off-by: Zhouyi Zhou <zhouzhouyi@gmail.com>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/admin-guide/mm/damon/stat: document kdamond_pid parameter
SeongJae Park [Sat, 2 May 2026 02:05:04 +0000 (19:05 -0700)] 
Docs/admin-guide/mm/damon/stat: document kdamond_pid parameter

Update DAMON_STAT usage document for newly added kdamond_pid parameter.

Link: https://lore.kernel.org/20260502020505.80822-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/stat: add a parameter for reading kdamond pid
SeongJae Park [Sat, 2 May 2026 02:05:03 +0000 (19:05 -0700)] 
mm/damon/stat: add a parameter for reading kdamond pid

Patch series "mm/damon/stat: add kdamond_pid parameter".

DAMON_STAT doesn't provide the pid of its kdamond, unlike DAMON_RECLAIM
and DAMON_LRU_SORT.  This makes user-space management of DAMON_STAT
unnecessarily complicated.  Provide the information via a new parameter,
namely kdamond_pid, and document it.

This patch (of 2):

Knowing the pid of the kdamonds can help user-space management including
monitoring of DAMON's system resource consumption.  To make it easier,
DAMON_SYSFS, DAMON_RECLAIM and DAMON_LRU_SORT provide the pid information.
DAMON_STAT is not providing it, though.  Expose the pid of DAMON_STAT
kdamond via a new read-only module parameter, namely kdamond_pid.  This
also makes DAMON modules usage more standardized, because DAMON_RECLAIM
and DAMON_LRU_SORT also provide the information via their read-only
parameters of the same name.

Link: https://lore.kernel.org/20260502020505.80822-1-sj@kernel.org
Link: https://lore.kernel.org/20260502020505.80822-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/admin-guide/mm/damon/reclaim: update for autotune_monitoring_intervals
SeongJae Park [Fri, 1 May 2026 01:17:39 +0000 (18:17 -0700)] 
Docs/admin-guide/mm/damon/reclaim: update for autotune_monitoring_intervals

Update DAMON_RECLAIM usage document for the newly added monitoring
intervals auto-tuning enablement parameter.

Link: https://lore.kernel.org/20260501011740.81988-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/reclaim: add autotune_monitoring_intervals parameter
SeongJae Park [Fri, 1 May 2026 01:17:38 +0000 (18:17 -0700)] 
mm/damon/reclaim: add autotune_monitoring_intervals parameter

Patch series "mm/damon/reclaim: support monitoring intervals auto-tuning".

The monitoring intervals auto-tuning feature of DAMON has proven to be
useful in multiple environments.  Add a new DAMON_RECLAIM parameter for
supporting the feature, and update the document for the new parameter.

This patch (of 2):

DAMON's monitoring intervals auto-tuning feature has proven to be useful
in multiple environments.  DAMON_RECLAIM is still asking users to do the
manual tuning of the intervals.  Add a module parameter for utilizing the
auto-tuning feature with the suggested default setup.

Note that use of the auto-tuning overrides the manually entered monitoring
intervals.  Also, note that the 'min_age' will dynamically changed
proportional to auto-tuned intervals.  It is recommended to use 'min_age'
short enough and use 'quota_mem_pressure_us' like coldness threshold
auto-tuning features together.

Link: https://lore.kernel.org/20260501011740.81988-1-sj@kernel.org
Link: https://lore.kernel.org/20260501011740.81988-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/cgroup: include slab in test_percpu_basic memory check
Li Wang [Fri, 1 May 2026 02:20:58 +0000 (10:20 +0800)] 
selftests/cgroup: include slab in test_percpu_basic memory check

test_percpu_basic() currently compares memory.current against only
memory.stat:percpu after creating 1000 child cgroups.

Observed failure:
  #./test_kmem
  ok 1 test_kmem_basic
  ok 2 test_kmem_memcg_deletion
  ok 3 test_kmem_proc_kpagecgroup
  ok 4 test_kmem_kernel_stacks
  ok 5 test_kmem_dead_cgroups
  memory.current 11530240
  percpu 8440000
  not ok 6 test_percpu_basic

That assumption is too strict: child cgroup creation also allocates
slab-backed metadata, so memory.current is expected to be larger than
percpu alone. One visible path is:

  cgroup_mkdir()
    cgroup_create()
      cgroup_addrm_file()
        cgroup_add_file()
          __kernfs_create_file()
            __kernfs_new_node()
              kmem_cache_zalloc()

These kernfs allocations are charged as slab and show up in
memory.stat:slab.

Update the check to compare memory.current against (percpu + slab)
within MAX_VMSTAT_ERROR, and print slab/delta in the failure message to
improve diagnostics.

Link: https://lore.kernel.org/20260501022058.18024-3-li.wang@linux.dev
Signed-off-by: Li Wang <li.wang@linux.dev>
Reviewed-by: Waiman Long <longman@redhat.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Sayali Patil <sayalip@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/cgroup: fix hardcoded page size in test_percpu_basic
Li Wang [Fri, 1 May 2026 02:20:57 +0000 (10:20 +0800)] 
selftests/cgroup: fix hardcoded page size in test_percpu_basic

Patch series "selftests/cgroup: Fix false positive failures in
test_percpu_basic", v2.

This patch series addresses two separate issues that cause false
positive failures in the test_percpu_basic test within the cgroup
kmem selftests.

The first issue stems from a hardcoded assumption about the system
page size, which breaks the test on architectures with larger page
sizes.

The second issue is an overly strict memory check that fails to
account for the slab metadata allocated during cgroup creation.

This patch (of 2):

MAX_VMSTAT_ERROR uses a hardcoded page size of 4096, which assumes 4K
pages.  This causes test_percpu_basic to fail on systems where the kernel
is configured with a larger page size, such as aarch64 systems using 16K
or 64K pages, where the maximum permissible discrepancy between
memory.current and percpu charges is proportionally larger.

Replace the hardcoded 4096 with sysconf(_SC_PAGESIZE) to correctly derive
the page size at runtime regardless of the underlying architecture or
kernel configuration.

Link: https://lore.kernel.org/20260501022058.18024-1-li.wang@linux.dev
Link: https://lore.kernel.org/20260501022058.18024-2-li.wang@linux.dev
Signed-off-by: Li Wang <li.wang@linux.dev>
Acked-by: Waiman Long <longman@redhat.com>
Reviewed-by: Sayali Patil <sayalip@linux.ibm.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits
fujunjie [Tue, 28 Apr 2026 01:59:44 +0000 (01:59 +0000)] 
mm/filemap: do not count FAULT_FLAG_TRIED retries as mmap hits

A fault that starts synchronous mmap readahead can return VM_FAULT_RETRY
after dropping mmap_lock.  The retry may then map the folio brought in by
that same miss.

Do not let this retry decrement mmap_miss.  The retry still maps the folio
from the page cache; it just does not count as a useful mmap readahead
hit.

Link: https://lore.kernel.org/tencent_22E6B8849EC1141FE7773C64467E6F1E2C09@qq.com
Signed-off-by: fujunjie <fujunjie1@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Vishal Moola <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/filemap: count only the faulting address as a mmap hit
fujunjie [Tue, 28 Apr 2026 01:59:43 +0000 (01:59 +0000)] 
mm/filemap: count only the faulting address as a mmap hit

Patch series "mm/filemap: tighten mmap_miss hit accounting", v3.

mmap_miss is increased when synchronous mmap readahead is needed, and
decreased when filemap_map_pages() maps folios that are already in the
page cache.  The decrease side can over-credit hits in two cases:

  - fault-around installs nearby PTEs even though the fault only proves
    that the faulting address was accessed;
  - after synchronous mmap readahead returns VM_FAULT_RETRY, the retry
    can find the folio brought in by the same miss and immediately
    cancel that miss.

Current evidence comes from a local KVM/data-disk microbenchmark using
mmap_miss_probe, with an 8 GiB guest, 2 vCPUs, 8192 KiB read_ahead_kb,
cold page cache before each run, 1% of the file accessed, and medians of 3
runs.

mmap_miss_probe mmap()s a prepared file with MADV_NORMAL and then touches
one byte at selected base-page offsets.  The access order is random,
sequential, or a fixed page stride.  The harness drops caches before each
run and samples /proc/vmstat around that access loop.

The 20 GiB case below is a larger-than-memory file case in an 8 GiB guest.
No separate memory hog was used.  The 4 GiB case uses the same 8 GiB
guest but keeps the file fit-in-memory.

Each case used a fresh temporary qcow2 data disk, seen by the guest as
/dev/vda, formatted as ext4 and mounted at /mnt/mmap-matrix.

Each result is "pgpgin GiB / elapsed seconds".  "pgpgin GiB" is the delta
of the guest /proc/vmstat pgpgin counter, converted from KiB to GiB; it is
used here as an approximate block input counter, not as resident memory or
exact application IO.  "Elapsed seconds" is the wall-clock runtime of the
whole mmap_miss_probe access pass, not per-access latency.

For the 20 GiB larger-than-memory case:

        workload       before                after
        random         223.377 GiB/101.293s  1.010 GiB/4.790s
        stride1021     204.214 GiB/97.557s   204.208 GiB/108.086s
        stride2053     409.584 GiB/193.700s  0.970 GiB/3.685s
        stride4099     406.452 GiB/134.241s  0.975 GiB/3.499s
        sequential       0.212 GiB/0.050s    0.212 GiB/0.057s

For the 4 GiB fit-in-memory case:

        workload       before              after
        random         3.987 GiB/1.960s    0.980 GiB/1.221s
        stride1021     4.002 GiB/1.838s    4.002 GiB/1.851s
        stride2053     3.991 GiB/1.835s    0.811 GiB/0.985s
        stride4099     4.001 GiB/1.836s    0.819 GiB/1.037s
        sequential     0.056 GiB/0.013s    0.056 GiB/0.018s

The 20 GiB setup also has an ablation.  P1 is only the faulting-address
hit accounting change.  P2-only is only the FAULT_FLAG_TRIED retry
filter.  P1+P2 is the combined accounting change:

        workload    variant   result
        random      baseline  223.377 GiB/101.293s
        random      P1        223.268 GiB/98.481s
        random      P2-only   223.257 GiB/100.091s
        random      P1+P2     1.010 GiB/4.790s
        stride2053  baseline  409.584 GiB/193.700s
        stride2053  P1        409.584 GiB/197.645s
        stride2053  P2-only   15.722 GiB/5.485s
        stride2053  P1+P2     0.970 GiB/3.685s
        sequential  baseline  0.212 GiB/0.050s
        sequential  P1        0.212 GiB/0.046s
        sequential  P2-only   0.212 GiB/0.050s
        sequential  P1+P2     0.212 GiB/0.057s

After the v2 implementation refactor, only the final P1+P2 shape was rerun
in the same setup.  The numbers stayed in line with the v1 P1+P2 rows
above:

        workload       larger-than-memory case    fit-in-memory case
                       20 GiB file, 1% access    4 GiB file, 1% access
        random           1.010 GiB/4.383s          0.980 GiB/1.088s
        stride1021     204.216 GiB/105.601s        4.001 GiB/1.783s
        stride2053       0.970 GiB/3.760s          0.810 GiB/0.908s
        stride4099       0.975 GiB/3.410s          0.818 GiB/0.870s
        sequential       0.212 GiB/0.060s          0.056 GiB/0.016s

This does not claim to solve every sparse pattern.  The stride1021 rows
are intentionally shown as a boundary: with 8192 KiB read_ahead_kb,
file->f_ra.ra_pages is 2048 base pages, and synchronous mmap read-around
uses a 2048-page window centered around the fault, roughly [index - 1024,
index + 1023].  stride1021 is 1021 * 4 KiB = 4084 KiB, so the next access
lands inside the previous read-around window.  About every other access
can be a real faulting-address page-cache hit, and the other half can each
read about 8 MiB.  For about 52k accesses in the 20 GiB/1% run, half of
them times 8 MiB is about 205 GiB, matching the observed 204 GiB.

This patch (of 2):

filemap_map_pages() reduces file->f_ra.mmap_miss when fault-around maps
folios that are already present in the page cache.  That hit accounting is
too generous because fault-around can install PTEs around the faulting
address even though the fault only proves that the faulting address was
accessed.

Move the mmap_miss update back into filemap_map_pages(), drop the
mmap_miss argument from the helper functions, and decrement mmap_miss only
when the helper return value shows that the faulting address was mapped.
Keep the existing workingset-folio behavior unchanged.

Link: https://lore.kernel.org/tencent_AA501E9A238337BD167E5C2ACF948A1AF308@qq.com
Link: https://lore.kernel.org/tencent_756F151FE66F3D80479A6F982C0AB8569F09@qq.com
Signed-off-by: fujunjie <fujunjie1@qq.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Vishal Moola <vishal.moola@gmail.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in __offline_isolated_pages()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:13 +0000 (12:02 +0000)] 
mm: use zone lock guard in __offline_isolated_pages()

Use spinlock_irqsave zone lock guard in __offline_isolated_pages() to
replace the explicit lock/unlock pattern with automatic scope-based
cleanup.

Link: https://lore.kernel.org/13149be4f8151e18eb5f1eb4f3241ab3cffb373e.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in free_pcppages_bulk()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:12 +0000 (12:02 +0000)] 
mm: use zone lock guard in free_pcppages_bulk()

Use spinlock_irqsave zone lock guard in free_pcppages_bulk() to replace
the explicit lock/unlock pattern with automatic scope-based cleanup.

Link: https://lore.kernel.org/aafc2d660057a91eb40417f8ff4645b0a8c525e2.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in put_page_back_buddy()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:11 +0000 (12:02 +0000)] 
mm: use zone lock guard in put_page_back_buddy()

Use spinlock_irqsave zone lock guard in put_page_back_buddy() to replace
the explicit lock/unlock pattern with automatic scope-based cleanup.

Link: https://lore.kernel.org/b0fceedca37139da36aa626ac72eb9840b641021.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in take_page_off_buddy()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:10 +0000 (12:02 +0000)] 
mm: use zone lock guard in take_page_off_buddy()

Use spinlock_irqsave zone lock guard in take_page_off_buddy() to replace
the explicit lock/unlock pattern with automatic scope-based cleanup.

This also allows to return directly from the loop, removing the 'ret'
variable.

Link: https://lore.kernel.org/a981721632a981f148c63e3f7df3d1116a0c3f6d.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in set_migratetype_isolate()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:09 +0000 (12:02 +0000)] 
mm: use zone lock guard in set_migratetype_isolate()

Use spinlock_irqsave scoped lock guard in set_migratetype_isolate() to
replace the explicit lock/unlock pattern with automatic scope-based
cleanup.  The scoped variant is used to keep dump_page() outside the
locked section to avoid a lockdep splat.

Link: https://lore.kernel.org/6883351ad7f74d20875fff30e0e3214a089cea97.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in unreserve_highatomic_pageblock()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:08 +0000 (12:02 +0000)] 
mm: use zone lock guard in unreserve_highatomic_pageblock()

Use spinlock_irqsave zone lock guard in unreserve_highatomic_pageblock()
to replace the explicit lock/unlock pattern with automatic scope-based
cleanup.

Link: https://lore.kernel.org/69db814cd178915cb5615334a29304678f960963.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in unset_migratetype_isolate()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:07 +0000 (12:02 +0000)] 
mm: use zone lock guard in unset_migratetype_isolate()

Use spinlock_irqsave zone lock guard in unset_migratetype_isolate() to
replace the explicit lock/unlock and goto pattern with automatic
scope-based cleanup.

Link: https://lore.kernel.org/815c0905ea77828ed32bf56ff0a6d3c6548eb3a2.1777462630.git.d@ilvokhin.com
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Zi Yan <ziy@nvidia.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: use zone lock guard in reserve_highatomic_pageblock()
Dmitry Ilvokhin [Wed, 29 Apr 2026 12:02:06 +0000 (12:02 +0000)] 
mm: use zone lock guard in reserve_highatomic_pageblock()

Patch series "mm: use spinlock guards for zone lock", v3.

This series uses spinlock guard for zone lock across several mm functions
to replace explicit lock/unlock patterns with automatic scope-based
cleanup.

This simplifies the control flow by removing 'flags' variables, goto
labels, and redundant unlock calls.

Patches are ordered by decreasing value.  The first six patches simplify
the control flow by removing gotos, multiple unlock paths, or 'ret'
variables.  The last two are simpler lock/unlock pair conversions that
only remove 'flags' and can be dropped if considered unnecessary churn.

Binary size increase is +39 bytes, with Peter Zijlstra's fix for guards
[1] applied.  This is due to the compiler not being able to deduplicate
epilogue and eliminate redundant NULL check.  See discussion [2] for more
details.  I proposed a patch [3] that fixes this, but until it is merged
we need to assume +39 bytes will stay (though it is compiler dependent).

This patch (of 8):

Use the spinlock_irqsave zone lock guard in reserve_highatomic_pageblock()
to replace the explicit lock/unlock and goto out_unlock pattern with
automatic scope-based cleanup.

Link: https://lore.kernel.org/cover.1777462630.git.d@ilvokhin.com
Link: https://lore.kernel.org/3657e1144e2ffc1ca0eb57d57d89bfec4073d8c6.1777462630.git.d@ilvokhin.com
Link: https://lore.kernel.org/all/20260309164516.GE606826@noisy.programming.kicks-ass.net/
Link: https://lore.kernel.org/all/afC5C6fylF4AsITV@shell.ilvokhin.com/
Link: https://lore.kernel.org/all/20260427165037.205337-1-d@ilvokhin.com/
Signed-off-by: Dmitry Ilvokhin <d@ilvokhin.com>
Suggested-by: Steven Rostedt <rostedt@goodmis.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/ABI/damon: mark schemes/<S>/filters/ deprecated
SeongJae Park [Wed, 29 Apr 2026 15:03:06 +0000 (08:03 -0700)] 
Docs/ABI/damon: mark schemes/<S>/filters/ deprecated

Now the 'filters/' directory is deprecated.  Update ABI document to also
announce the fact.  Also update the descriptions of the files to be based
on 'core_filter/' directory, to make the old descriptions ready to be
removed when the time arrives.

Link: https://lore.kernel.org/20260429150309.82282-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/admin-guide/mm/damon/usage: mark scheme filters sysfs dir as deprecated
SeongJae Park [Wed, 29 Apr 2026 15:03:05 +0000 (08:03 -0700)] 
Docs/admin-guide/mm/damon/usage: mark scheme filters sysfs dir as deprecated

Patch series "mm/damon/sysfs: document filters/ directory as deprecated".

Commit ab71d2d30121 ("mm/damon/sysfs-schemes: let
damon_sysfs_scheme_set_filters() be used for different named directories")
introduced alternatives of 'filters' directory, namely core_filters/ and
'ops_filters/ directories.  Now the alternatives are well stabilized and
ready for all users.  All filters/ directory use cases are expected to be
able to be migrated to the alternatives.  An LTS kernel having the
alternatives, namely 6.18.y, is also released.  Existence of filters/
directory is only confusing.

It would be better not immediately removing the directory, though.  There
could be users that need time before migrating to the alternatives.  There
might be unexpected use cases that the alternatives cannot support.  Doing
the deprecation step by step across multiple years like DAMON debugfs
deprecation would be safer.  Start the deprecation changes by announcing
the deprecation on the documents.

Every year, one more action for completely removing the directory will be
followed, like DAMON debugfs deprecation did.  Following yearly actions
are currently expected.  In 2027, deprecation warning kernel messages will
be printed once, for use of filters/ directory.  In 2028, filters/
directory will be renamed to filters_DEPRECATED/.  In 2029,
filters_DEPRECATED/ directory will be removed.

This patch (of 2):

The alternatives of 'filters/' directory, namely 'core_filters/' and
'ops_filters/', can fully support all the features 'filters/' directory
can do, and provide better user experience.  Having 'filters/' directory
is only confusing to users.  Announce it as deprecated on the usage
document.

Link: https://lore.kernel.org/20260429150309.82282-1-sj@kernel.org
Link: https://lore.kernel.org/20260429150309.82282-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/khugepaged: return -EAGAIN for SCAN_PAGE_HAS_PRIVATE in MADV_COLLAPSE
Vineet Agarwal [Wed, 29 Apr 2026 14:04:34 +0000 (19:34 +0530)] 
mm/khugepaged: return -EAGAIN for SCAN_PAGE_HAS_PRIVATE in MADV_COLLAPSE

MADV_COLLAPSE uses errno values to provide actionable feedback to
userspace.  Temporary resource constraints are mapped to -EAGAIN so the
caller may retry, while intrinsic failures of the specified range are
mapped to -EINVAL.

collapse_file() returns SCAN_PAGE_HAS_PRIVATE when filemap_release_folio()
fails while isolating file-backed folios for collapse.  This currently
falls through the default case in madvise_collapse_errno() and is reported
to userspace as -EINVAL.

However, filemap_release_folio() failure commonly reflects temporary folio
state rather than a permanently uncollapsible range.

For example, ext4 returns false when a folio still has dirty journalled
data, btrfs returns false for dirty or writeback folios before extent
state release, and NFS may return false while reclaiming
filesystem-private folio state.

In such cases, retrying MADV_COLLAPSE after writeback, reclaim or journal
progress may succeed.  This matches the existing -EAGAIN handling for
SCAN_PAGE_DIRTY_OR_WRITEBACK and other transient collapse failures more
closely than -EINVAL.

Therefore, map SCAN_PAGE_HAS_PRIVATE to -EAGAIN so userspace receives
retryable feedback for this temporary failure path.

Link: https://lore.kernel.org/20260429140434.439456-1-agarwal.vineet2006@gmail.com
Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com>
Reviewed-by: Dev Jain <dev.jain@arm.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/mm: khugepaged: initialize file contents via mmap
Vineet Agarwal [Wed, 29 Apr 2026 11:58:16 +0000 (17:28 +0530)] 
selftests/mm: khugepaged: initialize file contents via mmap

file_setup_area() currently allocates anonymous memory, fills it, and
writes it into the backing file used for collapse testing.

Instead of copying data through write(), resize the file with ftruncate(),
map it directly with MAP_SHARED, and initialize the mapped area in place.

This simplifies the setup path and avoids the need for explicit partial
write handling.

Link: https://lore.kernel.org/20260429115816.98824-1-agarwal.vineet2006@gmail.com
Signed-off-by: Vineet Agarwal <agarwal.vineet2006@gmail.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Tested-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>
Cc: Lance Yang <lance.yang@linux.dev>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Nico Pache <npache@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/admin-guide/mm/damon/lru_sort: update for entire memory monitoring
SeongJae Park [Wed, 29 Apr 2026 04:12:29 +0000 (21:12 -0700)] 
Docs/admin-guide/mm/damon/lru_sort: update for entire memory monitoring

Update DAMON_LRU_SORT usage document for the changed default monitoring
target region selection.

Link: https://lore.kernel.org/20260429041232.90257-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/admin-guide/mm/damon/reclaim: update for entire memory monitoring
SeongJae Park [Wed, 29 Apr 2026 04:12:28 +0000 (21:12 -0700)] 
Docs/admin-guide/mm/damon/reclaim: update for entire memory monitoring

Update DAMON_RECLAIM usage document for the changed default monitoring
target region selection.

Link: https://lore.kernel.org/20260429041232.90257-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/stat: use damon_set_region_system_rams_default()
SeongJae Park [Wed, 29 Apr 2026 04:12:27 +0000 (21:12 -0700)] 
mm/damon/stat: use damon_set_region_system_rams_default()

damon_stat_set_moniotirng_region() is nearly a duplicate of the core
function, damon_set_region_system_rams_default().  Use the core
implementation.

Link: https://lore.kernel.org/20260429041232.90257-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/core: remove damon_set_region_biggest_system_ram_default()
SeongJae Park [Wed, 29 Apr 2026 04:12:26 +0000 (21:12 -0700)] 
mm/damon/core: remove damon_set_region_biggest_system_ram_default()

Now nobody is using damon_set_region_biggest_system_ram_default().  Remove
it.

Link: https://lore.kernel.org/20260429041232.90257-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/lru_sort: cover all system rams
SeongJae Park [Wed, 29 Apr 2026 04:12:25 +0000 (21:12 -0700)] 
mm/damon/lru_sort: cover all system rams

DAMON_LRU_SORT allows users to set the physical address range to monitor
and do the work on.  When users don't explicitly set the range, the
biggest system ram resource of the system is selected as the monitoring
target address range.  The intention was to reduce the overhead from
monitoring non-System RAM areas because monitoring non-System RAM may be
meaningless.  However, because of the sampling based access check and
adaptive regions adjustment, the overhead should be negligible.  It makes
more sense to just cover all system rams of the system.  Do so.

Link: https://lore.kernel.org/20260429041232.90257-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/reclaim: cover all system rams
SeongJae Park [Wed, 29 Apr 2026 04:12:24 +0000 (21:12 -0700)] 
mm/damon/reclaim: cover all system rams

DAMON_RECLAIM allows users to set the physical address range to monitor
and do the work on.  When users don't explicitly set the range, the
biggest System RAM resource of the system is selected as the monitoring
target address range.  The intention was to reduce the overhead from
monitoring non-System RAM areas because monitoring of non-System RAM may
be meaningless.  However, because of the sampling based access check and
adaptive regions adjustment, the overhead should be negligible.  It makes
more sense to just cover all system rams of the system.  Do so.

Link: https://lore.kernel.org/20260429041232.90257-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon: introduce damon_set_region_system_rams_default()
SeongJae Park [Wed, 29 Apr 2026 04:12:23 +0000 (21:12 -0700)] 
mm/damon: introduce damon_set_region_system_rams_default()

Patch series "mm/damon/reclaim,lru_sort: monitor all system rams by
default".

DAMON_RECLAIM and DAMON_LRU_SORT set the biggest 'System RAM' resource of
the system as the default monitoring target address range.  The main
intention behind the design is to minimize the overhead coming from
monitoring of non-System RAM areas.

This could result in an odd setup when there are multiple discrete System
RAMs of considerable sizes.  For example, there are System RAMs each
having 500 GiB size.  In this case, only the first 500 GiB will be set as
the monitoring region by default.  This is particularly common on NUMA
systems.  Hence the modules allow users to set the monitoring target
address range using the module parameters if the default setup doesn't
work for them.  In other words, the current design trades ease of setup
for lower overhead.

However, because DAMON utilizes the sampling based access check and the
adaptive regions adjustment mechanisms, the overhead from the monitoring
of non-System RAM areas should be negligible in most setups.  Meanwhile,
the setup complexity is causing real headaches for users who need to run
those modules on various types of systems.  That is, the current tradeoff
is not a good deal.

Set the physical address range that can cover all System RAM areas of the
system as the default monitoring regions for DAMON_RECLAIM and
DAMON_LRU_SORT.

Technically speaking, this is changing documented behavior.  However, it
makes no sense to believe there is a real use case that really depends on
the old weird default behavior.  If the old default behavior was working
for them in the reasonable way, this change will only add a negligible
amount of monitoring overhead.  If it didn't work, the users may already
be using manual monitoring regions setup, and they will not be affected by
this change.

Patches Sequence
================

Patch 1 introduces a new core function that will be used for the new
default monitoring target region setup.  Patch 2 and 3 update
DAMON_RECLAIM and DAMON_LRU_SORT to use the new function instead of the
old one, respectively.  Patch 4 removes the old core function that was
replaced by the new one, as there is no more user of it.  Patch 5 updates
DAMON_STAT to use the new one instead of its in-house nearly-duplicate
self implementation of the functionality.  Finally patches 6 and 7 update
the DAMON_RECLAIM and DAMON_LRU_SORT user documentation for the new
behaviors, respectively.

This patch (of 7):

damon_set_region_biggest_system_ram_default() sets the monitoring target
region as the caller requested.  If the caller didn't specify the region,
it finds the biggest System RAM of the system and sets it as the target
region.  When there are more than one considerable size of System RAM
resources in the system, the default target setup makes no sense.
Introduce a variant, namely damon_set_region_system_rams_default().  It
sets a physical address range that covers all System RAM resources as the
default target region.

Link: https://lore.kernel.org/20260429041232.90257-1-sj@kernel.org
Link: https://lore.kernel.org/20260429041232.90257-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: skip KASAN tagging for page-allocated page tables
Muhammad Usama Anjum [Wed, 29 Apr 2026 10:27:04 +0000 (15:57 +0530)] 
mm: skip KASAN tagging for page-allocated page tables

Page tables are always accessed via the linear mapping with a match-all
tag, so HW-tag KASAN never checks them.  For page-allocated tables (PTEs
and PGDs etc), avoid the tag setup and poisoning overhead by using
__GFP_SKIP_KASAN.  SLUB-backed page tables are unchanged for now.  (They
aren't widely used and require more SLUB related skip logic.  Leave it
later.)

Link: https://lore.kernel.org/20260429102704.680174-4-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Segall <bsegall@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agokasan: skip HW tagging for all kernel thread stacks
Muhammad Usama Anjum [Wed, 29 Apr 2026 10:27:03 +0000 (15:57 +0530)] 
kasan: skip HW tagging for all kernel thread stacks

HW-tag KASAN never checks kernel stacks because stack pointers carry the
match-all tag, so setting/poisoning tags is pure overhead.

- Add __GFP_SKIP_KASAN to THREADINFO_GFP so every stack allocator that
  uses it skips tagging (fork path plus arch users)
- Add __GFP_SKIP_KASAN to GFP_VMAP_STACK for the fork-specific vmap
  stacks.
- When reusing cached vmap stacks, skip kasan_unpoison_range() if HW tags
  are enabled.

Software KASAN is unchanged; this only affects tag-based KASAN.

Link: https://lore.kernel.org/20260429102704.680174-3-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand (Arm) <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agovmalloc: add __GFP_SKIP_KASAN support
Muhammad Usama Anjum [Wed, 29 Apr 2026 10:27:02 +0000 (15:57 +0530)] 
vmalloc: add __GFP_SKIP_KASAN support

Patch series "kasan: hw_tags: Disable tagging for stack and page-tables",
v4.

Stacks and page tables are always accessed with the match-all tag, so
assigning a new random tag every time at allocation and setting invalid
tag at deallocation time, just adds overhead without improving the
detection.

With __GFP_SKIP_KASAN the page keeps its poison tag and KASAN_TAG_KERNEL
(match-all tag) is stored in the page flags while keeping the poison tag
in the hardware.  The benefit of it is that 256 tag setting instruction
per 4 kB page aren't needed at allocation and deallocation time.

Thus match-all pointers still work, while non-match tags (other than
poison tag) still fault.

__GFP_SKIP_KASAN only skips for KASAN_HW_TAGS mode, so coverage is
unchanged.

Benchmark:
The benchmark has two modes. In thread mode, the child process forks
and creates N threads. In pgtable mode, the parent maps and faults a
specified memory size and then forks repeatedly with children exiting
immediately.

Thread benchmark:
2000 iterations, 2000 threads: 2.575 s â†’ 2.229 s (~13.4% faster)

The pgtable samples:
- 2048 MB, 2000 iters 19.08 s â†’ 17.62 s (~7.6% faster)

This patch (of 3):

For allocations that will be accessed only with match-all pointers (e.g.,
kernel stacks), setting tags is wasted work.  If the caller already set
__GFP_SKIP_KASAN, skip tag setting of vmalloc pages.

Before this patch, __GFP_SKIP_KASAN wasn't being used with vmalloc APIs.
So it wasn't being checked.  Now its being checked and acted upon.  Other
KASAN modes are unchanged because __GFP_SKIP_KASAN is ignored for them in
the page allocator, and in vmalloc too we ignore this flag for them.

This is a preparatory patch for optimizing kernel stack allocations.

Link: https://lore.kernel.org/20260429102704.680174-1-dev.jain@arm.com
Link: https://lore.kernel.org/20260429102704.680174-2-dev.jain@arm.com
Signed-off-by: Muhammad Usama Anjum <usama.anjum@arm.com>
Co-developed-by: Ryan Roberts <ryan.roberts@arm.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
Co-developed-by: Dev Jain <dev.jain@arm.com>
Signed-off-by: Dev Jain <dev.jain@arm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Ben Segall <bsegall@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Kees Cook <kees@kernel.org>
Cc: K Prateek Nayak <kprateek.nayak@amd.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com>
Cc: Valentin Schneider <vschneid@redhat.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/memcontrol: hoist pstatc_pcpu assignment out of CPU loop
Hui Zhu [Wed, 29 Apr 2026 08:42:16 +0000 (16:42 +0800)] 
mm/memcontrol: hoist pstatc_pcpu assignment out of CPU loop

In mem_cgroup_alloc(), the assignment of pstatc_pcpu is invariant with
respect to the for_each_possible_cpu() loop: both the 'parent' pointer and
'parent->vmstats_percpu' remain constant throughout all iterations.

The original code redundantly re-evaluated the 'if (parent)' condition and
reassigned pstatc_pcpu on every CPU iteration, then repeated the same
ternary check 'parent ?  pstatc_pcpu : NULL' when storing into
statc->parent_pcpu.

Move the single conditional assignment of pstatc_pcpu to before the loop,
resolving both the loop-invariant placement issue and the duplicated null
check.  On systems with a large number of possible CPUs, this eliminates
repeated branch evaluation with no functional change.

No functional change intended.

Link: https://lore.kernel.org/20260429084216.186238-1-hui.zhu@linux.dev
Signed-off-by: Hui Zhu <zhuhui@kylinos.cn>
Reviewed-by: SeongJae Park <sj@kernel.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/migrate: rename PAGE_ migration flags to FOLIO_
Shivank Garg [Tue, 24 Mar 2026 19:07:09 +0000 (19:07 +0000)] 
mm/migrate: rename PAGE_ migration flags to FOLIO_

These flags only track folio-specific state during migration and are not
used for movable_ops pages.  Rename the enum values and the old_page_state
variable to match.

No functional change.

Link: https://lore.kernel.org/20260324190706.964555-4-shivankg@amd.com
Signed-off-by: Shivank Garg <shivankg@amd.com>
Suggested-by: David Hildenbrand <david@kernel.org>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Lance Yang <lance.yang@linux.dev>
Reviewed-by: Huang Ying <ying.huang@linux.alibaba.com>
Cc: Alistair Popple <apopple@nvidia.com>
Cc: Byungchul Park <byungchul@sk.com>
Cc: Gregory Price <gourry@gourry.net>
Cc: Joshua Hahn <joshua.hahnjy@gmail.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Rakie Kim <rakie.kim@sk.com>
Cc: Shivank Garg <shivankg@amd.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/damon/sysfs.py: pause DAMON before dumping status
SeongJae Park [Mon, 27 Apr 2026 15:12:29 +0000 (08:12 -0700)] 
selftests/damon/sysfs.py: pause DAMON before dumping status

The sysfs.py test commits DAMON parameters, dump the internal DAMON state,
and show if the parameters are committed as expected using the dumped
state.  While the dumping is ongoing, DAMON is alive.  It can make
internal changes including addition and removal of regions.  It can
therefore make a race that can result in false test results.  Pause DAMON
execution during the state dumping to avoid such races.

Link: https://lore.kernel.org/20260427151231.113429-11-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/damon/sysfs.py: check pause on assert_ctx_committed()
SeongJae Park [Mon, 27 Apr 2026 15:12:28 +0000 (08:12 -0700)] 
selftests/damon/sysfs.py: check pause on assert_ctx_committed()

Extend sysfs.py tests to confirm damon_ctx->pause can be set using the
pause sysfs file.

Link: https://lore.kernel.org/20260427151231.113429-10-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/damon/drgn_dump_damon_status: dump pause
SeongJae Park [Mon, 27 Apr 2026 15:12:27 +0000 (08:12 -0700)] 
selftests/damon/drgn_dump_damon_status: dump pause

drgn_dump_damon_status is not dumping the damon_ctx->pause parameter
value, so it cannot be tested.  Dump it for future tests.

Link: https://lore.kernel.org/20260427151231.113429-9-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoselftests/damon/_damon_sysfs: support pause file staging
SeongJae Park [Mon, 27 Apr 2026 15:12:26 +0000 (08:12 -0700)] 
selftests/damon/_damon_sysfs: support pause file staging

DAMON test-purpose sysfs interface control Python module, _damon_sysfs, is
not supporting the newly added pause file.  Add the support of the file,
for future test and use of the feature.

Link: https://lore.kernel.org/20260427151231.113429-8-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/tests/core-kunit: test pause commitment
SeongJae Park [Mon, 27 Apr 2026 15:12:25 +0000 (08:12 -0700)] 
mm/damon/tests/core-kunit: test pause commitment

Add a kunit test for commitment of damon_ctx->pause parameter that can be
done using damon_commit_ctx().

Link: https://lore.kernel.org/20260427151231.113429-7-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/ABI/damon: update for pause sysfs file
SeongJae Park [Mon, 27 Apr 2026 15:12:24 +0000 (08:12 -0700)] 
Docs/ABI/damon: update for pause sysfs file

Update DAMON ABI document for the DAMON context execution pause/resume
feature.

Link: https://lore.kernel.org/20260427151231.113429-6-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/admin-guide/mm/damon/usage: update for pause file
SeongJae Park [Mon, 27 Apr 2026 15:12:23 +0000 (08:12 -0700)] 
Docs/admin-guide/mm/damon/usage: update for pause file

Update DAMON usage document for the DAMON context execution pause/resume
feature.

Link: https://lore.kernel.org/20260427151231.113429-5-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agoDocs/mm/damon/design: update for context pause/resume feature
SeongJae Park [Mon, 27 Apr 2026 15:12:22 +0000 (08:12 -0700)] 
Docs/mm/damon/design: update for context pause/resume feature

Update DAMON design document for the context execution pause/resume
feature.

Link: https://lore.kernel.org/20260427151231.113429-4-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/sysfs: add pause file under context dir
SeongJae Park [Mon, 27 Apr 2026 15:12:21 +0000 (08:12 -0700)] 
mm/damon/sysfs: add pause file under context dir

Add pause DAMON sysfs file under the context directory.  It exposes the
damon_ctx->pause API parameter to the users so that they can use the
pause/resume feature.

Link: https://lore.kernel.org/20260427151231.113429-3-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/damon/core: introduce damon_ctx->paused
SeongJae Park [Mon, 27 Apr 2026 15:12:20 +0000 (08:12 -0700)] 
mm/damon/core: introduce damon_ctx->paused

Patch series "mm/damon: let DAMON be paused and resumed", v2.

DAMON utilizes a few mechanisms that enhance itself over time.  Adaptive
regions adjustment, goal-based DAMOS quota auto-tuning and monitoring
intervals auto-tuning like self-training mechanisms are such examples.  It
also adds access frequency stability information (age) to the monitoring
results, which makes it enhanced over time.

Sometimes users have to stop DAMON.  In this case, DAMON internal state
that enhanced over the time of the last execution simply goes away.
Restarted DAMON have to train itself and enhance its output from the
scratch.  This makes DAMON less useful in such cases.  Introducing three
such use cases below.

Investigation of DAMON.  It is best to do the investigation online,
especially when it is a production environment.  DAMON therefore provides
features for such online investigations, including DAMOS stats, monitoring
result snapshot exposure, and multiple tracepoints.  When those are
insufficient, and there are additional clues that could be interfered by
DAMON, users have to temporarily stop DAMON to collect the additional
clues.  It is not very useful since many of DAMON internal clues are gone
when DAMON is stopped.  The loss of the monitoring results that improved
over time is also problematic, especially in production environments.

Monitoring of workloads that have different user-known phases.  For
example, in Android, applications are known to have very different access
patterns and behaviors when they are running on the foreground and the
background.  It can therefore be useful to separate monitoring of apps
based on whether they are running on the foreground and on the background.
Having two DAMON threads per application that paused and resumed for the
apps foreground/background switches can be useful for the purpose.  But
such pause/resume of the execution is not supported.

Tests of DAMON.  A few DAMON selftests are using drgn to dump the internal
DAMON status.  The tests show if the dumped status is the same as what the
test code expected.  Because DAMON keeps running and modifying its
internal status, there are chances of data races that can cause false test
results.  Stopping DAMON can avoid the race.  But, since the internal
state of DAMON is dropped, the test coverage will be limited.

Let DAMON execution be paused and resumed without loss of the internal
state, to overhaul the limitations.  For this, introduce a new DAMON
context parameter, namely 'pause'.  API callers can update it while the
context is running, using the online parameters update functions
(damon_commit_ctx() and damon_call()).  Once it is set, kdamond_fn() main
loop will do only limited works excluding the monitoring and DAMOS works,
while sleeping sampling intervals per the work.  The limited works include
handling of the online parameters update.  Hence users can unset the
'pause' parameter again.  Once it is unset, kdamond_fn() main loop will do
all the work again (resumed).  Under the paused state, it also does stop
condition checks and handling of it, so that paused DAMON can also be
stopped if needed.  Expose the feature to the user space via DAMON sysfs
interface.  Also, update existing drgn-based tests to test and use the
feature.

Tests
=====

I confirmed the feature functionality using real time tracing ('perf
trace' or 'trace-cmd stream') of damon:damon_aggregated DAMON tracepoint.
By pausing and resuming the DAMON execution, I was able to see the trace
stops and continued as expected.  Note that the pause feature support is
added to DAMON user-space tool (damo) after v3.1.9.  Users can use
'--pause_ctx' command line option of damo for that, and I actually used it
for my test.  The extended drgn-based selftests are also testing a part of
the functionality.

Patches Sequence
================

Patch 1 introduces the new core API for the pause feature.  Patch 2 extend
DAMON sysfs interface for the new parameter.  Patches 3-5 update design,
usage and ABI documents for the new sysfs file, respectively.  The
following five patches are for tests.  Patch 6 implements a new kunit test
for the pause parameter online commitment.  Patches 7 and 8 extend DAMON
selftest helpers to support the new feature.  Patch 9 extends selftest to
test the commitment of the feature.  Finally, patch 10 updates existing
selftest to be safe from the race condition using the pause/resume
feature.

This patch (of 10):

DAMON supports only start and stop of the execution.  When it is stopped,
its internal data that it self-trained goes away.  It will be useful if
the execution can be paused and resumed with the previous self-trained
data.

Introduce per-context API parameter, 'paused', for the purpose.  The
parameter can be set and unset while DAMON is running and paused, using
the online parameters commit helper functions (damon_commit_ctx() and
damon_call()).  Once 'paused' is set, the kdamond_fn() main loop does only
limited works with sampling interval sleep during the works.  The limited
works include the handling of the online parameters update, so that users
can unset the 'pause' and resume the execution when they want.  It also
keep checking DAMON stop conditions and handling of it, so that DAMON can
be stopped while paused if needed.

Link: https://lore.kernel.org/20260427151231.113429-1-sj@kernel.org
Link: https://lore.kernel.org/20260427151231.113429-2-sj@kernel.org
Signed-off-by: SeongJae Park <sj@kernel.org>
Cc: Brendan Higgins <brendan.higgins@linux.dev>
Cc: David Hildenbrand <david@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: limit filemap_fault readahead to VMA boundaries
Frederick Mayle [Mon, 27 Apr 2026 03:01:47 +0000 (20:01 -0700)] 
mm: limit filemap_fault readahead to VMA boundaries

When a file mapping covers a strict subset of a file, an access to the
mapping can trigger readahead of file pages outside the mapped region.
Readahead is meant to prefetch pages likely to be accessed soon, but these
pages aren't accessible via the same means, so it fair to say we don't
have a good indicator they'll be accessed soon.  Take an ELF file for
example: an access to the end of a program's read-only segment isn't a
sign that nearby file contents will be accessed next (they are likely to
be mapped discontiguously, or not at all).  The pressure from loading
these pages into the cache can evict more useful pages.

To improve the behavior, make three changes:

* Introduce a new readahead_control field, max_index, as a hard limit on
  the readahead. The existing file_ra_state->size can't be used as a
  limit, it is more of a hint and can be increased by various
  heuristics.
* Set readahead_control->max_index to the end of the VMA in all of the
  readahead paths that can be triggered from a fault on a file mapping
  (both "sync" and "async" readahead).
* Limit the read-around range start to the VMA's start.

Note that these changes only affect readahead triggered in the context of
a fault, they do not affect readahead triggered by read syscalls.  If a
user mixes the two types of accesses, the behavior is expected to be the
following: if a fault causes readahead and places a PG_readahead marker
and then a read(2) syscall hits the PG_readahead marker, the resulting
async readahead *will not* be limited to the VMA end.  Conversely, if a
read(2) syscall places a PG_readahead marker and then a fault hits the
marker, the async readahead *will* be limited to the VMA end.

There is an edge case that the above motivation glosses over: A single
file mapping might be backed by multiple VMAs.  For example, a whole file
could be mapped RW, then part of the mapping made RO using mprotect.  This
patch would hurt performance of a sequential faulted read of such a
mapping, the degree depending on how fragmented the VMAs are.  A usage
pattern like that is likely rare and already suffering from sub-optimal
performance because, e.g., the fragmented VMAs limit the fault-around, so
each VMA boundary in a sequential faulted read would cause a minor fault.
Still, this patch would make it worse.  See a previous discussion of this
topic at [1].

Tested by mapping and reading a small subset of a large file, then using
the cachestat syscall to verify the number of cached pages didn't exceed
the mapping size.

In practical scenarios, the effect depends on the specific file and usage.
Sometimes there is no effect at all, but, for some ELF files in Android,
we see ~20% fewer pages pulled into the cache.

A comprehensive performance evaluation hasn't been done, but, in addition
to the anecdontal memory savings mentioned above, a benchmark was run with
fio 3.38, showing neutral looking results:

    /data/local/tmp/fio --version

    fio --name=mmap_test --ioengine=mmap --rw=read --bs=4k \
        --offset=1G --size=1G --filesize=3G --numjobs=1 \
        --filename=testfile.bin

        Before: 4366.6 MiB/s (avg of 3459, 4592, 4613, 4697, 4472)
        After:  4444.0 MiB/s (avg of 4633, 4655, 4511, 4571, 3850)
                +1.7%

    Same, with --ioengine=mmap --rw=randread

        Before: 445.6 MiB/s  (avg of 446, 447, 442, 452, 441)
        After:  447.0 MiB/s  (avg of 447, 446, 446, 451, 445)
                +0.3%

    Same, with --ioengine=psync --rw=read

        Before: 3086.6 MiB/s (avg of 3122, 3094, 3066, 3094, 3057)
        After:  3084.6 MiB/s (avg of 3039, 3103, 3103, 3084, 3094)
                -0.06%

    Same, with --ioengine=psync --rw=randread

        Before: 2226.4 MiB/s (avg of 2256, 2183, 2207, 2265, 2221)
        After:  2231.4 MiB/s (avg of 2236, 2241, 2236, 2193, 2251)
                +0.2%

Link: https://lore.kernel.org/20260427030148.653228-1-fmayle@google.com
Link: https://lore.kernel.org/all/ivnv2crd3et76p2nx7oszuqhzzah756oecn5yuykzqfkqzoygw@yvnlkhjjssoz/
Signed-off-by: Frederick Mayle <fmayle@google.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Kalesh Singh <kaleshsingh@google.com>
Cc: David Hildenbrand <david@kernel.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm/madvise: reject invalid process_madvise() advice for zero-length vectors
fujunjie [Mon, 4 May 2026 10:39:57 +0000 (10:39 +0000)] 
mm/madvise: reject invalid process_madvise() advice for zero-length vectors

process_madvise() used to validate the advice while walking each imported
iovec.  If the vector has zero total length, vector_madvise() does not
enter the loop and can return success without checking whether the advice
value is valid.

For a local mm, such as process_madvise(PIDFD_SELF, ...), the remote-only
process_madvise_remote_valid() check is skipped.  As a result, an invalid
advice can be reported as success when the vector has zero total length.
This differs from madvise(), which rejects an invalid advice before
returning success for a zero-length range.

Validate the generic madvise behavior at the syscall-facing entry points
before any vector walk.  In process_madvise(), do this before the
remote-only advice restriction so unsupported advice is rejected with the
same priority for local and remote mm.

Use an errno-returning helper for address/length validation, and handle
zero-length ranges explicitly at the call sites.  Requests with valid
advice and zero total length remain a noop and continue to return 0.  Add
a selftest that covers invalid advice with a zero-length iovec and an
empty vector, while also checking that a request with valid advice and
zero length still succeeds.

Link: https://lore.kernel.org/tencent_C3AEB0E769C5F4F9370F9411B69B7F8B2907@qq.com
Fixes: 021781b01275 ("mm/madvise: unrestrict process_madvise() for current process")
Signed-off-by: fujunjie <fujunjie1@qq.com>
Acked-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: SeongJae Park <sj@kernel.org>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Liam Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Vlastimil Babka <vbabka@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agomm: remove page_mapped()
David Hildenbrand (Arm) [Mon, 27 Apr 2026 11:43:16 +0000 (13:43 +0200)] 
mm: remove page_mapped()

Let's replace the last user of page_mapped() by folio_mapped() so we can
get rid of page_mapped().

Replace the remaining occurrences of page_mapped() in rmap documentation
by folio_mapped().

Link: https://lore.kernel.org/20260427-page_mapped-v1-3-e89c3592c74c@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Harry Yoo <harry@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agobpf: arena: use page_ref_count() instead of page_mapped() in arena_free_pages()
David Hildenbrand (Arm) [Mon, 27 Apr 2026 11:43:15 +0000 (13:43 +0200)] 
bpf: arena: use page_ref_count() instead of page_mapped() in arena_free_pages()

Pages that BPF arena code maps are allocated through
bpf_map_alloc_pages(), which does not allocate folios but pages.

In the future, pages will not have a mapcount, only folios will.
Converting the code to use folios and rely on folio_mapped() sounds like
the wrong approach.

Should BPF arena code allocate folios and use folio_mapped() here?  But
likely we would not want to use folios here longterm, as we don't really
need folio information.

Hard to tell.  But in the meantime, we can simply use the page refcount
instead, as a heuristic whether the page might be mapped to user space and
we would want to try zapping it, so we can get rid of page_mapped().

Page allocation will give us a page with a refcount of 1.  Any user space
mapping adds a page reference.  While there can be references from other
subsystems (e.g., GUP), in the common case for this test here relying on
the page count is good enough.

Link: https://lore.kernel.org/20260427-page_mapped-v1-2-e89c3592c74c@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Harry Yoo <harry@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
4 weeks agosh: use folio_mapped() instead of page_mapped() in sh4_flush_cache_page()
David Hildenbrand (Arm) [Mon, 27 Apr 2026 11:43:14 +0000 (13:43 +0200)] 
sh: use folio_mapped() instead of page_mapped() in sh4_flush_cache_page()

Patch series "mm: remove page_mapped()".

While preparing my slides for an LSF/MM talk, I realized that I did not
yet remove page_mapped().

So let's do that.  In the BPF arena code it's unclear which memdesc we
would want to allocate in the future: certainly something with a refcount,
but likely none with a mapcount.  So let's just rely on the page refcount
instead to decide whether we want to try zapping the page from user page
tables.

This patch (of 3):

We already have the folio in our hands, so let's just use folio_mapped().

Link: https://lore.kernel.org/20260427-page_mapped-v1-0-e89c3592c74c@kernel.org
Link: https://lore.kernel.org/20260427-page_mapped-v1-1-e89c3592c74c@kernel.org
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Cc: Eduard Zingerman <eddyz87@gmail.com>
Cc: Harry Yoo <harry@kernel.org>
Cc: Jann Horn <jannh@google.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Kumar Kartikeya Dwivedi <memxor@gmail.com>
Cc: Liam R. Howlett <liam@infradead.org>
Cc: Lorenzo Stoakes <ljs@kernel.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Rik van Riel <riel@surriel.com>
Cc: Song Liu <song@kernel.org>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Vlastimil Babka <vbabka@kernel.org>
Cc: Yonghong Song <yonghong.song@linux.dev>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>