The KHO framework uses a notifier chain as the mechanism for clients to
participate in the finalization process. While this works for a single,
central state machine, it is too restrictive for kernel-internal
components like pstore/reserve_mem or IMA. These components need a
simpler, direct way to register their state for preservation (e.g., during
their initcall) without being part of a complex, shutdown-time notifier
sequence. The notifier model forces all participants into a single
finalization flow and makes direct preservation from an arbitrary context
difficult. This patch refactors the client participation model by
removing the notifier chain and introducing a direct API for managing FDT
subtrees.
The core kho_finalize() and kho_abort() state machine remains, but clients
now register their data with KHO beforehand.
Link: https://lkml.kernel.org/r/20251101142325.1326536-3-pasha.tatashin@soleen.com Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Co-developed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Alexander Graf <graf@amazon.com> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Pratyush Yadav <pratyush@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Simon Horman <horms@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pasha Tatashin [Sat, 1 Nov 2025 14:23:17 +0000 (10:23 -0400)]
kho: make debugfs interface optional
Patch series "liveupdate: Rework KHO for in-kernel users", v9.
This series refactors the KHO framework to better support in-kernel users
like the upcoming LUO. The current design, which relies on a notifier
chain and debugfs for control, is too restrictive for direct programmatic
use.
The core of this rework is the removal of the notifier chain in favor of a
direct registration API. This decouples clients from the shutdown-time
finalization sequence, allowing them to manage their preserved state more
flexibly and at any time.
In support of this new model, this series also:
- Makes the debugfs interface optional.
- Introduces APIs to unpreserve memory and fixes a bug in the abort
path where client state was being incorrectly discarded. Note that
this is an interim step, as a more comprehensive fix is planned as
part of the stateless KHO work [1].
- Moves all KHO code into a new kernel/liveupdate/ directory to
consolidate live update components.
This patch (of 9):
Currently, KHO is controlled via debugfs interface, but once LUO is
introduced, it can control KHO, and the debug interface becomes optional.
Add a separate config CONFIG_KEXEC_HANDOVER_DEBUGFS that enables the
debugfs interface, and allows to inspect the tree.
Move all debugfs related code to a new file to keep the .c files clear of
ifdefs.
Link: https://lkml.kernel.org/r/20251101142325.1326536-1-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20251101142325.1326536-2-pasha.tatashin@soleen.com Link: https://lore.kernel.org/all/20251020100306.2709352-1-jasonmiu@google.com Co-developed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Pratyush Yadav <pratyush@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Changyuan Lyu <changyuanl@google.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Simon Horman <horms@kernel.org> Cc: Zhu Yanjun <yanjun.zhu@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
selftests: complete kselftest include centralization
This follow-up patch completes centralization of kselftest.h and
ksefltest_harness.h includes in remaining seltests files, replacing all
relative paths with a non-relative paths using shared -I include path in
lib.mk
Tested with gcc-13.3 and clang-18.1, and cross-compiled successfully on
riscv, arm64, x86_64 and powerpc arch.
[reddybalavignesh9979@gmail.com: add selftests include path for kselftest.h] Link: https://lkml.kernel.org/r/20251017090201.317521-1-reddybalavignesh9979@gmail.com Link: https://lkml.kernel.org/r/20251016104409.68985-1-reddybalavignesh9979@gmail.com Signed-off-by: Bala-Vignesh-Reddy <reddybalavignesh9979@gmail.com> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Link: https://lore.kernel.org/lkml/20250820143954.33d95635e504e94df01930d0@linux-foundation.org/ Reviewed-by: Wei Yang <richard.weiyang@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Günther Noack <gnoack@google.com> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mickael Salaun <mic@digikod.net> Cc: Ming Lei <ming.lei@redhat.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kiryl Shutsemau [Thu, 20 Nov 2025 16:14:11 +0000 (16:14 +0000)]
mm/filemap: fix logic around SIGBUS in filemap_map_pages()
Chris noticed that filemap_map_pages() calculates can_map_large only once
for the first page in the fault around range. The value is not valid for
the following pages in the range and must be recalculated.
Instead of recalculating can_map_large on each iteration, pass down
file_end to filemap_map_folio_range() and let it make the decision on what
can be mapped.
Link: https://lkml.kernel.org/r/20251120161411.859078-1-kirill@shutemov.name Fixes: 74207de2ba10 ("mm/memory: do not populate page table entries beyond i_size")h Signed-off-by: Kiryl Shutsemau <kas@kernel.org> Reported-by: Chris Mason <clm@meta.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Mason <clm@meta.com> Cc: Christian Brauner <brauner@kernel.org> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Wei Yang [Wed, 19 Nov 2025 23:53:02 +0000 (23:53 +0000)]
mm/huge_memory: fix NULL pointer deference when splitting folio
Commit c010d47f107f ("mm: thp: split huge page to any lower order pages")
introduced an early check on the folio's order via mapping->flags before
proceeding with the split work.
This check introduced a bug: for shmem folios in the swap cache and
truncated folios, the mapping pointer can be NULL. Accessing
mapping->flags in this state leads directly to a NULL pointer dereference.
This commit fixes the issue by moving the check for mapping != NULL before
any attempt to access mapping->flags.
Link: https://lkml.kernel.org/r/20251119235302.24773-1-richard.weiyang@gmail.com Fixes: c010d47f107f ("mm: thp: split huge page to any lower order pages") Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Pratyush Yadav [Tue, 18 Nov 2025 18:24:15 +0000 (19:24 +0100)]
MAINTAINERS: add test_kho to KHO's entry
Commit b753522bed0b7 ("kho: add test for kexec handover") introduced the
KHO test but missed adding it to KHO's MAINTAINERS entry. Add it so the
KHO maintainers can get patches for its test.
Link: https://lkml.kernel.org/r/20251118182416.70660-1-pratyush@kernel.org Fixes: b753522bed0b7 ("kho: add test for kexec handover") Signed-off-by: Pratyush Yadav <pratyush@kernel.org> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Alexander Graf <graf@amazon.com> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Carlos Llamas [Thu, 13 Nov 2025 03:46:22 +0000 (03:46 +0000)]
selftests/mm: fix division-by-zero in uffd-unit-tests
Commit 4dfd4bba8578 ("selftests/mm/uffd: refactor non-composite global
vars into struct") moved some of the operations previously implemented in
uffd_setup_environment() earlier in the main test loop.
The calculation of nr_pages, which involves a division by page_size, now
occurs before checking that default_huge_page_size() returns a non-zero
This leads to a division-by-zero error on systems with !CONFIG_HUGETLB.
Fix this by relocating the non-zero page_size check before the nr_pages
calculation, as it was originally implemented.
Link: https://lkml.kernel.org/r/20251113034623.3127012-1-cmllamas@google.com Fixes: 4dfd4bba8578 ("selftests/mm/uffd: refactor non-composite global vars into struct") Signed-off-by: Carlos Llamas <cmllamas@google.com> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Ujwal Kundur <ujwal.kundur@gmail.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Liam R. Howlett [Tue, 11 Nov 2025 21:56:05 +0000 (16:56 -0500)]
mm/mmap_lock: reset maple state on lock_vma_under_rcu() retry
The retry in lock_vma_under_rcu() drops the rcu read lock before
reacquiring the lock and trying again. This may cause a use-after-free if
the maple node the maple state was using was freed.
The maple state is protected by the rcu read lock. When the lock is
dropped, the state cannot be reused as it tracks pointers to objects that
may be freed during the time where the lock was not held.
Any time the rcu read lock is dropped, the maple state must be
invalidated. Resetting the address and state to MA_START is the safest
course of action, which will result in the next operation starting from
the top of the tree.
Prior to commit 0b16f8bed19c ("mm: change vma_start_read() to drop RCU
lock on failure"), vma_start_read() would drop rcu read lock and return
NULL, so the retry would not have happened. However, now that
vma_start_read() drops rcu read lock on failure followed by a retry, we
may end up using a freed maple tree node cached in the maple state.
[surenb@google.com: changelog alteration] Link: https://lkml.kernel.org/r/CAJuCfpEWMD-Z1j=nPYHcQW4F7E2Wka09KTXzGv7VE7oW1S8hcw@mail.gmail.com Link: https://lkml.kernel.org/r/20251111215605.1721380-1-Liam.Howlett@oracle.com Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Fixes: 0b16f8bed19c ("mm: change vma_start_read() to drop RCU lock on failure") Reported-by: syzbot+131f9eb2b5807573275c@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=131f9eb2b5807573275c Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
When allocating hugetlb folios for memfd, three initialization steps are
missing:
1. Folios are not zeroed, leading to kernel memory disclosure to userspace
2. Folios are not marked uptodate before adding to page cache
3. hugetlb_fault_mutex is not taken before hugetlb_add_to_page_cache()
The memfd allocation path bypasses the normal page fault handler
(hugetlb_no_page) which would handle all of these initialization steps.
This is problematic especially for udmabuf use cases where folios are
pinned and directly accessed by userspace via DMA.
Fix by matching the initialization pattern used in hugetlb_no_page():
- Zero the folio using folio_zero_user() which is optimized for huge pages
- Mark it uptodate with folio_mark_uptodate()
- Take hugetlb_fault_mutex before adding to page cache to prevent races
The folio_zero_user() change also fixes a potential security issue where
uninitialized kernel memory could be disclosed to userspace through read()
or mmap() operations on the memfd.
Link: https://lkml.kernel.org/r/20251112145034.2320452-1-kartikey406@gmail.com Fixes: 89c1905d9c14 ("mm/gup: introduce memfd_pin_folios() for pinning memfd folios") Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Reported-by: syzbot+f64019ba229e3a5c411b@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/20251112031631.2315651-1-kartikey406@gmail.com/ Closes: https://syzkaller.appspot.com/bug?extid=f64019ba229e3a5c411b Suggested-by: Oscar Salvador <osalvador@suse.de> Suggested-by: David Hildenbrand <david@redhat.com> Tested-by: syzbot+f64019ba229e3a5c411b@syzkaller.appspotmail.com Acked-by: Oscar Salvador <osalvador@suse.de> Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Vivek Kasireddy <vivek.kasireddy@intel.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Jason Gunthorpe <jgg@nvidia.com> (v2) Cc: Christoph Hellwig <hch@lst.de> (v6) Cc: Dave Airlie <airlied@redhat.com> Cc: Gerd Hoffmann <kraxel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Youngjun Park [Sun, 2 Nov 2025 08:24:56 +0000 (17:24 +0900)]
mm: swap: remove duplicate nr_swap_pages decrement in get_swap_page_of_type()
After commit 4f78252da887, nr_swap_pages is decremented in
swap_range_alloc(). Since cluster_alloc_swap_entry() calls
swap_range_alloc() internally, the decrement in get_swap_page_of_type()
causes double-decrementing.
As a representative userspace-visible runtime example of the impact,
/proc/meminfo reports increasingly inaccurate SwapFree values. The
discrepancy grows with each swap allocation, and during hibernation
when large amounts of memory are written to swap, the reported value
can deviate significantly from actual available swap space, misleading
users and monitoring tools.
Remove the duplicate decrement.
Link: https://lkml.kernel.org/r/20251102082456.79807-1-youngjun.park@lge.com Fixes: 4f78252da887 ("mm: swap: move nr_swap_pages counter decrement from folio_alloc_swap() to swap_range_alloc()") Signed-off-by: Youngjun Park <youngjun.park@lge.com> Acked-by: Chris Li <chrisl@kernel.org> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Kairui Song <kasong@tencent.com> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baoquan He <bhe@redhat.com> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: <stable@vger.kernel.org> [6.17+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: mark inode bad upon validation failure during read
A VFS cache inconsistency, potentially triggered by sequences like
buffered writes followed by open(O_DIRECT), can result in an invalid
on-disk inode block (e.g., bad signature). OCFS2 detects this corruption
when reading the inode block via ocfs2_validate_inode_block(), logs
"Invalid dinode", and often switches the filesystem to read-only mode.
The VFS open(O_DIRECT) operation appears to incorrectly clear the inode's
I_DIRTY flag without ensuring the dirty metadata (reflecting the earlier
buffered write, e.g., an updated i_size) is flushed to disk. This leaves
the in-memory VFS inode object "in limbo" with an updated size (e.g.,
38639 from the write) but marked clean, while its on-disk counterpart
remains stale (e.g., size 0) or invalid.
Currently, the function reading the inode block
(ocfs2_read_inode_block_full()) fails to call make_bad_inode() upon
detecting the validation error. Because the in-memory inode is not marked
bad, subsequent operations (like ftruncate) proceed erroneously. They
eventually reach code (e.g., ocfs2_truncate_file()) that compares the
inconsistent in-memory size (38639) against the invalid/stale on-disk size
(0), leading to kernel crashes via BUG_ON.
Fix this by calling make_bad_inode(inode) within the error handling path
of ocfs2_read_inode_block_full() immediately after a block read or
validation error occurs. This ensures VFS is properly notified about the
corrupt inode at the point of detection. Marking the inode bad allows VFS
to correctly fail subsequent operations targeting this inode early,
preventing kernel panics caused by operating on known inconsistent inode
states.
Link: https://lkml.kernel.org/r/20251118001833.423470-2-eraykrdg1@gmail.com Link: https://lore.kernel.org/all/20251029225748.11361-2-eraykrdg1@gmail.com/T/ Signed-off-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Signed-off-by: Ahmet Eray Karadag <eraykrdg1@gmail.com> Reported-by: syzbot+b93b65ee321c97861072@syzkaller.appspotmail.com Link: https://syzkaller.appspot.com/bug?extid=b93b65ee321c97861072 Reviewed-by: Heming Zhao <heming.zhao@suse.com> Co-developed-by: Albin Babu Varghese <albinbabuvarghese20@gmail.com> Acked-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: David Hunter <david.hunter.linux@gmail.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thorsten Blum [Tue, 18 Nov 2025 18:53:45 +0000 (19:53 +0100)]
ocfs2: replace deprecated strcpy with strscpy
strcpy() has been deprecated [1] because it performs no bounds checking on
the destination buffer, which can lead to buffer overflows. Replace it
with the safer strscpy(), and copy directly into '->rf_signature' instead
of using the start of the struct as the destination buffer.
Thorsten Blum [Tue, 18 Nov 2025 18:53:44 +0000 (19:53 +0100)]
ocfs2: replace deprecated strcpy in ocfs2_create_xattr_block
strcpy() has been deprecated [1] because it performs no bounds checking on
the destination buffer, which can lead to buffer overflows. Replace it
with the safer strscpy(), and copy directly into '->xb_signature' instead
of using the start of the struct as the destination buffer.
Sourabh Jain [Tue, 18 Nov 2025 07:10:23 +0000 (12:40 +0530)]
crash: export crashkernel CMA reservation to userspace
Add a sysfs entry /sys/kernel/kexec_crash_cma_ranges to expose all CMA
crashkernel ranges.
This allows userspace tools configuring kdump to determine how much memory
is reserved for crashkernel. If CMA is used, tools can warn users when
attempting to capture user pages with CMA reservation.
The new sysfs hold the CMA ranges in below format:
The reason for not including Crash CMA Ranges in /proc/iomem is to avoid
conflicts. It has been observed that contiguous memory ranges are
sometimes shown as two separate System RAM entries in /proc/iomem. If a
CMA range overlaps two System RAM ranges, adding crashk_res to /proc/iomem
can create a conflict. Reference [1] describes one such instance on the
PowerPC architecture.
Link: https://lkml.kernel.org/r/20251118071023.1673329-1-sourabhjain@linux.ibm.com Link: https://lore.kernel.org/all/20251016142831.144515-1-sourabhjain@linux.ibm.com/ Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Aditya Gupta <adityag@linux.ibm.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shivang Upadhyay <shivangu@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Sourabh Jain [Mon, 17 Nov 2025 03:51:53 +0000 (09:21 +0530)]
Documentation/ABI: add kexec and kdump sysfs interface
Add an ABI document for following kexec and kdump sysfs interface:
- /sys/kernel/kexec_loaded
- /sys/kernel/kexec_crash_loaded
- /sys/kernel/kexec_crash_size
- /sys/kernel/crash_elfcorehdr_size
Link: https://lkml.kernel.org/r/20251117035153.1199665-1-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Aditya Gupta <adityag@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Shivang Upadhyay <shivangu@linux.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guan-Chun Wu [Fri, 14 Nov 2025 06:02:40 +0000 (14:02 +0800)]
ceph: replace local base64 helpers with lib/base64
Remove the ceph_base64_encode() and ceph_base64_decode() functions and
replace their usage with the generic base64_encode() and base64_decode()
helpers from lib/base64.
This eliminates the custom implementation in Ceph, reduces code
duplication, and relies on the shared Base64 code in lib. The helpers
preserve RFC 3501-compliant Base64 encoding without padding, so there are
no functional changes.
This change also improves performance: encoding is about 2.7x faster and
decoding achieves 43-52x speedups compared to the previous local
implementation.
Link: https://lkml.kernel.org/r/20251114060240.89965-1-409411716@gms.tku.edu.tw Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Xiubo Li <xiubli@redhat.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: David Laight <david.laight.linux@gmail.com> Cc: Yu-Sheng Huang <home7438072@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guan-Chun Wu [Fri, 14 Nov 2025 06:02:21 +0000 (14:02 +0800)]
fscrypt: replace local base64url helpers with lib/base64
Replace the base64url encoding and decoding functions in fscrypt with the
generic base64_encode() and base64_decode() helpers from lib/base64.
This removes the custom implementation in fscrypt, reduces code
duplication, and relies on the shared Base64 implementation in lib. The
helpers preserve RFC 4648-compliant URL-safe Base64 encoding without
padding, so there are no functional changes.
This change also improves performance: encoding is about 2.7x faster and
decoding achieves 43-52x speedups compared to the previous implementation.
Link: https://lkml.kernel.org/r/20251114060221.89734-1-409411716@gms.tku.edu.tw Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com> Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Cc: Christoph Hellwig <hch@lst.de> Cc: David Laight <david.laight.linux@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yu-Sheng Huang <home7438072@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guan-Chun Wu [Fri, 14 Nov 2025 06:01:57 +0000 (14:01 +0800)]
lib: add KUnit tests for base64 encoding/decoding
Add a KUnit test suite to validate the base64 helpers. The tests cover
both encoding and decoding, including padded and unpadded forms as defined
by RFC 4648 (standard base64), and add negative cases for malformed inputs
and padding errors.
The test suite also validates other variants (URLSAFE, IMAP) to ensure
their correctness.
In addition to functional checks, the suite includes simple
microbenchmarks which report average encode/decode latency for small (64B)
and larger (1KB) inputs. These numbers are informational only and do not
gate the tests.
Kconfig (BASE64_KUNIT) and lib/tests/Makefile are updated accordingly.
Sample KUnit output:
KTAP version 1
# Subtest: base64
# module: base64_kunit
1..4
# base64_performance_tests: [64B] encode run : 32ns
# base64_performance_tests: [64B] decode run : 35ns
# base64_performance_tests: [1KB] encode run : 510ns
# base64_performance_tests: [1KB] decode run : 530ns
ok 1 base64_performance_tests
ok 2 base64_std_encode_tests
ok 3 base64_std_decode_tests
ok 4 base64_variant_tests
# base64: pass:4 fail:0 skip:0 total:4
# Totals: pass:4 fail:0 skip:0 total:4
Link: https://lkml.kernel.org/r/20251114060157.89507-1-409411716@gms.tku.edu.tw Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: David Laight <david.laight.linux@gmail.com> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yu-Sheng Huang <home7438072@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Guan-Chun Wu [Fri, 14 Nov 2025 06:01:32 +0000 (14:01 +0800)]
lib/base64: rework encode/decode for speed and stricter validation
The old base64 implementation relied on a bit-accumulator loop, which was
slow for larger inputs and too permissive in validation. It would accept
extra '=', missing '=', or even '=' appearing in the middle of the input,
allowing malformed strings to pass. This patch reworks the internals to
improve performance and enforce stricter validation.
Changes:
- Encoder:
* Process input in 3-byte blocks, mapping 24 bits into four 6-bit
symbols, avoiding bit-by-bit shifting and reducing loop iterations.
* Handle the final 1-2 leftover bytes explicitly and emit '=' only when
requested.
- Decoder:
* Based on the reverse lookup tables from the previous patch, decode
input in 4-character groups.
* Each group is looked up directly, converted into numeric values, and
combined into 3 output bytes.
* Explicitly handle padded and unpadded forms:
- With padding: input length must be a multiple of 4, and '=' is
allowed only in the last two positions. Reject stray or early '='.
- Without padding: validate tail lengths (2 or 3 chars) and require
unused low bits to be zero.
* Removed the bit-accumulator style loop to reduce loop iterations.
Kuan-Wei Chiu [Fri, 14 Nov 2025 06:01:07 +0000 (14:01 +0800)]
lib/base64: optimize base64_decode() with reverse lookup tables
Replace the use of strchr() in base64_decode() with precomputed reverse
lookup tables for each variant. This avoids repeated string scans and
improves performance. Use -1 in the tables to mark invalid characters.
Kuan-Wei Chiu [Fri, 14 Nov 2025 06:00:45 +0000 (14:00 +0800)]
lib/base64: add support for multiple variants
Patch series " lib/base64: add generic encoder/decoder, migrate users", v5.
This series introduces a generic Base64 encoder/decoder to the kernel
library, eliminating duplicated implementations and delivering significant
performance improvements.
The Base64 API has been extended to support multiple variants (Standard,
URL-safe, and IMAP) as defined in RFC 4648 and RFC 3501. The API now
takes a variant parameter and an option to control padding. As part of
this series, users are migrated to the new interface while preserving
their specific formats: fscrypt now uses BASE64_URLSAFE, Ceph uses
BASE64_IMAP, and NVMe is updated to BASE64_STD.
On the encoder side, the implementation processes input in 3-byte blocks,
mapping 24 bits directly to 4 output symbols. This avoids bit-by-bit
streaming and reduces loop overhead, achieving about a 2.7x speedup
compared to previous implementations.
On the decoder side, replace strchr() lookups with per-variant reverse
tables and process input in 4-character groups. Each group is mapped to
numeric values and combined into 3 bytes. Padded and unpadded forms are
validated explicitly, rejecting invalid '=' usage and enforcing tail
rules. This improves throughput by ~43-52x.
This patch (of 6):
Extend the base64 API to support multiple variants (standard, URL-safe,
and IMAP) as defined in RFC 4648 and RFC 3501. The API now takes a
variant parameter and an option to control padding. Update NVMe auth code
to use the new interface with BASE64_STD.
Link: https://lkml.kernel.org/r/20251114055829.87814-1-409411716@gms.tku.edu.tw Link: https://lkml.kernel.org/r/20251114060045.88792-1-409411716@gms.tku.edu.tw Signed-off-by: Kuan-Wei Chiu <visitorckw@gmail.com> Co-developed-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Signed-off-by: Guan-Chun Wu <409411716@gms.tku.edu.tw> Reviewed-by: David Laight <david.laight.linux@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Eric Biggers <ebiggers@kernel.org> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: "Theodore Y. Ts'o" <tytso@mit.edu> Cc: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com> Cc: Xiubo Li <xiubli@redhat.com> Cc: Yu-Sheng Huang <home7438072@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Feng Tang [Thu, 13 Nov 2025 11:10:39 +0000 (19:10 +0800)]
sys_info: add a default kernel sys_info mask
Which serves as a global default sys_info mask. When users want the same
system information for many error cases (panic, hung, lockup ...), they
can chose to set this global knob only once, while not setting up each
individual sys_info knobs.
This just adds a 'lazy' option, and doesn't change existing kernel
behavior as the mask is 0 by default.
Link: https://lkml.kernel.org/r/20251113111039.22701-5-feng.tang@linux.alibaba.com Suggested-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Feng Tang [Thu, 13 Nov 2025 11:10:38 +0000 (19:10 +0800)]
watchdog: add sys_info sysctls to dump sys info on system lockup
When soft/hard lockup happens, developers may need different kinds of
system information (call-stacks, memory info, locks, etc.) to help
debugging.
Add 'softlockup_sys_info' and 'hardlockup_sys_info' sysctl knobs to take
human readable string like "tasks,mem,timers,locks,ftrace,...", and when
system lockup happens, all requested information will be printed out.
(refer kernel/sys_info.c for more details).
Link: https://lkml.kernel.org/r/20251113111039.22701-4-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Petr Mladek <pmladek@suse.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Feng Tang [Thu, 13 Nov 2025 11:10:37 +0000 (19:10 +0800)]
hung_task: add hung_task_sys_info sysctl to dump sys info on task-hung
When task-hung happens, developers may need different kinds of system
information (call-stacks, memory info, locks, etc.) to help debugging.
Add 'hung_task_sys_info' sysctl knob to take human readable string like
"tasks,mem,timers,locks,ftrace,...", and when task-hung happens, all
requested information will be dumped. (refer kernel/sys_info.c for more
details).
Meanwhile, the newly introduced sys_info() call is used to unify some
existing info-dumping knobs.
[feng.tang@linux.alibaba.com: maintain consistecy established behavior, per Lance and Petr] Link: https://lkml.kernel.org/r/aRncJo1mA5Zk77Hr@U-2FWC9VHC-2323.local Link: https://lkml.kernel.org/r/20251113111039.22701-3-feng.tang@linux.alibaba.com Signed-off-by: Feng Tang <feng.tang@linux.alibaba.com> Suggested-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Lance Yang <lance.yang@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Feng Tang [Thu, 13 Nov 2025 11:10:36 +0000 (19:10 +0800)]
docs: panic: correct some sys_ifo names in sysctl doc
Patch series "Enable hung_task and lockup cases to dump system info on
demand", v2.
When working on kernel stability issues: panic, task-hung and soft/hard
lockup are frequently met. And to debug them, user may need lots of
system information at that time, like task call stacks, lock info, memory
info, ftrace dump, etc.
panic case already uses sys_info() for this purpose, and has a
'panic_sys_info' sysctl(also support cmdline setup) interface to take
human readable string like "tasks,mem,timers,locks,ftrace,..." to control
what kinds of information is needed. Which is also helpful to debug
task-hung and lockup cases.
So this patchset introduces the similar sys_info sysctl interface for
task-hung and lockup cases.
his is mainly for debugging and the info dumping could be intrusive, like
dumping call stack for all tasks when system has huge number of tasks,
similarly for ftrace dump (we may add tracing_stop() and tracing_start()
around it)
Locally these have been used in our bug chasing for stability issues and
were helpful.
As Andrew suggested, add a configurable global 'kernel_sys_info' knob.
When error scenarios like panic/hung-task/lockup etc doesn't setup their
own sys_info knob and calls sys_info() with parameter "0", this global
knob will take effect. It could be used for other kernel cases like OOM,
which may not need one dedicated sys_info knob.
This patch (of 4):
Some sys_info names wered forgotten to change in patch iterations, while
the right names are defined in kernel/sys_info.c.
The introduction of WRITE_ONCE() calls for the 'prev' and 'next' variables
inside plist_check_list() was a misapplication. WRITE_ONCE() is
fundamentally a compiler barrier designed to prevent compiler
optimizations (like caching or reordering) on shared memory locations.
However, the variables 'prev' and 'next' are local, stack-allocated
pointers accessed only by the current thread's invocation of the function.
Since these pointers are thread-local and are never accessed concurrently,
applying WRITE_ONCE() to them is semantically incorrect and unnecessary.
Furthermore, the use of WRITE_ONCE() on local variables prevents the
compiler from performing standard optimizations, such as keeping these
variables cached solely in CPU registers throughout the loop, potentially
introducing performance overhead. Restore the conventional C assignment
for local loop variables, allowing the compiler to generate optimal code.
Xie Yuanbin [Sun, 9 Nov 2025 08:37:15 +0000 (16:37 +0800)]
include/linux/once_lite.h: fix judgment in WARN_ONCE with clang
For c code:
```c
extern int xx;
void test(void)
{
if (WARN_ONCE(xx, "x"))
__asm__ volatile ("nop":::);
}
```
Clang will generate the following assembly code:
```assemble
test:
movl xx(%rip), %eax // Assume xx == 0 (likely case)
testl %eax, %eax // judge once
je .LBB0_3 // jump to .LBB0_3
testb $1, test.__already_done(%rip)
je .LBB0_2
.LBB0_3:
testl %eax, %eax // judge again
je .LBB0_5 // jump to .LBB0_5
.LBB0_4:
nop
.LBB0_5:
retq
// omit
```
In the above code, `xx == 0` should be a likely case, but in this case,
xx has been judged twice.
Test info:
1. kernel source:
linux-next
commit 9c0826a5d9aa4d52206d ("Add linux-next specific files for 20251107")
2. compiler:
clang: Debian clang version 21.1.4 (8) with
Debian LLD 21.1.4 (compatible with GNU linkers)
3. config:
base on default x86_64_defconfig, and setting:
CONFIG_MITIGATION_RETHUNK=n
CONFIG_STACKPROTECTOR=n
Add unlikely to __ret_cond to help the compiler optimize correctly.
[akpm@linux-foundation.org: undo whitespace changes] Link: https://lkml.kernel.org/r/20251109083715.24495-1-qq570070308@gmail.com Signed-off-by: Xie Yuanbin <qq570070308@gmail.com> Cc: Bill Wendling <morbo@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Justin Stitt <justinstitt@google.com> Cc: Maninder Singh <maninder1.s@samsung.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ryusuke Konishi [Fri, 7 Nov 2025 15:32:49 +0000 (00:32 +0900)]
MAINTAINERS: update nilfs2 entry
Viacheslav has kindly offered to help with the maintenance of nilfs2 by
upstreaming patches, similar to the HFS/HFS+ tree. I've accepted his
offer, and will therefore add him as a co-maintainer and switch the
project's git tree for that role.
At the same time, change the outdated status field to Maintained to
reflect the current state.
zhang jiao [Thu, 6 Nov 2025 01:07:34 +0000 (09:07 +0800)]
fs/proc/page: remove unused KPMBITS
KPMBITS is never referenced in the code. Just remove it.
Link: https://lkml.kernel.org/r/20251106010735.1603-1-zhangjiao2@cmss.chinamobile.com Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Ye <liuye@kylinos.cn> Cc: Luiz Capitulino <luizcap@redhat.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
scripts/gdb/symbols: make BPF debug info available to GDB
One can debug BPF programs with QEMU gdbstub by setting a breakpoint on
bpf_prog_kallsyms_add(), waiting for a hit with a matching aux.name, and
then setting a breakpoint on bpf_func. This is tedious, error-prone, and
also lacks line numbers.
Automate this in a way similar to the existing support for modules in
lx-symbols.
Enumerate and monitor changes to both BPF kallsyms and JITed progs. For
each ksym, generate and compile a synthetic .s file containing the name,
code, and size. In addition, if this ksym is also a prog, and not a
trampoline, add line number information.
Ensure that this is a no-op if the kernel is built without BPF support or
if "as" is missing. In theory the "as" dependency may be dropped by
generating the synthetic .o file manually, but this is too much complexity
for too little benefit.
Now one can debug BPF progs out of the box like this:
(gdb) lx-symbols -bpf
(gdb) b bpf_prog_4e612a6a881a086b_arena_list_add
Breakpoint 2 (bpf_prog_4e612a6a881a086b_arena_list_add) pending.
# ./test_progs -t arena_list
Thread 4 hit Breakpoint 2, bpf_prog_4e612a6a881a086b_arena_list_add ()
at linux/tools/testing/selftests/bpf/progs/arena_list.c:51
51 list_head = &global_head;
(gdb) n
bpf_prog_4e612a6a881a086b_arena_list_add () at linux/tools/testing/selftests/bpf/progs/arena_list.c:53
53 for (i = zero; i < cnt && can_loop; i++) {
Patch series "scripts/gdb/symbols: make BPF debug info available to GDB",
v2.
This series greatly simplifies debugging BPF progs when using QEMU gdbstub
by providing symbol names, sizes, and line numbers to GDB.
Patch 1 adds radix tree iteration, which is necessary for parsing
prog_idr. Patch 2 is the actual implementation; its description contains
some details on how to use this.
This patch (of 2):
Add a function and a command to iterate over radix tree contents.
Duplicate the C implementation in Python, but drop support for tagging.
David Laight [Wed, 5 Nov 2025 20:10:34 +0000 (20:10 +0000)]
lib: mul_u64_u64_div_u64(): optimise the divide code
Replace the bit by bit algorithm with one that generates 16 bits per
iteration on 32bit architectures and 32 bits on 64bit ones.
On my zen 5 this reduces the time for the tests (using the generic code)
from ~3350ns to ~1000ns.
Running the 32bit algorithm on 64bit x86 takes ~1500ns. It'll be slightly
slower on a real 32bit system, mostly due to register pressure.
The savings for 32bit x86 are much higher (tested in userspace). The
worst case (lots of bits in the quotient) drops from ~900 clocks to ~130
(pretty much independant of the arguments). Other 32bit architectures may
see better savings.
It is possibly to optimise for divisors that span less than
__LONG_WIDTH__/2 bits. However I suspect they don't happen that often and
it doesn't remove any slow cpu divide instructions which dominate the
result.
Typical improvements for 64bit random divides:
old new
sandy bridge: 470 150
haswell: 400 144
piledriver: 960 467 I think rdpmc is very slow.
zen5: 244 80
(Timing is 'rdpmc; mul_div(); rdpmc' with the multiply depending on the
first rdpmc and the second rdpmc depending on the quotient.)
Object code (64bit x86 test program): old 0x173 new 0x141.
Link: https://lkml.kernel.org/r/20251105201035.64043-9-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:33 +0000 (20:10 +0000)]
lib: mul_u64_u64_div_u64(): optimise multiply on 32bit x86
gcc generates horrid code for both ((u64)u32_a * u32_b) and (u64_a +
u32_b). As well as the extra instructions it can generate a lot of spills
to stack (including spills of constant zeros and even multiplies by
constant zero).
mul_u32_u32() already exists to optimise the multiply. Add a similar
add_u64_32() for the addition. Disable both for clang - it generates
better code without them.
Move the 64x64 => 128 multiply into a static inline helper function for
code clarity. No need for the a/b_hi/lo variables, the implicit casts on
the function calls do the work for us. Should have minimal effect on the
generated code.
Use mul_u32_u32() and add_u64_u32() in the 64x64 => 128 multiply in
mul_u64_add_u64_div_u64().
Link: https://lkml.kernel.org/r/20251105201035.64043-8-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:32 +0000 (20:10 +0000)]
lib: test_mul_u64_u64_div_u64(): test both generic and arch versions
Change the #if in div64.c so that test_mul_u64_u64_div_u64.c can compile
and test the generic version (including the 'long multiply') on
architectures (eg amd64) that define their own copy.
Test the kernel version and the locally compiled version on all arch.
Output the time taken (in ns) on the 'test completed' trace.
For reference, on my zen 5, the optimised version takes ~220ns and the
generic version ~3350ns. Using the native multiply saves ~200ns and
adding back the ilog2() 'optimisation' test adds ~50ms.
Link: https://lkml.kernel.org/r/20251105201035.64043-7-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:31 +0000 (20:10 +0000)]
lib: add tests for mul_u64_u64_div_u64_roundup()
Replicate the existing mul_u64_u64_div_u64() test cases with round up.
Update the shell script that verifies the table, remove the comment
markers so that it can be directly pasted into a shell.
Rename the divisor from 'c' to 'd' to match mul_u64_add_u64_div_u64().
It any tests fail then fail the module load with -EINVAL.
Link: https://lkml.kernel.org/r/20251105201035.64043-6-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:30 +0000 (20:10 +0000)]
lib: add mul_u64_add_u64_div_u64() and mul_u64_u64_div_u64_roundup()
The existing mul_u64_u64_div_u64() rounds down, a 'rounding up' variant
needs 'divisor - 1' adding in between the multiply and divide so cannot
easily be done by a caller.
Add mul_u64_add_u64_div_u64(a, b, c, d) that calculates (a * b + c)/d and
implement the 'round down' and 'round up' using it.
Update the x86-64 asm to optimise for 'c' being a constant zero.
Add kerndoc definitions for all three functions.
Link: https://lkml.kernel.org/r/20251105201035.64043-5-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:29 +0000 (20:10 +0000)]
lib: mul_u64_u64_div_u64(): simplify check for a 64bit product
If the product is only 64bits div64_u64() can be used for the divide.
Replace the pre-multiply check (ilog2(a) + ilog2(b) <= 62) with a simple
post-multiply check that the high 64bits are zero.
This has the advantage of being simpler, more accurate and less code. It
will always be faster when the product is larger than 64bits.
Most 64bit cpu have a native 64x64=128 bit multiply, this is needed (for
the low 64bits) even when div64_u64() is called - so the early check gains
nothing and is just extra code.
32bit cpu will need a compare (etc) to generate the 64bit ilog2() from two
32bit bit scans - so that is non-trivial. (Never mind the mess of x86's
'bsr' and any oddball cpu without fast bit-scan instructions.) Whereas the
additional instructions for the 128bit multiply result are pretty much one
multiply and two adds (typically the 'adc $0,%reg' can be run in parallel
with the instruction that follows).
The only outliers are 64bit systems without 128bit mutiply and simple in
order 32bit ones with fast bit scan but needing extra instructions to get
the high bits of the multiply result. I doubt it makes much difference to
either, the latter is definitely not mainstream.
If anyone is worried about the analysis they can look at the generated
code for x86 (especially when cmov isn't used).
Link: https://lkml.kernel.org/r/20251105201035.64043-4-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:28 +0000 (20:10 +0000)]
lib: mul_u64_u64_div_u64(): combine overflow and divide by zero checks
Since the overflow check always triggers when the divisor is zero
move the check for divide by zero inside the overflow check.
This means there is only one test in the normal path.
Link: https://lkml.kernel.org/r/20251105201035.64043-3-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
David Laight [Wed, 5 Nov 2025 20:10:27 +0000 (20:10 +0000)]
lib: mul_u64_u64_div_u64(): rename parameter 'c' to 'd'
Patch series "Implement mul_u64_u64_div_u64_roundup()", v5.
The pwm-stm32.c code wants a 'rounding up' version of
mul_u64_u64_div_u64(). This can be done simply by adding 'divisor - 1' to
the 128bit product. Implement mul_u64_add_u64_div_u64(a, b, c, d) = (a *
b + c)/d based on the existing code. Define mul_u64_u64_div_u64(a, b, d)
as mul_u64_add_u64_div_u64(a, b, 0, d) and mul_u64_u64_div_u64_roundup(a,
b, d) as mul_u64_add_u64_div_u64(a, b, d-1, d).
Only x86-64 has an optimsed (asm) version of the function. That is
optimised to avoid the 'add c' when c is known to be zero. In all other
cases the extra code will be noise compared to the software divide code.
The test module has been updated to test mul_u64_u64_div_u64_roundup() and
also enhanced it to verify the C division code on x86-64 and the 32bit
division code on 64bit.
This patch (of 9):
Change to prototype from mul_u64_u64_div_u64(u64 a, u64 b, u64 c) to
mul_u64_u64_div_u64(u64 a, u64 b, u64 d). Using 'd' for 'divisor' makes
more sense.
An upcoming change adds a 'c' parameter to calculate (a * b + c)/d.
Link: https://lkml.kernel.org/r/20251105201035.64043-1-david.laight.linux@gmail.com Link: https://lkml.kernel.org/r/20251105201035.64043-2-david.laight.linux@gmail.com Signed-off-by: David Laight <david.laight.linux@gmail.com> Reviewed-by: Nicolas Pitre <npitre@baylibre.com> Cc: Biju Das <biju.das.jz@bp.renesas.com> Cc: Borislav Betkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li RongQing <lirongqing@baidu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Cc: Uwe Kleine-König <u.kleine-koenig@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Tue, 4 Nov 2025 18:38:34 +0000 (19:38 +0100)]
util_macros.h: fix kernel-doc for u64_to_user_ptr()
The added documentation to u64_to_user_ptr() misspelled the function name.
Fix it.
Link: https://lkml.kernel.org/r/20251104183834.1046584-1-andriy.shevchenko@linux.intel.com Fixes: 029c896c4105 ("kernel.h: move PTR_IF() and u64_to_user_ptr() to util_macros.h") Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Alexandru Ardelean <aardelean@baylibre.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Carlos López [Fri, 31 Oct 2025 11:19:09 +0000 (12:19 +0100)]
checkpatch: add IDR to the deprecated list
As of commit 85656ec193e9, the IDR interface is marked as deprecated in
the documentation, but no checks are made in that regard for new code.
Add the existing IDR initialization APIs to the deprecated list in
checkpatch, so that if new code is introduced using these APIs, a warning
is emitted.
Link: https://lkml.kernel.org/r/20251031111908.2266077-2-clopez@suse.de Signed-off-by: Carlos López <clopez@suse.de> Suggested-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Joe Perches <joe@perches.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
ocfs2: validate cl_bpc in allocator inodes to prevent divide-by-zero
The chain allocator field cl_bpc (blocks per cluster) is read from disk
and used in division operations without validation. A corrupted
filesystem image with cl_bpc=0 causes a divide-by-zero crash in the
kernel:
This patch adds validation in ocfs2_validate_inode_block() to ensure
cl_bpc matches the expected value calculated from the superblock's cluster
size and block size for chain allocator inodes (identified by
OCFS2_CHAIN_FL).
Moving the validation to inode validation time (rather than allocation time)
has several benefits:
- Validates once when the inode is read, rather than on every allocation
- Protects all code paths that use cl_bpc (allocation, resize, etc.)
- Follows the existing pattern of inode validation in OCFS2
- Centralizes validation logic
The validation catches both:
- Zero values that cause divide-by-zero crashes
- Non-zero but incorrect values indicating filesystem corruption or
mismatched filesystem geometry
With this fix, mounting a corrupted filesystem produces:
[dmantipov@yandex.ru: combine into the series and tweak the message to fit the commonly used style] Link: https://lkml.kernel.org/r/20251030153003.1934585-2-dmantipov@yandex.ru Link: https://lore.kernel.org/ocfs2-devel/20251026132625.12348-1-kartikey406@gmail.com/T/#u Link: https://lore.kernel.org/all/20251027124131.10002-1-kartikey406@gmail.com/T/ Signed-off-by: Deepanshu Kartikey <kartikey406@gmail.com> Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+fd8af97c7227fe605d95@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fd8af97c7227fe605d95 Tested-by: syzbot+fd8af97c7227fe605d95@syzkaller.appspotmail.com Suggested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Antipov [Thu, 30 Oct 2025 15:30:02 +0000 (18:30 +0300)]
ocfs2: add extra consistency checks for chain allocator dinodes
When validating chain allocator dinode in 'ocfs2_validate_inode_block()',
add an extra checks whether a) the maximum amount of chain records in
'struct ocfs2_chain_list' matches the value calculated based on the
filesystem block size, and b) the next free slot index is within the valid
range.
Link: https://lkml.kernel.org/r/20251030153003.1934585-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+77026564530dbc29b854@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=77026564530dbc29b854 Reported-by: syzbot+5054473a31f78f735416@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5054473a31f78f735416 Suggested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Jun Piao <piaojun@huawei.com> Cc: Deepanshu Kartikey <kartikey406@gmail.com> Cc: Heming Zhao <heming.zhao@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Andy Shevchenko [Thu, 30 Oct 2025 11:44:20 +0000 (12:44 +0100)]
panic: sys_info: rewrite a fix for a compilation error (`make W=1`)
Compiler was not happy about dead variable in use:
lib/sys_info.c:52:19: error: variable 'sys_info_avail' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
52 | static const char sys_info_avail[] = "tasks,mem,timers,locks,ftrace,all_bt,blocked_tasks";
| ^~~~~~~~~~~~~~
This was fixed by adding __maybe_unused attribute that just hides the
issue and didn't actually fix the root cause. Rewrite the fix by moving
the local variable from stack to a heap.
As a side effect this drops unneeded "synchronisation" of duplicative info
and also makes code ready for the further refactoring.
Andy Shevchenko [Thu, 30 Oct 2025 11:44:18 +0000 (12:44 +0100)]
panic: sys_info: align constant definition names with parameters
Align constant definition names with parameters to make it easier to map.
It's also better to maintain and extend the names while keeping their
uniqueness.
Andy Shevchenko [Thu, 30 Oct 2025 11:44:17 +0000 (12:44 +0100)]
panic: sys_info: capture si_bits_global before iterating over it
Patch series "panic: sys_info: Refactor and fix a potential issue", v3.
While targeting the compilation issue due to dangling variable, I have
noticed more opportunities for refactoring that helps to avoid above
mentioned compilation issue in a cleaner way and also fixes a potential
problem with global variable access.
This patch (of 6):
The for-loop might re-read the content of the memory the si_bits_global
points to on each iteration. Instead, just capture it for the sake of
consistency and use that instead.
Kairui Song [Tue, 11 Nov 2025 13:36:08 +0000 (21:36 +0800)]
mm, swap: fix potential UAF issue for VMA readahead
Since commit 78524b05f1a3 ("mm, swap: avoid redundant swap device
pinning"), the common helper for allocating and preparing a folio in the
swap cache layer no longer tries to get a swap device reference
internally, because all callers of __read_swap_cache_async are already
holding a swap entry reference. The repeated swap device pinning isn't
needed on the same swap device.
Caller of VMA readahead is also holding a reference to the target entry's
swap device, but VMA readahead walks the page table, so it might encounter
swap entries from other devices, and call __read_swap_cache_async on
another device without holding a reference to it.
So it is possible to cause a UAF when swapoff of device A raced with
swapin on device B, and VMA readahead tries to read swap entries from
device A. It's not easy to trigger, but in theory, it could cause real
issues.
Make VMA readahead try to get the device reference first if the swap
device is a different one from the target entry.
Link: https://lkml.kernel.org/r/20251111-swap-fix-vma-uaf-v1-1-41c660e58562@tencent.com Fixes: 78524b05f1a3 ("mm, swap: avoid redundant swap device pinning") Suggested-by: Huang Ying <ying.huang@linux.alibaba.com> Signed-off-by: Kairui Song <kasong@tencent.com> Acked-by: Chris Li <chrisl@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Barry Song <baohua@kernel.org> Cc: Kemeng Shi <shikemeng@huaweicloud.com> Cc: Nhat Pham <nphamcs@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ankit Khushwaha [Thu, 6 Nov 2025 09:55:32 +0000 (15:25 +0530)]
selftests/user_events: fix type cast for write_index packed member in perf_test
Accessing 'reg.write_index' directly triggers a -Waddress-of-packed-member
warning due to potential unaligned pointer access:
perf_test.c:239:38: warning: taking address of packed member 'write_index'
of class or structure 'user_reg' may result in an unaligned pointer value
[-Waddress-of-packed-member]
239 | ASSERT_NE(-1, write(self->data_fd, ®.write_index,
| ^~~~~~~~~~~~~~~
Since write(2) works with any alignment. Casting '®.write_index'
explicitly to 'void *' to suppress this warning.
Zi Yan [Wed, 5 Nov 2025 16:29:10 +0000 (11:29 -0500)]
mm/huge_memory: fix folio split check for anon folios in swapcache
Both uniform and non uniform split check missed the check to prevent
splitting anon folios in swapcache to non-zero order.
Splitting anon folios in swapcache to non-zero order can cause data
corruption since swapcache only support PMD order and order-0 entries.
This can happen when one use split_huge_pages under debugfs to split
anon folios in swapcache.
In-tree callers do not perform such an illegal operation. Only debugfs
interface could trigger it. I will put adding a test case on my TODO
list.
Fix the check.
Link: https://lkml.kernel.org/r/20251105162910.752266-1-ziy@nvidia.com Fixes: 58729c04cf10 ("mm/huge_memory: add buddy allocator like (non-uniform) folio_split()") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: "David Hildenbrand (Red Hat)" <david@kernel.org> Closes: https://lore.kernel.org/all/dc0ecc2c-4089-484f-917f-920fdca4c898@kernel.org/ Acked-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Barry Song <baohua@kernel.org> Cc: Dev Jain <dev.jain@arm.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Nico Pache <npache@redhat.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
MAINTAINERS: update David Hildenbrand's email address
Switch to kernel.org email address as I will be leaving Red Hat. The old
address will remain active until end of January 2026, so performing the
change now should make sure that most mails will reach me.
If crashkernel is then shrunk to 50MB (echo 52428800 >
/sys/kernel/kexec_crash_size), /proc/iomem still shows 256MB reserved: af000000-beffffff : Crash kernel
This happens because __crash_shrink_memory()/kernel/crash_core.c
incorrectly updates the crashk_res resource object even when
crashk_low_res should be updated.
Fix this by ensuring the correct crashkernel resource object is updated
when shrinking crashkernel memory.
Link: https://lkml.kernel.org/r/20251101193741.289252-1-sourabhjain@linux.ibm.com Fixes: 16c6006af4d4 ("kexec: enable kexec_crash_size to support two crash kernel regions") Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Zhen Lei <thunder.leizhen@huawei.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: fix MAX_FOLIO_ORDER on powerpc configs with hugetlb
In the past, CONFIG_ARCH_HAS_GIGANTIC_PAGE indicated that we support
runtime allocation of gigantic hugetlb folios. In the meantime it evolved
into a generic way for the architecture to state that it supports gigantic
hugetlb folios.
In commit fae7d834c43c ("mm: add __dump_folio()") we started using
CONFIG_ARCH_HAS_GIGANTIC_PAGE to decide MAX_FOLIO_ORDER: whether we could
have folios larger than what the buddy can handle. In the context of that
commit, we started using MAX_FOLIO_ORDER to detect page corruptions when
dumping tail pages of folios. Before that commit, we assumed that we
cannot have folios larger than the highest buddy order, which was
obviously wrong.
In commit 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes
when registering hstate"), we used MAX_FOLIO_ORDER to detect
inconsistencies, and in fact, we found some now.
Powerpc allows for configs that can allocate gigantic folio during boot
(not at runtime), that do not set CONFIG_ARCH_HAS_GIGANTIC_PAGE and can
exceed PUD_ORDER.
To fix it, let's make powerpc select CONFIG_ARCH_HAS_GIGANTIC_PAGE with
hugetlb on powerpc, and increase the maximum folio size with hugetlb to 16
GiB on 64bit (possible on arm64 and powerpc) and 1 GiB on 32 bit
(powerpc). Note that on some powerpc configurations, whether we actually
have gigantic pages depends on the setting of CONFIG_ARCH_FORCE_MAX_ORDER,
but there is nothing really problematic about setting it unconditionally:
we just try to keep the value small so we can better detect problems in
__dump_folio() and inconsistencies around the expected largest folio in
the system.
Ideally, we'd have a better way to obtain the maximum hugetlb folio size
and detect ourselves whether we really end up with gigantic folios. Let's
defer bigger changes and fix the warnings first.
While at it, handle gigantic DAX folios more clearly: DAX can only end up
creating gigantic folios with HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD.
Add a new Kconfig option HAVE_GIGANTIC_FOLIOS to make both cases clearer.
In particular, worry about ARCH_HAS_GIGANTIC_PAGE only with HUGETLB_PAGE.
Note: with enabling CONFIG_ARCH_HAS_GIGANTIC_PAGE on powerpc, we will now
also allow for runtime allocations of folios in some more powerpc configs.
I don't think this is a problem, but if it is we could handle it through
__HAVE_ARCH_GIGANTIC_PAGE_RUNTIME_SUPPORTED.
While __dump_page()/__dump_folio was also problematic (not handling
dumping of tail pages of such gigantic folios correctly), it doesn't seem
critical enough to mark it as a fix.
Link: https://lkml.kernel.org/r/20251114214920.2550676-1-david@kernel.org Fixes: 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate") Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Closes: https://lore.kernel.org/r/3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu/ Reported-by: Sourabh Jain <sourabhjain@linux.ibm.com> Closes: https://lore.kernel.org/r/94377f5c-d4f0-4c0f-b0f6-5bf1cd7305b1@linux.ibm.com/ Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Donet Tom <donettom@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Thorsten Blum [Thu, 30 Oct 2025 15:46:43 +0000 (00:46 +0900)]
nilfs2: replace vmalloc + copy_from_user with vmemdup_user
Replace vmalloc() followed by copy_from_user() with vmemdup_user() to
improve nilfs_ioctl_clean_segments() and nilfs_ioctl_set_suinfo(). Use
kvfree() to free the buffers created by vmemdup_user().
Use u64_to_user_ptr() instead of manually casting the pointers and
remove the obsolete 'out_free' label.
Oleg Nesterov [Sun, 26 Oct 2025 14:31:40 +0000 (15:31 +0100)]
release_task: kill unnecessary rcu_read_lock() around dec_rlimit_ucounts()
rcu_read_lock() was added to shut RCU-lockdep up when this code used
__task_cred()->rcu_dereference(), but after the commit 21d1c5e386bc
("Reimplement RLIMIT_NPROC on top of ucounts") it is no longer needed:
task_ucounts()->task_cred_xxx() takes rcu_read_lock() itself.
NOTE: task_ucounts() returns the pointer to another rcu-protected data,
struct ucounts. So it should either be used when task->real_cred and thus
task->real_cred->ucounts is stable (release_task, copy_process,
copy_creds), or it should be called under rcu_read_lock(). In both cases
it is pointless to take rcu_read_lock() to read the cred->ucounts pointer.
xxh32_reset() and xxh32_copy_state() are unused, and with those gone, the
xxh32_state struct is also unused.
xxh64_copy_state() is also unused.
Remove them all.
(Also fixes a comment above the xxh64_state that referred to it as
xxh32_state).
Link: https://lkml.kernel.org/r/20251024205120.454508-1-linux@treblig.org Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Suggested-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Bin [Sat, 25 Oct 2025 08:00:03 +0000 (16:00 +0800)]
dynamic_debug: add support for print stack
In practical problem diagnosis, especially during the boot phase, it is
often desirable to know the call sequence. However, currently, apart from
adding print statements and recompiling the kernel, there seems to be no
good alternative. If dynamic_debug supported printing the call stack, it
would be very helpful for diagnosing issues. This patch add support '+d'
for dump stack.
Dmitry Antipov [Thu, 23 Oct 2025 14:16:50 +0000 (17:16 +0300)]
ocfs2: add inline inode consistency check to ocfs2_validate_inode_block()
In 'ocfs2_validate_inode_block()', add an extra check whether an inode
with inline data (i.e. self-contained) has no clusters, thus preventing
an invalid inode from being passed to 'ocfs2_evict_inode()' and below.
Link: https://lkml.kernel.org/r/20251023141650.417129-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+c16daba279a1161acfb0@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=c16daba279a1161acfb0 Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Cc: Heming Zhao <heming.zhao@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Joseph Qi [Sat, 25 Oct 2025 12:32:18 +0000 (20:32 +0800)]
ocfs2: convert to host endian in ocfs2_validate_inode_block
Convert to host endian when checking OCFS2_VALID_FL to keep consistent
with other checks.
Link: https://lkml.kernel.org/r/20251025123218.3997866-2-joseph.qi@linux.alibaba.com Signed-off-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Joel Becker <jlbec@evilplan.org> Cc: Jun Piao <piaojun@huawei.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Mark Fasheh <mark@fasheh.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Antipov [Mon, 13 Oct 2025 06:28:26 +0000 (09:28 +0300)]
ocfs2: add boundary check to ocfs2_check_dir_entry()
In 'ocfs2_check_dir_entry()', add extra check whether at least the
smallest possible dirent may be located at the specified offset within
bh's data, thus preventing an out-of-bounds accesses below.
Link: https://lkml.kernel.org/r/20251013062826.122586-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reported-by: syzbot+b20bbf680bb0f2ecedae@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=b20bbf680bb0f2ecedae Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
uaccess: decouple INLINE_COPY_FROM_USER and CONFIG_RUST
Commit 1f9a8286bc0c ("uaccess: always export _copy_[from|to]_user with
CONFIG_RUST") exports _copy_{from,to}_user() unconditionally, if RUST is
enabled. This pollutes exported symbols namespace, and spreads RUST
ifdefery in core files.
It's better to declare a corresponding helper under the rust/helpers,
similarly to how non-underscored copy_{from,to}_user() is handled.
[yury.norov@gmail.com: drop rust part of comment for _copy_from_user(), per Alice] Link: https://lkml.kernel.org/r/20251024154754.99768-1-yury.norov@gmail.com Link: https://lkml.kernel.org/r/20251023171607.1171534-1-yury.norov@gmail.com Signed-off-by: Yury Norov (NVIDIA) <yury.norov@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Miguel Ojeda <ojeda@kernel.org> Reviewed-by: Alice Ryhl <aliceryhl@google.com> Tested-by: Alice Ryhl <aliceryhl@google.com> Cc: Alex Gaynor <alex.gaynor@gmail.com> Cc: Andreas Hindborg <a.hindborg@kernel.org> Cc: Björn Roy Baron <bjorn3_gh@protonmail.com> Cc: Boqun Feng <boqun.feng@gmail.com> Cc: Danilo Krummrich <dakr@kernel.org> Cc: Gary Guo <gary@garyguo.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Trevor Gross <tmgross@umich.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Douglas Anderson [Thu, 23 Oct 2025 18:33:05 +0000 (11:33 -0700)]
init/main.c: wrap long kernel cmdline when printing to logs
The kernel cmdline length is allowed to be longer than what printk can
handle. When this happens the cmdline that's printed to the kernel ring
buffer at bootup is cutoff and some kernel cmdline options are "hidden"
from the logs. This undercuts the usefulness of the log message.
Specifically, grepping for COMMAND_LINE_SIZE shows that 2048 is common and
some architectures even define it as 4096. s390 allows a CONFIG-based
maximum up to 1MB (though it's not expected that anyone will go over the
default max of 4096 [1]).
The maximum message pr_notice() seems to be able to handle (based on
experiment) is 1021 characters. This appears to be based on the current
value of PRINTKRB_RECORD_MAX as 1024 and the fact that pr_notice() spends
2 characters on the loglevel prefix and we have a '\n' at the end.
While it would be possible to increase the limits of printk() (and
therefore pr_notice()) somewhat, it doesn't appear possible to increase it
enough to fully include a 2048-character cmdline without breaking
userspace. Specifically on at least two tested userspaces (ChromeOS plus
the Debian-based distro I'm typing this message on) the `dmesg` tool reads
lines from `/dev/kmsg` in 2047-byte chunks. As per
`Documentation/ABI/testing/dev-kmsg`:
Every read() from the opened device node receives one record
of the kernel's printk buffer.
...
Messages in the record ring buffer get overwritten as whole,
there are never partial messages received by read().
We simply can't fit a 2048-byte cmdline plus the "Kernel command line:"
prefix plus info about time/log_level/etc in a 2047-byte read.
The above means that if we want to avoid the truncation we need to do some
type of wrapping of the cmdline when printing.
Add wrapping to the printout of the kernel command line. By default, the
wrapping is set to 1021 characters to avoid breaking anyone, but allow
wrapping to be set lower by a Kconfig knob
"CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN". Any tools that are correctly parsing
the cmdline today (because it is less than 1021 characters) will see no
difference in their behavior. The format of wrapped output is designed to
be matched by anyone using "grep" to search for the cmdline and also to be
easy for tools to handle. Anyone who is sure their tools (if any) handle
the wrapped format can choose a lower wrapping value and have prettier
output.
Setting CONFIG_CMDLINE_LOG_WRAP_IDEAL_LEN to 0 fully disables the wrapping
logic. This means that long command lines will be truncated again, but
this config could be set if command lines are expected to be long and
userspace is known not to handle parsing logs with the wrapping.
Wrapping is based on spaces, ignoring quotes. All lines are prefixed with
"Kernel command line: " and lines that are not the last line have a " \"
suffix added to them. The prefix and suffix count towards the line length
for wrapping purposes. The ideal length will be exceeded if no
appropriate place to wrap is found.
The wrapping function added here is fairly generic and could be made a
library function (somewhat like print_hex_dump()) if it's needed elsewhere
in the kernel. However, having printk() directly incorporate this
wrapping would be unlikely to be a good idea since it would break
printouts into more than one record without any obvious common line prefix
to tie lines together. It would also be extra overhead when, in general,
kernel log message should simply be kept smaller than 1021 bytes. For
some discussion on this topic, see responses to the v1 posting of this
patch [2].
[akpm@linux-foundation.org: make print_kernel_cmdline __init]
[dianders@chromium.org: v4] Link: https://lkml.kernel.org/r/20251027082204.v4.1.I095f1e2c6c27f9f4de0b4841f725f356c643a13f@changeid Link: https://lkml.kernel.org/r/20251023113257.v3.1.I095f1e2c6c27f9f4de0b4841f725f356c643a13f@changeid Link: https://lore.kernel.org/r/20251021131633.26700Dd6-hca@linux.ibm.com Link: https://lore.kernel.org/r/CAD=FV=VNyt1zG_8pS64wgV8VkZWiWJymnZ-XCfkrfaAhhFSKcA@mail.gmail.com Signed-off-by: Douglas Anderson <dianders@chromium.org> Tested-by: Paul E. McKenney <paulmck@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andrew Chant <achant@google.com> Cc: Brian Gerst <brgerst@gmail.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Francesco Valla <francesco@valla.it> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: guoweikang <guoweikang.kernel@gmail.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Jan Hendrik Farr <kernel@jfarr.cc> Cc: Jeff Xu <jeffxu@chromium.org> Cc: Kees Cook <kees@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleinxer <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vlad Kulikov [Tue, 21 Oct 2025 18:13:39 +0000 (21:13 +0300)]
ipc: create_ipc_ns: drop mqueue mount on sysctl setup failure
If setup_mq_sysctls(ns) fails after mq_init_ns(ns) succeeds, the error
path skipped releasing the internal kernel mqueue mount kept in
ns->mq_mnt. That leaves the vfsmount/superblock referenced until final
namespace teardown, i.e. a resource leak on this rare failure edge.
Unwind it by calling mntput(ns->mq_mnt) before dropping user_ns and
freeing the IPC namespace. This mirrors the normal ordering used in
free_ipc_ns().
Link: https://lkml.kernel.org/r/20251021181341.670297-1-vlad_kulikov_c@pm.me Signed-off-by: Vlad Kulikov <vlad_kulikov_c@pm.me> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Ma Wupeng <mawupeng1@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Antipov [Mon, 13 Oct 2025 10:37:09 +0000 (13:37 +0300)]
ocfs2: add directory size check to ocfs2_find_dir_space_id()
Fix a null-pointer-deref which was detected by UBSAN:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 5317 Comm: syz-executor310 Not tainted 6.15.0-syzkaller-12141-gec7714e49479 #0 PREEMPT(full)
In 'ocfs2_find_dir_space_id()', add extra check whether the directory data
block is large enough to hold at least one directory entry, and raise
'ocfs2_error()' if the former is unexpectedly small.
Link: https://lkml.kernel.org/r/20251013103709.146001-1-dmantipov@yandex.ru Reported-by: syzbot+ded9116588a7b73c34bc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ded9116588a7b73c34bc Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Petr Pavlu [Wed, 22 Oct 2025 08:28:04 +0000 (10:28 +0200)]
taint/module: remove unnecessary taint_flag.module field
The TAINT_RANDSTRUCT and TAINT_FWCTL flags are mistakenly set in the
taint_flags table as per-module flags. While this can be trivially
corrected, the issue can be avoided altogether by removing the
taint_flag.module field.
This is possible because, since commit 7fd8329ba502 ("taint/module: Clean
up global and module taint flags handling") in 2016, the handling of
module taint flags has been fully generic. Specifically,
module_flags_taint() can print all flags, and the required output buffer
size is properly defined in terms of TAINT_FLAGS_COUNT. The actual
per-module flags are always those added to module.taints by calls to
add_taint_module().
Link: https://lkml.kernel.org/r/20251022082938.26670-1-petr.pavlu@suse.com Signed-off-by: Petr Pavlu <petr.pavlu@suse.com> Acked-by: Petr Mladek <pmladek@suse.com> Reviewed-by: Randy Dunlap <rdunlap@infradead.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Luis Chamberalin <mcgrof@kernel.org> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Sami Tolvanen <samitolvanen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Randy Dunlap [Wed, 15 Oct 2025 22:16:26 +0000 (15:16 -0700)]
taint: add reminder about updating docs and scripts
Sometimes people update taint-related pieces of the kernel without
updating the supporting documentation or scripts. Add a reminder to do
this.
Link: https://lkml.kernel.org/r/20251015221626.1126156-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Suggested-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Cc: David Gow <davidgow@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
This warning occurs due to a conflict between crashkernel and System RAM
iomem resources.
The generic crashkernel reservation adds the crashkernel memory range to
/proc/iomem during early initialization. Later, all memblock ranges are
added to /proc/iomem as System RAM. If the crashkernel region overlaps
with any memblock range, it causes a conflict while adding those memblock
regions as iomem resources, triggering the above warning. The conflicting
memblock regions are then omitted from /proc/iomem.
For example, if the following crashkernel region is added to /proc/iomem: 20000000-11fffffff : Crash kernel
then the following memblock regions System RAM regions fail to be inserted: 00000000-7fffffff : System RAM 80000000-257fffffff : System RAM
Fix this by not adding the crashkernel memory to /proc/iomem on powerpc.
Introduce an architecture hook to let each architecture decide whether to
export the crashkernel region to /proc/iomem.
For more info checkout commit c40dd2f766440 ("powerpc: Add System RAM
to /proc/iomem") and commit bce074bdbc36 ("powerpc: insert System RAM
resource to prevent crashkernel conflict")
Note: Before switching to the generic crashkernel reservation, powerpc
never exported the crashkernel region to /proc/iomem.
Link: https://lkml.kernel.org/r/20251016142831.144515-1-sourabhjain@linux.ibm.com Fixes: e3185ee438c2 ("powerpc/crash: use generic crashkernel reservation"). Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reported-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Closes: https://lore.kernel.org/all/90937fe0-2e76-4c82-b27e-7b8a7fe3ac69@linux.ibm.com/ Tested-by: Venkat Rao Bagalkote <venkat88@linux.ibm.com> Cc: Baoquan he <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Madhavan Srinivasan <maddy@linux.ibm.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Young <dyoung@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
WangYuli [Tue, 14 Oct 2025 05:07:47 +0000 (13:07 +0800)]
.mailmap: add entry for WangYuli
Map my old, obsolete work email address to my current email address.
My current work email may not be ideal for timely communication, as
it requires a secure network environment for access due to security
policies.
Therefore, associate both my previous and current work email addresses
with an email address provided to me by AOSC Linux community. During
work hours, my commits will likely still be authored using my company
email address.
Link: https://lkml.kernel.org/r/20251014050747.527357-1-wangyuli@aosc.io Signed-off-by: WangYuli <wangyl5933@chinaunicom.cn> Signed-off-by: WangYuli <wangyuli@aosc.io> Cc: Carlos Bilbao <carlos.bilbao@kernel.org> Cc: Jarkko Sakkinen <jarkko@kernel.org> Cc: Shannon Nelson <sln@onemain.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Li RongQing [Wed, 15 Oct 2025 06:36:15 +0000 (14:36 +0800)]
hung_task: panic when there are more than N hung tasks at the same time
The hung_task_panic sysctl is currently a blunt instrument: it's all or
nothing.
Panicking on a single hung task can be an overreaction to a transient
glitch. A more reliable indicator of a systemic problem is when
multiple tasks hang simultaneously.
Extend hung_task_panic to accept an integer threshold, allowing the
kernel to panic only when N hung tasks are detected in a single scan.
This provides finer control to distinguish between isolated incidents
and system-wide failures.
The accepted values are:
- 0: Don't panic (unchanged)
- 1: Panic on the first hung task (unchanged)
- N > 1: Panic after N hung tasks are detected in a single scan
The original behavior is preserved for values 0 and 1, maintaining full
backward compatibility.
[lance.yang@linux.dev: new changelog] Link: https://lkml.kernel.org/r/20251015063615.2632-1-lirongqing@baidu.com Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Reviewed-by: Lance Yang <lance.yang@linux.dev> Tested-by: Lance Yang <lance.yang@linux.dev> Acked-by: Andrew Jeffery <andrew@codeconstruct.com.au> [aspeed_g5_defconfig] Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Hildenbrand <david@redhat.com> Cc: Florian Wesphal <fw@strlen.de> Cc: Jakub Kacinski <kuba@kernel.org> Cc: Jason A. Donenfeld <jason@zx2c4.com> Cc: Joel Granados <joel.granados@kernel.org> Cc: Joel Stanley <joel@jms.id.au> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kees Cook <kees@kernel.org> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Phil Auld <pauld@redhat.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Shuah Khan <shuah@kernel.org> Cc: Simon Horman <horms@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Antipov [Tue, 7 Oct 2025 09:46:26 +0000 (12:46 +0300)]
ocfs2: add extra consistency check to ocfs2_dx_dir_lookup_rec()
In 'ocfs2_dx_dir_lookup_rec()', check whether an extent list length of the
directory indexing block matches the one configured via the superblock
parameters established at mount, thus preventing an out-of-bounds accesses
while iterating over the extent records below.
Link: https://lkml.kernel.org/r/20251007094626.196143-1-dmantipov@yandex.ru Reported-by: syzbot+30b53487d00b4f7f0922@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=30b53487d00b4f7f0922 Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com>> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Dmitry Antipov [Tue, 7 Oct 2025 12:35:26 +0000 (15:35 +0300)]
ocfs2: annotate flexible array members with __counted_by_le()
Annotate flexible array members of 'struct ocfs2_extent_list',
'struct ocfs2_chain_list', 'struct ocfs2_truncate_log',
'struct ocfs2_dx_entry_list', 'ocfs2_refcount_list' and
'struct ocfs2_xattr_header' with '__counted_by_le()'
attribute to improve array bounds checking when
CONFIG_UBSAN_BOUNDS is enabled.
[dmantipov@yandex.ru: fix __counted_by_le() usage in ocfs2_expand_inline_dx_root()] Link: https://lkml.kernel.org/r/20251014070324.130313-1-dmantipov@yandex.ru Link: https://lkml.kernel.org/r/20251007123526.213150-1-dmantipov@yandex.ru Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru> Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com> Reviewed-by: Heming Zhao <heming.zhao@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <gechangwei@live.cn> Cc: Jun Piao <piaojun@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lukas Bulwahn [Fri, 10 Oct 2025 08:21:38 +0000 (10:21 +0200)]
treewide: drop outdated compiler version remarks in Kconfig help texts
As of writing, Documentation/Changes states the minimal versions of GNU C
being 8.1, Clang being 15.0.0 and binutils being 2.30. A few Kconfig help
texts are pointing out that specific GCC and Clang versions are needed,
but by now, those pointers to versions, such later than 4.0, later than
4.4, or clang later than 5.0, are obsolete and unlikely to be found by
users configuring their kernel builds anyway.
Drop these outdated remarks in Kconfig help texts referring to older
compiler and binutils versions. No functional change.
Link: https://lkml.kernel.org/r/20251010082138.185752-1-lukas.bulwahn@redhat.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Cc: Bill Wendling <morbo@google.com> Cc: Justin Stitt <justinstitt@google.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Russel King <linux@armlinux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Zhichi Lin [Sat, 11 Oct 2025 08:22:22 +0000 (16:22 +0800)]
scs: fix a wrong parameter in __scs_magic
__scs_magic() needs a 'void *' variable, but a 'struct task_struct *' is
given. 'task_scs(tsk)' is the starting address of the task's shadow call
stack, and '__scs_magic(task_scs(tsk))' is the end address of the task's
shadow call stack. Here should be '__scs_magic(task_scs(tsk))'.
The user-visible effect of this bug is that when CONFIG_DEBUG_STACK_USAGE
is enabled, the shadow call stack usage checking function
(scs_check_usage) would scan an incorrect memory range. This could lead
to:
1. **Inaccurate stack usage reporting**: The function would calculate
wrong usage statistics for the shadow call stack, potentially showing
incorrect value in kmsg.
2. **Potential kernel crash**: If the value of __scs_magic(tsk)is
greater than that of __scs_magic(task_scs(tsk)), the for loop may
access unmapped memory, potentially causing a kernel panic. However,
this scenario is unlikely because task_struct is allocated via the slab
allocator (which typically returns lower addresses), while the shadow
call stack returned by task_scs(tsk) is allocated via vmalloc(which
typically returns higher addresses).
However, since this is purely a debugging feature
(CONFIG_DEBUG_STACK_USAGE), normal production systems should be not
unaffected. The bug only impacts developers and testers who are actively
debugging stack usage with this configuration enabled.
Link: https://lkml.kernel.org/r/20251011082222.12965-1-zhichi.lin@vivo.com Fixes: 5bbaf9d1fcb9 ("scs: Add support for stack usage debugging") Signed-off-by: Jiyuan Xie <xiejiyuan@vivo.com> Signed-off-by: Zhichi Lin <zhichi.lin@vivo.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Will Deacon <will@kernel.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Kees Cook <keescook@chromium.org> Cc: Marco Elver <elver@google.com> Cc: Will Deacon <will@kernel.org> Cc: Yee Lee <yee.lee@mediatek.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
kexec_core: remove superfluous page offset handling in segment loading
During kexec_segment loading, when copying the content of the segment
(i.e. kexec_segment::kbuf or kexec_segment::buf) to its associated pages,
kimage_load_{cma,normal,crash}_segment handle the case where the physical
address of the segment is not page aligned, e.g. in
kimage_load_normal_segment:
(similar logic is present in kimage_load_{cma,crash}_segment).
This is actually not needed because, prior to their loading, all
kexec_segments first go through a vetting step in
`sanity_check_segment_list`, which rejects any segment that is not
page-aligned:
for (i = 0; i < nr_segments; i++) {
unsigned long mstart, mend;
mstart = image->segment[i].mem;
mend = mstart + image->segment[i].memsz;
// ...
if ((mstart & ~PAGE_MASK) || (mend & ~PAGE_MASK))
return -EADDRNOTAVAIL;
// ...
}
In case `sanity_check_segment_list` finds a non-page aligned the whole
kexec load is aborted and no segment is loaded.
This means that `kimage_load_{cma,normal,crash}_segment` never actually
have to handle non page-aligned segments and `(maddr & ~PAGE_MASK) == 0`
is always true no matter if the segment is coming from a file (i.e.
`kexec_file_load` syscall), from a user-space buffer (i.e. `kexec_load`
syscall) or created by the kernel through `kexec_add_buffer`. In the
latter case, `kexec_add_buffer` actually enforces the page alignment:
[jbouron@amazon.com: v3] Link: https://lkml.kernel.org/r/20251024155009.39502-1-jbouron@amazon.com Link: https://lkml.kernel.org/r/20250929160220.47616-1-jbouron@amazon.com Signed-off-by: Justinien Bouron <jbouron@amazon.com> Reviewed-by: Gunnar Kudrjavets <gunnarku@amazon.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Alexander Graf <graf@amazon.com> Cc: Marcos Paulo de Souza <mpdesouza@suse.com> Cc: Mario Limonciello <mario.limonciello@amd.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>