wangzijie [Sat, 7 Jun 2025 02:13:53 +0000 (10:13 +0800)]
proc: use the same treatment to check proc_lseek as ones for proc_read_iter et.al
Check pde->proc_ops->proc_lseek directly may cause UAF in rmmod scenario.
It's a gap in proc_reg_open() after commit 654b33ada4ab("proc: fix UAF in
proc_get_inode()"). Followed by AI Viro's suggestion, fix it in same
manner.
Link: https://lkml.kernel.org/r/20250607021353.1127963-1-wangzijie1@honor.com Fixes: 3f61631d47f1 ("take care to handle NULL ->proc_lseek()") Signed-off-by: wangzijie <wangzijie1@honor.com> Reviewed-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com> Cc: Kirill A. Shuemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Sat, 7 Jun 2025 22:01:50 +0000 (10:01 +1200)]
mm: use per_vma lock for MADV_DONTNEED
Certain madvise operations, especially MADV_DONTNEED, occur far more
frequently than other madvise options, particularly in native and Java
heaps for dynamic memory management.
Currently, the mmap_lock is always held during these operations, even when
unnecessary. This causes lock contention and can lead to severe priority
inversion, where low-priority threads—such as Android's
HeapTaskDaemon— hold the lock and block higher-priority threads.
This patch enables the use of per-VMA locks when the advised range lies
entirely within a single VMA, avoiding the need for full VMA traversal.
In practice, userspace heaps rarely issue MADV_DONTNEED across multiple
VMAs.
Tangquan's testing shows that over 99.5% of memory reclaimed by Android
benefits from this per-VMA lock optimization. After extended runtime,
217,735 madvise calls from HeapTaskDaemon used the per-VMA path, while
only 1,231 fell back to mmap_lock.
To simplify handling, the implementation falls back to the standard
mmap_lock if userfaultfd is enabled on the VMA, avoiding the complexity of
userfaultfd_remove().
Many thanks to Lorenzo's work[1] on "mm/madvise: support VMA read locks
for MADV_DONTNEED[_LOCKED]"
Then use this mechanism to permit VMA locking to be done later in the
madvise() logic and also to allow altering of the locking mode to permit
falling back to an mmap read lock if required."
One important point, as pointed out by Jann[2], is that
untagged_addr_remote() requires holding mmap_lock. This is because
address tagging on x86 and RISC-V is quite complex.
Until untagged_addr_remote() becomes atomic—which seems unlikely in the
near future—we cannot support per-VMA locks for remote processes. So
for now, only local processes are supported.
Lance said:
: Just to put some numbers on it, I ran a micro-benchmark with 100
: parallel threads, where each thread calls madvise() on its own 1GiB
: chunk of 64KiB mTHP-backed memory. The performance gain is huge:
:
: 1) MADV_DONTNEED saw its average time drop from 0.0508s to 0.0270s
: (~47% faster)
:
: 2) MADV_FREE saw its average time drop from 0.3078s to 0.1095s (~64%
: faster)
Tal Zussman [Fri, 20 Jun 2025 01:24:26 +0000 (21:24 -0400)]
userfaultfd: remove UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET
UFFD_CLOEXEC, UFFD_NONBLOCK, and UFFD_FLAGS_SET have been unused since
they were added in commit 932b18e0aec6 ("userfaultfd:
linux/userfaultfd_k.h"). Remove them and the associated BUILD_BUG_ON()
checks.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-4-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Peter Xu <peterx@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Fri, 20 Jun 2025 01:24:25 +0000 (21:24 -0400)]
userfaultfd: remove (VM_)BUG_ON()s
BUG_ON() is deprecated [1]. Convert all the BUG_ON()s and VM_BUG_ON()s to
use VM_WARN_ON_ONCE().
There are a few additional cases that are converted or modified:
- Convert the printk(KERN_WARNING ...) in handle_userfault() to use
pr_warn().
- Convert the WARN_ON_ONCE()s in move_pages() to use VM_WARN_ON_ONCE(),
as the relevant conditions are already checked in validate_range() in
move_pages()'s caller.
- Convert the VM_WARN_ON()'s in move_pages() to VM_WARN_ON_ONCE(). These
cases should never happen and are similar to those in mfill_atomic()
and mfill_atomic_hugetlb(), which were previously BUG_ON()s.
move_pages() was added later than those functions and makes use of
VM_WARN_ON() as a replacement for the deprecated BUG_ON(), but.
VM_WARN_ON_ONCE() is likely a better direct replacement.
- Convert the WARN_ON() for !VM_MAYWRITE in userfaultfd_unregister() and
userfaultfd_register_range() to VM_WARN_ON_ONCE(). This condition is
enforced in userfaultfd_register() so it should never happen, and can
be converted to a debug check.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-3-a7274d3bd5e4@columbia.edu Signed-off-by: Tal Zussman <tz2294@columbia.edu> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Fri, 20 Jun 2025 01:24:24 +0000 (21:24 -0400)]
userfaultfd: prevent unregistering VMAs through a different userfaultfd
Currently, a VMA registered with a uffd can be unregistered through a
different uffd associated with the same mm_struct.
The existing behavior is slightly broken and may incorrectly reject
unregistering some VMAs due to the following check:
if (!vma_can_userfault(cur, cur->vm_flags, wp_async))
goto out_unlock;
where wp_async is derived from ctx, not from cur. For example, a
file-backed VMA registered with wp_async enabled and UFFD_WP mode cannot
be unregistered through a uffd that does not have wp_async enabled.
Rather than fix this and maintain this odd behavior, make unregistration
stricter by requiring VMAs to be unregistered through the same uffd they
were registered with. Additionally, reorder the BUG() checks to avoid the
aforementioned wp_async issue in them. Convert the existing check to
VM_WARN_ON_ONCE() as BUG_ON() is deprecated.
This change slightly modifies the ABI. It should not be backported to
-stable. It is expected that no one depends on this behavior, and no such
cases are known.
While at it, correct the comment for the no userfaultfd case. This seems
to be a copy-paste artifact from the analogous userfaultfd_register()
check.
Link: https://lkml.kernel.org/r/20250619-uffd-fixes-v3-2-a7274d3bd5e4@columbia.edu Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization") Signed-off-by: Tal Zussman <tz2294@columbia.edu> Acked-by: David Hildenbrand <david@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jason A. Donenfeld <Jason@zx2c4.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Tal Zussman [Fri, 20 Jun 2025 01:24:23 +0000 (21:24 -0400)]
userfaultfd: correctly prevent registering VM_DROPPABLE regions
Patch series "mm: userfaultfd: assorted fixes and cleanups", v3.
Two fixes and two cleanups for userfaultfd.
Note that the third patch yields a small change in the ABI, but we seem to
have concluded that it is acceptable in this case.
This patch (of 4):
vma_can_userfault() masks off non-userfaultfd VM flags from vm_flags. The
vm_flags & VM_DROPPABLE test will then always be false, incorrectly
allowing VM_DROPPABLE regions to be registered with userfaultfd.
Additionally, vm_flags is not guaranteed to correspond to the actual VMA's
flags. Fix this test by checking the VMA's flags directly.
Donet Tom [Wed, 28 May 2025 17:18:04 +0000 (12:18 -0500)]
drivers/base/node: rename __register_one_node() to register_one_node()
The register_one_node() function was a simple wrapper around
__register_one_node(). To simplify the code, register_one_node() has been
removed, and __register_one_node() has been renamed to
register_one_node().
Donet Tom [Wed, 28 May 2025 17:18:03 +0000 (12:18 -0500)]
drivers/base/node: rename register_memory_blocks_under_node() and remove context argument
The function register_memory_blocks_under_node() is now only called from
the memory hotplug path, as register_memory_blocks_under_node_early()
handles registration during early boot. Therefore, the context argument
used to differentiate between early boot and hotplug is no longer needed
and was removed.
Since the function is only called from the hotplug path, we renamed
register_memory_blocks_under_node() to
register_memory_blocks_under_node_hotplug()
Donet Tom [Wed, 28 May 2025 17:18:02 +0000 (12:18 -0500)]
drivers/base/node: remove register_memory_blocks_under_node() function call from register_one_node
register_one_node() is now only called via cpu_up() →
__try_online_node() during CPU hotplug operations to online a node.
At this stage, the node has not yet had any memory added. As a result,
there are no memory blocks to walk or register, so calling
register_memory_blocks_under_node() is unnecessary.
Therefore, the call to register_memory_blocks_under_node() has been
removed from register_one_node().
The function register_mem_block_under_node_early() is no longer used, as
register_memory_blocks_under_node_early() now handles memory block
registration during early boot.
Removed register_mem_block_under_node_early() and get_nid_for_pfn(), the
latter was only used by the former.
Donet Tom [Wed, 28 May 2025 17:18:00 +0000 (12:18 -0500)]
drivers/base/node: optimize memory block registration to reduce boot time
Patch series "drivers/base/node.c: optimization and cleanups", v7.
This patch (of 7)
During node device initialization, `memory blocks` are registered under
each NUMA node. The `memory blocks` to be registered are identified using
the node's start and end PFNs, which are obtained from the node's pg_data
However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes. Additionally, due to the
discontiguous nature of physical memory, certain sections within a `memory
block` may be absent.
As a result, `memory blocks` that fall between a node's start and end PFNs
may span across multiple nodes, and some sections within those blocks may
be missing. `Memory blocks` have a fixed size, which is architecture
dependent.
Due to these considerations, the memory block registration is currently
performed as follows:
Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node. For each `memory
block` in this range, we inspect all PFNs it contains and check their
associated NUMA node ID. If a PFN within the block matches the current
node, the memory block is registered under that node.
If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN. If it is not enabled, the node ID is retrieved directly
from the struct page.
On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match
is found. When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the
additional overhead of the binary search increases the execution time
significantly, potentially leading to soft lockups during boot.
In this patch, we iterate over `memblock region` to identify the `memory
blocks` that belong to the current NUMA node. `memblock regions` are
contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.
for_each_memory_region(r): // r => region
if (!node_online(r->nid)):
continue;
else
for_each_memory_block_between(r->base, r->base + r->size - 1):
do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);
We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on
the region's start and end PFNs. We then register all the memory blocks
within that range under the region node.
Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.
Without this patch
------------------
Startup finished in 1min 16.528s (kernel)
With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement
2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.
Without this patch
------------------
Startup finished in 28.320s (kernel)
With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement
Chi Zhiling [Thu, 5 Jun 2025 05:49:35 +0000 (13:49 +0800)]
readahead: fix return value of page_cache_next_miss() when no hole is found
max_scan in page_cache_next_miss always decreases to zero when no hole is
found, causing the return value to be index + 0.
Fix this by preserving the max_scan value throughout the loop.
Jan said "From what I know and have seen in the past, wrong responses
from page_cache_next_miss() can lead to readahead window reduction and
thus reduced read speeds."
Link: https://lkml.kernel.org/r/20250605054935.2323451-1-chizhiling@163.com Fixes: 901a269ff3d5 ("filemap: fix page_cache_next_miss() when no hole found") Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Richard Chang [Thu, 5 Jun 2025 07:25:32 +0000 (07:25 +0000)]
mm/cma: pair the trace_cma_alloc_start/finish
In the bad input validation cases, there is no trace_cma_alloc_finish to
match the trace_cma_alloc_start. Move the trace_cma_alloc_start event
after the validations.
Link: https://lkml.kernel.org/r/20250605072532.972081-1-richardycc@google.com Signed-off-by: Richard Chang <richardycc@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Martin Liu <liumartin@google.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Barry Song [Thu, 5 Jun 2025 08:31:44 +0000 (20:31 +1200)]
mm: madvise: use walk_page_range_vma() instead of walk_page_range()
We've already found the VMA within madvise_walk_vmas() before calling
specific madvise behavior functions like madvise_free_single_vma(). So
calling walk_page_range() and doing find_vma() again seems unnecessary.
It also prevents potential optimizations in those madvise callbacks,
particularly the use of dedicated per-VMA locking.
[v-songbaohua@oppo.com: revert the walk_page_range_vma change for MADV_GUARD_INSTALL] Link: https://lkml.kernel.org/r/20250609105513.10901-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20250605083144.43046-1-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Dev Jain <dev.jain@arm.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Tested-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: remove the for_reclaim field from struct writeback_control
This field is now only set to one in the i915 gem code that only calls
writeback_iter on it, which ignores the flag. All other checks are thuse
dead code and the field can be removed.
Link: https://lkml.kernel.org/r/20250610054959.2057526-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: stop passing a writeback_control structure to __swap_writepage
__swap_writepage only needs the swap_iocb cookie from the
writeback_control structure, so pass it explicitly and remove the now
unused swap_iocb member from struct writeback_control.
Link: https://lkml.kernel.org/r/20250610054959.2057526-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Nhat Pham <nphamcs@gmail.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
mm: stop passing a writeback_control structure to shmem_writeout
shmem_writeout only needs the swap_iocb cookie and the split folio list.
Pass those explicitly and remove the now unused list member from struct
writeback_control.
Link: https://lkml.kernel.org/r/20250610054959.2057526-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Patch series "stop passing a writeback_control to swap/shmem writeout",
v3.
This series was intended to remove the last remaining users of
AOP_WRITEPAGE_ACTIVATE after my other pending patches removed the rest,
but spectacularly failed at that.
But instead it nicely improves the code, and removes two pointers from
struct writeback_control.
This patch (of 6):
Move the code to write back swap / shmem folios into a self-contained
helper to keep prepare for refactoring it.
Link: https://lkml.kernel.org/r/20250610054959.2057526-1-hch@lst.de Link: https://lkml.kernel.org/r/20250610054959.2057526-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Baolin Wang <baolin.wang@linux.aibaba.com> Cc: Chengming Zhou <chengming.zhou@linux.dev> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Nhat Pham <nphamcs@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Kairui Song [Mon, 26 May 2025 18:06:38 +0000 (02:06 +0800)]
mm, list_lru: refactor the locking code
Cocci is confused by the try lock then release RCU and return logic here.
So separate the try lock part out into a standalone helper. The code is
easier to follow too.
No feature change, fixes:
cocci warnings: (new ones prefixed by >>)
>> mm/list_lru.c:82:3-9: preceding lock on line 77
>> mm/list_lru.c:82:3-9: preceding lock on line 77
mm/list_lru.c:82:3-9: preceding lock on line 75
mm/list_lru.c:82:3-9: preceding lock on line 75
Link: https://lkml.kernel.org/r/20250526180638.14609-1-ryncsn@gmail.com Signed-off-by: Kairui Song <kasong@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Julia Lawall <julia.lawall@inria.fr> Closes: https://lore.kernel.org/r/202505252043.pbT1tBHJ-lkp@intel.com/ Reviewed-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kairui Song <kasong@tencent.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Especially once we hit one of the assertions in
sanity_check_pinned_pages(), observing follow-up assertions failing in
other code can give good clues about what went wrong, so use
VM_WARN_ON_ONCE instead.
While at it, let's just convert all VM_BUG_ON to VM_WARN_ON_ONCE as well.
Add one comment for the pfn_valid() check.
We have to introduce VM_WARN_ON_ONCE_VMA() to make that fly.
Drop the BUG_ON after mmap_read_lock_killable(), if that ever returns
something > 0 we're in bigger trouble. Convert the other BUG_ON's into
VM_WARN_ON_ONCE as well, they are in a similar domain "should never
happen", but more reasonable to check for during early testing.
[david@redhat.com: use the _FOLIO variant where possible, per Lorenzo] Link: https://lkml.kernel.org/r/844bd929-a551-48e3-a12e-285cd65ba580@redhat.com Link: https://lkml.kernel.org/r/20250604140544.688711-1-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: SeongJae Park <sj@kernel.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Wed, 4 Jun 2025 18:31:26 +0000 (11:31 -0700)]
mm/damon/stat: calculate and expose idle time percentiles
Knowing how much memory is how cold can be useful for understanding
coldness and utilization efficiency of memory. The raw form of DAMON's
monitoring results has the information. Convert the raw results into the
per-byte idle time distributions and expose it as percentiles metric to
users, as a read-only DAMON_STAT parameter.
In detail, the metrics are calculated as follows. First, DAMON's
per-region access frequency and age information is converted into per-byte
idle time. If access frequency of a region is higher than zero, every
byte of the region has zero idle time. If the access frequency of a
region is zero, every byte of the region has idle time as the age of the
region. Then the logic sorts the per-byte idle times and provides the
value at 0/100, 1/100, ..., 99/100 and 100/100 location of the sorted
array.
The metric can be easily aggregated and compared on large scale production
systems. For example, if an average of 75-th percentile idle time of
machines that collected on similar time is two minutes, it means the
system's 25 percent memory is not accessed at all for two minutes or more
on average. If a workload considers two minutes as unit work time, we can
conclude its working set size is only 75 percent of the memory. If the
system utilizes proactive reclamation and it supports coldness-based
thresholds like DAMON_RECLAIM, the idle time percentiles can be used to
find a more safe or aggressive coldness threshold for aimed memory saving.
SeongJae Park [Wed, 4 Jun 2025 18:31:25 +0000 (11:31 -0700)]
mm/damon/stat: calculate and expose estimated memory bandwidth
The raw form of DAMON's monitoring results captures many details of the
information. However, not every bit of the information is always required
for understanding practical access patterns. Especially on real world
production systems of high scale time and size, the raw form is difficult
to be aggregated and compared.
Convert the raw monitoring results into a single number metric, namely
estimated memory bandwidth and expose it to users as a read-only
DAMON_STAT parameter. The metric represents access intensiveness
(hotness) of the system. It can easily be aggregated and compared for
high level understanding of the access pattern on large systems.
SeongJae Park [Wed, 4 Jun 2025 18:31:24 +0000 (11:31 -0700)]
mm/damon: introduce DAMON_STAT module
Patch series "mm/damon: introduce DAMON_STAT for simple and practical
access monitoring", v2.
DAMON-based access monitoring is not simple due to required DAMON control
and results visualizations. Introduce a static kernel module for making
it simple. The module can be enabled without manual setup and provides
access pattern metrics that easy to fetch and understand the practical
access pattern information, namely estimated memory bandwidth and memory
idle time percentiles.
Background and Problems
=======================
DAMON can be used for monitoring data access patterns of the system and
workloads. Specifically, users can start DAMON to monitor access events
on specific address space with fine controls including address ranges to
monitor and time intervals between samplings and aggregations. The
resulting access information snapshot contains access frequency
(nr_accesses) and how long the frequency was kept (age) for each byte.
The monitoring usage is not simple and practical enough for production
usage. Users should first start DAMON with a number of parameters, and
wait until DAMON's monitoring results capture a reasonable amount of the
time data (age). In production, such manual start and wait is impractical
to capture useful information from a high number of machines in a timely
manner.
The monitoring result is also too detailed to be used on production
environments. The raw results are hard to be aggregated and/or compared
for production environments having a large scale of time, space and
machines fleet.
Users have to implement and use their own automation of DAMON control and
results processing. It is repetitive and challenging since there is no
good reference or guideline for such automation.
Solution: DAMON_STAT
====================
Implement such automation in kernel space as a static kernel module,
namely DAMON_STAT. It can be enabled at build, boot, or run time via its
build configuration or module parameter. It monitors the entire physical
address space with monitoring intervals that auto-tuned for a reasonable
amount of access observations and minimum overhead. It converts the raw
monitoring results into simpler metrics that can easily be aggregated and
compared, namely estimated memory bandwidth and idle time percentiles.
Understanding of the metrics and the user interface of DAMON_STAT is
essential. Refer to the commit messages of the second and the third
patches of this patch series for more details about the metrics. For the
user interface, the standard module parameters system is used. Refer to
the fourth patch of this patch series for details of the user interface.
Discussions
===========
The module aims to be useful on production environments constructed with a
large number of machines that run a long time. The auto-tuned monitoring
intervals ensure a reasonable quality of the outputs. The auto-tuning
also ensures its overhead be reasonable and low enough to be enabled
always on the production. The simplified monitoring results metrics can
be useful for showing both coldness (idle time percentiles) and hotness
(memory bandwidth) of the system's access pattern. We expect the
information can be useful for assessing system memory utilization and
inspiring optimizations or investigations on both kernel and user space
memory management logics for large scale fleets.
We hence expect the module is good enough to be just used in most
environments. For special cases that require a custom access monitoring
automation, users will still benefit by using DAMON_STAT as a reference or
a guideline for their specialized automation.
This patch (of 4):
To use DAMON for monitoring access patterns of the system, users should
manually start DAMON via DAMON sysfs ABI with a number of parameters for
specifying the monitoring target address space, address ranges, and
monitoring intervals. After that, users should also wait until desired
amount of time data is captured into DAMON's monitoring results. It is
bothersome and take a long time to be practical for access monitoring on
large fleet level production environments.
For access-aware system operations use cases like proactive cold memory
reclamation, similar problems existed. We we solved those by introducing
dedicated static kernel modules such as DAMON_RECLAIM.
Implement such static kernel module for access monitoring, namely
DAMON_STAT. It monitors the entire physical address space with auto-tuned
monitoring intervals. The auto-tuning is set to capture 4 % of observable
access events in each snapshot while keeping the sampling intervals 5
milliseconds in minimum and 10 seconds in maximum. From a few production
environments, we confirmed this setup provides high quality monitoring
results with minimum overheads. The module therefore receives only one
user input, whether to enable or disable it. It can be set on build or
boot time via build configuration or kernel boot command line. It can
also be overridden at runtime.
Note that this commit only implements the DAMON control part of the
module. Users could get the monitoring results via damon:damon_aggregated
tracepoint, but that's of course not the recommended way. Following
commits will implement convenient and optimized ways for serving the
monitoring results to users.
The vma_mas_szero and vma_store tracepoints are unused since commit fbcc3104b843 ("mmap: convert __vma_adjust() to use vma iterator"). Remove
them so they are no longer listed as available tracepoints.
Link: https://lkml.kernel.org/r/20250411161746.1043239-1-csander@purestorage.com Signed-off-by: Caleb Sander Mateos <csander@purestorage.com> Reported-by: Eric Mueller <emueller@purestorage.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Jann Horn <jannh@google.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: "Masami Hiramatsu (Google)" <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Paul Menzel [Tue, 3 Jun 2025 06:13:03 +0000 (08:13 +0200)]
mm: Kconfig: use verb *use* in plural form in description
*workloads* is plural requiring the verb *use* in plural form.
Link: https://lkml.kernel.org/r/20250603061303.479551-2-pmenzel@molgen.mpg.de Fixes: e13e7922d034 ("mm: add CONFIG_PAGE_BLOCK_ORDER to select page block order") Signed-off-by: Paul Menzel <pmenzel@molgen.mpg.de> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 29 May 2025 17:15:48 +0000 (18:15 +0100)]
tools/testing/selftests: add VMA merge tests for KSM merge
Add test to assert that we have now allowed merging of VMAs when KSM
merging-by-default has been set by prctl(PR_SET_MEMORY_MERGE, ...).
We simply perform a trivial mapping of adjacent VMAs expecting a merge,
however prior to recent changes implementing this mode earlier than
before, these merges would not have succeeded.
Assert that we have fixed this!
Link: https://lkml.kernel.org/r/6dec7aabf062c6b121cfac992c9c716cefdda00c.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Tested-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Cc: Xu Xin <xu.xin16@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 29 May 2025 17:15:47 +0000 (18:15 +0100)]
mm: prevent KSM from breaking VMA merging for new VMAs
If a user wishes to enable KSM mergeability for an entire process and all
fork/exec'd processes that come after it, they use the prctl()
PR_SET_MEMORY_MERGE operation.
This defaults all newly mapped VMAs to have the VM_MERGEABLE VMA flag set
(in order to indicate they are KSM mergeable), as well as setting this
flag for all existing VMAs and propagating this across fork/exec.
However it also breaks VMA merging for new VMAs, both in the process and
all forked (and fork/exec'd) child processes.
This is because when a new mapping is proposed, the flags specified will
never have VM_MERGEABLE set. However all adjacent VMAs will already have
VM_MERGEABLE set, rendering VMAs unmergeable by default.
To work around this, we try to set the VM_MERGEABLE flag prior to
attempting a merge. In the case of brk() this can always be done.
However on mmap() things are more complicated - while KSM is not supported
for MAP_SHARED file-backed mappings, it is supported for MAP_PRIVATE
file-backed mappings.
These mappings may have deprecated .mmap() callbacks specified which
could, in theory, adjust flags and thus KSM eligibility.
So we check to determine whether this is possible. If not, we set
VM_MERGEABLE prior to the merge attempt on mmap(), otherwise we retain the
previous behaviour.
This fixes VMA merging for all new anonymous mappings, which covers the
majority of real-world cases, so we should see a significant improvement
in VMA mergeability.
For MAP_PRIVATE file-backed mappings, those which implement the
.mmap_prepare() hook and shmem are both known to be safe, so we allow
these, disallowing all other cases.
Also add stubs for newly introduced function invocations to VMA userland
testing.
[lorenzo.stoakes@oracle.com: correctly invoke late KSM check after mmap hook] Link: https://lkml.kernel.org/r/5861f8f6-cf5a-4d82-a062-139fb3f9cddb@lucifer.local Link: https://lkml.kernel.org/r/3ba660af716d87a18ca5b4e635f2101edeb56340.1748537921.git.lorenzo.stoakes@oracle.com Fixes: d7597f59d1d3 ("mm: add new api to enable ksm per process") # please no backport! Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 29 May 2025 17:15:46 +0000 (18:15 +0100)]
mm: ksm: refer to special VMAs via VM_SPECIAL in ksm_compatible()
There's no need to spell out all the special cases, also doing it this way
makes it absolutely clear that we preclude unmergeable VMAs in general,
and puts the other excluded flags in stark and clear contrast.
Link: https://lkml.kernel.org/r/c8be5b055163b164c8824020164076ee3b9389bd.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lorenzo Stoakes [Thu, 29 May 2025 17:15:45 +0000 (18:15 +0100)]
mm: ksm: have KSM VMA checks not require a VMA pointer
Patch series "mm: ksm: prevent KSM from breaking merging of new VMAs", v3.
When KSM-by-default is established using prctl(PR_SET_MEMORY_MERGE), this
defaults all newly mapped VMAs to having VM_MERGEABLE set, and thus makes
them available to KSM for samepage merging. It also sets VM_MERGEABLE in
all existing VMAs.
However this causes an issue upon mapping of new VMAs - the initial flags
will never have VM_MERGEABLE set when attempting a merge with adjacent
VMAs (this is set later in the mmap() logic), and adjacent VMAs will
ALWAYS have VM_MERGEABLE set.
This renders all newly mapped VMAs unmergeable.
To avoid this, this series performs the check for PR_SET_MEMORY_MERGE far
earlier in the mmap() logic, prior to the merge being attempted.
However we run into complexity with the depreciated .mmap() callback - if
a driver hooks this, it might change flags which adjust KSM merge
eligibility.
We have to worry about this because, while KSM is only applicable to
private mappings, this includes both anonymous and MAP_PRIVATE-mapped
file-backed mappings.
This isn't a problem for brk(), where the VMA must be anonymous. However
in mmap() we must be conservative - if the VMA is anonymous then we can
always proceed, however if not, we permit only shmem mappings (whose .mmap
hook does not affect KSM eligibility) and drivers which implement
.mmap_prepare() (invoked prior to the KSM eligibility check).
If we can't be sure of the driver changing things, then we maintain the
same behaviour of performing the KSM check later in the mmap() logic (and
thus losing new VMA mergeability).
A great many use-cases for this logic will use anonymous mappings any
rate, so this change should already cover the majority of actual KSM
use-cases.
This patch (of 4):
In subsequent commits we are going to determine KSM eligibility prior to a
VMA being constructed, at which point we will of course not yet have
access to a VMA pointer.
It is trivial to boil down the check logic to be parameterised on
mm_struct, file and VMA flags, so do so.
As a part of this change, additionally expose and use file_is_dax() to
determine whether a file is being mapped under a DAX inode.
Link: https://lkml.kernel.org/r/cover.1748537921.git.lorenzo.stoakes@oracle.com Link: https://lkml.kernel.org/r/36ad13eb50cdbd8aac6dcfba22c65d5031667295.1748537921.git.lorenzo.stoakes@oracle.com Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xu Xin <xu.xin16@zte.com.cn> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jann Horn <jannh@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Ye Liu [Fri, 30 May 2025 05:58:55 +0000 (13:58 +0800)]
tools/mm: add script to display page state for a given PID and VADDR
Introduces a new drgn script, `show_page_info.py`, which allows users
to analyze the state of a page given a process ID (PID) and a virtual
address (VADDR). This can help kernel developers or debuggers easily
inspect page-related information in a live kernel or vmcore.
The script extracts information such as the page flags, mapping, and
other metadata relevant to diagnosing memory issues.
Output example:
sudo ./show_page_info.py 1 0x7fc988181000
PID: 1 Comm: systemd mm: 0xffff8d22c4089700
RAW: 0017ffffc000416cfffff939062ff708fffff939062ffe08ffff8d23062a12a8
RAW: 0000000000000000ffff8d2323438f600000002500000007ffff8d23203ff500
Page Address: 0xfffff93905664e00
Page Flags: PG_referenced|PG_uptodate|PG_lru|PG_head|PG_active|
PG_private|PG_reported|PG_has_hwpoisoned
Page Size: 4096
Page PFN: 0x159938
Page Physical: 0x159938000
Page Virtual: 0xffff8d2319938000
Page Refcount: 37
Page Mapcount: 7
Page Index: 0x0
Page Memcg Data: 0xffff8d23203ff500
Memcg Name: init.scope
Memcg Path: /sys/fs/cgroup/memory/init.scope
Page Mapping: 0xffff8d23062a12a8
Page Anon/File: File
Page VMA: 0xffff8d22e06e0e40
VMA Start: 0x7fc988181000
VMA End: 0x7fc988185000
This page is part of a compound page.
This page is the head page of a compound page.
Head Page: 0xfffff93905664e00
Compound Order: 2
Number of Pages: 4
Link: https://lkml.kernel.org/r/20250530055855.687067-1-ye.liu@linux.dev Signed-off-by: Ye Liu <liuye@kylinos.cn> Tested-by: SeongJae Park <sj@kernel.org> Reviewed-by: Stephen Brennan <stephen.s.brennan@oracle.com> Cc: Florian Weimer <fweimer@redhat.com> Cc: Omar Sandoval <osandov@osandov.com> Cc: "Paul E . McKenney" <paulmck@kernel.org> Cc: Sweet Tea Dorminy <sweettea-kernel@dorminy.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Koichiro Den [Fri, 30 May 2025 16:23:53 +0000 (01:23 +0900)]
mm: vmscan: apply proportional reclaim pressure for memcg when MGLRU is enabled
The scan implementation for MGLRU was missing proportional reclaim
pressure for memcg, which contradicts the description in
Documentation/admin-guide/cgroup-v2.rst (memory.{low,min} section).
This issue can be observed in kselftest cgroup:test_memcontrol
(specifically test_memcg_min and test_memcg_low). The following table
shows the actual values observed in my local test env (on xfs) and the
error "e", which is the symmetric absolute percentage error from the ideal
values of 29M for c[0] and 21M for c[1].
Factor out the proportioning logic to a new function and have MGLRU reuse
it. While at it, update the eviction behavior via debugfs 'lru_gen'
interface ('-' command with an explicit 'nr_to_reclaim' parameter) to
ensure eviction is limited to the specified number.
Link: https://lkml.kernel.org/r/20250530162353.541882-1-den@valinux.co.jp Signed-off-by: Koichiro Den <koichiro.den@canonical.com> Reviewed-by: Yuanchu Xie <yuanchu@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull /proc/sys dcache lookup fix from Al Viro:
"Fix for the breakage spotted by Neil in the interplay between
/proc/sys ->d_compare() weirdness and parallel lookups"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix proc_sys_compare() handling of in-lookup dentries
Merge tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Fix the calculation of the deadline server task's runtime as this
mishap was preventing realtime tasks from running
- Avoid a race condition during migrate-swapping two tasks
- Fix the string reported for the "none" dynamic preemption option
* tag 'sched_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix dl_server runtime calculation formula
sched/core: Fix migrate_swap() vs. hotplug
sched: Fix preemption string of preempt_dynamic_none
Merge tag 'objtool_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix from Borislav Petkov:
- Fix the compilation of an x86 kernel on a big engian machine due to a
missed endianness conversion
* tag 'objtool_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Add missing endian conversion to read_annotate()
Merge tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Borislav Petkov:
- Revert uprobes to using CAP_SYS_ADMIN again as currently they can
destructively modify kernel code from an unprivileged process
- Move a warning to where it belongs
* tag 'perf_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Revert to requiring CAP_SYS_ADMIN for uprobes
perf/core: Fix the WARN_ON_ONCE is out of lock protected region
Merge tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
- Make sure AMD SEV guests using secure TSC, include a TSC_FACTOR which
prevents their TSCs from going skewed from the hypervisor's
* tag 'x86_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sev: Use TSC_FACTOR for Secure TSC frequency calculation
Merge tag 'locking_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking fixes from Borislav Petkov:
- Disable FUTEX_PRIVATE_HASH for this cycle due to a performance
regression
- Add a selftests compilation product to the corresponding .gitignore
file
* tag 'locking_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
selftests/futex: Add futex_numa to .gitignore
futex: Temporary disable FUTEX_PRIVATE_HASH
Merge tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS fixes from Borislav Petkov:
- Do not remove the MCE sysfs hierarchy if thresholding sysfs nodes
init fails due to new/unknown banks present, which in itself is not
fatal anyway; add default names for new banks
- Make sure MCE polling settings are honored after CMCI storms
- Make sure MCE threshold limit is reset after the thresholding
interrupt has been serviced
- Clean up properly and disable CMCI banks on shutdown so that a
second/kexec-ed kernel can rediscover those banks again
* tag 'ras_urgent_for_v6.16_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mce: Make sure CMCI banks are cleared during shutdown on Intel
x86/mce/amd: Fix threshold limit reset
x86/mce/amd: Add default names for MCA banks and blocks
x86/mce: Ensure user polling settings are honored when restarting timer
x86/mce: Don't remove sysfs if thresholding sysfs init fails
Merge tag 'hid-for-linus-2025070502' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- Memory corruption fixes in hid-appletb-kbd driver (Qasim Ijaz)
- New device ID in hid-elecom driver (Leonard Dizon)
- Fixed several HID debugfs contants (Vicki Pfau)
* tag 'hid-for-linus-2025070502' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid:
HID: appletb-kbd: fix slab use-after-free bug in appletb_kbd_probe
HID: Fix debug name for BTN_GEAR_DOWN, BTN_GEAR_UP, BTN_WHEEL
HID: elecom: add support for ELECOM HUGE 019B variant
HID: appletb-kbd: fix memory corruption of input_handler_list
Merge tag 'v6.16-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- Two reconnect fixes including one for a reboot/reconnect race
- Fix for incorrect file type that can be returned by SMB3.1.1 POSIX
extensions
- tcon initialization fix
- Fix for resolving Windows symlinks with absolute paths
* tag 'v6.16-rc4-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: fix native SMB symlink traversal
smb: client: fix race condition in negotiate timeout by using more precise timing
cifs: all initializations for tcon should happen in tcon_info_alloc
smb: client: fix warning when reconnecting channel
smb: client: fix readdir returning wrong type with POSIX extensions
Merge tag 'pm-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address system suspend failures under memory pressure in some
configurations, fix up RAPL handling on platforms where PL1 cannot be
disabled, and fix a documentation typo:
- Prevent the Intel RAPL power capping driver from allowing PL1 to be
exceeded by mistake on systems when PL1 cannot be disabled (Zhang
Rui)
- Fix a typo in the ABI documentation (Sumanth Gavini)
- Allow swap to be used a bit longer during system suspend and
hibernation to avoid suspend failures under memory pressure (Mario
Limonciello)"
* tag 'pm-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: sleep: docs: Replace "diasble" with "disable"
powercap: intel_rapl: Do not change CLAMPING bit if ENABLE bit cannot be changed
PM: Restrict swap use to later in the suspend sequence
Merge tag 'soc-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull SoC fixes from Arnd Bergmann:
"A couple of fixes for firmware drivers have come up, addressing kernel
side bugs in op-tee and ff-a code, as well as compatibility issues
with exynos-acpm and ff-a protocols.
The only devicetree fixes are for the Apple platform, addressing
issues with conformance to the bindings for the wlan, spi and mipi
nodes"
* tag 'soc-fixes-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: apple: Move touchbar mipi {address,size}-cells from dtsi to dts
arm64: dts: apple: Drop {address,size}-cells from SPI NOR
arm64: dts: apple: t8103: Fix PCIe BCM4377 nodename
optee: ffa: fix sleep in atomic context
firmware: exynos-acpm: fix timeouts on xfers handling
arm64: defconfig: update renamed PHY_SNPS_EUSB2
firmware: arm_ffa: Fix the missing entry in struct ffa_indirect_msg_hdr
firmware: arm_ffa: Replace mutex with rwlock to avoid sleep in atomic context
firmware: arm_ffa: Move memory allocation outside the mutex locking
firmware: arm_ffa: Fix memory leak by freeing notifier callback node
Merge tag 'riscv-for-linus-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- kCFI is restricted to clang-17 or newer, as earlier versions have
known bugs
- sbi_hsm_hart_start is now staticly allocated, to avoid tripping up
the SBI HSM page mapping on sparse systems.
* tag 'riscv-for-linus-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: cpu_ops_sbi: Use static array for boot_data
riscv: Require clang-17 or newer for kCFI
Merge tag 'regulator-fix-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A few driver fixes (the GPIO one being potentially nasty, though it
has been there for a while without anyone reporting it), and one core
fix for the rarely used combination of coupled regulators and
unbinding"
* tag 'regulator-fix-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: gpio: Fix the out-of-bounds access to drvdata::gpiods
regulator: mp886x: Fix ID table driver_data
regulator: sy8824x: Fix ID table driver_data
regulator: tps65219: Fix devm_kmalloc size allocation
regulator: core: fix NULL dereference on unbind due to stale coupling data
Merge tag 'spi-fix-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"As well as a few driver specific fixes we've got a core change here
which raises the hard coded limit on the number of devices we can
support on one SPI bus since some FPGA based systems are running into
the existing limit. This is not a good solution but it's one suitable
for this point in the release cycle, we should dynamically size the
relevant data structures which I hope will happen in the next couple
of merge windows.
We also pull in a MTD fix for the Qualcomm SNAND driver, the two fixes
cover the same issue and merging them together minimises bisection
issues"
* tag 'spi-fix-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: cadence-quadspi: fix cleanup of rx_chan on failure paths
spi: spi-fsl-dspi: Clear completion counter before initiating transfer
spi: Raise limit on number of chip selects to 24
mtd: nand: qpic_common: prevent out of bounds access of BAM arrays
spi: spi-qpic-snand: reallocate BAM transactions
Merge tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform drivers fixes from Ilpo Järvinen:
"Mostly a few lines fixed here and there except amd/isp4 which improves
swnodes relationships but that is a new driver not in any stable
kernels yet. The think-lmi driver changes also look relatively large
but there are just many fixes to it.
The i2c/piix4 change is a effectively a revert of the commit 7e173eb82ae9 ("i2c: piix4: Make CONFIG_I2C_PIIX4 dependent on
CONFIG_X86") but that required moving the header out from arch/x86
under include/linux/platform_data/
Summary:
- amd/isp4: Improve swnode graph (new driver exception)
- asus-nb-wmi: Use duo keyboard quirk for Zenbook Duo UX8406CA
- dell-lis3lv02d: Add Latitude 5500 accelerometer address
- dell-wmi-sysman: Fix WMI data block retrieval and class dev unreg
- hp-bioscfg: Fix class device unregistration
- i2c: piix4: Re-enable on non-x86 + move FCH header under platform_data/
- intel/hid: Wildcat Lake support
- mellanox:
- mlxbf-pmc: Fix duplicate event ID
- mlxbf-tmfifo: Fix vring_desc.len assignment
- mlxreg-lc: Fix bit-not-set logic check
- nvsw-sn2201: Fix bus number in error message & spelling errors
- portwell-ec: Move watchdog device under correct platform hierarchy
- think-lmi: Error handling fixes (sysfs, kset, kobject, class dev unreg)
* tag 'platform-drivers-x86-v6.16-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (22 commits)
platform/x86: think-lmi: Fix sysfs group cleanup
platform/x86: think-lmi: Fix kobject cleanup
platform/x86: think-lmi: Create ksets consecutively
platform/mellanox: mlxreg-lc: Fix logic error in power state check
i2c: Re-enable piix4 driver on non-x86
Move FCH header to a location accessible by all archs
platform/x86/intel/hid: Add Wildcat Lake support
platform/x86: dell-wmi-sysman: Fix class device unregistration
platform/x86: think-lmi: Fix class device unregistration
platform/x86: hp-bioscfg: Fix class device unregistration
platform/x86: Update swnode graph for amd isp4
platform/x86: dell-wmi-sysman: Fix WMI data block retrieval in sysfs callbacks
platform/x86: wmi: Update documentation of WCxx/WExx ACPI methods
platform/x86: wmi: Fix WMI event enablement
platform/mellanox: nvsw-sn2201: Fix bus number in adapter error message
platform/mellanox: Fix spelling and comment clarity in Mellanox drivers
platform/mellanox: mlxbf-pmc: Fix duplicate event ID for CACHE_DATA1
platform/x86: thinkpad_acpi: handle HKEY 0x1402 event
platform/x86: asus-nb-wmi: add DMI quirk for ASUS Zenbook Duo UX8406CA
platform/x86: dell-lis3lv02d: Add Latitude 5500
...
Merge tag 'usb-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some USB driver fixes for 6.16-rc5. I originally wanted this
to get into -rc4, but there were some regressions that had to be
handled first. Now all looks good. Included in here are the following
fixes:
- cdns3 driver fixes
- xhci driver fixes
- typec driver fixes
- USB hub fixes (this is what took the longest to get right)
- new USB driver quirks added
- chipidea driver fixes
All of these have been in linux-next for a while and now we have no
more reported problems with them"
* tag 'usb-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (21 commits)
usb: hub: Fix flushing of delayed work used for post resume purposes
xhci: dbc: Flush queued requests before stopping dbc
xhci: dbctty: disable ECHO flag by default
xhci: Disable stream for xHC controller with XHCI_BROKEN_STREAMS
usb: xhci: quirk for data loss in ISOC transfers
usb: dwc3: gadget: Fix TRB reclaim logic for short transfers and ZLPs
usb: hub: Fix flushing and scheduling of delayed work that tunes runtime pm
usb: typec: displayport: Fix potential deadlock
usb: typec: altmodes/displayport: do not index invalid pin_assignments
usb: cdnsp: Fix issue with CV Bad Descriptor test
usb: typec: tcpm: apply vbus before data bringup in tcpm_src_attach
Revert "usb: xhci: Implement xhci_handshake_check_state() helper"
usb: xhci: Skip xhci_reset in xhci_resume if xhci is being removed
usb: gadget: u_serial: Fix race condition in TTY wakeup
Revert "usb: gadget: u_serial: Add null pointer check in gs_start_io"
usb: chipidea: udc: disconnect/reconnect from host when do suspend/resume
usb: acpi: fix device link removal
usb: hub: fix detection of high tier USB3 devices behind suspended hubs
Logitech C-270 even more broken
usb: dwc3: Abort suspend on soft disconnect failure
...
Merge tag 'input-for-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input updates from Dmitry Torokhov:
- support for Acer NGR 200 Controller added to xpad driver
- xpad driver will no longer log errors about URBs at sudden disconnect
- a fix for potential NULL dereference in cs40l50-vibra driver
- several drivers have been switched to using scnprintf() to suppress
warnings about potential output truncation
* tag 'input-for-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: cs40l50-vibra - fix potential NULL dereference in cs40l50_upload_owt()
Input: alps - use scnprintf() to suppress truncation warning
Input: iqs7222 - explicitly define number of external channels
Input: xpad - support Acer NGR 200 Controller
Input: xpad - return errors from xpad_try_sending_next_out_packet() up
Input: xpad - adjust error handling for disconnect
Input: apple_z2 - drop default ARCH_APPLE in Kconfig
Input: Fully open-code compatible for grepping
dt-bindings: HID: i2c-hid: elan: Introduce Elan eKTH8D18
Input: psmouse - switch to use scnprintf() to suppress truncation warning
Input: lifebook - switch to use scnprintf() to suppress truncation warning
Input: alps - switch to use scnprintf() to suppress truncation warning
Input: atkbd - switch to use scnprintf() to suppress truncation warning
Input: fsia6b - suppress buffer truncation warning for phys
Input: iqs626a - replace snprintf() with scnprintf()
Merge tag 'drm-fixes-2025-07-04' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes, bit of a bumper crop, the usual amdgpu/xe/i915
suspects, then there is a large scattering of fixes across core and
drivers. I think the simple panel lookup fix is probably the largest,
the sched race fix is also fun, but I don't see anything standing out
too badly.
dma-buf:
- fix timeout handling
gem:
- fix framebuffer object references
sched:
- fix spsc queue job count race
bridge:
- fix aux hpd bridge of node
- panel: move missing flag handling
- samsung-dsim: fix %pK usage to %p
amdkfd:
- mtype fix for ext coherent system memory
- MMU notifier fix
- gfx7/8 fix
xe:
- Fix chunking the PTE updates and overflowing the maximum number of
dwords with with MI_STORE_DATA_IMM
- Move WA BB to the LRC BO to mitigate hangs on context switch
- Fix frequency/flush WAs for BMG
- Fix kconfig prompt title and description
- Do not require kunit
- Extend 14018094691 WA to BMG
- Fix wedging the device on signal
i915:
- Make mei interrupt top half irq disabled to fix RT builds
- Fix timeline left held on VMA alloc error
- Fix NULL pointer deref in vlv_dphy_param_init()
- Fix selftest mock_request() to avoid NULL deref
exynos:
- switch to using %p instead of %pK
- fix vblank NULL ptr race
- fix lockup on samsung peach-pit/pi chromebooks
vesadrm:
- NULL ptr fix
vmwgfx:
- fix encrypted memory allocation bug
v3d:
- fix irq enabled during reset"
* tag 'drm-fixes-2025-07-04' of https://gitlab.freedesktop.org/drm/kernel: (41 commits)
drm/xe: Do not wedge device on killed exec queues
drm/xe: Extend WA 14018094691 to BMG
drm/v3d: Disable interrupts before resetting the GPU
drm/gem: Acquire references on GEM handles for framebuffers
drm/sched: Increment job count before swapping tail spsc queue
drm/xe: Allow dropping kunit dependency as built-in
drm/xe: Fix kconfig prompt
drm/xe/bmg: Update Wa_22019338487
drm/xe/bmg: Update Wa_14022085890
drm/xe: Split xe_device_td_flush()
drm/xe/xe_guc_pc: Lock once to update stashed frequencies
drm/xe/guc_pc: Add _locked variant for min/max freq
drm/xe: Make WA BB part of LRC BO
drm/xe: Fix out-of-bounds field write in MI_STORE_DATA_IMM
drm/i915/gsc: mei interrupt top half should be in irq disabled context
drm/i915/gt: Fix timeline left held on VMA alloc error
drm/vmwgfx: Fix guests running with TDX/SEV
drm/amd/display: Don't allow OLED to go down to fully off
drm/amd/display: Added case for when RR equals panel's max RR using freesync
drm/amdkfd: add hqd_sdma_get_doorbell callbacks for gfx7/8
...
Merge tag 'iommu-fixes-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux
Pull iommu fixes from Joerg Roedel:
- Rockchip: fix infinite loop caused by probing race condition
- Intel VT-d: assign devtlb cache tag on ATS enablement
* tag 'iommu-fixes-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux:
iommu/vt-d: Assign devtlb cache tag on ATS enablement
iommu/rockchip: prevent iommus dead loop when two masters share one IOMMU
Merge tag 'block-6.16-20250704' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- NVMe fixes via Christoph:
- fix incorrect cdw15 value in passthru error logging (Alok Tiwari)
- fix memory leak of bio integrity in nvmet (Dmitry Bogdanov)
- refresh visible attrs after being checked (Eugen Hristev)
- fix suspicious RCU usage warning in the multipath code (Geliang Tang)
- correctly account for namespace head reference counter (Nilay Shroff)
- Fix for a regression introduced in ublk in this cycle, where it would
attempt to queue a canceled request.
- brd RCU sleeping fix, also introduced in this cycle. Bare bones fix,
should be improved upon for the next release.
* tag 'block-6.16-20250704' of git://git.kernel.dk/linux:
brd: fix sleeping function called from invalid context in brd_insert_page()
ublk: don't queue request if the associated uring_cmd is canceled
nvme-multipath: fix suspicious RCU usage warning
nvme-pci: refresh visible attrs after being checked
nvmet: fix memory leak of bio integrity
nvme: correctly account for namespace head reference counter
nvme: Fix incorrect cdw15 value in passthru error logging
Wolfram Sang [Fri, 4 Jul 2025 16:31:22 +0000 (18:31 +0200)]
Merge tag 'i2c-host-fixes-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/andi.shyti/linux into i2c/for-current
i2c-host-fixes for v6.16-rc5
designware: initialise msg_write_idx during transfer
microchip: check return value from core xfer call
realtek: add 'reg' property constraint to the device tree
Merge tag 'bcachefs-2025-07-03' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:
"The 'opts.casefold_disabled' patch is non critical, but would be a
6.15 backport; it's to address the casefolding + overlayfs
incompatibility that was discovvered late.
It's late because I was hoping that this would be addressed on the
overlayfs side (and will be in 6.17), but user reports keep coming in
on this one (lots of people are using docker these days)"
* tag 'bcachefs-2025-07-03' of git://evilpiepirate.org/bcachefs:
bcachefs: opts.casefold_disabled
bcachefs: Work around deadlock to btree node rewrites in journal replay
bcachefs: Fix incorrect transaction restart handling
bcachefs: fix btree_trans_peek_prev_journal()
bcachefs: mark invalid_btree_id autofix
Merge tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull vfs fixes from Christian Brauner:
- Fix a regression caused by the anonymous inode rework. Making them
regular files causes various places in the kernel to tip over
starting with io_uring.
Revert to the former status quo and port our assertion to be based on
checking the inode so we don't lose the valuable VFS_*_ON_*()
assertions that have already helped discover weird behavior our
outright bugs.
- Fix the the upper bound calculation in fuse_fill_write_pages()
- Fix priority inversion issues in the eventpoll code
- Make secretmen use anon_inode_make_secure_inode() to avoid bypassing
the LSM layer
- Fix a netfs hang due to missing case in final DIO read result
collection
- Fix a double put of the netfs_io_request struct
- Provide some helpers to abstract out NETFS_RREQ_IN_PROGRESS flag
wrangling
- Fix infinite looping in netfs_wait_for_pause/request()
- Fix a netfs ref leak on an extra subrequest inserted into a request's
list of subreqs
- Fix various cifs RPC callbacks to set NETFS_SREQ_NEED_RETRY if a
subrequest fails retriably
- Fix a cifs warning in the workqueue code when reconnecting a channel
- Fix the updating of i_size in netfs to avoid a race between testing
if we should have extended the file with a DIO write and changing
i_size
- Merge the places in netfs that update i_size on write
- Fix coredump socket selftests
* tag 'vfs-6.16-rc5.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs:
anon_inode: rework assertions
netfs: Update tracepoints in a number of ways
netfs: Renumber the NETFS_RREQ_* flags to make traces easier to read
netfs: Merge i_size update functions
netfs: Fix i_size updating
smb: client: set missing retry flag in cifs_writev_callback()
smb: client: set missing retry flag in cifs_readv_callback()
smb: client: set missing retry flag in smb2_writev_callback()
netfs: Fix ref leak on inserted extra subreq in write retry
netfs: Fix looping in wait functions
netfs: Provide helpers to perform NETFS_RREQ_IN_PROGRESS flag wangling
netfs: Fix double put of request
netfs: Fix hang due to missing case in final DIO read result collection
eventpoll: Fix priority inversion problem
fuse: fix fuse_fill_write_pages() upper bound calculation
fs: export anon_inode_make_secure_inode() and fix secretmem LSM bypass
selftests/coredump: Fix "socket_detect_userspace_client" test failure
sched/deadline: Fix dl_server runtime calculation formula
In our testing with 6.12 based kernel on a big.LITTLE system, we were
seeing instances of RT tasks being blocked from running on the LITTLE
cpus for multiple seconds of time, apparently by the dl_server. This
far exceeds the default configured 50ms per second runtime.
This is due to the fair dl_server runtime calculation being scaled
for frequency & capacity of the cpu.
Consider the following case under a Big.LITTLE architecture:
Assume the runtime is: 50,000,000 ns, and Frequency/capacity
scale-invariance defined as below:
Frequency scale-invariance: 100
Capacity scale-invariance: 50
First by Frequency scale-invariance,
the runtime is scaled to 50,000,000 * 100 >> 10 = 4,882,812
Then by capacity scale-invariance,
it is further scaled to 4,882,812 * 50 >> 10 = 238,418.
So it will scaled to 238,418 ns.
This smaller "accounted runtime" value is what ends up being
subtracted against the fair-server's runtime for the current period.
Thus after 50ms of real time, we've only accounted ~238us against the
fair servers runtime. This 209:1 ratio in this example means that on
the smaller cpu the fair server is allowed to continue running,
blocking RT tasks, for over 10 seconds before it exhausts its supposed
50ms of runtime. And on other hardware configurations it can be even
worse.
For the fair deadline_server, to prevent realtime tasks from being
unexpectedly delayed, we really do want to use fixed time, and not
scaled time for smaller capacity/frequency cpus. So remove the scaling
from the fair server's accounting to fix this.
Fixes: a110a81c52a9 ("sched/deadline: Deferrable dl server") Suggested-by: Peter Zijlstra <peterz@infradead.org> Suggested-by: John Stultz <jstultz@google.com> Signed-off-by: kuyo chang <kuyo.chang@mediatek.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Juri Lelli <juri.lelli@redhat.com> Acked-by: John Stultz <jstultz@google.com> Tested-by: John Stultz <jstultz@google.com> Link: https://lore.kernel.org/r/20250702021440.2594736-1-kuyo.chang@mediatek.com
Lu Baolu [Sat, 28 Jun 2025 10:03:51 +0000 (18:03 +0800)]
iommu/vt-d: Assign devtlb cache tag on ATS enablement
Commit <4f1492efb495> ("iommu/vt-d: Revert ATS timing change to fix boot
failure") placed the enabling of ATS in the probe_finalize callback. This
occurs after the default domain attachment, which is when the ATS cache
tag is assigned. Consequently, the device TLB cache tag is missed when the
domain is attached, leading to the device TLB not being invalidated in the
iommu_unmap paths.
Fix this by assigning the CACHE_TAG_DEVTLB cache tag when ATS is enabled.
Input: cs40l50-vibra - fix potential NULL dereference in cs40l50_upload_owt()
The cs40l50_upload_owt() function allocates memory via kmalloc()
without checking for allocation failure, which could lead to a
NULL pointer dereference.
Return -ENOMEM in case allocation fails.
Signed-off-by: Yunshui Jiang <jiangyunshui@kylinos.cn> Fixes: c38fe1bb5d21 ("Input: cs40l50 - Add support for the CS40L50 haptic driver") Link: https://lore.kernel.org/r/20250704024010.2353841-1-jiangyunshui@kylinos.cn Cc: stable@vger.kernel.org Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Al Viro [Mon, 30 Jun 2025 06:52:13 +0000 (02:52 -0400)]
fix proc_sys_compare() handling of in-lookup dentries
There's one case where ->d_compare() can be called for an in-lookup
dentry; usually that's nothing special from ->d_compare() point of
view, but... proc_sys_compare() is weird.
The thing is, /proc/sys subdirectories can look differently for
different processes. Up to and including having the same name
resolve to different dentries - all of them hashed.
The way it's done is ->d_compare() refusing to admit a match unless
this dentry is supposed to be visible to this caller. The information
needed to discriminate between them is stored in inode; it is set
during proc_sys_lookup() and until it's done d_splice_alias() we really
can't tell who should that dentry be visible for.
Normally there's no negative dentries in /proc/sys; we can run into
a dying dentry in RCU dcache lookup, but those can be safely rejected.
However, ->d_compare() is also called for in-lookup dentries, before
they get positive - or hashed, for that matter. In case of match
we will wait until dentry leaves in-lookup state and repeat ->d_compare()
afterwards. In other words, the right behaviour is to treat the
name match as sufficient for in-lookup dentries; if dentry is not
for us, we'll see that when we recheck once proc_sys_lookup() is
done with it.
While we are at it, fix the misspelled READ_ONCE and WRITE_ONCE there.
Fixes: d9171b934526 ("parallel lookups machinery, part 4 (and last)") Reported-by: NeilBrown <neilb@brown.name> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: NeilBrown <neil@brown.name> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Dave Airlie [Fri, 4 Jul 2025 00:01:49 +0000 (10:01 +1000)]
Merge tag 'drm-xe-fixes-2025-07-03' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Fix chunking the PTE updates and overflowing the maximum number of
dwords with with MI_STORE_DATA_IMM (Jia Yao)
- Move WA BB to the LRC BO to mitigate hangs on context switch (Matthew
Brost)
- Fix frequency/flush WAs for BMG (Vinay / Lucas)
- Fix kconfig prompt title and description (Lucas)
- Do not require kunit (Harry Austen / Lucas)
- Extend 14018094691 WA to BMG (Daniele)
- Fix wedging the device on signal (Matthew Brost)
Paulo Alcantara [Thu, 3 Jul 2025 20:57:19 +0000 (17:57 -0300)]
smb: client: fix native SMB symlink traversal
We've seen customers having shares mounted in paths like /??/C:/ or
/??/UNC/foo.example.com/share in order to get their native SMB
symlinks successfully followed from different mounts.
After commit 12b466eb52d9 ("cifs: Fix creating and resolving absolute NT-style symlinks"),
the client would then convert absolute paths from "/??/C:/" to "/mnt/c/"
by default. The absolute paths would vary depending on the value of
symlinkroot= mount option.
Fix this by restoring old behavior of not trying to convert absolute
paths by default. Only do this if symlinkroot= was _explicitly_ set.
Before patch:
$ mount.cifs //w22-fs0/test2 /mnt/1 -o vers=3.1.1,username=xxx,password=yyy
$ ls -l /mnt/1/symlink2
lrwxr-xr-x 1 root root 15 Jun 20 14:22 /mnt/1/symlink2 -> /mnt/c/testfile
$ mkdir -p /??/C:; echo foo > //??/C:/testfile
$ cat /mnt/1/symlink2
cat: /mnt/1/symlink2: No such file or directory
Wang Zhaolong [Thu, 3 Jul 2025 13:29:52 +0000 (21:29 +0800)]
smb: client: fix race condition in negotiate timeout by using more precise timing
When the SMB server reboots and the client immediately accesses the mount
point, a race condition can occur that causes operations to fail with
"Host is down" error.
Reproduction steps:
# Mount SMB share
mount -t cifs //192.168.245.109/TEST /mnt/ -o xxxx
ls /mnt
# Immediate access fails
ls /mnt
ls: cannot access '/mnt': Host is down
# But works if there is a delay
The issue is caused by a race condition between negotiate and reconnect.
The 20-second negotiate timeout mechanism can interfere with the normal
recovery process when both are triggered simultaneously.
The server_unresponsive() timeout triggers cifs_reconnect(), which aborts
ongoing mid requests and causes the ls command to receive -EAGAIN, leading
to -EHOSTDOWN.
Fix this by introducing a dedicated `neg_start` field to
precisely tracks when the negotiate process begins. The timeout check
now uses this accurate timestamp instead of `lstrp`, ensuring that:
1. Timeout is only triggered after negotiate has actually run for 20s
2. The mechanism doesn't interfere with concurrent recovery processes
3. Uninitialized timestamps (value 0) don't trigger false timeouts
Fixes: 7ccc1465465d ("smb: client: fix hang in wait_for_response() for negproto") Signed-off-by: Wang Zhaolong <wangzhaolong@huaweicloud.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Dave Airlie [Thu, 3 Jul 2025 23:40:17 +0000 (09:40 +1000)]
Merge tag 'samsung-dsim-fixes-for-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
- Fixed raw pointer leakage and unsafe behavior in printk()
. Switch from %pK to %p for pointer formatting, as %p is now safer
and prevents issues like raw pointer leakage and acquiring sleeping
locks in atomic contexts.
Dave Airlie [Thu, 3 Jul 2025 23:37:57 +0000 (09:37 +1000)]
Merge tag 'exynos-drm-fixes-for-v6.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/daeinki/drm-exynos into drm-fixes
Fixups
- Fixed raw pointer leakage and unsafe behavior in printk()
. Switch from %pK to %p for pointer formatting, as %p is now safer
and prevents issues like raw pointer leakage and acquiring sleeping
locks in atomic contexts.
- Fixed kernel panic during boot
. A NULL pointer dereference issue occasionally occurred
when the vblank interrupt handler was called before
the DRM driver was fully initialized during boot.
So this patch fixes the issue by adding a check in the interrupt handler
to ensure the DRM driver is properly initialized.
- Fixed a lockup issue on Samsung Peach-Pit/Pi Chromebooks
. The issue occurred after commit c9b1150a68d9 changed
the call order of CRTC enable/disable and bridge pre_enable/post_disable
methods, causing fimd_dp_clock_enable() to be called
before the FIMD device was activated. To fix this,
runtime PM guards were added to fimd_dp_clock_enable()
to ensure proper operation even when CRTC is not enabled.
Dave Airlie [Thu, 3 Jul 2025 23:26:57 +0000 (09:26 +1000)]
Merge tag 'drm-intel-fixes-2025-07-03' of https://gitlab.freedesktop.org/drm/i915/kernel into drm-fixes
- Make mei interrupt top half irq disabled to fix RT builds
- Fix timeline left held on VMA alloc error
- Fix NULL pointer deref in vlv_dphy_param_init()
- Fix selftest mock_request() to avoid NULL deref
Merge tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
- tree-log fixes:
- fixes of log tracking of directories and subvolumes
- fix iteration and error handling of inode references
during log replay
- fix free space tree rebuild (reported by syzbot)
* tag 'for-6.16-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: use btrfs_record_snapshot_destroy() during rmdir
btrfs: propagate last_unlink_trans earlier when doing a rmdir
btrfs: record new subvolume in parent dir earlier to avoid dir logging races
btrfs: fix inode lookup error handling during log replay
btrfs: fix iteration of extrefs during log replay
btrfs: fix missing error handling when searching for inode refs during log replay
btrfs: fix failure to rebuild free space tree using multiple transactions
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Driver fixes plus core sd.c fix are all small and obvious.
The larger change to hosts.c is less obvious, but required to avoid
data corruption caused by bio splitting"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: core: Fix spelling of a sysfs attribute name
scsi: core: Enforce unlimited max_segment_size when virt_boundary_mask is set
scsi: RDMA/srp: Don't set a max_segment_size when virt_boundary_mask is set
scsi: sd: Fix VPD page 0xb7 length check
scsi: qla4xxx: Fix missing DMA mapping error in qla4xxx_alloc_pdu()
scsi: qla2xxx: Fix DMA mapping test in qla24xx_get_port_database()
Merge tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - new code bugs:
- eth:
- txgbe: fix the issue of TX failure
- ngbe: specify IRQ vector when the number of VFs is 7
Previous releases - regressions:
- sched: always pass notifications when child class becomes empty
- ipv4: fix stat increase when udp early demux drops the packet
- bluetooth: prevent unintended pause by checking if advertising is active
- virtio: fix error reporting in virtqueue_resize
- eth:
- virtio-net:
- ensure the received length does not exceed allocated size
- fix the xsk frame's length check
- lan78xx: fix WARN in __netif_napi_del_locked on disconnect
- eth:
- idpf: convert control queue mutex to a spinlock
- dpaa2: fix xdp_rxq_info leak
- amd-xgbe: align CL37 AN sequence as per databook"
* tag 'net-6.16-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (38 commits)
vsock/vmci: Clear the vmci transport packet properly when initializing it
dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the example
net: ngbe: specify IRQ vector when the number of VFs is 7
net: wangxun: revert the adjustment of the IRQ vector sequence
net: txgbe: request MISC IRQ in ndo_open
virtio_net: Enforce minimum TX ring size for reliability
virtio_net: Cleanup '2+MAX_SKB_FRAGS'
virtio_ring: Fix error reporting in virtqueue_resize
virtio-net: xsk: rx: fix the frame's length check
virtio-net: use the check_mergeable_len helper
virtio-net: remove redundant truesize check with PAGE_SIZE
virtio-net: ensure the received length does not exceed allocated size
net: ipv4: fix stat increase when udp early demux drops the packet
net: libwx: fix the incorrect display of the queue number
amd-xgbe: do not double read link status
net/sched: Always pass notifications when child class becomes empty
nui: Fix dma_mapping_error() check
rose: fix dangling neighbour pointers in rose_rt_device_down()
enic: fix incorrect MTU comparison in enic_change_mtu()
amd-xgbe: align CL37 AN sequence as per databook
...
Merge tag 'xfs-fixes-6.16-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
- Fix umount hang with unflushable inodes (and add new tracepoint used
for debugging this)
- Fix ABBA deadlock in xfs_reclaim_inode() vs xfs_ifree_cluster()
- Fix dquot buffer pin deadlock
* tag 'xfs-fixes-6.16-rc5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: add FALLOC_FL_ALLOCATE_RANGE to supported flags mask
xfs: fix unmount hang with unflushable inodes stuck in the AIL
xfs: factor out stale buffer item completion
xfs: rearrange code in xfs_buf_item.c
xfs: add tracepoints for stale pinned inode state debug
xfs: avoid dquot buffer pin deadlock
xfs: catch stale AGF/AGF metadata
xfs: xfs_ifree_cluster vs xfs_iflush_shutdown_abort deadlock
xfs: actually use the xfs_growfs_check_rtgeom tracepoint
xfs: Improve error handling in xfs_mru_cache_create()
xfs: move xfs_submit_zoned_bio a bit
xfs: use xfs_readonly_buftarg in xfs_remount_rw
xfs: remove NULL pointer checks in xfs_mru_cache_insert
xfs: check for shutdown before going to sleep in xfs_select_zone
Merge tag 'nvme-6.16-2025-07-03' of git://git.infradead.org/nvme into block-6.16
Pull NVMe fixes from Christoph:
"- fix incorrect cdw15 value in passthru error logging (Alok Tiwari)
- fix memory leak of bio integrity in nvmet (Dmitry Bogdanov)
- refresh visible attrs after being checked (Eugen Hristev)
- fix suspicious RCU usage warning in the multipath code (Geliang Tang)
- correctly account for namespace head reference counter (Nilay Shroff)"
* tag 'nvme-6.16-2025-07-03' of git://git.infradead.org/nvme:
nvme-multipath: fix suspicious RCU usage warning
nvme-pci: refresh visible attrs after being checked
nvmet: fix memory leak of bio integrity
nvme: correctly account for namespace head reference counter
nvme: Fix incorrect cdw15 value in passthru error logging
Merge tag 'apple-soc-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/sven/linux into arm/fixes
Apple SoC fixes for 6.16
One devicetree fix for a dtbs_warning that's been present for a while:
- Rename the PCIe BCM4377 node to conform to the devicetree binding
schema
Two devicetree fixes for W=1 warnings that have been introduced recently:
- Drop {address,size}-cells from SPI NOR which doesn't have any child
nodes such that these don't make sense
- Move touchbar mipi {address,size}-cells from the dtsi file where the
node is disabled and has no children to the dts file where it's
enabled and its children are declared
Signed-off-by: Sven Peter <sven@kernel.org>
* tag 'apple-soc-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/sven/linux:
arm64: dts: apple: Move touchbar mipi {address,size}-cells from dtsi to dts
arm64: dts: apple: Drop {address,size}-cells from SPI NOR
arm64: dts: apple: t8103: Fix PCIe BCM4377 nodename
Merge tag 'samsung-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux into arm/fixes
Samsung SoC fixes for v6.16
1. Correct CONFIG option in arm64 defconfig enabling the Qualcomm SoC
SNPS EUSB2 phy driver, because Kconfig entry was renamed when
changing the driver to a common one, shared with Samsung SoC, thus
defconfig lost that driver effectively.
2. Exynos ACPM: Fix timeouts happening with multiple requests.
* tag 'samsung-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/krzk/linux:
firmware: exynos-acpm: fix timeouts on xfers handling
arm64: defconfig: update renamed PHY_SNPS_EUSB2
Note that this is a GSC WA and we don't load the GSC on BMG, so
extending the WA to BMG won't do anything right now. However, it helps
future-proof the driver so that if we ever turn the GSC on we won't have
to remember to extend this WA.
v2: don't use VERSION_RANGE from 2001 to 2004 (Matt)
Merge tag 'ffa-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux into arm/fixes
Arm FF-A fixes for v6.16
Couple of fixes to address:
1. The safety and memory issues in the FF-A notification callback handler:
The fixes replaces a mutex with an rwlock to prevent sleeping in atomic
context, resolving kernel warnings. Memory allocation is moved outside
the lock to support this transition safely. Additionally, a memory leak
in the notifier unregistration path is fixed by properly freeing the
callback node.
2. The missing entry in struct ffa_indirect_msg_hdr:
The fix adds the missing 32 bit reserved entry in the structure as
required by the FF-A specification.
* tag 'ffa-fixes-6.16' of https://git.kernel.org/pub/scm/linux/kernel/git/sudeep.holla/linux:
firmware: arm_ffa: Fix the missing entry in struct ffa_indirect_msg_hdr
firmware: arm_ffa: Replace mutex with rwlock to avoid sleep in atomic context
firmware: arm_ffa: Move memory allocation outside the mutex locking
firmware: arm_ffa: Fix memory leak by freeing notifier callback node
regulator: gpio: Fix the out-of-bounds access to drvdata::gpiods
drvdata::gpiods is supposed to hold an array of 'gpio_desc' pointers. But
the memory is allocated for only one pointer. This will lead to
out-of-bounds access later in the code if 'config::ngpios' is > 1. So
fix the code to allocate enough memory to hold 'config::ngpios' of GPIO
descriptors.
While at it, also move the check for memory allocation failure to be below
the allocation to make it more readable.
Cc: stable@vger.kernel.org # 5.0 Fixes: d6cd33ad7102 ("regulator: gpio: Convert to use descriptors") Signed-off-by: Manivannan Sadhasivam <mani@kernel.org> Link: https://patch.msgid.link/20250703103549.16558-1-mani@kernel.org Signed-off-by: Mark Brown <broonie@kernel.org>
Revert "ACPI: battery: negate current when discharging"
Revert commit 234f71555019 ("ACPI: battery: negate current when
discharging") breaks not one but several userspace implementations
of battery monitoring: Steam and MangoHud. Perhaps it breaks more,
but those are the two that have been tested.
Reported-by: Matthew Schwartz <matthew.schwartz@linux.dev> Closes: https://lore.kernel.org/linux-acpi/87C1B2AF-D430-4568-B620-14B941A8ABA4@linux.dev/ Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
vsock/vmci: Clear the vmci transport packet properly when initializing it
In vmci_transport_packet_init memset the vmci_transport_packet before
populating the fields to avoid any uninitialised data being left in the
structure.
Cc: Bryan Tan <bryan-bt.tan@broadcom.com> Cc: Vishnu Dasa <vishnu.dasa@broadcom.com> Cc: Broadcom internal kernel review list Cc: Stefano Garzarella <sgarzare@redhat.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Simon Horman <horms@kernel.org> Cc: virtualization@lists.linux.dev Cc: netdev@vger.kernel.org Cc: stable <stable@kernel.org> Signed-off-by: HarshaVardhana S A <harshavardhana.sa@broadcom.com> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Fixes: d021c344051a ("VSOCK: Introduce VM Sockets") Acked-by: Stefano Garzarella <sgarzare@redhat.com> Link: https://patch.msgid.link/20250701122254.2397440-1-gregkh@linuxfoundation.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
dt-bindings: net: sophgo,sg2044-dwmac: Drop status from the example
Examples should be complete and should not have a 'status' property,
especially a disabled one because this disables the dt_binding_check of
the example against the schema. Dropping 'status' property shows
missing other properties - phy-mode and phy-handle.
Fixes: 114508a89ddc ("dt-bindings: net: Add support for Sophgo SG2044 dwmac") Cc: <stable@vger.kernel.org> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@gmail.com> Reviewed-by: Chen Wang <unicorn_wang@outlook.com> Link: https://patch.msgid.link/20250701063621.23808-2-krzysztof.kozlowski@linaro.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Thu, 3 Jul 2025 09:51:41 +0000 (11:51 +0200)]
Merge branch 'fix-irq-vectors'
Jiawen Wu says:
====================
Fix IRQ vectors
The interrupt vector order was adjusted by [1]commit 937d46ecc5f9 ("net:
wangxun: add ethtool_ops for channel number") in Linux-6.8. Because at
that time, the MISC interrupt acts as the parent interrupt in the GPIO
IRQ chip. When the number of Rx/Tx ring changes, the last MISC
interrupt must be reallocated. Then the GPIO interrupt controller would
be corrupted. So the initial plan was to adjust the sequence of the
interrupt vectors, let MISC interrupt to be the first one and do not
free it.
Later, irq_domain was introduced in [2]commit aefd013624a1 ("net: txgbe:
use irq_domain for interrupt controller") to avoid this problem.
However, the vector sequence adjustment was not reverted. So there is
still one problem that has been left unresolved.
Due to hardware limitations of NGBE, queue IRQs can only be requested
on vector 0 to 7. When the number of queues is set to the maximum 8,
the PCI IRQ vectors are allocated from 0 to 8. The vector 0 is used by
MISC interrupt, and althrough the vector 8 is used by queue interrupt,
it is unable to receive packets. This will cause some packets to be
dropped when RSS is enabled and they are assigned to queue 8.
net: ngbe: specify IRQ vector when the number of VFs is 7
For NGBE devices, the queue number is limited to be 1 when SRIOV is
enabled. In this case, IRQ vector[0] is used for MISC and vector[1] is
used for queue, based on the previous patches. But for the hardware
design, the IRQ vector[1] must be allocated for use by the VF[6] when
the number of VFs is 7. So the IRQ vector[0] should be shared for PF
MISC and QUEUE interrupts.
net: wangxun: revert the adjustment of the IRQ vector sequence
Due to hardware limitations of NGBE, queue IRQs can only be requested
on vector 0 to 7. When the number of queues is set to the maximum 8,
the PCI IRQ vectors are allocated from 0 to 8. The vector 0 is used by
MISC interrupt, and althrough the vector 8 is used by queue interrupt,
it is unable to receive packets. This will cause some packets to be
dropped when RSS is enabled and they are assigned to queue 8.
So revert the adjustment of the MISC IRQ location, to make it be the
last one in IRQ vectors.
Move the creating of irq_domain for MISC IRQ from .probe to .ndo_open,
and free it in .ndo_stop, to maintain consistency with the queue IRQs.
This it for subsequent adjustments to the IRQ vectors.
Fixes: aefd013624a1 ("net: txgbe: use irq_domain for interrupt controller") Cc: stable@vger.kernel.org Signed-off-by: Jiawen Wu <jiawenwu@trustnetic.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Link: https://patch.msgid.link/20250701063030.59340-2-jiawenwu@trustnetic.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
virtio: Fixes for TX ring sizing and resize error reporting
This patch series contains two fixes and a cleanup for the virtio subsystem.
The first patch fixes an error reporting bug in virtio_ring's
virtqueue_resize() function. Previously, errors from internal resize
helpers could be masked if the subsequent re-enabling of the virtqueue
succeeded. This patch restores the correct error propagation, ensuring that
callers of virtqueue_resize() are properly informed of underlying resize
failures.
The second patch does a cleanup of the use of '2+MAX_SKB_FRAGS'
The third patch addresses a reliability issue in virtio_net where the TX
ring size could be configured too small, potentially leading to
persistently stopped queues and degraded performance. It enforces a
minimum TX ring size to ensure there's always enough space for at least one
maximally-fragmented packet plus an additional slot.
====================
Laurent Vivier [Wed, 21 May 2025 09:22:36 +0000 (11:22 +0200)]
virtio_net: Enforce minimum TX ring size for reliability
The `tx_may_stop()` logic stops TX queues if free descriptors
(`sq->vq->num_free`) fall below the threshold of (`MAX_SKB_FRAGS` + 2).
If the total ring size (`ring_num`) is not strictly greater than this
value, queues can become persistently stopped or stop after minimal
use, severely degrading performance.
A single sk_buff transmission typically requires descriptors for:
- The virtio_net_hdr (1 descriptor)
- The sk_buff's linear data (head) (1 descriptor)
- Paged fragments (up to MAX_SKB_FRAGS descriptors)
This patch enforces that the TX ring size ('ring_num') must be strictly
greater than (MAX_SKB_FRAGS + 2). This ensures that the ring is
always large enough to hold at least one maximally-fragmented packet
plus at least one additional slot.
Reported-by: Lei Yang <leiyang@redhat.com> Signed-off-by: Laurent Vivier <lvivier@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://patch.msgid.link/20250521092236.661410-4-lvivier@redhat.com Tested-by: Lei Yang <leiyang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>