Peter Zijlstra [Sat, 12 Jul 2025 03:33:49 +0000 (03:33 +0000)]
sched: Start blocked_on chain processing in find_proxy_task()
Start to flesh out the real find_proxy_task() implementation,
but avoid the migration cases for now, in those cases just
deactivate the donor task and pick again.
To ensure the donor task or other blocked tasks in the chain
aren't migrated away while we're running the proxy, also tweak
the fair class logic to avoid migrating donor or mutex blocked
tasks.
[jstultz: This change was split out from the larger proxy patch] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-9-jstultz@google.com
Proxy execution forms atomic pairs of tasks: The waiting donor
task (scheduling context) and a proxy (execution context). The
donor task, along with the rest of the blocked chain, follows
the proxy wrt CPU placement.
They can be the same task, in which case push/pull doesn't need any
modification. When they are different, however,
FIFO1 & FIFO42:
,-> RT42
| | blocked-on
| v
blocked_donor | mutex
| | owner
| v
`-- RT1
RT1 is eligible to be pushed to CPU1, but should that happen it will
"carry" RT42 along. Clearly here neither RT1 nor RT42 must be seen as
push/pullable.
Unfortunately, only the donor task is usually dequeued from the rq,
and the proxy'ed execution context (rq->curr) remains on the rq.
This can cause RT1 to be selected for migration from logic like the
rt pushable_list.
Thus, adda a dequeue/enqueue cycle on the proxy task before __schedule
returns, which allows the sched class logic to avoid adding the now
current task to the pushable_list.
Furthermore, tasks becoming blocked on a mutex don't need an explicit
dequeue/enqueue cycle to be made (push/pull)able: they have to be running
to block on a mutex, thus they will eventually hit put_prev_task().
Signed-off-by: Valentin Schneider <valentin.schneider@arm.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-8-jstultz@google.com
John Stultz [Sat, 12 Jul 2025 03:33:47 +0000 (03:33 +0000)]
sched: Add an initial sketch of the find_proxy_task() function
Add a find_proxy_task() function which doesn't do much.
When we select a blocked task to run, we will just deactivate it
and pick again. The exception being if it has become unblocked
after find_proxy_task() was called.
This allows us to validate keeping blocked tasks on the runqueue
and later deactivating them is working ok, stressing the failure
cases for when a proxy isn't found.
Greatly simplified from patch by:
Peter Zijlstra (Intel) <peterz@infradead.org>
Juri Lelli <juri.lelli@redhat.com>
Valentin Schneider <valentin.schneider@arm.com>
Connor O'Brien <connoro@google.com>
[jstultz: Split out from larger proxy patch and simplified
for review and testing.] Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-7-jstultz@google.com
Without proxy-exec, we normally charge the "current" task for
both its vruntime as well as its sum_exec_runtime.
With proxy, however, we have two "current" contexts: the
scheduler context and the execution context. We want to charge
the execution context rq->curr (ie: proxy/lock holder) execution
time to its sum_exec_runtime (so it's clear to userland the
rq->curr task *is* running), as well as its thread group.
However the rest of the time accounting (such a vruntime and
cgroup accounting), we charge against the scheduler context
(rq->donor) task, because it is from that task that the time
is being "donated".
If the donor and curr tasks are the same, then it's the same as
without proxy.
Peter Zijlstra [Sat, 12 Jul 2025 03:33:43 +0000 (03:33 +0000)]
locking/mutex: Rework task_struct::blocked_on
Track the blocked-on relation for mutexes, to allow following this
relation at schedule time.
task
| blocked-on
v
mutex
| owner
v
task
This all will be used for tracking blocked-task/mutex chains
with the prox-execution patch in a similar fashion to how
priority inheritance is done with rt_mutexes.
For serialization, blocked-on is only set by the task itself
(current). And both when setting or clearing (potentially by
others), is done while holding the mutex::wait_lock.
[minor changes while rebasing]
[jstultz: Fix blocked_on tracking in __mutex_lock_common in error paths] Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Connor O'Brien <connoro@google.com> Signed-off-by: John Stultz <jstultz@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lkml.kernel.org/r/20250712033407.2383110-3-jstultz@google.com
John Stultz [Sat, 12 Jul 2025 03:33:42 +0000 (03:33 +0000)]
sched: Add CONFIG_SCHED_PROXY_EXEC & boot argument to enable/disable
Add a CONFIG_SCHED_PROXY_EXEC option, along with a boot argument
sched_proxy_exec= that can be used to disable the feature at boot
time if CONFIG_SCHED_PROXY_EXEC was enabled.
Also uses this option to allow the rq->donor to be different from
rq->curr.
Support for overlapping domains added in commit e3589f6c81e4 ("sched:
Allow for overlapping sched_domain spans") also allowed forcefully
setting SD_OVERLAP for !NUMA domains via FORCE_SD_OVERLAP sched_feat().
Since NUMA domains had to be presumed overlapping to ensure correct
behavior, "sched_domain_topology_level::flags" was introduced. NUMA
domains added the SDTL_OVERLAP flag would ensure SD_OVERLAP was always
added during build_sched_domains() for these domains, even when
FORCE_SD_OVERLAP was off.
Condition for adding the SD_OVERLAP flag at the aforementioned commit
was as follows:
if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
sd->flags |= SD_OVERLAP;
The FORCE_SD_OVERLAP debug feature was removed in commit af85596c74de
("sched/topology: Remove FORCE_SD_OVERLAP") which left the NUMA domains
as the exclusive users of SDTL_OVERLAP, SD_OVERLAP, and SD_NUMA flags.
Get rid of SDTL_OVERLAP and SD_OVERLAP as they have become redundant
and instead rely on SD_NUMA to detect the only overlapping domain
currently supported. Since SDTL_OVERLAP was the only user of
"tl->flags", get rid of "sched_domain_topology_level::flags" too.
Li Chen [Thu, 10 Jul 2025 10:57:09 +0000 (18:57 +0800)]
x86/smpboot: moves x86_topology to static initialize and truncate
The #ifdeffery and the initializers in build_sched_topology() are just
disgusting.
Statically initialize the domain levels in the topology array and let
build_sched_topology() invalidate the package domain level when NUMA in
package is available.
Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-4-me@linux.beauty
Li Chen [Thu, 10 Jul 2025 10:57:08 +0000 (18:57 +0800)]
x86/smpboot: remove redundant CONFIG_SCHED_SMT
On x86 CONFIG_SCHED_SMT is default y if SMP is enabled, so let's
simply drop CONFIG_SCHED_SMT.
Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-3-me@linux.beauty
Li Chen [Thu, 10 Jul 2025 10:57:07 +0000 (18:57 +0800)]
smpboot: introduce SDTL_INIT() helper to tidy sched topology setup
Define a small SDTL_INIT(maskfn, flagsfn, name) macro and use it to build the
sched_domain_topology_level array. Purely a cleanup; behaviour is unchanged.
Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: K Prateek Nayak <kprateek.nayak@amd.com> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://lore.kernel.org/r/20250710105715.66594-2-me@linux.beauty
Juri Lelli [Fri, 27 Jun 2025 11:51:17 +0000 (13:51 +0200)]
tools/sched: Add root_domains_dump.py which dumps root domains info
Root domains information is somewhat hard to access at runtime. Even
with sched_debug and sched_verbose, such information is only printed
on kernel console when domains are modified.
Add a simple drgn script to more easily retrieve root domains
information at runtime.
Since tools/sched is a new directory, add it to MAINTAINERS as well.
Juri Lelli [Fri, 27 Jun 2025 11:51:16 +0000 (13:51 +0200)]
sched/deadline: Fix accounting after global limits change
A global limits change (sched_rt_handler() logic) currently leaves stale
and/or incorrect values in variables related to accounting (e.g.
extra_bw).
Properly clean up per runqueue variables before implementing the change
and rebuild scheduling domains (so that accounting is also properly
restored) after such a change is complete.
Juri Lelli [Fri, 27 Jun 2025 11:51:15 +0000 (13:51 +0200)]
sched/deadline: Reset extra_bw to max_bw when clearing root domains
dl_clear_root_domain() doesn't take into account the fact that per-rq
extra_bw variables retain values computed before root domain changes,
resulting in broken accounting.
Fix it by resetting extra_bw to max_bw before restoring back dl-servers
contributions.
Fixes: 2ff899e351643 ("sched/deadline: Rebuild root domain accounting after every update") Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-3-juri.lelli@redhat.com
Juri Lelli [Fri, 27 Jun 2025 11:51:14 +0000 (13:51 +0200)]
sched/deadline: Initialize dl_servers after SMP
dl-servers are currently initialized too early at boot when CPUs are not
fully up (only boot CPU is). This results in miscalculation of per
runqueue DEADLINE variables like extra_bw (which needs a stable CPU
count).
Move initialization of dl-servers later on after SMP has been
initialized and CPUs are all online, so that CPU count is stable and
DEADLINE variables can be computed correctly.
Fixes: d741f297bceaf ("sched/fair: Fair server interface") Reported-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> Signed-off-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Waiman Long <longman@redhat.com> Tested-by: Marcel Ziswiler <marcel.ziswiler@codethink.co.uk> # nuc & rock5b Link: https://lore.kernel.org/r/20250627115118.438797-2-juri.lelli@redhat.com
sched: Change nr_uninterruptible type to unsigned long
The commit e6fe3f422be1 ("sched: Make multiple runqueue task counters
32-bit") changed nr_uninterruptible to an unsigned int. But the
nr_uninterruptible values for each of the CPU runqueues can grow to
large numbers, sometimes exceeding INT_MAX. This is valid, if, over
time, a large number of tasks are migrated off of one CPU after going
into an uninterruptible state. Only the sum of all nr_interruptible
values across all CPUs yields the correct result, as explained in a
comment in kernel/sched/loadavg.c.
Change the type of nr_uninterruptible back to unsigned long to prevent
overflows, and thus the miscalculation of load average.
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Fixes for a few clk drivers and bindings:
- Add a missing property to the Mediatek MT8188 clk binding to
keep binding checks happy
- Avoid an OOB by setting the correct number of parents in
dispmix_csr_clk_dev_data
- Allocate clk_hw structs early in probe to avoid an ordering
issue where clk_parent_data points to an unallocated clk_hw
when the child clk is registered before the parent clk in the
SCMI clk driver
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
dt-bindings: clock: mediatek: Add #reset-cells property for MT8188
clk: imx: Fix an out-of-bounds access in dispmix_csr_clk_dev_data
clk: scmi: Handle case where child clocks are initialized before their parents
Merge tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Borislav Petkov:
- Fix a case of recursive locking in the MSI code
- Fix a randconfig build failure in armada-370-xp irqchip
* tag 'irq_urgent_for_v6.16_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
irqchip/irq-msi-lib: Fix build with PCI disabled
PCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()
Merge tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"19 hotfixes. A whopping 16 are cc:stable and the remainder address
post-6.15 issues or aren't considered necessary for -stable kernels.
14 are for MM. Three gdb-script fixes and a kallsyms build fix"
* tag 'mm-hotfixes-stable-2025-07-11-16-16' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
Revert "sched/numa: add statistics of numa balance task"
mm: fix the inaccurate memory statistics issue for users
mm/damon: fix divide by zero in damon_get_intervals_score()
samples/damon: fix damon sample mtier for start failure
samples/damon: fix damon sample wsse for start failure
samples/damon: fix damon sample prcl for start failure
kasan: remove kasan_find_vm_area() to prevent possible deadlock
scripts: gdb: vfs: support external dentry names
mm/migrate: fix do_pages_stat in compat mode
mm/damon/core: handle damon_call_control as normal under kdmond deactivation
mm/rmap: fix potential out-of-bounds page table access during batched unmap
mm/hugetlb: don't crash when allocating a folio if there are no resv
scripts/gdb: de-reference per-CPU MCE interrupts
scripts/gdb: fix interrupts.py after maple tree conversion
maple_tree: fix mt_destroy_walk() on root leaf node
mm/vmalloc: leave lazy MMU mode on PTE mapping error
scripts/gdb: fix interrupts display after MCP on x86
lib/alloc_tag: do not acquire non-existent lock in alloc_tag_top_users()
kallsyms: fix build without execinfo
Merge tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs
Pull erofs fixes from Gao Xiang:
"Fix for a cache aliasing issue by adding missing flush_dcache_folio(),
which causes execution failures on some arm32 setups.
Fix for large compressed fragments, which could be generated by
-Eall-fragments option (but should be rare) and was rejected by
mistake due to an on-disk hardening commit.
The remaining ones are small fixes. Summary:
- Address cache aliasing for mappable page cache folios
- Allow readdir() to be interrupted
- Fix large fragment handling which was errored out by mistake
- Add missing tracepoints
- Use memcpy_to_folio() to replace copy_to_iter() for inline data"
* tag 'erofs-for-6.16-rc6-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/xiang/erofs:
erofs: fix large fragment handling
erofs: allow readdir() to be interrupted
erofs: address D-cache aliasing
erofs: use memcpy_to_folio() to replace copy_to_iter()
erofs: fix to add missing tracepoint in erofs_read_folio()
erofs: fix to add missing tracepoint in erofs_readahead()
Merge tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix use after free in lease break
- small fix for freeing rdma transport (fixes missing logging of
cm_qp_destroy)
- fix write count leak
* tag 'v6.16-rc5-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix potential use-after-free in oplock/lease break ack
ksmbd: fix a mount write count leak in ksmbd_vfs_kern_path_locked()
smb: server: make use of rdma_destroy_qp()
Merge tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull PCI fixes from Bjorn Helgaas:
- Track apple Root Ports explicitly and look up the driver data from
the struct device instead of using dev->driver_data, which is used by
pci_host_common_init() for the generic host bridge pointer (Marc
Zyngier)
- Set dev->driver_data before pci_host_common_init() calls
gen_pci_init() because some drivers need it to set up ECAM mappings;
this fixes a regression on MicroChip MPFS Icicle (Geert Uytterhoeven)
- Revert the now-unnecessary use of ECAM pci_config_window.priv to
store a copy of dev->driver_data (Marc Zyngier)
* tag 'pci-v6.16-fixes-3' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
Revert "PCI: ecam: Allow cfg->priv to be pre-populated from the root port device"
PCI: host-generic: Set driver_data before calling gen_pci_init()
PCI: apple: Add tracking of probed root ports
Core Changes:
- fix race in gem_handle_create_tail
- fixup handle_count fb refcount regression from -rc5, popular with
reports ...
- call rust dtor for drm_device release
Driver Changes:
- nouveau: magic 50ms suspend fix, acpi leak fix
- tegra: dma api error in nvdec
- pvr: fix device reset
- habanalbs maintainer update
- intel display: fix some dsi mipi sequences
- xe fixes: SRIOV fixes, small GuC fixes, disable indirect ring due
to issues, compression fix for fragmented BO, doc update
* tag 'drm-fixes-2025-07-12' of https://gitlab.freedesktop.org/drm/kernel: (22 commits)
drm/xe/guc: Default log level to non-verbose
drm/xe/bmg: Don't use WA 16023588340 and 22019338487 on VF
drm/xe/guc: Recommend GuC v70.46.2 for BMG, LNL, DG2
drm/xe/pm: Correct comment of xe_pm_set_vram_threshold()
drm/xe: Release runtime pm for error path of xe_devcoredump_read()
drm/xe/pm: Restore display pm if there is error after display suspend
drm/i915/bios: Apply vlv_fixup_mipi_sequences() to v2 mipi-sequences too
drm/gem: Fix race in drm_gem_handle_create_tail()
drm/framebuffer: Acquire internal references on GEM handles
agp/amd64: Check AGP Capability before binding to unsupported devices
drm/xe/bmg: fix compressed VRAM handling
Revert "drm/xe/xe2: Enable Indirect Ring State support for Xe2"
drm/xe: Allocate PF queue size on pow2 boundary
drm/xe/pf: Clear all LMTT pages on alloc
drm/nouveau/gsp: fix potential leak of memory used during acpi init
rust: drm: remove unnecessary imports
MAINTAINERS: Change habanalabs maintainer
drm/imagination: Fix kernel crash when hard resetting the GPU
drm/tegra: nvdec: Fix dma_alloc_coherent error check
rust: drm: device: drop_in_place() the drm::Device in release()
...
I haven't figured out what the actual bug in this commit is, but I did
spend a lot of time chasing it down and eventually succeeded in
bisecting it down to this.
For some reason, this eventpoll commit ends up causing delays and stuck
user space processes, but it only happens on one of my machines, and
only during early boot or during the flurry of initial activity when
logging in.
I must be triggering some very subtle timing issue, but once I figured
out the behavior pattern that made it reasonably reliable to trigger, it
did bisect right to this, and reverting the commit fixes the problem.
Of course, that was only after I had failed at bisecting it several
times, and had flailed around blaming both the drm people and the
netlink people for the odd problems. The most obvious of which happened
at the time of the first graphical login (the most common symptom being
that some gnome app aborted due to a 30s timeout, often leading to the
whole session then failing if it was some critical component like
gnome-shell or similar).
Acked-by: Nam Cao <namcao@linutronix.de> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fragments aren't limited by Z_EROFS_PCLUSTER_MAX_DSIZE. However, if
a fragment's logical length is larger than Z_EROFS_PCLUSTER_MAX_DSIZE
but the fragment is not the whole inode, it currently returns
-EOPNOTSUPP because m_flags has the wrong EROFS_MAP_ENCODED flag set.
It is not intended by design but should be rare, as it can only be
reproduced by mkfs with `-Eall-fragments` in a specific case.
Let's normalize fragment m_flags using the new EROFS_MAP_FRAGMENT.
Merge tag 'block-6.16-20250710' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
- MD changes via Yu:
- fix UAF due to stack memory used for bio mempool (Jinchao)
- fix raid10/raid1 nowait IO error path (Nigel and Qixing)
- fix kernel crash from reading bitmap sysfs entry (HĂĄkon)
- Fix for a UAF in the nbd connect error path
- Fix for blocksize being bigger than pagesize, if THP isn't enabled
* tag 'block-6.16-20250710' of git://git.kernel.dk/linux:
block: reject bs > ps block devices when THP is disabled
nbd: fix uaf in nbd_genl_connect() error path
md/md-bitmap: fix GPF in bitmap_get_stats()
md/raid1,raid10: strip REQ_NOWAIT from member bios
raid10: cleanup memleak at raid10_make_request
md/raid1: Fix stack memory use after return in raid1_reshape
Merge tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
- Remove a pointless warning in the zcrx code
- Fix for MSG_RING commands, where the allocated io_kiocb
needs to be freed under RCU as well
- Revert the work-around we had in place for the anon inodes
pretending to be regular files. Since that got reworked
upstream, the work-around is no longer needed
* tag 'io_uring-6.16-20250710' of git://git.kernel.dk/linux:
Revert "io_uring: gate REQ_F_ISREG on !S_ANON_INODE as well"
io_uring/msg_ring: ensure io_kiocb freeing is deferred for RCU
io_uring/zcrx: fix pp destruction warnings
Merge tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull more networking fixes from Jakub Kicinski
"Big chunk of fixes for WiFi, Johannes says probably the last for the
release.
The Netlink fixes (on top of the tree) restore operation of iw (WiFi
CLI) which uses sillily small recv buffer, and is the reason for this
'emergency PR'.
The GRE multicast fix also stands out among the user-visible
regressions.
Current release - fix to a fix:
- netlink: make sure we always allow at least one skb to be queued,
even if the recvbuf is (mis)configured to be tiny
Previous releases - regressions:
- gre: fix IPv6 multicast route creation
Previous releases - always broken:
- wifi: prevent A-MSDU attacks in mesh networks
- wifi: cfg80211: fix S1G beacon head validation and detection
- wifi: mac80211:
- always clear frame buffer to prevent stack leak in cases which
hit a WARN()
- fix monitor interface in device restart
- wifi: mwifiex: discard erroneous disassoc frames on STA interface
- wifi: mt76:
- prevent null-deref in mt7925_sta_set_decap_offload()
- add missing RCU annotations, and fix sleep in atomic
- fix decapsulation offload
- fixes for scanning
- phy: microchip: improve link establishment and reset handling
- eth: mlx5e: fix race between DIM disable and net_dim()
- bnxt_en: correct DMA unmap len for XDP_REDIRECT"
* tag 'net-6.16-rc6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (44 commits)
netlink: make sure we allow at least one dump skb
netlink: Fix rmem check in netlink_broadcast_deliver().
bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT
bnxt_en: Flush FW trace before copying to the coredump
bnxt_en: Fix DCB ETS validation
net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()
net/mlx5e: Add new prio for promiscuous mode
net/mlx5e: Fix race between DIM disable and net_dim()
net/mlx5: Reset bw_share field when changing a node's parent
can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
selftests: net: lib: fix shift count out of range
selftests: Add IPv6 multicast route generation tests for GRE devices.
gre: Fix IPv6 multicast route creation.
net: phy: microchip: limit 100M workaround to link-down events on LAN88xx
net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof
net: appletalk: Fix device refcount leak in atrtr_create()
netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto()
wifi: mac80211: add the virtual monitor after reconfig complete
wifi: mac80211: always initialize sdata::key_list
...
Merge tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix performance regression when setting values of multiple GPIO lines
at once
- make sure the GPIO OF xlate code doesn't end up passing an
uninitialized local variable to GPIO core
- update MAINTAINERS
* tag 'gpio-fixes-for-v6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
MAINTAINERS: remove bouncing address for Nandor Han
gpio: of: initialize local variable passed to the .of_xlate() callback
gpiolib: fix performance regression when using gpio_chip_get_multiple()
Merge tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux
Pull dma-mapping fix from Marek Szyprowski:
- small fix relevant to arm64 server and custom CMA configuration (Feng
Tang)
* tag 'dma-mapping-6.16-2025-07-11' of git://git.kernel.org/pub/scm/linux/kernel/git/mszyprowski/linux:
dma-contiguous: hornor the cma address limit setup by user
Jakub Kicinski [Fri, 11 Jul 2025 00:11:21 +0000 (17:11 -0700)]
netlink: make sure we allow at least one dump skb
Commit under Fixes tightened up the memory accounting for Netlink
sockets. Looks like the accounting is too strict for some existing
use cases, Marek reported issues with nl80211 / WiFi iw CLI.
To reduce number of iterations Netlink dumps try to allocate
messages based on the size of the buffer passed to previous
recvmsg() calls. If user space uses a larger buffer in recvmsg()
than sk_rcvbuf we will allocate an skb we won't be able to queue.
Make sure we always allow at least one skb to be queued.
Same workaround is already present in netlink_attachskb().
Alternative would be to cap the allocation size to
rcvbuf - rmem_alloc
but as I said, the workaround is already present in other places.
Jakub Kicinski [Fri, 11 Jul 2025 14:28:36 +0000 (07:28 -0700)]
Merge branch 'bnxt_en-3-bug-fixes'
Michael Chan says:
====================
bnxt_en: 3 bug fixes
The first one fixes a possible failure when setting DCB ETS.
The second one fixes the ethtool coredump (-W 2) not containing
all the FW traces. The third one fixes the DMA unmap length when
transmitting XDP_REDIRECT packets.
====================
bnxt_en: Set DMA unmap len correctly for XDP_REDIRECT
When transmitting an XDP_REDIRECT packet, call dma_unmap_len_set()
with the proper length instead of 0. This bug triggers this warning
on a system with IOMMU enabled:
bnxt_en: Flush FW trace before copying to the coredump
bnxt_fill_drv_seg_record() calls bnxt_dbg_hwrm_log_buffer_flush()
to flush the FW trace buffer. This needs to be done before we
call bnxt_copy_ctx_mem() to copy the trace data.
Without this fix, the coredump may not contain all the FW
traces.
Fixes: 3c2179e66355 ("bnxt_en: Add FW trace coredump segments to the coredump") Reviewed-by: Kalesh AP <kalesh-anakkur.purayil@broadcom.com> Signed-off-by: Shruti Parab <shruti.parab@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-3-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
In bnxt_ets_validate(), the code incorrectly loops over all possible
traffic classes to check and add the ETS settings. Fix it to loop
over the configured traffic classes only.
The unconfigured traffic classes will default to TSA_ETS with 0
bandwidth. Looping over these unconfigured traffic classes may
cause the validation to fail and trigger this error message:
"rejecting ETS config starving a TC\n"
The .ieee_setets() will then fail.
Fixes: 7df4ae9fe855 ("bnxt_en: Implement DCBNL to support host-based DCBX.") Reviewed-by: Sreekanth Reddy <sreekanth.reddy@broadcom.com> Signed-off-by: Shravya KN <shravya.k-n@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Link: https://patch.msgid.link/20250710213938.1959625-2-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: ll_temac: Fix missing tx_pending check in ethtools_set_ringparam()
The function ll_temac_ethtools_set_ringparam() incorrectly checked
rx_pending twice, once correctly for RX and once mistakenly in place
of tx_pending. This caused tx_pending to be left unchecked against
TX_BD_NUM_MAX.
As a result, invalid TX ring sizes may have been accepted or valid
ones wrongly rejected based on the RX limit, leading to potential
misconfiguration or unexpected results.
This patch corrects the condition to properly validate tx_pending.
Jianbo Liu [Thu, 10 Jul 2025 13:53:44 +0000 (16:53 +0300)]
net/mlx5e: Add new prio for promiscuous mode
An optimization for promiscuous mode adds a high-priority steering
table with a single catch-all rule to steer all traffic directly to
the TTC table.
However, a gap exists between the creation of this table and the
insertion of the catch-all rule. Packets arriving in this brief window
would miss as no rule was inserted yet, unnecessarily incrementing the
'rx_steer_missed_packets' counter and dropped.
This patch resolves the issue by introducing a new prio for this
table, placing it between MLX5E_TC_PRIO and MLX5E_NIC_PRIO. By doing
so, packets arriving during the window now fall through to the next
prio (at MLX5E_NIC_PRIO) instead of being dropped.
Fixes: 1c46d7409f30 ("net/mlx5e: Optimize promiscuous mode") Signed-off-by: Jianbo Liu <jianbol@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-4-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/mlx5e: Fix race between DIM disable and net_dim()
There's a race between disabling DIM and NAPI callbacks using the dim
pointer on the RQ or SQ.
If NAPI checks the DIM state bit and sees it still set, it assumes
`rq->dim` or `sq->dim` is valid. But if DIM gets disabled right after
that check, the pointer might already be set to NULL, leading to a NULL
pointer dereference in net_dim().
Fix this by calling `synchronize_net()` before freeing the DIM context.
This ensures all in-progress NAPI callbacks are finished before the
pointer is cleared.
net/mlx5: Reset bw_share field when changing a node's parent
When changing a node's parent, its scheduling element is destroyed and
re-created with bw_share 0. However, the node's bw_share field was not
updated accordingly.
Set the node's bw_share to 0 after re-creation to keep the software
state in sync with the firmware configuration.
Fixes: 9c7bbf4c3304 ("net/mlx5: Add support for setting parent of nodes") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Link: https://patch.msgid.link/1752155624-24095-2-git-send-email-tariqt@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 11 Jul 2025 14:07:56 +0000 (07:07 -0700)]
Merge tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2025-07-11
Sean Nyekjaer's patch targets the m_can driver and demotes the "msg
lost in rx" message to debug level to prevent flooding the kernel log
with error messages.
* tag 'linux-can-fixes-for-6.16-20250711' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: m_can: m_can_handle_lost_msg(): downgrade msg lost in rx message to debug level
====================
Merge tag 'drm-misc-fixes-2025-07-10' of https://gitlab.freedesktop.org/drm/misc/kernel into drm-fixes
drm-misc-fixes for v6.16-rc6 or final:
- Fix nouveau fail on debugfs errors.
- Magic 50 ms to fix nouveau suspend.
- Call rust destructor on drm device release.
- Fix DMA api error handling in tegra/nvdec.
- Fix PVR device reset.
- Habanalabs maintainer update.
- Small memory leak fix when nouveau acpi init fails.
- Do not attempt to bind to any PCI device with AGP capability.
- Make FB's acquire handles on backing object, same as i915/xe already does.
- Fix race in drm_gem_handle_create_tail.
Merge tag 'drm-xe-fixes-2025-07-11' of https://gitlab.freedesktop.org/drm/xe/kernel into drm-fixes
Driver Changes:
- Clear LMTT page to avoid leaking data from one VF to another
- Align PF queue size to power of 2
- Disable Indirect Ring State to avoid intermittent issues on context
switch: feature is not currently needed, so can be disabled for now.
- Fix compression handling when the BO pages are very fragmented
- Restore display pm on error path
- Fix runtime pm handling in xe devcoredump
- Fix xe_pm_set_vram_threshold() doc
- Recommend new minor versions of GuC firmware
- Drop some workarounds on VF
- Do not use verbose GuC logging by default: it should be only for
debugging
Lucas De Marchi [Fri, 13 Jun 2025 20:00:37 +0000 (13:00 -0700)]
drm/xe/guc: Default log level to non-verbose
Currently xe sets the guc log level to a verbose level since it's useful
to debug hangs and general development. However the verbose level may
already be too much and affect performance.
Michal Mrozek did some tests with the L0 compute stack for submission
latency with ULLS disabled. Below are the normalized numbers with log
level 3 (the current default) as baseline for each test:
Log level 2 is the first "verbose level" for GuC, where the biggest
difference happens. Keep log level 3 for CONFIG_DRM_XE_DEBUG, but switch
to 1, i.e. GUC_LOG_LEVEL_NON_VERBOSE, for "normal" builds.
These workarounds are not applicable for use by the VFs.
Signed-off-by: Michal Wajdeczko <michal.wajdeczko@intel.com> Tested-by: Jakub Kolakowski <jakub1.kolakowski@intel.com> Reviewed-by: Satyanarayana K V P <satyanarayana.k.v.p@intel.com> Signed-off-by: Jakub Kolakowski <jakub1.kolakowski@intel.com> Link: https://lore.kernel.org/r/20250710103040.375610-2-jakub1.kolakowski@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
(cherry picked from commit 1d2e2503e506ddc499cbb7afdc8b70bcf6fe241f) Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Shuicheng Lin [Mon, 7 Jul 2025 00:49:14 +0000 (00:49 +0000)]
drm/xe: Release runtime pm for error path of xe_devcoredump_read()
xe_pm_runtime_put() is missed to be called for the error path in
xe_devcoredump_read().
Add function description comments for xe_devcoredump_read() to help
understand it.
v2: more detail function comments and refine goto logic (Matt)
Hangbin Liu [Wed, 9 Jul 2025 09:12:44 +0000 (09:12 +0000)]
selftests: net: lib: fix shift count out of range
I got the following warning when writing other tests:
+ handle_test_result_pass 'bond 802.3ad' '(lacp_active off)'
+ local 'test_name=bond 802.3ad'
+ shift
+ local 'opt_str=(lacp_active off)'
+ shift
+ log_test_result 'bond 802.3ad' '(lacp_active off)' ' OK '
+ local 'test_name=bond 802.3ad'
+ shift
+ local 'opt_str=(lacp_active off)'
+ shift
+ local 'result= OK '
+ shift
+ local retmsg=
+ shift
/net/tools/testing/selftests/net/forwarding/../lib.sh: line 315: shift: shift count out of range
This happens because an extra shift is executed even after all arguments
have been consumed. Remove the last shift in log_test_result() to avoid
this warning.
When fixing IPv6 link-local address generation on GRE devices with
commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), I accidentally broke the default IPv6 multicast route
creation on these GRE devices.
Fix that in patch 1, making the GRE specific code yet a bit closer to
the generic code used by most other network interface types.
Then extend the selftest in patch 2 to cover this case.
====================
selftests: Add IPv6 multicast route generation tests for GRE devices.
The previous patch fixes a bug that prevented the creation of the
default IPv6 multicast route (ff00::/8) for some GRE devices. Now let's
extend the GRE IPv6 selftests to cover this case.
Also, rename check_ipv6_ll_addr() to check_ipv6_device_config() and
adapt comments and script output to take into account the fact that
we're not limited to link-local address generation.
Use addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config() so that we don't just get the inet6_dev, but also
install the default ff00::/8 multicast route.
Before commit 3e6a0243ff00 ("gre: Fix again IPv6 link-local address
generation."), the multicast route was created at the end of the
function by addrconf_add_mroute(). But this code path is now only taken
in one particular case (gre devices not bound to a local IP address and
in EUI64 mode). For all other cases, the function exits early and
addrconf_add_mroute() is not called anymore.
Using addrconf_add_dev() instead of ipv6_find_idev() in
addrconf_gre_config(), fixes the problem as it will create the default
multicast route for all gre devices. This also brings
addrconf_gre_config() a bit closer to the normal netdevice IPv6
configuration code (addrconf_dev_config()).
Cc: stable@vger.kernel.org Fixes: 3e6a0243ff00 ("gre: Fix again IPv6 link-local address generation.") Reported-by: Aiden Yang <ling@moedove.com> Closes: https://lore.kernel.org/netdev/CANR=AhRM7YHHXVxJ4DmrTNMeuEOY87K2mLmo9KMed1JMr20p6g@mail.gmail.com/ Reviewed-by: Gary Guo <gary@garyguo.net> Tested-by: Gary Guo <gary@garyguo.net> Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/027a923dcb550ad115e6d93ee8bb7d310378bd01.1752070620.git.gnault@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch series improves the reliability of the Microchip LAN88xx
PHYs, particularly in edge cases involving fixed link configurations or
forced speed modes.
Patch 1 assigns genphy_soft_reset() to the .soft_reset hook to ensure
that stale link partner advertisement (LPA) bits are properly cleared
during reconfiguration. Without this, outdated autonegotiation bits may
remain visible in some parallel detection cases.
Patch 2 restricts the 100 Mbps workaround (originally intended to handle
cable length switching) to only run when the link transitions to the
PHY_NOLINK state. This prevents repeated toggling that can confuse
autonegotiating link partners such as the Intel i350, leading to
unstable link cycles.
Both patches were tested on a LAN7850 (with integrated LAN88xx PHY)
against an Intel I350 NIC. The full test suite - autonegotiation, fixed
link, and parallel detection - passed successfully.
====================
net: phy: microchip: limit 100M workaround to link-down events on LAN88xx
Restrict the 100Mbit forced-mode workaround to link-down transitions
only, to prevent repeated link reset cycles in certain configurations.
The workaround was originally introduced to improve signal reliability
when switching cables between long and short distances. It temporarily
forces the PHY into 10 Mbps before returning to 100 Mbps.
However, when used with autonegotiating link partners (e.g., Intel i350),
executing this workaround on every link change can confuse the partner
and cause constant renegotiation loops. This results in repeated link
down/up transitions and the PHY never reaching a stable state.
Limit the workaround to only run during the PHY_NOLINK state. This ensures
it is triggered only once per link drop, avoiding disruptive toggling
while still preserving its intended effect.
Note: I am not able to reproduce the original issue that this workaround
addresses. I can only confirm that 100 Mbit mode works correctly in my
test setup. Based on code inspection, I assume the workaround aims to
reset some internal state machine or signal block by toggling speeds.
However, a PHY reset is already performed earlier in the function via
phy_init_hw(), which may achieve a similar effect. Without a reproducer,
I conservatively keep the workaround but restrict its conditions.
Fixes: e57cf3639c32 ("net: lan78xx: fix accessing the LAN7800's internal phy specific registers from the MAC driver") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250709130753.3994461-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: microchip: Use genphy_soft_reset() to purge stale LPA bits
Enable .soft_reset for the LAN88xx PHY driver by assigning
genphy_soft_reset() to ensure that the phylib core performs a proper
soft reset during reconfiguration.
Previously, the driver left .soft_reset unimplemented, so calls to
phy_init_hw() (e.g., from lan88xx_link_change_notify()) did not fully
reset the PHY. As a result, stale contents in the Link Partner Ability
(LPA) register could persist, causing the PHY to incorrectly report
that the link partner advertised autonegotiation even when it did not.
Using genphy_soft_reset() guarantees a clean reset of the PHY and
corrects the false autoneg reporting in these scenarios.
Fixes: ccb989e4d1ef ("net: phy: microchip: Reset LAN88xx PHY to ensure clean link state on LAN7800/7850") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20250709130753.3994461-2-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mingming Cao [Wed, 9 Jul 2025 15:33:32 +0000 (08:33 -0700)]
ibmvnic: Fix hardcoded NUM_RX_STATS/NUM_TX_STATS with dynamic sizeof
The previous hardcoded definitions of NUM_RX_STATS and
NUM_TX_STATS were not updated when new fields were added
to the ibmvnic_{rx,tx}_queue_stats structures. Specifically,
commit 2ee73c54a615 ("ibmvnic: Add stat for tx direct vs tx
batched") added a fourth TX stat, but NUM_TX_STATS remained 3,
leading to a mismatch.
This patch replaces the static defines with dynamic sizeof-based
calculations to ensure the stat arrays are correctly sized.
This fixes incorrect indexing and prevents incomplete stat
reporting in tools like ethtool.
Fixes: 2ee73c54a615 ("ibmvnic: Add stat for tx direct vs tx batched") Signed-off-by: Mingming Cao <mmc@linux.ibm.com> Reviewed-by: Dave Marquardt <davemarq@linux.ibm.com> Reviewed-by: Haren Myneni <haren@linux.ibm.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250709153332.73892-1-mmc@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kito Xu [Wed, 9 Jul 2025 03:52:51 +0000 (03:52 +0000)]
net: appletalk: Fix device refcount leak in atrtr_create()
When updating an existing route entry in atrtr_create(), the old device
reference was not being released before assigning the new device,
leading to a device refcount leak. Fix this by calling dev_put() to
release the old device reference before holding the new one.
The armada-370-xp irqchip fails in some randconfig builds because
of a missing declaration:
In file included from drivers/irqchip/irq-armada-370-xp.c:23:
include/linux/irqchip/irq-msi-lib.h:25:39: error: 'struct msi_domain_info' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
Add a forward declaration for the msi_domain_info structure.
[ tglx: Fixed up the subsystem prefix. Is it really that hard to get right? ]
Fixes: e51b27438a10 ("irqchip: Make irq-msi-lib.h globally available") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/all/20250710080021.2303640-1-arnd@kernel.org
PCI/MSI: Prevent recursive locking in pci_msix_write_tph_tag()
pci_msix_write_tph_tag() takes the per device MSI descriptor mutex and then
invokes msi_domain_get_virq(), which takes the same mutex again. That
obviously results in a system hang which is exposed by a softlockup or
lockdep warning.
Move the lock guard after the invocation of msi_domain_get_virq() to fix
this.
[ tglx: Massage changelog by adding a proper explanation and removing the
not really useful stacktrace ]
Fixes: d5124a9957b2 ("PCI/MSI: Provide a sane mechanism for TPH") Reported-by: Jorge Lopez <jorge.jo.lopez@oracle.com> Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Jorge Lopez <jorge.jo.lopez@oracle.com> Link: https://lore.kernel.org/all/20250708222530.1041477-1-himanshu.madhani@oracle.com
Merge tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Paolo Abeni:
"Including fixes from Bluetooth.
Current release - regressions:
- tcp: refine sk_rcvbuf increase for ooo packets
- bluetooth: fix attempting to send HCI_Disconnect to BIS handle
- rxrpc: fix over large frame size warning
- eth: bcmgenet: initialize u64 stats seq counter
Previous releases - regressions:
- tcp: correct signedness in skb remaining space calculation
- sched: abort __tc_modify_qdisc if parent class does not exist
- vsock: fix transport_{g2h,h2g} TOCTOU
- rxrpc: fix bug due to prealloc collision
- tipc: fix use-after-free in tipc_conn_close().
- bluetooth: fix not marking Broadcast Sink BIS as connected
- phy: qca808x: fix WoL issue by utilizing at8031_set_wol()
- eth: am65-cpsw-nuss: fix skb size by accounting for skb_shared_info
Previous releases - always broken:
- netlink: fix wraparounds of sk->sk_rmem_alloc.
- atm: fix infinite recursive call of clip_push().
- eth:
- stmmac: fix interrupt handling for level-triggered mode in DWC_XGMAC2
- rtsn: fix a null pointer dereference in rtsn_probe()"
* tag 'net-6.16-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (37 commits)
net/sched: sch_qfq: Fix null-deref in agg_dequeue
rxrpc: Fix oops due to non-existence of prealloc backlog struct
rxrpc: Fix bug due to prealloc collision
MAINTAINERS: remove myself as netronome maintainer
selftests/net: packetdrill: add tcp_ooo-before-and-after-accept.pkt
tcp: refine sk_rcvbuf increase for ooo packets
net/sched: Abort __tc_modify_qdisc if parent class does not exist
net: ethernet: ti: am65-cpsw-nuss: Fix skb size by accounting for skb_shared_info
net: thunderx: avoid direct MTU assignment after WRITE_ONCE()
selftests/tc-testing: Create test case for UAF scenario with DRR/NETEM/BLACKHOLE chain
atm: clip: Fix NULL pointer dereference in vcc_sendmsg()
atm: clip: Fix infinite recursive call of clip_push().
atm: clip: Fix memory leak of struct clip_vcc.
atm: clip: Fix potential null-ptr-deref in to_atmarpd().
net: phy: smsc: Fix link failure in forced mode with Auto-MDIX
net: phy: smsc: Force predictable MDI-X state on LAN87xx
net: phy: smsc: Fix Auto-MDIX configuration when disabled by strap
net: stmmac: Fix interrupt handling for level-triggered mode in DWC_XGMAC2
rxrpc: Fix over large frame size warning
net: airoha: Fix an error handling path in airoha_probe()
...
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"Many patches, pretty much all of them small, that accumulated while I
was on vacation.
ARM:
- Remove the last leftovers of the ill-fated FPSIMD host state
mapping at EL2 stage-1
- Fix unexpected advertisement to the guest of unimplemented S2 base
granule sizes
- Gracefully fail initialising pKVM if the interrupt controller isn't
GICv3
- Also gracefully fail initialising pKVM if the carveout allocation
fails
- Fix the computing of the minimum MMIO range required for the host
on stage-2 fault
- Fix the generation of the GICv3 Maintenance Interrupt in nested
mode
x86:
- Reject SEV{-ES} intra-host migration if one or more vCPUs are
actively being created, so as not to create a non-SEV{-ES} vCPU in
an SEV{-ES} VM
- Use a pre-allocated, per-vCPU buffer for handling de-sparsification
of vCPU masks in Hyper-V hypercalls; fixes a "stack frame too
large" issue
- Allow out-of-range/invalid Xen event channel ports when configuring
IRQ routing, to avoid dictating a specific ioctl() ordering to
userspace
- Conditionally reschedule when setting memory attributes to avoid
soft lockups when userspace converts huge swaths of memory to/from
private
- Add back MWAIT as a required feature for the MONITOR/MWAIT selftest
- Add a missing field in struct sev_data_snp_launch_start that
resulted in the guest-visible workarounds field being filled at the
wrong offset
- Skip non-canonical address when processing Hyper-V PV TLB flushes
to avoid VM-Fail on INVVPID
- Advertise supported TDX TDVMCALLs to userspace
- Pass SetupEventNotifyInterrupt arguments to userspace
- Fix TSC frequency underflow"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM: x86: avoid underflow when scaling TSC frequency
KVM: arm64: Remove kvm_arch_vcpu_run_map_fp()
KVM: arm64: Fix handling of FEAT_GTG for unimplemented granule sizes
KVM: arm64: Don't free hyp pages with pKVM on GICv2
KVM: arm64: Fix error path in init_hyp_mode()
KVM: arm64: Adjust range correctly during host stage-2 faults
KVM: arm64: nv: Fix MI line level calculation in vgic_v3_nested_update_mi()
KVM: x86/hyper-v: Skip non-canonical addresses during PV TLB flush
KVM: SVM: Add missing member in SNP_LAUNCH_START command structure
Documentation: KVM: Fix unexpected unindent warnings
KVM: selftests: Add back the missing check of MONITOR/MWAIT availability
KVM: Allow CPU to reschedule while setting per-page memory attributes
KVM: x86/xen: Allow 'out of range' event channel ports in IRQ routing table.
KVM: x86/hyper-v: Use preallocated per-vCPU buffer for de-sparsified vCPU masks
KVM: SVM: Initialize vmsa_pa in VMCB to INVALID_PAGE if VMSA page is NULL
KVM: SVM: Reject SEV{-ES} intra host migration if vCPU creation is in-flight
KVM: TDX: Report supported optional TDVMCALLs in TDX capabilities
KVM: TDX: Exit to userspace for SetupEventNotifyInterrupt
lib/smp_processor_id: Make migration check unconditional of SMP
Commit cac5cefbade90 ("sched/smp: Make SMP unconditional")
migrate_disable() even on UP builds.
Commit 06ddd17521bf1 ("sched/smp: Always define is_percpu_thread() and
scheduler_ipi()") made is_percpu_thread() check the affinity mask
instead replying always true for UP mask.
As a consequence smp_processor_id() now complains if invoked within a
migrate_disable() section because is_percpu_thread() checks its mask and
the migration check is left out.
Make migration check unconditional of SMP.
Fixes: cac5cefbade90 ("sched/smp: Make SMP unconditional") Closes: https://lore.kernel.org/oe-lkp/202507100448.6b88d6f1-lkp@intel.com Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Chen Yu <yu.c.chen@intel.com> Link: https://lore.kernel.org/r/20250710082748.-DPO1rjO@linutronix.de
Sequence 2 - MIPI_SEQ_INIT_OTP
GPIO index 9, source 0, set 0 (0x00)
Delay: 50000 us
GPIO index 9, source 0, set 1 (0x01)
Delay: 6000 us
GPIO index 9, source 0, set 0 (0x00)
Delay: 6000 us
GPIO index 9, source 0, set 1 (0x01)
Delay: 25000 us
Send DCS: Port A, VC 0, LP, Type 39, Length 5, Data ff aa 55 a5 80
Send DCS: Port A, VC 0, LP, Type 39, Length 3, Data 6f 11 00
...
Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 29
Delay: 120000 us
Sequence 4 - MIPI_SEQ_DISPLAY_OFF
Send DCS: Port A, VC 0, LP, Type 05, Length 1, Data 28
Delay: 105000 us
Send DCS: Port A, VC 0, LP, Type 05, Length 2, Data 10 00
Delay: 10000 us
Sequence 5 - MIPI_SEQ_ASSERT_RESET
Delay: 10000 us
GPIO index 9, source 0, set 0 (0x00)
Notice how there is no MIPI_SEQ_DEASSERT_RESET, instead the deassert
is done at the beginning of MIPI_SEQ_INIT_OTP, which is exactly what
the fixup from vlv_fixup_mipi_sequences() fixes up.
Extend it to also apply to v2 sequences, this fixes the panel not working
on the Acer Iconia One 8 A1-840.
wifi: mac80211: add the virtual monitor after reconfig complete
In reconfig we add the virtual monitor in 2 cases:
1. If we are resuming (it was deleted on suspend)
2. If it was added after an error but before the reconfig
(due to the last non-monitor interface removal).
In the second case, the removal of the non-monitor interface will succeed
but the addition of the virtual monitor will fail, so we add it in the
reconfig.
The problem is that we mislead the driver to think that this is an existing
interface that is getting re-added - while it is actually a completely new
interface from the drivers' point of view.
Some drivers act differently when a interface is re-added. For example, it
might not initialize things because they were already initialized.
Such drivers will - in this case - be left with a partialy initialized vif.
To fix it, add the virtual monitor after reconfig_complete, so the
driver will know that this is a completely new interface.
This is currently not initialized for a virtual monitor, leading to a
NULL pointer dereference when - for example - iterating over all the
keys of all the vifs.
Xiang Mei [Sat, 5 Jul 2025 21:21:43 +0000 (14:21 -0700)]
net/sched: sch_qfq: Fix null-deref in agg_dequeue
To prevent a potential crash in agg_dequeue (net/sched/sch_qfq.c)
when cl->qdisc->ops->peek(cl->qdisc) returns NULL, we check the return
value before using it, similar to the existing approach in sch_hfsc.c.
To avoid code duplication, the following changes are made:
1. Changed qdisc_warn_nonwc(include/net/pkt_sched.h) into a static
inline function.
2. Moved qdisc_peek_len from net/sched/sch_hfsc.c to
include/net/pkt_sched.h so that sch_qfq can reuse it.
3. Applied qdisc_peek_len in agg_dequeue to avoid crashing.
In a quick slow device, readdir() may loop for long time in large
directory, let's give a chance to allow it to be interrupted by
userspace.
Signed-off-by: Chao Yu <chao@kernel.org> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> Link: https://lore.kernel.org/r/20250710073619.4083422-1-chao@kernel.org
[ Gao Xiang: move cond_resched() to the end of the while loop. ] Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Flush the D-cache before unlocking folios for compressed inodes, as
they are dirtied during decompression.
Avoid calling flush_dcache_folio() on every CPU write, since it's more
like playing whack-a-mole without real benefit.
It has no impact on x86 and arm64/risc-v: on x86, flush_dcache_folio()
is a no-op, and on arm64/risc-v, PG_dcache_clean (PG_arch_1) is clear
for new page cache folios. However, certain ARM boards are affected,
as reported.
erofs: fix to add missing tracepoint in erofs_read_folio()
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readpage()
tracepoint in the meantime, let's add it back.
erofs: fix to add missing tracepoint in erofs_readahead()
Commit 771c994ea51f ("erofs: convert all uncompressed cases to iomap")
converts to use iomap interface, it removed trace_erofs_readahead()
tracepoint in the meantime, let's add it back.
This commit introduces per-memcg/task NUMA balance statistics, but
unfortunately it introduced a NULL pointer exception due to the following
race condition: After a swap task candidate was chosen, its mm_struct
pointer was set to NULL due to task exit. Later, when performing the
actual task swapping, the p->mm caused the problem.
CPU0 CPU1
:
...
task_numa_migrate
task_numa_find_cpu
task_numa_compare
# a normal task p is chosen
env->best_task = p
migrate_swap_stop
__migrate_swap_task((arg->src_task, arg->dst_cpu)
count_memcg_event_mm(p->mm, NUMA_TASK_SWAP)# p->mm is NULL
task_lock() should be held and the PF_EXITING flag needs to be checked to
prevent this from happening. After discussion, the conclusion was that
adding a lock is not worthwhile for some statistics calculations. Revert
the change and rely on the tracepoint for this purpose.
Link: https://lkml.kernel.org/r/20250704135620.685752-1-yu.c.chen@intel.com Link: https://lkml.kernel.org/r/20250708064917.BBD13C4CEED@smtp.kernel.org Fixes: ad6b26b6a0a7 ("sched/numa: add statistics of numa balance task") Signed-off-by: Chen Yu <yu.c.chen@intel.com> Reported-by: Jirka Hladky <jhladky@redhat.com> Closes: https://lore.kernel.org/all/CAE4VaGBLJxpd=NeRJXpSCuw=REhC5LWJpC29kDy-Zh2ZDyzQZA@mail.gmail.com/ Reported-by: Srikanth Aithal <Srikanth.Aithal@amd.com> Reported-by: Suneeth D <Suneeth.D@amd.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Hladky <jhladky@redhat.com> Cc: Libo Chen <libo.chen@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Baolin Wang [Thu, 5 Jun 2025 12:58:29 +0000 (20:58 +0800)]
mm: fix the inaccurate memory statistics issue for users
On some large machines with a high number of CPUs running a 64K pagesize
kernel, we found that the 'RES' field is always 0 displayed by the top
command for some processes, which will cause a lot of confusion for users.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
875525 root 20 0 12480 0 0 R 0.3 0.0 0:00.08 top
1 root 20 0 172800 0 0 S 0.0 0.0 0:04.52 systemd
The main reason is that the batch size of the percpu counter is quite
large on these machines, caching a significant percpu value, since
converting mm's rss stats into percpu_counter by commit f1a7941243c1 ("mm:
convert mm's rss stats into percpu_counter"). Intuitively, the batch
number should be optimized, but on some paths, performance may take
precedence over statistical accuracy. Therefore, introducing a new
interface to add the percpu statistical count and display it to users,
which can remove the confusion. In addition, this change is not expected
to be on a performance-critical path, so the modification should be
acceptable.
In addition, the 'mm->rss_stat' is updated by using add_mm_counter() and
dec/inc_mm_counter(), which are all wrappers around
percpu_counter_add_batch(). In percpu_counter_add_batch(), there is
percpu batch caching to avoid 'fbc->lock' contention. This patch changes
task_mem() and task_statm() to get the accurate mm counters under the
'fbc->lock', but this should not exacerbate kernel 'mm->rss_stat' lock
contention due to the percpu batch caching of the mm counters. The
following test also confirm the theoretical analysis.
I run the stress-ng that stresses anon page faults in 32 threads on my 32
cores machine, while simultaneously running a script that starts 32
threads to busy-loop pread each stress-ng thread's /proc/pid/status
interface. From the following data, I did not observe any obvious impact
of this patch on the stress-ng tests.
w/o patch:
stress-ng: info: [6848] 4,399,219,085,152 CPU Cycles 67.327 B/sec
stress-ng: info: [6848] 1,616,524,844,832 Instructions 24.740 B/sec (0.367 instr. per cycle)
stress-ng: info: [6848] 39,529,792 Page Faults Total 0.605 M/sec
stress-ng: info: [6848] 39,529,792 Page Faults Minor 0.605 M/sec
w/patch:
stress-ng: info: [2485] 4,462,440,381,856 CPU Cycles 68.382 B/sec
stress-ng: info: [2485] 1,615,101,503,296 Instructions 24.750 B/sec (0.362 instr. per cycle)
stress-ng: info: [2485] 39,439,232 Page Faults Total 0.604 M/sec
stress-ng: info: [2485] 39,439,232 Page Faults Minor 0.604 M/sec
On comparing a very simple app which just allocates & touches some
memory against v6.1 (which doesn't have f1a7941243c1) and latest Linus
tree (4c06e63b9203) I can see that on latest Linus tree the values for
VmRSS, RssAnon and RssFile from /proc/self/status are all zeroes while
they do report values on v6.1 and a Linus tree with this patch.
Link: https://lkml.kernel.org/r/f4586b17f66f97c174f7fd1f8647374fdb53de1c.1749119050.git.baolin.wang@linux.alibaba.com Fixes: f1a7941243c1 ("mm: convert mm's rss stats into percpu_counter") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Aboorva Devarajan <aboorvad@linux.ibm.com> Tested-by: Aboorva Devarajan <aboorvad@linux.ibm.com>
Tested-by Donet Tom <donettom@linux.ibm.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: SeongJae Park <sj@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: David Hildenbrand <david@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Suren Baghdasaryan <surenb@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Honggyu Kim [Wed, 2 Jul 2025 00:02:04 +0000 (09:02 +0900)]
mm/damon: fix divide by zero in damon_get_intervals_score()
The current implementation allows having zero size regions with no special
reasons, but damon_get_intervals_score() gets crashed by divide by zero
when the region size is zero.
This patch fixes the bug, but does not disallow zero size regions to keep
the backward compatibility since disallowing zero size regions might be a
breaking change for some users.
In addition, the same crash can happen when intervals_goal.access_bp is
zero so this should be fixed in stable trees as well.
Link: https://lkml.kernel.org/r/20250702000205.1921-5-honggyu.kim@sk.com Fixes: f04b0fedbe71 ("mm/damon/core: implement intervals auto-tuning") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Honggyu Kim [Wed, 2 Jul 2025 00:02:02 +0000 (09:02 +0900)]
samples/damon: fix damon sample wsse for start failure
The damon_sample_wsse_start() can fail so we must reset the "enable"
parameter to "false" again for proper rollback.
In such cases, setting Y to "enable" then N triggers the similar crash
with wsse because damon sample start failed but the "enable" stays as Y.
Link: https://lkml.kernel.org/r/20250702000205.1921-3-honggyu.kim@sk.com Fixes: b757c6cfc696 ("samples/damon/wsse: start and stop DAMON as the user requests") Signed-off-by: Honggyu Kim <honggyu.kim@sk.com> Reviewed-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Illia Ostapyshyn [Sun, 29 Jun 2025 00:38:11 +0000 (02:38 +0200)]
scripts: gdb: vfs: support external dentry names
d_shortname of struct dentry only reserves D_NAME_INLINE_LEN characters
and contains garbage for longer names. Use d_name instead, which always
references the valid name.
Link: https://lore.kernel.org/all/20250525213709.878287-2-illia@yshyn.com/ Link: https://lkml.kernel.org/r/20250629003811.2420418-1-illia@yshyn.com Fixes: 79300ac805b6 ("scripts/gdb: fix dentry_name() lookup") Signed-off-by: Illia Ostapyshyn <illia@yshyn.com> Tested-by: Florian Fainelli <florian.fainelli@broadcom.com> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Jan Kara <jack@suse.cz> Cc: Jan Kiszka <jan.kiszka@siemens.com> Cc: Kieran Bingham <kbingham@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Christoph Berg [Tue, 24 Jun 2025 14:44:27 +0000 (16:44 +0200)]
mm/migrate: fix do_pages_stat in compat mode
For arrays with more than 16 entries, the old code would incorrectly
advance the pages pointer by 16 words instead of 16 compat_uptr_t. Fix by
doing the pointer arithmetic inside get_compat_pages_array where pages32
is already a correctly-typed pointer.
Discovered while working on PostgreSQL 18's new NUMA introspection code.
Link: https://lkml.kernel.org/r/aGREU0XTB48w9CwN@msg.df7cb.de Fixes: 5b1b561ba73c ("mm: simplify compat_sys_move_pages") Signed-off-by: Christoph Berg <myon@debian.org> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reported-by: Tomas Vondra <tomas@vondra.me> Closes: https://www.postgresql.org/message-id/flat/6342f601-77de-4ee0-8c2a-3deb50ceac5b%40vondra.me#86402e3d80c031788f5f55b42c459471 Cc: Alistair Popple <apopple@nvidia.com> Cc: Byungchul Park <byungchul@sk.com> Cc: Gregory Price <gourry@gourry.net> Cc: "Huang, Ying" <ying.huang@linux.alibaba.com> Cc: Joshua Hahn <joshua.hahnjy@gmail.com> Cc: Mathew Brost <matthew.brost@intel.com> Cc: Rakie Kim <rakie.kim@sk.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
SeongJae Park [Sun, 29 Jun 2025 20:49:14 +0000 (13:49 -0700)]
mm/damon/core: handle damon_call_control as normal under kdmond deactivation
DAMON sysfs interface internally uses damon_call() to update DAMON
parameters as users requested, online. However, DAMON core cancels any
damon_call() requests when it is deactivated by DAMOS watermarks.
As a result, users cannot change DAMON parameters online while DAMON is
deactivated. Note that users can turn DAMON off and on with different
watermarks to work around. Since deactivated DAMON is nearly same to
stopped DAMON, the work around should have no big problem. Anyway, a bug
is a bug.
There is no real good reason to cancel the damon_call() request under
DAMOS deactivation. Fix it by simply handling the request as normal,
rather than cancelling under the situation.
Link: https://lkml.kernel.org/r/20250629204914.54114-1-sj@kernel.org Fixes: 42b7491af14c ("mm/damon/core: introduce damon_call()") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: <stable@vger.kernel.org> [6.14+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Lance Yang [Fri, 27 Jun 2025 06:23:19 +0000 (14:23 +0800)]
mm/rmap: fix potential out-of-bounds page table access during batched unmap
As pointed out by David[1], the batched unmap logic in
try_to_unmap_one() may read past the end of a PTE table when a large
folio's PTE mappings are not fully contained within a single page
table.
While this scenario might be rare, an issue triggerable from userspace
must be fixed regardless of its likelihood. This patch fixes the
out-of-bounds access by refactoring the logic into a new helper,
folio_unmap_pte_batch().
The new helper correctly calculates the safe batch size by capping the
scan at both the VMA and PMD boundaries. To simplify the code, it also
supports partial batching (i.e., any number of pages from 1 up to the
calculated safe maximum), as there is no strong reason to special-case
for fully mapped folios.
Link: https://lkml.kernel.org/r/20250701143100.6970-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250630011305.23754-1-lance.yang@linux.dev Link: https://lkml.kernel.org/r/20250627062319.84936-1-lance.yang@linux.dev Link: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com Fixes: 354dffd29575 ("mm: support batched unmap for lazyfree large folios during reclamation") Signed-off-by: Lance Yang <lance.yang@linux.dev> Suggested-by: David Hildenbrand <david@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Closes: https://lore.kernel.org/linux-mm/a694398c-9f03-4737-81b9-7e49c857fcbe@redhat.com Suggested-by: Barry Song <baohua@kernel.org> Acked-by: Barry Song <baohua@kernel.org> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Harry Yoo <harry.yoo@oracle.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: "Huang, Ying" <huang.ying.caritas@gmail.com> Cc: Kairui Song <kasong@tencent.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mingzhe Yang <mingzhe.yang@ly.com> Cc: Rik van Riel <riel@surriel.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Tangquan Zheng <zhengtangquan@oppo.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Vivek Kasireddy [Thu, 26 Jun 2025 19:11:16 +0000 (12:11 -0700)]
mm/hugetlb: don't crash when allocating a folio if there are no resv
There are cases when we try to pin a folio but discover that it has not
been faulted-in. So, we try to allocate it in memfd_alloc_folio() but
there is a chance that we might encounter a fatal crash/failure
(VM_BUG_ON(!h->resv_huge_pages) in alloc_hugetlb_folio_reserve()) if there
are no active reservations at that instant. This issue was reported by
syzbot:
Therefore, prevent the above crash by removing the VM_BUG_ON() as there is
no need to crash the system in this situation and instead we could just
fail the allocation request.
Furthermore, as described above, the specific situation where this happens
is when we try to pin memfd folios before they are faulted-in. Although,
this is a valid thing to do, it is not the regular or the common use-case.
Let us consider the following scenarios:
3) is the most common use-case where first a memfd is allocated followed
by mmap(), user writes/updates and then the relevant folios are pinned
(memfd_pin_folios()). The BUG this patch is fixing occurs in 2) because
we try to pin the folios before hugetlbfs_file_mmap() is called. So, in
this situation we try to allocate the folios before pinning them but since
we did not make any reservations, resv_huge_pages would be 0, leading to
this issue.
Link: https://lkml.kernel.org/r/20250626191116.1377761-1-vivek.kasireddy@intel.com Fixes: 26a8ea80929c ("mm/hugetlb: fix memfd_pin_folios resv_huge_pages leak") Reported-by: syzbot+a504cb5bae4fe117ba94@syzkaller.appspotmail.com Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Closes: https://syzkaller.appspot.com/bug?extid=a504cb5bae4fe117ba94 Closes: https://lore.kernel.org/all/677928b5.050a0220.3b53b0.004d.GAE@google.com/T/ Acked-by: Oscar Salvador <osalvador@suse.de> Cc: Steve Sistare <steven.sistare@oracle.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: David Hildenbrand <david@redhat.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>