Linus Torvalds [Thu, 14 May 2026 21:30:01 +0000 (14:30 -0700)]
Merge tag 'hid-for-linus-2026051401' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid
Pull HID fixes from Jiri Kosina:
- fixes for a few OOB/UAF in several HID drivers (Florian Pradines, Lee
Jones, Michael Zaidman, Rosalie Wanders, Sangyun Kim and Tomasz
Pakuła)
- more general sanitation of input data, dealing with potentially
malicious hardware in hid-core (Benjamin Tissoires)
- a few device-specific quirks and fixups
* tag 'hid-for-linus-2026051401' of git://git.kernel.org/pub/scm/linux/kernel/git/hid/hid: (22 commits)
HID: logitech-hidpp: Add support for newer Bluetooth keyboards
HID: pidff: Fix integer overflow in pidff_rescale
HID: i2c-hid: add reset quirk for BLTP7853 touchpad
HID: core: introduce hid_safe_input_report()
HID: pass the buffer size to hid_report_raw_event
HID: google: hammer: stop hardware on devres action failure
HID: appletb-kbd: run inactivity autodim from workqueues
HID: appletb-kbd: fix UAF in inactivity-timer cleanup path
HID: playstation: Clamp num_touch_reports
HID: magicmouse: Prevent out-of-bounds (OOB) read during DOUBLE_REPORT_ID
HID: mcp2221: fix OOB write in mcp2221_raw_event()
HID: quirks: really enable the intended work around for appledisplay
HID: hid-sjoy: race between init and usage
HID: uclogic: Fix regression of input name assignment
HID: intel-thc-hid: Intel-quickspi: Fix some error codes
HID: hid-lenovo-go-s: restore OS_TYPE after resume from s2idle
HID: elan: Add support for ELAN SB974D touchpad
HID: sony: add missing size validation for Rock Band 3 Pro instruments
HID: sony: add missing size validation for SMK-Link remotes
HID: sony: remove unneeded WARN_ON() in sony_leds_init()
...
Linus Torvalds [Thu, 14 May 2026 21:06:31 +0000 (14:06 -0700)]
Merge tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI support fixes from Rafael Wysocki:
"These fix several platform drivers that use the ACPI companion of the
given platform device without checking its presence, which may lead to
a NULL pointer dereference or other kind of malfunction if the driver
is forced to match a device without an ACPI companion via driver
override, and restore debug log level for some messages in the ACPI
CPPC library:
- Check ACPI_COMPANION() against NULL during probe in several core
ACPI device drivers (Rafael Wysocki)
- Restore log level of messages in amd_set_max_freq_ratio() (Mario
Limonciello)"
* tag 'acpi-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: PAD: xen: Check ACPI_COMPANION() against NULL
ACPI: driver: Check ACPI_COMPANION() against NULL during probe
Revert "ACPI: CPPC: Adjust debug messages in amd_set_max_freq_ratio() to warn"
- ipv6: flowlabel: enforce per-netns limit for unprivileged callers
- tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
- smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
- sctp: revalidate list cursor after sctp_sendmsg_to_asoc() in SCTP_SENDALL
- batman-adv:
- reject new tp_meter sessions during teardown
- purge non-released claims
- eth:
- i40e: cleanup PTP registration on probe failure
- idpf: fix double free and use-after-free in aux device error paths
- ena: fix potential use-after-free in get_timestamp"
* tag 'net-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
net: phy: DP83TC811: add reading of abilities
net: tls: prevent chain-after-chain in plain text SG
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
macsec: introduce dedicated workqueue for SA crypto cleanup
net: net_failover: Fix the deadlock in slave register
MAINTAINERS: update atlantic driver maintainer
selftests/tc-testing: Add QFQ/CBS qlen underflow test
net/sched: sch_cbs: Call qdisc_reset for child qdisc
FDDI: defza: Sanitise the reset safety timer
net: ethernet: ravb: Do not check URAM suspension when WoL is active
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
net: atm: fix skb leak in sigd_send() default branch
net: ethtool: phy: avoid NULL deref when PHY driver is unbound
net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
net: shaper: reject QUEUE scope handle with missing id
...
Linus Torvalds [Thu, 14 May 2026 15:53:24 +0000 (08:53 -0700)]
Merge tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fixes from Paul Moore:
- Correctly log the inheritable capabilities
- Honor AUDIT_LOCKED in the AUDIT_TRIM and AUDIT_MAKE_EQUIV commands
* tag 'audit-pr-20260513' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
audit: fix incorrect inheritable capability in CAPSET records
Linus Torvalds [Wed, 13 May 2026 18:37:18 +0000 (11:37 -0700)]
ptrace: slightly saner 'get_dumpable()' logic
The 'dumpability' of a task is fundamentally about the memory image of
the task - the concept comes from whether it can core dump or not - and
makes no sense when you don't have an associated mm.
And almost all users do in fact use it only for the case where the task
has a mm pointer.
But we have one odd special case: ptrace_may_access() uses 'dumpable' to
check various other things entirely independently of the MM (typically
explicitly using flags like PTRACE_MODE_READ_FSCREDS). Including for
threads that no longer have a VM (and maybe never did, like most kernel
threads).
It's not what this flag was designed for, but it is what it is.
The ptrace code does check that the uid/gid matches, so you do have to
be uid-0 to see kernel thread details, but this means that the
traditional "drop capabilities" model doesn't make any difference for
this all.
Make it all make a *bit* more sense by saying that if you don't have a
MM pointer, we'll use a cached "last dumpability" flag if the thread
ever had a MM (it will be zero for kernel threads since it is never
set), and require a proper CAP_SYS_PTRACE capability to override.
Sven Schuchmann [Tue, 12 May 2026 07:19:47 +0000 (09:19 +0200)]
net: phy: DP83TC811: add reading of abilities
At this time the driver is not listing any speeds
it supports. This should be ETHTOOL_LINK_MODE_100baseT1_Full_BIT
for DP83TC811. Add the missing call for phylib to read the abilities.
Fixes: b753a9faaf9a ("net: phy: DP83TC811: Introduce support for the DP83TC811 phy") Suggested-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Sven Schuchmann <schuchmann@schleissheimer.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20260512071949.6218-1-schuchmann@schleissheimer.de
[pabeni@redhat.com: dropped revision history] Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:18 +0000 (10:49 -0700)]
net: tls: prevent chain-after-chain in plain text SG
Sashiko points out that if end = 0 (start != 0) the current
code will create a chain link to content type right after
the wrap link:
This would create a chain where the wrap link points directly
to another chain link. The scatterlist API sg_next iterator
does not recursively resolve consecutive chain links.
meaning this is illegal input to crypto.
The wrapping link is unnecessary if end = 0. end is the entry after
the last one used so end = 0 means there's nothing pushed after
the wrap:
end start i
v v v
[ ]...[ ][ d ][ d ][ d ][ d ][rsv for wrap]
Skip the wrapping in this case.
TLS 1.3 can use the "wrapping slot" for it's chaining if end = 0.
This avoids the chain-after-chain.
Move the wrap chaining before marking END and chaining off content
type, that feels like more logical ordering to me, but should not
matter from functional perspective.
Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Link: https://patch.msgid.link/20260511174920.433155-3-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Mon, 11 May 2026 17:49:17 +0000 (10:49 -0700)]
net: tls: fix off-by-one in sg_chain entry count for wrapped sk_msg ring
When an sk_msg scatterlist ring wraps (sg.end < sg.start),
tls_push_record() chains the tail portion of the ring to the head
using sg_chain(). An extra entry in the sg array is reserved for
this:
struct sk_msg_sg {
[...]
/* The extra two elements:
* 1) used for chaining the front and sections when the list becomes
* partitioned (e.g. end < start). The crypto APIs require the
* chaining;
* 2) to chain tailer SG entries after the message.
*/
struct scatterlist data[MAX_MSG_FRAGS + 2];
The current code uses MAX_SKB_FRAGS + 1 as the ring size:
instead of the true last entry. This is likely due to a "race" of
the commit under Fixes landing close to
commit 031097d9e079 ("bpf: sk_msg, zap ingress queue on psock down")
Convert to ARRAY_SIZE and drop the data[start] / - start (as suggested
by Sabrina).
Reported-by: 钱一铭 <yimingqian591@gmail.com> Fixes: 9aaaa56845a0 ("bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511174920.433155-2-kuba@kernel.org Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Xiang Mei [Mon, 11 May 2026 06:21:38 +0000 (23:21 -0700)]
net/smc: reject CHID-0 ACCEPT that matches an empty ism_dev slot
On the SMC-D client, slot 0 of ini->ism_dev[]/ini->ism_chid[] is
reserved for an SMC-Dv1 device. smc_find_ism_v2_device_clnt()
populates V2 entries starting at index 1, so when no V1 device is
selected slot 0 is left in its kzalloc()'ed state with ism_dev[0] ==
NULL and ism_chid[0] == 0.
smc_v2_determine_accepted_chid() then matches the peer's CHID against
the array starting from index 0 using the CHID alone. A malicious
peer replying to a SMC-Dv2-only proposal with d1.chid == 0 matches
the empty slot, ini->ism_selected becomes 0, and the subsequent
ism_dev[0]->lgr_lock dereference in smc_conn_create() faults at
offsetof(struct smcd_dev, lgr_lock) == 0x68:
BUG: KASAN: null-ptr-deref in _raw_spin_lock_bh+0x79/0xe0
Write of size 4 at addr 0000000000000068 by task exploit/144
Call Trace:
_raw_spin_lock_bh
smc_conn_create (net/smc/smc_core.c:1997)
__smc_connect (net/smc/af_smc.c:1447)
smc_connect (net/smc/af_smc.c:1720)
__sys_connect
__x64_sys_connect
do_syscall_64
Require ism_dev[i] to be non-NULL before accepting a CHID match.
Fixes: a7c9c5f4af7f ("net/smc: CLC accept / confirm V2") Reported-by: Weiming Shi <bestswngs@gmail.com> Assisted-by: Claude:claude-opus-4-7 Signed-off-by: Xiang Mei <xmei5@asu.edu> Link: https://patch.msgid.link/20260511062138.2839584-1-xmei5@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
====================
macsec: use rcu_work to fix crypto cleanup in softirq context
From: Jinliang Zheng <alexjlzheng@tencent.com>
crypto_free_aead() can internally call vunmap() (e.g. via dma_free_attrs()
in hardware crypto drivers like hisi_sec2), which must not be invoked from
softirq context. Both free_rxsa() and free_txsa() are RCU callbacks that
run in softirq, causing a kernel crash on affected hardware.
This series fixes the issue by deferring the actual cleanup to a workqueue
using rcu_work, which combines the RCU grace period and workqueue dispatch
into a single primitive.
Two design decisions worth noting:
1. rcu_work instead of schedule_work() + synchronize_rcu()
An alternative would be to call schedule_work() directly from
macsec_rxsa_put()/macsec_txsa_put(), then call synchronize_rcu() at
the start of the work handler to replace the grace period previously
provided by call_rcu(). However, synchronize_rcu() blocks the worker
thread for the duration of a full RCU grace period. Under high SA
churn (e.g. tearing down an interface with many SAs), each SA would
occupy a worker thread while waiting, and multiple concurrent calls
cannot share the same grace period — leading to unnecessary latency
and resource waste.
rcu_work uses call_rcu_hurry() internally, which is fully asynchronous:
the worker thread is only dispatched after the grace period has elapsed,
and multiple concurrent queue_rcu_work() calls naturally batch under the
same grace period via the RCU subsystem's existing coalescing mechanism.
2. Dedicated workqueue instead of system_wq
Using a dedicated workqueue (macsec_wq) allows macsec_exit() to drain
exactly the work items belonging to this module — by calling
destroy_workqueue() after rcu_barrier(). If system_wq were used,
flush_scheduled_work() would drain all pending work items across the
entire system, creating unnecessary coupling with unrelated subsystems
and potentially causing unexpected delays. The dedicated workqueue
provides a clean, contained teardown path.
====================
Jinliang Zheng [Mon, 11 May 2026 15:31:00 +0000 (23:31 +0800)]
macsec: use rcu_work to defer TX SA crypto cleanup out of softirq
free_txsa() is an RCU callback running in softirq context, but calls
crypto_free_aead() which can invoke vunmap() internally on hardware
crypto drivers (e.g. hisi_sec2), triggering a kernel crash.
Use rcu_work to defer the cleanup to a workqueue, for the same reasons
as the analogous fix to free_rxsa() in the previous patch.
Jinliang Zheng [Mon, 11 May 2026 15:30:59 +0000 (23:30 +0800)]
macsec: use rcu_work to defer RX SA crypto cleanup out of softirq
crypto_free_aead() can internally invoke vunmap() (e.g. via
dma_free_attrs() in hardware crypto drivers such as hisi_sec2).
vunmap() must not be called from softirq context, but free_rxsa()
is an RCU callback that runs in softirq, leading to a kernel crash:
Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches
the worker asynchronously after the RCU grace period, so no thread
blocks waiting, and concurrent releases of multiple SAs naturally
share the same grace period.
Jinliang Zheng [Mon, 11 May 2026 15:30:58 +0000 (23:30 +0800)]
macsec: introduce dedicated workqueue for SA crypto cleanup
Introduce a dedicated ordered workqueue, macsec_wq, which will be used
by subsequent patches to defer SA crypto cleanup (crypto_free_aead and
related teardown) out of softirq context.
Using a dedicated workqueue instead of system_wq allows macsec_exit()
to drain exactly the work items belonging to this module via
destroy_workqueue(), without interfering with unrelated work items on
system_wq or causing unexpected delays elsewhere.
rcu_barrier() in macsec_exit() ensures all in-flight rcu_work callbacks
have enqueued their work items before destroy_workqueue() drains and
destroys the queue, making the two-step teardown correct and complete.
The same sequence is kept in the error path of macsec_init() as a
precaution, to mirror macsec_exit() and stay safe if work ever becomes
queueable before this point in the future.
While at it, rename the error labels in macsec_init() from the
resource-named style (rtnl:, notifier:, wq:) to the err_xxx: style
(err_rtnl:, err_notifier:, err_destroy_wq:) to align with the broader
kernel convention.
Faicker Mo [Mon, 11 May 2026 14:05:51 +0000 (22:05 +0800)]
net: net_failover: Fix the deadlock in slave register
There is netdev_lock_ops() before the NETDEV_REGISTER notifier
in register_netdevice(), so use the non-locking functions
in net_failover_slave_register().
failover_slave_register() in failover_existing_slave_register() adds lock
and unlock ops too.
Fixes: 4c975fd70002 ("net: hold instance lock during NETDEV_REGISTER/UP") Signed-off-by: Faicker Mo <faicker.mo@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Victor Nogueira [Mon, 11 May 2026 18:30:58 +0000 (14:30 -0400)]
selftests/tc-testing: Add QFQ/CBS qlen underflow test
Since CBS was not calling reset for its child qdisc, there are scenarios
where it could cause an underflow on its parent's qlen/backlog. When the
parent is QFQ, a null-ptr deref could occur.
Add a test case that reproduces the underflow followed by a null-ptr
deref scenario.
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Victor Nogueira <victor@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jamal Hadi Salim [Mon, 11 May 2026 18:30:57 +0000 (14:30 -0400)]
net/sched: sch_cbs: Call qdisc_reset for child qdisc
During a reset, CBS is not calling reset on its child qdisc, which
might cause qlen/backlog accounting issues. For example, if we have CBS
with a QFQ parent and a netem child with delay, we can create a scenario
where the parent's qlen underflows. QFQ, specifically, uses qlen to
check whether it should deference a pointer, so this scenario may cause
a null-ptr deref in QFQ:
Fix this by calling qdisc_reset for CBS' child qdisc.
Sashiko caught an issue which could result in a null ptr deref if
qdisc_create_dflt() is invoked on an unitialised cbs qdisc which is exposed
by this patch. We add an early return if the qdisc is null to address this.
This is a similar approach used by two other fixes[1][2].
The proper fix for this specific issue elucidated by sashiko is to remove
the call to qdisc_reset when qdisc_create_dflt fails. Since the dflt qdisc
isn't attached anywhere yet at that point, calling the reset callback doesn't
make much sense (and as stated has been a source of two other bugs).
We plan on submitting this fix in a later patch.
[1] https://lore.kernel.org/netdev/20221018063201.306474-2-shaozhengchao@huawei.com/
[2] https://lore.kernel.org/netdev/20221018063201.306474-4-shaozhengchao@huawei.com/
Fixes: 585d763af09c ("net/sched: Introduce Credit Based Shaper (CBS) qdisc") Reported-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Junyoung Jang <graypanda.inzag@gmail.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The reset actions of the DEFZA adapters are exceedingly slow, taking up
to 30 seconds to complete by the device spec and typically in the range
of 10 seconds in reality, as required for the device RTOS to boot, still
quite a lot. Therefore a state machine is used that's interrupt driven,
however a safety mechanism is required in case of adapter malfunction,
so that if no state change interrupt has arrived in time, then the
situation is taken care of.
The safety mechanism depends on the origin of the reset. For regular
adapter initialisation at the device probe time a sleep is requested.
However a reset is also required by the device spec when the adapter has
transitioned into the halted state, such as in response to a PC Trace
event in the course of ring fault recovery, possibly a common network
event. In that case no sleep is possible as a device halt is reported
at the hardirq level.
A timer is therefore set up to ensure progress in case no adapter state
change interrupt has arrived in time, but as from commit 168f6b6ffbee
("timers: Use del_timer_sync() even on UP") a warning is issued as the
timer is deleted in the hardirq handler upon an expected state change:
The immediate origin of the new warning is the switch away from aliasing
del_timer_sync() to del_timer() (timer_delete_sync() to timer_delete()
in terms of current function names) for UP configurations, which however
is the only choice for this driver anyway as no SMP hardware supports
the TURBOchannel bus this device interfaces to. Therefore there is a
very remote issue only this is a sign of.
Specifically if an adapter reset issued upon a transition to the halted
state times out and first triggers fza_reset_timer() for another reset
assertion, which then schedules fza_reset_timer() for reset deassertion
and then that second call is pre-empted after poking at the hardware,
but before the timer has been rearmed and owing to high system load
causing exceedingly high scheduling latency control is not handed back
before a transition to the uninitialised state has caused the timer to
be deleted even before it has been started, then fza_reset_timer() will
be called yet again and issue another reset even though by then the
adapter has already recovered.
Prevent this situation from happening by switching to timer_delete() for
the transition to the halted state and protect the code region affected
with a spinlock, also to make sure add_timer() has not been called twice
in a row due to an execution race between the interrupt handler and the
timer handler (though it could only happen on SMP, but let's keep the
driver clean). It's a very unlikely sequence of events to happen and
therefore there's no point in trying to be overly clever about it, such
as by placing printk() calls outside the protection. For the transition
to the uninitialised state switch to timer_delete_sync_try() instead, so
that a timer isn't deleted that's just been rearmed by the timer handler
and needs to watch for the device to come out of reset again (again, an
SMP scenario only).
Retain timer_delete_sync() invocations outside the hardirq context for a
stray timer not to fire once device structures have been released.
Fixes: 61414f5ec9834 ("FDDI: defza: Add support for DEC FDDIcontroller 700 TURBOchannel adapter") Signed-off-by: Maciej W. Rozycki <macro@orcam.me.uk> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Wed, 13 May 2026 22:00:40 +0000 (15:00 -0700)]
Merge tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
"The bulk of this is hardening of the new sub-scheduler infrastructure.
- UAFs and lifecycle bugs on the sub-sched attach/detach paths:
parent sub_kset freed under a racing child, list_del_rcu on an
uninitialized list head, ops->priv stomped by concurrent
attach/detach, and a UAF in the init-failure error path
- Task state-machine reorg closing concurrent enable-vs-dead races: a
task exiting during the unlocked init window could trip NULL ops
derefs or skip exit_task() cleanup
- A scx_link_sched() self-deadlock on scx_sched_lock
- isolcpus: stop dereferencing the now-RCU-protected HK_TYPE_DOMAIN
cpumask without RCU, and stop rejecting BPF schedulers when only
cpuset isolated partitions are active
- PREEMPT_RT: disable irq_work runs in hardirq context so dumps show
the failing task rather than the irq_work kthread
- Assorted !CONFIG_EXT_SUB_SCHED, randconfig, and selftest build
fixes"
* tag 'sched_ext-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
sched_ext: Fix ops->priv clobber on concurrent attach/detach
selftests/sched_ext: Fix build error in dequeue selftest
sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths
sched_ext: Close sub-sched init race with post-init DEAD recheck
sched_ext: Close root-enable vs sched_ext_dead() race with SCX_TASK_INIT_BEGIN
sched_ext: Replace SCX_TASK_OFF_TASKS flag with SCX_TASK_DEAD state
sched_ext: Inline scx_init_task() and move RESET_RUNNABLE_AT into scx_set_task_state()
sched_ext: Cleanups in preparation for the SCX_TASK_INIT_BEGIN/DEAD work
sched_ext: Use IRQ_WORK_INIT_HARD() to initialize sch->disable_irq_work
sched_ext: Fix !CONFIG_EXT_SUB_SCHED build warnings
sched_ext: Drop unused scx_find_sub_sched() stub
sched_ext: Move scx_error() out of scx_link_sched()'s lock region
Linus Torvalds [Wed, 13 May 2026 21:56:31 +0000 (14:56 -0700)]
Merge tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- cpuset fixes:
- Partition invalidation could return CPUs still in use by sibling
partitions, producing overlapping effective_cpus
- cpuset_can_attach() over-reserved DL bandwidth on moves that
stayed within the same root domain
- Pending DL migration state leaked into later attaches when a
later can_attach() check failed
- Reorder PF_EXITING and __GFP_HARDWALL checks so dying tasks can
allocate from any node and exit quickly
- dmem: propagate -ENOMEM instead of spinning forever when the fallback
pool allocation also fails
- selftests/cgroup: percpu test error-path leak, bogus numeric
comparison of cpuset strings, and a zero-length read() that silently
passed OOM-kill tests
* tag 'cgroup-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Return only actually allocated CPUs during partition invalidation
selftests/cgroup: Fix error path leaks in test_percpu_basic
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cgroup/cpuset: Reset DL migration state on can_attach() failure
selftests/cgroup: Fix string comparison in write_test
selftests/cgroup: Fix cg_read_strcmp() empty string comparison
cgroup/dmem: Return -ENOMEM on failed pool preallocation
cgroup/cpuset: move PF_EXITING check before __GFP_HARDWALL in cpuset_current_node_allowed()
Linus Torvalds [Wed, 13 May 2026 21:49:13 +0000 (14:49 -0700)]
Merge tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue fixes from Tejun Heo:
- Plug a wq->cpu_pwq leak on the WQ_UNBOUND allocation failure path
- Fix a cancel_delayed_work_sync() livelock against drain_workqueue()
caused by the drain/destroy reject path leaving WORK_STRUCT_PENDING
set with no owner
* tag 'wq-for-7.1-rc3-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path
workqueue: Release PENDING in __queue_work() drain/destroy reject path
Andrea Righi [Wed, 13 May 2026 11:24:38 +0000 (13:24 +0200)]
sched_ext: Use HK_TYPE_DOMAIN_BOOT to detect isolcpus= domain isolation
scx_enable() refuses to attach a BPF scheduler when isolcpus=domain is
in effect by comparing housekeeping_cpumask(HK_TYPE_DOMAIN) against
cpu_possible_mask.
Since commit 27c3a5967f05 ("sched/isolation: Convert housekeeping
cpumasks to rcu pointers"), HK_TYPE_DOMAIN's cpumask is RCU protected
and dereferencing it requires either RCU read lock, the cpu_hotplug
write lock, or the cpuset lock; scx_enable() holds none of these, so
booting with isolcpus=domain and attaching any BPF scheduler triggers
the following lockdep splat:
In addition, commit 03ff73510169 ("cpuset: Update HK_TYPE_DOMAIN cpumask
from cpuset") made HK_TYPE_DOMAIN include cpuset isolated partitions as
well, which means the current check also rejects BPF schedulers when a
cpuset partition is active. That contradicts the original intent of
commit 9f391f94a173 ("sched_ext: Disallow loading BPF scheduler if
isolcpus= domain isolation is in effect"), which explicitly noted that
cpuset partitions are honored through per-task cpumasks and should not
be rejected.
Switch to housekeeping_enabled(HK_TYPE_DOMAIN_BOOT), which reads only
the housekeeping flag bit (no RCU dereference) and reflects exactly the
boot-time isolcpus= configuration that the error message refers to.
where xcpus = user_xcpus(cs) which returns cs->exclusive_cpus (if set)
or cs->cpus_allowed. When exclusive_cpus is not set, user_xcpus(cs) can
contain CPUs that were never actually granted to the partition due to
sibling exclusion in compute_excpus(). Consequently, the invalidation
may return CPUs to the parent that remain in use by sibling partitions,
causing overlapping effective_cpus and triggering the
WARN_ON_ONCE(1) in generate_sched_domains().
Use cs->effective_xcpus instead, which reflects the CPUs actually
granted to this partition.
# b1 becomes partition root with CPUs 1-2, but sibling exclusion
# reduces its effective_xcpus to CPU 2 only
echo "1-2" > b1/cpuset.cpus
echo "root" > b1/cpuset.cpus.partition
Linus Torvalds [Wed, 13 May 2026 18:53:51 +0000 (11:53 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"arm64:
- Add the pKVM side of the workaround for ARM's erratum 4193714,
provided that the EL3 firmware does its part of the job. KVM will
refuse to initialise otherwise
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0
- Correctly deal with permission faults in guest_memfd memslots
- Fix the steal-time selftest after the infrastructure was reworked
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events
- Handle the current vcpu being NULL on EL2 panic
- Fix the selftest_vcpu memcache being empty at the point of donation
or sharing
- Check that the memcache has enough capacity before engaging on the
share/donate path
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context
s390:
- Fix array overrun with large amounts of PCI devices
x86:
- Never use L0's PAUSE loop exiting while L2 is running, since it's
unlikely that a nested guest will help solving the hypervisor's
spinlock contention
- Fix emulation of MOVNTDQA
- Fix typo in Xen hypercall tracepoint
- Add back an optimization that was left behind when recently fixing
a bug
- Add module parameter to disable CET, whose implementation seems to
have issues. For now it remains enabled by default
Generic:
- Reject offset causing an unsigned overflow in kvm_reset_dirty_gfn()
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (22 commits)
KVM: VMX: introduce module parameter to disable CET
KVM: x86: Swap the dst and src operand for MOVNTDQA
KVM: x86: use again the flush argument of __link_shadow_page()
KVM: selftests: Ensure gmem file sizes are multiple of host page size
Documentation: kvm: update links in the references section of AMD Memory Encryption
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
KVM: x86: Fix Xen hypercall tracepoint argument assignment
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
KVM: arm64: Pre-check vcpu memcache for host->guest donate
KVM: arm64: Pre-check vcpu memcache for host->guest share
KVM: arm64: Seed pkvm_ownership_selftest vcpu memcache
KVM: arm64: Fix __deactivate_fgt macro parameter typo
KVM: arm64: Guard against NULL vcpu on VHE hyp panic path
KVM: arm64: Make EL2 exception entry and exit context-synchronization events
MAINTAINERS: Add Steffen as reviewer for KVM/arm64
KVM: arm64: Remove potential UB on nvhe tracing clock update
KVM: selftests: arm64: Fix steal_time test after UAPI refactoring
KVM: arm64: Handle permission faults with guest_memfd
KVM: arm64: nv: Consider the DS bit when translating TCR_EL2
KVM: arm64: Work around C1-Pro erratum 4193714 for protected guests
...
Yu Miao [Wed, 13 May 2026 02:39:07 +0000 (10:39 +0800)]
selftests/cgroup: Fix error path leaks in test_percpu_basic
When cg_name_indexed() returns NULL partway through the child creation
loop, the code returned -1 without running cleanup_children and cleanup.
That left the `parent` pathname allocation unreleased and did not remove
child cgroup directories already created under the parent. Fix by jumping
to cleanup_children instead of returning.
When cg_create() fails, `child` (the pathname from cg_name_indexed())
was not freed before cleanup_children. Fix by freeing `child` before
branching to cleanup_children.
Paolo Bonzini [Tue, 12 May 2026 14:58:48 +0000 (16:58 +0200)]
KVM: VMX: introduce module parameter to disable CET
There have been reports of host hangs caused by CET virtualization.
Until these are analyzed further, introduce a module parameter that
makes it possible to easily disable it.
Niklas Söderlund [Sun, 10 May 2026 10:30:17 +0000 (12:30 +0200)]
net: ethernet: ravb: Do not check URAM suspension when WoL is active
When updating the driver to match latest datasheet to suspend access to
URAM when suspending DMA transfers a corner-case was missed, URAM access
will not be suspended if WoL is enabled. This lead to the error message
(correctly) being triggered as URAM access is not suspended even tho
it's requested as part of stopping DMA.
Avoid checking if URAM access is suspended and printing the error
message if WoL is enabled when we suspend the system, as we know it will
not be.
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org> Closes: https://lore.kernel.org/all/CAMuHMdWnjV%3DHGE1o08zLhUfTgOSene5fYx1J5GG10mB%2BToq8qg@mail.gmail.com/ Fixes: 353d8e7989b6 ("net: ethernet: ravb: Suspend and resume the transmission flow") Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Reviewed-by: Sai Krishna <saikrishnag@marvell.com> Tested-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Chenguang Zhao [Mon, 11 May 2026 01:43:43 +0000 (09:43 +0800)]
ethtool: fix ethnl_bitmap32_not_zero() bit interval semantics
ethnl_bitmap32_not_zero() should return true if some bit in [start, end)
is set:
- Fix inverted memchr_inv() sense: return true when the scan finds a
non-zero byte, not when the middle words are all zero.
- Return false for an empty interval (end <= start).
- When end is 32-bit aligned, indices in [start, end) do not include any
bits from map[end_word]; return false after earlier checks found no
non-zero data.
Xiang Mei [Sun, 10 May 2026 22:26:40 +0000 (15:26 -0700)]
net/smc: avoid NULL deref of conn->lnk in smc_msg_event tracepoint
The smc_msg_event tracepoint class, shared by smc_tx_sendmsg and
smc_rx_recvmsg, unconditionally dereferences smc->conn.lnk:
__string(name, smc->conn.lnk->ibname)
conn->lnk is only set for SMC-R; for SMC-D it is NULL. Other code on
these paths already handles this (e.g. !conn->lnk in
SMC_STAT_RMB_TX_SIZE_SMALL()). With the tracepoint enabled, the first
sendmsg()/recvmsg() on an SMC-D socket crashes:
Oops: general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [...]
RIP: 0010:strlen+0x1e/0xa0
Call Trace:
trace_event_raw_event_smc_msg_event (net/smc/smc_tracepoint.h:44)
smc_rx_recvmsg (net/smc/smc_rx.c:515)
smc_recvmsg (net/smc/af_smc.c:2859)
__sys_recvfrom (net/socket.c:2315)
__x64_sys_recvfrom (net/socket.c:2326)
do_syscall_64
The faulting address 0x3e0 is offsetof(struct smc_link, ibname),
confirming the NULL ->lnk deref. Enabling the tracepoint requires
root, but the trigger itself is unprivileged: socket(AF_SMC, ...) has
no capability check, and SMC-D negotiation needs no admin step on
s390 or on x86 with the loopback ISM device loaded.
Log an empty device name for SMC-D instead of dereferencing NULL.
Fixes: aff3083f10bf ("net/smc: Introduce tracepoints for tx and rx msg") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Reviewed-by: Sidraya Jayagond <sidraya@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nicolò Coccia [Sun, 10 May 2026 16:34:13 +0000 (12:34 -0400)]
net/smc: fix sleep-inside-lock in __smc_setsockopt() causing local DoS
A logic flaw in __smc_setsockopt() allows a local unprivileged user to
cause a Denial of Service (DoS) by holding the socket lock indefinitely.
The function __smc_setsockopt() calls copy_from_sockptr() while holding
lock_sock(sk). By passing a userfaultfd-monitored memory page (or
FUSE-backed memory on systems where unprivileged userfaultfd is disabled)
as the optval, an attacker can halt execution during the copy operation,
keeping the lock held.
Combined with asynchronous tear-down operations like shutdown(), this
exhausts the kernel wq (kworkers) and triggers the hung task watchdog.
[ 240.123456] INFO: task kworker/u8:2 blocked for more than 120 seconds.
[ 240.123489] Call Trace:
[ 240.123501] smc_shutdown+...
[ 240.123512] lock_sock_nested+...
This patch moves the user-space copy outside the lock_sock() critical
section to prevent the issue.
Fixes: a6a6fe27bab4 ("net/smc: Dynamic control handshake limitation by socket options") Signed-off-by: Nicolò Coccia <n.coccia96@gmail.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Wei Yang [Sat, 9 May 2026 12:23:58 +0000 (20:23 +0800)]
net: atm: fix skb leak in sigd_send() default branch
The default branch in sigd_send() calls sock_put() and returns -EINVAL
without freeing the skb, while all other exit paths do so. Add the
missing dev_kfree_skb() before sock_put() to fix the leak.
phy_remove() clears phydev->drv but doesn't call phy_detach(), so the
phy_device stays in the link topology xarray and ethnl_req_get_phydev()
still hands it back. ETHTOOL_MSG_PHY_GET then oopses on:
drvname is already treated as optional by phy_reply_size(),
phy_fill_reply() and phy_cleanup_data(), so just skip the allocation
when there is no driver bound.
Fixes: 9dd2ad5e92b9 ("net: ethtool: phy: Convert the PHY_GET command to generic phy dump") Cc: stable@vger.kernel.org # 6.13.x Signed-off-by: David Carlier <devnexen@gmail.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260509215046.107157-1-devnexen@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zoran Ilievski [Mon, 11 May 2026 06:40:02 +0000 (08:40 +0200)]
net: atlantic: preserve PCI wake-from-D3 on shutdown when WOL enabled
The shutdown handler aq_pci_shutdown() unconditionally calls
pci_wake_from_d3(pdev, false), clearing the PCI PME_En bit even when
wake-on-LAN has been configured. While aq_nic_shutdown() correctly
programs the NIC firmware via aq_nic_set_power() to listen for magic
packets, the PCI subsystem will not propagate the resulting PME wake
event from D3, so the system never wakes after poweroff.
WOL from suspend (S3) is unaffected because aq_suspend_common() does
not touch pci_wake_from_d3() and relies on the PM core's wake
configuration via device_may_wakeup().
This affects all atlantic-supported NICs (AQC107/108/111/112/113);
users have reported that WOL works if the atlantic driver is never
loaded, but breaks once it has run its shutdown path.
Pass the configured WOL state to pci_wake_from_d3() instead of a
literal false, so the PCI PME_En bit is preserved when the user has
armed WOL via ethtool.
Tejun Heo [Mon, 11 May 2026 23:18:23 +0000 (13:18 -1000)]
sched_ext: Defer sub_kset base put to scx_sched_free_rcu_work
scx_sub_enable_workfn() pins parent->kobj before dropping scx_sched_lock,
but that does not pin parent->sub_kset. Concurrent disable can
kset_unregister and free sub_kset before scx_alloc_and_add_sched()
dereferences it.
Split sub_kset teardown: kobject_del() at disable keeps sysfs removal; defer
kobject_put() to scx_sched_free_rcu_work so the memory survives. A racing
child sees state_in_sysfs=0 with valid memory, sysfs_create_dir() fails, and
the existing exit_kind gate in scx_link_sched() turns it away with -ENOENT.
Tejun Heo [Mon, 11 May 2026 23:18:19 +0000 (13:18 -1000)]
sched_ext: INIT_LIST_HEAD() &sch->all in scx_alloc_and_add_sched()
On scx_link_sched() error paths (parent disabled, hash insert failure),
&sch->all is never added to scx_sched_all. The cleanup path runs
scx_unlink_sched() unconditionally, which calls list_del_rcu(&sch->all) on a
list_head that was never initialized triggering a corruption warning.
Initialize &sch->all.
Fixes: 54be8de4236a ("sched_ext: Factor out scx_link_sched() and scx_unlink_sched()") Signed-off-by: Tejun Heo <tj@kernel.org>
Tejun Heo [Tue, 12 May 2026 20:30:00 +0000 (10:30 -1000)]
sched_ext: Drop NONE early return in scx_disable_and_exit_task()
d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from
paths") skipped the trailing scx_set_task_sched(p, NULL) on NONE tasks.
After scx_fail_parent() parks a task at NONE/sched=parent and the parent
is later freed via queue_rcu_work() during root_disable, the preserved
p->scx.sched dangles - print_scx_info() from sched_show_task() reads
sch->ops.name from freed memory.
Drop the early return. __scx_disable_and_exit_task() already short-
circuits on NONE and the SUB_INIT block was cleared by
scx_fail_parent()'s earlier call, so clearing p->scx.sched is the only
work left - and the one thing the path actually needs.
v2: Extend the SUB_INIT block comment to note that the flag is only
set on the sub-enable path, so it's always clear on the NONE
re-entry (Andrea).
Fixes: d3e73a0808dd ("sched_ext: Handle SCX_TASK_NONE in disable/switched_from paths") Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andrea Righi <arighi@nvidia.com>
KVM: x86: Swap the dst and src operand for MOVNTDQA
Swap the MOVNTDQA operands, as MOVNTDQA does NOT in fact have "the same
characteristics as 0F E7 (MOVNTDQ)"; MOVNTDQA loads from memory and stores
to registers, while MOVNTDQ loads from registers and stores to memory.
Per the SDM:
MOVNTDQ - Move packed integer values in xmm1 to m128 using non-temporal
hint.
MOVNTDQA - Move double quadword from m128 to xmm1 using non-temporal hint
if WC memory type.
Reported-by: Josh Eads <josheads@google.com> Fixes: c57d9bafbd0b ("KVM: x86: Add support for emulating MOVNTDQA") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260506213514.2781948-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Sun, 3 May 2026 21:09:17 +0000 (23:09 +0200)]
KVM: x86: use again the flush argument of __link_shadow_page()
Except in the case of parentless nested-TDP pages, mmu_page_zap_pte()
clears the SPTE but leaves the invalid_list empty. In this case, using
kvm_flush_remote_tlbs() as kvm_mmu_remote_flush_or_zap() does is overkill.
Avoid flushing the entirety of the remote TLBs unless the invalid_list
was populated: instead, use a more efficient gfn-targeting flush (if
available) and skip it altogether if the caller guarantees that a TLB
flush is not necessary.
Based-on: <20260503201029.106481-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-ID: <20260503210917.121840-1-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: Ensure gmem file sizes are multiple of host page size
When creating a guest_memfd file and associated memslot to validate shared
guest memory, size the file+memslot to the maximum of the host or guest
page size. Attempting to allocate a single guest page will fail if the
host page size is greater than the guest page size, as KVM requires that
the size of memslots and guest_memfd files are a multiple of the host page
size.
For simplicity, verify the entire file can be shared between guest and host,
e.g. instead of trying to validate "partial" mappings.
Fixes: 42188667be38 ("KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory") Reported-by: Zenghui Yu <zenghui.yu@linux.dev> Closes: https://lore.kernel.org/all/0064952b-048c-455d-ad89-e27e5cb82591@linux.dev Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20260512155634.772602-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 12 May 2026 20:19:20 +0000 (22:19 +0200)]
Merge tag 'kvmarm-fixes-7.1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
KVM/arm64 fixes for 7.1, take #2
- Add the pKVM side of the workaround for ARM's erratum 4193714, provided
that the EL3 firmware does its part of the job. KVM will refuse to
initialise otherwise.
- Correctly handle 52bit VAs for guest EL2 stage-1 translations when
running under NV with E2H==0.
- Correctly deal with permission faults in guest_memfd memslots.
- Fix the steal-time selftest after the infrastructure was reworked.
- Make sure the host cannot pass a non-sensical clock update to the
EL2 tracing infrastructure.
- Appoint Steffen Eiden as a reviewer in anticipation of the KVM/s390
ability to run arm64 guests, which will inevitably lead to arm64
code being directly used on s390.
- Make sure that EL2 is configured with both exception entry and exit
being Context Synchronization Events.
- Handle the current vcpu being NULL on EL2 panic.
- Fix the selftest_vcpu memcache being empty at the point of donation or
sharing.
- Check that the memcache has enough capacity before engaging on the
share/donate path.
- Fix __deactivate_fgt() to use its parameter rather than a variable
in the macro context.
KVM: nSVM: Never use L0's PAUSE loop exiting while L2 is running
Never use L0's (KVM's) PAUSE loop exiting controls while L2 is running,
and instead always configure vmcb02 according to L1's exact capabilities
and desires.
The purpose of intercepting PAUSE after N attempts is to detect when the
vCPU may be stuck waiting on a lock, so that KVM can schedule in a
different vCPU that may be holding said lock. Barring a very interesting
setup, L1 and L2 do not share locks, and it's extremely unlikely that an
L1 vCPU would hold a spinlock while running L2. I.e. having a vCPU
executing in L1 yield to a vCPU running in L2 will not allow the L1 vCPU
to make forward progress, and vice versa.
While teaching KVM's "on spin" logic to only yield to other vCPUs in L2 is
doable, in all likelihood it would do more harm than good for most setups.
KVM has limited visibility into which L2 "vCPUs" belong to the same VM,
and thus share a locking domain. And even if L2 vCPUs are in the same
VM, KVM has no visilibity into L2 vCPU's that are scheduled out by the
L1 hypervisor.
Furthermore, KVM doesn't actually steal PAUSE exits from L1. If L1 is
intercepting PAUSE, KVM will route PAUSE exits to L1, not L0, as
nested_svm_intercept() gives priority to the vmcb12 intercept. As such,
overriding the count/threshold fields in vmcb02 with vmcb01's values is
nonsensical, as doing so clobbers all the training/learning that has been
done in L1.
Even worse, if L1 is not intercepting PAUSE, i.e. KVM is handling PAUSE
exits, then KVM will adjust the PLE knobs based on L2 behavior, which could
very well be detrimental to L1, e.g. due to essentially poisoning L1 PLE
training with bad data.
And copying the count from vmcb02 to vmcb01 on a nested VM-Exit makes even
less sense, because again, the purpose of PLE is to detect spinning vCPUs.
Whether or not a vCPU is spinning in L2 at the time of a nested VM-Exit
has no relevance as to the behavior of the vCPU when it executes in L1.
The only scenarios where any of this actually works is if at least one
of KVM or L1 is NOT intercepting PAUSE for the guest. Per the original
changelog, those were the only scenarios considered to be supported.
Disabling KVM's use of PLE makes it so the VM is always in a "supported"
mode.
Last, but certainly not least, using KVM's count/threshold instead of the
values provided by L1 is a blatant violation of the SVM architecture.
Fixes: 74fd41ed16fd ("KVM: x86: nSVM: support PAUSE filtering when L0 doesn't intercept PAUSE") Cc: Maxim Levitsky <mlevitsk@redhat.com> Tested-by: David Kaplan <david.kaplan@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Link: https://patch.msgid.link/20260508213321.373309-1-seanjc@google.com/ Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Aaron Sacks [Tue, 12 May 2026 06:07:42 +0000 (02:07 -0400)]
KVM: Reject wrapped offset in kvm_reset_dirty_gfn()
kvm_reset_dirty_gfn() guards the gfn range with
if (!memslot || (offset + __fls(mask)) >= memslot->npages)
return;
but offset is u64 and the addition is unchecked. The check can be
silently bypassed by a u64 wrap.
The dirty ring backing those entries is MAP_SHARED at
KVM_DIRTY_LOG_PAGE_OFFSET of the vcpu fd, so the VMM can rewrite the
slot and offset fields of any entry between when the kernel pushes
them and when KVM_RESET_DIRTY_RINGS consumes them. On reset,
kvm_dirty_ring_reset() re-reads the values via READ_ONCE() and feeds
them straight back into this check; only the flags handshake is
treated as the handover, the slot/offset payload is taken on trust.
makes the coalescing loop in kvm_dirty_ring_reset() compute
delta = (s64)(0 - 0xffffffffffffffc1) = 63
which falls in [0, BITS_PER_LONG), so it folds entry[i+1] into the
existing mask by setting bit 63. The trailing kvm_reset_dirty_gfn()
call then sees offset = 0xffffffffffffffc1 and __fls(mask) = 63;
the sum is 0 in u64 and the bounds check passes.
That offset propagates into kvm_arch_mmu_enable_log_dirty_pt_masked()
unchanged. On the legacy MMU path -- kvm_memslots_have_rmaps() ==
true, i.e. shadow paging, any VM that has allocated shadow roots, or
a write-tracked slot -- it reaches gfn_to_rmap(), which indexes
slot->arch.rmap[0][] with a near-U64_MAX gfn. That is an
out-of-bounds load of a kvm_rmap_head, followed by a conditional
clear of PT_WRITABLE_MASK in whatever the loaded pointer points at.
The path is reachable from any process holding /dev/kvm.
Range-check offset on its own first, so the addition cannot wrap.
memslot->npages is bounded well below U64_MAX, so once offset <
npages holds, offset + __fls(mask) (with __fls(mask) < BITS_PER_LONG)
stays in range.
Sergio Correia [Tue, 12 May 2026 13:28:59 +0000 (14:28 +0100)]
audit: enforce AUDIT_LOCKED for AUDIT_TRIM and AUDIT_MAKE_EQUIV
AUDIT_ADD_RULE and AUDIT_DEL_RULE correctly check for AUDIT_LOCKED
and return -EPERM, but AUDIT_TRIM and AUDIT_MAKE_EQUIV do not. This
allows a process with CAP_AUDIT_CONTROL to modify directory tree
watches and equivalence mappings even when the audit configuration
has been locked, undermining the purpose of the lock.
Add AUDIT_LOCKED checks to both commands.
Cc: stable@vger.kernel.org Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Sergio Correia [Tue, 12 May 2026 13:28:33 +0000 (14:28 +0100)]
audit: fix incorrect inheritable capability in CAPSET records
__audit_log_capset() records the effective capability set into the
inheritable field due to a copy-paste error. Every CAPSET audit
record therefore reports cap_pi (process inheritable) with the value
of cap_effective instead of cap_inheritable.
This silently corrupts audit data used for compliance and forensic
analysis: an attacker who modifies inheritable capabilities to
prepare for a privilege-escalating exec would have the change masked
in the audit trail.
The bug has been present since the original introduction of CAPSET
audit records in 2008.
Cc: stable@vger.kernel.org Fixes: e68b75a027bb ("When the capset syscall is used it is not possible for audit to record the actual capbilities being added/removed. This patch adds a new record type which emits the target pid and the eff, inh, and perm cap sets.") Reviewed-by: Ricardo Robaina <rrobaina@redhat.com> Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Sergio Correia <scorreia@redhat.com> Signed-off-by: Paul Moore <paul@paul-moore.com>
Linus Torvalds [Tue, 12 May 2026 17:18:02 +0000 (10:18 -0700)]
Merge tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull probes fixes from Masami Hiramatsu:
- kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
Since the ftrace adds its NOPs at .kprobes.text section (which stores
an array), a wrong entry is added when loading a module which uses
"__kprobes" attribute.
To solve this, add "notrace" to __kprobes functions
- test_kprobes: clear kprobes between test runs
Clear all kprobes in the test program after running a test set,
because Kunit test can run several times
- fprobe: Fix unregister_fprobe() to wait for RCU grace period
Since the fprobe data structure is removed with hlist_del_rcu(), it
should wait for the RCU grace period. If the caller waits for RCU, we
can use the async variant (e.g. eBPF)
* tag 'probes-fixes-v7.1-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
fprobe: Fix unregister_fprobe() to wait for RCU grace period
test_kprobes: clear kprobes between test runs
kprobes: skip non-symbol addresses in kprobe_add_ksym_blacklist()
ACPI: PAD: xen: Check ACPI_COMPANION() against NULL
Every platform driver can be forced to match a device that doesn't match
its list of device IDs because of device_match_driver_override(), so
platform drivers that rely on the existence of a device's ACPI companion
object need to verify its presence.
Accordingly, add a requisite ACPI_COMPANION() check against NULL to the
Xen variant of the ACPI processor aggregator device (PAD) driver.
Tomasz Pakuła [Sun, 10 May 2026 12:23:52 +0000 (14:23 +0200)]
HID: pidff: Fix integer overflow in pidff_rescale
Rescaling values close to the max (U16_MAX) temporarily creates values
that exceed the s32 range. This caused value overflow in case when, for
example, a periodic effect phase was higer than 180 degrees. In turn,
rescale function could return values outised of the logical range of the
HID field.
Fix by using 64 bit signed integer to store the value during calculation
but still return only 32 bit integer.
Closes: https://github.com/JacKeTUs/universal-pidff/issues/116 Fixes: 224ee88fe395 ("Input: add force feedback driver for PID devices") Cc: stable@vger.kernel.org Signed-off-by: Tomasz Pakuła <tomasz.pakula.oficjalny@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Xu Rao [Sat, 9 May 2026 08:21:32 +0000 (16:21 +0800)]
HID: i2c-hid: add reset quirk for BLTP7853 touchpad
The BLTP7853 I2C HID touchpad may fail to probe after reboot or
reprobe because reset completion is not signalled to the host. The
driver then waits for the reset-complete interrupt until it times out
and the device probe fails:
i2c_hid i2c-BLTP7853:00: failed to reset device.
i2c_hid i2c-BLTP7853:00: can't add hid device: -61
i2c_hid: probe of i2c-BLTP7853:00 failed with error -61
Add I2C_HID_QUIRK_NO_IRQ_AFTER_RESET for the device so i2c-hid does
not wait for a reset interrupt that may never arrive.
hid_input_report() is used in too many places to have a commit that
doesn't cross subsystem borders. Instead of changing the API, introduce
a new one when things matters in the transport layers:
- usbhid
- i2chid
This effectively revert to the old behavior for those two transport
layers.
commit 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing
bogus memset()") enforced the provided data to be at least the size of
the declared buffer in the report descriptor to prevent a buffer
overflow. However, we can try to be smarter by providing both the buffer
size and the data size, meaning that hid_report_raw_event() can make
better decision whether we should plaining reject the buffer (buffer
overflow attempt) or if we can safely memset it to 0 and pass it to the
rest of the stack.
Fixes: 0a3fe972a7cb ("HID: core: Mitigate potential OOB by removing bogus memset()") Cc: stable@vger.kernel.org Signed-off-by: Benjamin Tissoires <bentiss@kernel.org> Acked-by: Johan Hovold <johan@kernel.org> Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Myeonghun Pak [Fri, 24 Apr 2026 12:50:41 +0000 (21:50 +0900)]
HID: google: hammer: stop hardware on devres action failure
hammer_probe() starts the HID hardware before registering the devres
action that stops it. If devm_add_action() fails, probe returns an
error with the hardware still started because the cleanup action was
never registered and the driver's remove callback is not called after a
failed probe.
Use devm_add_action_or_reset() so the stop action runs immediately on
registration failure while preserving the existing devres-managed cleanup
path for later probe failures and remove.
Signed-off-by: Myeonghun Pak <mhun512@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Sangyun Kim [Mon, 20 Apr 2026 05:13:18 +0000 (14:13 +0900)]
HID: appletb-kbd: run inactivity autodim from workqueues
The autodim code in hid-appletb-kbd takes backlight_device->ops_lock
via backlight_device_set_brightness() -> mutex_lock() from two
different atomic contexts:
* appletb_inactivity_timer() is a struct timer_list callback, so it
runs in softirq context. Every expiry triggers
BUG: sleeping function called from invalid context at kernel/locking/mutex.c:591
Call Trace:
<IRQ>
__might_resched
__mutex_lock
backlight_device_set_brightness
appletb_inactivity_timer
call_timer_fn
run_timer_softirq
* reset_inactivity_timer() is called from appletb_kbd_hid_event() and
appletb_kbd_inp_event(). On real USB hardware these run in
softirq/IRQ context (URB completion and input-event dispatch).
When the Touch Bar has already been dimmed or turned off, the
reset path calls backlight_device_set_brightness() directly to
restore brightness, producing the same warning.
Both call sites hit the same mutex_lock()-from-atomic bug. Fix them
together by moving the blocking work onto the system workqueue:
* Convert the inactivity timer from struct timer_list to
struct delayed_work; the callback (appletb_inactivity_work) now
runs in process context where mutex_lock() is legal.
* Add a dedicated struct work_struct restore_brightness_work and have
reset_inactivity_timer() schedule it instead of calling
backlight_device_set_brightness() directly.
Cancel both works synchronously during driver tear-down alongside the
existing backlight reference drop.
The semantics are unchanged (same delays, same state transitions on
dim, turn-off and user activity); only the execution context of the
sleeping call changes. The timer field and callback are renamed to
match their new type; reset_inactivity_timer() keeps its name because
it is invoked from input event paths that read naturally as "reset
the inactivity timer".
Fixes: 93a0fc489481 ("HID: hid-appletb-kbd: add support for automatic brightness control while using the touchbar") Cc: stable@vger.kernel.org Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Sangyun Kim [Mon, 20 Apr 2026 05:13:17 +0000 (14:13 +0900)]
HID: appletb-kbd: fix UAF in inactivity-timer cleanup path
Commit 38224c472a03 ("HID: appletb-kbd: fix slab use-after-free bug in
appletb_kbd_probe") added timer_delete_sync(&kbd->inactivity_timer) to
both the probe close_hw error path and appletb_kbd_remove(), but the
way it was wired in left the inactivity timer reachable during driver
tear-down via two distinct windows.
Window A -- put_device() before timer_delete_sync():
The inactivity_timer softirq reads kbd->backlight_dev and calls
backlight_device_set_brightness() -> mutex_lock(&ops_lock). If a
concurrent hid_appletb_bl unbind drops the last devm reference
between these two calls, the backlight_device is freed and the
mutex_lock() touches freed memory.
Window B -- backlight cleanup before hid_hw_stop():
if (kbd->backlight_dev) {
timer_delete_sync(...);
put_device(...);
}
hid_hw_close(hdev);
hid_hw_stop(hdev);
Even after Window A is closed, hid_hw_close()/hid_hw_stop() still run
afterwards, so a late ".event" callback from the HID core (USB URB
completion on real Apple hardware) can arrive after
timer_delete_sync() drained the softirq but before put_device() drops
the reference. That callback reaches reset_inactivity_timer(), which
calls mod_timer() and re-arms the timer. The freshly re-armed timer
can then fire on the about-to-be-freed backlight_device.
Both windows produce the same KASAN slab-use-after-free:
BUG: KASAN: slab-use-after-free in __mutex_lock+0x1aab/0x21c0
Read of size 8 at addr ffff88803ee9a108 by task swapper/0/0
Call Trace:
<IRQ>
__mutex_lock
backlight_device_set_brightness
appletb_inactivity_timer
call_timer_fn
run_timer_softirq
handle_softirqs
Allocated by task N:
devm_backlight_device_register
appletb_bl_probe
Freed by task M:
(concurrent hid_appletb_bl unbind path)
Close both windows at once by reworking the tear-down in
appletb_kbd_remove() and in the probe close_hw error path so that
1) hid_hw_close()/hid_hw_stop() run before the backlight cleanup,
guaranteeing no further .event callback can fire and re-arm the
timer, and
2) inside the "if (kbd->backlight_dev)" block, timer_delete_sync()
runs before put_device(), so the softirq is drained before the
final reference is dropped.
Fixes: 38224c472a03 ("HID: appletb-kbd: fix slab use-after-free bug in appletb_kbd_probe") Cc: stable@vger.kernel.org Signed-off-by: Sangyun Kim <sangyun.kim@snu.ac.kr> Signed-off-by: Jiri Kosina <jkosina@suse.com>
T.J. Mercier [Fri, 17 Apr 2026 15:47:02 +0000 (08:47 -0700)]
HID: playstation: Clamp num_touch_reports
A device would never lie about the number of touch reports would it?
If it does the loop in dualshock4_parse_report will read off the end of
the touch_reports array, up to about 2 KiB for the maximum number of 256
loop iteraions. The data that is read is emitted via evdev if the
DS4_TOUCH_POINT_INACTIVE bit happens to be set. Protect against this by
clamping the num_touch_reports value provided by the device to the
maximum size of the touch_reports array.
Lee Jones [Thu, 16 Apr 2026 13:16:54 +0000 (14:16 +0100)]
HID: magicmouse: Prevent out-of-bounds (OOB) read during DOUBLE_REPORT_ID
It is currently possible for a malicious or misconfigured USB device to
cause an out-of-bounds (OOB) read when submitting reports using
DOUBLE_REPORT_ID by specifying a large report length and providing a
smaller one.
Let's prevent that by comparing the specified report length with the
actual size of the data read in from userspace. If the actual data
length ends up being smaller than specified, we'll politely warn the
user and prevent any further processing.
Signed-off-by: Lee Jones <lee@kernel.org> Reviewed-by: Günther Noack <gnoack@google.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
HID: mcp2221: fix OOB write in mcp2221_raw_event()
mcp2221_raw_event() copies device-supplied data into mcp->rxbuf at
offset rxbuf_idx without checking that the copy fits within the
destination buffer. A device responding with up to 60 bytes to a
small I2C/SMBus read can overflow the buffer.
Add a rxbuf_size field to struct mcp2221, set it alongside rxbuf in
mcp_i2c_smbus_read(), and check rxbuf_idx + data[3] <= rxbuf_size
before the memcpy.
Lukas Bulwahn [Thu, 5 Feb 2026 08:11:31 +0000 (09:11 +0100)]
HID: quirks: really enable the intended work around for appledisplay
Commit c7fabe4ad921 ("HID: quirks: work around VID/PID conflict for
appledisplay") intends to add a quirk for kernels built with Apple Cinema
Display support, but it refers to the non-existing config option
CONFIG_APPLEDISPLAY, whereas the config option for Apple Cinema Display
support is named CONFIG_USB_APPLEDISPLAY.
Refer to the intended config option CONFIG_USB_APPLEDISPLAY in the ifdef
directive.
Fixes: c7fabe4ad921 ("HID: quirks: work around VID/PID conflict for appledisplay") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@redhat.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Oliver Neukum [Tue, 3 Mar 2026 09:48:54 +0000 (10:48 +0100)]
HID: hid-sjoy: race between init and usage
The driver uses an initial IO to set the device to a default
state. That initialization is currently being done after the device
node has been created. That means that the single buffer used
for output can be altered while IO is in progress.
Move the intialization before announcement to user space.
Fixes: fac733f029251 ("HID: force feedback support for SmartJoy PLUS PS2/USB adapter") Signed-off-by: Oliver Neukum <oneukum@suse.com> Signed-off-by: Jiri Kosina <jkosina@suse.com>
Paolo Abeni [Tue, 12 May 2026 14:15:03 +0000 (16:15 +0200)]
Merge branch 'net-shaper-fix-various-minor-bugs'
Jakub Kicinski says:
====================
net: shaper: fix various minor bugs
Fix various minor bugs in the net shaper API.
First 2 patches deal with ordering issues around inserting
and publishing new shapers. Shapers are inserted "tentatively"
and marked valid only after HW op succeeded, this used to
be slightly racy.
Only other patch of note is patch 8. We want to add a Netlink
policy check on the handle ID. This necessitates patch 7.
Jakub Kicinski [Sun, 10 May 2026 19:29:04 +0000 (12:29 -0700)]
net: shaper: reject QUEUE scope handle with missing id
net_shaper_parse_handle() does not enforce that the user provides
the handle ID. For NODE the ID defaults to UNSPEC for all other
cases it defaults to 0.
For NETDEV 0 is the only option. For QUEUE defaulting to 0 makes
less intuitive sense. Specifically because the behavior should
(IMHO) be the same for all cases where there may be more than
one ID (QUEUE and NODE).
We should either document this as intentional or reject.
I picked the latter with no strong conviction.
Jakub Kicinski [Sun, 10 May 2026 19:29:03 +0000 (12:29 -0700)]
net: shaper: enforce singleton NETDEV scope with id 0
The NETDEV scope represents a singleton root shaper in the per-device
hierarchy. All code assumes NETDEV shapers have id 0:
net_shaper_default_parent() hardcodes parent->id = 0 when returning
the NETDEV parent for QUEUE/NODE children, and the UAPI documentation
describes NETDEV scope as "the main shaper" (singular, not plural).
net_shaper_parse_handle() reads the user-supplied handle ID via
nla_get_u32(), accepting the full u32 range. However, the xarray key
is built by net_shaper_handle_to_index() using
FIELD_PREP(NET_SHAPER_ID_MASK, handle->id), where NET_SHAPER_ID_MASK
is GENMASK(25, 0) - only 26 bits wide. FIELD_PREP silently masks off
the upper bits at runtime. A user-supplied NODE id like 0x04000123
becomes id 0x123.
Additionally, a user-supplied id equal to NET_SHAPER_ID_UNSPEC
(0x03FFFFFF, which is NET_SHAPER_ID_MASK itself) would collide with
the sentinel used internally by the group operation to signal
"allocate a new NODE id".
Reject user-supplied IDs >= NET_SHAPER_ID_MASK (i.e., >= 0x03FFFFFF)
in the policy.
Jakub Kicinski [Sun, 10 May 2026 19:29:01 +0000 (12:29 -0700)]
tools: ynl: add scope qualifier for definitions
Using definitions in kernel policies is awkward right now.
On one hand we want defines for max values and such.
On the other we don't have a way of adding kernel-only defines.
Adding unnecessary defines to uAPI is a bad idea, we won't
be able to delete them. And when it comes to policy user
space should just query it via the policy dump, not use
hard coded defines.
Add a "scope" property to definitions, which will let us tell
the codegen that a definition is for kernel use only. Support
following values:
- uapi: render into the uAPI header (default, today's behavior)
- kernel: render to kernel header only
- user: same as kernel but for the user-side generated header
Definitions may have a header property (definition is "external",
provided by existing header). Extend the scope to headers, too.
If definition has both scope and header properties we will only
generate the includes in the right scope.
Jakub Kicinski [Sun, 10 May 2026 19:29:00 +0000 (12:29 -0700)]
net: shaper: fix undersized reply skb allocation in GROUP command
net_shaper_group_send_reply() writes both the NET_SHAPER_A_IFINDEX
attribute (via net_shaper_fill_binding()) and the nested
NET_SHAPER_A_HANDLE attribute (via net_shaper_fill_handle()), but
the reply skb at the call site in net_shaper_nl_group_doit() is
allocated using net_shaper_handle_size(), which only accounts for
the nested handle.
The allocation is therefore short by nla_total_size(sizeof(u32))
(8 bytes) for the IFINDEX attribute. In practice the slab allocator
rounds up the small allocation so the bug is latent, but the size
accounting is wrong and could bite if the reply grew further.
Introduce net_shaper_group_reply_size() that accounts for the full
reply payload and use it both at the genlmsg_new() call site and in
the defensive WARN_ONCE message.
Jakub Kicinski [Sun, 10 May 2026 19:28:57 +0000 (12:28 -0700)]
net: shaper: reject duplicate leaves in GROUP request
net_shaper_nl_group_doit() does not deduplicate NET_SHAPER_A_LEAVES
entries. When userspace supplies the same leaf handle twice, the same
old-parent pointer lands twice in old_nodes[]. The cleanup loop double
frees the parent. Of course the same parent may still be in old_nodes[]
twice if we are moving multiple of its leaves.
Note that this patch also implicitly fixes the fact that the
i >= leaves_count path forgets to set ret.
Jakub Kicinski [Sun, 10 May 2026 19:28:55 +0000 (12:28 -0700)]
net: shaper: flip the polarity of the valid flag
The usual way of inserting entries which are not yet fully ready
into XArray is to have a VALID flag. The shaper code has a NOT_VALID
flag. Since XArray code does not let us create entries with marks
already set - the creation of entries is currently not atomic.
Flip the polarity of the VALID flag. This closes the tiny race
in net_shaper_pre_insert() of entries being created without
the NOT_VALID flag.
net: ethernet: cs89x0: remove stale CONFIG_MACH_MX31ADS reference
The legacy ARM board file for MACH_MX31ADS was removed in commit c93197b0041d ("ARM: imx: Remove i.MX31 board files"), but a reference
to it remained in the cs89x0 driver. Drop this unused code.
Linus Walleij [Fri, 8 May 2026 22:13:38 +0000 (00:13 +0200)]
net: ethernet: cortina: Carry over frag counter
The gmac_rx() NAPI poll function assembles packets in an
SKB from a ring buffer.
If the ring buffer gets completely emptied during a poll cycle,
we exit gmac_rx(), but the packet is not yet completely
assembled in the SKB, yet the fragment counter frag_nr is
reset to zero on the next invocation.
Solve this by making the RX fragment counter a part of the
port struct, and carry it over between invocations.
Reset the fragment counter only right after calling
napi_gro_frags(), on error (after calling napi_free_frags())
or if stopping the port.
Reset it in some place where not strictly necessary just to
emphasize what is going on.
This was found by Sashiko during normal patch review.
Linus Walleij [Fri, 8 May 2026 22:13:37 +0000 (00:13 +0200)]
net: ethernet: cortina: Make RX SKB per-port
The SKB used to assemble packets from fragments in gmac_rx()
is static local, but the Gemini has two ethernet ports, meaning
there can be races between the ports on a bad day if a device
is using both.
Make the RX SKB a per-port variable and carry it over between
invocations in the port struct instead.
Zero the pointer once we call napi_gro_frags(), on error (after
calling napi_free_frags()) or if the port is stopped.
Zero it in some place where not strictly necessary just to
emphasize what is going on.
This was found by Sashiko during normal patch review.
====================
vsock/virtio: fix vsockmon tap skb construction
While reviewing the patch posted by Yiqi Sun [1] to fix an issue in
virtio_transport_build_skb(), I discovered another issue related to
the offset and length of the payload to be copied in the new skb.
This was introduced when we did the skb conversion, and fixed by
patch 1.
Patch 2 fixes the issue found by Yiqi Sun in a different way: using
iov_iter_kvec() to properly initialize all the iov_iter fields and
removing the linear vs non-linear split like we alredy do in
vhost-vsock.
It could have been a single patch, but since there were two affected
commits, I decided to keep the fixes separate.
vsock/virtio: fix empty payload in tap skb for non-linear buffers
For non-linear skbs, virtio_transport_build_skb() goes through
virtio_transport_copy_nonlinear_skb() to copy the original payload
in the new skb to be delivered to the vsockmon tap device.
This manually initializes an iov_iter but does not set iov_iter.count.
Since the iov_iter is zero-initialized, the copy length is zero and no
payload is actually copied to the monitor interface, leaving data
un-initialized.
Fix this by removing the linear vs non-linear split and using
skb_copy_datagram_iter() with iov_iter_kvec() for all cases, as
vhost-vsock already does. This handles both linear and non-linear skbs,
properly initializes the iov_iter, and removes the now unused
virtio_transport_copy_nonlinear_skb().
While touching this code, let's also check the return value of
skb_copy_datagram_iter(), even though it's unlikely to fail.
Fixes: 4b0bf10eb077 ("vsock/virtio: non-linear skb handling for tap") Reported-by: Yiqi Sun <sunyiqixm@gmail.com> Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-3-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
vsock/virtio: fix length and offset in tap skb for split packets
virtio_transport_build_skb() builds a new skb to be delivered to the
vsockmon tap device. To build the new skb, it uses the original skb
data length as payload length, but as the comment notes, the original
packet stored in the skb may have been split in multiple packets, so we
need to use the length in the header, which is correctly updated before
the packet is delivered to the tap, and the offset for the data.
This was also similar to what we did before commit 71dc9ec9ac7d
("virtio/vsock: replace virtio_vsock_pkt with sk_buff") where we probably
missed something during the skb conversion.
Also update the comment above, which was left stale by the skb
conversion and still mentioned a buffer pointer that no longer exists.
Fixes: 71dc9ec9ac7d ("virtio/vsock: replace virtio_vsock_pkt with sk_buff") Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Bobby Eshleman <bobbyeshleman@meta.com> Reviewed-by: Arseniy Krasnov <avkrasnov@rulkc.org> Link: https://patch.msgid.link/20260508164411.261440-2-sgarzare@redhat.com Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Quan Sun [Fri, 8 May 2026 12:46:36 +0000 (20:46 +0800)]
net: hsr: fix NULL pointer dereference in hsr_get_node_data()
In the HSR (High-availability Seamless Redundancy) protocol, node
information is maintained in the node_db. When a supervision frame is
received, node->addr_B_port is updated to track the receiving port type
(e.g., HSR_PT_SLAVE_B).
If the underlying physical interface associated with this slave port is
removed (e.g., via `ip link del`), hsr_del_port() frees the hsr_port
object. However, the stale node->addr_B_port reference is kept in the
node_db until the node ages out.
Subsequently, if userspace queries the node status via the Netlink
command HSR_C_GET_NODE_STATUS, the kernel calls hsr_get_node_data().
This function unconditionally dereferences the pointer returned by
hsr_port_get_hsr():
if (node->addr_B_port != HSR_PT_NONE) {
port = hsr_port_get_hsr(hsr, node->addr_B_port);
*addr_b_ifindex = port->dev->ifindex; // <-- NULL deref
}
If the slave port has been deleted, hsr_port_get_hsr() returns NULL,
resulting in a kernel panic.
Oops: general protection fault, probably for non-canonical address
KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
RIP: 0010:hsr_get_node_data+0x7b6/0x9e0
Call Trace:
<TASK>
hsr_get_node_status+0x445/0xa40
Fix this by adding a proper NULL pointer check. If the port lookup fails
due to a stale port type, gracefully treat it as if no valid port exists
and assign -1 to the interface index.
Steps to reproduce:
1. Create an HSR interface with two slave devices.
2. Receive a supervision frame to populate node_db with
addr_B_port assigned to SLAVE_B.
3. Delete the underlying slave device B.
4. Send an HSR_C_GET_NODE_STATUS Netlink message.
qed: fix division by zero in qed_init_wfq_param when all vports are configured
In qed_init_wfq_param(), variable non_requested_count can become zero
when the number of vports with the configured flag set (including the
current vport being configured) equals total num_vports. This happens
when configuring the last unconfigured vport or when re-configuring
an already configured vport.
The function then calculates left_rate_per_vp = total_left_rate /
non_requested_count, which causes division by zero.
Fix this by skipping the division when non_requested_count is zero.
In that case, there is no remaining bandwidth to distribute, so just
record the configuration for the current vport and return success.
Davide Caratti [Fri, 8 May 2026 17:05:10 +0000 (19:05 +0200)]
net/sched: dualpi2: initialize timer earlier in dualpi2_init()
'pi2_timer' needs to be initialized in all error paths of dualpi2_init():
otherwise, a failure in qdisc_create_dflt() causes the following crash in
dualpi2_destroy():
net: ena: PHC: Check return code before setting timestamp output
ena_phc_gettimex64() is setting the output parameter regardless
of whether ena_com_phc_get_timestamp() succeeded or failed.
When ena_com_phc_get_timestamp() returns an error, the timestamp
parameter may contain uninitialized stack memory (e.g., when PHC is
disabled or in blocked state) or invalid hardware values. Passing
these to userspace via the PTP ioctl is both a security issue
(information leak) and a correctness bug.
Fix by checking the return code after releasing the lock and only
setting the output timestamp on success.
Fixes: e0ea34158ee8 ("net: ena: Add PHC support in the ENA driver") Cc: stable@vger.kernel.org Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20260507003518.22554-1-akiyano@amazon.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net/rds: reset op_nents when zerocopy page pin fails
When iov_iter_get_pages2() fails in rds_message_zcopy_from_user(),
the pinned pages are released with put_page(), and
rm->data.op_mmp_znotifier is cleared. But we fail to properly
clear rm->data.op_nents.
Later when rds_message_purge() is called from rds_sendmsg() the
cleanup loop iterates over the incorrectly non zero number of
op_nents and frees them again.
Fix this by properly resetting op_nents when it should be in
rds_message_zcopy_from_user().
Linus Torvalds [Mon, 11 May 2026 22:38:49 +0000 (15:38 -0700)]
Merge tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah Khan:
"Fix to decouple KUNIT_DEBUGFS and KUNIT_ALL_TESTS options and fix
KUNIT_DEBUGFS dependencies so it depends on DEBUG_FS without which it
will not be useful"
* tag 'linux_kselftest-kunit-fixes-7.1-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: config: KUNIT_DEBUGFS should depend on DEBUG_FS
kunit: config: Enable KUNIT_DEBUGFS by default
Tejun Heo [Mon, 11 May 2026 22:05:48 +0000 (12:05 -1000)]
sched_ext: Avoid UAF in scx_root_enable_workfn() init failure path
In scx_root_enable_workfn(), put_task_struct(p) is called before scx_error()
dereferences p->comm and p->pid. If the iterator's reference is the last
drop, the task is freed synchronously and the deref becomes a UAF.
Guopeng Zhang [Sat, 9 May 2026 10:20:31 +0000 (18:20 +0800)]
cgroup/cpuset: Reserve DL bandwidth only for root-domain moves
cpuset_can_attach() currently adds the bandwidth of all migrating
SCHED_DEADLINE tasks to sum_migrate_dl_bw. If the source and destination
cpuset effective CPU masks do not overlap, the whole sum is then
reserved in the destination root domain.
set_cpus_allowed_dl(), however, subtracts bandwidth from the source
root domain only when the affinity change really moves the task between
root domains. A DL task can move between cpusets that are still in the
same root domain, so including that task in sum_migrate_dl_bw can reserve
destination bandwidth without a matching source-side subtraction.
Share the root-domain move test with set_cpus_allowed_dl(). Keep
nr_migrate_dl_tasks counting all migrating deadline tasks for cpuset DL
task accounting, but add to sum_migrate_dl_bw only for tasks that need a
root-domain bandwidth move. Keep using the destination cpuset effective
CPU mask and leave the broader can_attach()/attach() transaction model
unchanged.
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Juri Lelli <juri.lelli@redhat.com> Tested-by: Juri Lelli <juri.lelli@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
ACPI: driver: Check ACPI_COMPANION() against NULL during probe
Since every platform driver can be forced to match a device that doesn't
match its list of device IDs because of device_match_driver_override(),
platform drivers that rely on the existence of a device's ACPI companion
object should verify its presence.
Accordingly, add requisite ACPI_COMPANION() or ACPI_HANDLE() checks
against NULL to 13 platform drivers handling core ACPI devices.
Also change the value returned by the ACPI thermal zone driver when
the device's ACPI companion is not present to -ENODEV for consistency
with the other drivers.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Hans de Goede <johannes.goede@oss.qualcomm.com> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Link: https://patch.msgid.link/4516068.ejJDZkT8p0@rafael.j.wysocki Cc: 7.0+ <stable@vger.kernel.org> # 7.0+
Jann Horn [Mon, 11 May 2026 15:55:11 +0000 (08:55 -0700)]
exit: prevent preemption of oopsing TASK_DEAD task
When an already-exiting task oopses, make_task_dead() currently calls
do_task_dead() with preemption enabled. That is forbidden:
do_task_dead() calls __schedule(), which has a comment saying "WARNING:
must be called with preemption disabled!".
If an oopsing task is preempted in do_task_dead(), between becoming
TASK_DEAD and entering the scheduler explicitly, bad things happen:
finish_task_switch() assumes that once the scheduler has switched away
from a TASK_DEAD task, the task can never run again and its stack is no
longer needed; but that assumption apparently doesn't hold if the dead
task was preempted (the SM_PREEMPT case).
This means that the scheduler ends up repeatedly dropping references on
the dead task's stack, which can lead to use-after-free or double-free
of the entire task stack; in other words, two tasks can end up running
on the same stack, resulting in various kinds of memory corruption.
(This does not just affect "recursively oopsing" tasks; it is enough to
oops once during task exit, for example in a file_operations::release
handler)
Fixes: 7f80a2fd7db9 ("exit: Stop poorly open coding do_task_dead in make_task_dead") Cc: stable@kernel.org Signed-off-by: Jann Horn <jannh@google.com> Acked-by: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fprobe: Fix unregister_fprobe() to wait for RCU grace period
Commit 4346ba1604093 ("fprobe: Rewrite fprobe on function-graph tracer")
changed fprobe to register struct fprobe to an rcu-hlist, but it forgot
to wait for RCU GP. Thus there can be use-after-free if the fprobe is
released right after unregistering. This can be happened on fprobe
event and sample module code.
To fix this issue, add synchronize_rcu() in unregister_fprobe().
Note that BPF is OK because fprobe is used as a part of
bpf_kprobe_multi_link. This unregisters its fprobe in
bpf_kprobe_multi_link_release() and it is deallocated via
bpf_kprobe_multi_link_dealloc(), which is invoked from
bpf_link_defer_dealloc_rcu_gp() RCU callback.
For BPF, this also introduced unregister_fprobe_async() which does
NOT wait for RCU grace priod.
Andrea Righi [Mon, 11 May 2026 08:31:30 +0000 (10:31 +0200)]
sched_ext: Clear ops->priv on scx_alloc_and_add_sched() error paths
scx_alloc_and_add_sched() can fail after @sch has been assigned to
ops->priv. In those cases @sch is torn down (either via kfree() through
the err_free_* chain or via kobject_put() -> scx_kobj_release() -> RCU
work), but @ops->priv is left pointing at the about-to-be-freed pointer.
With the recent -EBUSY gate in scx_root_enable_workfn() and
scx_sub_enable_workfn() that rejects an attach when @ops->priv is still
non-NULL, see commit bbf30b383cf6 ("sched_ext: Fix ops->priv clobber on
concurrent attach/detach"), a dangling @ops->priv permanently locks the
kdata out: every future attach attempt sees a stale binding and returns
-EBUSY even though no scheduler is actually attached.
Clear @ops->priv on the post-assign failure paths so that the kdata
returns to its pre-attach state when the function returns ERR_PTR().
Guopeng Zhang [Sat, 9 May 2026 10:20:30 +0000 (18:20 +0800)]
cgroup/cpuset: Reset DL migration state on can_attach() failure
cpuset_can_attach() accumulates temporary SCHED_DEADLINE migration
state in the destination cpuset while walking the taskset.
If a later task_can_attach() or security_task_setscheduler() check
fails, cgroup_migrate_execute() treats cpuset as the failing subsystem
and does not call cpuset_cancel_attach() for it. The partially
accumulated state is then left behind and can be consumed by a later
attach, corrupting cpuset DL task accounting and pending DL bandwidth
accounting.
Reset the pending DL migration state from the common error exit when
ret is non-zero. Successful can_attach() keeps the state for
cpuset_attach() or cpuset_cancel_attach().
Fixes: 2ef269ef1ac0 ("cgroup/cpuset: Free DL BW in case can_attach() fails") Cc: stable@vger.kernel.org # v6.10+ Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn> Signed-off-by: Tejun Heo <tj@kernel.org> Reviewed-by: Chen Ridong <chenridong@huaweicloud.com> Reviewed-by: Waiman Long <longman@redhat.com>
Andrea Righi [Mon, 11 May 2026 06:18:12 +0000 (08:18 +0200)]
sched_ext: Fix ops->priv clobber on concurrent attach/detach
Under heavy concurrent attach/detach operations, scx_claim_exit() can
trigger a NULL pointer dereference. This can be reproduced running the
reload_loop kselftests inside a virtme-ng session:
T1 acquires scx_enable_mutex inside scx_root_disable()'s mutex_unlock
window and starts a fresh attach on the same kdata, assigning sch_a800
to @ops->priv. T2 then continues out of scx_disable()/flush_disable_work
and clobbers @ops->priv to NULL, leaking sch_a800; the bpf_link is gone
but state stays SCX_ENABLED, so all future attaches fail with -EBUSY
permanently. The next bpf_scx_unreg() on that kdata then reads NULL
@ops->priv and dereferences it in scx_claim_exit().
Make @ops->priv the lifecycle binding: in scx_root_enable_workfn() and
scx_sub_enable_workfn(), after the existing state check and still under
scx_enable_mutex, refuse with -EBUSY if @ops->priv is non-NULL. This
rejects an attempt to reuse a kdata that is still bound to a previous
scheduler instance, closing the race without changing the unreg side.
Andrea Righi [Sun, 10 May 2026 17:52:11 +0000 (19:52 +0200)]
selftests/sched_ext: Fix build error in dequeue selftest
Building the dequeue selftest with newer compilers (e.g., gcc 16)
triggers the following error:
dequeue.c:28:22: error: variable 'sum' set but not used
The 'volatile' qualifier prevents the writes from being optimized away,
but does not silence the unused variable 'sum' is indeed only written
and never read.
Consume 'sum' via an empty asm() with a register input constraint. This
forces the compiler to keep the accumulated value (preserving the CPU
stress loop) and avoiding the build error.
Fixes: 658ad2259b3e ("selftests/sched_ext: Add test to validate ops.dequeue() semantics") Signed-off-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>