Jiayuan Chen [Tue, 26 May 2026 02:55:29 +0000 (10:55 +0800)]
net/sched: cls_bpf: prevent unbounded recursion in offload rollback
Quan Sun reported [1] a stack overflow in cls_bpf_offload_cmd().
Reproducer on netdevsim: add a skip_sw cls_bpf filter, set the
bpf_tc_accept debugfs knob to 0, then `tc filter replace`. The replace
calls tc_setup_cb_replace() which fails. cls_bpf_offload_cmd() then
swaps prog/oldprog and recursively calls itself to roll back. But
bpf_tc_accept=0 makes the rollback fail too, which triggers yet another
rollback frame with the same arguments, and so on until the stack is
exhausted.
bpf_tc_accept is just a convenient knob for the reproducer. Any driver
whose tc_setup_cb_replace() fails twice in a row can hit the same loop,
so this is not a netdevsim-only issue.
Two ways to fix it:
1) Have the rollback call tc_setup_cb_add() on oldprog instead of
re-entering cls_bpf_offload_cmd().
2) Mark the rollback frame with a flag and skip a second-level
rollback from inside it.
Go with (2). It is the smaller change and keeps the original behaviour:
the rollback still goes through tc_setup_cb_replace(), so the driver
gets one real chance to restore its state. If that attempt also fails,
we just return the original error instead of recursing.
Jakub Kicinski [Thu, 28 May 2026 00:42:18 +0000 (17:42 -0700)]
Merge branch 'ethtool-more-bug-fixes'
Jakub Kicinski says:
====================
ethtool: more bug fixes
Last week I sent two patch sets - one fixing bugs in RSS handling,
and one fixing CMIS / module handling. This set contains the remaining
fixes. There's a concentration of fixes around PHY and timestamp config
handling but not enough to break those out as separate sets.
====================
Jakub Kicinski [Tue, 26 May 2026 15:35:33 +0000 (08:35 -0700)]
ethtool: eeprom: add more safeties to EEPROM Netlink fallback
The Netlink fallback path for reading module EEPROM
(fallback_set_params()) validates that offset < eeprom_len,
but does not check that offset + length stays within eeprom_len.
The ioctl equivalent (ethtool_get_any_eeprom() in ioctl.c) has
always enforced both bounds:
if (eeprom.offset + eeprom.len > total_len)
return -EINVAL;
This could lead to surprises in both drivers and device FW.
Add the missing offset + length validation to fallback_set_params(),
mirroring the ioctl.
Similarly - ethtool core in general, and ethtool_get_any_eeprom()
in particular tries to zero-init all buffers passed to the drivers
to avoid any extra work of zeroing things out. eeprom_fallback()
uses a plain kmalloc(), change it to zalloc.
Fixes: 96d971e307cc ("ethtool: Add fallback to get_module_eeprom from netlink command") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:32 +0000 (08:35 -0700)]
ethtool: eeprom: add missing ethnl_ops_begin() / _complete() during fallback
All ethtool driver op calls should be sandwiched between
ethnl_ops_begin() / ethnl_ops_complete(). In Netlink eeprom code,
if the paged access failed we fall back to old API, but we
first call _complete() and the fallback never does its own
ethnl_ops_begin(). Move the fallback into the _begin() / _complete()
section.
Fixes: 96d971e307cc ("ethtool: Add fallback to get_module_eeprom from netlink command") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:31 +0000 (08:35 -0700)]
ethtool: strset: fix header attribute index in ethnl_req_get_phydev()
strset_prepare_data() passes ETHTOOL_A_HEADER_FLAGS (3) as the header
attribute to ethnl_req_get_phydev(). This is incorrect, in the main
attr space 3 is ETHTOOL_A_STRSET_COUNTS_ONLY, not the request
header attr. The correct constant is ETHTOOL_A_STRSET_HEADER (1).
ethnl_req_get_phydev() only uses this value for the extack,
so this is not a "functionally visible"(?) bug.
Fixes: e96c93aa4be9 ("net: ethtool: strset: Allow querying phy stats by index") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:30 +0000 (08:35 -0700)]
ethtool: tsinfo: don't pass ERR_PTR to genlmsg_cancel on prepare failure
The goto err label leads to:
genlmsg_cancel(skb, ehdr);
return ret;
If ethnl_tsinfo_prepare_dump() failed, it has not started a genlmsg.
There's nothing to cancel, and passing an error pointer to
genlmsg_cancel() would cause a crash.
Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:29 +0000 (08:35 -0700)]
ethtool: tsinfo: fix uninitialized stats on the by-PHC path
tsinfo_prepare_data() has two code paths: a "by-PHC" path for
user-specified hardware timestamping providers, and the old path.
Commit 89e281ebff72 ("ethtool: init tsinfo stats if requested") added
ethtool_stats_init() to mark stat slots as ETHTOOL_STAT_NOT_SET before
the driver callback populates them, but placed the call inside the
old-path block.
When commit b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to
support several hwtstamp by net topology") added the by-PHC early
return, it landed above the stats initialization. On that path
the stats array retains the zero-fill from ethnl_init_reply_data()'s
zalloc. This leads to the reply including a stats nest with four
zero-valued attributes that should have been absent.
Reject GET requests for stats with HWTSTAMP_PROVIDER or dump.
Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:27 +0000 (08:35 -0700)]
ethtool: pse-pd: fix missing ethnl_ops_complete()
pse_prepare_data() is missing ethnl_ops_complete() if
ethnl_req_get_phydev() returned an error. Move getting
phydev up so that we don't have to worry about this
(similar order to linkstate_prepare_data()).
Note that phydev may still be NULL (this is checked in
pse_get_pse_attributes()), the goal isn't really to avoid
the _begin() / _complete() calls, only to simplify the error
handling.
While at it propagate the original error. Why this code
overrides the error with -ENODEV but !phydev generates
-EOPNOTSUPP is unclear to me...
Fixes: 31748765bed3 ("net: ethtool: pse-pd: Target the command to the requested PHY") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:26 +0000 (08:35 -0700)]
ethtool: linkstate: fix unbalanced ethnl_ops_complete() on PHY lookup error
linkstate_prepare_data() calls ethnl_req_get_phydev() before
ethnl_ops_begin(), but routes its error path through "goto out"
which calls ethnl_ops_complete().
Fixes: fe55b1d401c6 ("ethtool: linkstate: migrate linkstate functions to support multi-PHY setups") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:25 +0000 (08:35 -0700)]
ethtool: tsconfig: fix reply error handling
A couple of trivial bugs in error handling in tsconfig_send_reply().
If we failed to allocate rskb we need to set the error.
If we did allocate it but failed to send it - we need to remember
to free it.
Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config") Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:24 +0000 (08:35 -0700)]
ethtool: coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES
ethnl_update_profile() walks the ETHTOOL_A_PROFILE_IRQ_MODERATION
nest list with an index 'i' and writes new_profile[i++] without
bounding i. The destination is kmemdup()'d at NET_DIM_PARAMS_NUM_PROFILES
entries (5), but the Netlink nest count is entirely user-controlled.
Netlink policies do not have support for constraining the number
of nested entries (or number of multi-attr entries).
Zhao Dongdong [Tue, 26 May 2026 06:51:56 +0000 (14:51 +0800)]
net: page_pool: silence static analysis warnings in page_pool_nl_stats_fill()
nla_nest_start() can return NULL if the skb runs out of space.
Jakub:
There is no bug here, if nla_nest_start() failed there's not space
left in the message. Next nla_put_uint() will also fail and we will
exit via nla_nest_cancel() which handles NULL just fine.
Various people keep sending us this patch so let's commit this.
Eric Dumazet [Tue, 26 May 2026 14:55:29 +0000 (14:55 +0000)]
ipv6: frags: cleanup __IP6_INC_STATS() confusion
After commits e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original
netdev") and bdb7cc643fc9 ("ipv6: Count interface receive statistics
on the ingress netdev") net/ipv6/reassembly.c uses three different
ways to reach idev in various __IP6_INC_STATS() calls.
Lets centralize this from ipv6_frag_rcv() and use __in6_dev_stats_get().
Note that ipv6_frag_rcv() tests if skb->dev could be NULL already, so
I chose to also guard against NULL, but we probably can remove the
tests in a followup patch, because I do not think skb->dev could be NULL.
iif = skb->dev ? skb->dev->ifindex : 0;
idev can be NULL, __IP6_INC_STATS() deals with this possibility.
Small code size reduction as a bonus.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-145 (-145)
Function old new delta
ipv6_frag_rcv 2399 2362 -37
ip6_frag_reasm 705 597 -108
Total: Before=31455552, After=31455407, chg -0.00%
Eric Dumazet [Tue, 26 May 2026 14:55:28 +0000 (14:55 +0000)]
ipv6: guard against possible NULL deref in __in6_dev_stats_get()
dev_get_by_index_rcu() could return NULL if the original physical
device is unregistered.
Found by Sashiko.
Fixes: e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original netdev") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526145529.3587126-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 28 May 2026 00:23:07 +0000 (17:23 -0700)]
Merge branch 'bridge-fix-sleep-in-atomic-context'
Ido Schimmel says:
====================
bridge: Fix sleep in atomic context
Under certain circumstances the bridge driver can call
dev_set_promiscuity() while holding the bridge spin lock. This is a
problem as dev_set_promiscuity() might sleep.
Patches #1-#2 fix the problem in the netlink and sysfs configuration
paths by only taking the lock where it is actually needed, thereby
avoiding calling dev_set_promiscuity() from an atomic context.
Patch #3 adds test cases for both configuration paths in rtnetlink.sh
which already includes test cases for similar issues.
Note that dev_set_promiscuity() can sleep either when it takes the net
device mutex or when calling netif_rx_mode_sync(). I encountered the
problem with the latter, but blamed the former since it came earlier.
====================
Add two test cases that always pass, but trigger sleeping in atomic
context BUGs without "bridge: Fix sleep in atomic context in netlink
path" and "bridge: Fix sleep in atomic context in sysfs path".
Ido Schimmel [Tue, 26 May 2026 06:48:17 +0000 (09:48 +0300)]
bridge: Fix sleep in atomic context in sysfs path
Since the start of the git history, brport_store() always acquired the
bridge lock. Back then this decision made sense: The bridge lock
protects the STP state of the bridge and its ports and at that time the
function was only used by two STP related attributes (cost and
priority).
Nowadays, brport_store() processes a lot more attributes and most of
them do not need the bridge lock:
* Bridge flags: Only require RTNL. Read locklessly by the data path.
Annotations can be added in net-next.
* FDB port flushing: Only requires the FDB lock.
* Multicast attributes: Only require the multicast lock.
* Group forward mask: Only requires RTNL. Read locklessly by the data
path. Annotations can be added in net-next.
* Backup port: Only requires RTNL. Read locklessly by the data path.
This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].
Fix this by reducing the scope of the bridge lock and only take it when
processing the two STP related attributes that require it. Remove the
now stale comment from br_switchdev_set_port_flag(). The
SWITCHDEV_F_DEFER flag can be removed in net-next.
Ido Schimmel [Tue, 26 May 2026 06:48:16 +0000 (09:48 +0300)]
bridge: Fix sleep in atomic context in netlink path
Since the introduction of the netlink configuration path for bridge
ports in commit 25c71c75ac87 ("bridge: bridge port parameters over
netlink"), br_setport() was always called with the bridge lock held
around it. Back then this decision made sense: The bridge lock protects
the STP state of the bridge and its ports and at that time the function
only processed three STP related netlink attributes (cost, priority and
state).
Nowadays, br_setport() processes a lot more attributes and most of them
do not need the bridge lock:
* Bridge flags: Only require RTNL. Read locklessly by the data path.
Annotations can be added in net-next.
* FDB port flushing: Only requires the FDB lock.
* Multicast attributes: Only require the multicast lock.
* Group forward mask: Only requires RTNL. Read locklessly by the data
path. Annotations can be added in net-next.
* Backup port and NHID: Only require RTNL. Read locklessly by the data
path.
This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].
Fix this by reducing the scope of the bridge lock and only take it when
processing the three STP related attributes that require it. This is
consistent with the multicast attributes where each attribute acquires
the multicast lock instead of having one critical section for all
relevant attributes.
KVM: TDX: Move external page table freeing to TDX code
Move the freeing of external page tables into the reclaim operation that
lives in TDX code.
The TDP MMU supports traversing the TDP without holding locks. Page tables
need to be freed via RCU to prevent walking one that gets freed.
While none of these lockless walk operations actually happen for the mirror
page table, the TDP MMU nonetheless frees the mirror page table in the same
way, and (because it's a handy place to plug it in) the external page table
as well.
However, the external page table definitely can't be walked once the page
table pages are reclaimed from the TDX module. The TDX module releases the
page for the host VMM to use, so this RCU-time free is unnecessary for the
external page table.
So move the free_page() call to TDX code. Create an
tdp_mmu_free_unused_sp() to allow for freeing external page tables that
have never left the TDP MMU code (i.e. don't need to be freed in a special
way).
Move the logic for TDX's specific need to leak pages when reclaim
fails inside the free_external_spt() op, so this can be done in TDX
specific code and not the generic MMU.
Do this by passing in "sp" instead of the external page table pointer.
This way, TDX code can set sp->external_spt to NULL. Since the error is now
handled internally in TDX code (by triggering KVM_BUG_ON() or
TDX_BUG_ON_3(), which warn and stop the VM on any error), change the op to
return void. This way it also operates like a normal free in that success
is guaranteed from the caller's perspective.
Opportunistically, drop the unused level and gfn args while adjusting the
sp arg.
[ Rick: Re-wrote log and massaged op name ] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[ Yan: Updated patch log/function comment, dropped unused param in op ] Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075730.4354-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop kvm_x86_ops.remove_external_spte(), and instead handle the removal of
leaf SPTEs in the S-EPT (a.k.a. external page table) in
kvm_x86_ops.set_external_spte(). This will also allow extending
tdx_sept_set_private_spte() to support splitting a huge S-EPT entry without
needing yet another kvm_x86_ops hook.
Now all changes for removing leaf mirror SPTEs are propagated through
kvm_x86_ops.set_external_spte().
- When removing leaf mirror SPTEs under shared mmu_lock (though currently
no path can trigger this scenario and TDX does not support this
scenario), tdx_sept_remove_private_spte() may produce a warning due to
lockdep_assert_held_write() or may return -EIO and trigger TDX_BUG_ON()
due to concurrent BLOCK, TRACK, REMOVE.
- When removing leaf mirror SPTEs under exclusive mmu_lock, all errors are
unexpected. If any error occurs in this scenario,
tdx_sept_remove_private_spte() will return -EIO and trigger KVM_BUG_ON().
A redundant KVM_BUG_ON() call will also be triggered in TDP MMU core in
handle_changed_spte(), which is benign (the WARN will fire if and only if
the VM isn't already bugged).
Arrange tdx_sept_remove_private_spte() (and its tdx_track() helper) to be
above tdx_sept_set_private_spte() in anticipation of routing all S-EPT
writes (with the exception of reclaiming non-leaf pages) through the "set"
API.
Rick Edgecombe [Sat, 9 May 2026 07:56:47 +0000 (15:56 +0800)]
KVM: x86/mmu: Drop KVM_BUG_ON() on shared lock to zap child external PTEs
Drop the KVM_BUG_ON() in the KVM MMU core before zapping child external
PTEs, since requiring zapping PTEs to be protected by exclusive mmu_lock is
TDX's specific requirement.
No need to plumb the shared/exclusive info into the remove_external_spte()
op or move the KVM_BUG_ON() to TDX, because
- There's already an assertion of exclusive mmu_lock protection in TDX.
- The KVM_BUG_ON() is a bit redundant given that if there's any bug causing
zapping of leaf PTEs in S-EPT under shared mmu_lock, SEAMCALL failures
due to contention would result in TDX_BUG_ON() in TDX.
KVM: x86/tdp_mmu: Centrally propagate to-present/atomic zap updates to external PTEs
Move propagation of to-present changes and atomic zap changes to external
PTEs from function __tdp_mmu_set_spte_atomic() to function
__handle_changed_spte(), which centrally handles changes of SPTEs.
When setting a PTE to present in the mirror page tables, the update needs
to be propagated to the external page tables (in TDX parlance, the S-EPT).
Today this is handled by special mirror page tables logic/branching in
__tdp_mmu_set_spte_atomic(), which is the only place where present PTEs are
set for TDX.
The current approach obviously works, but is a bit hacked on. The hook for
setting present leaf PTEs is added only where TDX happens to need it. For
example, TDX does not support any of the operations that use the non-atomic
variant, tdp_mmu_set_spte(), to set present PTEs. Since the hook is missing
there, it is very hard to understand the code from a non-TDX lens. If the
reader doesn't know the TDX specifics it could look like the external SPTE
update is missing.
In addition to being confusing, it also litters the TDP MMU with "external"
update callbacks. This is especially unfortunate because there is already a
central place to react to TDP updates, handle_changed_spte().
Begin the process of moving towards a model where all mirror page table
updates are forwarded to TDX code where the TDX-specific logic can live
with a more proper separation of concerns. Do this by adding a helper
__handle_changed_spte() and teaching it how to return error codes, such
that it can propagate the failures that may come from TDX external page
table updates. Make the original handle_changed_spte() a no-fail version of
__handle_changed_spte(), so it handles no-fail changes which are under
exclusive mmu_lock or under the no-fail path handle_removed_pt(),
triggering KVM_BUG_ON() on error returns.
Instead of having __tdp_mmu_set_spte_atomic() do the frozen mirror SPTE
dance and trigger propagation to external PTEs, make
__tdp_mmu_set_spte_atomic() a simple helper of try_cmpxchg64() and hoist
the frozen mirror SPTE dance up a level to tdp_mmu_set_spte_atomic(). Then,
the propagation of changes to present to the external PTEs can be
centralized to __handle_changed_spte(). Aging external SPTEs is not yet
supported for the mirror page table, so just warn on mirror usage in
kvm_tdp_mmu_age_spte() and invoke __tdp_mmu_set_spte_atomic() directly
without frozen dance. No need to warn on installing FROZEN_SPTE as a
long-term value in kvm_tdp_mmu_age_spte() since removing accessed bit is
mutually exclusive with installing FROZEN_SPTE (FROZEN_SPTE is with
accessed bit in all x86 platforms).
Since tdp_mmu_set_spte_atomic() can also be invoked to atomically zap SPTEs
(though there's no path to trigger atomic zap on the mirror page table up
to now), also leverage set_external_spte() op to propagate the atomic zaps
when tdp_mmu_set_spte_atomic() zaps leaf SPTEs directly. (When
tdp_mmu_set_spte_atomic() zaps a non-leaf SPTE, zaps of the child leaf
SPTEs are propagated via the remove_external_spte() op).
Note: tdp_mmu_set_spte_atomic() invokes __handle_changed_spte() to handle
changes to new_spte while the mirror SPTE is frozen, so
(1) the update of the external PTEs and statistics, or
(2) the update of child mirror SPTEs, child external PTEs and corresponding
statistics,
now occur before the mirror SPTE is actually set to new_spte.
(1) is ok since if it fails, the mirror SPTE will be restored to its
original value. (2) is also ok since handle_removed_pt() is no-fail.
Sagi Shahar [Thu, 5 Mar 2026 22:26:27 +0000 (22:26 +0000)]
KVM: SEV: Restrict userspace return codes for KVM_HC_MAP_GPA_RANGE
To align with the updated TDX api that allows userspace to request
that guests retry MAP_GPA operations, make sure that userspace is only
returning EINVAL or EAGAIN as possible error codes.
KVM: TDX: Allow userspace to return errors to guest for MAPGPA
MAPGPA request from TDX VMs gets split into chunks by KVM using a loop
of userspace exits until the complete range is handled.
In some cases userspace VMM might decide to break the MAPGPA operation
and continue it later. For example: in the case of intrahost migration
userspace might decide to continue the MAPGPA operation after the
migration is completed.
Allow userspace to signal to TDX guests that the MAPGPA operation should
be retried the next time the guest is scheduled.
This is potentially a breaking change since if userspace sets
hypercall.ret to a value other than EBUSY or EINVAL an EINVAL error code
will be returned to userspace. As of now QEMU never sets hypercall.ret
to a non-zero value after handling KVM_EXIT_HYPERCALL so this change
should be safe.
Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Vishal Annapurve <vannapurve@google.com> Co-developed-by: Sagi Shahar <sagis@google.com> Signed-off-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260305222627.4193305-2-sagis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Yuyang Huang [Sun, 24 May 2026 02:24:56 +0000 (11:24 +0900)]
ipv6: mcast: annotate data-races around mca_users
/proc/net/igmp6 walks IPv6 multicast memberships under RCU and prints
mca_users without holding idev->mc_lock, while multicast join and leave
paths update the field while holding idev->mc_lock. Annotate this
intentional lockless snapshot with READ_ONCE() and the matching writers
with WRITE_ONCE().
Oliver Hartkopp [Tue, 26 May 2026 19:33:19 +0000 (21:33 +0200)]
bonding: refuse to enslave CAN devices
syzbot reported a kernel paging request crash in
can_rx_unregister() inside net/can/af_can.c. The crash occurs
because a virtual CAN device (vxcan) is being enslaved to a
bonding master.
During the enslavement process, the bonding driver mutates
and modifies the network device states to fit an Ethernet-like
aggregation model. However, CAN devices operate on a completely
different Layer 2 architecture, relying on the CAN mid-layer
private data structure (can_ml_priv) instead of standard
Ethernet structures. Since bonding does not initialize or
maintain these CAN structures, subsequent operations on the
half-enslaved interface (such as closing associated sockets
via isotp_release) lead to a null-pointer dereference when
accessing the CAN receiver lists.
Bonding CAN interfaces is architecturally invalid as CAN lacks
MAC addresses, ARP capabilities, and standard Ethernet
link-layer mechanisms. While generic loopback devices are
blocked globally in net/core/dev.c, virtual CAN devices
bypass this check because they do not carry the IFF_LOOPBACK
flag, despite acting as local software-loopbacks.
Fix this by explicitly blocking network devices of type
ARPHRD_CAN from being enslaved at the very beginning of
bond_enslave(). This prevents illegal state mutations,
eliminates the resulting KASAN crashes, and avoids potential
memory leaks from incomplete socket cleanups.
As the CAN support has been added a long time after bonding
the Fixes-tag points to the introduction of ARPHRD_CAN that
would have needed a specific handling in bonding_main.c.
Fixes: cd05acfe65ed ("[CAN]: Allocate protocol numbers for PF_CAN") Reported-by: syzbot+8ed98cbd0161632bce95@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8ed98cbd0161632bce95 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20260526-bonding-candev-v1-1-ba1df400918a@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selinux: hooks: use __getname() to allocate path buffer
selinux_genfs_get_sid() allocates memory for a path with __get_free_page()
although there is a dedicated helper for allocation of file paths:
__getname().
Replace __get_free_page() for allocation of a path buffer with __getname().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
selinux: use k[mz]alloc() to allocate temporary buffers
Several functions in selinuxfs.c allocate temporary buffers using
__get_free_page() or get_zeroed_page().
These buffers are used either to store a string generated by snprintf() (in
sel_make_bools()) or to copy data from user (sel_read_avc_hash_stats() and
sel_read_sidtab_hash_stats()).
Such usage does not require struct page access and it is better to allocate
these buffers with kzalloc()/kmalloc() that provide better scalability and
more debugging possibilities.
Replace use of get_zeroed_page() with kzalloc() and usage of
__get_free_page() with kmalloc().
Xuanqing Shi [Wed, 27 May 2026 02:26:17 +0000 (19:26 -0700)]
KVM: VMX: Handle bad values on proxied writes to LBR MSRs
Use the "safe" WRMSR API when writing LBRs on behalf of the guest (or host
userspace), and propagate any errors back to the instigator, as the value
being written is untrusted. E.g. if the guest (or host userspace) attempts
to set reserved bits in LBR_SELECT, then KVM needs to return an error, and
not WARN on the bad value.
Continue using the "unsafe" version of RDMSR, as it should be impossible to
reach the helper with a completely bogus MSR, i.e. WARNing on RDMSR failure
is very desirable, e.g. to make KVM bugs more visible.
Fixes: 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE") Cc: stable@vger.kernel.org Signed-off-by: Xuanqing Shi <1356292400@qq.com>
[sean: rework changelog, only modify WRMSR path, tag for stable@] Link: https://patch.msgid.link/20260527022617.3973884-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Ricardo Robaina [Wed, 27 May 2026 23:15:34 +0000 (19:15 -0400)]
audit: fix recursive locking deadlock in audit_dupe_exe()
A deadlock occurs in the audit subsystem when duplicating
executable-related rules.
When a file is moved (e.g., via do_renameat2()), the VFS layer locks
the parent directory (I_MUTEX_PARENT), which synchronously triggers an
fsnotify_move event. If an existing executable audit rule matches the
file being moved, the audit subsystem catches this event and calls
audit_dupe_exe() to duplicate the watch and update the rule. Then,
audit_alloc_mark() would call kern_path_parent() to resolve the path,
leading to a blind attempt to acquire the exact same I_MUTEX_PARENT lock
already held by the task, resulting in the following recursive locking
deadlock:
============================================
WARNING: possible recursive locking detected
6.12.0-55.27.1.el10_0.x86_64+debug #1 Not tainted
--------------------------------------------
mv/5099 is trying to acquire lock: ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
at: __kern_path_locked+0x10a/0x2f0
but task is already holding lock: ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
at: lock_two_directories+0x13f/0x2b0
other info that might help us debug this:
Possible unsafe locking scenario:
The aforementioned deadlock can be consistently reproduced by running
the script below:
audit-dupe-exe-deadlock.sh
--------------------------
#!/bin/bash
auditctl -D
mkdir -p /tmp/foo
touch /tmp/file
auditctl -a always,exit -F exe=/tmp/file -F path=/tmp/file -S all -k dr
mv /tmp/file /tmp/foo/file
rm -Rf /tmp/foo
This patch fixes the issue by introducing struct audit_watch_ctx to pass
the fsnotify event context down to audit_alloc_mark(). By utilizing the
already-resolved directory inode provided by the event, we bypass the
kern_path_parent() path resolution entirely, safely avoiding the
recursive lock. Furthermore, it explicitly allows duplicate fsnotify
marks (allow_dups = 1) during the rename update, allowing the new rule's
mark to safely coexist with the old rule's mark until the old rule is
freed.
P.S.: This issue was identified and reproduced during a comprehensive
code coverage analysis of the audit subsystem. The full report is
available at the link below:
P.P.S: With the permission of both Ricardo and Nathan, I've squashed a
fixup patch from Nathan that addresses a compile time error when
CONFIG_AUDITSYSCALL=n.
Cc: stable@kernel.org Fixes: 34d99af52ad4 ("audit: implement audit by executable") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
[PM: move link metadata into the msg, apply fix from NC] Signed-off-by: Paul Moore <paul@paul-moore.com>
KVM: x86/mmu: Plumb "sp" _pointer_ into the TDP MMU's handle_changed_spte()
Plumb the "sp" pointer into handle_changed_spte() to allow checking of
is_mirror_sp(sp) in handle_changed_spte(). This will allow consolidating
all S-EPT updates into a single kvm_x86_ops hook.
[Yan: Remove unused "as_id" param in tdp_mmu_set_spte() ]
Rick Edgecombe [Sat, 9 May 2026 07:56:09 +0000 (15:56 +0800)]
KVM: x86/tdp_mmu: Morph !is_frozen_spte() check into a KVM_MMU_WARN_ON()
Remove the conditional logic for handling the setting of mirror page table
to frozen in __tdp_mmu_set_spte_atomic() and add it as a warning for both
mirror and direct cases.
The mirror page table needs to propagate PTE changes to the external page
table. This presents a problem for atomic updates which can't update both
page tables at once. So a special value, FROZEN_SPTE, is used as a
temporary state during these updates to prevent concurrent operations on
the PTE. If the TDP MMU tried to install FROZEN_SPTE as a long-term value,
it would confuse these updates.
On the other hand, it would also confuse other threads if FROZEN_SPTE is
installed as a long-term value for direct page tables (e.g., causing
another thread working on atomic zap to wait for a !FROZEN_SPTE value
endlessly).
Therefore, add the warning for installing FROZEN_SPTE as a long-term value
in __tdp_mmu_set_spte_atomic() without differentiating whether it's a
mirror or direct page table.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075609.4242-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Rick Edgecombe [Sat, 9 May 2026 07:55:57 +0000 (15:55 +0800)]
KVM: TDX: Move lockdep assert in __tdp_mmu_set_spte_atomic() to TDX code
Move the MMU lockdep assert in __tdp_mmu_set_spte_atomic() into the TDX
specific op because the assert is TDX specific in intention.
The TDP MMU has many lockdep asserts for various scenarios, and in fact
the callchains that are used for TDX already have a lockdep assert which
covers the case in __tdp_mmu_set_spte_atomic(). However, these asserts are
for management of the TDP root owned by KVM. In the
__tdp_mmu_set_spte_atomic() assert case, it is helping with a scheme to
avoid contention in the TDX module during zap operations. That is very
TDX specific.
One option would be to just remove the assert in
__tdp_mmu_set_spte_atomic() and rely on the other ones in the TDP MMU. But
that assert is for a different intention, and too far away from the
SEAMCALL that needs it. So just move it to TDX code.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075557.4226-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Rick Edgecombe [Sat, 9 May 2026 07:55:44 +0000 (15:55 +0800)]
KVM: TDX: Move KVM_BUG_ON()s in __tdp_mmu_set_spte_atomic() to TDX code
Drop some KVM_BUG_ON()s that are guarding against TDP MMU attempting to
propagate unsupported changes to the external page table through
__tdp_mmu_set_spte_atomic(). Have TDX code trigger them instead.
Now that TDP MMU logically allows propagating atomic zapping operation to
the external page table through the set_external_spte() op in
__tdp_mmu_set_spte_atomic(), TDX code will trigger the KVM_BUG_ON() on the
atomic zapping request instead. (Note: non-atomic zapping is not propagated
via the set_external_spte() op yet).
Despite the generic naming, external page table ops are designed completely
around TDX. They hook the bare minimum of what is needed, and exclude the
operations that are not supported by TDX. To help wrangle which operations
are handleable by various operations, warnings and KVM_BUG_ON()s exist in
the code. These warnings and KVM_BUG_ON()s put the burden of understanding
which operations should be forwarded to TDX code on TDP MMU developers, who
often read the code without TDX context.
Future changes will transition the encapsulation of this domain knowledge
to TDX code by funneling the external page table updates through a central
update mechanism. In this paradigm, the central update mechanism can
encapsulate the special knowledge, but will not have as much knowledge
about what operation is in progress.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075544.4210-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86/mmu: Plumb param "old_spte" into kvm_x86_ops.set_external_spte()
If tdp_mmu_set_spte_atomic() triggers an atomic zap on a mirror SPTE
(though currently no paths trigger it), the change is propagated via the
set_external_spte() op. Plumb the old SPTE into the set_external_spte() op,
so TDX code rather than TDP MMU code can warn if the atomic zap isn't
allowed, i.e. to let TDX enforce TDX's rules (inasmuch as possible).
Rename mirror_spte to new_spte to follow the TDP MMU's naming, and to make
it more obvious what value the parameter holds.
Opportunistically tweak the ordering of parameters to match the pattern of
most TDP MMU functions, which do "old, new, level".
KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() in
anticipation of propagating *all* changes (like atomic zap) triggered by
tdp_mmu_set_spte_atomic() to the external PTEs.
KVM: TDX: Wrap mapping of leaf and non-leaf S-EPT entries into helpers
Add a helper, tdx_sept_map_leaf_spte(), to wrap and isolate PAGE.ADD and
PAGE.AUG operations. Rename tdx_sept_link_private_spt() to
tdx_sept_map_nonleaf_spte() to wrap SEPT.ADD for symmetry.
Thus, transition tdx_sept_set_private_spte() into a "dispatch" routine for
setting/writing S-EPT entries.
Drop the dedicated .link_external_spt() for linking S-EPT pages, and
instead funnel everything through .set_external_spte() for mapping S-EPT
entries. Using separate hooks doesn't help prevent TDP MMU details from
bleeding into TDX, and vice versa; to the contrary, dedicated callbacks
will result in _more_ pollution when hugepage support is added, e.g. will
require the TDP MMU to know details about the splitting rules for TDX that
aren't all that relevant to the TDP MMU.
Ideally, KVM would provide a single pair of hooks to set S-EPT entries,
one hook for setting SPTEs under write-lock and another for setting SPTEs
under read-lock (e.g. to ensure the entire operation is "atomic", to allow
for failure, etc.). Sadly, TDX's requirement that all child S-EPT entries
are removed before the parent makes that impractical: the TDP MMU
deliberately prunes non-leaf SPTEs and _then_ processes its children, thus
making it quite important for the TDP MMU to differentiate between zapping
leaf and non-leaf S-EPT entries.
However, that's the _only_ case that's truly special, and even that case
could be shoehorned into a single hook; it just wouldn't be a net positive.
mshv: support 1G hugepages by passing them as 2M-aligned chunks
The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.
Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
hypercall has no 1G stride, so 1G folios are processed as a
sequence of 2M chunks. Folios whose order is less than PMD_ORDER
(e.g. mTHP) fall back to single-page stride; mapping them as 2M
would fail in the hypervisor anyway.
Assisted-by: Copilot-CLI:claude-opus-4.7 Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Dexuan Cui [Thu, 7 May 2026 21:28:38 +0000 (14:28 -0700)]
Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
screen.lfb_base being zero [1], there is an MMIO conflict between the
drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
hv_allocate_config_window() calls vmbus_allocate_mmio() to get an
MMIO range, typically it gets a 32-bit MMIO range that overlaps with the
framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
error message "PCI Pass-through VSP failed D0 Entry with status" since
the host thinks that PCI devices must not use MMIO space that the
host has assigned to the framebuffer.
This is especially an issue if pci-hyperv is built-in and hyperv-drm is
built as a module. Consequently, the kdump/kexec kernel fails to detect
PCI devices via pci-hyperv, and may fail to mount the root file system,
which may reside in a NVMe disk. The issue described here has existed
for SR-IOV VF NICs since day one of the pci-hyperv driver, and has been
worked around on x64 when possible. With the recent introduction of
ARM64 VMs that boot from NVMe, there is no workaround, so we need a
formal fix.
On Gen2 VMs, if the screen.lfb_base is 0 in the kdump/kexec kernel [1],
fall back to the low MMIO base, which should be equal to the framebuffer
MMIO base [2] (the statement is true according to my testing on x64
Windows Server 2016, and on x64 and ARM64 Windows Server 2025 and on
Azure. I checked with the Hyper-V team and they said the statement should
continue to be true for Gen2 VMs). In the first kernel, screen.lfb_base
is not 0; if the user specifies a very high resolution, it's not enough
to only reserve 8MB: let's always reserve half of the space below 4GB,
but cap the reservation to 128MB, which is the required framebuffer size
of the highest resolution 7680*4320 supported by Hyper-V.
While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
the > to >=. Here the 'end' is an inclusive end (typically, it's
0xFFFF_FFFF for the low MMIO range).
Note: vmbus_reserve_fb() now also reserves an MMIO range at the beginning
of the low MMIO range on CVMs, which have no framebuffers (the
'screen.lfb_base' in vmbus_reserve_fb() is 0 for CVMs), just in case the
host might treat the beginning of the low MMIO range specially [3]. BTW,
the OpenHCL kernel is not affected by the change, because that kernel
boots with DeviceTree rather than ACPI (so vmbus_reserve_fb() won't run
there), and there is no framebuffer device for that kernel.
Note: normally Gen1 VMs don't have the MMIO conflict issue because the
framebuffer MMIO range (which is hardcoded to base=4GB-128MB and
size=64MB for Gen1 VMs by the host) is always reported via the legacy PCI
graphics device's BAR, so the kdump/kexec kernel can reserve the 64MB
MMIO range; however, if the VM is configured to use a very high resolution
and the required framebuffer size exceeds 64MB (AFAIK, in practice, this
isn't a typical configuration by users), the hyperv-drm driver may need to
allocate an MMIO range above 4GB and change the framebuffer MMIO location
to the allocated MMIO range -- in this case, there can still be issues [4]
which can't be easily fixed: any possible affected Gen1 users would have
to use a resolution whose framebuffer size is <= 64MB, or switch to Gen2
VMs.
x86/ftrace: Relocate %rip-relative percpu refs in dynamic trampolines
With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform
(eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline
crashes on the first call into the traced function:
Monitoring the crash under GDB points to the exact instruction in charge of
incrementing the call depth:
sarq $5, %gs:__x86_call_depth(%rip)
This instruction matches the one inserted by the ftrace_regs_caller from
ftrace_64.S. This emitted code was likely working fine until the introduction
of
59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"):
it has made the call depth accounting addressing relative to $rip, instead of
being based on an absolute address.
As this code exact location depends on where the trampoline lives in memory,
the corresponding displacement needs to be adjusted at runtime to actually
correctly find the per-cpu __x86_call_depth value, otherwise the targeted
address is wrong, leading to the page fault seen above.
Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT
instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(),
as it is done for example by the x86 BPF JIT compiler through
x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots,
in ftrace_caller and ftrace_regs_caller.
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the redefinition of __cleanup with
__maybe_unused added to it is unnecessary because the referenced LLVM
change is present in all supported LLVM versions. Drop it.
kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the check added to CC_HAS_ASM_GOTO_OUTPUT by
commit e2ffa15b9baa ("kbuild: Disable CC_HAS_ASM_GOTO_OUTPUT on clang <
17") can be removed, as the issue it detects is guaranteed to be fixed.
x86/build: Drop unnecessary '-ffreestanding' addition to KBUILD_CFLAGS
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the addition of '-ffreestanding' to
KBUILD_CFLAGS for 32-bit x86 is unnecessary, as the linked LLVM bug is
resolved in all supported LLVM versions.
16cb16e0d285 ("x86/build: Remove -ffreestanding on i386 with GCC")
intended to make the addition of '-ffreestanding' clang only but due to
a bug in the adjusted check from
d70da12453ac ("hardening: Enable i386 FORTIFY_SOURCE on Clang 16+")
it has been applied for all versions of GCC and clang < 16.0.0. There
are no known problems with removing this for GCC but if one surfaces, it
can be restored under a CONFIG_CC_IS_GCC block.
scripts/Makefile.warn: Drop -Wformat handling for clang < 16
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the block dealing with -Wformat with clang
prior to 16 can be removed since the condition for its inclusion is
always false.
riscv: Drop tautological condition from TOOLCHAIN_NEEDS_OLD_ISA_SPEC
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the Clang dependency part of
CONFIG_TOOLCHAIN_NEEDS_OLD_ISA_SPEC is always false, so it can be
removed. Adjust the help text to remove mention of Clang < 17, as it is
irrelevant for the kernel after the minimum supported bump.
riscv: Remove tautological condition from selection of ARCH_SUPPORTS_CFI
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the condition of the selection of
CONFIG_ARCH_SUPPORTS_CFI is always true, so it can be removed.
ARM: Drop tautological ld.lld conditions from ARCH_MULTI_V4{,T}
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!ld.lld || ld.lld >= 16' dependency of
CONFIG_ARCH_MULTI_V4{,T} is always true, so it can be removed from both
symbols.
arch/Kconfig: Remove tautological condition from AUTOFDO_CLANG
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the clang version check in
CONFIG_AUTOFDO_CLANG can be removed because it is always true.
arch/Kconfig: Remove tautological conditions from HAS_LTO_CLANG
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, two dependency lines in CONFIG_HAS_LTO_CLANG
are always true because Clang will always be newer than 17.0.0, so they
can be removed.
security/Kconfig.hardening: Remove tautological condition from CC_HAS_RANDSTRUCT
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!Clang || Clang >= 16' dependency for
CONFIG_CC_HAS_RANDSTRUCT is always true, so it can be removed.
security/Kconfig.hardening: Remove tautological condition from FORTIFY_SOURCE
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!X86_32 || !Clang || Clang > 16'
dependency of CONFIG_FORTIFY_SOURCE is always true, so it can be
removed.
security/Kconfig.hardening: Remove tautological condition from CC_HAS_ZERO_CALL_USED_REGS
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!Clang || Clang > 15.0.6' dependency for
CONFIG_CC_HAS_ZERO_CALL_USED_REGS is always true, so it can be removed.
kbuild: Bump minimum version of LLVM for building the kernel to 17.0.1
The current minimum version of LLVM for building the kernel is 15.0.0.
However, there are two deficiencies compared to GCC that were fixed in
LLVM 17 that are starting to become more noticeable.
The first was a bug in LLVM's scope checker [1], where all labels in a
function were validated as potential targets of an asm goto statement,
even if they were not listed in the asm goto statement as targets. This
becomes particularly problematic when the cleanup attribute is used, as
asm goto(... : label_a);
...
label_a:
...
int var __free(foo);
asm goto(... : label_b);
...
label_b:
...
will trigger an error since the scope checker will complain that the
cleanup variable would be skipped when jumping from the first asm goto
to label_b (which obviously cannot happen). This issue was the catalyst
for commit e2ffa15b9baa ("kbuild: Disable CC_HAS_ASM_GOTO_OUTPUT on
clang < 17"). Unfortunately, this issue is reproducible with regular asm
goto in addition to asm goto with outputs, so that change was not
entirely sufficient to avoid the issue altogether. As asm goto has
effectively been required since commit a0a12c3ed057 ("asm goto:
eradicate CC_HAS_ASM_GOTO") and the usage of the cleanup attribute
continues to grow across the tree, raising the minimum to a version that
avoids this issue altogether is a better long term solution than
attempting to workaround it at every spot where it happens.
The second issue is an incompatibility with GCC 8.1+ around variables
marked with const being valid constant expressions for _Static_assert
and other macros [2]. With GCC 8.1 being the minimum supported version
since commit 118c40b7b503 ("kbuild: require gcc-8 and binutils-2.30"),
this incompatibility becomes more of a maintenance burden since only
clang-15 and clang-16 are affected by it.
Looking at the clang version of various major distributions through
Docker images, no one should be left behind as a result of this bump, as
the old ones cannot clear the current minimum of 15.0.0.
archlinux:latest clang version 22.1.3
debian:oldoldstable-slim Debian clang version 11.0.1-2
debian:oldstable-slim Debian clang version 14.0.6
debian:stable-slim Debian clang version 19.1.7 (3+b1)
debian:testing-slim Debian clang version 21.1.8 (3+b1)
debian:unstable-slim Debian clang version 21.1.8 (7+b1)
fedora:42 clang version 20.1.8 (Fedora 20.1.8-4.fc42)
fedora:latest clang version 21.1.8 (Fedora 21.1.8-4.fc43)
fedora:44 clang version 22.1.1 (Fedora 22.1.1-2.fc44)
fedora:rawhide clang version 22.1.3 (Fedora 22.1.3-1.fc45)
opensuse/leap:latest clang version 17.0.6
opensuse/tumbleweed:latest clang version 21.1.8
ubuntu:jammy Ubuntu clang version 14.0.0-1ubuntu1.1
ubuntu:noble Ubuntu clang version 18.1.3 (1ubuntu1)
ubuntu:questing Ubuntu clang version 20.1.8 (0ubuntu4)
ubuntu:resolute Ubuntu clang version 21.1.8 (6ubuntu1)
17.0.1 is chosen as the minimum instead of 17.0.0 to ensure that the
particular version of LLVM 17 has the two aforementioned bugs fixed, as
the second was fixed during the 17.0.0 release candidate phase and it
was not until LLVM 18 that LLVM adopted the scheme of x.0.0 being a
prerelease version and x.1.0 is a release version [3] to help with
scenarios such as this.
Steve French [Fri, 22 May 2026 23:28:49 +0000 (18:28 -0500)]
smb: client: fix uninitialized variable in smb2_writev_callback
compiling with W=2 pointed out that "written may be used uninitialized"
Fixes: 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref") Cc: stable@vger.kernel.org Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Jeremy Erazo [Wed, 20 May 2026 18:23:31 +0000 (18:23 +0000)]
smb: client: detect short folioq copy in cifs_copy_folioq_to_iter()
cifs_copy_folioq_to_iter() copies a requested number of bytes from
a folio queue into the destination iterator. Since the encrypted
SMB2 READ path was changed to pass the server-declared payload
length (data_len) instead of the larger folioq buffer length, the
caller can ask for fewer bytes than the folio queue holds.
In that case the helper continues walking the remaining folios after
data_size has reached zero and calls copy_folio_to_iter() with
len = 0, which is unnecessary work.
The helper also returns 0 (success) when the folio queue is
exhausted before data_size bytes have been copied. The caller has
no way to distinguish that from a full copy and the reported
transfer count ends up larger than the amount of data placed in the
iterator.
Add an early exit when data_size reaches zero, and return an error
when the folio queue is exhausted before all requested bytes have
been copied.
Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Fix this while meeting these requirements:
* It must be possible to include the MSHV root driver without the
VMBus driver. In such case, the MSHV root driver can be built-in
to the kernel image, or it can be built as a separate module.
* If both the MSHV root driver and the VMBus driver are present, the
MSHV root driver and VMBus driver can both be built-in, or they can
both be separate modules. Or the MSHV root driver can be a module
while the VMBus driver can be built-in, but the reverse is
disallowed. Regardless of the build choices, the VMBus driver must
be loaded before the MSHV driver in order for the SynIC to be
managed properly (see comments in the MSHV SynIC code).
The fix has two parts:
* Add a Kconfig entry for MSHV_ROOT to depend on HYPERV_VMBUS if
HYPERV_VMBUS is present. The entry disallows MSHV_ROOT being
built-in when HYPERV_VMBUS is a module, but without requiring that
HYPERV_VMBUS be built.
* Add a stub implementation of hv_vmbus_exists() for when the
VMBus driver is not present so that the MSHV root driver has
no module dependency on VMBus. When the VMBus driver *is*
present, the module dependency ensures that the VMBus driver
loads first when both are built as modules.
Existing code ensures that the VMBus driver loads first if it is
built-in. The VMBus driver uses subsys_initcall(), which is
initcall level 4. The MSHV root driver uses module_init(), which
becomes device_init() when built-in, and device_init() is
initcall level 6.
Dexuan Cui [Wed, 27 May 2026 19:21:01 +0000 (12:21 -0700)]
hyperv: Clean up and fix the guest ID comment in hvgdk.h
Change the "64 bit" to "64-bit", and the "Os" to "OS".
Remove the obsolete paragraph since the guideline has been
published in the Hypervisor Top Level Functional Specification
for many years.
The "OS Type" is 0x1 for Linux, not 0x100.
No functional change.
Fixes: 83ba0c4f3f31 ("Drivers: hv: Cleanup the guest ID computation") Signed-off-by: Dexuan Cui <decui@microsoft.com> Reviewed-by: Hamza Mahfooz <hamzamahfooz@linux.microsoft.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Tejun Heo [Wed, 27 May 2026 19:26:32 +0000 (09:26 -1000)]
bpf: Fix bpf_arena_handle_page_fault() redefinition without CONFIG_BPF_SYSCALL
On configs with CONFIG_BPF=y but CONFIG_BPF_SYSCALL=n (e.g. arm
multi_v7_defconfig), kernel/bpf/core.c defines a __weak
bpf_arena_handle_page_fault() while bpf_defs.h already supplies a static
inline stub for it, causing a redefinition error. Build the __weak
definition only under CONFIG_BPF_SYSCALL, matching the bpf_defs.h
declaration and the CONFIG_BPF_SYSCALL-gated strong definition in arena.c.
Fixes: dc11a4dba246 ("bpf: Recover arena kernel faults with scratch page") Reported-by: Mark Brown <broonie@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Song Liu <song@kernel.org> Link: https://lore.kernel.org/r/20260527192632.2109419-1-tj@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Shuai Zhang [Mon, 25 May 2026 06:51:56 +0000 (14:51 +0800)]
Bluetooth: hci_qca: Use 100 ms SSR delay for rampatch and NVM loading
When bt_en is pulled high by hardware, the host does not re-download
the firmware after SSR. The controller loads the rampatch and NVM
internally.
On HMT chip, the rampatch is ~264 KB and the NVM is ~9.4 KB. The
loading process takes approximately 70 ms. The previous 50 ms delay is
too short, causing the controller to not respond to the reset command
sent by the host, which leads to BT initialization failure:
Bluetooth: hci0: QCA memdump Done, received 458752, total 458752
Bluetooth: hci0: mem_dump_status: 2
Bluetooth: hci0: Opcode 0x0c03 failed: -110
Increase the delay to 100 ms, which was confirmed as a safe value by
the controller, to ensure the controller has finished loading the
firmware before the host sends commands.
Steps to reproduce:
1. Trigger SSR and wait for SSR to complete:
hcitool cmd 0x3f 0c 26
2. Run "bluetoothctl power on" and observe that BT fails to start.
Fixes: fce1a9244a0f ("Bluetooth: hci_qca: Fix SSR (SubSystem Restart) fail when BT_EN is pulled up by hw") Cc: stable@vger.kernel.org Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@oss.qualcomm.com> Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Doruk Tan Ozturk [Mon, 25 May 2026 16:24:38 +0000 (18:24 +0200)]
Bluetooth: hci_sync: fix UAF in hci_le_create_cis_sync
hci_le_create_cis_sync() dereferences conn->conn_timeout after releasing
both rcu_read_lock() and hci_dev_lock(hdev). The conn pointer was
obtained from an RCU-protected iteration over hdev->conn_hash.list and
is not valid once these locks are dropped. A concurrent disconnect can
free the hci_conn between the unlock and the dereference, causing a
use-after-free read.
The cancellation mechanism in hci_conn_del() cannot prevent this because
hci_le_create_cis_pending() queues hci_create_cis_sync with data=NULL:
Since NULL != conn, the lookup in _hci_cmd_sync_lookup_entry() never
matches, and the pending work item is not cancelled.
Fix this by saving conn->conn_timeout into a local variable while the
locks are still held, so the stale conn pointer is never dereferenced
after unlock.
This is the same class of bug as the one fixed by commit 035c25007c9e
("Bluetooth: hci_sync: Fix UAF on le_read_features_complete") which
addressed the identical pattern in a different function.
This vulnerability was identified using 0sec.ai, an open-source
automated security auditing platform (https://github.com/0sec-labs).
Fixes: c09b80be6ffc ("Bluetooth: hci_conn: Fix not waiting for HCI_EVT_LE_CIS_ESTABLISHED") Cc: stable@vger.kernel.org Reported-by: Doruk Tan Ozturk <doruk@0sec.ai> Signed-off-by: Doruk Tan Ozturk <doruk@0sec.ai> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Zhao Dongdong [Tue, 26 May 2026 03:21:39 +0000 (11:21 +0800)]
Bluetooth: 6lowpan: check skb_clone() return value in send_mcast_pkt()
The skb_clone() function can return NULL if memory allocation fails.
send_mcast_pkt() calls skb_clone() without checking the return value, which
can lead to a NULL pointer dereference in send_pkt() when it dereferences
skb->data.
Add a NULL check after skb_clone() and skip the peer if the clone fails.
Fixes: 18722c247023 ("Bluetooth: Enable 6LoWPAN support for BT LE devices") Signed-off-by: Zhao Dongdong <zhaodongdong@kylinos.cn> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Shuai Zhang [Thu, 21 May 2026 05:25:47 +0000 (13:25 +0800)]
Bluetooth: btusb: Allow firmware re-download when version matches
The Bluetooth host decides whether to download firmware by reading the
controller firmware download completion flag and firmware version
information.
If a USB error occurs during the firmware download process (for example
due to a USB disconnect), the download is aborted immediately. An
incomplete firmware transfer does not cause the controller to set the
download completion flag, but the firmware version information may be
updated at an early stage of the download process.
In this case, after USB reconnection, the host attempts to re-download
the firmware because the download completion flag is not set. However,
since the controller reports the same firmware version as the target
firmware, the download is skipped. This ultimately results in the
firmware not being properly updated on the controller.
This change removes the restriction that skips firmware download when
the versions are equal. It covers scenarios where the USB connection
can be disconnected at any time and ensures that firmware download can
be retriggered after USB reconnection, allowing the Bluetooth firmware
to be correctly and completely updated.
Fixes: 3267c884cefa ("Bluetooth: btusb: Add support for QCA ROME chipset family") Cc: stable@vger.kernel.org Signed-off-by: Shuai Zhang <shuai.zhang@oss.qualcomm.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Muhammad Bilal [Wed, 20 May 2026 22:56:43 +0000 (18:56 -0400)]
Bluetooth: HIDP: fix missing length checks in hidp_input_report()
hidp_input_report() reads keyboard and mouse payload data from an skb
without first verifying that skb->len contains enough data.
hidp_recv_intr_frame() pulls the 1-byte HIDP header before dispatching
to hidp_input_report(). If a paired device sends a truncated packet,
the handler reads beyond the valid skb data, resulting in an
out-of-bounds read of skb data. The OOB bytes may be interpreted as
phantom key presses or spurious mouse movement.
Replace the open-coded length tracking and pointer arithmetic with
skb_pull_data() calls. skb_pull_data() returns NULL if the requested
bytes are not present, eliminating the need for a manual size variable
and the separate skb->len guard.
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Signed-off-by: Muhammad Bilal <meatuni001@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Siwei Zhang [Thu, 21 May 2026 02:12:20 +0000 (22:12 -0400)]
Bluetooth: L2CAP: use chan timer to close channels in cleanup_listen()
l2cap_chan_close() removes the channel from conn->chan_l, which
must be done under conn->lock. cleanup_listen() runs under the
parent sk_lock, so acquiring conn->lock would invert the
established conn->lock -> chan->lock -> sk_lock order.
Instead of calling l2cap_chan_close() directly, schedule
l2cap_chan_timeout with delay 0 to close the channel
asynchronously. The timeout handler already acquires conn->lock
and chan->lock in the correct order.
The timer is only armed when chan->conn is still set: if it is
already NULL, l2cap_conn_del() has already processed this channel
(l2cap_chan_del + l2cap_sock_teardown_cb + l2cap_sock_close_cb),
so there is nothing left to do. If l2cap_conn_del() races in
after the timer is armed, __clear_chan_timer() inside
l2cap_chan_del() cancels it; if the timer has already fired, the
handler returns harmlessly because chan->conn was cleared.
Fixes: 3df91ea20e74 ("Bluetooth: Revert to mutexes from RCU list") Cc: <stable@vger.kernel.org> # 0b58004: Bluetooth: fix UAF in l2cap_sock_cleanup_listen() vs l2cap_conn_del() Signed-off-by: Siwei Zhang <oss@fourdim.xyz> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Siwei Zhang [Thu, 21 May 2026 02:30:36 +0000 (22:30 -0400)]
Bluetooth: L2CAP: fix chan ref leak in l2cap_chan_timeout() on !conn
__set_chan_timer() takes a l2cap_chan reference via l2cap_chan_hold()
before scheduling the delayed work. The normal path in
l2cap_chan_timeout() drops this reference with l2cap_chan_put() at the
end, but the early return when chan->conn is NULL skips the put,
leaking the reference.
Add the missing l2cap_chan_put() before the early return.
Fixes: adf0398cee86 ("Bluetooth: l2cap: fix null-ptr-deref in l2cap_chan_timeout") Cc: stable@vger.kernel.org Signed-off-by: Siwei Zhang <oss@fourdim.xyz> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Pavitra Jha [Thu, 21 May 2026 08:04:14 +0000 (04:04 -0400)]
Bluetooth: hci_conn: Fix memory leak in hci_le_big_terminate()
hci_le_big_terminate() allocates iso_list_data via kzalloc_obj but
returns 0 without freeing it when neither pa_sync_term nor big_sync_term
flags are set after evaluating the PA and BIG sync connection state.
This early-return path was introduced when hci_le_big_terminate() was
refactored to take struct hci_conn instead of raw u8 parameters, adding
PA/BIG flag evaluation logic. The existing kfree() on hci_cmd_sync_queue
failure does not cover this path.
Fixes: a7bcffc673de ("Bluetooth: Add PA_LINK to distinguish BIG sync and PA sync connections") Cc: stable@vger.kernel.org Signed-off-by: Pavitra Jha <jhapavitra98@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Zicheng Qu [Wed, 27 May 2026 09:38:50 +0000 (17:38 +0800)]
tools/sched_ext: Fix scx_show_state per-scheduler state reads
scx_show_state.py still reads scx_aborting and scx_bypass_depth as
global symbols. Those symbols no longer exist after the state was moved
into struct scx_sched, so the drgn script fails when it reaches either
field.
Read aborting and bypass_depth from scx_root instead. This preserves the
script's current root-scheduler view: with sub-scheduler support, the
reported values are for the root scheduler and sub-schedulers are not
enumerated.
Fixes: 5c8d98a1b4de ("sched_ext: Move bypass state into scx_sched") Fixes: c1743da43cf5 ("sched_ext: Move aborting flag to per-scheduler field") Signed-off-by: Zicheng Qu <quzicheng@huawei.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Sun Shaojie [Wed, 27 May 2026 07:05:09 +0000 (15:05 +0800)]
cgroup/cpuset: Add test cases for sibling CPU exclusion on partition update
When sibling CPU exclusion occurs, a partition's effective_xcpus may be
a subset of its user_xcpus. The partcmd_update path must use
effective_xcpus instead of user_xcpus when calculating CPUs to return
to or request from the parent.
Add two test cases to verify this behavior:
1) Narrowing cpuset.cpus to only the sibling-excluded CPUs should not
return CPUs to parent that the partition never actually owned.
2) Expanding cpuset.cpus after a sibling becomes a member should
correctly request the additional CPUs from parent.
Co-developed-by: Zhang Guopeng <zhangguopeng@kylinos.cn> Signed-off-by: Zhang Guopeng <zhangguopeng@kylinos.cn> Signed-off-by: Sun Shaojie <sunshaojie@kylinos.cn> Reviewed-by: Waiman Long <longman@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org>
Sun Shaojie [Wed, 27 May 2026 06:43:28 +0000 (14:43 +0800)]
cgroup/cpuset: Use effective_xcpus in partcmd_update add/del mask calculation
When sibling CPU exclusion occurs, a partition's user_xcpus may contain
CPUs that were never actually granted to it. These CPUs are present in
user_xcpus(cs) but not in cs->effective_xcpus.
The partcmd_update path in update_parent_effective_cpumask() uses
user_xcpus(cs) (via the local variable xcpus) to compute the addmask
(CPUs to return to parent) and delmask (CPUs to request from parent).
This is incorrect:
1) When newmask removes a CPU that was previously excluded by a
sibling, addmask incorrectly includes that CPU and tries to return
it to the parent even though the partition never actually owned it,
causing CPU overlap with sibling partitions and triggering warnings
in generate_sched_domains().
2) When newmask adds a previously excluded CPU that is now available,
delmask fails to request it from the parent because user_xcpus(cs)
already includes it.
Fix this by using cs->effective_xcpus instead of user_xcpus(cs) in all
partcmd_update paths that calculate addmask or delmask, including the
PERR_NOCPUS error handling paths.
Reproducers:
Example 1 - Removing a sibling-excluded CPU incorrectly returns it:
Commit bf9e4e30f353 ("x86/mm: use pagetable_free()"), switched from
freeing non-boot page tables through __free_pages() to
pagetable_free().
However, the function is also called to free vmemmap pages.
Given that vmemmap pages are not page tables, already the page_ptdesc(page)
is wrong. But worse, pagetable_free() calls:
__free_pages(page, compound_order(page));
Since vmemmap pages are not compound pages (see vmemmap_alloc_block())
-- except for HVO, which doesn't apply here -- only first page of a
PMD-sized vmemmap page is freed, leaking the other ones.
Fix it by properly decoupling pagetable and vmemmap freeing.
free_pagetable() no longer has to mess with SECTION_INFO, as only the
vmemmap is marked like that in register_page_bootmem_memmap().
The indentation in remove_pmd_table() is messed up. Fix that while
touching it.
Bootmem info handling will soon be fixed up. For now, handle it
similar to free_pagetable(), just avoiding the ifdef.
[ dhansen: changelog munging. More imperative voice ]
QA output created by 637
entries 7 and 8 have duplicate d_off 8
Found unlinked files in open dir (see xfstests-dev/results//generic/637.full for details)
Likewise HFS+, currently, HFS has very complicated and
fragile logic of rd->file->f_pos correction in hfs_delete_cat().
This patch removes this logic and it stores the current
pos into hfs_readdir_data. Finally, if rd->pos == ctx->pos
then hfs_readdir() tries to find the position in
b-tree's node by means of hfs_cat_key. This position is
used to re-start the folder's content traversal.
Breno Leitao [Sun, 24 May 2026 15:19:56 +0000 (08:19 -0700)]
workqueue: drop spurious '*' from print_worker_info() fn declaration
print_worker_info() declares its local 'fn' as work_func_t * but
worker->current_func has type work_func_t (a function pointer). The
extra level of indirection is wrong and only happens to be harmless
today because every supported Linux architecture has
sizeof(work_func_t) == sizeof(work_func_t *):
copy_from_kernel_nofault() reads the correct number of bytes by
accident, and %ps still resolves the printed address because the
stored value is the function address regardless of declared type.
On any future ABI where sizeof(void (*)()) differs from
sizeof(void *), the nofault copy would transfer the wrong number of
bytes and the subsequent %ps would print an incorrect address.
Match the field type so the intent is explicit and the code does not
silently rely on equal pointer sizes.
Fixes: 3d1cb2059d93 ("workqueue: include workqueue info when printing debug dump of a worker task") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
Jim Mattson [Wed, 27 May 2026 17:43:47 +0000 (10:43 -0700)]
KVM: selftests: Update hwcr_msr_test for CPUID faulting bit
Add BIT_ULL(35) (CpuidUserDis) to the valid mask in hwcr_msr_test, now that
KVM accepts writes to this bit when the guest CPUID advertises
CpuidUserDis.
Jim Mattson [Wed, 27 May 2026 17:43:46 +0000 (10:43 -0700)]
KVM: x86: Virtualize AMD CPUID faulting
On AMD CPUs, CPUID faulting support is advertised via
CPUID.80000021H:EAX.CpuidUserDis[bit 17] and enabled by setting
HWCR.CpuidUserDis[bit 35].
Advertise the feature to userspace regardless of host CPU support. Allow
writes to HWCR to set bit 35 when the guest CPUID advertises
CpuidUserDis. Update cpuid_fault_enabled() to check HWCR.CpuidUserDis as
well as MSR_FEATURE_ENABLES.CPUID_GP_ON_CPL_GT_0.
Unlike VMX, SVM prioritizes the CPUID intercept over the #GP induced by
CPUID faulting.[1] This behavior has been confirmed on a Turin CPU (F/M/S
1AH/2/1).
Jim Mattson [Wed, 27 May 2026 17:43:45 +0000 (10:43 -0700)]
KVM: x86: Remove supports_cpuid_fault() helper
The function, supports_cpuid_fault(), tests specifically for guest support
of Intel's CPUID faulting feature. It does not test for guest support of
AMD's CPUID faulting feature.
To avoid confusion, remove the helper.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Jim Mattson <jmattson@google.com> Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com> Link: https://patch.msgid.link/20260527174347.2356165-4-jmattson@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Jim Mattson [Wed, 27 May 2026 17:43:44 +0000 (10:43 -0700)]
KVM: x86: Prioritize CPUID faulting over CPUID VM-exits in nested VMX
Per the Intel SDM, "Certain exceptions have priority over VM exits. These
include invalid-opcode exceptions, faults based on privilege level, and
general-protection exceptions that are based on checking I/O permission
bits in the task-state segment (TSS)."
Ensure that when L2 executes CPUID at CPL > 0 while L1 has enabled CPUID
faulting, KVM intercepts the exit in L0 and queues #GP rather than
forwarding the CPUID VM-exit to L1.
Empirical testing confirms that this #GP has higher precedence than a CPUID
VM-exit on Granite Rapids (F/M/S 6/0xad/1).
KVM: x86: Consolidate CPUID fault handling for emulator and interception logic
Extract the logic for emulating CPUID faulting (where CPUID #GPs at CPL>0
outside of SMM) into a dedicated helper and use the helper for both the
full emulator and the intercepted-CPUID paths.
Opportunistically drop kvm_require_cpl(), as kvm_emulate_cpuid() was the
one and only user.
No functional change intended.
[jim: Add EXPORT_STATIC_CALL_GPL(kvm_x86_get_cpl) so that KVM vendor
modules can call kvm_is_cpuid_allowed(). Fix typo in commit message.]
Cheng-Yang Chou [Mon, 25 May 2026 17:22:31 +0000 (01:22 +0800)]
sched_ext: idle: Fix errno loss in scx_idle_init()
|| is a boolean operator, any nonzero (error) return short-circuits
to 1 rather than the actual errno. The caller in scx_init() logs and
propagates this value, so the wrong code reaches upper layers.