Ville Syrjälä [Fri, 22 May 2026 20:03:40 +0000 (23:03 +0300)]
drm/i915/bw: Fix 'deinterleave' rounding direction
For some reason we're rounding up when calculating the deinterleave
value. But the spec says we should round down. Fix it.
But I suppose this doesn't actually matter since the deinterleave
values should always be power of two. The only exception is therefore
the deinterleave==1 case, which gets handled by the max(..., 1).
Ville Syrjälä [Fri, 22 May 2026 20:03:38 +0000 (23:03 +0300)]
drm/i915/bw: Fix DCLK rounding mess
Fix up the total mess when calculating the DCLK
frequency. Some codepaths are trying to do both DIV_ROUND_UP()
and an open coded "round to nearest" at the same time. The
MTL+ codepath was the only one that was correct (using
DIV_ROUND_CLOSEST()).
Let's unify all of them, and borrow the actual '100/6'
approach from adl_calc_psf_bw() so that we get even less
rounding errors.
Ville Syrjälä [Fri, 22 May 2026 20:03:37 +0000 (23:03 +0300)]
drm/i915/bw: Fix num_planes handling on TGL+
The TGL+ bw code has an off by one error on the num_planes
calculation, and tgl_max_bw_index() incorrectly bumps
the num_planes to 1 from 0.
That approach made sense on ICL where num_planes is more or
less a minimum number of planes to consider for the group,
but on TGL+ num_planes really is a maximum number of planes,
so these adjustments no longer make any sense there.
Introduce an event relayer mechanism. Instead of each open file
registering directly with `ec_dev->event_notifier`, the platform device
registers a single relayer notifier. Individual files then register
with a local subscribers list in `chardev_pdata`.
This allows the driver to safely disconnect from the event chain
`ec_dev->event_notifier` during cros_ec_chardev_remove(), preventing
events from being delivered to open files after the device is removed,
while still allowing those files to be closed safely later.
Introduce struct chardev_pdata to hold platform driver data.
The platform driver data is allocated by kzalloc() instead of devm
variant, allowing for managed cleanup that can eventually extend beyond
device removal if files are still open.
hdrlen is __u8. For n >= 127 the result exceeds 255 and silently
truncates. With n=127 (cmpri=15, cmpre=15, pad=0, hdrlen=16):
(128 * 16) >> 3 = 256, truncated to 0 as __u8
The caller in ipv6_rpl_srh_rcv() then places the compressed header
at buf + ((ohdr->hdrlen + 1) << 3). With hdrlen=0 this is buf + 8,
but the decompressed region occupies buf[0..2055] (8-byte header
plus 128 full addresses). The compressed header overlaps the
decompressed data, and ipv6_rpl_srh_compress() writes into this
overlap, corrupting the routing header of the forwarded packet.
The existing guard at exthdrs.c:546 checks (n + 1) > 255, which
prevents n+1 from overflowing unsigned char (the segments_left
field), but does not prevent the computed hdrlen from overflowing
__u8. n=127 passes because 128 <= 255, yet hdrlen=256 does not
fit.
Tighten the bound to (n + 1) > 127. This caps n at 126, giving
hdrlen = (127 * 16) >> 3 = 254, which fits in __u8. The compressed
header then lands at buf + ((254 + 1) << 3) = buf + 2040, exactly
past the decompressed region (buf[0..2039]). No overlap. 127
segments is well beyond any realistic RPL deployment.
Jim Mattson [Wed, 27 May 2026 23:47:07 +0000 (23:47 +0000)]
KVM: x86/pmu: Allow Host-Only/Guest-Only bits with nSVM and mediated PMU
Now that KVM correctly handles Host-Only and Guest-Only bits in the
event selector MSRs, allow the guest to set them if the vCPU advertises
SVM and uses the mediated PMU.
Yosry Ahmed [Wed, 27 May 2026 23:47:06 +0000 (23:47 +0000)]
KVM: x86/pmu: Reprogram Host/Guest-Only counters on nested transitions
Reprogram PMU counters on nested transitions for the mediated PMU, to
re-evaluate Host-Only and Guest-Only bits and enable/disable the PMU
counters accordingly. For example, if Host-Only is set and Guest-Only is
cleared, a counter should be disabled when entering guest mode and
enabled when exiting guest mode.
According to the APM, when EFER.SVME is cleared, setting Host-Only or
Guest-Only disables the counter, so also trigger counter reprogramming
when EFER.SVME is toggled.
Counters setting any of Host-Only and Guest-Only bits are already being
tracked in pmc_has_mode_specific_enables, use the bitmap to reprogram
these counters.
Reprogram the counters synchronously on nested VMRUN/#VMEXIT and
EFER.SVME toggling. This is necessary as these instructions are counted
based on the new CPU state (after the instruction is retired in
hardware). Hence, the PMU needs to be updated before instruction
emulation is completed and kvm_pmu_instruction_retired() is called.
Defer reprogramming the counters when force leaving guest mode through
svm_leave_nested() to avoid potentially reading stale state (e.g.
incorrect EFER). All flows force leaving nested are non-architectural,
so accuracy is irrelevant.
Refactor a helper out of kvm_pmu_request_reprogram_counters() that
accepts a boolean allowing synchronous vs deferred reprogramming, and
use that from SVM code to support both scenarios.
Yosry Ahmed [Wed, 27 May 2026 23:47:05 +0000 (23:47 +0000)]
KVM: x86/pmu: Track mediated PMU counters with mode-specific enables
Instead of always checking of a counter needs to be disabled for
mode-specific reasons (e.g. Host-Only/Guest-Only bits in SVM), add a
bitmap to track such counters. Set the bit for counters using either
Host-Only or Guest-Only bits in EVENTSEL on SVM.
This bitmap will also be reused in following changes to selectively
apply changes to such counters.
Yosry Ahmed [Wed, 27 May 2026 23:47:04 +0000 (23:47 +0000)]
KVM: x86/pmu: Disable counters based on Host-Only/Guest-Only bits in SVM
Introduce an optional per-vendor PMU callback for checking if a counter
is disabled in the current mode, and register a callback on AMD to
disable a counter based on the vCPU's setting of Host-Only or Guest-Only
EVENT_SELECT bits with the mediated PMU.
If EFER.SVME is set, all events are counted if both bits are set or
cleared. If only one bit is set, the counter is disabled if the vCPU
context does not match the set bit.
If EFER.SVME is cleared, the counter is disabled if any of the bits is
set, otherwise all events are counted. Note that a Linux guest correctly
handles this and clears Host-Only when EFER.SVME is cleared, see commit 1018faa6cf23 ("perf/x86/kvm: Fix Host-Only/Guest-Only counting with SVM
disabled").
The callback is made from pmc_is_locally_enabled(), which is used for
the mediated PMU when updating eventsel_hw in
kvm_mediated_pmu_refresh_eventsel_hw(), as well as when checking what
PMCs count instructions/branches for emulation in
kvm_pmu_recalc_pmc_emulation().
Host-Only and Guest-Only bits are currently reserved, so this change is
a noop, but the bits will be allowed with mediated PMU in a following
change when fully supported.
Yosry Ahmed [Wed, 27 May 2026 23:47:03 +0000 (23:47 +0000)]
KVM: x86/pmu: Add support for KVM_X86_PMU_OP_OPTIONAL_RET0
Add definitions for KVM_X86_PMU_OP_OPTIONAL_RET0() to resolve to
__static_call_return0, similar to KVM_X86_OP_OPTIONAL_RET0(). Move the
definition of kvm_pmu_call() to pmu.h, and add declarations for the
static PMU calls in the header to allow making callbacks from the header
in following changes.
Yosry Ahmed [Wed, 27 May 2026 23:47:02 +0000 (23:47 +0000)]
KVM: x86/pmu: Check mediated PMU counter enablement before event filters
If the guest disables the counter (by clearing
ARCH_PERFMON_EVENTSEL_ENABLE), KVM still performs the PMU filter lookup,
even though it doesn't end up changing eventsel_hw. Check if the
counter is enabled by the guest before doing the potentially expensive
PMU filter lookup.
Yosry Ahmed [Wed, 27 May 2026 23:47:00 +0000 (23:47 +0000)]
KVM: x86/pmu: Rename reprogram_counters() to clarify usage
Rename reprogram_counters() to kvm_pmu_request_counters_reprogram()
clarifying that it is more similar to
kvm_pmu_request_counter_reprogram(), and less similar to
reprogram_counter(). The kvm_pmu_* prefix is also appropriate as the
function is exposed in the header.
Opportunistically rename the argument from 'diff' to 'counters'.
Yosry Ahmed [Wed, 27 May 2026 23:46:59 +0000 (23:46 +0000)]
KVM: x86: Move enable_pmu/enable_mediated_pmu to pmu.h and pmu.c
The declaration and definition of enable_pmu/enable_mediated_pmu
semantically belongs in pmu.h and pmu.c, and more importantly, pmu.h
uses enable_mediated_pmu and relies on the caller including x86.h.
There is already precedence for other module params defined outside of
x86.c, so move enable_pmu/enable_mediated_pmu to pmu.c.
Yosry Ahmed [Wed, 27 May 2026 23:46:58 +0000 (23:46 +0000)]
KVM: nSVM: Move VMRUN instruction retirement after entering guest mode
A successful VMRUN retires in guest mode and should be counted by the
PMU as a guest instruction. Move the call to
kvm_pmu_instruction_retired() after potentially entering guest mode,
such that VMRUN is counted correctly.
The PMU event will be matched against L2's CPL, but otherwise this does
not change the behavior in terms of guest vs. host, because KVM does
not virtualize Host-Only/Guest-Only PMC controls yet, so all
instructions are counted regardless of the vCPU's host/guest state. But
this change is needed for the incoming support for Host-Only/Guest-Only
controls to count VMRUN correctly.
Yosry Ahmed [Wed, 27 May 2026 23:46:57 +0000 (23:46 +0000)]
KVM: nSVM: Unify RIP and PMU handling calls when emulating VMRUN
The code paths for advancing RIP and retiring the instruction for RIP
are very similar whether or not caching vmcb12 succeeds. The only
difference is handling mapping failures (i.e. EFAULT).
Pull the mapping failure handling out and unify the calls to
svm_skip_emulated_instruction() and kvm_pmu_instruction_retired(), but
return immediately after if copying and caching vmcb12 failed. A nice
side effect of this is that the FIXME comment is now above the only code
path calling svm_skip_emulated_instruction().
Yosry Ahmed [Wed, 27 May 2026 23:46:56 +0000 (23:46 +0000)]
KVM: nSVM: Bail early out of VMRUN emulation if advancing RIP fails
If svm_skip_emulation_instruction() fails, then RIP could not be
advanced correctly (e.g. decode failure when NextRIP is not available).
KVM will exit to userspace to handle the emulation failure, but only
after stuffing the wrong RIP into vmcb01 and entering guest mode.
Bail early and exit to userspace before committing any side-effects of
emulating the VMRUN (e.g. entering guest mode).
Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()") Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260527234711.4175166-3-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Wed, 27 May 2026 23:46:55 +0000 (23:46 +0000)]
KVM: nSVM: Stop leaking single-stepping on VMRUN into L2
According to the APM, TF on VMRUN causes a #DB after VMRUN completes on
the _host_ side. However, KVM injects a #DB in L2 context instead (or
exits to userspace if KVM_GUESTDBG_SINGLESTEP is set) in
kvm_skip_emulated_instruction().
Avoid single-step handling on VMRUN by open-coding the rest of
kvm_skip_emulated_instruction() in nested_svm_vmrun(). This doesn't look
pretty, but following changes will need to open-code
kvm_pmu_instruction_retired() anyway, and will cleanup the code. This
ignores TF on VMRUN instead of injecting a spurious exception into
L2. Document this virtualization hole with a FIXME.
Note that a failed VMRUN would have been correctly single-stepped, but
now TF is always ignored for consistency and simplicity purposes. VMX
does not support TF on a successful VMLAUNCH/VMRESUME, so it's unlikely
that single-stepping VMRUN properly is important, especially if it's
only for failed VMRUNs.
Fixes: c8e16b78c614 ("x86: KVM: svm: eliminate hardcoded RIP advancement from vmrun_interception()") Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260527234711.4175166-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Willem de Bruijn [Tue, 26 May 2026 13:40:37 +0000 (09:40 -0400)]
net: sch_fq: update flow delivery time on earlier EDT packet
When inserting an EDT packet with time before flow->time_next_packet,
update the flow and possibly queue next delivery time.
Reinsert the flow into the q->delayed rb-tree to position correctly
and to have fq_check_throttled set wake-up at the right next time.
Factor RB tree insertion out fq_flow_set_throttled to avoid open
coding twice.
EDT packets do not take precedence over queue rate limit. Skip this
new step if a queue limit is set. EDT packets do take precedence over
per-socket rate limits, as can be seen from fq_dequeue reading
sk_pacing_rate if !skb->tstamp.
With this change the so_txtime selftest sends packets in the expected
order.
====================
Introduce Airoha AN8801R series Gigabit Ethernet PHY driver
This series introduces the Airoha AN8801R Gigabit Ethernet PHY initial
support.
The Airoha AN8801R is a low power single-port Ethernet PHY Transceiver
with Single-port serdes interface for 1000Base-X/RGMII.
This chip is compliant with 10Base-T, 100Base-TX and 1000Base-T IEEE
802.3(u,ab) and supports:
- Energy Efficient Ethernet (802.3az)
- Full Duplex Control Flow (802.3x)
- auto-negotiation
- crossover detect and autocorrection,
- Wake-on-LAN with Magic Packet
- Jumbo Frame up to 9 Kilobytes.
This PHY also supports up to three user-configurable LEDs, which are
usually used for LAN Activity, 100M, 1000M indication.
The series provides the devicetree binding and the driver that have been
written by AngeloGioacchino Del Regno, based on downstream
implementation ([1]). The driver allows setting up PHY LEDs, 10/100M,
1000M speeds, and Wake on LAN and PHY interrupts.
Since v2, the series also adds the air_phy_lib library, which goal is to
share common code between air_en8811h and air_an8801 drivers, and its use
in them. The first shared functions are the existing BuckPbus register
accessors and air_phy_read/write_page functions coming from air_en8811h
driver.
The series is based on net-next kernel tree (sha1: 90d03ee2c5dc) and
I have tested it on Mediatek Genio 720-EVK board (that integrates an
Airoha AN8801RIN/A Ethernet PHY) with early board hardware enablement
patches.
net: phy: air_an8801: ensure maximum available speed link use
To ensure that the Airoha AN8801R PHY uses the maximum available link
speed, an additional register write is needed to configure the function
mode for either 1G or 100M/10M operation after link detection.
So, in air_an8801 driver, implement a custom read_status callback, that
after genphy_read_status determines the link speed, sets the bit 0 of
the link mode register (REG_LINK_MODE) if the detected speed is 1Gbps,
or unsets it otherwise.
Introduce a driver for the Airoha AN8801R Series Gigabit Ethernet
PHY; this currently supports setting up PHY LEDs, 10/100M, 1000M
speeds, and Wake on LAN and PHY interrupts.
net: phy: Rename Airoha common BuckPBus register accessors
Rename the BuckPBus register accessors functions present in air_phy_lib
and their calls in air_en8811h driver, so all exported functions start
with the same prefix.
In preparation of Airoha AN8801R PHY support, move the BuckPBus
register accessors and definitions, present in air_en8811h driver,
into the Airoha PHY shared code (air_phy_lib), so they will be usable
by the new driver without duplicating them.
In preparation of Airoha AN8801R PHY support, split out the interface
functions that will be common between the already present air_en8811h
driver and the new one, and put them into a new library named
air_phy_lib.
Jiayuan Chen [Tue, 26 May 2026 02:55:29 +0000 (10:55 +0800)]
net/sched: cls_bpf: prevent unbounded recursion in offload rollback
Quan Sun reported [1] a stack overflow in cls_bpf_offload_cmd().
Reproducer on netdevsim: add a skip_sw cls_bpf filter, set the
bpf_tc_accept debugfs knob to 0, then `tc filter replace`. The replace
calls tc_setup_cb_replace() which fails. cls_bpf_offload_cmd() then
swaps prog/oldprog and recursively calls itself to roll back. But
bpf_tc_accept=0 makes the rollback fail too, which triggers yet another
rollback frame with the same arguments, and so on until the stack is
exhausted.
bpf_tc_accept is just a convenient knob for the reproducer. Any driver
whose tc_setup_cb_replace() fails twice in a row can hit the same loop,
so this is not a netdevsim-only issue.
Two ways to fix it:
1) Have the rollback call tc_setup_cb_add() on oldprog instead of
re-entering cls_bpf_offload_cmd().
2) Mark the rollback frame with a flag and skip a second-level
rollback from inside it.
Go with (2). It is the smaller change and keeps the original behaviour:
the rollback still goes through tc_setup_cb_replace(), so the driver
gets one real chance to restore its state. If that attempt also fails,
we just return the original error instead of recursing.
Jakub Kicinski [Thu, 28 May 2026 00:42:18 +0000 (17:42 -0700)]
Merge branch 'ethtool-more-bug-fixes'
Jakub Kicinski says:
====================
ethtool: more bug fixes
Last week I sent two patch sets - one fixing bugs in RSS handling,
and one fixing CMIS / module handling. This set contains the remaining
fixes. There's a concentration of fixes around PHY and timestamp config
handling but not enough to break those out as separate sets.
====================
Jakub Kicinski [Tue, 26 May 2026 15:35:33 +0000 (08:35 -0700)]
ethtool: eeprom: add more safeties to EEPROM Netlink fallback
The Netlink fallback path for reading module EEPROM
(fallback_set_params()) validates that offset < eeprom_len,
but does not check that offset + length stays within eeprom_len.
The ioctl equivalent (ethtool_get_any_eeprom() in ioctl.c) has
always enforced both bounds:
if (eeprom.offset + eeprom.len > total_len)
return -EINVAL;
This could lead to surprises in both drivers and device FW.
Add the missing offset + length validation to fallback_set_params(),
mirroring the ioctl.
Similarly - ethtool core in general, and ethtool_get_any_eeprom()
in particular tries to zero-init all buffers passed to the drivers
to avoid any extra work of zeroing things out. eeprom_fallback()
uses a plain kmalloc(), change it to zalloc.
Fixes: 96d971e307cc ("ethtool: Add fallback to get_module_eeprom from netlink command") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-11-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:32 +0000 (08:35 -0700)]
ethtool: eeprom: add missing ethnl_ops_begin() / _complete() during fallback
All ethtool driver op calls should be sandwiched between
ethnl_ops_begin() / ethnl_ops_complete(). In Netlink eeprom code,
if the paged access failed we fall back to old API, but we
first call _complete() and the fallback never does its own
ethnl_ops_begin(). Move the fallback into the _begin() / _complete()
section.
Fixes: 96d971e307cc ("ethtool: Add fallback to get_module_eeprom from netlink command") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-10-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:31 +0000 (08:35 -0700)]
ethtool: strset: fix header attribute index in ethnl_req_get_phydev()
strset_prepare_data() passes ETHTOOL_A_HEADER_FLAGS (3) as the header
attribute to ethnl_req_get_phydev(). This is incorrect, in the main
attr space 3 is ETHTOOL_A_STRSET_COUNTS_ONLY, not the request
header attr. The correct constant is ETHTOOL_A_STRSET_HEADER (1).
ethnl_req_get_phydev() only uses this value for the extack,
so this is not a "functionally visible"(?) bug.
Fixes: e96c93aa4be9 ("net: ethtool: strset: Allow querying phy stats by index") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-9-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:30 +0000 (08:35 -0700)]
ethtool: tsinfo: don't pass ERR_PTR to genlmsg_cancel on prepare failure
The goto err label leads to:
genlmsg_cancel(skb, ehdr);
return ret;
If ethnl_tsinfo_prepare_dump() failed, it has not started a genlmsg.
There's nothing to cancel, and passing an error pointer to
genlmsg_cancel() would cause a crash.
Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-8-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:29 +0000 (08:35 -0700)]
ethtool: tsinfo: fix uninitialized stats on the by-PHC path
tsinfo_prepare_data() has two code paths: a "by-PHC" path for
user-specified hardware timestamping providers, and the old path.
Commit 89e281ebff72 ("ethtool: init tsinfo stats if requested") added
ethtool_stats_init() to mark stat slots as ETHTOOL_STAT_NOT_SET before
the driver callback populates them, but placed the call inside the
old-path block.
When commit b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to
support several hwtstamp by net topology") added the by-PHC early
return, it landed above the stats initialization. On that path
the stats array retains the zero-fill from ethnl_init_reply_data()'s
zalloc. This leads to the reply including a stats nest with four
zero-valued attributes that should have been absent.
Reject GET requests for stats with HWTSTAMP_PROVIDER or dump.
Fixes: b9e3f7dc9ed9 ("net: ethtool: tsinfo: Enhance tsinfo to support several hwtstamp by net topology") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-7-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:27 +0000 (08:35 -0700)]
ethtool: pse-pd: fix missing ethnl_ops_complete()
pse_prepare_data() is missing ethnl_ops_complete() if
ethnl_req_get_phydev() returned an error. Move getting
phydev up so that we don't have to worry about this
(similar order to linkstate_prepare_data()).
Note that phydev may still be NULL (this is checked in
pse_get_pse_attributes()), the goal isn't really to avoid
the _begin() / _complete() calls, only to simplify the error
handling.
While at it propagate the original error. Why this code
overrides the error with -ENODEV but !phydev generates
-EOPNOTSUPP is unclear to me...
Fixes: 31748765bed3 ("net: ethtool: pse-pd: Target the command to the requested PHY") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:26 +0000 (08:35 -0700)]
ethtool: linkstate: fix unbalanced ethnl_ops_complete() on PHY lookup error
linkstate_prepare_data() calls ethnl_req_get_phydev() before
ethnl_ops_begin(), but routes its error path through "goto out"
which calls ethnl_ops_complete().
Fixes: fe55b1d401c6 ("ethtool: linkstate: migrate linkstate functions to support multi-PHY setups") Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:25 +0000 (08:35 -0700)]
ethtool: tsconfig: fix reply error handling
A couple of trivial bugs in error handling in tsconfig_send_reply().
If we failed to allocate rskb we need to set the error.
If we did allocate it but failed to send it - we need to remember
to free it.
Fixes: 6e9e2eed4f39 ("net: ethtool: Add support for tsconfig command to get/set hwtstamp config") Reviewed-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Link: https://patch.msgid.link/20260526153533.2779187-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 26 May 2026 15:35:24 +0000 (08:35 -0700)]
ethtool: coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES
ethnl_update_profile() walks the ETHTOOL_A_PROFILE_IRQ_MODERATION
nest list with an index 'i' and writes new_profile[i++] without
bounding i. The destination is kmemdup()'d at NET_DIM_PARAMS_NUM_PROFILES
entries (5), but the Netlink nest count is entirely user-controlled.
Netlink policies do not have support for constraining the number
of nested entries (or number of multi-attr entries).
Zhao Dongdong [Tue, 26 May 2026 06:51:56 +0000 (14:51 +0800)]
net: page_pool: silence static analysis warnings in page_pool_nl_stats_fill()
nla_nest_start() can return NULL if the skb runs out of space.
Jakub:
There is no bug here, if nla_nest_start() failed there's not space
left in the message. Next nla_put_uint() will also fail and we will
exit via nla_nest_cancel() which handles NULL just fine.
Various people keep sending us this patch so let's commit this.
Eric Dumazet [Tue, 26 May 2026 14:55:29 +0000 (14:55 +0000)]
ipv6: frags: cleanup __IP6_INC_STATS() confusion
After commits e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original
netdev") and bdb7cc643fc9 ("ipv6: Count interface receive statistics
on the ingress netdev") net/ipv6/reassembly.c uses three different
ways to reach idev in various __IP6_INC_STATS() calls.
Lets centralize this from ipv6_frag_rcv() and use __in6_dev_stats_get().
Note that ipv6_frag_rcv() tests if skb->dev could be NULL already, so
I chose to also guard against NULL, but we probably can remove the
tests in a followup patch, because I do not think skb->dev could be NULL.
iif = skb->dev ? skb->dev->ifindex : 0;
idev can be NULL, __IP6_INC_STATS() deals with this possibility.
Small code size reduction as a bonus.
$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-145 (-145)
Function old new delta
ipv6_frag_rcv 2399 2362 -37
ip6_frag_reasm 705 597 -108
Total: Before=31455552, After=31455407, chg -0.00%
Eric Dumazet [Tue, 26 May 2026 14:55:28 +0000 (14:55 +0000)]
ipv6: guard against possible NULL deref in __in6_dev_stats_get()
dev_get_by_index_rcu() could return NULL if the original physical
device is unregistered.
Found by Sashiko.
Fixes: e1ae5c2ea478 ("vrf: Increment Icmp6InMsgs on the original netdev") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20260526145529.3587126-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 28 May 2026 00:23:07 +0000 (17:23 -0700)]
Merge branch 'bridge-fix-sleep-in-atomic-context'
Ido Schimmel says:
====================
bridge: Fix sleep in atomic context
Under certain circumstances the bridge driver can call
dev_set_promiscuity() while holding the bridge spin lock. This is a
problem as dev_set_promiscuity() might sleep.
Patches #1-#2 fix the problem in the netlink and sysfs configuration
paths by only taking the lock where it is actually needed, thereby
avoiding calling dev_set_promiscuity() from an atomic context.
Patch #3 adds test cases for both configuration paths in rtnetlink.sh
which already includes test cases for similar issues.
Note that dev_set_promiscuity() can sleep either when it takes the net
device mutex or when calling netif_rx_mode_sync(). I encountered the
problem with the latter, but blamed the former since it came earlier.
====================
Add two test cases that always pass, but trigger sleeping in atomic
context BUGs without "bridge: Fix sleep in atomic context in netlink
path" and "bridge: Fix sleep in atomic context in sysfs path".
Ido Schimmel [Tue, 26 May 2026 06:48:17 +0000 (09:48 +0300)]
bridge: Fix sleep in atomic context in sysfs path
Since the start of the git history, brport_store() always acquired the
bridge lock. Back then this decision made sense: The bridge lock
protects the STP state of the bridge and its ports and at that time the
function was only used by two STP related attributes (cost and
priority).
Nowadays, brport_store() processes a lot more attributes and most of
them do not need the bridge lock:
* Bridge flags: Only require RTNL. Read locklessly by the data path.
Annotations can be added in net-next.
* FDB port flushing: Only requires the FDB lock.
* Multicast attributes: Only require the multicast lock.
* Group forward mask: Only requires RTNL. Read locklessly by the data
path. Annotations can be added in net-next.
* Backup port: Only requires RTNL. Read locklessly by the data path.
This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].
Fix this by reducing the scope of the bridge lock and only take it when
processing the two STP related attributes that require it. Remove the
now stale comment from br_switchdev_set_port_flag(). The
SWITCHDEV_F_DEFER flag can be removed in net-next.
Ido Schimmel [Tue, 26 May 2026 06:48:16 +0000 (09:48 +0300)]
bridge: Fix sleep in atomic context in netlink path
Since the introduction of the netlink configuration path for bridge
ports in commit 25c71c75ac87 ("bridge: bridge port parameters over
netlink"), br_setport() was always called with the bridge lock held
around it. Back then this decision made sense: The bridge lock protects
the STP state of the bridge and its ports and at that time the function
only processed three STP related netlink attributes (cost, priority and
state).
Nowadays, br_setport() processes a lot more attributes and most of them
do not need the bridge lock:
* Bridge flags: Only require RTNL. Read locklessly by the data path.
Annotations can be added in net-next.
* FDB port flushing: Only requires the FDB lock.
* Multicast attributes: Only require the multicast lock.
* Group forward mask: Only requires RTNL. Read locklessly by the data
path. Annotations can be added in net-next.
* Backup port and NHID: Only require RTNL. Read locklessly by the data
path.
This is a problem as the bridge calls dev_set_promiscuity() when certain
bridge port flags change and this function can sleep since the commit
cited below, resulting in a splat such as [1].
Fix this by reducing the scope of the bridge lock and only take it when
processing the three STP related attributes that require it. This is
consistent with the multicast attributes where each attribute acquires
the multicast lock instead of having one critical section for all
relevant attributes.
KVM: TDX: Move external page table freeing to TDX code
Move the freeing of external page tables into the reclaim operation that
lives in TDX code.
The TDP MMU supports traversing the TDP without holding locks. Page tables
need to be freed via RCU to prevent walking one that gets freed.
While none of these lockless walk operations actually happen for the mirror
page table, the TDP MMU nonetheless frees the mirror page table in the same
way, and (because it's a handy place to plug it in) the external page table
as well.
However, the external page table definitely can't be walked once the page
table pages are reclaimed from the TDX module. The TDX module releases the
page for the host VMM to use, so this RCU-time free is unnecessary for the
external page table.
So move the free_page() call to TDX code. Create an
tdp_mmu_free_unused_sp() to allow for freeing external page tables that
have never left the TDP MMU code (i.e. don't need to be freed in a special
way).
Move the logic for TDX's specific need to leak pages when reclaim
fails inside the free_external_spt() op, so this can be done in TDX
specific code and not the generic MMU.
Do this by passing in "sp" instead of the external page table pointer.
This way, TDX code can set sp->external_spt to NULL. Since the error is now
handled internally in TDX code (by triggering KVM_BUG_ON() or
TDX_BUG_ON_3(), which warn and stop the VM on any error), change the op to
return void. This way it also operates like a normal free in that success
is guaranteed from the caller's perspective.
Opportunistically, drop the unused level and gfn args while adjusting the
sp arg.
[ Rick: Re-wrote log and massaged op name ] Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
[ Yan: Updated patch log/function comment, dropped unused param in op ] Co-developed-by: Yan Zhao <yan.y.zhao@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075730.4354-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Drop kvm_x86_ops.remove_external_spte(), and instead handle the removal of
leaf SPTEs in the S-EPT (a.k.a. external page table) in
kvm_x86_ops.set_external_spte(). This will also allow extending
tdx_sept_set_private_spte() to support splitting a huge S-EPT entry without
needing yet another kvm_x86_ops hook.
Now all changes for removing leaf mirror SPTEs are propagated through
kvm_x86_ops.set_external_spte().
- When removing leaf mirror SPTEs under shared mmu_lock (though currently
no path can trigger this scenario and TDX does not support this
scenario), tdx_sept_remove_private_spte() may produce a warning due to
lockdep_assert_held_write() or may return -EIO and trigger TDX_BUG_ON()
due to concurrent BLOCK, TRACK, REMOVE.
- When removing leaf mirror SPTEs under exclusive mmu_lock, all errors are
unexpected. If any error occurs in this scenario,
tdx_sept_remove_private_spte() will return -EIO and trigger KVM_BUG_ON().
A redundant KVM_BUG_ON() call will also be triggered in TDP MMU core in
handle_changed_spte(), which is benign (the WARN will fire if and only if
the VM isn't already bugged).
Arrange tdx_sept_remove_private_spte() (and its tdx_track() helper) to be
above tdx_sept_set_private_spte() in anticipation of routing all S-EPT
writes (with the exception of reclaiming non-leaf pages) through the "set"
API.
Rick Edgecombe [Sat, 9 May 2026 07:56:47 +0000 (15:56 +0800)]
KVM: x86/mmu: Drop KVM_BUG_ON() on shared lock to zap child external PTEs
Drop the KVM_BUG_ON() in the KVM MMU core before zapping child external
PTEs, since requiring zapping PTEs to be protected by exclusive mmu_lock is
TDX's specific requirement.
No need to plumb the shared/exclusive info into the remove_external_spte()
op or move the KVM_BUG_ON() to TDX, because
- There's already an assertion of exclusive mmu_lock protection in TDX.
- The KVM_BUG_ON() is a bit redundant given that if there's any bug causing
zapping of leaf PTEs in S-EPT under shared mmu_lock, SEAMCALL failures
due to contention would result in TDX_BUG_ON() in TDX.
KVM: x86/tdp_mmu: Centrally propagate to-present/atomic zap updates to external PTEs
Move propagation of to-present changes and atomic zap changes to external
PTEs from function __tdp_mmu_set_spte_atomic() to function
__handle_changed_spte(), which centrally handles changes of SPTEs.
When setting a PTE to present in the mirror page tables, the update needs
to be propagated to the external page tables (in TDX parlance, the S-EPT).
Today this is handled by special mirror page tables logic/branching in
__tdp_mmu_set_spte_atomic(), which is the only place where present PTEs are
set for TDX.
The current approach obviously works, but is a bit hacked on. The hook for
setting present leaf PTEs is added only where TDX happens to need it. For
example, TDX does not support any of the operations that use the non-atomic
variant, tdp_mmu_set_spte(), to set present PTEs. Since the hook is missing
there, it is very hard to understand the code from a non-TDX lens. If the
reader doesn't know the TDX specifics it could look like the external SPTE
update is missing.
In addition to being confusing, it also litters the TDP MMU with "external"
update callbacks. This is especially unfortunate because there is already a
central place to react to TDP updates, handle_changed_spte().
Begin the process of moving towards a model where all mirror page table
updates are forwarded to TDX code where the TDX-specific logic can live
with a more proper separation of concerns. Do this by adding a helper
__handle_changed_spte() and teaching it how to return error codes, such
that it can propagate the failures that may come from TDX external page
table updates. Make the original handle_changed_spte() a no-fail version of
__handle_changed_spte(), so it handles no-fail changes which are under
exclusive mmu_lock or under the no-fail path handle_removed_pt(),
triggering KVM_BUG_ON() on error returns.
Instead of having __tdp_mmu_set_spte_atomic() do the frozen mirror SPTE
dance and trigger propagation to external PTEs, make
__tdp_mmu_set_spte_atomic() a simple helper of try_cmpxchg64() and hoist
the frozen mirror SPTE dance up a level to tdp_mmu_set_spte_atomic(). Then,
the propagation of changes to present to the external PTEs can be
centralized to __handle_changed_spte(). Aging external SPTEs is not yet
supported for the mirror page table, so just warn on mirror usage in
kvm_tdp_mmu_age_spte() and invoke __tdp_mmu_set_spte_atomic() directly
without frozen dance. No need to warn on installing FROZEN_SPTE as a
long-term value in kvm_tdp_mmu_age_spte() since removing accessed bit is
mutually exclusive with installing FROZEN_SPTE (FROZEN_SPTE is with
accessed bit in all x86 platforms).
Since tdp_mmu_set_spte_atomic() can also be invoked to atomically zap SPTEs
(though there's no path to trigger atomic zap on the mirror page table up
to now), also leverage set_external_spte() op to propagate the atomic zaps
when tdp_mmu_set_spte_atomic() zaps leaf SPTEs directly. (When
tdp_mmu_set_spte_atomic() zaps a non-leaf SPTE, zaps of the child leaf
SPTEs are propagated via the remove_external_spte() op).
Note: tdp_mmu_set_spte_atomic() invokes __handle_changed_spte() to handle
changes to new_spte while the mirror SPTE is frozen, so
(1) the update of the external PTEs and statistics, or
(2) the update of child mirror SPTEs, child external PTEs and corresponding
statistics,
now occur before the mirror SPTE is actually set to new_spte.
(1) is ok since if it fails, the mirror SPTE will be restored to its
original value. (2) is also ok since handle_removed_pt() is no-fail.
Sagi Shahar [Thu, 5 Mar 2026 22:26:27 +0000 (22:26 +0000)]
KVM: SEV: Restrict userspace return codes for KVM_HC_MAP_GPA_RANGE
To align with the updated TDX api that allows userspace to request
that guests retry MAP_GPA operations, make sure that userspace is only
returning EINVAL or EAGAIN as possible error codes.
KVM: TDX: Allow userspace to return errors to guest for MAPGPA
MAPGPA request from TDX VMs gets split into chunks by KVM using a loop
of userspace exits until the complete range is handled.
In some cases userspace VMM might decide to break the MAPGPA operation
and continue it later. For example: in the case of intrahost migration
userspace might decide to continue the MAPGPA operation after the
migration is completed.
Allow userspace to signal to TDX guests that the MAPGPA operation should
be retried the next time the guest is scheduled.
This is potentially a breaking change since if userspace sets
hypercall.ret to a value other than EBUSY or EINVAL an EINVAL error code
will be returned to userspace. As of now QEMU never sets hypercall.ret
to a non-zero value after handling KVM_EXIT_HYPERCALL so this change
should be safe.
Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Vishal Annapurve <vannapurve@google.com> Co-developed-by: Sagi Shahar <sagis@google.com> Signed-off-by: Sagi Shahar <sagis@google.com> Link: https://patch.msgid.link/20260305222627.4193305-2-sagis@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Yuyang Huang [Sun, 24 May 2026 02:24:56 +0000 (11:24 +0900)]
ipv6: mcast: annotate data-races around mca_users
/proc/net/igmp6 walks IPv6 multicast memberships under RCU and prints
mca_users without holding idev->mc_lock, while multicast join and leave
paths update the field while holding idev->mc_lock. Annotate this
intentional lockless snapshot with READ_ONCE() and the matching writers
with WRITE_ONCE().
Oliver Hartkopp [Tue, 26 May 2026 19:33:19 +0000 (21:33 +0200)]
bonding: refuse to enslave CAN devices
syzbot reported a kernel paging request crash in
can_rx_unregister() inside net/can/af_can.c. The crash occurs
because a virtual CAN device (vxcan) is being enslaved to a
bonding master.
During the enslavement process, the bonding driver mutates
and modifies the network device states to fit an Ethernet-like
aggregation model. However, CAN devices operate on a completely
different Layer 2 architecture, relying on the CAN mid-layer
private data structure (can_ml_priv) instead of standard
Ethernet structures. Since bonding does not initialize or
maintain these CAN structures, subsequent operations on the
half-enslaved interface (such as closing associated sockets
via isotp_release) lead to a null-pointer dereference when
accessing the CAN receiver lists.
Bonding CAN interfaces is architecturally invalid as CAN lacks
MAC addresses, ARP capabilities, and standard Ethernet
link-layer mechanisms. While generic loopback devices are
blocked globally in net/core/dev.c, virtual CAN devices
bypass this check because they do not carry the IFF_LOOPBACK
flag, despite acting as local software-loopbacks.
Fix this by explicitly blocking network devices of type
ARPHRD_CAN from being enslaved at the very beginning of
bond_enslave(). This prevents illegal state mutations,
eliminates the resulting KASAN crashes, and avoids potential
memory leaks from incomplete socket cleanups.
As the CAN support has been added a long time after bonding
the Fixes-tag points to the introduction of ARPHRD_CAN that
would have needed a specific handling in bonding_main.c.
Fixes: cd05acfe65ed ("[CAN]: Allocate protocol numbers for PF_CAN") Reported-by: syzbot+8ed98cbd0161632bce95@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=8ed98cbd0161632bce95 Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Acked-by: Jay Vosburgh <jv@jvosburgh.net> Link: https://patch.msgid.link/20260526-bonding-candev-v1-1-ba1df400918a@hartkopp.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selinux: hooks: use __getname() to allocate path buffer
selinux_genfs_get_sid() allocates memory for a path with __get_free_page()
although there is a dedicated helper for allocation of file paths:
__getname().
Replace __get_free_page() for allocation of a path buffer with __getname().
Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Signed-off-by: Paul Moore <paul@paul-moore.com>
selinux: use k[mz]alloc() to allocate temporary buffers
Several functions in selinuxfs.c allocate temporary buffers using
__get_free_page() or get_zeroed_page().
These buffers are used either to store a string generated by snprintf() (in
sel_make_bools()) or to copy data from user (sel_read_avc_hash_stats() and
sel_read_sidtab_hash_stats()).
Such usage does not require struct page access and it is better to allocate
these buffers with kzalloc()/kmalloc() that provide better scalability and
more debugging possibilities.
Replace use of get_zeroed_page() with kzalloc() and usage of
__get_free_page() with kmalloc().
Xuanqing Shi [Wed, 27 May 2026 02:26:17 +0000 (19:26 -0700)]
KVM: VMX: Handle bad values on proxied writes to LBR MSRs
Use the "safe" WRMSR API when writing LBRs on behalf of the guest (or host
userspace), and propagate any errors back to the instigator, as the value
being written is untrusted. E.g. if the guest (or host userspace) attempts
to set reserved bits in LBR_SELECT, then KVM needs to return an error, and
not WARN on the bad value.
Continue using the "unsafe" version of RDMSR, as it should be impossible to
reach the helper with a completely bogus MSR, i.e. WARNing on RDMSR failure
is very desirable, e.g. to make KVM bugs more visible.
Fixes: 1b5ac3226a1a ("KVM: vmx/pmu: Pass-through LBR msrs when the guest LBR event is ACTIVE") Cc: stable@vger.kernel.org Signed-off-by: Xuanqing Shi <1356292400@qq.com>
[sean: rework changelog, only modify WRMSR path, tag for stable@] Link: https://patch.msgid.link/20260527022617.3973884-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Ricardo Robaina [Wed, 27 May 2026 23:15:34 +0000 (19:15 -0400)]
audit: fix recursive locking deadlock in audit_dupe_exe()
A deadlock occurs in the audit subsystem when duplicating
executable-related rules.
When a file is moved (e.g., via do_renameat2()), the VFS layer locks
the parent directory (I_MUTEX_PARENT), which synchronously triggers an
fsnotify_move event. If an existing executable audit rule matches the
file being moved, the audit subsystem catches this event and calls
audit_dupe_exe() to duplicate the watch and update the rule. Then,
audit_alloc_mark() would call kern_path_parent() to resolve the path,
leading to a blind attempt to acquire the exact same I_MUTEX_PARENT lock
already held by the task, resulting in the following recursive locking
deadlock:
============================================
WARNING: possible recursive locking detected
6.12.0-55.27.1.el10_0.x86_64+debug #1 Not tainted
--------------------------------------------
mv/5099 is trying to acquire lock: ffff888132845358 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
at: __kern_path_locked+0x10a/0x2f0
but task is already holding lock: ffff888132846b58 (&inode->i_sb->s_type->i_mutex_dir_key/1){+.+.}-{3:3},
at: lock_two_directories+0x13f/0x2b0
other info that might help us debug this:
Possible unsafe locking scenario:
The aforementioned deadlock can be consistently reproduced by running
the script below:
audit-dupe-exe-deadlock.sh
--------------------------
#!/bin/bash
auditctl -D
mkdir -p /tmp/foo
touch /tmp/file
auditctl -a always,exit -F exe=/tmp/file -F path=/tmp/file -S all -k dr
mv /tmp/file /tmp/foo/file
rm -Rf /tmp/foo
This patch fixes the issue by introducing struct audit_watch_ctx to pass
the fsnotify event context down to audit_alloc_mark(). By utilizing the
already-resolved directory inode provided by the event, we bypass the
kern_path_parent() path resolution entirely, safely avoiding the
recursive lock. Furthermore, it explicitly allows duplicate fsnotify
marks (allow_dups = 1) during the rename update, allowing the new rule's
mark to safely coexist with the old rule's mark until the old rule is
freed.
P.S.: This issue was identified and reproduced during a comprehensive
code coverage analysis of the audit subsystem. The full report is
available at the link below:
P.P.S: With the permission of both Ricardo and Nathan, I've squashed a
fixup patch from Nathan that addresses a compile time error when
CONFIG_AUDITSYSCALL=n.
Cc: stable@kernel.org Fixes: 34d99af52ad4 ("audit: implement audit by executable") Acked-by: Waiman Long <longman@redhat.com> Acked-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Ricardo Robaina <rrobaina@redhat.com>
[PM: move link metadata into the msg, apply fix from NC] Signed-off-by: Paul Moore <paul@paul-moore.com>
KVM: x86/mmu: Plumb "sp" _pointer_ into the TDP MMU's handle_changed_spte()
Plumb the "sp" pointer into handle_changed_spte() to allow checking of
is_mirror_sp(sp) in handle_changed_spte(). This will allow consolidating
all S-EPT updates into a single kvm_x86_ops hook.
[Yan: Remove unused "as_id" param in tdp_mmu_set_spte() ]
Rick Edgecombe [Sat, 9 May 2026 07:56:09 +0000 (15:56 +0800)]
KVM: x86/tdp_mmu: Morph !is_frozen_spte() check into a KVM_MMU_WARN_ON()
Remove the conditional logic for handling the setting of mirror page table
to frozen in __tdp_mmu_set_spte_atomic() and add it as a warning for both
mirror and direct cases.
The mirror page table needs to propagate PTE changes to the external page
table. This presents a problem for atomic updates which can't update both
page tables at once. So a special value, FROZEN_SPTE, is used as a
temporary state during these updates to prevent concurrent operations on
the PTE. If the TDP MMU tried to install FROZEN_SPTE as a long-term value,
it would confuse these updates.
On the other hand, it would also confuse other threads if FROZEN_SPTE is
installed as a long-term value for direct page tables (e.g., causing
another thread working on atomic zap to wait for a !FROZEN_SPTE value
endlessly).
Therefore, add the warning for installing FROZEN_SPTE as a long-term value
in __tdp_mmu_set_spte_atomic() without differentiating whether it's a
mirror or direct page table.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075609.4242-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Rick Edgecombe [Sat, 9 May 2026 07:55:57 +0000 (15:55 +0800)]
KVM: TDX: Move lockdep assert in __tdp_mmu_set_spte_atomic() to TDX code
Move the MMU lockdep assert in __tdp_mmu_set_spte_atomic() into the TDX
specific op because the assert is TDX specific in intention.
The TDP MMU has many lockdep asserts for various scenarios, and in fact
the callchains that are used for TDX already have a lockdep assert which
covers the case in __tdp_mmu_set_spte_atomic(). However, these asserts are
for management of the TDP root owned by KVM. In the
__tdp_mmu_set_spte_atomic() assert case, it is helping with a scheme to
avoid contention in the TDX module during zap operations. That is very
TDX specific.
One option would be to just remove the assert in
__tdp_mmu_set_spte_atomic() and rely on the other ones in the TDP MMU. But
that assert is for a different intention, and too far away from the
SEAMCALL that needs it. So just move it to TDX code.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075557.4226-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
Rick Edgecombe [Sat, 9 May 2026 07:55:44 +0000 (15:55 +0800)]
KVM: TDX: Move KVM_BUG_ON()s in __tdp_mmu_set_spte_atomic() to TDX code
Drop some KVM_BUG_ON()s that are guarding against TDP MMU attempting to
propagate unsupported changes to the external page table through
__tdp_mmu_set_spte_atomic(). Have TDX code trigger them instead.
Now that TDP MMU logically allows propagating atomic zapping operation to
the external page table through the set_external_spte() op in
__tdp_mmu_set_spte_atomic(), TDX code will trigger the KVM_BUG_ON() on the
atomic zapping request instead. (Note: non-atomic zapping is not propagated
via the set_external_spte() op yet).
Despite the generic naming, external page table ops are designed completely
around TDX. They hook the bare minimum of what is needed, and exclude the
operations that are not supported by TDX. To help wrangle which operations
are handleable by various operations, warnings and KVM_BUG_ON()s exist in
the code. These warnings and KVM_BUG_ON()s put the burden of understanding
which operations should be forwarded to TDX code on TDP MMU developers, who
often read the code without TDX context.
Future changes will transition the encapsulation of this domain knowledge
to TDX code by funneling the external page table updates through a central
update mechanism. In this paradigm, the central update mechanism can
encapsulate the special knowledge, but will not have as much knowledge
about what operation is in progress.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com> Link: https://patch.msgid.link/20260509075544.4210-1-yan.y.zhao@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
KVM: x86/mmu: Plumb param "old_spte" into kvm_x86_ops.set_external_spte()
If tdp_mmu_set_spte_atomic() triggers an atomic zap on a mirror SPTE
(though currently no paths trigger it), the change is propagated via the
set_external_spte() op. Plumb the old SPTE into the set_external_spte() op,
so TDX code rather than TDP MMU code can warn if the atomic zap isn't
allowed, i.e. to let TDX enforce TDX's rules (inasmuch as possible).
Rename mirror_spte to new_spte to follow the TDP MMU's naming, and to make
it more obvious what value the parameter holds.
Opportunistically tweak the ordering of parameters to match the pattern of
most TDP MMU functions, which do "old, new, level".
KVM: x86/mmu: Fold set_external_spte_present() into its sole caller
Fold set_external_spte_present() into __tdp_mmu_set_spte_atomic() in
anticipation of propagating *all* changes (like atomic zap) triggered by
tdp_mmu_set_spte_atomic() to the external PTEs.
KVM: TDX: Wrap mapping of leaf and non-leaf S-EPT entries into helpers
Add a helper, tdx_sept_map_leaf_spte(), to wrap and isolate PAGE.ADD and
PAGE.AUG operations. Rename tdx_sept_link_private_spt() to
tdx_sept_map_nonleaf_spte() to wrap SEPT.ADD for symmetry.
Thus, transition tdx_sept_set_private_spte() into a "dispatch" routine for
setting/writing S-EPT entries.
Drop the dedicated .link_external_spt() for linking S-EPT pages, and
instead funnel everything through .set_external_spte() for mapping S-EPT
entries. Using separate hooks doesn't help prevent TDP MMU details from
bleeding into TDX, and vice versa; to the contrary, dedicated callbacks
will result in _more_ pollution when hugepage support is added, e.g. will
require the TDP MMU to know details about the splitting rules for TDX that
aren't all that relevant to the TDP MMU.
Ideally, KVM would provide a single pair of hooks to set S-EPT entries,
one hook for setting SPTEs under write-lock and another for setting SPTEs
under read-lock (e.g. to ensure the entire operation is "atomic", to allow
for failure, etc.). Sadly, TDX's requirement that all child S-EPT entries
are removed before the parent makes that impractical: the TDP MMU
deliberately prunes non-leaf SPTEs and _then_ processes its children, thus
making it quite important for the TDP MMU to differentiate between zapping
leaf and non-leaf S-EPT entries.
However, that's the _only_ case that's truly special, and even that case
could be shoehorned into a single hook; it just wouldn't be a net positive.
mshv: support 1G hugepages by passing them as 2M-aligned chunks
The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned
chunks into 1G mappings when alignment permits, so the driver can
support 1G hugepages by feeding them in as 2M chunks. Note that this
is the only way to make 1G mappings; there is no way to directly map
a 1G hugepage using the hypercall.
Always emit a 2M (PMD_ORDER) stride for the huge-page case. The
hypercall has no 1G stride, so 1G folios are processed as a
sequence of 2M chunks. Folios whose order is less than PMD_ORDER
(e.g. mTHP) fall back to single-page stride; mapping them as 2M
would fail in the hypervisor anyway.
Assisted-by: Copilot-CLI:claude-opus-4.7 Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Acked-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
Dexuan Cui [Thu, 7 May 2026 21:28:38 +0000 (14:28 -0700)]
Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
screen.lfb_base being zero [1], there is an MMIO conflict between the
drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
hv_allocate_config_window() calls vmbus_allocate_mmio() to get an
MMIO range, typically it gets a 32-bit MMIO range that overlaps with the
framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
error message "PCI Pass-through VSP failed D0 Entry with status" since
the host thinks that PCI devices must not use MMIO space that the
host has assigned to the framebuffer.
This is especially an issue if pci-hyperv is built-in and hyperv-drm is
built as a module. Consequently, the kdump/kexec kernel fails to detect
PCI devices via pci-hyperv, and may fail to mount the root file system,
which may reside in a NVMe disk. The issue described here has existed
for SR-IOV VF NICs since day one of the pci-hyperv driver, and has been
worked around on x64 when possible. With the recent introduction of
ARM64 VMs that boot from NVMe, there is no workaround, so we need a
formal fix.
On Gen2 VMs, if the screen.lfb_base is 0 in the kdump/kexec kernel [1],
fall back to the low MMIO base, which should be equal to the framebuffer
MMIO base [2] (the statement is true according to my testing on x64
Windows Server 2016, and on x64 and ARM64 Windows Server 2025 and on
Azure. I checked with the Hyper-V team and they said the statement should
continue to be true for Gen2 VMs). In the first kernel, screen.lfb_base
is not 0; if the user specifies a very high resolution, it's not enough
to only reserve 8MB: let's always reserve half of the space below 4GB,
but cap the reservation to 128MB, which is the required framebuffer size
of the highest resolution 7680*4320 supported by Hyper-V.
While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
the > to >=. Here the 'end' is an inclusive end (typically, it's
0xFFFF_FFFF for the low MMIO range).
Note: vmbus_reserve_fb() now also reserves an MMIO range at the beginning
of the low MMIO range on CVMs, which have no framebuffers (the
'screen.lfb_base' in vmbus_reserve_fb() is 0 for CVMs), just in case the
host might treat the beginning of the low MMIO range specially [3]. BTW,
the OpenHCL kernel is not affected by the change, because that kernel
boots with DeviceTree rather than ACPI (so vmbus_reserve_fb() won't run
there), and there is no framebuffer device for that kernel.
Note: normally Gen1 VMs don't have the MMIO conflict issue because the
framebuffer MMIO range (which is hardcoded to base=4GB-128MB and
size=64MB for Gen1 VMs by the host) is always reported via the legacy PCI
graphics device's BAR, so the kdump/kexec kernel can reserve the 64MB
MMIO range; however, if the VM is configured to use a very high resolution
and the required framebuffer size exceeds 64MB (AFAIK, in practice, this
isn't a typical configuration by users), the hyperv-drm driver may need to
allocate an MMIO range above 4GB and change the framebuffer MMIO location
to the allocated MMIO range -- in this case, there can still be issues [4]
which can't be easily fixed: any possible affected Gen1 users would have
to use a resolution whose framebuffer size is <= 64MB, or switch to Gen2
VMs.
x86/ftrace: Relocate %rip-relative percpu refs in dynamic trampolines
With CONFIG_CALL_DEPTH_TRACKING enabled on an x86 retbleed-affected platform
(eg: Skylake), with retbleed=stuff, registering a dynamic ftrace trampoline
crashes on the first call into the traced function:
Monitoring the crash under GDB points to the exact instruction in charge of
incrementing the call depth:
sarq $5, %gs:__x86_call_depth(%rip)
This instruction matches the one inserted by the ftrace_regs_caller from
ftrace_64.S. This emitted code was likely working fine until the introduction
of
59bec00ace28 ("x86/percpu: Introduce %rip-relative addressing to PER_CPU_VAR()"):
it has made the call depth accounting addressing relative to $rip, instead of
being based on an absolute address.
As this code exact location depends on where the trampoline lives in memory,
the corresponding displacement needs to be adjusted at runtime to actually
correctly find the per-cpu __x86_call_depth value, otherwise the targeted
address is wrong, leading to the page fault seen above.
Fix the %rip-relative displacement of the copied CALL_DEPTH_ACCOUNT
instruction (from ftrace_regs_caller) by calling text_poke_apply_relocation(),
as it is done for example by the x86 BPF JIT compiler through
x86_call_depth_emit_accounting(). This corrects both CALL_DEPTH_ACCOUNT slots,
in ftrace_caller and ftrace_regs_caller.
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the redefinition of __cleanup with
__maybe_unused added to it is unnecessary because the referenced LLVM
change is present in all supported LLVM versions. Drop it.
kbuild: Remove check for broken scoping with clang < 17 in CC_HAS_ASM_GOTO_OUTPUT
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the check added to CC_HAS_ASM_GOTO_OUTPUT by
commit e2ffa15b9baa ("kbuild: Disable CC_HAS_ASM_GOTO_OUTPUT on clang <
17") can be removed, as the issue it detects is guaranteed to be fixed.
x86/build: Drop unnecessary '-ffreestanding' addition to KBUILD_CFLAGS
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the addition of '-ffreestanding' to
KBUILD_CFLAGS for 32-bit x86 is unnecessary, as the linked LLVM bug is
resolved in all supported LLVM versions.
16cb16e0d285 ("x86/build: Remove -ffreestanding on i386 with GCC")
intended to make the addition of '-ffreestanding' clang only but due to
a bug in the adjusted check from
d70da12453ac ("hardening: Enable i386 FORTIFY_SOURCE on Clang 16+")
it has been applied for all versions of GCC and clang < 16.0.0. There
are no known problems with removing this for GCC but if one surfaces, it
can be restored under a CONFIG_CC_IS_GCC block.
scripts/Makefile.warn: Drop -Wformat handling for clang < 16
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the block dealing with -Wformat with clang
prior to 16 can be removed since the condition for its inclusion is
always false.
riscv: Drop tautological condition from TOOLCHAIN_NEEDS_OLD_ISA_SPEC
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the Clang dependency part of
CONFIG_TOOLCHAIN_NEEDS_OLD_ISA_SPEC is always false, so it can be
removed. Adjust the help text to remove mention of Clang < 17, as it is
irrelevant for the kernel after the minimum supported bump.
riscv: Remove tautological condition from selection of ARCH_SUPPORTS_CFI
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the condition of the selection of
CONFIG_ARCH_SUPPORTS_CFI is always true, so it can be removed.
ARM: Drop tautological ld.lld conditions from ARCH_MULTI_V4{,T}
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!ld.lld || ld.lld >= 16' dependency of
CONFIG_ARCH_MULTI_V4{,T} is always true, so it can be removed from both
symbols.
arch/Kconfig: Remove tautological condition from AUTOFDO_CLANG
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the clang version check in
CONFIG_AUTOFDO_CLANG can be removed because it is always true.
arch/Kconfig: Remove tautological conditions from HAS_LTO_CLANG
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, two dependency lines in CONFIG_HAS_LTO_CLANG
are always true because Clang will always be newer than 17.0.0, so they
can be removed.
security/Kconfig.hardening: Remove tautological condition from CC_HAS_RANDSTRUCT
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!Clang || Clang >= 16' dependency for
CONFIG_CC_HAS_RANDSTRUCT is always true, so it can be removed.
security/Kconfig.hardening: Remove tautological condition from FORTIFY_SOURCE
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!X86_32 || !Clang || Clang > 16'
dependency of CONFIG_FORTIFY_SOURCE is always true, so it can be
removed.
security/Kconfig.hardening: Remove tautological condition from CC_HAS_ZERO_CALL_USED_REGS
Now that the minimum supported version of LLVM for building the kernel
has been raised to 17.0.1, the '!Clang || Clang > 15.0.6' dependency for
CONFIG_CC_HAS_ZERO_CALL_USED_REGS is always true, so it can be removed.
kbuild: Bump minimum version of LLVM for building the kernel to 17.0.1
The current minimum version of LLVM for building the kernel is 15.0.0.
However, there are two deficiencies compared to GCC that were fixed in
LLVM 17 that are starting to become more noticeable.
The first was a bug in LLVM's scope checker [1], where all labels in a
function were validated as potential targets of an asm goto statement,
even if they were not listed in the asm goto statement as targets. This
becomes particularly problematic when the cleanup attribute is used, as
asm goto(... : label_a);
...
label_a:
...
int var __free(foo);
asm goto(... : label_b);
...
label_b:
...
will trigger an error since the scope checker will complain that the
cleanup variable would be skipped when jumping from the first asm goto
to label_b (which obviously cannot happen). This issue was the catalyst
for commit e2ffa15b9baa ("kbuild: Disable CC_HAS_ASM_GOTO_OUTPUT on
clang < 17"). Unfortunately, this issue is reproducible with regular asm
goto in addition to asm goto with outputs, so that change was not
entirely sufficient to avoid the issue altogether. As asm goto has
effectively been required since commit a0a12c3ed057 ("asm goto:
eradicate CC_HAS_ASM_GOTO") and the usage of the cleanup attribute
continues to grow across the tree, raising the minimum to a version that
avoids this issue altogether is a better long term solution than
attempting to workaround it at every spot where it happens.
The second issue is an incompatibility with GCC 8.1+ around variables
marked with const being valid constant expressions for _Static_assert
and other macros [2]. With GCC 8.1 being the minimum supported version
since commit 118c40b7b503 ("kbuild: require gcc-8 and binutils-2.30"),
this incompatibility becomes more of a maintenance burden since only
clang-15 and clang-16 are affected by it.
Looking at the clang version of various major distributions through
Docker images, no one should be left behind as a result of this bump, as
the old ones cannot clear the current minimum of 15.0.0.
archlinux:latest clang version 22.1.3
debian:oldoldstable-slim Debian clang version 11.0.1-2
debian:oldstable-slim Debian clang version 14.0.6
debian:stable-slim Debian clang version 19.1.7 (3+b1)
debian:testing-slim Debian clang version 21.1.8 (3+b1)
debian:unstable-slim Debian clang version 21.1.8 (7+b1)
fedora:42 clang version 20.1.8 (Fedora 20.1.8-4.fc42)
fedora:latest clang version 21.1.8 (Fedora 21.1.8-4.fc43)
fedora:44 clang version 22.1.1 (Fedora 22.1.1-2.fc44)
fedora:rawhide clang version 22.1.3 (Fedora 22.1.3-1.fc45)
opensuse/leap:latest clang version 17.0.6
opensuse/tumbleweed:latest clang version 21.1.8
ubuntu:jammy Ubuntu clang version 14.0.0-1ubuntu1.1
ubuntu:noble Ubuntu clang version 18.1.3 (1ubuntu1)
ubuntu:questing Ubuntu clang version 20.1.8 (0ubuntu4)
ubuntu:resolute Ubuntu clang version 21.1.8 (6ubuntu1)
17.0.1 is chosen as the minimum instead of 17.0.0 to ensure that the
particular version of LLVM 17 has the two aforementioned bugs fixed, as
the second was fixed during the 17.0.0 release candidate phase and it
was not until LLVM 18 that LLVM adopted the scheme of x.0.0 being a
prerelease version and x.1.0 is a release version [3] to help with
scenarios such as this.
Steve French [Fri, 22 May 2026 23:28:49 +0000 (18:28 -0500)]
smb: client: fix uninitialized variable in smb2_writev_callback
compiling with W=2 pointed out that "written may be used uninitialized"
Fixes: 20d72b00ca81 ("netfs: Fix the request's work item to not require a ref") Cc: stable@vger.kernel.org Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Jeremy Erazo [Wed, 20 May 2026 18:23:31 +0000 (18:23 +0000)]
smb: client: detect short folioq copy in cifs_copy_folioq_to_iter()
cifs_copy_folioq_to_iter() copies a requested number of bytes from
a folio queue into the destination iterator. Since the encrypted
SMB2 READ path was changed to pass the server-declared payload
length (data_len) instead of the larger folioq buffer length, the
caller can ask for fewer bytes than the folio queue holds.
In that case the helper continues walking the remaining folios after
data_size has reached zero and calls copy_folio_to_iter() with
len = 0, which is unnecessary work.
The helper also returns 0 (success) when the folio queue is
exhausted before data_size bytes have been copied. The caller has
no way to distinguish that from a full copy and the reported
transfer count ends up larger than the amount of data placed in the
iterator.
Add an early exit when data_size reaches zero, and return an error
when the folio queue is exhausted before all requested bytes have
been copied.
Signed-off-by: Jeremy Erazo <mendozayt13@gmail.com> Reviewed-by: David Howells <dhowells@redhat.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Fix this while meeting these requirements:
* It must be possible to include the MSHV root driver without the
VMBus driver. In such case, the MSHV root driver can be built-in
to the kernel image, or it can be built as a separate module.
* If both the MSHV root driver and the VMBus driver are present, the
MSHV root driver and VMBus driver can both be built-in, or they can
both be separate modules. Or the MSHV root driver can be a module
while the VMBus driver can be built-in, but the reverse is
disallowed. Regardless of the build choices, the VMBus driver must
be loaded before the MSHV driver in order for the SynIC to be
managed properly (see comments in the MSHV SynIC code).
The fix has two parts:
* Add a Kconfig entry for MSHV_ROOT to depend on HYPERV_VMBUS if
HYPERV_VMBUS is present. The entry disallows MSHV_ROOT being
built-in when HYPERV_VMBUS is a module, but without requiring that
HYPERV_VMBUS be built.
* Add a stub implementation of hv_vmbus_exists() for when the
VMBus driver is not present so that the MSHV root driver has
no module dependency on VMBus. When the VMBus driver *is*
present, the module dependency ensures that the VMBus driver
loads first when both are built as modules.
Existing code ensures that the VMBus driver loads first if it is
built-in. The VMBus driver uses subsys_initcall(), which is
initcall level 4. The MSHV root driver uses module_init(), which
becomes device_init() when built-in, and device_init() is
initcall level 6.