Marc Zyngier [Wed, 1 Apr 2026 17:00:17 +0000 (18:00 +0100)]
KVM: arm64: Advertise ID_AA64PFR2_EL1.GCIE
As we are missing ID_AA64PFR2_EL1.GCIE from the kernel feature set,
userspace cannot write ID_AA64PFR2_EL1 with GCIE set, even if we are
on a GICv5 host.
Paul Walmsley [Sun, 5 Apr 2026 00:40:58 +0000 (18:40 -0600)]
prctl: cfi: change the branch landing pad prctl()s to be more descriptive
Per Linus' comments requesting the replacement of "INDIR_BR_LP" in the
indirect branch tracking prctl()s with something more readable, and
suggesting the use of the speculation control prctl()s as an exemplar,
reimplement the prctl()s and related constants that control per-task
forward-edge control flow integrity.
This primarily involves two changes. First, the prctls are
restructured to resemble the style of the speculative execution
workaround control prctls PR_{GET,SET}_SPECULATION_CTRL, to make them
easier to extend in the future. Second, the "indir_br_lp" abbrevation
is expanded to "branch_landing_pads" to be less telegraphic. The
kselftest and documentation is adjusted accordingly.
Paul Walmsley [Sun, 5 Apr 2026 00:40:58 +0000 (18:40 -0600)]
riscv: ptrace: cfi: expand "SS" references to "shadow stack" in uapi headers
Similar to the recent change to expand "LP" to "branch landing pad",
let's expand "SS" in the ptrace uapi macros to "shadow stack" as well.
This aligns with the existing prctl() arguments, which use the
expanded "shadow stack" names, rather than just the abbreviation.
Paul Walmsley [Sun, 5 Apr 2026 00:40:58 +0000 (18:40 -0600)]
prctl: rename branch landing pad implementation functions to be more explicit
Per Linus' comments about the unreadability of abbreviations such as
"indir_br_lp", rename the three prctl() implementation functions to be more
explicit. This involves renaming "indir_br_lp_status" in the function
names to "branch_landing_pad_state".
While here, add _prctl_ into the function names, following the
speculation control prctl implementation functions.
Paul Walmsley [Sun, 5 Apr 2026 00:40:58 +0000 (18:40 -0600)]
riscv: ptrace: expand "LP" references to "branch landing pads" in uapi headers
Per Linus' comments about the unreadability of abbreviations such as
"LP", rename the RISC-V ptrace landing pad CFI macro names to be more
explicit. This primarily involves expanding "LP" in the names to some
variant of "branch landing pad."
Zong Li [Sun, 5 Apr 2026 00:40:58 +0000 (18:40 -0600)]
riscv: cfi: clear CFI lock status in start_thread()
When libc locks the CFI status through the following prctl:
- PR_LOCK_SHADOW_STACK_STATUS
- PR_LOCK_INDIR_BR_LP_STATUS
A newly execd address space will inherit the lock status
if it does not clear the lock bits. Since the lock bits
remain set, libc will later fail to enable the landing
pad and shadow stack.
Paul Walmsley [Sun, 5 Apr 2026 00:40:57 +0000 (18:40 -0600)]
riscv: ptrace: cfi: fix "PRACE" typo in uapi header
A CFI-related macro defined in arch/riscv/uapi/asm/ptrace.h misspells
"PTRACE" as "PRACE"; fix this.
Fixes: 2af7c9cf021c ("riscv/ptrace: expose riscv CFI status and state via ptrace and in core files") Cc: Deepak Gupta <debug@rivosinc.com> Signed-off-by: Paul Walmsley <pjw@kernel.org>
Sunil V L [Tue, 3 Mar 2026 06:16:05 +0000 (11:46 +0530)]
ACPI: RIMT: Add dependency between iommu and devices
EPROBE_DEFER ensures IOMMU devices are probed before the devices that
depend on them. During shutdown, however, the IOMMU may be removed
first, leading to issues. To avoid this, a device link is added
which enforces the correct removal order.
Charlie Jenkins [Tue, 10 Mar 2026 01:52:11 +0000 (18:52 -0700)]
selftests: riscv: Add braces around EXPECT_EQ()
EXPECT_EQ() expands to multiple lines, breaking up one-line if
statements. This issue was not present in the patch on the mailing list
but was instead introduced by the maintainer when attempting to fix up
checkpatch warnings. Add braces around EXPECT_EQ() to avoid the error
even though checkpatch suggests them to be removed:
validate_v_ptrace.c:626:17: error: ‘else’ without a previous ‘if’
Paul Walmsley [Thu, 2 Apr 2026 23:18:03 +0000 (17:18 -0600)]
riscv: use _BITUL macro rather than BIT() in ptrace uapi and kselftests
Fix the build of non-kernel code that includes the RISC-V ptrace uapi
header, and the RISC-V validate_v_ptrace.c kselftest, by using the
_BITUL() macro rather than BIT(). BIT() is not available outside
the kernel.
Based on patches and comments from Charlie Jenkins, Michael Neuling,
and Andreas Schwab.
Zishun Yi [Sun, 22 Mar 2026 16:00:22 +0000 (00:00 +0800)]
riscv: Reset pmm when PR_TAGGED_ADDR_ENABLE is not set
In set_tagged_addr_ctrl(), when PR_TAGGED_ADDR_ENABLE is not set, pmlen
is correctly set to 0, but it forgets to reset pmm. This results in the
CPU pmm state not corresponding to the software pmlen state.
Fix this by resetting pmm along with pmlen.
Fixes: 2e1743085887 ("riscv: Add support for the tagged address ABI") Signed-off-by: Zishun Yi <vulab@iscas.ac.cn> Reviewed-by: Samuel Holland <samuel.holland@sifive.com> Link: https://patch.msgid.link/20260322160022.21908-1-vulab@iscas.ac.cn Signed-off-by: Paul Walmsley <pjw@kernel.org>
Jisheng Zhang [Sat, 21 Feb 2026 02:37:31 +0000 (10:37 +0800)]
riscv: make runtime const not usable by modules
Similar as commit 284922f4c563 ("x86: uaccess: don't use runtime-const
rewriting in modules") does, make riscv's runtime const not usable by
modules too, to "make sure this doesn't get forgotten the next time
somebody wants to do runtime constant optimizations". The reason is
well explained in the above commit: "The runtime-const infrastructure
was never designed to handle the modular case, because the constant
fixup is only done at boot time for core kernel code."
Vivian Wang [Mon, 23 Mar 2026 23:43:47 +0000 (17:43 -0600)]
riscv: patch: Avoid early phys_to_page()
Similarly to commit 8d09e2d569f6 ("arm64: patching: avoid early
page_to_phys()"), avoid using phys_to_page() for the kernel address case
in patch_map().
Since this is called from apply_boot_alternatives() in setup_arch(), and
commit 4267739cabb8 ("arch, mm: consolidate initialization of SPARSE
memory model") has moved sparse_init() to after setup_arch(),
phys_to_page() is not available there yet, and it panics on boot with
SPARSEMEM on RV32, which does not use SPARSEMEM_VMEMMAP.
Reported-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Closes: https://lore.kernel.org/r/20260223144108-dcace0b9-02e8-4b67-a7ce-f263bed36f26@linutronix.de/ Fixes: 4267739cabb8 ("arch, mm: consolidate initialization of SPARSE memory model") Suggested-by: Mike Rapoport <rppt@kernel.org> Signed-off-by: Vivian Wang <wangruikang@iscas.ac.cn> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Tested-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Link: https://patch.msgid.link/20260310-riscv-sparsemem-alternatives-fix-v1-1-659d5dd257e2@iscas.ac.cn
[pjw@kernel.org: fix the subject line to align with the patch description] Signed-off-by: Paul Walmsley <pjw@kernel.org>
Paul Walmsley [Mon, 23 Mar 2026 23:43:47 +0000 (17:43 -0600)]
riscv: kgdb: fix several debug register assignment bugs
Fix several bugs in the RISC-V kgdb implementation:
- The element of dbg_reg_def[] that is supposed to pertain to the S1
register embeds instead the struct pt_regs offset of the A1
register. Fix this to use the S1 register offset in struct pt_regs.
- The sleeping_thread_to_gdb_regs() function copies the value of the
S10 register into the gdb_regs[] array element meant for the S9
register, and copies the value of the S11 register into the array
element meant for the S10 register. It also neglects to copy the
value of the S11 register. Fix all of these issues.
Michael Kelley [Thu, 2 Apr 2026 20:24:00 +0000 (13:24 -0700)]
Drivers: hv: Move add_interrupt_randomness() to hypervisor callback sysvec
The Hyper-V ISRs, for normal guests and when running in the hypervisor root
patition, are calling add_interrupt_randomness() as a primary source of
entropy. The call is currently in the ISRs as a common place to handle both
x86/x64 and arm64.
On x86/x64, hypervisor interrupts come through a custom sysvec entry, and
do not go through a generic interrupt handler.
On arm64, hypervisor interrupts come through an emulated GICv3. GICv3 uses
the generic handler handle_percpu_devid_irq(), which does not do
add_interrupt_randomness() -- unlike its counterpart
handle_percpu_irq().
But handle_percpu_devid_irq() is now updated to do the
add_interrupt_randomness(). So add_interrupt_randomness() is now needed
only in Hyper-V's x86/x64 custom sysvec path.
Move add_interrupt_randomness() from the Hyper-V ISRs into the Hyper-V
x86/x64 custom sysvec path, matching the existing STIMER0 sysvec path.
With this change, add_interrupt_randomness() is no longer called from any
device drivers, which is appropriate.
Merge tag 'devfreq-next-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux
Pull devfreq updates for v7.1 from Chanwoo Choi:
"- Remove unneeded casting for HZ_PER_KHZ on devfreq.c
- Use _visible attribute to replace create/remove_sysfs_files() to fix
sysfs attribute race conditions on devfreq.c
- Add support for Tegra114 activity monitor device on tegra30-devfreq.c"
* tag 'devfreq-next-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/chanwoo/linux:
PM / devfreq: tegra30-devfreq: add support for Tegra114
PM / devfreq: use _visible attribute to replace create/remove_sysfs_files()
PM / devfreq: Remove unneeded casting for HZ_PER_KHZ
Merge tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux
Pull amd-pstate new content for 7.1 (2026-04-02) from Mario Limonciello:
"Add support for new features:
* CPPC performance priority
* Dynamic EPP
* Raw EPP
* New unit tests for new features
Fixes for:
* PREEMPT_RT
* sysfs files being present when HW missing
* Broken/outdated documentation"
* tag 'amd-pstate-v7.1-2026-04-02' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/superm1/linux: (22 commits)
MAINTAINERS: amd-pstate: Step down as maintainer, add Prateek as reviewer
cpufreq: Pass the policy to cpufreq_driver->adjust_perf()
cpufreq/amd-pstate: Pass the policy to amd_pstate_update()
cpufreq/amd-pstate-ut: Add a unit test for raw EPP
cpufreq/amd-pstate: Add support for raw EPP writes
cpufreq/amd-pstate: Add support for platform profile class
cpufreq/amd-pstate: add kernel command line to override dynamic epp
cpufreq/amd-pstate: Add dynamic energy performance preference
Documentation: amd-pstate: fix dead links in the reference section
cpufreq/amd-pstate: Cache the max frequency in cpudata
Documentation/amd-pstate: Add documentation for amd_pstate_floor_{freq,count}
Documentation/amd-pstate: List amd_pstate_prefcore_ranking sysfs file
Documentation/amd-pstate: List amd_pstate_hw_prefcore sysfs file
amd-pstate-ut: Add a testcase to validate the visibility of driver attributes
amd-pstate-ut: Add module parameter to select testcases
amd-pstate: Introduce a tracepoint trace_amd_pstate_cppc_req2()
amd-pstate: Add sysfs support for floor_freq and floor_count
amd-pstate: Add support for CPPC_REQ2 and FLOOR_PERF
x86/cpufeatures: Add AMD CPPC Performance Priority feature.
amd-pstate: Make certain freq_attrs conditionally visible
...
Huisong Li [Fri, 3 Apr 2026 09:02:53 +0000 (17:02 +0800)]
ACPI: processor: idle: Fix NULL pointer dereference in hotplug path
A cpuidle_device might fail to register during boot, but the system can
continue to run. In such cases, acpi_processor_hotplug() can trigger
a NULL pointer dereference when accessing the per-cpu acpi_cpuidle_device.
So add NULL pointer check for the per-cpu acpi_cpuidle_device in
acpi_processor_hotplug.
Danilo Krummrich [Tue, 24 Mar 2026 00:59:06 +0000 (01:59 +0100)]
bus: fsl-mc: use generic driver_override infrastructure
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Huisong Li [Fri, 3 Apr 2026 08:53:43 +0000 (16:53 +0800)]
ACPI: processor: idle: Reset power_setup_done flag on initialization failure
The 'power_setup_done' flag is a key indicator used across the ACPI
processor driver to determine if cpuidle are properly configured and
available for a given CPU.
Currently, this flag is set during the early stages of initialization.
However, if the subsequent registration of the cpuidle driver in
acpi_processor_register_idle_driver() or the per-CPU device registration
in acpi_processor_power_init() fails, this flag remains set. This may
lead to some issues where other functions in ACPI idle driver use these
flags.
Fix this by explicitly resetting this flag to 0 in these error paths.
ACPI: TAD: Add alarm support to the RTC class device interface
Add alarm support, based on Section 9.17 of ACPI 6.6 [1], to the RTC
class device interface of the driver.
The ACPI time and alarm device (TAD) can support two separate alarm
timers, one for waking up the system when it is on AC power, and one
for waking it up when it is on DC power. In principle, each of them
can be set to a different value representing the number of seconds
till the given alarm timer expires.
However, the RTC class device can only set one alarm, so it will set
both the alarm timers of the ACPI TAD (if the DC one is supported) to
the same value. That is somewhat cumbersome because there is no way in
the ACPI TAD firmware interface to set both timers in one go, so they
need to be set sequentially, but that's how it goes.
On the alarm read side, the driver assumes that both timers have been
set to the same value, so it is sufficient to access one of them (the
AC one specifically).
Move the code converting a struct acpi_tad_rt into a struct rtc_time
from acpi_tad_rtc_read_time() into a new function, acpi_tad_rt_to_tm(),
to facilitate adding alarm support to the driver's RTC class device
interface going forward.
No intentional functional impact.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
[ rjw: Subject and changelog edits ] Link: https://patch.msgid.link/9619488.CDJkKcVGEf@rafael.j.wysocki Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Move two functions introduced previously, __acpi_tad_wake_set() and
__acpi_tad_wake_read(), to the part of the code preceding the sysfs
interface implementation, since subsequently they will be used by
the RTC device interface too.
ACPI: TAD: Split three functions to untangle runtime PM handling
Move the core functionality of acpi_tad_get_real_time(),
acpi_tad_wake_set(), and acpi_tad_wake_read() into separate functions
called __acpi_tad_get_real_time(), __acpi_tad_wake_set(), and
__acpi_tad_wake_read(), respectively, which can be called from
code blocks following a single runtime resume of the device.
This will facilitate adding alarm support to the RTC class device
interface of the driver going forward.
ACPI: processor: Rearrange and clean up acpi_processor_errata_piix4()
In acpi_processor_errata_piix4() it is not necessary to use three
struct pci_dev pointers. One is sufficient, so use it everywhere and
drop the other two.
Additionally, define the auxiliary local variables value1 and value2
in the code block in which they are used.
PCI: cadence: Use cdns_pcie_read_sz() for byte or word read access
The commit 18ac51ae9df9 ("PCI: cadence: Implement capability search
using PCI core APIs") assumed all the platforms using Cadence PCIe
controller support byte and word register accesses. This is not true
for all platforms (e.g., TI J721E SoC, which only supports dword
register accesses).
This causes capability searches via cdns_pcie_find_capability() to fail
on such platforms.
Fix this by using cdns_pcie_read_sz() for config read functions, which
properly handles size-aligned accesses. Remove the now-unused byte and
word read wrapper functions (cdns_pcie_readw and cdns_pcie_readb).
ACPI: TAD: Use DC wakeup only if AC wakeup is supported
According to Section 9.17.2 of ACPI 6.6 [1], setting ACPI_TAD_DC_WAKE in
the capabilities without setting ACPI_TAD_AC_WAKE is invalid, so don't
support wakeup if that's the case.
Moreover, it is sufficient to check ACPI_TAD_AC_WAKE alone to determine
if wakeup is supported at all, so use this observation to simplify one
check.
Instead of creating and removing the device sysfs attributes directly
during probe and remove of the driver, respectively, use dev_groups in
struct device_driver to point to the attribute definitions and let the
core take care of creating and removing them.
Move RT data validation checks from acpi_tad_set_real_time() to
a separate function called acpi_tad_rt_is_invalid() and use it
also in acpi_tad_get_real_time() to validate data coming from
the platform firmware.
Also make acpi_tad_set_real_time() return -EINVAL when the RT data
passed to it is invalid (instead of -ERANGE which is somewhat
confusing) and introduce ACPI_TAD_TZ_UNSPEC to represent the
"unspecified timezone" value.
ACPI: TAD: Use __free() for cleanup in time_store()
Use __free() for the automatic freeing of memory pointed to by local
variable str in time_store() which allows the code to become somewhat
easier to follow.
Instead of creating three attribute groups, one for each supported
subset of capabilities, create just one and use an .is_visible()
callback in it to decide which attributes to use.
rtc: cmos: Do not require IRQ if ACPI alarm is used
If the ACPI RTC fixed event is used, a dedicated IRQ is not required
for the CMOS RTC alarm to work, so allow the driver to use the alarm
without a valid IRQ in that case.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Link: https://patch.msgid.link/6168746.MhkbZ0Pkbq@rafael.j.wysocki
rtc: cmos: Enable ACPI alarm if advertised in ACPI FADT
If the ACPI_FADT_FIXED_RTC flag is unset, the platform is declaring that
it supports the ACPI RTC fixed event which should be used instead of a
dedicated CMOS RTC IRQ. However, the driver only enables it when
is_hpet_enabled() returns true, which is questionable because there is
no clear connection between enabled HPET and signaling wakeup via the
ACPI RTC fixed event (for instance, the latter can be expected to work
on systems that don't include a functional HPET).
Moreover, since use_hpet_alarm() returns false if use_acpi_alarm is set,
the ACPI RTC fixed event is effectively used instead of the HPET alarm
if the latter is functional, but there is no particular reason why it
could not be used otherwise.
Accordingly, on x86 systems with ACPI, set use_acpi_alarm if
ACPI_FADT_FIXED_RTC is unset without looking at whether or not HPET is
enabled.
Also, do the ACPI FADT check in use_acpi_alarm_quirks() before the DMI
BIOS year checks which are more expensive and it's better to skip them
if ACPI_FADT_FIXED_RTC is set.
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Mario Limonciello (AMD) <superm1@kernel.org> Link: https://patch.msgid.link/9618535.CDJkKcVGEf@rafael.j.wysocki
Chen-Yu Tsai [Tue, 24 Mar 2026 09:35:41 +0000 (17:35 +0800)]
PCI: mediatek-gen3: Prevent leaking IRQ domains when IRQ not found
In mtk_pcie_setup_irq(), the IRQ domains are allocated before the
controller's IRQ is fetched. If the latter fails, the function
directly returns an error, without cleaning up the allocated domains.
Hence, reverse the order so that the IRQ domains are allocated after the
controller's IRQ is found.
This was flagged by Sashiko during a review of "[PATCH v6 0/7] PCI:
mediatek-gen3: add power control support".
Merge tag 'microchip-soc-7.1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/at91/linux into soc/arm
Microchip ARM64 SoC updates for v7.1
This update includes:
- use a top-level configuration flag for all Microchip platforms
* tag 'microchip-soc-7.1' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/at91/linux:
arm64: Kconfig: provide a top-level switch for Microchip platforms
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Merge tag 'input-for-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- new IDs for BETOP BTP-KP50B/C and Razer Wolverine V3 Pro added to
xpad controller driver
- another quirk for new TUXEDO InfinityBook added to i8042
- a small fixup for Synaptics RMI4 driver to properly unlock mutex when
encountering an error in F54
- an update to bcm5974 touch controller driver to reliably switch into
wellspring mode
* tag 'input-for-v7.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: xpad - add support for BETOP BTP-KP50B/C controller's wireless mode
Input: xpad - add support for Razer Wolverine V3 Pro
Input: synaptics-rmi4 - fix a locking bug in an error path
Input: i8042 - add TUXEDO InfinityBook Max 16 Gen10 AMD to i8042 quirk table
Input: bcm5974 - recover from failed mode switch
Merge tag 'imx-dt64-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/frank.li/linux into soc/dt
Krzysztof notes:
1. This might impact users of i.MX8MM SPDIF as compatible is being
replaced.
Frank Li writes:
i.MX arm64 device tree changes for 7.1:
- New Board Support
S32N79-RDB, Variscite DART-MX95, DART-MX91 with Sonata carrier boards,
Verdin iMX95 with multiple carrier boards (Yavia, Mallow, Ivy, Dahlia)
TQMa93xx/MBa93xxLA-MINI, SolidRun i.MX8MP HummingBoard IIoT,
SolidRun i.MX8MM SOM and EVB, SolidRun SolidSense-N8 board
Ka-Ro Electronics tx8m-1610 COM, GOcontroll Moduline IV and Moduline Mini,
NXP FRDM-IMX91S board, i.MX93 Wireless EVK board with Wireless SiP,
NXP i.MX8MP audio board v2.
- USB & Type-C Support
Type-C and USB nodes for imx943, correct power-fole for
imx8qxp-mek/imx8qm-mek.
- Audio Enhancements
PDM microphone, bt-sco, and WM8962 sound card support for i.MX952. AONMIX
MQS for i.MX95. Use audio-graph-card2 for imx8dxl-evk. WM8904 audio codec
for imx8mm-var-som.
- Thermal & Cooling
PF09/53 thermal zone, fan node, active cooling on A55, SCMI
sensor/lmm/cpu for imx943/imx94.
- Display Support
Multiple LVDS and parallel display overlays for TQ boards (imx91/imx93).
Parallel display for i.MX93. ontat,kd50g21-40nt-a1 panel for
imx93-9x9-qsb. pixpaper display overlay for i.MX93 FRDM.
- Networking
Multiple queue configuration on eqos for TQMa8MPxL.
MaxLinear PHY support, MCP251xFD CAN controller for imx8mm-var-som.
SDIO WiFi support (imx91-evk, imx8mp-evk, imx943-evk)
- Bluetooth Support
imx943-evk, imx93-14x14-evk, imx95-19x19-evk, imx8mp-evk, imx8mn-evk,
imx8mm-evk.
- Miscellaneous
xspi and MT35XU01G SPI NOR flash for i.MX952.
V2X/ELE mailbox nodes, SCMI misc ctrl-ids for imx94.
eDMA channel reservation for V2X, Cortex M7 support for imx95.
Ethos-U65 NPU and SRAM nodes for imx93.
Wire up DMA IRQ for PCIe for imx8qm-ss-hsio.
- Bug Fixes & Improvements
Complete pinmux for rcwsr12 to fix I2C bus recovery affect other module
pinmux for layscape platform.
Multiple bug fixes for GPIO polarity, IRQ types, pinmux configurations.
GICv3 PPI interrupt CPU mask cleanup across multiple SoCs.
Fixed Ethernet PHY IRQ types on TQ boards.
Fixed UART RTS/CTS muxing issues.
Fixed SD card issues on Kontron boards.
Fixed touch reset configuration.
Removed fallback ethernet-phy-ieee802.3-c22 where appropriate.
Move funnel outside from soc.
TMU sensor ID cleanup.
Change usdhc tuning step for eMMC and SD.
Hexadecimal format, readability improvements, duplicate removal.
* tag 'imx-dt64-7.1' of https://git.kernel.org/pub/scm/linux/kernel/git/frank.li/linux: (139 commits)
arm64: dts: imx8qxp-mek: switch Type-C connector power-role to dual
arm64: dts: imx8qm-mek: switch Type-C connector power-role to dual
arm64: dts: lx2162a-clearfog: set sfp connector leds function and source
arm64: dts: lx2162a-sr-som: add crypto & rtc aliases, model
arm64: dts: lx2160a-cex7: add rtc alias
arm64: dts: lx2160a: complete pinmux for rcwsr12 configuration word
arm64: dts: lx2160a: change zeros to hexadecimal in pinmux nodes
arm64: dts: lx2160a: add sda gpio references for i2c bus recovery
arm64: dts: lx2160a: rename pinmux nodes for readability
arm64: dts: lx2160a: remove duplicate pinmux nodes
arm64: dts: lx2160a: change i2c0 (iic1) pinmux mask to one bit
arm64: dts: lx2160a-cex7/lx2162a-sr-som: fix usd-cd & gpio pinmux
arm64: dts: freescale: imx8mp-moduline-display-106: add typec-power-opmode property
arm64: dts: imx8mp-tqma8mpql: Add DT overlays to explicit list
arm64: dts: imx8mp-evk: Specify ADV7535 register addresses
arm64: dts: imx8dxl-evk: Use audio-graph-card2 for wm8960-2 and wm8960-3
arm64: dts: imx943-evk: Add pf09/53 thermal zone
arm64: dts: imx943-evk: Add fan node and enable active cooling on A55
arm64: dts: imx943-evk: Add nxp,ctrl-ids for scmi_misc
arm64: dts: imx943: Add thermal support
...
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Merge tag 'tegra-for-7.1-arm64-dt' of https://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux into soc/dt
arm64: tegra: Device tree changes for v7.1-rc1
Various fixes and new additions across a number of devices. GPIO and PCI
are enabled on Tegra264 and the Jetson AGX Thor Developer Kit, allowing
it to boot via network and mass storage.
* tag 'tegra-for-7.1-arm64-dt' of https://git.kernel.org/pub/scm/linux/kernel/git/tegra/linux:
arm64: tegra: Add Tegra264 GPIO controllers
arm64: tegra: smaug: Enable SPI-NOR flash
arm64: tegra: Add Jetson AGX Thor Developer Kit support
arm64: tegra: Add PCI controllers on Tegra264
arm64: tegra: Fix RTC aliases
arm64: tegra: Drop redundant clock and reset names for TSEC
arm64: tegra: Fix snps,blen properties
dt-bindings: pci: Document the NVIDIA Tegra264 PCIe controller
Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org>
Daniel Lezcano [Thu, 2 Apr 2026 08:44:25 +0000 (10:44 +0200)]
thermal/core: Remove pointless variable when registering a cooling device
The 'id' variable is set to store the ida_alloc() value which is
already stored into cdev->id. It is pointless to use it because
cdev->id can be used instead.
Signed-off-by: Daniel Lezcano <daniel.lezcano@oss.qualcomm.com> Signed-off-by: Daniel Lezcano <daniel.lezcano@kernel.org> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com> Link: https://patch.msgid.link/20260402084426.1360086-1-daniel.lezcano@kernel.org Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
PCI: tegra194: Expose BAR2 (MSI-X) and BAR4 (DMA) as 64-bit BAR_RESERVED
Tegra Endpoint exposes three 64-bit BARs at indices 0, 2, and 4:
- BAR0+BAR1: EPF test/data (programmable 64-bit BAR)
- BAR2+BAR3: MSI-X table (hardware-backed)
- BAR4+BAR5: DMA registers (hardware-backed)
Update tegra_pcie_epc_features so that BAR2 is BAR_RESERVED with
PCI_EPC_BAR_RSVD_MSIX_TBL_RAM (64 KB) & PCI_EPC_BAR_RSVD_MSIX_PBA_RAM
(64 KB) and BAR4 is BAR_RESERVED with PCI_EPC_BAR_RSVD_DMA_CTRL_MMIO (4KB).
This keeps CONSECUTIVE_BAR_TEST working while allowing the host to use
64-bit BAR2 (MSI-X) and BAR4 (DMA).
PCI: tegra194: Make BAR0 programmable and remove 1MB size limit
The Tegra194/234 Endpoint does not support the Resizable BAR capability,
but BAR0 can be programmed to different sizes via the DBI2 BAR registers
in dw_pcie_ep_set_bar_programmable(). The BAR0 size is set once during
initialization.
Remove the fixed 1MB limit from pci_epc_features so Endpoint function
drivers can configure the BAR0 size they need.
PCI: endpoint: Add reserved region type for MSI-X Table and PBA
Add PCI_EPC_BAR_RSVD_MSIX_TBL_RAM and PCI_EPC_BAR_RSVD_MSIX_PBA_RAM to
enum pci_epc_bar_rsvd_region_type so that Endpoint controllers can
describe hardware-owned MSI-X Table and PBA (Pending Bit Array) regions
behind a BAR_RESERVED BAR.
Richard Zhu [Tue, 24 Mar 2026 02:30:32 +0000 (10:30 +0800)]
dt-bindings: PCI: imx6q-pcie: Fix maxItems of clocks and clock-names
Commit 1352f58d7c8d ("dt-bindings: PCI: pci-imx6: Add external reference
clock input") that added reference clock to the binding was incomplete.
The constraints for "clocks" and "clock-names" still enforce an incorrect
number of items. Update maxItems for both properties to 6 to match the
actual hardware configuration.
Felix Gu [Mon, 23 Mar 2026 17:57:59 +0000 (01:57 +0800)]
PCI: aspeed: Fix IRQ domain leak on platform_get_irq() failure
The aspeed_pcie_probe() function calls aspeed_pcie_init_irq_domain()
which allocates pcie->intx_domain and initializes MSI. However, if
platform_get_irq() fails afterwards, the cleanup action was not yet
registered via devm_add_action_or_reset(), causing the IRQ domain
resources to leak.
Fix this by registering the devm cleanup action immediately after
aspeed_pcie_init_irq_domain() succeeds, before calling
platform_get_irq(). This ensures proper cleanup on any subsequent
failure.
The current custom implementation of offsetof() fails UBSAN:
runtime error: member access within null pointer of type 'struct ...'
This means that all its users, including container_of(), free() and
realloc(), fail.
Use __builtin_offsetof() instead which does not have this issue and
has been available since GCC 4 and clang 3.
Documentation: fix two typos in latest update to the security report howto
In previous patch "Documentation: clarify the mandatory and desirable
info for security reports" I left two typos that I didn't detect in local
checks. One is "get_maintainers.pl" (no 's' in the script name), and the
other one is a missing closing quote after "Reported-by", which didn't
have effect here but I don't know if it can break rendering elsewhere
(e.g. on the public HTML page). Better fix it before it gets merged.
fstatat() contains two open-coded copies of makedev() to handle minor
numbers >= 256. Now that the regular makedev() handles both large minor
and major numbers correctly use the common function.
statx() returns both 32-bit minor and major numbers. For both of them to
fit into the 'dev_t' in 'struct stat', that needs to be 64 bits wide.
The other uses of 'dev_t' in nolibc are makedev() and friends and
mknod(). makedev() and friends are going to be adapted in an upcoming
commit and mknod() will silently truncate 'dev_t' to 'unsigned int' in
the kernel, similar to other libcs.
RISC-V: KVM: Cache gstage pgd_levels in struct kvm_gstage
Gstage page-table helpers frequently chase gstage->kvm->arch to
fetch pgd_levels. This adds noise and repeats the same dereference
chain in hot paths.
Add pgd_levels to struct kvm_gstage and initialize it from kvm->arch
when setting up a gstage instance. Introduce kvm_riscv_gstage_init()
to centralize initialization and switch gstage code to use
gstage->pgd_levels.
RISC-V: KVM: Support runtime configuration for per-VM's HGATP mode
Introduces one per-VM architecture-specific fields to support runtime
configuration of the G-stage page table format:
- kvm->arch.pgd_levels: the corresponding number of page table levels
for the selected mode.
These fields replace the previous global variables
kvm_riscv_gstage_mode and kvm_riscv_gstage_pgd_levels, enabling different
virtual machines to independently select their G-stage page table format
instead of being forced to share the maximum mode detected by the kernel
at boot time.
Signed-off-by: Fangyu Yu <fangyu.yu@linux.alibaba.com> Reviewed-by: Andrew Jones <andrew.jones@oss.qualcomm.com> Reviewed-by: Anup Patel <anup@brainfault.org> Reviewed-by: Guo Ren <guoren@kernel.org> Reviewed-by: Nutty Liu <nutty.liu@hotmail.com> Link: https://lore.kernel.org/r/20260403153019.9916-2-fangyu.yu@linux.alibaba.com Signed-off-by: Anup Patel <anup@brainfault.org>
Input: xpad - add support for BETOP BTP-KP50B/C controller's wireless mode
BETOP's BTP-KP50B and BTP-KP50C controller's wireless dongles are both
working as standard Xbox 360 controllers. Add USB device IDs for them to
xpad driver.
Input: xpad - add support for Razer Wolverine V3 Pro
Add device IDs for the Razer Wolverine V3 Pro controller in both
wired (0x0a57) and wireless 2.4 GHz dongle (0x0a59) modes.
The controller uses the Xbox 360 protocol (vendor-specific class,
subclass 93, protocol 1) on interface 0 with an identical 20-byte
input report layout, so no additional processing is needed.
mshv: Fix infinite fault loop on permission-denied GPA intercepts
Prevent infinite fault loops when guests access memory regions without
proper permissions. Currently, mshv_handle_gpa_intercept() attempts to
remap pages for all faults on movable memory regions, regardless of
whether the access type is permitted. When a guest writes to a read-only
region, the remap succeeds but the region remains read-only, causing
immediate re-fault and spinning the vCPU indefinitely.
Validate intercept access type against region permissions before
attempting remaps. Reject writes to non-writable regions and executes to
non-executable regions early, returning false to let the VMM handle the
intercept appropriately.
This also closes a potential DoS vector where malicious guests could
intentionally trigger these fault loops to consume host resources.
Fixes: b9a66cd5ccbb ("mshv: Add support for movable memory regions") Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Reviewed-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com> Signed-off-by: Wei Liu <wei.liu@kernel.org>
PCI: hv: Fix double ida_free in hv_pci_probe error path
If hv_pci_probe() fails after storing the domain number in
hbus->bridge->domain_nr, there is a call to free this domain_nr via
pci_bus_release_emul_domain_nr(), however, during cleanup, the bridge
release callback pci_release_host_bridge_dev() also frees the domain_nr
causing ida_free to be called on same ID twice and triggering following
warning:
ida_free called for id=28971 which is not allocated.
WARNING: lib/idr.c:594 at ida_free+0xdf/0x160, CPU#0: kworker/0:2/198
Call Trace:
pci_bus_release_emul_domain_nr+0x17/0x20
pci_release_host_bridge_dev+0x4b/0x60
device_release+0x3b/0xa0
kobject_put+0x8e/0x220
devm_pci_alloc_host_bridge_release+0xe/0x20
devres_release_all+0x9a/0xd0
device_unbind_cleanup+0x12/0xa0
really_probe+0x1c5/0x3f0
vmbus_add_channel_work+0x135/0x1a0
Fix this by letting pci core handle the free domain_nr and remove
the explicit free called in pci-hyperv driver.
Merge tag 's390-7.0-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix a memory leak in the zcrypt driver where the AP message buffer
for clear key RSA requests was allocated twice, once by the caller
and again locally, causing the first allocation to never be freed
- Fix the cpum_sf perf sampling rate overflow adjustment to clamp the
recalculated rate to the hardware maximum, preventing exceptions on
heavily loaded systems running with HZ=1000
* tag 's390-7.0-7' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/zcrypt: Fix memory leak with CCA cards used as accelerator
s390/cpum_sf: Cap sampling rate to prevent lsctl exception
Merge tag 'hwmon-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging
Pull hwmon fixes from Guenter Roeck:
- Fix temperature sensor for PRIME X670E-PRO WIFI
- occ: Add missing newline, and fix potential division by zero
- pmbus:
- Fix device ID comparison and printing in tps53676_identify()
- Add missing MODULE_IMPORT_NS("PMBUS") for ltc4286
- Check return value of page-select write in pxe1610 probe
- Fix array access with zero-length block tps53679 read
* tag 'hwmon-for-v7.0-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
hwmon: (asus-ec-sensors) Fix T_Sensor for PRIME X670E-PRO WIFI
hwmon: (occ) Fix missing newline in occ_show_extended()
hwmon: (occ) Fix division by zero in occ_show_power_1()
hwmon: (tps53679) Fix device ID comparison and printing in tps53676_identify()
hwmon: (ltc4286) Add missing MODULE_IMPORT_NS("PMBUS")
hwmon: (pxe1610) Check return value of page-select write in probe
hwmon: (tps53679) Fix array access with zero-length block read
Lucas De Marchi [Mon, 30 Mar 2026 13:13:52 +0000 (08:13 -0500)]
module: Simplify warning on positive returns from module_init()
It should now be rare to trigger this warning - it doesn't need to be so
verbose. Make it follow the usual style in the module loading code.
For the same reason, drop the dump_stack().
Suggested-by: Petr Pavlu <petr.pavlu@suse.com> Signed-off-by: Lucas De Marchi <demarchi@kernel.org> Reviewed-by: Aaron Tomlin <atomlin@atomlin.com> Reviewed-by: Petr Pavlu <petr.pavlu@suse.com> Reviewed-by: Daniel Gomez <da.gomez@samsung.com> Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
Lucas De Marchi [Mon, 30 Mar 2026 13:13:51 +0000 (08:13 -0500)]
module: Override -EEXIST module return
The -EEXIST errno is reserved by the module loading functionality. When
userspace calls [f]init_module(), it expects a -EEXIST to mean that the
module is already loaded in the kernel. If module_init() returns it,
that is not true anymore.
Override the error when returning to userspace: it doesn't make sense to
change potentially long error propagation call chains just because it's
will end up as the return of module_init().
Closes: https://lore.kernel.org/all/aKLzsAX14ybEjHfJ@orbyte.nwl.cc/ Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Aaron Tomlin <atomlin@atomlin.com> Cc: Petr Pavlu <petr.pavlu@suse.com> Cc: Daniel Gomez <da.gomez@samsung.com> Cc: Phil Sutter <phil@nwl.cc> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Lucas De Marchi <demarchi@kernel.org>
[Sami: Fixed a typo.] Signed-off-by: Sami Tolvanen <samitolvanen@google.com>
====================
dpll: add frequency monitoring feature
This series adds support for monitoring the measured input frequency
of DPLL input pins via the DPLL netlink interface.
Some DPLL devices can measure the actual frequency being received on
input pins. The approach mirrors the existing phase-offset-monitor
feature: a device-level attribute (DPLL_A_FREQUENCY_MONITOR) enables
or disables monitoring, and a per-pin attribute
(DPLL_A_PIN_MEASURED_FREQUENCY) exposes the measured frequency in
millihertz (mHz) when monitoring is enabled.
Patch 1 adds the new attributes to the DPLL netlink spec (dpll.yaml),
the DPLL_PIN_MEASURED_FREQUENCY_DIVIDER constant, regenerates the
auto-generated UAPI header and netlink policy, and updates
Documentation/driver-api/dpll.rst.
Patch 2 adds the callback operations (freq_monitor_get/set for
devices, measured_freq_get for pins) and the corresponding netlink
GET/SET handlers in the DPLL core. The core only invokes
measured_freq_get when the frequency monitor is enabled on the parent
device. The freq_monitor_get callback is required when measured_freq_get
is provided.
Patch 3 implements the feature in the ZL3073x driver by extracting
a common measurement latch helper from the existing FFO update path,
adding a frequency measurement function, and wiring up the new
callbacks.
====================
Ivan Vecera [Thu, 2 Apr 2026 18:40:57 +0000 (20:40 +0200)]
dpll: zl3073x: implement frequency monitoring
Extract common measurement latch logic from zl3073x_ref_ffo_update()
into a new zl3073x_ref_freq_meas_latch() helper and add
zl3073x_ref_freq_meas_update() that uses it to latch and read absolute
input reference frequencies in Hz.
Add meas_freq field to struct zl3073x_ref and the corresponding
zl3073x_ref_meas_freq_get() accessor. The measured frequencies are
updated periodically alongside the existing FFO measurements.
Add freq_monitor boolean to struct zl3073x_dpll and implement the
freq_monitor_set/get device callbacks to enable/disable frequency
monitoring via the DPLL netlink interface.
Implement measured_freq_get pin callback for input pins that returns the
measured input frequency in mHz.
Ivan Vecera [Thu, 2 Apr 2026 18:40:56 +0000 (20:40 +0200)]
dpll: add frequency monitoring callback ops
Add new callback operations for a dpll device:
- freq_monitor_get(..) - to obtain current state of frequency monitor
feature from dpll device,
- freq_monitor_set(..) - to allow feature configuration.
Add new callback operation for a dpll pin:
- measured_freq_get(..) - to obtain the measured frequency in mHz.
Obtain the feature state value using the get callback and provide it to
the user if the device driver implements callbacks. The measured_freq_get
pin callback is only invoked when the frequency monitor is enabled.
The freq_monitor_get device callback is required when measured_freq_get
is provided by the driver.
Ivan Vecera [Thu, 2 Apr 2026 18:40:55 +0000 (20:40 +0200)]
dpll: add frequency monitoring to netlink spec
Add DPLL_A_FREQUENCY_MONITOR device attribute to allow control over
the frequency monitor feature. The attribute uses the existing
dpll_feature_state enum (enable/disable) and is present in both
device-get reply and device-set request.
Add DPLL_A_PIN_MEASURED_FREQUENCY pin attribute to expose the measured
input frequency in millihertz (mHz). The attribute is present in the
pin-get reply. Add DPLL_PIN_MEASURED_FREQUENCY_DIVIDER constant to
allow userspace to extract integer and fractional parts.
The test currently allegedly makes sure that VMRUN causes a #GP in
vmcb12 GPA is valid but unmappable. However, it calls run_guest() with
an the test vmcb12 GPA, and the #GP is produced from VMLOAD, not VMRUN.
Additionally, the underlying logic just changed to match architectural
behavior, and all of VMRUN/VMLOAD/VMSAVE fail emulation if vmcb12 cannot
be mapped. The CPU still injects a #GP if the vmcb12 GPA exceeds
maxphyaddr.
Rework the test such to use the KVM_ONE_VCPU_TEST[_SUITE] harness, and
test all of VMRUN/VMLOAD/VMSAVE with both an invalid GPA (-1ULL) causing
a #GP, and a valid but unmappable GPA causing emulation failure. Execute
the instructions directly from L1 instead of run_guest() to make sure
the #GP or emulation failure is produced by the right instruction.
Leave the #VMEXIT with unmappable GPA test case as-is, but wrap it with
a test harness as well.
Opportunisitically drop gp_triggered, as the test already checks that
a #GP was injected through a SYNC. Also, use the first unmapped GPA
instead of the maximum legal GPA, as some CPUs inject a #GP for the
maximum legal GPA (likely in a reserved area).
Yosry Ahmed [Mon, 16 Mar 2026 20:27:30 +0000 (20:27 +0000)]
KVM: nSVM: Fail emulation of VMRUN/VMLOAD/VMSAVE if mapping vmcb12 fails
KVM currently injects a #GP if mapping vmcb12 fails when emulating
VMRUN/VMLOAD/VMSAVE. This is not architectural behavior, as #GP should
only be injected if the physical address is not supported or not
aligned. Instead, handle it as an emulation failure, similar to how nVMX
handles failures to read/write guest memory in several emulation paths.
When virtual VMLOAD/VMSAVE is enabled, if vmcb12's GPA is not mapped in
the NPTs a VMEXIT(#NPF) will be generated, and KVM will install an MMIO
SPTE and emulate the instruction if there is no corresponding memslot.
x86_emulate_insn() will return EMULATION_FAILED as VMLOAD/VMSAVE are not
handled as part of the twobyte_insn cases.
Even though this will also result in an emulation failure, it will only
result in a straight return to userspace if
KVM_CAP_EXIT_ON_EMULATION_FAILURE is set. Otherwise, it would inject #UD
and only exit to userspace if not in guest mode. So the behavior is
slightly different if virtual VMLOAD/VMSAVE is enabled.
Fixes: 3d6368ef580a ("KVM: SVM: Add VMRUN handler") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-8-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Mon, 16 Mar 2026 20:27:29 +0000 (20:27 +0000)]
KVM: SVM: Treat mapping failures equally in VMLOAD/VMSAVE emulation
Currently, a #GP is only injected if kvm_vcpu_map() fails with -EINVAL.
But it could also fail with -EFAULT if creating a host mapping failed.
Inject a #GP in all cases, no reason to treat failure modes differently.
Similar to commit 01ddcdc55e09 ("KVM: nSVM: Always inject a #GP if
mapping VMCB12 fails on nested VMRUN"), treat all failures equally.
Yosry Ahmed [Mon, 16 Mar 2026 20:27:28 +0000 (20:27 +0000)]
KVM: SVM: Check EFER.SVME and CPL on #GP intercept of SVM instructions
When KVM intercepts #GP on an SVM instruction from L2, it checks the
legality of RAX, and injects a #GP if RAX is illegal, or otherwise
synthesizes a #VMEXIT to L1. However, checking EFER.SVME and CPL takes
precedence over both the RAX check and the intercept. Call
nested_svm_check_permissions() first to cover both.
Note that if #GP is intercepted on SVM instruction in L1, the intercept
handlers of VMRUN/VMLOAD/VMSAVE already perform these checks.
Note #2, if KVM does not intercept #GP, the check for EFER.SVME is not
done in the correct order, because KVM handles it by intercepting the
instructions when EFER.SVME=0 and injecting #UD. However, a #GP
injected by hardware would happen before the instruction intercept,
leading to #GP taking precedence over #UD from the guest's perspective.
Opportunistically add a FIXME for this.
Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-6-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
When #GP is intercepted by KVM, the #GP interception handler checks
whether the GPA in RAX is legal and reinjects the #GP accordingly.
Otherwise, it calls into the appropriate interception handler for
VMRUN/VMLOAD/VMSAVE. The intercept handlers do not check RAX.
However, the intercept handlers need to do the RAX check, because if the
guest has a smaller MAXPHYADDR, RAX could be legal from the hardware
perspective (i.e. CPU does not inject #GP), but not from the vCPU's
perspective. Note that with allow_smaller_maxphyaddr, both NPT and VLS
cannot be used, so VMLOAD/VMSAVE have to be intercepted, and RAX can
always be checked against the vCPU's MAXPHYADDR.
Move the check into the interception handlers for VMRUN/VMLOAD/VMSAVE as
the CPU does not check RAX before the interception. Read RAX using
kvm_register_read() to avoid a false negative on page_address_valid() on
32-bit due to garbage in the higher bits.
Keep the check in the #GP intercept handler in the nested case where
a #VMEXIT is synthesized into L1, as the RAX check is still needed there
and takes precedence over the intercept.
Opportunistically add a FIXME about the #VMEXIT being synthesized into
L1, as it needs to be conditional.
Yosry Ahmed [Mon, 16 Mar 2026 20:27:26 +0000 (20:27 +0000)]
KVM: SVM: Properly check RAX on #GP intercept of SVM instructions
When KVM intercepts #GP on an SVM instruction, it re-injects the #GP if
the instruction was executed with a mis-algined RAX. However, a #GP
should also be reinjected if RAX contains an illegal GPA, according to
the APM, one of #GP conditions is:
rAX referenced a physical address above the maximum
supported physical address.
Replace the PAGE_MASK check with page_address_valid(), which checks both
page-alignment as well as the legality of the GPA based on the vCPU's
MAXPHYADDR. Use kvm_register_read() to read RAX to so that bits 63:32 are
dropped when the vCPU is in 32-bit mode, i.e. to avoid a false positive
when checking the validity of the address.
Note that this is currently only a problem if KVM is running an L2 guest
and ends up synthesizing a #VMEXIT to L1, as the RAX check takes
precedence over the intercept. Otherwise, if KVM emulates the
instruction, kvm_vcpu_map() should fail on illegal GPAs and inject a #GP
anyway. However, following patches will change the failure behavior of
kvm_vcpu_map(), so make sure the #GP interception handler does this
appropriately.
Opportunistically drop a teaser FIXME about the SVM instructions
handling on #GP belonging in the emulator.
Fixes: 82a11e9c6fa2 ("KVM: SVM: Add emulation support for #GP triggered by SVM instructions") Fixes: d1cba6c92237 ("KVM: x86: nSVM: test eax for 4K alignment for GP errata workaround") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-4-yosry@kernel.org
[sean: massage wording with respect to kvm_register_read()] Signed-off-by: Sean Christopherson <seanjc@google.com>
Yosry Ahmed [Mon, 16 Mar 2026 20:27:25 +0000 (20:27 +0000)]
KVM: SVM: Refactor SVM instruction handling on #GP intercept
Instead of returning an opcode from svm_instr_opcode() and then passing
it to emulate_svm_instr(), which uses it to find the corresponding exit
code and intercept handler, return the exit code directly from
svm_instr_opcode(), and rename it to svm_get_decoded_instr_exit_code().
emulate_svm_instr() boils down to synthesizing a #VMEXIT or calling the
intercept handler, so open-code it in gp_interception(), and use
svm_invoke_exit_handler() to call the intercept handler based on
the exit code. This allows for dropping the SVM_INSTR_* enum, and the
const array mapping its values to exit codes and intercept handlers.
In gp_intercept(), handle SVM instructions and first with an early return,
and invert is_guest_mode() checks, un-indenting the rest of the code.
Yosry Ahmed [Mon, 16 Mar 2026 20:27:24 +0000 (20:27 +0000)]
KVM: SVM: Properly check RAX in the emulator for SVM instructions
Architecturally, VMRUN/VMLOAD/VMSAVE should generate a #GP if the
physical address in RAX is not supported. check_svme_pa() hardcodes this
to checking that bits 63-48 are not set. This is incorrect on HW
supporting 52 bits of physical address space. Additionally, the emulator
does not check if the address is not aligned, which should also result
in #GP.
Use page_address_valid() which properly checks alignment and the address
legality based on the guest's MAXPHYADDR. Plumb it through
x86_emulate_ops, similar to is_canonical_addr(), to avoid directly
accessing the vCPU object in emulator code.
Fixes: 01de8b09e606 ("KVM: SVM: Add intercept checks for SVM instructions") Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Yosry Ahmed <yosry@kernel.org> Link: https://patch.msgid.link/20260316202732.3164936-2-yosry@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>