When building with CONFIG_CFI_CLANG=y after the recent series to
separate the x86 startup code, there are objtool warnings along the
lines of:
vmlinux.o: warning: objtool: __pi___cfi_startup_64_load_idt() falls through to next function __pi_startup_64_load_idt()
vmlinux.o: warning: objtool: __pi___cfi_startup_64_setup_gdt_idt() falls through to next function __pi_startup_64_setup_gdt_idt()
vmlinux.o: warning: objtool: __pi___cfi___startup_64() falls through to next function __pi___startup_64()
As the comment in validate_branch() states, this is expected, so ignore
these symbols in the same way that __cfi_ and __pfx_ symbols are already
ignored for the rest of the kernel.
Fixes: 7b38dec3c5af ("x86/boot: Create a confined code area for startup code") Signed-off-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Acked-by: Ard Biesheuvel <ardb@kernel.org>
This function is going away so replace the callsites with the equivalent
functionality. Add a new SAVIC-specific termination reason. If more
granularity is needed there, it will be revisited in the future.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:24 +0000 (12:22 +0200)]
x86/boot: Move startup code out of __head section
Move startup code out of the __head section, now that this no longer has
a special significance. Move everything into .text or .init.text as
appropriate, so that startup code is not kept around unnecessarily.
[ bp: Fold in hunk to fix 32-bit CPU hotplug: Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202509022207.56fd97f4-lkp@intel.com ] Signed-off-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250828102202.1849035-45-ardb+git@google.com
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:23 +0000 (12:22 +0200)]
efistub/x86: Remap inittext read-execute when needed
Recent EFI x86 systems are more strict when it comes to mapping boot
images, and require that mappings are either read-write or read-execute.
Now that the boot code is being cleaned up and refactored, most of it is
being moved into .init.text [where it arguably belongs] but that implies
that when booting on such strict EFI firmware, we need to take care to
map .init.text (and the .altinstr_aux section that follows it)
read-execute as well.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:22 +0000 (12:22 +0200)]
x86/boot: Create a confined code area for startup code
In order to be able to have tight control over which code may execute
from the early 1:1 mapping of memory, but still link vmlinux as a single
executable, prefix all symbol references in startup code with __pi_, and
invoke it from outside using the __pi_ prefix.
Use objtool to check that no absolute symbol references are present in
the startup code, as these cannot be used from code running from the 1:1
mapping.
Note that this also requires disabling the latent-entropy GCC plugin, as
the global symbol references that it injects would require explicit
exports, and given that the startup code rarely executes more than once,
it is not a useful source of entropy anyway.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:21 +0000 (12:22 +0200)]
x86/kbuild: Incorporate boot/startup/ via Kbuild makefile
Using core-y is not the correct way to get kbuild to descend into
arch/x86/boot/startup. For instance, building an individual object does
not work as expected when the pattern rule is local to the Makefile
$ make arch/x86/boot/startup/map_kernel.pi.o
GEN Makefile
CALL /home/ardb/linux/scripts/checksyscalls.sh
DESCEND objtool
INSTALL libsubcmd_headers
make[3]: *** No rule to make target 'arch/x86/boot/startup/map_kernel.pi.o'. Stop.
make[2]: *** [/home/ardb/linux/scripts/Makefile.build:461: arch/x86] Error 2
make[1]: *** [/home/ardb/linux/Makefile:2011: .] Error 2
make: *** [/home/ardb/linux/Makefile:248: __sub-make] Error 2
So use obj-y from arch.x86/Kbuild instead, which makes things work as
expected.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:20 +0000 (12:22 +0200)]
x86/boot: Revert "Reject absolute references in .head.text"
This reverts commit
faf0ed487415 ("x86/boot: Reject absolute references in .head.text")
The startup code is checked directly for the absence of absolute symbol
references, so checking the .head.text section in the relocs tool is no
longer needed.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:19 +0000 (12:22 +0200)]
x86/boot: Check startup code for absence of absolute relocations
Invoke objtool on each startup code object individually to check for the
absence of absolute relocations. This is needed because this code will
be invoked from the 1:1 mapping of memory before those absolute virtual
addresses (which are derived from the kernel virtual base address
provided to the linker and possibly shifted at boot) are mapped.
Only objects built under arch/x86/boot/startup/ have this restriction,
and once they have been incorporated into vmlinux.o, this distinction is
difficult to make. So force the invocation of objtool for each object
file individually, even if objtool is deferred to vmlinux.o for the rest
of the build. In the latter case, only pass --noabs and nothing else;
otherwise, append it to the existing objtool command line.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:18 +0000 (12:22 +0200)]
objtool: Add action to check for absence of absolute relocations
The x86 startup code must not use absolute references to code or data,
as it executes before the kernel virtual mapping is up.
Add an action to objtool to check all allocatable sections (with the
exception of __patchable_function_entries, which uses absolute
references for nebulous reasons) and raise an error if any absolute
references are found.
Note that debug sections typically contain lots of absolute references
too, but those are not allocatable so they will be ignored.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:16 +0000 (12:22 +0200)]
x86/sev: Move __sev_[get|put]_ghcb() into separate noinstr object
Rename sev-nmi.c to noinstr.c, and move the get/put GHCB routines into it too,
which are also annotated as 'noinstr' and suffer from the same problem as the
NMI code, i.e., that GCC may ignore the __no_sanitize_address__ function
attribute implied by 'noinstr' and insert KASAN instrumentation anyway.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:15 +0000 (12:22 +0200)]
x86/sev: Provide PIC aliases for SEV related data objects
Provide PIC aliases for data objects that are shared between the SEV startup
code and the SEV code that executes later. This is needed so that the confined
startup code is permitted to access them.
This requires some of these variables to be moved into a source file that is
not part of the startup code, as the PIC alias is already implied, and
exporting variables in the opposite direction is not supported.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:14 +0000 (12:22 +0200)]
x86/boot: Provide PIC aliases for 5-level paging related constants
Provide PIC aliases for the global variables related to 5-level paging, so
that the startup code can access them in order to populate the kernel page
tables.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:13 +0000 (12:22 +0200)]
x86/boot: Drop redundant RMPADJUST in SEV SVSM presence check
snp_vmpl will be assigned a non-zero value when executing at a VMPL other than
0, and this is inferred from a call to RMPADJUST, which only works when
running at VMPL0.
This means that testing snp_vmpl is sufficient, and there is no need to
perform the same check again.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:12 +0000 (12:22 +0200)]
x86/sev: Use boot SVSM CA for all startup and init code
To avoid having to reason about whether or not to use the per-CPU SVSM calling
area when running startup and init code on the boot CPU, reuse the boot SVSM
calling area as the per-CPU area for the BSP.
Thus, remove the need to make the per-CPU variables and associated state in
sev_cfg accessible to the startup code once confined.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:11 +0000 (12:22 +0200)]
x86/sev: Pass SVSM calling area down to early page state change API
The early page state change API is mostly only used very early, when only the
boot time SVSM calling area is in use. However, this API is also called by the
kexec finishing code, which runs very late, and potentially from a different
CPU (which uses a different calling area).
To avoid pulling the per-CPU SVSM calling area pointers and related SEV state
into the startup code, refactor the page state change API so the SVSM calling
area virtual and physical addresses can be provided by the caller.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:10 +0000 (12:22 +0200)]
x86/sev: Share implementation of MSR-based page state change
Both the decompressor and the SEV startup code implement the exact same
sequence for invoking the MSR based communication protocol to effectuate
a page state change.
Before tweaking the internal APIs used in both versions, merge them and
share them so those tweaks are only needed in a single place.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:09 +0000 (12:22 +0200)]
x86/sev: Avoid global variable to store virtual address of SVSM area
The boottime SVSM calling area is used both by the startup code running from
a 1:1 mapping, and potentially later on running from the ordinary kernel
mapping.
This SVSM calling area is statically allocated, and so its physical address
doesn't change. However, its virtual address depends on the calling context
(1:1 mapping or kernel virtual mapping), and even though the variable that
holds the virtual address of this calling area gets updated from 1:1 address
to kernel address during the boot, it is hard to reason about why this is
guaranteed to be safe.
So instead, take the RIP-relative address of the boottime SVSM calling area
whenever its virtual address is required, and only use a global variable for
the physical address.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:08 +0000 (12:22 +0200)]
x86/sev: Move GHCB page based HV communication out of startup code
Both the decompressor and the core kernel implement an early #VC handler,
which only deals with CPUID instructions, and full featured one, which can
handle any #VC exception.
The former communicates with the hypervisor using the MSR based protocol,
whereas the latter uses a shared GHCB page, which is configured a bit later
during the boot, when the kernel runs from its ordinary virtual mapping,
rather than the 1:1 mapping that the startup code uses.
Accessing this shared GHCB page from the core kernel's startup code is
problematic, because it involves converting the GHCB address provided by the
caller to a physical address. In the startup code, virtual to physical address
translations are problematic, given that the virtual address might be a 1:1
mapped address, and such translations should therefore be avoided.
This means that exposing startup code dealing with the GHCB to callers that
execute from the ordinary kernel virtual mapping should be avoided too. So
move all GHCB page based communication out of the startup code, now that all
communication occurring before the kernel virtual mapping is up relies on the
MSR protocol only.
As an exception, add a flag representing the need to apply the coherency
fix in order to avoid exporting CPUID* helpers because of the code
running too early for the *cpu_has* infrastructure.
Neeraj Upadhyay [Thu, 28 Aug 2025 11:31:19 +0000 (17:01 +0530)]
x86/sev: Prevent SECURE_AVIC_CONTROL MSR interception for Secure AVIC guests
The SECURE_AVIC_CONTROL MSR holds the GPA of the guest APIC backing page and
bitfields to control enablement of Secure AVIC and whether the guest allows
NMIs to be injected by the hypervisor.
This MSR is populated by the guest and can be read by the guest to get the GPA
of the APIC backing page. The MSR can only be accessed in Secure AVIC mode.
Any attempt to access it when not in Secure AVIC mode results in #GP. So, the
hypervisor should not intercept it. A #VC exception will be generated
otherwise. If this occurs and Secure AVIC is enabled, terminate the guest
execution.
Neeraj Upadhyay [Thu, 28 Aug 2025 11:21:26 +0000 (16:51 +0530)]
x86/apic: Enable Secure AVIC in the control MSR
With all the pieces in place now, enable Secure AVIC in the Secure AVIC
Control MSR. Any access to x2APIC MSRs are emulated by the hypervisor
before Secure AVIC is enabled in the control MSR. Post Secure AVIC
enablement, all x2APIC MSR accesses (whether accelerated by AVIC
hardware or trapped as a #VC exception) operate on the vCPU's APIC
backing page.
Neeraj Upadhyay [Thu, 28 Aug 2025 11:20:08 +0000 (16:50 +0530)]
x86/apic: Add kexec support for Secure AVIC
Add a apic->teardown() callback to disable Secure AVIC before rebooting into
the new kernel. This ensures that the new kernel does not access the old APIC
backing page which was allocated by the previous kernel.
Such accesses can happen if there are any APIC accesses done during the guest
boot before Secure AVIC driver probe is done by the new kernel (as Secure AVIC
would have remained enabled in the Secure AVIC control MSR).
Neeraj Upadhyay [Thu, 28 Aug 2025 11:16:54 +0000 (16:46 +0530)]
x86/apic: Handle EOI writes for Secure AVIC guests
Secure AVIC accelerates the guest's EOI MSR writes for edge-triggered
interrupts.
For level-triggered interrupts, EOI MSR writes trigger a #VC exception with
an SVM_EXIT_AVIC_UNACCELERATED_ACCESS error code. To complete EOI handling,
the #VC exception handler would need to trigger a GHCB protocol MSR write
event to notify the hypervisor about completion of the level-triggered
interrupt. Hypervisor notification is required for cases like emulated
IO-APIC, to complete and clear interrupt in the IO-APIC's interrupt state.
However, #VC exception handling adds extra performance overhead for APIC
register writes. In addition, for Secure AVIC, some unaccelerated APIC
register MSR writes are trapped, whereas others are faulted.
This results in additional complexity in #VC exception handling for
unaccelerated APIC MSR accesses. So, directly do a GHCB protocol based APIC
EOI MSR write from apic->eoi() callback for level-triggered interrupts.
Use WRMSR for edge-triggered interrupts, so that hardware re-evaluates any
pending interrupt which can be delivered to the guest vCPU. For
level-triggered interrupts, re-evaluation happens on return from VMGEXIT
corresponding to the GHCB event for APIC EOI MSR write.
Neeraj Upadhyay [Thu, 28 Aug 2025 11:13:56 +0000 (16:43 +0530)]
x86/apic: Read and write LVT* APIC registers from HV for SAVIC guests
The Hypervisor needs information about the current state of the LVT registers
for device emulation and NMIs. So, forward reads and write of these registers
to the hypervisor for Secure AVIC enabled guests.
Now that support to send NMI IPI and support to inject NMI from the hypervisor
has been added, set V_NMI_ENABLE in the VINTR_CTRL field of the VMSA to enable
NMI for Secure AVIC guests.
[ bp: Zap useless brackets. ]
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828111315.208959-1-Neeraj.Upadhyay@amd.com
x86/sev: Initialize VGIF for secondary vCPUs for Secure AVIC
Virtual GIF (VGIF) provides masking capability for when virtual interrupts
(virtual maskable interrupts, virtual NMIs) can be taken by the guest vCPU.
The Secure AVIC hardware reads VGIF state from the vCPU's VMSA. So, set VGIF for
secondary CPUs (the configuration for the boot CPU is done by the hypervisor),
to unmask delivery of virtual interrupts to the vCPU.
Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828111141.208920-1-Neeraj.Upadhyay@amd.com
Neeraj Upadhyay [Thu, 28 Aug 2025 11:09:26 +0000 (16:39 +0530)]
x86/apic: Support LAPIC timer for Secure AVIC
Secure AVIC requires the LAPIC timer to be emulated by the hypervisor. KVM
already supports emulating the LAPIC timer using hrtimers. In order to emulate
it, APIC_LVTT, APIC_TMICT and APIC_TDCR register values need to be propagated
to the hypervisor for arming the timer. APIC_TMCCT register value has to be
read from the hypervisor, which is required for calibrating the APIC timer.
So, read/write all APIC timer registers from/to the hypervisor.
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828110926.208866-1-Neeraj.Upadhyay@amd.com
Neeraj Upadhyay [Thu, 28 Aug 2025 11:08:24 +0000 (16:38 +0530)]
x86/apic: Add support to send IPI for Secure AVIC
Secure AVIC hardware accelerates only Self-IPI, i.e. on WRMSR to APIC_SELF_IPI
and APIC_ICR (with destination shorthand equal to Self) registers, hardware
takes care of updating the APIC_IRR in the APIC backing page of the vCPU.
For other IPI types (cross-vCPU, broadcast IPIs), software needs to take care
of updating the APIC_IRR state of the target vCPUs and to ensure that the
target vCPUs notice the new pending interrupt.
Add new callbacks in the Secure AVIC driver for sending IPI requests. These
callbacks update the IRR in the target guest vCPU's APIC backing page. To
ensure that the remote vCPU notices the new pending interrupt, reuse the GHCB
MSR handling code in vc_handle_msr() to issue APIC_ICR MSR-write GHCB protocol
event to the hypervisor.
For Secure AVIC guests, on APIC_ICR write MSR exits, the hypervisor notifies
the target vCPU by either sending an AVIC doorbell (if target vCPU is running)
or by waking up the non-running target vCPU.
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828110824.208851-1-Neeraj.Upadhyay@amd.com
Neeraj Upadhyay [Thu, 28 Aug 2025 11:02:43 +0000 (16:32 +0530)]
x86/apic: Add an update_vector() callback for Secure AVIC
Add an update_vector() callback to set/clear the ALLOWED_IRR field in a vCPU's
APIC backing page for vectors which are emulated by the hypervisor.
The ALLOWED_IRR field indicates the interrupt vectors which the guest allows
the hypervisor to inject (typically for emulated devices). Interrupt vectors
used exclusively by the guest itself and the vectors which are not emulated by
the hypervisor, such as IPI vectors, should not be set by the guest in the
ALLOWED_IRR fields.
As clearing/setting state of a vector will also be used in subsequent commits
for other APIC registers (such as APIC_IRR update for sending IPI), add
a common update_vector() in the Secure AVIC driver.
[ bp: Massage commit message. ]
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828110255.208779-4-Neeraj.Upadhyay@amd.com
Neeraj Upadhyay [Thu, 28 Aug 2025 11:02:42 +0000 (16:32 +0530)]
x86/apic: Add update_vector() callback for APIC drivers
Add an update_vector() callback to allow APIC drivers to perform driver
specific operations on external vector allocation/teardown on a CPU. This
callback will be used by the Secure AVIC APIC driver to configure the vectors
which a guest vCPU allows the hypervisor to send to it.
As system vectors have fixed vector assignments and are not dynamically
allocated, add an apic_update_vector() public API to facilitate
update_vector() callback invocation for them. This will be used for Secure
AVIC enabled guests to allow the hypervisor to inject system vectors which are
emulated by the hypervisor such as APIC timer vector and
HYPERVISOR_CALLBACK_VECTOR.
While at it, cleanup line break in apic_update_irq_cfg().
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/20250828110255.208779-3-Neeraj.Upadhyay@amd.com
Neeraj Upadhyay [Thu, 28 Aug 2025 11:02:41 +0000 (16:32 +0530)]
x86/apic: Initialize APIC ID for Secure AVIC
Initialize the APIC ID in the Secure AVIC APIC backing page with the APIC_ID
MSR value read from the hypervisor. CPU topology evaluation later during boot
would catch and report any duplicate APIC ID for two CPUs.
Neeraj Upadhyay [Thu, 28 Aug 2025 11:02:40 +0000 (16:32 +0530)]
x86/apic: Populate .read()/.write() callbacks of Secure AVIC driver
Add read() and write() APIC callback functions to read and write the x2APIC
registers directly from the guest APIC backing page of a vCPU.
The x2APIC registers are mapped at an offset within the guest APIC backing
page which is the same as their x2APIC MMIO offset. Secure AVIC adds new
registers such as ALLOWED_IRRs (which are at 4-byte offset within the IRR
register offset range) and NMI_REQ to the APIC register space.
When Secure AVIC is enabled, accessing the guest's APIC registers through
RD/WRMSR results in a #VC exception (for non-accelerated register accesses)
with error code VMEXIT_AVIC_NOACCEL.
The #VC exception handler can read/write the x2APIC register in the guest APIC
backing page to complete the RDMSR/WRMSR. Since doing this would increase the
latency of accessing the x2APIC registers, instead of doing RDMSR/WRMSR based
register accesses and handling reads/writes in the #VC exception, directly
read/write the APIC registers from/to the guest APIC backing page of the vCPU
in read() and write() callbacks of the Secure AVIC APIC driver.
[ bp: Massage commit message. ]
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828110255.208779-1-Neeraj.Upadhyay@amd.com
With Secure AVIC, the APIC backing page is owned and managed by the guest.
Allocate and initialize APIC backing page for all guest CPUs.
The NPT entry for a vCPU's APIC backing page must always be present when the
vCPU is running in order for Secure AVIC to function. A VMEXIT_BUSY is
returned on VMRUN and the vCPU cannot be resumed otherwise.
To handle this, notify GPA of the vCPU's APIC backing page to the hypervisor
by using the SVM_VMGEXIT_SECURE_AVIC GHCB protocol event. Before executing
VMRUN, the hypervisor makes use of this information to make sure the APIC
backing page is mapped in the NPT.
[ bp: Massage commit message. ]
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828070334.208401-3-Neeraj.Upadhyay@amd.com
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:07 +0000 (12:22 +0200)]
x86/sev: Run RMPADJUST on SVSM calling area page to test VMPL
Determining the VMPL at which the kernel runs involves performing a RMPADJUST
operation on an arbitrary page of memory, and observing whether it succeeds.
The use of boot_ghcb_page in the core kernel in this case is completely
arbitrary, but results in the need to provide a PIC alias for it. So use
boot_svsm_ca_page instead, which already needs this alias for other reasons.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:06 +0000 (12:22 +0200)]
x86/sev: Use MSR protocol only for early SVSM PVALIDATE call
The early page state change API performs an SVSM call to PVALIDATE each page
when running under a SVSM, and this involves either a GHCB page based call or
a call based on the MSR protocol.
The GHCB page based variant involves VA to PA translation of the GHCB address,
and this is best avoided in the startup code, where virtual addresses are
ambiguous (1:1 or kernel virtual).
As this is the last remaining occurrence of svsm_perform_call_protocol() in
the startup code, switch to the MSR protocol exclusively in this particular
case, so that the GHCB based plumbing can be moved out of the startup code
entirely in a subsequent patch.
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:05 +0000 (12:22 +0200)]
x86/sev: Use MSR protocol for remapping SVSM calling area
As the preceding code comment already indicates, remapping the SVSM
calling area occurs long before the GHCB page is configured, and so
calling svsm_perform_call_protocol() is guaranteed to result in a call
to svsm_perform_msr_protocol().
So just call the latter directly. This allows most of the GHCB based API
infrastructure to be moved out of the startup code in a subsequent
patch.
Neeraj Upadhyay [Thu, 28 Aug 2025 07:03:17 +0000 (12:33 +0530)]
x86/apic: Add new driver for Secure AVIC
The Secure AVIC feature provides SEV-SNP guests hardware acceleration for
performance sensitive APIC accesses while securely managing the guest-owned
APIC state through the use of a private APIC backing page.
This helps prevent the hypervisor from generating unexpected interrupts for
a vCPU or otherwise violate architectural assumptions around the APIC
behavior.
Add a new x2APIC driver that will serve as the base of the Secure AVIC
support. It is initially the same as the x2APIC physical driver (without IPI
callbacks), but will be modified as features are implemented.
As the new driver does not implement Secure AVIC features yet, if the
hypervisor sets the Secure AVIC bit in SEV_STATUS, maintain the existing
behavior to enforce the guest termination.
[ bp: Massage commit message. ]
Co-developed-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Kishon Vijay Abraham I <kvijayab@amd.com> Signed-off-by: Neeraj Upadhyay <Neeraj.Upadhyay@amd.com> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Reviewed-by: Tianyu Lan <tiala@microsoft.com> Link: https://lore.kernel.org/20250828070334.208401-2-Neeraj.Upadhyay@amd.com
Ard Biesheuvel [Thu, 28 Aug 2025 10:22:04 +0000 (12:22 +0200)]
x86/sev: Separate MSR and GHCB based snp_cpuid() via a callback
There are two distinct callers of snp_cpuid(): the MSR protocol and the GHCB
page based interface.
The snp_cpuid() logic does not care about the distinction, which only matters
at a lower level. But the fact that it supports both interfaces means that the
GHCB page based logic is pulled into the early startup code where PA to VA
conversions are problematic, given that it runs from the 1:1 mapping of memory.
So keep snp_cpuid() itself in the startup code, but factor out the hypervisor
calls via a callback, so that the GHCB page handling can be moved out.
Code refactoring only - no functional change intended.
Thomas Gleixner [Thu, 31 Jul 2025 14:12:19 +0000 (16:12 +0200)]
x86/apic: Make the ISR clearing sane
apic_pending_intr_clear() is fundamentally voodoo programming. It's primary
purpose is to clear stale ISR bits in the local APIC, which would otherwise
lock the corresponding interrupt priority level.
The comments and the implementation claim falsely that after clearing the
stale ISR bits, eventually stale IRR bits would be turned into ISR bits and
can be cleared as well. That's just wishful thinking because:
1) If interrupts are disabled, the APIC does not propagate an IRR bit to
the ISR.
2) If interrupts are enabled, then the APIC propagates the IRR bit to the
ISR and raises the interrupt in the CPU, which means that code _cannot_
observe the ISR bit for any of those IRR bits.
Rename the function to reflect the purpose and make exactly _one_ attempt
to EOI the pending ISR bits and add comments why traversing the pending bit
map in low to high priority order is correct.
Instead of trying to "clear" IRR bits, simply print a warning message when
the IRR is not empty.
Linus Torvalds [Sun, 24 Aug 2025 14:13:05 +0000 (10:13 -0400)]
Merge tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Fix a case where the events throttling logic operates on inactive
events
* tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Avoid undefined behavior from stopping/starting inactive events
Linus Torvalds [Sun, 24 Aug 2025 13:52:28 +0000 (09:52 -0400)]
Merge tag 'x86_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix the GDS mitigation detection on some machines after the recent
attack vectors conversion
- Filter out the invalid machine reset reason value -1 when running as
a guest as in such cases the reason why the machine was rebooted does
not make a whole lot of sense
- Init the resource control machinery on Hygon hw in order to avoid a
division by zero and to actually enable the feature on hw which
supports it
* tag 'x86_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Fix GDS mitigation selecting when mitigation is off
x86/CPU/AMD: Ignore invalid reset reason value
x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper
Linus Torvalds [Sun, 24 Aug 2025 13:43:50 +0000 (09:43 -0400)]
Merge tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull modules fix from Daniel Gomez:
"This includes a fix part of the KSPP (Kernel Self Protection Project)
to replace the deprecated and unsafe strcpy() calls in the kernel
parameter string handler and sysfs parameters for built-in modules.
Single commit, no functional changes"
* tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
params: Replace deprecated strcpy() with strscpy() and memcpy()
Linus Torvalds [Sat, 23 Aug 2025 15:27:31 +0000 (11:27 -0400)]
Merge tag 'char-misc-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc/iio fixes from Greg KH:
"Here are a small number of char/misc/iio and other driver fixes for
6.17-rc3. Included in here are:
- IIO driver bugfixes for reported issues
- bunch of comedi driver fixes
- most core bugfix
- fpga driver bugfix
- cdx driver bugfix
All of these have been in linux-next this week with no reported
issues"
* tag 'char-misc-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
most: core: Drop device reference after usage in get_channel()
comedi: Make insn_rw_emulate_bits() do insn->n samples
comedi: Fix use of uninitialized memory in do_insn_ioctl() and do_insnlist_ioctl()
comedi: pcl726: Prevent invalid irq number
cdx: Fix off-by-one error in cdx_rpmsg_probe()
fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
iio: pressure: bmp280: Use IS_ERR() in bmp280_common_probe()
iio: light: as73211: Ensure buffer holes are zeroed
iio: adc: rzg2l_adc: Set driver data before enabling runtime PM
iio: adc: rzg2l: Cleanup suspend/resume path
iio: adc: ad7380: fix missing max_conversion_rate_hz on adaq4381-4
iio: adc: bd79124: Add GPIOLIB dependency
iio: imu: inv_icm42600: change invalid data error to -EBUSY
iio: adc: ad7124: fix channel lookup in syscalib functions
iio: temperature: maxim_thermocouple: use DMA-safe buffer for spi_read()
iio: adc: ad7173: prevent scan if too many setups requested
iio: proximity: isl29501: fix buffered read on big-endian systems
iio: accel: sca3300: fix uninitialized iio scan data
Linus Torvalds [Sat, 23 Aug 2025 15:21:56 +0000 (11:21 -0400)]
Merge tag 'usb-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 6.17-rc3 to resolve a bunch
of reported issues. Included in here are:
- typec driver fixes
- dwc3 new device id
- dwc3 driver fixes
- new usb-storage driver quirks
- xhci driver fixes
- other tiny USB driver fixes to resolve bugs
All of these have been in linux-next this week with no reported issues"
* tag 'usb-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: xhci: fix host not responding after suspend and resume
usb: xhci: Fix slot_id resource race conflict
usb: typec: fusb302: Revert incorrect threaded irq fix
USB: core: Update kerneldoc for usb_hcd_giveback_urb()
usb: typec: maxim_contaminant: re-enable cc toggle if cc is open and port is clean
usb: typec: maxim_contaminant: disable low power mode when reading comparator values
usb: dwc3: Remove WARN_ON for device endpoint command timeouts
USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
usb: storage: realtek_cr: Use correct byte order for bcs->Residue
usb: chipidea: imx: improve usbmisc_imx7d_pullup()
kcov, usb: Don't disable interrupts in kcov_remote_start_usb_softirq()
usb: dwc3: pci: add support for the Intel Wildcat Lake
usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
usb: core: hcd: fix accessing unmapped memory in SINGLE_STEP_SET_FEATURE test
usb: renesas-xhci: Fix External ROM access timeouts
usb: gadget: tegra-xudc: fix PM use count underflow
usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
Linus Torvalds [Sat, 23 Aug 2025 14:11:34 +0000 (10:11 -0400)]
Merge tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix rtla and latency tooling pkg-config errors
If libtraceevent and libtracefs is installed, but their corresponding
'.pc' files are not installed, it reports that the libraries are
missing and confuses the developer. Instead, report that the
pkg-config files are missing and should be installed.
- Fix overflow bug of the parser in trace_get_user()
trace_get_user() uses the parsing functions to parse the user space
strings. If the parser fails due to incorrect processing, it doesn't
terminate the buffer with a nul byte. Add a "failed" flag to the
parser that gets set when parsing fails and is used to know if the
buffer is fine to use or not.
- Remove a semicolon that was at an end of a comment line
- Fix register_ftrace_graph() to unregister the pm notifier on error
The register_ftrace_graph() registers a pm notifier but there's an
error path that can exit the function without unregistering it. Since
the function returns an error, it will never be unregistered.
- Allocate and copy ftrace hash for reader of ftrace filter files
When the set_ftrace_filter or set_ftrace_notrace files are open for
read, an iterator is created and sets its hash pointer to the
associated hash that represents filtering or notrace filtering to it.
The issue is that the hash it points to can change while the
iteration is happening. All the locking used to access the tracer's
hashes are released which means those hashes can change or even be
freed. Using the hash pointed to by the iterator can cause UAF bugs
or similar.
Have the read of these files allocate and copy the corresponding
hashes and use that as that will keep them the same while the
iterator is open. This also simplifies the code as opening it for
write already does an allocate and copy, and now that the read is
doing the same, there's no need to check which way it was opened on
the release of the file, and the iterator hash can always be freed.
- Fix function graph to copy args into temp storage
The output of the function graph tracer shows both the entry and the
exit of a function. When the exit is right after the entry, it
combines the two events into one with the output of "function();",
instead of showing:
function() {
}
In order to do this, the iterator descriptor that reads the events
includes storage that saves the entry event while it peaks at the
next event in the ring buffer. The peek can free the entry event so
the iterator must store the information to use it after the peek.
With the addition of function graph tracer recording the args, where
the args are a dynamic array in the entry event, the temp storage
does not save them. This causes the args to be corrupted or even
cause a read of unsafe memory.
Add space to save the args in the temp storage of the iterator.
- Fix race between ftrace_dump and reading trace_pipe
ftrace_dump() is used when a crash occurs where the ftrace buffer
will be printed to the console. But it can also be triggered by
sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe
it can cause a race in the ftrace_dump() where it checks if the
buffer has content, then it checks if the next event is available,
and then prints the output (regardless if the next event was
available or not). Reading trace_pipe at the same time can cause it
to not be available, and this triggers a WARN_ON in the print. Move
the printing into the check if the next event exists or not
* tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Also allocate and copy hash for reading of filter files
ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
fgraph: Copy args in intermediate storage with entry
trace/fgraph: Fix the warning caused by missing unregister notifier
ring-buffer: Remove redundant semicolons
tracing: Limit access to parser->buffer when trace_get_user failed
rtla: Check pkg-config install
tools/latency-collector: Check pkg-config install
Linus Torvalds [Sat, 23 Aug 2025 13:04:32 +0000 (09:04 -0400)]
Merge tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Fix swapped handling of lru_gen and lru_gen_full debugfs files in
vmscan
- Fix debugfs mount options (uid, gid, mode) being silently ignored
- Fix leak of devres action in the unwind path of Devres::new()
- Documentation:
- Expand and fix documentation of (outdated) Device, DeviceContext
and generic driver infrastructure
- Fix C header link of faux device abstractions
- Clarify expected interaction with the security team
- Smooth text flow in the security bug reporting process
documentation
* tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
Documentation: smooth the text flow in the security bug reporting process
Documentation: clarify the expected collaboration with security bugs reporters
debugfs: fix mount options not being applied
rust: devres: fix leaking call to devm_add_action()
rust: faux: fix C header link
driver: rust: expand documentation for driver infrastructure
device: rust: expand documentation for Device
device: rust: expand documentation for DeviceContext
mm/vmscan: fix inverted polarity in lru_gen_seq_show()
Steven Rostedt [Fri, 22 Aug 2025 22:36:06 +0000 (18:36 -0400)]
ftrace: Also allocate and copy hash for reading of filter files
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds
the pointer to the global tracer hash to its iterator. Unlike the writer
that allocates a copy of the hash, the reader keeps the pointer to the
filter hashes. This is problematic because this pointer is static across
function calls that release the locks that can update the global tracer
hashes. This can cause UAF and similar bugs.
Allocate and copy the hash for reading the filter files like it is done
for the writers. This not only fixes UAF bugs, but also makes the code a
bit simpler as it doesn't have to differentiate when to free the
iterator's hash between writers and readers.
Linus Torvalds [Fri, 22 Aug 2025 22:16:54 +0000 (18:16 -0400)]
Merge tag 'drm-fixes-2025-08-23-1' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes. Looks like things did indeed get busier after rc2,
nothing seems too major, but stuff scattered all over the place,
amdgpu, xe, i915, hibmc, rust support code, and other small fixes.
rust:
- drm device memory layout and safety fixes
tests:
- Endianness fixes
gpuvm:
- docs warning fix
panic:
- fix division on 32-bit arm
i915:
- TypeC DP display Fixes
- Silence rpm wakeref asserts on GEN11_GU_MISC_IIR access
- Relocate compression repacking WA for JSL/EHL
hibmc:
- modesetting black screen fixes
- fix UAF on irq
- fix leak on i2c failure path
nouveau:
- memory leak fixes
- typos
rockchip:
- Kconfig fix
- register caching fix"
* tag 'drm-fixes-2025-08-23-1' of https://gitlab.freedesktop.org/drm/kernel: (49 commits)
drm/xe: Fix vm_bind_ioctl double free bug
drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_create
drm/xe: Assign ioctl xe file handler to vm in xe_vm_create
drm/i915/gt: Relocate compression repacking WA for JSL/EHL
drm/i915: silence rpm wakeref asserts on GEN11_GU_MISC_IIR access
drm/amd/display: Fix DP audio DTO1 clock source on DCE 6.
drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
drm/amd/display: Don't print errors for nonexistent connectors
drm/amd/display: Don't warn when missing DCE encoder caps
drm/amd/display: Fill display clock and vblank time in dce110_fill_display_configs
drm/amd/display: Find first CRTC and its line time in dce110_fill_display_configs
drm/amd/display: Adjust DCE 8-10 clock, don't overclock by 15%
drm/amd/display: Don't overclock DCE 6 by 15%
drm/amd/display: Add null pointer check in mod_hdcp_hdcp1_create_session()
drm/amd/display: Fix Xorg desktop unresponsive on Replay panel
drm/amd/display: Avoid a NULL pointer dereference
drm/amdgpu/swm14: Update power limit logic
drm/amd/display: Revert Add HPO encoder support to Replay
drm/i915/icl+/tc: Convert AUX powered WARN to a debug message
drm/i915/lnl+/tc: Use the cached max lane count value
...
In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.
Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.
Steven Rostedt [Wed, 20 Aug 2025 23:55:22 +0000 (19:55 -0400)]
fgraph: Copy args in intermediate storage with entry
The output of the function graph tracer has two ways to display its
entries. One way for leaf functions with no events recorded within them,
and the other is for functions with events recorded inside it. As function
graph has an entry and exit event, to simplify the output of leaf
functions it combines the two, where as non leaf functions are separate:
2) | invoke_rcu_core() {
2) | raise_softirq() {
2) 0.391 us | __raise_softirq_irqoff();
2) 1.191 us | }
2) 2.086 us | }
The __raise_softirq_irqoff() function above is really two events that were
merged into one. Otherwise it would have looked like:
2) | invoke_rcu_core() {
2) | raise_softirq() {
2) | __raise_softirq_irqoff() {
2) 0.391 us | }
2) 1.191 us | }
2) 2.086 us | }
In order to do this merge, the reading of the trace output file needs to
look at the next event before printing. But since the pointer to the event
is on the ring buffer, it needs to save the entry event before it looks at
the next event as the next event goes out of focus as soon as a new event
is read from the ring buffer. After it reads the next event, it will print
the entry event with either the '{' (non leaf) or ';' and timestamps (leaf).
The iterator used to read the trace file has storage for this event. The
problem happens when the function graph tracer has arguments attached to
the entry event as the entry now has a variable length "args" field. This
field only gets set when funcargs option is used. But the args are not
recorded in this temp data and garbage could be printed. The entry field
is copied via:
data->ent = *curr;
Where "curr" is the entry field. But this method only saves the non
variable length fields from the structure.
Add a helper structure to the iterator data that adds the max args size to
the data storage in the iterator. Then simply copy the entire entry into
this storage (with size protection).
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home Reported-by: Sasha Levin <sashal@kernel.org> Tested-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/ Fixes: ff5c9c576e75 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Fri, 22 Aug 2025 21:24:48 +0000 (17:24 -0400)]
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
"Two very minor fixes:
- Fix mismatched kvalloc()/kfree()
- Spelling fixes in documentation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd: Fix spelling errors in iommufd.rst
iommufd: viommu: free memory allocated by kvcalloc() using kvfree()
Bindig requires a node name matching ‘^ethernet@[0-9a-f]+$’. This patch
changes the clock name from “etop” to “ethernet”.
This fixes the following warning:
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): $nodename:0: 'etop@e180000' does not match '^ethernet@[0-9a-f]+$'
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
Fixes: dac0bad93741 ("dt-bindings: net: lantiq,etop-xway: Document Lantiq Xway ETOP bindings") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Acked-by: Jakub Kicinski <kuba@kernel.org>
The upstream dts lacks the lantiq,{rx/tx}-burst-length property. Other
issues were also fixed:
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'interrupt-names' is a required property
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'lantiq,tx-burst-length' is a required property
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'lantiq,rx-burst-length' is a required property
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
Fixes: 14d4e308e0aa ("net: lantiq: configure the burst length in ethernet drivers") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Acked-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Fri, 22 Aug 2025 14:16:47 +0000 (10:16 -0400)]
Merge tag 's390-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- When kernel lockdown is active userspace tools that rely on read
operations only are unnecessarily blocked. Fix that by avoiding ioctl
registration during lockdown
- Invalid NULL pointer accesses succeed due to the lowcore is always
mapped the identity mapping pinned to zero. To fix that never map the
first two pages of physical memory with identity mapping
- Fix invalid SCCB present check in the SCLP interrupt handler
- Update defconfigs
* tag 's390-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/hypfs: Enable limited access during lockdown
s390/hypfs: Avoid unnecessary ioctl registration in debugfs
s390/mm: Do not map lowcore with identity mapping
s390/sclp: Fix SCCB present check
s390/configs: Set HZ=1000
s390/configs: Update defconfigs
Linus Torvalds [Fri, 22 Aug 2025 13:50:17 +0000 (09:50 -0400)]
Merge tag 'for-linus-6.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two small cleanups which are both relevant only when running as a Xen
guest"
* tag 'for-linus-6.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
drivers/xen/xenbus: remove quirk for Xen 3.x
compiler: remove __ADDRESSABLE_ASM{_STR,}() again
Linus Torvalds [Fri, 22 Aug 2025 13:35:21 +0000 (09:35 -0400)]
Merge tag 'platform-drivers-x86-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- amd/hsmp:
- Ensure sock->metric_tbl_addr is non-NULL
- Register driver even if hwmon registration fails
- amd/pmc: Drop SMU F/W match for Cezanne
- dell-smbios-wmi: Separate "priority" from WMI device ID
- hp-wmi: mark Victus 16-r1xxx for Victus s fan and thermal profile
support
- intel-uncore-freq: Check write blocked for efficiency latency control
* tag 'platform-drivers-x86-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: hp-wmi: mark Victus 16-r1xxx for victus_s fan and thermal profile support
platform/x86/amd/hsmp: Ensure success even if hwmon registration fails
platform/x86/amd/hsmp: Ensure sock->metric_tbl_addr is non-NULL
platform/x86/intel-uncore-freq: Check write blocked for ELC
platform/x86/amd: pmc: Drop SMU F/W match for Cezanne
platform/x86: dell-smbios-wmi: Stop touching WMI device ID
Linus Torvalds [Fri, 22 Aug 2025 13:29:51 +0000 (09:29 -0400)]
Merge tag 'block-6.17-20250822' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"A set of fixes for block that should go into this tree. A bit larger
than what I usually have at this point in time, a lot of that is the
continued fixing of the lockdep annotation for queue freezing that we
recently added, which has highlighted a number of little issues here
and there. This contains:
- MD pull request via Yu:
- Add a legacy_async_del_gendisk mode, to prevent a user tools
regression. New user tools releases will not use such a mode,
the old release with a new kernel now will have warning about
deprecated behavior, and we prepare to remove this legacy mode
after about a year later
- The rename in kernel causing user tools build failure, revert
the rename in mdp_superblock_s
- Fix a regression that interrupted resync can be shown as
recover from mdstat or sysfs
- Improve file size detection for loop, particularly for networked
file systems, by using getattr to get the size rather than the
cached inode size.
- Hotplug CPU lock vs queue freeze fix
- Lockdep fix while updating the number of hardware queues
- Fix stacking for PI devices
- Silence bio_check_eod() for the known case of device removal where
the size is truncated to 0 sectors"
* tag 'block-6.17-20250822' of git://git.kernel.dk/linux:
block: avoid cpu_hotplug_lock depedency on freeze_lock
block: decrement block_rq_qos static key in rq_qos_del()
block: skip q->rq_qos check in rq_qos_done_bio()
blk-mq: fix lockdep warning in __blk_mq_update_nr_hw_queues
block: tone down bio_check_eod
loop: use vfs_getattr_nosec for accurate file size
loop: Consolidate size calculation logic into lo_calculate_size()
block: remove newlines from the warnings in blk_validate_integrity_limits
block: handle pi_tuple_size in queue_limits_stack_integrity
selftests: ublk: Use ARRAY_SIZE() macro to improve code
md: fix sync_action incorrect display during resync
md: add helper rdev_needs_recovery()
md: keep recovery_cp in mdp_superblock_s
md: add legacy_async_del_gendisk mode
Linus Torvalds [Fri, 22 Aug 2025 13:25:59 +0000 (09:25 -0400)]
Merge tag 'io_uring-6.17-20250822' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Just two small fixes - one that fixes inconsistent ->async_data vs
REQ_F_ASYNC_DATA handling in futex, and a followup that just ensures
that if other opcode handlers mess this up, it won't cause any issues"
* tag 'io_uring-6.17-20250822' of git://git.kernel.dk/linux:
io_uring: clear ->async_data as part of normal init
io_uring/futex: ensure io_futex_wait() cleans up properly on failure
Linus Torvalds [Fri, 22 Aug 2025 13:20:42 +0000 (09:20 -0400)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"All fixes in drivers. The largest diffstat in ufs is caused by the doc
update with the next being the qcom null pointer deref fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: ufs-qcom: Fix ESI null pointer dereference
scsi: ufs: core: Rename ufshcd_wait_for_doorbell_clr()
scsi: ufs: core: Fix the return value documentation
scsi: ufs: core: Remove WARN_ON_ONCE() call from ufshcd_uic_cmd_compl()
scsi: ufs: core: Fix IRQ lock inversion for the SCSI host lock
scsi: qla4xxx: Prevent a potential error pointer dereference
scsi: ufs: ufs-pci: Add support for Intel Wildcat Lake
scsi: fnic: Remove a useless struct mempool forward declaration
Linus Torvalds [Fri, 22 Aug 2025 13:17:49 +0000 (09:17 -0400)]
Merge tag 'mmc-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC host:
- sdhci_am654: Disable HS400 for AM62P SR1.0 and SR1.1
- sdhci-of-arasan: Ensure CD logic stabilization before power-up
- sdhci-pci-gli: Mask the replay timer timeout of AER for GL9763e
MEMSTICK:
- Fix deadlock by moving removing flag earlier"
* tag 'mmc-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci_am654: Disable HS400 for AM62P SR1.0 and SR1.1
memstick: Fix deadlock by moving removing flag earlier
mmc: sdhci-of-arasan: Ensure CD logic stabilization before power-up
mmc: sdhci-pci-gli: GL9763e: Mask the replay timer timeout of AER
mmc: sdhci-pci-gli: GL9763e: Rename the gli_set_gl9763e() for consistency
mmc: sdhci-pci-gli: Add a new function to simplify the code
Linus Torvalds [Fri, 22 Aug 2025 13:13:24 +0000 (09:13 -0400)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
- syzkaller found a WARN_ON in rxe due to poor lifecycle management of
resources linked to skbs
- Missing error path handling in erdma qp creation
- Initialize the qp number for the GSI QP in erdma
- Mismatching of DIP, SCC and QP numbers in hns
- SRQ bug fixes in bnxt_re
- Memory leak and possibly uninited memory in bnxt_re
- Remove retired irdma maintainer
- Fix kfree() for kvalloc() in ODP
- Fix memory leak in hns
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/hns: Fix dip entries leak on devices newer than hip09
RDMA/core: Free pfn_list with appropriate kvfree call
MAINTAINERS: Remove bouncing irdma maintainer
RDMA/bnxt_re: Fix to initialize the PBL array
RDMA/bnxt_re: Fix a possible memory leak in the driver
RDMA/bnxt_re: Fix to remove workload check in SRQ limit path
RDMA/bnxt_re: Fix to do SRQ armena by default
RDMA/hns: Fix querying wrong SCC context for DIP algorithm
RDMA/erdma: Fix unset QPN of GSI QP
RDMA/erdma: Fix ignored return value of init_kernel_qp
RDMA/rxe: Flush delayed SKBs while releasing RXE resources
Linus Torvalds [Fri, 22 Aug 2025 13:05:37 +0000 (09:05 -0400)]
Merge tag 'sound-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Only small fixes.
- ASoC Cirrus codec fixes
- A regression fix for the recent TAS2781 codec refactoring
- A fix for user-timer error handling
- Fixes for USB-audio descriptor validators
- Usual HD-audio and ASoC device-specific quirks"
* tag 'sound-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
ALSA: timer: fix ida_free call while not allocated
ASoC: cs35l56: Remove SoundWire Clock Divider workaround for CS35L63
ASoC: cs35l56: Handle new algorithms IDs for CS35L63
ASoC: cs35l56: Update Firmware Addresses for CS35L63 for production silicon
ALSA: hda: tas2781: Fix wrong reference of tasdevice_priv
ALSA: hda/realtek: Audio disappears on HP 15-fc000 after warm boot again
ALSA: hda/realtek: Fix headset mic on ASUS Zenbook 14
ASoC: codecs: ES9389: Modify the standby configuration
ALSA: usb-audio: Fix size validation in convert_chmap_v3()
ALSA: hda/tas2781: Add name prefix tas2781 for tas2781's dvc_tlv and amp_vol_tlv
ALSA: hda/realtek: Add support for HP EliteBook x360 830 G6 and EliteBook 830 G6
Linus Torvalds [Fri, 22 Aug 2025 12:54:34 +0000 (08:54 -0400)]
Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 10 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 17 of these
fixes are for MM.
As usual, singletons all over the place, apart from a three-patch
series of KHO followup work from Pasha which is actually also a bunch
of singletons"
* tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mremap: fix WARN with uffd that has remap events disabled
mm/damon/sysfs-schemes: put damos dests dir after removing its files
mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m
mm/damon/core: fix damos_commit_filter not changing allow
mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn
MAINTAINERS: mark MGLRU as maintained
mm: rust: add page.rs to MEMORY MANAGEMENT - RUST
iov_iter: iterate_folioq: fix handling of offset >= folio size
selftests/damon: fix selftests by installing drgn related script
.mailmap: add entry for Easwar Hariharan
selftests/mm: add test for invalid multi VMA operations
mm/mremap: catch invalid multi VMA moves earlier
mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area
mm/damon/core: fix commit_ops_filters by using correct nth function
tools/testing: add linux/args.h header and fix radix, VMA tests
mm/debug_vm_pgtable: clear page table entries at destroy_args()
squashfs: fix memory leak in squashfs_fill_super
kho: warn if KHO is disabled due to an error
kho: mm: don't allow deferred struct page with KHO
kho: init new_physxa->phys_bits to fix lockdep
XianLiang Huang [Wed, 20 Aug 2025 07:22:48 +0000 (15:22 +0800)]
iommu/riscv: prevent NULL deref in iova_to_phys
The riscv_iommu_pte_fetch() function returns either NULL for
unmapped/never-mapped iova, or a valid leaf pte pointer that
requires no further validation.
riscv_iommu_iova_to_phys() failed to handle NULL returns.
Prevent null pointer dereference in
riscv_iommu_iova_to_phys(), and remove the pte validation.
Robin Murphy [Thu, 14 Aug 2025 16:47:16 +0000 (17:47 +0100)]
iommu/virtio: Make instance lookup robust
Much like arm-smmu in commit 7d835134d4e1 ("iommu/arm-smmu: Make
instance lookup robust"), virtio-iommu appears to have the same issue
where iommu_device_register() makes the IOMMU instance visible to other
API callers (including itself) straight away, but internally the
instance isn't ready to recognise itself for viommu_probe_device() to
work correctly until after viommu_probe() has returned. This matters a
lot more now that bus_iommu_probe() has the DT/VIOT knowledge to probe
client devices the way that was always intended. Tweak the lookup and
initialisation in much the same way as for arm-smmu, to ensure that what
we register is functional and ready to go.
The arm_smmu_attach_commit() updates master->ats_enabled before calling
arm_smmu_remove_master_domain() that is supposed to clean up everything
in the old domain, including the old domain's nr_ats_masters. So, it is
supposed to use the old ats_enabled state of the device, not an updated
state.
This isn't a problem if switching between two domains where:
- old ats_enabled = false; new ats_enabled = false
- old ats_enabled = true; new ats_enabled = true
but can fail cases where:
- old ats_enabled = false; new ats_enabled = true
(old domain should keep the counter but incorrectly decreased it)
- old ats_enabled = true; new ats_enabled = false
(old domain needed to decrease the counter but incorrectly missed it)
Update master->ats_enabled after arm_smmu_remove_master_domain() to fix
this.
Fixes: 7497f4211f4f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS") Cc: stable@vger.kernel.org Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/20250801030127.2006979-1-nicolinc@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Linus Torvalds [Thu, 21 Aug 2025 20:31:27 +0000 (16:31 -0400)]
Merge tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix NULL de-ref in css_rstat_exit() which could happen after
allocation failure
- Fix a cpuset partition handling bug and a couple other misc issues
- Doc spelling fix
* tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fixed spelling mistakes in documentation
cgroup: avoid null de-ref in css_rstat_exit()
cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()
cgroup/cpuset: Fix a partition error with CPU hotplug
cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
Linus Torvalds [Thu, 21 Aug 2025 20:28:00 +0000 (16:28 -0400)]
Merge tag 'spi-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in during the past week, a few
driver specifics plus one fix for the spi-mem core where we weren't
taking account of the frequency capabilities of the system when
determining if it can support an operation"
* tag 'spi-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: st: fix PM macros to use CONFIG_PM instead of CONFIG_PM_SLEEP
spi: spi-qpic-snand: fix calculating of ECC OOB regions' properties
spi: spi-fsl-lpspi: Clamp too high speed_hz
spi: spi-mem: add spi_mem_adjust_op_freq() in spi_mem_supports_op()
spi: spi-mem: Add missing kdoc argument
spi: spi-qpic-snand: use correct CW_PER_PAGE value for OOB write
Linus Torvalds [Thu, 21 Aug 2025 20:26:18 +0000 (16:26 -0400)]
Merge tag 'regulator-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of fairly minor device specific fixes that came in over the
past week or so, plus the addition of an actual maintainer for the
IR38060"
* tag 'regulator-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: tps65219: regulator: tps65219: Fix error codes in probe()
regulator: pca9450: Use devm_register_sys_off_handler
regulator: dt-bindings: infineon,ir38060: Add Guenter as maintainer from IBM
Linus Torvalds [Thu, 21 Aug 2025 20:07:15 +0000 (16:07 -0400)]
Merge tag 'acpi-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix three new issues in the ACPI APEI error injection code and
an ACPI platform firmware runtime update interface issue:
- Make ACPI APEI error injection check the version of the request
when mapping the EINJ parameter structure in the BIOS reserved
memory to prevent injecting errors based on an uninitialized
field (Tony Luck)
- Fix potential NULL dereference in __einj_error_inject() that may
occur when memory allocation fails (Charles Han)
- Remove the __exit annotation from einj_remove(), so it can be
called on errors during faux device probe (Uwe Kleine-König)
- Use a security-version-number check instead of a runtime version
check during ACPI platform firmware runtime driver updates to
prevent those updates from failing due to false-positive driver
version check failures (Chen Yu)"
* tag 'acpi-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: pfr_update: Fix the driver update version check
ACPI: APEI: EINJ: Fix resource leak by remove callback in .exit.text
ACPI: APEI: EINJ: fix potential NULL dereference in __einj_error_inject()
ACPI: APEI: EINJ: Check if user asked for EINJV2 injection
Linus Torvalds [Thu, 21 Aug 2025 20:04:58 +0000 (16:04 -0400)]
Merge tag 'pm-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a cpuidle menu governor issue and two issues in the cpupower
utility:
- Prevent the menu cpuidle governor from selecting idle states with
exit latency exceeding the current PM QoS limit after stopping the
scheduler tick (Rafael Wysocki)
- Make the set subcommand's -t option in the cpupower utility work as
documented and allow it to control the CPU boost feature of cpufreq
beyond x86 (Shinji Nomoto)"
* tag 'pm-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: governors: menu: Avoid selecting states with too much latency
cpupower: Allow control of boost feature on non-x86 based systems with boost support.
cpupower: Fix a bug where the -t option of the set subcommand was not working.
Linus Torvalds [Thu, 21 Aug 2025 20:02:35 +0000 (16:02 -0400)]
Merge tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix a subtle bug during SCX enabling where a dead task skips init
but doesn't skip sched class switch leading to invalid task state
transition warning
- Cosmetic fix in selftests
* tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Remove duplicate sched.h header
sched/ext: Fix invalid task state transitions on class switch
Jens Axboe [Thu, 21 Aug 2025 19:24:57 +0000 (13:24 -0600)]
io_uring: clear ->async_data as part of normal init
Opcode handlers like POLL_ADD will use ->async_data as the pointer for
double poll handling, which is a bit different than the usual case
where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more
proactive in handling ->async_data, and clear it to NULL as part of
regular init. Init is touching that cacheline anyway, so might as well
clear it.
Jens Axboe [Thu, 21 Aug 2025 19:23:21 +0000 (13:23 -0600)]
io_uring/futex: ensure io_futex_wait() cleans up properly on failure
The io_futex_data is allocated upfront and assigned to the io_kiocb
async_data field, but the request isn't marked with REQ_F_ASYNC_DATA
at that point. Those two should always go together, as the flag tells
io_uring whether the field is valid or not.
Additionally, on failure cleanup, the futex handler frees the data but
does not clear ->async_data. Clear the data and the flag in the error
path as well.
Thanks to Trend Micro Zero Day Initiative and particularly ReDress for
reporting this.
Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the argument check during an array bind fails, the bind_ops are freed
twice as seen below. Fix this by setting bind_ops to NULL after freeing.
==================================================================
BUG: KASAN: double-free in xe_vm_bind_ioctl+0x1b2/0x21f0 [xe]
Free of addr ffff88813bb9b800 by task xe_vm/14198
Piotr Piórkowski [Mon, 11 Aug 2025 10:43:58 +0000 (12:43 +0200)]
drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_create
Currently, ASID assignment for user VMs and page-table BO accounting for
client memory tracking are performed in xe_vm_create_ioctl.
To consolidate VM object initialization, move this logic to
xe_vm_create.
v2:
- removed unnecessary duplicate BO tracking code
- using the local variable xef to verify whether the VM is being created
by userspace
Fixes: 658a1c8e0a66 ("drm/xe: Assign ioctl xe file handler to vm in xe_vm_create") Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250811104358.2064150-3-piotr.piorkowski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 30e0c3f43a414616e0b6ca76cf7f7b2cd387e1d4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Rodrigo: Added fixes tag]
Linus Torvalds [Thu, 21 Aug 2025 17:51:15 +0000 (13:51 -0400)]
Merge tag 'net-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth.
Current release - fix to a fix:
- usb: asix_devices: fix PHY address mask in MDIO bus initialization
Current release - regressions:
- Bluetooth: fixes for the split between BIS_LINK and PA_LINK
- Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN
flag", breaks compatibility with some existing device tree blobs
- dsa: b53: fix reserved register access in b53_fdb_dump()
Current release - new code bugs:
- sched: dualpi2: run probability update timer in BH to avoid
deadlock
- eth: libwx: fix the size in RSS hash key population
- pse-pd: pd692x0: improve power budget error paths and handling
Previous releases - regressions:
- tls: fix handling of zero-length records on the rx_list
- hsr: reject HSR frame if skb can't hold tag
- bonding: fix negotiation flapping in 802.3ad passive mode
Previous releases - always broken:
- gso: forbid IPv6 TSO with extensions on devices with only IPV6_CSUM
- sched: make cake_enqueue return NET_XMIT_CN when past buffer_limit,
avoid packet drops with low buffer_limit, remove unnecessary WARN()
- sched: fix backlog accounting after modifying config of a qdisc in
the middle of the hierarchy
- mptcp: improve handling of skb extension allocation failures
- eth: mlx5:
- fixes for the "HW Steering" flow management method
- fixes for QoS and device buffer management"
* tag 'net-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
netfilter: nf_reject: don't leak dst refcount for loopback packets
net/mlx5e: Preserve shared buffer capacity during headroom updates
net/mlx5e: Query FW for buffer ownership
net/mlx5: Restore missing scheduling node cleanup on vport enable failure
net/mlx5: Fix QoS reference leak in vport enable error path
net/mlx5: Destroy vport QoS element when no configuration remains
net/mlx5e: Preserve tc-bw during parent changes
net/mlx5: Remove default QoS group and attach vports directly to root TSAR
net/mlx5: Base ECVF devlink port attrs from 0
net: pse-pd: pd692x0: Skip power budget configuration when undefined
net: pse-pd: pd692x0: Fix power budget leak in manager setup error path
Octeontx2-af: Skip overlap check for SPI field
selftests: tls: add tests for zero-length records
tls: fix handling of zero-length records on the rx_list
net: airoha: ppe: Do not invalid PPE entries in case of SW hash collision
selftests: bonding: add test for passive LACP mode
bonding: send LACPDUs periodically in passive mode after receiving partner's LACPDU
bonding: update LACP activity flag after setting lacp_active
Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flag"
ipv6: sr: Fix MAC comparison to be constant-time
...
This is because blamed commit forgot about loopback packets.
Such packets already have a dst_entry attached, even at PRE_ROUTING stage.
Instead of checking hook just check if the skb already has a route
attached to it.
Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When kernel lockdown is active, debugfs_locked_down() blocks access to
hypfs files that register ioctl callbacks, even if the ioctl interface
is not required for a function. This unnecessarily breaks userspace
tools that only rely on read operations.
Resolve this by registering a minimal set of file operations during
lockdown, avoiding ioctl registration and preserving access for affected
tooling.
Note that this change restores hypfs functionality when lockdown is
active from early boot (e.g. via lockdown=integrity kernel parameter),
but does not apply to scenarios where lockdown is enabled dynamically
while Linux is running.
Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390/hypfs: Avoid unnecessary ioctl registration in debugfs
Currently, hypfs registers ioctl callbacks for all debugfs files,
despite only one file requiring them. This leads to unintended exposure
of unused interfaces to user space and can trigger side effects such as
restricted access when kernel lockdown is enabled.
Restrict ioctl registration to only those files that implement ioctl
functionality to avoid interface clutter and unnecessary access
restrictions.
Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Takashi Iwai [Thu, 21 Aug 2025 15:08:34 +0000 (17:08 +0200)]
ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
The entry of the validators table for UAC3 feature unit is defined
with a wrong sub-type UAC_FEATURE (= 0x06) while it should have been
UAC3_FEATURE (= 0x07). This patch corrects the entry value.
Armen Ratner [Wed, 20 Aug 2025 13:32:09 +0000 (16:32 +0300)]
net/mlx5e: Preserve shared buffer capacity during headroom updates
When port buffer headroom changes, port_update_shared_buffer()
recalculates the shared buffer size and splits it in a 3:1 ratio
(lossy:lossless) - Currently, the calculation is:
lossless = shared / 4;
lossy = (shared / 4) * 3;
Meaning, the calculation dropped the remainder of shared % 4 due to
integer division, unintentionally reducing the total shared buffer
by up to three cells on each update. Over time, this could shrink
the buffer below usable size.
Fix it by changing the calculation to:
lossless = shared / 4;
lossy = shared - lossless;
This retains all buffer cells while still approximating the
intended 3:1 split, preventing capacity loss over time.
While at it, perform headroom calculations in units of cells rather than
in bytes for more accurate calculations avoiding extra divisions.