Paolo Bonzini [Wed, 27 Aug 2025 08:37:40 +0000 (04:37 -0400)]
Merge branch 'guest-memfd-mmap' into HEAD
Add support for host userspace mapping of guest_memfd-backed memory for VM
types that do NOT use support KVM_MEMORY_ATTRIBUTE_PRIVATE (which isn't
precisely the same thing as CoCo VMs, since x86's SEV-MEM and SEV-ES have
no way to detect private vs. shared).
mmap() support paves the way for several evolving KVM use cases:
* Allows VMMs like Firecracker to run guests entirely backed by
guest_memfd [1]. This provides a unified memory management model for
both confidential and non-confidential guests, simplifying VMM design.
* Enhanced Security via direct map removal: When combined with Patrick's
series for direct map removal [2], this provides additional hardening
against Spectre-like transient execution attacks by eliminating the
need for host kernel direct maps of guest memory.
* Lays the groundwork for *restricted* mmap() support for guest_memfd-backed
memory on CoCo platforms [3] that permit in-place sharing of guest memory
with the host.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory
Add a guest_memfd testcase to verify that a vCPU can fault-in guest_memfd
memory that supports mmap(), but that is not currently mapped into host
userspace and/or has a userspace address (in the memslot) that points at
something other than the target guest_memfd range. Mapping guest_memfd
memory into the guest is supposed to operate completely independently from
any userspace mappings.
Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-25-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: guest_memfd mmap() test when mmap is supported
Expand the guest_memfd selftests to comprehensively test host userspace
mmap functionality for guest_memfd-backed memory when supported by the
VM type.
Introduce new test cases to verify the following:
* Successful mmap operations: Ensure that MAP_SHARED mappings succeed
when guest_memfd mmap is enabled.
* Data integrity: Validate that data written to the mmap'd region is
correctly persistent and readable.
* fallocate interaction: Test that fallocate(FALLOC_FL_PUNCH_HOLE)
correctly zeros out mapped pages.
* Out-of-bounds access: Verify that accessing memory beyond the
guest_memfd's size correctly triggers a SIGBUS signal.
* Unsupported mmap: Confirm that mmap attempts fail as expected when
guest_memfd mmap support is not enabled for the specific guest_memfd
instance or VM type.
* Flag validity: Introduce test_vm_type_gmem_flag_validity() to
systematically test that only allowed guest_memfd creation flags are
accepted for different VM types (e.g., GUEST_MEMFD_FLAG_MMAP for
default VMs, no flags for CoCo VMs).
The existing tests for guest_memfd creation (multiple instances, invalid
sizes), file read/write, file size, and invalid punch hole operations
are integrated into the new test_with_type() framework to allow testing
across different VM types.
Cc: James Houghton <jthoughton@google.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Shivank Garg <shivankg@amd.com> Co-developed-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-24-seanjc@google.com>
[Fix default vm_types to use BIT() - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: Do not use hardcoded page sizes in guest_memfd test
Update the guest_memfd_test selftest to use getpagesize() instead of
hardcoded 4KB page size values.
Using hardcoded page sizes can cause test failures on architectures or
systems configured with larger page sizes, such as arm64 with 64KB
pages. By dynamically querying the system's page size, the test becomes
more portable and robust across different environments.
Additionally, build the guest_memfd_test selftest for arm64.
Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Suggested-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Allow and advertise support for host mmap() on guest_memfd files
Now that all the x86 and arm64 plumbing for mmap() on guest_memfd is in
place, allow userspace to set GUEST_MEMFD_FLAG_MMAP and advertise support
via a new capability, KVM_CAP_GUEST_MEMFD_MMAP.
The availability of this capability is determined per architecture, and
its enablement for a specific guest_memfd instance is controlled by the
GUEST_MEMFD_FLAG_MMAP flag at creation time.
Update the KVM API documentation to detail the KVM_CAP_GUEST_MEMFD_MMAP
capability, the associated GUEST_MEMFD_FLAG_MMAP, and provide essential
information regarding support for mmap in guest_memfd.
Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-22-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: arm64: Enable support for guest_memfd backed memory
Now that the infrastructure is in place, enable guest_memfd for arm64.
* Select CONFIG_KVM_GUEST_MEMFD in KVM/arm64 Kconfig.
* Enforce KVM_MEMSLOT_GMEM_ONLY for guest_memfd on arm64: Ensure that
guest_memfd-backed memory slots on arm64 are only supported if they
are intended for shared memory use cases (i.e.,
kvm_memslot_is_gmem_only() is true). This design reflects the current
arm64 KVM ecosystem where guest_memfd is primarily being introduced
for VMs that support shared memory.
Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-21-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: arm64: nv: Handle VNCR_EL2-triggered faults backed by guest_memfd
Handle faults for memslots backed by guest_memfd in arm64 nested
virtualization triggered by VNCR_EL2.
* Introduce is_gmem output parameter to kvm_translate_vncr(), indicating
whether the faulted memory slot is backed by guest_memfd.
* Dispatch faults backed by guest_memfd to kvm_gmem_get_pfn().
* Update kvm_handle_vncr_abort() to handle potential guest_memfd errors.
Some of the guest_memfd errors need to be handled by userspace instead
of attempting to (implicitly) retry by returning to the guest.
Suggested-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-20-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add arm64 architecture support for handling guest page faults on memory
slots backed by guest_memfd.
This change introduces a new function, gmem_abort(), which encapsulates
the fault handling logic specific to guest_memfd-backed memory. The
kvm_handle_guest_abort() entry point is updated to dispatch to
gmem_abort() when a fault occurs on a guest_memfd-backed memory slot (as
determined by kvm_slot_has_gmem()).
Until guest_memfd gains support for huge pages, the fault granule for
these memory regions is restricted to PAGE_SIZE.
Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: James Houghton <jthoughton@google.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-19-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Refactor user_mem_abort() to improve code clarity and simplify
assumptions within the function.
Key changes include:
* Immediately set force_pte to true at the beginning of the function if
logging_active is true. This simplifies the flow and makes the
condition for forcing a PTE more explicit.
* Remove the misleading comment stating that logging_active is
guaranteed to never be true for VM_PFNMAP memslots, as this assertion
is not entirely correct.
* Extract reusable code blocks into new helper functions:
* prepare_mmu_memcache(): Encapsulates the logic for preparing and
topping up the MMU page cache.
* adjust_nested_fault_perms(): Isolates the adjustments to shadow S2
permissions and the encoding of nested translation levels.
* Update min(a, (long)b) to min_t(long, a, b) for better type safety and
consistency.
* Perform other minor tidying up of the code.
These changes primarily aim to simplify user_mem_abort() and make its
logic easier to understand and maintain, setting the stage for future
modifications.
Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Marc Zyngier <maz@kernel.org> Reviewed-by: Tao Chan <chentao@kylinos.cn> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-18-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory
Update the KVM MMU fault handler to service guest page faults
for memory slots backed by guest_memfd with mmap support. For such
slots, the MMU must always fault in pages directly from guest_memfd,
bypassing the host's userspace_addr.
This ensures that guest_memfd-backed memory is always handled through
the guest_memfd specific faulting path, regardless of whether it's for
private or non-private (shared) use cases.
Additionally, rename kvm_mmu_faultin_pfn_private() to
kvm_mmu_faultin_pfn_gmem(), as this function is now used to fault in
pages from guest_memfd for both private and non-private memory,
accommodating the new use cases.
Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Ackerley Tng <ackerleytng@google.com> Co-developed-by: Fuad Tabba <tabba@google.com> Signed-off-by: Fuad Tabba <tabba@google.com>
[sean: drop the helper] Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250729225455.670324-17-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings
Rework kvm_mmu_max_mapping_level() to consult guest_memfd for all mappings,
not just private mappings, so that hugepage support plays nice with the
upcoming support for backing non-private memory with guest_memfd.
In addition to getting the max order from guest_memfd for gmem-only
memslots, update TDX's hook to effectively ignore shared mappings, as TDX's
restrictions on page size only apply to Secure EPT mappings. Do nothing
for SNP, as RMP restrictions apply to both private and shared memory.
Suggested-by: Ackerley Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Message-ID: <20250729225455.670324-16-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages
Rework kvm_mmu_max_mapping_level() to provide the plumbing to consult
guest_memfd (and relevant vendor code) when recovering hugepages, e.g.
after disabling live migration. The flaw has existed since guest_memfd was
originally added, but has gone unnoticed due to lack of guest_memfd support
for hugepages or dirty logging.
Don't actually call into guest_memfd at this time, as it's unclear as to
what the API should be. Ideally, KVM would simply use kvm_gmem_get_pfn(),
but invoking kvm_gmem_get_pfn() would lead to sleeping in atomic context
if guest_memfd needed to allocate memory (mmu_lock is held). Luckily,
the path isn't actually reachable, so just add a TODO and WARN to ensure
the functionality is added alongisde guest_memfd hugepage support, and
punt the guest_memfd API design question to the future.
Note, calling kvm_mem_is_private() in the non-fault path is safe, so long
as mmu_lock is held, as hugepage recovery operates on shadow-present SPTEs,
i.e. calling kvm_mmu_max_mapping_level() with @fault=NULL is mutually
exclusive with kvm_vm_set_mem_attributes() changing the PRIVATE attribute
of the gfn.
Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-15-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: guest_memfd: Track guest_memfd mmap support in memslot
Add a new internal flag, KVM_MEMSLOT_GMEM_ONLY, to the top half of
memslot->flags (which makes it strictly for KVM's internal use). This
flag tracks when a guest_memfd-backed memory slot supports host
userspace mmap operations, which implies that all memory, not just
private memory for CoCo VMs, is consumed through guest_memfd: "gmem
only".
This optimization avoids repeatedly checking the underlying guest_memfd
file for mmap support, which would otherwise require taking and
releasing a reference on the file for each check. By caching this
information directly in the memslot, we reduce overhead and simplify the
logic involved in handling guest_memfd-backed pages for host mappings.
Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Suggested-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-12-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: guest_memfd: Add plumbing to host to map guest_memfd pages
Introduce the core infrastructure to enable host userspace to mmap()
guest_memfd-backed memory. This is needed for several evolving KVM use
cases:
* Non-CoCo VM backing: Allows VMMs like Firecracker to run guests
entirely backed by guest_memfd, even for non-CoCo VMs [1]. This
provides a unified memory management model and simplifies guest memory
handling.
* Direct map removal for enhanced security: This is an important step
for direct map removal of guest memory [2]. By allowing host userspace
to fault in guest_memfd pages directly, we can avoid maintaining host
kernel direct maps of guest memory. This provides additional hardening
against Spectre-like transient execution attacks by removing a
potential attack surface within the kernel.
* Future guest_memfd features: This also lays the groundwork for future
enhancements to guest_memfd, such as supporting huge pages and
enabling in-place sharing of guest memory with the host for CoCo
platforms that permit it [3].
Enable the basic mmap and fault handling logic within guest_memfd, but
hold off on allow userspace to actually do mmap() until the architecture
support is also in place.
KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds
Enable KVM_GUEST_MEMFD for all KVM x86 64-bit builds, i.e. for "default"
VM types when running on 64-bit KVM. This will allow using guest_memfd
to back non-private memory for all VM shapes, by supporting mmap() on
guest_memfd.
Opportunistically clean up various conditionals that become tautologies
once x86 selects KVM_GUEST_MEMFD more broadly. Specifically, because
SW protected VMs, SEV, and TDX are all 64-bit only, private memory no
longer needs to take explicit dependencies on KVM_GUEST_MEMFD, because
it is effectively a prerequisite.
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-10-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem()
Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() to improve
clarity and accurately reflect its purpose.
The function kvm_slot_can_be_private() was previously used to check if a
given kvm_memory_slot is backed by guest_memfd. However, its name
implied that the memory in such a slot was exclusively "private".
As guest_memfd support expands to include non-private memory (e.g.,
shared host mappings), it's important to remove this association. The
new name, kvm_slot_has_gmem(), states that the slot is backed by
guest_memfd without making assumptions about the memory's privacy
attributes.
Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE
The original name was vague regarding its functionality. This Kconfig
option specifically enables and gates the kvm_gmem_populate() function,
which is responsible for populating a GPA range with guest data.
The new name, HAVE_KVM_ARCH_GMEM_POPULATE, describes the purpose of the
option: to enable arch-specific guest_memfd population mechanisms. It
also follows the same pattern as the other HAVE_KVM_ARCH_* configuration
options.
This improves clarity for developers and ensures the name accurately
reflects the functionality it controls, especially as guest_memfd
support expands beyond purely "private" memory scenarios.
Temporarily keep KVM_GENERIC_PRIVATE_MEM as an x86-only config so as to
minimize churn, and to hopefully make it easier to see what features
require HAVE_KVM_ARCH_GMEM_POPULATE. On that note, omit GMEM_POPULATE
for KVM_X86_SW_PROTECTED_VM, as regular ol' memset() suffices for
software-protected VMs.
As for KVM_GENERIC_PRIVATE_MEM, a future change will select KVM_GUEST_MEMFD
for all 64-bit KVM builds, at which point the intermediate config will
become obsolete and can/will be dropped.
Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Select KVM_GENERIC_PRIVATE_MEM and KVM_GENERIC_MEMORY_ATTRIBUTES directly
from KVM_INTEL_TDX, i.e. if and only if TDX support is fully enabled in
KVM. There is no need to enable KVM's private memory support just because
the core kernel's INTEL_TDX_HOST is enabled.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Select KVM_GENERIC_PRIVATE_MEM directly from KVM_SW_PROTECTED_VM
Now that KVM_SW_PROTECTED_VM doesn't have a hidden dependency on KVM_X86,
select KVM_GENERIC_PRIVATE_MEM from within KVM_SW_PROTECTED_VM instead of
conditionally selecting it from KVM_X86.
No functional change intended.
Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Have all vendor neutral sub-configs depend on KVM_X86, not just KVM
Make all vendor neutral KVM x86 configs depend on KVM_X86, not just KVM,
i.e. gate them on at least one vendor module being enabled and thus on
kvm.ko actually being built. Depending on just KVM allows the user to
select the configs even though they won't actually take effect, and more
importantly, makes it all too easy to create unmet dependencies. E.g.
KVM_GENERIC_PRIVATE_MEM can't be selected by KVM_SW_PROTECTED_VM, because
the KVM_GENERIC_MMU_NOTIFIER dependency is select by KVM_X86.
Hiding all sub-configs when neither KVM_AMD nor KVM_INTEL is selected also
helps communicate to the user that nothing "interesting" is going on, e.g.
--- Virtualization
<M> Kernel-based Virtual Machine (KVM) support
< > KVM for Intel (and compatible) processors support
< > KVM for AMD processors support
Fixes: ea4290d77bda ("KVM: x86: leave kvm.ko out of the build if no vendor module is requested") Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Fuad Tabba <tabba@google.com>
Message-ID: <20250729225455.670324-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD
Rename the Kconfig option CONFIG_KVM_PRIVATE_MEM to
CONFIG_KVM_GUEST_MEMFD. The original name implied that the feature only
supported "private" memory. However, CONFIG_KVM_PRIVATE_MEM enables
guest_memfd in general, which is not exclusively for private memory.
Subsequent patches in this series will add guest_memfd support for
non-CoCo VMs, whose memory is not private.
Renaming the Kconfig option to CONFIG_KVM_GUEST_MEMFD more accurately
reflects its broader scope as the main Kconfig option for all
guest_memfd-backed memory. This provides clearer semantics for the
option and avoids confusion as new features are introduced.
Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Shivank Garg <shivankg@amd.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Xiaoyao Li <xiaoyao.li@intel.com> Co-developed-by: David Hildenbrand <david@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Fuad Tabba <tabba@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-ID: <20250729225455.670324-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Wed, 27 Aug 2025 08:18:01 +0000 (04:18 -0400)]
Merge tag 'kvm-x86-fixes-6.17-rc7' of https://github.com/kvm-x86/linux into HEAD
* Use array_index_nospec() to sanitize the target vCPU ID when handling PV
IPIs and yields as the ID is guest-controlled.
* Drop a superfluous cpumask_empty() check when reclaiming SEV memory, as
the common case, by far, is that at least one CPU will have entered the
VM, and wbnoinvd_on_cpus_mask() will naturally handle the rare case where
the set of have_run_cpus is empty.
* Rename the is_signed_type() macro in kselftest_harness.h to is_signed_var()
to fix a collision with linux/overflow.h. The collision generates compiler
warnings due to the two macros having different implementations.
Linus Torvalds [Sun, 24 Aug 2025 14:13:05 +0000 (10:13 -0400)]
Merge tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fix from Borislav Petkov:
- Fix a case where the events throttling logic operates on inactive
events
* tag 'perf_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Avoid undefined behavior from stopping/starting inactive events
Linus Torvalds [Sun, 24 Aug 2025 13:52:28 +0000 (09:52 -0400)]
Merge tag 'x86_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Fix the GDS mitigation detection on some machines after the recent
attack vectors conversion
- Filter out the invalid machine reset reason value -1 when running as
a guest as in such cases the reason why the machine was rebooted does
not make a whole lot of sense
- Init the resource control machinery on Hygon hw in order to avoid a
division by zero and to actually enable the feature on hw which
supports it
* tag 'x86_urgent_for_v6.17_rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/bugs: Fix GDS mitigation selecting when mitigation is off
x86/CPU/AMD: Ignore invalid reset reason value
x86/cpu/hygon: Add missing resctrl_cpu_detect() in bsp_init helper
Linus Torvalds [Sun, 24 Aug 2025 13:43:50 +0000 (09:43 -0400)]
Merge tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux
Pull modules fix from Daniel Gomez:
"This includes a fix part of the KSPP (Kernel Self Protection Project)
to replace the deprecated and unsafe strcpy() calls in the kernel
parameter string handler and sysfs parameters for built-in modules.
Single commit, no functional changes"
* tag 'modules-6.17-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/modules/linux:
params: Replace deprecated strcpy() with strscpy() and memcpy()
Linus Torvalds [Sat, 23 Aug 2025 15:27:31 +0000 (11:27 -0400)]
Merge tag 'char-misc-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc/iio fixes from Greg KH:
"Here are a small number of char/misc/iio and other driver fixes for
6.17-rc3. Included in here are:
- IIO driver bugfixes for reported issues
- bunch of comedi driver fixes
- most core bugfix
- fpga driver bugfix
- cdx driver bugfix
All of these have been in linux-next this week with no reported
issues"
* tag 'char-misc-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
most: core: Drop device reference after usage in get_channel()
comedi: Make insn_rw_emulate_bits() do insn->n samples
comedi: Fix use of uninitialized memory in do_insn_ioctl() and do_insnlist_ioctl()
comedi: pcl726: Prevent invalid irq number
cdx: Fix off-by-one error in cdx_rpmsg_probe()
fpga: zynq_fpga: Fix the wrong usage of dma_map_sgtable()
iio: pressure: bmp280: Use IS_ERR() in bmp280_common_probe()
iio: light: as73211: Ensure buffer holes are zeroed
iio: adc: rzg2l_adc: Set driver data before enabling runtime PM
iio: adc: rzg2l: Cleanup suspend/resume path
iio: adc: ad7380: fix missing max_conversion_rate_hz on adaq4381-4
iio: adc: bd79124: Add GPIOLIB dependency
iio: imu: inv_icm42600: change invalid data error to -EBUSY
iio: adc: ad7124: fix channel lookup in syscalib functions
iio: temperature: maxim_thermocouple: use DMA-safe buffer for spi_read()
iio: adc: ad7173: prevent scan if too many setups requested
iio: proximity: isl29501: fix buffered read on big-endian systems
iio: accel: sca3300: fix uninitialized iio scan data
Linus Torvalds [Sat, 23 Aug 2025 15:21:56 +0000 (11:21 -0400)]
Merge tag 'usb-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 6.17-rc3 to resolve a bunch
of reported issues. Included in here are:
- typec driver fixes
- dwc3 new device id
- dwc3 driver fixes
- new usb-storage driver quirks
- xhci driver fixes
- other tiny USB driver fixes to resolve bugs
All of these have been in linux-next this week with no reported issues"
* tag 'usb-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: xhci: fix host not responding after suspend and resume
usb: xhci: Fix slot_id resource race conflict
usb: typec: fusb302: Revert incorrect threaded irq fix
USB: core: Update kerneldoc for usb_hcd_giveback_urb()
usb: typec: maxim_contaminant: re-enable cc toggle if cc is open and port is clean
usb: typec: maxim_contaminant: disable low power mode when reading comparator values
usb: dwc3: Remove WARN_ON for device endpoint command timeouts
USB: storage: Ignore driver CD mode for Realtek multi-mode Wi-Fi dongles
usb: storage: realtek_cr: Use correct byte order for bcs->Residue
usb: chipidea: imx: improve usbmisc_imx7d_pullup()
kcov, usb: Don't disable interrupts in kcov_remote_start_usb_softirq()
usb: dwc3: pci: add support for the Intel Wildcat Lake
usb: dwc3: Ignore late xferNotReady event to prevent halt timeout
USB: storage: Add unusual-devs entry for Novatek NTK96550-based camera
usb: core: hcd: fix accessing unmapped memory in SINGLE_STEP_SET_FEATURE test
usb: renesas-xhci: Fix External ROM access timeouts
usb: gadget: tegra-xudc: fix PM use count underflow
usb: quirks: Add DELAY_INIT quick for another SanDisk 3.2Gen1 Flash Drive
Linus Torvalds [Sat, 23 Aug 2025 14:11:34 +0000 (10:11 -0400)]
Merge tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
Pull tracing fixes from Steven Rostedt:
- Fix rtla and latency tooling pkg-config errors
If libtraceevent and libtracefs is installed, but their corresponding
'.pc' files are not installed, it reports that the libraries are
missing and confuses the developer. Instead, report that the
pkg-config files are missing and should be installed.
- Fix overflow bug of the parser in trace_get_user()
trace_get_user() uses the parsing functions to parse the user space
strings. If the parser fails due to incorrect processing, it doesn't
terminate the buffer with a nul byte. Add a "failed" flag to the
parser that gets set when parsing fails and is used to know if the
buffer is fine to use or not.
- Remove a semicolon that was at an end of a comment line
- Fix register_ftrace_graph() to unregister the pm notifier on error
The register_ftrace_graph() registers a pm notifier but there's an
error path that can exit the function without unregistering it. Since
the function returns an error, it will never be unregistered.
- Allocate and copy ftrace hash for reader of ftrace filter files
When the set_ftrace_filter or set_ftrace_notrace files are open for
read, an iterator is created and sets its hash pointer to the
associated hash that represents filtering or notrace filtering to it.
The issue is that the hash it points to can change while the
iteration is happening. All the locking used to access the tracer's
hashes are released which means those hashes can change or even be
freed. Using the hash pointed to by the iterator can cause UAF bugs
or similar.
Have the read of these files allocate and copy the corresponding
hashes and use that as that will keep them the same while the
iterator is open. This also simplifies the code as opening it for
write already does an allocate and copy, and now that the read is
doing the same, there's no need to check which way it was opened on
the release of the file, and the iterator hash can always be freed.
- Fix function graph to copy args into temp storage
The output of the function graph tracer shows both the entry and the
exit of a function. When the exit is right after the entry, it
combines the two events into one with the output of "function();",
instead of showing:
function() {
}
In order to do this, the iterator descriptor that reads the events
includes storage that saves the entry event while it peaks at the
next event in the ring buffer. The peek can free the entry event so
the iterator must store the information to use it after the peek.
With the addition of function graph tracer recording the args, where
the args are a dynamic array in the entry event, the temp storage
does not save them. This causes the args to be corrupted or even
cause a read of unsafe memory.
Add space to save the args in the temp storage of the iterator.
- Fix race between ftrace_dump and reading trace_pipe
ftrace_dump() is used when a crash occurs where the ftrace buffer
will be printed to the console. But it can also be triggered by
sysrq-z. If a sysrq-z is triggered while a task is reading trace_pipe
it can cause a race in the ftrace_dump() where it checks if the
buffer has content, then it checks if the next event is available,
and then prints the output (regardless if the next event was
available or not). Reading trace_pipe at the same time can cause it
to not be available, and this triggers a WARN_ON in the print. Move
the printing into the check if the next event exists or not
* tag 'trace-v6.17-rc2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace:
ftrace: Also allocate and copy hash for reading of filter files
ftrace: Fix potential warning in trace_printk_seq during ftrace_dump
fgraph: Copy args in intermediate storage with entry
trace/fgraph: Fix the warning caused by missing unregister notifier
ring-buffer: Remove redundant semicolons
tracing: Limit access to parser->buffer when trace_get_user failed
rtla: Check pkg-config install
tools/latency-collector: Check pkg-config install
Linus Torvalds [Sat, 23 Aug 2025 13:04:32 +0000 (09:04 -0400)]
Merge tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core
Pull driver core fixes from Danilo Krummrich:
- Fix swapped handling of lru_gen and lru_gen_full debugfs files in
vmscan
- Fix debugfs mount options (uid, gid, mode) being silently ignored
- Fix leak of devres action in the unwind path of Devres::new()
- Documentation:
- Expand and fix documentation of (outdated) Device, DeviceContext
and generic driver infrastructure
- Fix C header link of faux device abstractions
- Clarify expected interaction with the security team
- Smooth text flow in the security bug reporting process
documentation
* tag 'driver-core-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/driver-core/driver-core:
Documentation: smooth the text flow in the security bug reporting process
Documentation: clarify the expected collaboration with security bugs reporters
debugfs: fix mount options not being applied
rust: devres: fix leaking call to devm_add_action()
rust: faux: fix C header link
driver: rust: expand documentation for driver infrastructure
device: rust: expand documentation for Device
device: rust: expand documentation for DeviceContext
mm/vmscan: fix inverted polarity in lru_gen_seq_show()
Steven Rostedt [Fri, 22 Aug 2025 22:36:06 +0000 (18:36 -0400)]
ftrace: Also allocate and copy hash for reading of filter files
Currently the reader of set_ftrace_filter and set_ftrace_notrace just adds
the pointer to the global tracer hash to its iterator. Unlike the writer
that allocates a copy of the hash, the reader keeps the pointer to the
filter hashes. This is problematic because this pointer is static across
function calls that release the locks that can update the global tracer
hashes. This can cause UAF and similar bugs.
Allocate and copy the hash for reading the filter files like it is done
for the writers. This not only fixes UAF bugs, but also makes the code a
bit simpler as it doesn't have to differentiate when to free the
iterator's hash between writers and readers.
Linus Torvalds [Fri, 22 Aug 2025 22:16:54 +0000 (18:16 -0400)]
Merge tag 'drm-fixes-2025-08-23-1' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Dave Airlie:
"Weekly drm fixes. Looks like things did indeed get busier after rc2,
nothing seems too major, but stuff scattered all over the place,
amdgpu, xe, i915, hibmc, rust support code, and other small fixes.
rust:
- drm device memory layout and safety fixes
tests:
- Endianness fixes
gpuvm:
- docs warning fix
panic:
- fix division on 32-bit arm
i915:
- TypeC DP display Fixes
- Silence rpm wakeref asserts on GEN11_GU_MISC_IIR access
- Relocate compression repacking WA for JSL/EHL
hibmc:
- modesetting black screen fixes
- fix UAF on irq
- fix leak on i2c failure path
nouveau:
- memory leak fixes
- typos
rockchip:
- Kconfig fix
- register caching fix"
* tag 'drm-fixes-2025-08-23-1' of https://gitlab.freedesktop.org/drm/kernel: (49 commits)
drm/xe: Fix vm_bind_ioctl double free bug
drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_create
drm/xe: Assign ioctl xe file handler to vm in xe_vm_create
drm/i915/gt: Relocate compression repacking WA for JSL/EHL
drm/i915: silence rpm wakeref asserts on GEN11_GU_MISC_IIR access
drm/amd/display: Fix DP audio DTO1 clock source on DCE 6.
drm/amd/display: Fix fractional fb divider in set_pixel_clock_v3
drm/amd/display: Don't print errors for nonexistent connectors
drm/amd/display: Don't warn when missing DCE encoder caps
drm/amd/display: Fill display clock and vblank time in dce110_fill_display_configs
drm/amd/display: Find first CRTC and its line time in dce110_fill_display_configs
drm/amd/display: Adjust DCE 8-10 clock, don't overclock by 15%
drm/amd/display: Don't overclock DCE 6 by 15%
drm/amd/display: Add null pointer check in mod_hdcp_hdcp1_create_session()
drm/amd/display: Fix Xorg desktop unresponsive on Replay panel
drm/amd/display: Avoid a NULL pointer dereference
drm/amdgpu/swm14: Update power limit logic
drm/amd/display: Revert Add HPO encoder support to Replay
drm/i915/icl+/tc: Convert AUX powered WARN to a debug message
drm/i915/lnl+/tc: Use the cached max lane count value
...
In the context between trace_empty() and trace_find_next_entry_inc()
during ftrace_dump, the ring buffer data was consumed by other readers.
This caused trace_find_next_entry_inc to return NULL, failing to populate
`iter.seq`. At this point, due to the prior trace_iterator_reset, both
`iter.seq.len` and `iter.seq.size` were set to 0. Since they are equal,
the WARN_ON_ONCE condition is triggered.
Move the trace_printk_seq() into the if block that checks to make sure the
return value of trace_find_next_entry_inc() is non-NULL in
ftrace_dump_one(), ensuring the 'iter.seq' is properly populated before
subsequent operations.
Steven Rostedt [Wed, 20 Aug 2025 23:55:22 +0000 (19:55 -0400)]
fgraph: Copy args in intermediate storage with entry
The output of the function graph tracer has two ways to display its
entries. One way for leaf functions with no events recorded within them,
and the other is for functions with events recorded inside it. As function
graph has an entry and exit event, to simplify the output of leaf
functions it combines the two, where as non leaf functions are separate:
2) | invoke_rcu_core() {
2) | raise_softirq() {
2) 0.391 us | __raise_softirq_irqoff();
2) 1.191 us | }
2) 2.086 us | }
The __raise_softirq_irqoff() function above is really two events that were
merged into one. Otherwise it would have looked like:
2) | invoke_rcu_core() {
2) | raise_softirq() {
2) | __raise_softirq_irqoff() {
2) 0.391 us | }
2) 1.191 us | }
2) 2.086 us | }
In order to do this merge, the reading of the trace output file needs to
look at the next event before printing. But since the pointer to the event
is on the ring buffer, it needs to save the entry event before it looks at
the next event as the next event goes out of focus as soon as a new event
is read from the ring buffer. After it reads the next event, it will print
the entry event with either the '{' (non leaf) or ';' and timestamps (leaf).
The iterator used to read the trace file has storage for this event. The
problem happens when the function graph tracer has arguments attached to
the entry event as the entry now has a variable length "args" field. This
field only gets set when funcargs option is used. But the args are not
recorded in this temp data and garbage could be printed. The entry field
is copied via:
data->ent = *curr;
Where "curr" is the entry field. But this method only saves the non
variable length fields from the structure.
Add a helper structure to the iterator data that adds the max args size to
the data storage in the iterator. Then simply copy the entire entry into
this storage (with size protection).
Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Mark Rutland <mark.rutland@arm.com> Link: https://lore.kernel.org/20250820195522.51d4a268@gandalf.local.home Reported-by: Sasha Levin <sashal@kernel.org> Tested-by: Sasha Levin <sashal@kernel.org> Closes: https://lore.kernel.org/all/aJaxRVKverIjF4a6@lappy/ Fixes: ff5c9c576e75 ("ftrace: Add support for function argument to graph tracer") Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
Linus Torvalds [Fri, 22 Aug 2025 21:24:48 +0000 (17:24 -0400)]
Merge tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd
Pull iommufd fixes from Jason Gunthorpe:
"Two very minor fixes:
- Fix mismatched kvalloc()/kfree()
- Spelling fixes in documentation"
* tag 'for-linus-iommufd' of git://git.kernel.org/pub/scm/linux/kernel/git/jgg/iommufd:
iommufd: Fix spelling errors in iommufd.rst
iommufd: viommu: free memory allocated by kvcalloc() using kvfree()
Bindig requires a node name matching ‘^ethernet@[0-9a-f]+$’. This patch
changes the clock name from “etop” to “ethernet”.
This fixes the following warning:
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): $nodename:0: 'etop@e180000' does not match '^ethernet@[0-9a-f]+$'
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
Fixes: dac0bad93741 ("dt-bindings: net: lantiq,etop-xway: Document Lantiq Xway ETOP bindings") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Acked-by: Jakub Kicinski <kuba@kernel.org>
The upstream dts lacks the lantiq,{rx/tx}-burst-length property. Other
issues were also fixed:
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'interrupt-names' is a required property
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'lantiq,tx-burst-length' is a required property
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
arch/mips/boot/dts/lantiq/danube_easy50712.dtb: etop@e180000 (lantiq,etop-xway): 'lantiq,rx-burst-length' is a required property
from schema $id: http://devicetree.org/schemas/net/lantiq,etop-xway.yaml#
Fixes: 14d4e308e0aa ("net: lantiq: configure the burst length in ethernet drivers") Signed-off-by: Aleksander Jan Bajkowski <olek2@wp.pl> Acked-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Fri, 22 Aug 2025 14:16:47 +0000 (10:16 -0400)]
Merge tag 's390-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:
- When kernel lockdown is active userspace tools that rely on read
operations only are unnecessarily blocked. Fix that by avoiding ioctl
registration during lockdown
- Invalid NULL pointer accesses succeed due to the lowcore is always
mapped the identity mapping pinned to zero. To fix that never map the
first two pages of physical memory with identity mapping
- Fix invalid SCCB present check in the SCLP interrupt handler
- Update defconfigs
* tag 's390-6.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/hypfs: Enable limited access during lockdown
s390/hypfs: Avoid unnecessary ioctl registration in debugfs
s390/mm: Do not map lowcore with identity mapping
s390/sclp: Fix SCCB present check
s390/configs: Set HZ=1000
s390/configs: Update defconfigs
Linus Torvalds [Fri, 22 Aug 2025 13:50:17 +0000 (09:50 -0400)]
Merge tag 'for-linus-6.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Two small cleanups which are both relevant only when running as a Xen
guest"
* tag 'for-linus-6.17-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
drivers/xen/xenbus: remove quirk for Xen 3.x
compiler: remove __ADDRESSABLE_ASM{_STR,}() again
Linus Torvalds [Fri, 22 Aug 2025 13:35:21 +0000 (09:35 -0400)]
Merge tag 'platform-drivers-x86-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Ilpo Järvinen:
- amd/hsmp:
- Ensure sock->metric_tbl_addr is non-NULL
- Register driver even if hwmon registration fails
- amd/pmc: Drop SMU F/W match for Cezanne
- dell-smbios-wmi: Separate "priority" from WMI device ID
- hp-wmi: mark Victus 16-r1xxx for Victus s fan and thermal profile
support
- intel-uncore-freq: Check write blocked for efficiency latency control
* tag 'platform-drivers-x86-v6.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: hp-wmi: mark Victus 16-r1xxx for victus_s fan and thermal profile support
platform/x86/amd/hsmp: Ensure success even if hwmon registration fails
platform/x86/amd/hsmp: Ensure sock->metric_tbl_addr is non-NULL
platform/x86/intel-uncore-freq: Check write blocked for ELC
platform/x86/amd: pmc: Drop SMU F/W match for Cezanne
platform/x86: dell-smbios-wmi: Stop touching WMI device ID
Linus Torvalds [Fri, 22 Aug 2025 13:29:51 +0000 (09:29 -0400)]
Merge tag 'block-6.17-20250822' of git://git.kernel.dk/linux
Pull block fixes from Jens Axboe:
"A set of fixes for block that should go into this tree. A bit larger
than what I usually have at this point in time, a lot of that is the
continued fixing of the lockdep annotation for queue freezing that we
recently added, which has highlighted a number of little issues here
and there. This contains:
- MD pull request via Yu:
- Add a legacy_async_del_gendisk mode, to prevent a user tools
regression. New user tools releases will not use such a mode,
the old release with a new kernel now will have warning about
deprecated behavior, and we prepare to remove this legacy mode
after about a year later
- The rename in kernel causing user tools build failure, revert
the rename in mdp_superblock_s
- Fix a regression that interrupted resync can be shown as
recover from mdstat or sysfs
- Improve file size detection for loop, particularly for networked
file systems, by using getattr to get the size rather than the
cached inode size.
- Hotplug CPU lock vs queue freeze fix
- Lockdep fix while updating the number of hardware queues
- Fix stacking for PI devices
- Silence bio_check_eod() for the known case of device removal where
the size is truncated to 0 sectors"
* tag 'block-6.17-20250822' of git://git.kernel.dk/linux:
block: avoid cpu_hotplug_lock depedency on freeze_lock
block: decrement block_rq_qos static key in rq_qos_del()
block: skip q->rq_qos check in rq_qos_done_bio()
blk-mq: fix lockdep warning in __blk_mq_update_nr_hw_queues
block: tone down bio_check_eod
loop: use vfs_getattr_nosec for accurate file size
loop: Consolidate size calculation logic into lo_calculate_size()
block: remove newlines from the warnings in blk_validate_integrity_limits
block: handle pi_tuple_size in queue_limits_stack_integrity
selftests: ublk: Use ARRAY_SIZE() macro to improve code
md: fix sync_action incorrect display during resync
md: add helper rdev_needs_recovery()
md: keep recovery_cp in mdp_superblock_s
md: add legacy_async_del_gendisk mode
Linus Torvalds [Fri, 22 Aug 2025 13:25:59 +0000 (09:25 -0400)]
Merge tag 'io_uring-6.17-20250822' of git://git.kernel.dk/linux
Pull io_uring fixes from Jens Axboe:
"Just two small fixes - one that fixes inconsistent ->async_data vs
REQ_F_ASYNC_DATA handling in futex, and a followup that just ensures
that if other opcode handlers mess this up, it won't cause any issues"
* tag 'io_uring-6.17-20250822' of git://git.kernel.dk/linux:
io_uring: clear ->async_data as part of normal init
io_uring/futex: ensure io_futex_wait() cleans up properly on failure
Linus Torvalds [Fri, 22 Aug 2025 13:20:42 +0000 (09:20 -0400)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"All fixes in drivers. The largest diffstat in ufs is caused by the doc
update with the next being the qcom null pointer deref fix"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ufs: ufs-qcom: Fix ESI null pointer dereference
scsi: ufs: core: Rename ufshcd_wait_for_doorbell_clr()
scsi: ufs: core: Fix the return value documentation
scsi: ufs: core: Remove WARN_ON_ONCE() call from ufshcd_uic_cmd_compl()
scsi: ufs: core: Fix IRQ lock inversion for the SCSI host lock
scsi: qla4xxx: Prevent a potential error pointer dereference
scsi: ufs: ufs-pci: Add support for Intel Wildcat Lake
scsi: fnic: Remove a useless struct mempool forward declaration
Linus Torvalds [Fri, 22 Aug 2025 13:17:49 +0000 (09:17 -0400)]
Merge tag 'mmc-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC host:
- sdhci_am654: Disable HS400 for AM62P SR1.0 and SR1.1
- sdhci-of-arasan: Ensure CD logic stabilization before power-up
- sdhci-pci-gli: Mask the replay timer timeout of AER for GL9763e
MEMSTICK:
- Fix deadlock by moving removing flag earlier"
* tag 'mmc-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: sdhci_am654: Disable HS400 for AM62P SR1.0 and SR1.1
memstick: Fix deadlock by moving removing flag earlier
mmc: sdhci-of-arasan: Ensure CD logic stabilization before power-up
mmc: sdhci-pci-gli: GL9763e: Mask the replay timer timeout of AER
mmc: sdhci-pci-gli: GL9763e: Rename the gli_set_gl9763e() for consistency
mmc: sdhci-pci-gli: Add a new function to simplify the code
Linus Torvalds [Fri, 22 Aug 2025 13:13:24 +0000 (09:13 -0400)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
- syzkaller found a WARN_ON in rxe due to poor lifecycle management of
resources linked to skbs
- Missing error path handling in erdma qp creation
- Initialize the qp number for the GSI QP in erdma
- Mismatching of DIP, SCC and QP numbers in hns
- SRQ bug fixes in bnxt_re
- Memory leak and possibly uninited memory in bnxt_re
- Remove retired irdma maintainer
- Fix kfree() for kvalloc() in ODP
- Fix memory leak in hns
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/hns: Fix dip entries leak on devices newer than hip09
RDMA/core: Free pfn_list with appropriate kvfree call
MAINTAINERS: Remove bouncing irdma maintainer
RDMA/bnxt_re: Fix to initialize the PBL array
RDMA/bnxt_re: Fix a possible memory leak in the driver
RDMA/bnxt_re: Fix to remove workload check in SRQ limit path
RDMA/bnxt_re: Fix to do SRQ armena by default
RDMA/hns: Fix querying wrong SCC context for DIP algorithm
RDMA/erdma: Fix unset QPN of GSI QP
RDMA/erdma: Fix ignored return value of init_kernel_qp
RDMA/rxe: Flush delayed SKBs while releasing RXE resources
Linus Torvalds [Fri, 22 Aug 2025 13:05:37 +0000 (09:05 -0400)]
Merge tag 'sound-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"Only small fixes.
- ASoC Cirrus codec fixes
- A regression fix for the recent TAS2781 codec refactoring
- A fix for user-timer error handling
- Fixes for USB-audio descriptor validators
- Usual HD-audio and ASoC device-specific quirks"
* tag 'sound-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
ALSA: timer: fix ida_free call while not allocated
ASoC: cs35l56: Remove SoundWire Clock Divider workaround for CS35L63
ASoC: cs35l56: Handle new algorithms IDs for CS35L63
ASoC: cs35l56: Update Firmware Addresses for CS35L63 for production silicon
ALSA: hda: tas2781: Fix wrong reference of tasdevice_priv
ALSA: hda/realtek: Audio disappears on HP 15-fc000 after warm boot again
ALSA: hda/realtek: Fix headset mic on ASUS Zenbook 14
ASoC: codecs: ES9389: Modify the standby configuration
ALSA: usb-audio: Fix size validation in convert_chmap_v3()
ALSA: hda/tas2781: Add name prefix tas2781 for tas2781's dvc_tlv and amp_vol_tlv
ALSA: hda/realtek: Add support for HP EliteBook x360 830 G6 and EliteBook 830 G6
Linus Torvalds [Fri, 22 Aug 2025 12:54:34 +0000 (08:54 -0400)]
Merge tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"20 hotfixes. 10 are cc:stable and the remainder address post-6.16
issues or aren't considered necessary for -stable kernels. 17 of these
fixes are for MM.
As usual, singletons all over the place, apart from a three-patch
series of KHO followup work from Pasha which is actually also a bunch
of singletons"
* tag 'mm-hotfixes-stable-2025-08-21-18-17' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mm/mremap: fix WARN with uffd that has remap events disabled
mm/damon/sysfs-schemes: put damos dests dir after removing its files
mm/migrate: fix NULL movable_ops if CONFIG_ZSMALLOC=m
mm/damon/core: fix damos_commit_filter not changing allow
mm/memory-failure: fix infinite UCE for VM_PFNMAP pfn
MAINTAINERS: mark MGLRU as maintained
mm: rust: add page.rs to MEMORY MANAGEMENT - RUST
iov_iter: iterate_folioq: fix handling of offset >= folio size
selftests/damon: fix selftests by installing drgn related script
.mailmap: add entry for Easwar Hariharan
selftests/mm: add test for invalid multi VMA operations
mm/mremap: catch invalid multi VMA moves earlier
mm/mremap: allow multi-VMA move when filesystem uses thp_get_unmapped_area
mm/damon/core: fix commit_ops_filters by using correct nth function
tools/testing: add linux/args.h header and fix radix, VMA tests
mm/debug_vm_pgtable: clear page table entries at destroy_args()
squashfs: fix memory leak in squashfs_fill_super
kho: warn if KHO is disabled due to an error
kho: mm: don't allow deferred struct page with KHO
kho: init new_physxa->phys_bits to fix lockdep
XianLiang Huang [Wed, 20 Aug 2025 07:22:48 +0000 (15:22 +0800)]
iommu/riscv: prevent NULL deref in iova_to_phys
The riscv_iommu_pte_fetch() function returns either NULL for
unmapped/never-mapped iova, or a valid leaf pte pointer that
requires no further validation.
riscv_iommu_iova_to_phys() failed to handle NULL returns.
Prevent null pointer dereference in
riscv_iommu_iova_to_phys(), and remove the pte validation.
Robin Murphy [Thu, 14 Aug 2025 16:47:16 +0000 (17:47 +0100)]
iommu/virtio: Make instance lookup robust
Much like arm-smmu in commit 7d835134d4e1 ("iommu/arm-smmu: Make
instance lookup robust"), virtio-iommu appears to have the same issue
where iommu_device_register() makes the IOMMU instance visible to other
API callers (including itself) straight away, but internally the
instance isn't ready to recognise itself for viommu_probe_device() to
work correctly until after viommu_probe() has returned. This matters a
lot more now that bus_iommu_probe() has the DT/VIOT knowledge to probe
client devices the way that was always intended. Tweak the lookup and
initialisation in much the same way as for arm-smmu, to ensure that what
we register is functional and ready to go.
The arm_smmu_attach_commit() updates master->ats_enabled before calling
arm_smmu_remove_master_domain() that is supposed to clean up everything
in the old domain, including the old domain's nr_ats_masters. So, it is
supposed to use the old ats_enabled state of the device, not an updated
state.
This isn't a problem if switching between two domains where:
- old ats_enabled = false; new ats_enabled = false
- old ats_enabled = true; new ats_enabled = true
but can fail cases where:
- old ats_enabled = false; new ats_enabled = true
(old domain should keep the counter but incorrectly decreased it)
- old ats_enabled = true; new ats_enabled = false
(old domain needed to decrease the counter but incorrectly missed it)
Update master->ats_enabled after arm_smmu_remove_master_domain() to fix
this.
Fixes: 7497f4211f4f ("iommu/arm-smmu-v3: Make changing domains be hitless for ATS") Cc: stable@vger.kernel.org Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> Acked-by: Will Deacon <will@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Pranjal Shrivastava <praan@google.com> Link: https://lore.kernel.org/r/20250801030127.2006979-1-nicolinc@nvidia.com Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
Linus Torvalds [Thu, 21 Aug 2025 20:31:27 +0000 (16:31 -0400)]
Merge tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fixes from Tejun Heo:
- Fix NULL de-ref in css_rstat_exit() which could happen after
allocation failure
- Fix a cpuset partition handling bug and a couple other misc issues
- Doc spelling fix
* tag 'cgroup-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup: fixed spelling mistakes in documentation
cgroup: avoid null de-ref in css_rstat_exit()
cgroup/cpuset: Remove the unnecessary css_get/put() in cpuset_partition_write()
cgroup/cpuset: Fix a partition error with CPU hotplug
cgroup/cpuset: Use static_branch_enable_cpuslocked() on cpusets_insane_config_key
Linus Torvalds [Thu, 21 Aug 2025 20:28:00 +0000 (16:28 -0400)]
Merge tag 'spi-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi
Pull spi fixes from Mark Brown:
"A small collection of fixes that came in during the past week, a few
driver specifics plus one fix for the spi-mem core where we weren't
taking account of the frequency capabilities of the system when
determining if it can support an operation"
* tag 'spi-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/spi:
spi: st: fix PM macros to use CONFIG_PM instead of CONFIG_PM_SLEEP
spi: spi-qpic-snand: fix calculating of ECC OOB regions' properties
spi: spi-fsl-lpspi: Clamp too high speed_hz
spi: spi-mem: add spi_mem_adjust_op_freq() in spi_mem_supports_op()
spi: spi-mem: Add missing kdoc argument
spi: spi-qpic-snand: use correct CW_PER_PAGE value for OOB write
Linus Torvalds [Thu, 21 Aug 2025 20:26:18 +0000 (16:26 -0400)]
Merge tag 'regulator-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fixes from Mark Brown:
"A couple of fairly minor device specific fixes that came in over the
past week or so, plus the addition of an actual maintainer for the
IR38060"
* tag 'regulator-fix-v6.17-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: tps65219: regulator: tps65219: Fix error codes in probe()
regulator: pca9450: Use devm_register_sys_off_handler
regulator: dt-bindings: infineon,ir38060: Add Guenter as maintainer from IBM
Linus Torvalds [Thu, 21 Aug 2025 20:07:15 +0000 (16:07 -0400)]
Merge tag 'acpi-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix three new issues in the ACPI APEI error injection code and
an ACPI platform firmware runtime update interface issue:
- Make ACPI APEI error injection check the version of the request
when mapping the EINJ parameter structure in the BIOS reserved
memory to prevent injecting errors based on an uninitialized
field (Tony Luck)
- Fix potential NULL dereference in __einj_error_inject() that may
occur when memory allocation fails (Charles Han)
- Remove the __exit annotation from einj_remove(), so it can be
called on errors during faux device probe (Uwe Kleine-König)
- Use a security-version-number check instead of a runtime version
check during ACPI platform firmware runtime driver updates to
prevent those updates from failing due to false-positive driver
version check failures (Chen Yu)"
* tag 'acpi-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: pfr_update: Fix the driver update version check
ACPI: APEI: EINJ: Fix resource leak by remove callback in .exit.text
ACPI: APEI: EINJ: fix potential NULL dereference in __einj_error_inject()
ACPI: APEI: EINJ: Check if user asked for EINJV2 injection
Linus Torvalds [Thu, 21 Aug 2025 20:04:58 +0000 (16:04 -0400)]
Merge tag 'pm-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These fix a cpuidle menu governor issue and two issues in the cpupower
utility:
- Prevent the menu cpuidle governor from selecting idle states with
exit latency exceeding the current PM QoS limit after stopping the
scheduler tick (Rafael Wysocki)
- Make the set subcommand's -t option in the cpupower utility work as
documented and allow it to control the CPU boost feature of cpufreq
beyond x86 (Shinji Nomoto)"
* tag 'pm-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpuidle: governors: menu: Avoid selecting states with too much latency
cpupower: Allow control of boost feature on non-x86 based systems with boost support.
cpupower: Fix a bug where the -t option of the set subcommand was not working.
Linus Torvalds [Thu, 21 Aug 2025 20:02:35 +0000 (16:02 -0400)]
Merge tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext
Pull sched_ext fixes from Tejun Heo:
- Fix a subtle bug during SCX enabling where a dead task skips init
but doesn't skip sched class switch leading to invalid task state
transition warning
- Cosmetic fix in selftests
* tag 'sched_ext-for-6.17-rc2-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext:
selftests/sched_ext: Remove duplicate sched.h header
sched/ext: Fix invalid task state transitions on class switch
Jens Axboe [Thu, 21 Aug 2025 19:24:57 +0000 (13:24 -0600)]
io_uring: clear ->async_data as part of normal init
Opcode handlers like POLL_ADD will use ->async_data as the pointer for
double poll handling, which is a bit different than the usual case
where it's strictly gated by the REQ_F_ASYNC_DATA flag. Be a bit more
proactive in handling ->async_data, and clear it to NULL as part of
regular init. Init is touching that cacheline anyway, so might as well
clear it.
Jens Axboe [Thu, 21 Aug 2025 19:23:21 +0000 (13:23 -0600)]
io_uring/futex: ensure io_futex_wait() cleans up properly on failure
The io_futex_data is allocated upfront and assigned to the io_kiocb
async_data field, but the request isn't marked with REQ_F_ASYNC_DATA
at that point. Those two should always go together, as the flag tells
io_uring whether the field is valid or not.
Additionally, on failure cleanup, the futex handler frees the data but
does not clear ->async_data. Clear the data and the flag in the error
path as well.
Thanks to Trend Micro Zero Day Initiative and particularly ReDress for
reporting this.
Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Signed-off-by: Jens Axboe <axboe@kernel.dk>
If the argument check during an array bind fails, the bind_ops are freed
twice as seen below. Fix this by setting bind_ops to NULL after freeing.
==================================================================
BUG: KASAN: double-free in xe_vm_bind_ioctl+0x1b2/0x21f0 [xe]
Free of addr ffff88813bb9b800 by task xe_vm/14198
Piotr Piórkowski [Mon, 11 Aug 2025 10:43:58 +0000 (12:43 +0200)]
drm/xe: Move ASID allocation and user PT BO tracking into xe_vm_create
Currently, ASID assignment for user VMs and page-table BO accounting for
client memory tracking are performed in xe_vm_create_ioctl.
To consolidate VM object initialization, move this logic to
xe_vm_create.
v2:
- removed unnecessary duplicate BO tracking code
- using the local variable xef to verify whether the VM is being created
by userspace
Fixes: 658a1c8e0a66 ("drm/xe: Assign ioctl xe file handler to vm in xe_vm_create") Suggested-by: Matthew Auld <matthew.auld@intel.com> Signed-off-by: Piotr Piórkowski <piotr.piorkowski@intel.com> Reviewed-by: Matthew Auld <matthew.auld@intel.com> Link: https://lore.kernel.org/r/20250811104358.2064150-3-piotr.piorkowski@intel.com Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
(cherry picked from commit 30e0c3f43a414616e0b6ca76cf7f7b2cd387e1d4) Signed-off-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
[Rodrigo: Added fixes tag]
Linus Torvalds [Thu, 21 Aug 2025 17:51:15 +0000 (13:51 -0400)]
Merge tag 'net-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from Bluetooth.
Current release - fix to a fix:
- usb: asix_devices: fix PHY address mask in MDIO bus initialization
Current release - regressions:
- Bluetooth: fixes for the split between BIS_LINK and PA_LINK
- Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN
flag", breaks compatibility with some existing device tree blobs
- dsa: b53: fix reserved register access in b53_fdb_dump()
Current release - new code bugs:
- sched: dualpi2: run probability update timer in BH to avoid
deadlock
- eth: libwx: fix the size in RSS hash key population
- pse-pd: pd692x0: improve power budget error paths and handling
Previous releases - regressions:
- tls: fix handling of zero-length records on the rx_list
- hsr: reject HSR frame if skb can't hold tag
- bonding: fix negotiation flapping in 802.3ad passive mode
Previous releases - always broken:
- gso: forbid IPv6 TSO with extensions on devices with only IPV6_CSUM
- sched: make cake_enqueue return NET_XMIT_CN when past buffer_limit,
avoid packet drops with low buffer_limit, remove unnecessary WARN()
- sched: fix backlog accounting after modifying config of a qdisc in
the middle of the hierarchy
- mptcp: improve handling of skb extension allocation failures
- eth: mlx5:
- fixes for the "HW Steering" flow management method
- fixes for QoS and device buffer management"
* tag 'net-6.17-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
netfilter: nf_reject: don't leak dst refcount for loopback packets
net/mlx5e: Preserve shared buffer capacity during headroom updates
net/mlx5e: Query FW for buffer ownership
net/mlx5: Restore missing scheduling node cleanup on vport enable failure
net/mlx5: Fix QoS reference leak in vport enable error path
net/mlx5: Destroy vport QoS element when no configuration remains
net/mlx5e: Preserve tc-bw during parent changes
net/mlx5: Remove default QoS group and attach vports directly to root TSAR
net/mlx5: Base ECVF devlink port attrs from 0
net: pse-pd: pd692x0: Skip power budget configuration when undefined
net: pse-pd: pd692x0: Fix power budget leak in manager setup error path
Octeontx2-af: Skip overlap check for SPI field
selftests: tls: add tests for zero-length records
tls: fix handling of zero-length records on the rx_list
net: airoha: ppe: Do not invalid PPE entries in case of SW hash collision
selftests: bonding: add test for passive LACP mode
bonding: send LACPDUs periodically in passive mode after receiving partner's LACPDU
bonding: update LACP activity flag after setting lacp_active
Revert "net: cadence: macb: sama7g5_emac: Remove USARIO CLKEN flag"
ipv6: sr: Fix MAC comparison to be constant-time
...
This is because blamed commit forgot about loopback packets.
Such packets already have a dst_entry attached, even at PRE_ROUTING stage.
Instead of checking hook just check if the skb already has a route
attached to it.
Fixes: f53b9b0bdc59 ("netfilter: introduce support for reject at prerouting stage") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://patch.msgid.link/20250820123707.10671-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When kernel lockdown is active, debugfs_locked_down() blocks access to
hypfs files that register ioctl callbacks, even if the ioctl interface
is not required for a function. This unnecessarily breaks userspace
tools that only rely on read operations.
Resolve this by registering a minimal set of file operations during
lockdown, avoiding ioctl registration and preserving access for affected
tooling.
Note that this change restores hypfs functionality when lockdown is
active from early boot (e.g. via lockdown=integrity kernel parameter),
but does not apply to scenarios where lockdown is enabled dynamically
while Linux is running.
Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
s390/hypfs: Avoid unnecessary ioctl registration in debugfs
Currently, hypfs registers ioctl callbacks for all debugfs files,
despite only one file requiring them. This leads to unintended exposure
of unused interfaces to user space and can trigger side effects such as
restricted access when kernel lockdown is enabled.
Restrict ioctl registration to only those files that implement ioctl
functionality to avoid interface clutter and unnecessary access
restrictions.
Tested-by: Mete Durlu <meted@linux.ibm.com> Reviewed-by: Vasily Gorbik <gor@linux.ibm.com> Fixes: 5496197f9b08 ("debugfs: Restrict debugfs when the kernel is locked down") Signed-off-by: Peter Oberparleiter <oberpar@linux.ibm.com> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com>
Takashi Iwai [Thu, 21 Aug 2025 15:08:34 +0000 (17:08 +0200)]
ALSA: usb-audio: Use correct sub-type for UAC3 feature unit validation
The entry of the validators table for UAC3 feature unit is defined
with a wrong sub-type UAC_FEATURE (= 0x06) while it should have been
UAC3_FEATURE (= 0x07). This patch corrects the entry value.
Armen Ratner [Wed, 20 Aug 2025 13:32:09 +0000 (16:32 +0300)]
net/mlx5e: Preserve shared buffer capacity during headroom updates
When port buffer headroom changes, port_update_shared_buffer()
recalculates the shared buffer size and splits it in a 3:1 ratio
(lossy:lossless) - Currently, the calculation is:
lossless = shared / 4;
lossy = (shared / 4) * 3;
Meaning, the calculation dropped the remainder of shared % 4 due to
integer division, unintentionally reducing the total shared buffer
by up to three cells on each update. Over time, this could shrink
the buffer below usable size.
Fix it by changing the calculation to:
lossless = shared / 4;
lossy = shared - lossless;
This retains all buffer cells while still approximating the
intended 3:1 split, preventing capacity loss over time.
While at it, perform headroom calculations in units of cells rather than
in bytes for more accurate calculations avoiding extra divisions.
Alexei Lazar [Wed, 20 Aug 2025 13:32:08 +0000 (16:32 +0300)]
net/mlx5e: Query FW for buffer ownership
The SW currently saves local buffer ownership when setting
the buffer.
This means that the SW assumes it has ownership of the buffer
after the command is set.
If setting the buffer fails and we remain in FW ownership,
the local buffer ownership state incorrectly remains as SW-owned.
This leads to incorrect behavior in subsequent PFC commands,
causing failures.
Instead of saving local buffer ownership in SW,
query the FW for buffer ownership when setting the buffer.
This ensures that the buffer ownership state is accurately
reflected, avoiding the issues caused by incorrect ownership
states.
Fixes: ecdf2dadee8e ("net/mlx5e: Receive buffer support for DCBX") Signed-off-by: Alexei Lazar <alazar@nvidia.com> Reviewed-by: Shahar Shitrit <shshitrit@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-8-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Carolina Jubran [Wed, 20 Aug 2025 13:32:05 +0000 (16:32 +0300)]
net/mlx5: Destroy vport QoS element when no configuration remains
If a VF has been configured and the user later clears all QoS settings,
the vport element remains in the firmware QoS tree. This leads to
inconsistent behavior compared to VFs that were never configured, since
the FW assumes that unconfigured VFs are outside the QoS hierarchy.
As a result, the bandwidth share across VFs may differ, even though
none of them appear to have any configuration.
Align the driver behavior with the FW expectation by destroying the
vport QoS element when all configurations are removed.
Fixes: c9497c98901c ("net/mlx5: Add support for setting VF min rate") Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Link: https://patch.msgid.link/20250820133209.389065-5-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Carolina Jubran [Wed, 20 Aug 2025 13:32:04 +0000 (16:32 +0300)]
net/mlx5e: Preserve tc-bw during parent changes
When changing parent of a node/leaf with tc-bw configured, the code
saves and restores tc-bw values. However, it was reading the converted
hardware bw_share values (where 0 becomes 1) instead of the original
user values, causing incorrect tc-bw calculations after parent change.
Store original tc-bw values in the node structure and use them directly
for save/restore operations.
Fixes: cf7e73770d1b ("net/mlx5: Manage TC arbiter nodes and implement full support for tc-bw") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-4-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Carolina Jubran [Wed, 20 Aug 2025 13:32:03 +0000 (16:32 +0300)]
net/mlx5: Remove default QoS group and attach vports directly to root TSAR
Currently, the driver creates a default group (`node0`) and attaches
all vports to it unless the user explicitly sets a parent group. As a
result, when a user configures tx_share on a group and tx_share on
a VF, the expectation is for the group and the VF to share bandwidth
relatively. However, since the VF is not connected to the same parent
(but to the default node), the proportional share logic is not applied
correctly.
To fix this, remove the default group (`node0`) and instead connect
vports directly to the root TSAR when no parent is specified. This
ensures that vports and groups share the same root scheduler and their
tx_share values are compared directly under the same hierarchy.
Fixes: 0fe132eac38c ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-3-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Daniel Jurgens [Wed, 20 Aug 2025 13:32:02 +0000 (16:32 +0300)]
net/mlx5: Base ECVF devlink port attrs from 0
Adjust the vport number by the base ECVF vport number so the port
attributes start at 0. Previously the port attributes would start 1
after the maximum number of host VFs.
Fixes: dc13180824b7 ("net/mlx5: Enable devlink port for embedded cpu VF vports") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Mark Bloch <mbloch@nvidia.com> Link: https://patch.msgid.link/20250820133209.389065-2-mbloch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kory Maincent [Wed, 20 Aug 2025 13:33:21 +0000 (15:33 +0200)]
net: pse-pd: pd692x0: Skip power budget configuration when undefined
If the power supply's power budget is not defined in the device tree,
the current code still requests power and configures the PSE manager
with a 0W power limit, which is undesirable behavior.
Skip power budget configuration entirely when the budget is zero,
avoiding unnecessary power requests and preventing invalid 0W limits
from being set on the PSE manager.
Fixes: 359754013e6a ("net: pse-pd: pd692x0: Add support for PSE PI priority feature") Signed-off-by: Kory Maincent <kory.maincent@bootlin.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20250820133321.841054-1-kory.maincent@bootlin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Kory Maincent [Wed, 20 Aug 2025 13:27:07 +0000 (15:27 +0200)]
net: pse-pd: pd692x0: Fix power budget leak in manager setup error path
Fix a resource leak where manager power budgets were freed on both
success and error paths during manager setup. Power budgets should
only be freed on error paths after regulator registration or during
driver removal.
Refactor cleanup logic by extracting OF node cleanup and power budget
freeing into separate helper functions for better maintainability.
Hariprasad Kelam [Wed, 20 Aug 2025 06:39:18 +0000 (12:09 +0530)]
Octeontx2-af: Skip overlap check for SPI field
Octeontx2/CN10K silicon supports generating a 256-bit key per packet.
The specific fields to be extracted from a packet for key generation
are configurable via a Key Extraction (MKEX) Profile.
The AF driver scans the configured extraction profile to ensure that
fields from upper layers do not overwrite fields from lower layers in
the key.
Example Packet Field Layout:
LA: DMAC + SMAC
LB: VLAN
LC: IPv4/IPv6
LD: TCP/UDP
Valid MKEX Profile Configuration:
LA -> DMAC -> key_offset[0-5]
LC -> SIP -> key_offset[20-23]
LD -> SPORT -> key_offset[30-31]
Invalid MKEX profile configuration:
LA -> DMAC -> key_offset[0-5]
LC -> SIP -> key_offset[20-23]
LD -> SPORT -> key_offset[2-3] // Overlaps with DMAC field
In another scenario, if the MKEX profile is configured to extract
the SPI field from both AH and ESP headers at the same key offset,
the driver rejecting this configuration. In a regular traffic,
ipsec packet will be having either AH(LD) or ESP (LE). This patch
relaxes the check for the same.
Jakub Kicinski [Wed, 20 Aug 2025 02:19:52 +0000 (19:19 -0700)]
selftests: tls: add tests for zero-length records
Test various combinations of zero-length records.
Unfortunately, kernel cannot be coerced into producing those,
so hardcode the ciphertext messages in the test.
Jakub Kicinski [Wed, 20 Aug 2025 02:19:51 +0000 (19:19 -0700)]
tls: fix handling of zero-length records on the rx_list
Each recvmsg() call must process either
- only contiguous DATA records (any number of them)
- one non-DATA record
If the next record has different type than what has already been
processed we break out of the main processing loop. If the record
has already been decrypted (which may be the case for TLS 1.3 where
we don't know type until decryption) we queue the pending record
to the rx_list. Next recvmsg() will pick it up from there.
Queuing the skb to rx_list after zero-copy decrypt is not possible,
since in that case we decrypted directly to the user space buffer,
and we don't have an skb to queue (darg.skb points to the ciphertext
skb for access to metadata like length).
Only data records are allowed zero-copy, and we break the processing
loop after each non-data record. So we should never zero-copy and
then find out that the record type has changed. The corner case
we missed is when the initial record comes from rx_list, and it's
zero length.
Reported-by: Muhammad Alifa Ramdhan <ramdhan@starlabs.sg> Reported-by: Billy Jheng Bing-Jhong <billy@starlabs.sg> Fixes: 84c61fe1a75b ("tls: rx: do not use the standard strparser") Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20250820021952.143068-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 21 Aug 2025 14:37:33 +0000 (10:37 -0400)]
Merge tag 'loongarch-fixes-6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson
Pull LoongArch fixes from Huacai Chen:
"Fix a lot of build warnings for LTO-enabled objtool check, increase
COMMAND_LINE_SIZE up to 4096, rename a missing GCC_PLUGIN_STACKLEAK to
KSTACK_ERASE, and fix some bugs about arch timer, module loading, LBT
and KVM"
* tag 'loongarch-fixes-6.17-1' of git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson:
LoongArch: KVM: Add address alignment check in pch_pic register access
LoongArch: KVM: Use kvm_get_vcpu_by_id() instead of kvm_get_vcpu()
LoongArch: KVM: Fix stack protector issue in send_ipi_data()
LoongArch: KVM: Make function kvm_own_lbt() robust
LoongArch: Rename GCC_PLUGIN_STACKLEAK to KSTACK_ERASE
LoongArch: Save LBT before FPU in setup_sigcontext()
LoongArch: Optimize module load time by optimizing PLT/GOT counting
LoongArch: Add cpuhotplug hooks to fix high cpu usage of vCPU threads
LoongArch: Increase COMMAND_LINE_SIZE up to 4096
LoongArch: Pass annotate-tablejump option if LTO is enabled
objtool/LoongArch: Get table size correctly if LTO is enabled
Nilay Shroff [Thu, 14 Aug 2025 08:24:59 +0000 (13:54 +0530)]
block: avoid cpu_hotplug_lock depedency on freeze_lock
A recent lockdep[1] splat observed while running blktest block/005
reveals a potential deadlock caused by the cpu_hotplug_lock dependency
on ->freeze_lock. This dependency was introduced by commit 033b667a823e
("block: blk-rq-qos: guard rq-qos helpers by static key").
That change added a static key to avoid fetching q->rq_qos when
neither blk-wbt nor blk-iolatency is configured. The static key
dynamically patches kernel text to a NOP when disabled, eliminating
overhead of fetching q->rq_qos in the I/O hot path. However, enabling
a static key at runtime requires acquiring both cpu_hotplug_lock and
jump_label_mutex. When this happens after the queue has already been
frozen (i.e., while holding ->freeze_lock), it creates a locking
dependency from cpu_hotplug_lock to ->freeze_lock, which leads to a
potential deadlock reported by lockdep [1].
To resolve this, replace the static key mechanism with q->queue_flags:
QUEUE_FLAG_QOS_ENABLED. This flag is evaluated in the fast path before
accessing q->rq_qos. If the flag is set, we proceed to fetch q->rq_qos;
otherwise, the access is skipped.
Since q->queue_flags is commonly accessed in IO hotpath and resides in
the first cacheline of struct request_queue, checking it imposes minimal
overhead while eliminating the deadlock risk.
This change avoids the lockdep splat without introducing performance
regressions.
Nilay Shroff [Thu, 14 Aug 2025 08:24:58 +0000 (13:54 +0530)]
block: decrement block_rq_qos static key in rq_qos_del()
rq_qos_add() increments the block_rq_qos static key when a QoS
policy is attached. When a QoS policy is removed via rq_qos_del(),
we must symmetrically decrement the static key. If this removal drops
the last QoS policy from the queue (q->rq_qos becomes NULL), the
static branch can be disabled and the jump label patched to a NOP,
avoiding overhead on the hot path.
This change ensures rq_qos_add()/rq_qos_del() keep the
block_rq_qos static key balanced and prevents leaving the branch
permanently enabled after the last policy is removed.
Fixes: 033b667a823e ("block: blk-rq-qos: guard rq-qos helpers by static key") Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20250814082612.500845-3-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
Nilay Shroff [Thu, 14 Aug 2025 08:24:57 +0000 (13:54 +0530)]
block: skip q->rq_qos check in rq_qos_done_bio()
If a bio has BIO_QOS_THROTTLED or BIO_QOS_MERGED set,
it implicitly guarantees that q->rq_qos is present.
Avoid re-checking q->rq_qos in this case and call
__rq_qos_done_bio() directly as a minor optimization.
Linus Torvalds [Thu, 21 Aug 2025 11:54:01 +0000 (04:54 -0700)]
Merge tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux
Pull crypto library fixes from Eric Biggers:
"Fix a regression where 'make clean' stopped removing some of the
generated assembly files on arm and arm64"
* tag 'libcrypto-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux:
lib/crypto: ensure generated *.S files are removed on make clean
lib/crypto: sha: Update Kconfig help for SHA1 and SHA256
Linus Torvalds [Thu, 21 Aug 2025 11:48:41 +0000 (04:48 -0700)]
Merge tag '6.17-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd
Pull smb server fixes from Steve French:
- fix refcount issue that can cause memory leak
- rate limit repeated connections from IPv6, not just IPv4 addresses
- fix potential null pointer access of smb direct work queue
* tag '6.17-rc2-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: fix refcount leak causing resource not released
ksmbd: extend the connection limiting mechanism to support IPv6
smb: server: split ksmbd_rdma_stop_listening() out of ksmbd_rdma_destroy()