David Matlack [Wed, 26 Nov 2025 23:17:28 +0000 (23:17 +0000)]
vfio: selftests: Stop passing device for IOMMU operations
Drop the struct vfio_pci_device wrappers for IOMMU map/unmap functions
and require tests to directly call iommu_map(), iommu_unmap(), etc. This
results in more concise code, and also makes it clear the map operations
are happening on a struct iommu, not necessarily on a specific device,
especially when multi-device tests are introduced.
Do the same for iova_allocator_init() as that function only needs the
struct iommu, not struct vfio_pci_device.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-14-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:27 +0000 (23:17 +0000)]
vfio: selftests: Move IOVA allocator into iova_allocator.c
Move the IOVA allocator into its own file, to provide better separation
between the allocator and the struct vfio_pci_device helper code.
The allocator could go into iommu.c, but it is standalone enough that a
separate file seems cleaner. This also continues the trend of having a
.c for every major object in VFIO selftests (vfio_pci_device.c,
vfio_pci_driver.c, iommu.c, and now iova_allocator.c).
No functional change intended.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-13-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:26 +0000 (23:17 +0000)]
vfio: selftests: Move IOMMU library code into iommu.c
Move all the IOMMU related library code into their own file iommu.c.
This provides a better separation between the vfio_pci_device helper
code and the iommu code.
No function change intended.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-12-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:25 +0000 (23:17 +0000)]
vfio: selftests: Rename struct vfio_dma_region to dma_region
Rename struct vfio_dma_region to dma_region. This is in preparation for
separating the VFIO PCI device library code from the IOMMU library code.
This name change also better reflects the fact that DMA mappings can be
managed by either VFIO or IOMMUFD. i.e. the "vfio_" prefix is
misleading.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-11-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:24 +0000 (23:17 +0000)]
vfio: selftests: Upgrade driver logging to dev_err()
Upgrade various logging in the VFIO selftests drivers from dev_info() to
dev_err(). All of these logs indicate scenarios that may be unexpected.
For example, the logging during probing indicates matching devices but
that aren't supported by the driver. And the memcpy errors can indicate
a problem if the caller was not trying to do something like exercise I/O
fault handling. Exercising I/O fault handling is certainly a valid thing
to do, but the driver can't infer the caller's expectations, so better
to just log with dev_err().
Suggested-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-10-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:22 +0000 (23:17 +0000)]
vfio: selftests: Eliminate overly chatty logging
Eliminate overly chatty logs that are printed during almost every test.
These logs are adding more noise than value. If a test cares about this
information it can log it itself. This is especially true as the VFIO
selftests gains support for multiple devices in a single test (which
multiplies all these logs).
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-8-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:19 +0000 (23:17 +0000)]
vfio: selftests: Rename struct vfio_iommu_mode to iommu_mode
Rename struct vfio_iommu_mode to struct iommu_mode since the mode can
include iommufd. This also prepares for splitting out all the IOMMU code
into its own structs/helpers/files which are independent from the
vfio_pci_device code.
No function change intended.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-5-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:18 +0000 (23:17 +0000)]
vfio: selftests: Allow passing multiple BDFs on the command line
Add support for passing multiple device BDFs to a test via the command
line. This is a prerequisite for multi-device tests.
Single-device tests can continue using vfio_selftests_get_bdf(), which
will continue to return argv[argc - 1] (if it is a BDF string), or the
environment variable $VFIO_SELFTESTS_BDF otherwise.
For multi-device tests, a new helper called vfio_selftests_get_bdfs() is
introduced which will return an array of all BDFs found at the end of
argv[], as well as the number of BDFs found (passed back to the caller
via argument). The array of BDFs returned does not need to be freed by
the caller.
The environment variable VFIO_SELFTESTS_BDF continues to support only a
single BDF for the time being.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-4-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:17 +0000 (23:17 +0000)]
vfio: selftests: Split run.sh into separate scripts
Split run.sh into separate scripts (setup.sh, run.sh, cleanup.sh) to
enable multi-device testing, and prepare for VFIO selftests
automatically detecting which devices to use for testing by storing
device metadata on the filesystem.
- setup.sh takes one or more BDFs as arguments and sets up each device.
Metadata about each device is stored on the filesystem in the
directory:
${TMPDIR:-/tmp}/vfio-selftests-devices
Within this directory is a directory for each BDF, and then files in
those directories that cleanup.sh uses to cleanup the device.
- run.sh runs a selftest by passing it the BDFs of all set up devices.
- cleanup.sh takes zero or more BDFs as arguments and cleans up each
device. If no BDFs are provided, it cleans up all devices.
This split enables multi-device testing by allowing multiple BDFs to be
set up and passed into tests:
In the future, VFIO selftests can automatically detect set up devices by
inspecting ${TMPDIR:-/tmp}/vfio-selftests-devices. This will avoid the
need for the run.sh script.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-3-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Wed, 26 Nov 2025 23:17:16 +0000 (23:17 +0000)]
vfio: selftests: Move run.sh into scripts directory
Move run.sh in a new sub-directory scripts/. This directory will be used
to house various helper scripts to be used by humans and automation for
running VFIO selftests.
Opportunistically also switch run.sh from TEST_PROGS_EXTENDED to
TEST_FILES. The former is for actual test executables that are just not
run by default. TEST_FILES is a better fit for helper scripts.
No functional change intended.
Reviewed-by: Alex Mastro <amastro@fb.com> Tested-by: Alex Mastro <amastro@fb.com> Reviewed-by: Raghavendra Rao Ananta <rananta@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20251126231733.3302983-2-dmatlack@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Ankit Agrawal [Thu, 27 Nov 2025 17:06:32 +0000 (17:06 +0000)]
vfio/nvgrace-gpu: wait for the GPU mem to be ready
Speculative prefetches from CPU to GPU memory until the GPU is
ready after reset can cause harmless corrected RAS events to
be logged on Grace systems. It is thus preferred that the
mapping not be re-established until the GPU is ready post reset.
The GPU readiness can be checked through BAR0 registers similar
to the checking at the time of device probe.
It can take several seconds for the GPU to be ready. So it is
desirable that the time overlaps as much of the VM startup as
possible to reduce impact on the VM bootup time. The GPU
readiness state is thus checked on the first fault/huge_fault
request or read/write access which amortizes the GPU readiness
time.
The first fault and read/write checks the GPU state when the
reset_done flag - which denotes whether the GPU has just been
reset. The memory_lock is taken across map/access to avoid
races with GPU reset.
Also check if the memory is enabled, before waiting for GPU
to be ready. Otherwise the readiness check would block for 30s.
Lastly added PM handling wrapping on read/write access.
Cc: Shameer Kolothum <skolothumtho@nvidia.com> Cc: Alex Williamson <alex@shazbot.org> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Vikram Sethi <vsethi@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-7-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Ankit Agrawal [Thu, 27 Nov 2025 17:06:31 +0000 (17:06 +0000)]
vfio/nvgrace-gpu: Inform devmem unmapped after reset
Introduce a new flag reset_done to notify that the GPU has just
been reset and the mapping to the GPU memory is zapped.
Implement the reset_done handler to set this new variable. It
will be used later in the patches to wait for the GPU memory
to be ready before doing any mapping or access.
Cc: Jason Gunthorpe <jgg@ziepe.ca> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Suggested-by: Alex Williamson <alex@shazbot.org> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-6-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Ankit Agrawal [Thu, 27 Nov 2025 17:06:30 +0000 (17:06 +0000)]
vfio/nvgrace-gpu: split the code to wait for GPU ready
Split the function that check for the GPU device being ready on
the probe.
Move the code to wait for the GPU to be ready through BAR0 register
reads to a separate function. This would help reuse the code.
This also fixes a bug where the return status in case of timeout
gets overridden by return from pci_enable_device. With the fix,
a timeout generate an error as initially intended.
Fixes: d85f69d520e6 ("vfio/nvgrace-gpu: Check the HBM training and C2C link status") Reviewed-by: Zhi Wang <zhiw@nvidia.com> Reviewed-by: Shameer Kolothum <skolothumtho@nvidia.com> Signed-off-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251127170632.3477-5-ankita@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Ankit Agrawal [Thu, 27 Nov 2025 17:06:28 +0000 (17:06 +0000)]
vfio/nvgrace-gpu: Add support for huge pfnmap
NVIDIA's Grace based systems have large device memory. The device
memory is mapped as VM_PFNMAP in the VMM VMA. The nvgrace-gpu
module could make use of the huge PFNMAP support added in mm [1].
To make use of the huge pfnmap support, fault/huge_fault ops
based mapping mechanism needs to be implemented. Currently nvgrace-gpu
module relies on remap_pfn_range to do the mapping during VM bootup.
Replace it to instead rely on fault and use vfio_pci_vmf_insert_pfn
to setup the mapping.
Moreover to enable huge pfnmap, nvgrace-gpu module is updated by
adding huge_fault ops implementation. The implementation establishes
mapping according to the order request. Note that if the PFN or the
VMA address is unaligned to the order, the mapping fallbacks to
the PTE level.
Alex Mastro [Wed, 26 Nov 2025 01:11:18 +0000 (17:11 -0800)]
dma-buf: fix integer overflow in fill_sg_entry() for buffers >= 8GiB
fill_sg_entry() splits large DMA buffers into multiple scatter-gather
entries, each holding up to UINT_MAX bytes. When calculating the DMA
address for entries beyond the second one, the expression (i * UINT_MAX)
causes integer overflow due to 32-bit arithmetic.
This manifests when the input arg length >= 8 GiB results in looping for
i >= 2.
Fix by casting i to dma_addr_t before multiplication.
Fixes: 3aa31a8bb11e ("dma-buf: provide phys_vec to scatter-gather mapping routine") Signed-off-by: Alex Mastro <amastro@fb.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Leon Romanovsky <leon@kernel.org> Link: https://lore.kernel.org/r/20251125-dma-buf-overflow-v1-1-b70ea1e6c4ba@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Remove the pci_bus_sem->igate leg of the triangle by using RCU
to peek at the eventfd rather than locking it with igate.
Fixes: 3be3a074cf5b ("vfio-pci: Don't use device_lock around AER interrupt setup") Signed-off-by: Alex Williamson <alex.williamson@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251124223623.2770706-1-alex@shazbot.org Signed-off-by: Alex Williamson <alex@shazbot.org>
Jason Gunthorpe [Thu, 20 Nov 2025 09:28:30 +0000 (11:28 +0200)]
vfio/nvgrace: Support get_dmabuf_phys
Call vfio_pci_core_fill_phys_vec() with the proper physical ranges for the
synthetic BAR 2 and BAR 4 regions. Otherwise use the normal flow based on
the PCI bar.
This demonstrates a DMABUF that follows the region info report to only
allow mapping parts of the region that are mmapable. Since the BAR is
power of two sized and the "CXL" region is just page aligned the there can
be a padding region at the end that is not mmaped or passed into the
DMABUF.
The "CXL" ranges that are remapped into BAR 2 and BAR 4 areas are not PCI
MMIO, they actually run over the CXL-like coherent interconnect and for
the purposes of DMA behave identically to DRAM. We don't try to model this
distinction between true PCI BAR memory that takes a real PCI path and the
"CXL" memory that takes a different path in the p2p framework for now.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Reviewed-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-11-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:29 +0000 (11:28 +0200)]
vfio/pci: Add dma-buf export support for MMIO regions
Add support for exporting PCI device MMIO regions through dma-buf,
enabling safe sharing of non-struct page memory with controlled
lifetime management. This allows RDMA and other subsystems to import
dma-buf FDs and build them into memory regions for PCI P2P operations.
The implementation provides a revocable attachment mechanism using
dma-buf move operations. MMIO regions are normally pinned as BARs
don't change physical addresses, but access is revoked when the VFIO
device is closed or a PCI reset is issued. This ensures kernel
self-defense against potentially hostile userspace.
Currently VFIO can take MMIO regions from the device's BAR and map
them into a PFNMAP VMA with special PTEs. This mapping type ensures
the memory cannot be used with things like pin_user_pages(), hmm, and
so on. In practice only the user process CPU and KVM can safely make
use of these VMA. When VFIO shuts down these VMAs are cleaned by
unmap_mapping_range() to prevent any UAF of the MMIO beyond driver
unbind.
However, VFIO type 1 has an insecure behavior where it uses
follow_pfnmap_*() to fish a MMIO PFN out of a VMA and program it back
into the IOMMU. This has a long history of enabling P2P DMA inside
VMs, but has serious lifetime problems by allowing a UAF of the MMIO
after the VFIO driver has been unbound.
Introduce DMABUF as a new safe way to export a FD based handle for the
MMIO regions. This can be consumed by existing DMABUF importers like
RDMA or DRM without opening an UAF. A following series will add an
importer to iommufd to obsolete the type 1 code and allow safe
UAF-free MMIO P2P in VM cases.
DMABUF has a built in synchronous invalidation mechanism called
move_notify. VFIO keeps track of all drivers importing its MMIO and
can invoke a synchronous invalidation callback to tell the importing
drivers to DMA unmap and forget about the MMIO pfns. This process is
being called revoke. This synchronous invalidation fully prevents any
lifecycle problems. VFIO will do this before unbinding its driver
ensuring there is no UAF of the MMIO beyond the driver lifecycle.
Further, VFIO has additional behavior to block access to the MMIO
during things like Function Level Reset. This is because some poor
platforms may experience a MCE type crash when touching MMIO of a PCI
device that is undergoing a reset. Today this is done by using
unmap_mapping_range() on the VMAs. Extend that into the DMABUF world
and temporarily revoke the MMIO from the DMABUF importers during FLR
as well. This will more robustly prevent an errant P2P from possibly
upsetting the platform.
A DMABUF FD is a preferred handle for MMIO compared to using something
like a pgmap because:
- VFIO is supported, including its P2P feature, on archs that don't
support pgmap
- PCI devices have all sorts of BAR sizes, including ones smaller
than a section so a pgmap cannot always be created
- It is undesirable to waste a lot of memory for struct pages,
especially for a case like a GPU with ~100GB of BAR size
- We want a synchronous revoke semantic to support FLR with light
hardware requirements
Use the P2P subsystem to help generate the DMA mapping. This is a
significant upgrade over the abuse of dma_map_resource() that has
historically been used by DMABUF exporters. Experience with an OOT
version of this patch shows that real systems do need this. This
approach deals with all the P2P scenarios:
- Non-zero PCI bus_offset
- ACS flags routing traffic to the IOMMU
- ACS flags that bypass the IOMMU - though vfio noiommu is required
to hit this.
There will be further work to formalize the revoke semantic in
DMABUF. For now this acts like a move_notify dynamic exporter where
importer fault handling will get a failure when they attempt to map.
This means that only fully restartable fault capable importers can
import the VFIO DMABUFs. A future revoke semantic should open this up
to more HW as the HW only needs to invalidate, not handle restartable
faults.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-10-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:28 +0000 (11:28 +0200)]
vfio/pci: Enable peer-to-peer DMA transactions by default
Make sure that all VFIO PCI devices have peer-to-peer capabilities
enables, so we would be able to export their MMIO memory through DMABUF,
VFIO has always supported P2P mappings with itself. VFIO type 1
insecurely reads PFNs directly out of a VMA's PTEs and programs them
into the IOMMU allowing any two VFIO devices to perform P2P to each
other.
All existing VMMs use this capability to export P2P into a VM where
the VM could setup any kind of DMA it likes. Projects like DPDK/SPDK
are also known to make use of this, though less frequently.
As a first step to more properly integrating VFIO with the P2P
subsystem unconditionally enable P2P support for VFIO PCI devices. The
struct p2pdma_provider will act has a handle to the P2P subsystem to
do things like DMA mapping.
While real PCI devices have to support P2P (they can't even tell if an
IOVA is P2P or not) there may be fake PCI devices that may trigger
some kind of catastrophic system failure. To date VFIO has never
tripped up on such a case, but if one is discovered the plan is to add
a PCI quirk and have pcim_p2pdma_init() fail. This will fully block
the broken device throughout any users of the P2P subsystem in the
kernel.
Thus P2P through DMABUF will follow the historical VFIO model and be
unconditionally enabled by vfio-pci.
Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-9-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Vivek Kasireddy [Thu, 20 Nov 2025 09:28:27 +0000 (11:28 +0200)]
vfio/pci: Share the core device pointer while invoking feature functions
There is no need to share the main device pointer (struct vfio_device *)
with all the feature functions as they only need the core device
pointer. Therefore, extract the core device pointer once in the
caller (vfio_pci_core_ioctl_feature) and share it instead.
Signed-off-by: Vivek Kasireddy <vivek.kasireddy@intel.com> Reviewed-by: Kevin Tian <kevin.tian@intel.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-8-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:25 +0000 (11:28 +0200)]
dma-buf: provide phys_vec to scatter-gather mapping routine
Add dma_buf_phys_vec_to_sgt() and dma_buf_free_sgt() helpers to convert
an array of MMIO physical address ranges into scatter-gather tables with
proper DMA mapping.
These common functions are a starting point and support any PCI
drivers creating mappings from their BAR's MMIO addresses. VFIO is one
case, as shortly will be RDMA. We can review existing DRM drivers to
refactor them separately. We hope this will evolve into routines to
help common DRM that include mixed CPU and MMIO mappings.
Compared to the dma_map_resource() abuse this implementation handles
the complicated PCI P2P scenarios properly, especially when an IOMMU
is enabled:
- Direct bus address mapping without IOVA allocation for
PCI_P2PDMA_MAP_BUS_ADDR, using pci_p2pdma_bus_addr_map(). This
happens if the IOMMU is enabled but the PCIe switch ACS flags allow
transactions to avoid the host bridge.
Further, this handles the slightly obscure, case of MMIO with a
phys_addr_t that is different from the physical BAR programming
(bus offset). The phys_addr_t is converted to a dma_addr_t and
accommodates this effect. This enables certain real systems to
work, especially on ARM platforms.
- Mapping through host bridge with IOVA allocation and DMA_ATTR_MMIO
attribute for MMIO memory regions (PCI_P2PDMA_MAP_THRU_HOST_BRIDGE).
This happens when the IOMMU is enabled and the ACS flags are forcing
all traffic to the IOMMU - ie for virtualization systems.
- Cases where P2P is not supported through the host bridge/CPU. The
P2P subsystem is the proper place to detect this and block it.
Helper functions fill_sg_entry() and calc_sg_nents() handle the
scatter-gather table construction, splitting large regions into
UINT_MAX-sized chunks to fit within sg->length field limits.
Since the physical address based DMA API forbids use of the CPU list
of the scatterlist this will produce a mangled scatterlist that has
a fully zero-length and NULL'd CPU list. The list is 0 length,
all the struct page pointers are NULL and zero sized. This is stronger
and more robust than the existing mangle_sg_table() technique. It is
a future project to migrate DMABUF as a subsystem away from using
scatterlist for this data structure.
Reviewed-by: Kevin Tian <kevin.tian@intel.com> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Christian König <christian.koenig@amd.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-6-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:23 +0000 (11:28 +0200)]
PCI/P2PDMA: Provide an access to pci_p2pdma_map_type() function
Provide an access to pci_p2pdma_map_type() function to allow subsystems
to determine the appropriate mapping type for P2PDMA transfers between
a provider and target device.
The pci_p2pdma_map_type() function is the core P2P layer version of
the existing public, but struct page focused, pci_p2pdma_state()
function. It returns the same result. It is required to use the p2p
subsystem from drivers that don't use the struct page layer.
Like __pci_p2pdma_update_state() it is not an exported function. The
idea is that only subsystem code will implement mapping helpers for
taking in phys_addr_t lists, this is deliberately not made accessible
to every driver to prevent abuse.
Following patches will use this function to implement a shared DMA
mapping helper for DMABUF.
Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-4-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:22 +0000 (11:28 +0200)]
PCI/P2PDMA: Refactor to separate core P2P functionality from memory allocation
Refactor the PCI P2PDMA subsystem to separate the core peer-to-peer DMA
functionality from the optional memory allocation layer. This creates a
two-tier architecture:
The core layer provides P2P mapping functionality for physical addresses
based on PCI device MMIO BARs and integrates with the DMA API for
mapping operations. This layer is required for all P2PDMA users.
The optional upper layer provides memory allocation capabilities
including gen_pool allocator, struct page support, and sysfs interface
for user space access.
This separation allows subsystems like DMABUF to use only the core P2P
mapping functionality without the overhead of memory allocation features
they don't need. The core functionality is now available through the
new pcim_p2pdma_provider() function that returns a p2pdma_provider
structure.
Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-3-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:21 +0000 (11:28 +0200)]
PCI/P2PDMA: Simplify bus address mapping API
Update the pci_p2pdma_bus_addr_map() function to take a direct pointer
to the p2pdma_provider structure instead of the pci_p2pdma_map_state.
This simplifies the API by removing the need for callers to extract
the provider from the state structure.
The change updates all callers across the kernel (block layer, IOMMU,
DMA direct, and HMM) to pass the provider pointer directly, making
the code more explicit and reducing unnecessary indirection. This
also removes the runtime warning check since callers now have direct
control over which provider they use.
Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-2-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Leon Romanovsky [Thu, 20 Nov 2025 09:28:20 +0000 (11:28 +0200)]
PCI/P2PDMA: Separate the mmap() support from the core logic
Currently the P2PDMA code requires a pgmap and a struct page to
function. The was serving three important purposes:
- DMA API compatibility, where scatterlist required a struct page as
input
- Life cycle management, the percpu_ref is used to prevent UAF during
device hot unplug
- A way to get the P2P provider data through the pci_p2pdma_pagemap
The DMA API now has a new flow, and has gained phys_addr_t support, so
it no longer needs struct pages to perform P2P mapping.
Lifecycle management can be delegated to the user, DMABUF for instance
has a suitable invalidation protocol that does not require struct page.
Finding the P2P provider data can also be managed by the caller
without need to look it up from the phys_addr.
Split the P2PDMA code into two layers. The optional upper layer,
effectively, provides a way to mmap() P2P memory into a VMA by
providing struct page, pgmap, a genalloc and sysfs.
The lower layer provides the actual P2P infrastructure and is wrapped
up in a new struct p2pdma_provider. Rework the mmap layer to use new
p2pdma_provider based APIs.
Drivers that do not want to put P2P memory into VMA's can allocate a
struct p2pdma_provider after probe() starts and free it before
remove() completes. When DMA mapping the driver must convey the struct
p2pdma_provider to the DMA mapping code along with a phys_addr of the
MMIO BAR slice to map. The driver must ensure that no DMA mapping
outlives the lifetime of the struct p2pdma_provider.
The intended target of this new API layer is DMABUF. There is usually
only a single p2pdma_provider for a DMABUF exporter. Most drivers can
establish the p2pdma_provider during probe, access the single instance
during DMABUF attach and use that to drive the DMA mapping.
DMABUF provides an invalidation mechanism that can guarantee all DMA
is halted and the DMA mappings are undone prior to destroying the
struct p2pdma_provider. This ensures there is no UAF through DMABUFs
that are lingering past driver removal.
The new p2pdma_provider layer cannot be used to create P2P memory that
can be mapped into VMA's, be used with pin_user_pages(), O_DIRECT, and
so on. These use cases must still use the mmap() layer. The
p2pdma_provider layer is principally for DMABUF-like use cases where
DMABUF natively manages the life cycle and access instead of
vmas/pin_user_pages()/struct page.
In addition, remove the bus_off field from pci_p2pdma_map_state since
it duplicates information already available in the pgmap structure.
The bus_offset is only used in one location (pci_p2pdma_bus_addr_map)
and is always identical to pgmap->bus_offset.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Alex Mastro <amastro@fb.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: Ankit Agrawal <ankita@nvidia.com> Link: https://lore.kernel.org/r/20251120-dmabuf-vfio-v9-1-d7f71607f371@nvidia.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Jason Gunthorpe [Fri, 7 Nov 2025 17:41:17 +0000 (13:41 -0400)]
vfio: Provide a get_region_info op
Instead of hooking the general ioctl op, have the core code directly
decode VFIO_DEVICE_GET_REGION_INFO and call an op just for it.
This is intended to allow mechanical changes to the drivers to pull their
VFIO_DEVICE_GET_REGION_INFO int oa function. Later patches will improve
the function signature to consolidate more code.
Alex Mastro [Tue, 11 Nov 2025 18:48:27 +0000 (10:48 -0800)]
vfio: selftests: replace iova=vaddr with allocated iovas
vfio_dma_mapping_test and vfio_pci_driver_test currently use iova=vaddr
as part of DMA mapping operations. However, not all IOMMUs support the
same virtual address width as the processor. For instance, older Intel
consumer platforms only support 39-bits of IOMMU address space. On such
platforms, using the virtual address as the IOVA fails.
Make the tests more robust by using iova_allocator to vend IOVAs, which
queries legally accessible IOVAs from the underlying IOMMUFD or VFIO
container.
Alex Mastro [Tue, 11 Nov 2025 18:48:26 +0000 (10:48 -0800)]
vfio: selftests: add iova allocator
Add struct iova_allocator, which gives tests a convenient way to generate
legally-accessible IOVAs to map. This allocator traverses the sorted
available IOVA ranges linearly, requires power-of-two size allocations,
and does not support freeing iova allocations. The assumption is that
tests are not IOVA space-bounded, and will not need to recycle IOVAs.
This is based on Alex Williamson's patch series for adding an IOVA
allocator [1].
Alex Mastro [Tue, 11 Nov 2025 18:48:25 +0000 (10:48 -0800)]
vfio: selftests: fix map limit tests to use last available iova
Use the newly available vfio_pci_iova_ranges() to determine the last
legal IOVA, and use this as the basis for vfio_dma_map_limit_test tests.
Fixes: de8d1f2fd5a5 ("vfio: selftests: add end of address space DMA map/unmap tests") Reviewed-by: David Matlack <dmatlack@google.com> Tested-by: David Matlack <dmatlack@google.com> Signed-off-by: Alex Mastro <amastro@fb.com> Link: https://lore.kernel.org/r/20251111-iova-ranges-v3-2-7960244642c5@fb.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Alex Mastro [Tue, 11 Nov 2025 18:48:24 +0000 (10:48 -0800)]
vfio: selftests: add iova range query helpers
VFIO selftests need to map IOVAs from legally accessible ranges, which
could vary between hardware. Tests in vfio_dma_mapping_test.c are making
excessively strong assumptions about which IOVAs can be mapped.
Add vfio_iommu_iova_ranges(), which queries IOVA ranges from the
IOMMUFD or VFIO container associated with the device. The queried ranges
are normalized to IOMMUFD's iommu_iova_range representation so that
handling of IOVA ranges up the stack can be implementation-agnostic.
iommu_iova_range and vfio_iova_range are equivalent, so bias to using the
new interface's struct.
Query IOMMUFD's ranges with IOMMU_IOAS_IOVA_RANGES.
Query VFIO container's ranges with VFIO_IOMMU_GET_INFO and
VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE.
The underlying vfio_iommu_type1_info buffer-related functionality has
been kept generic so the same helpers can be used to query other
capability chain information, if needed.
hisi_acc_vfio_pci: Add .match_token_uuid callback in hisi_acc_vfio_pci_migrn_ops
The commit, <86624ba3b522> ("vfio/pci: Do vf_token checks for
VFIO_DEVICE_BIND_IOMMUFD") accidentally ignored including the
.match_token_uuid callback in the hisi_acc_vfio_pci_migrn_ops struct.
Introduce the missed callback here.
Fixes: 86624ba3b522 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Suggested-by: Longfang Liu <liulongfang@huawei.com> Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-3-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
vfio: Fix ksize arg while copying user struct in vfio_df_ioctl_bind_iommufd()
For the cases where user includes a non-zero value in 'token_uuid_ptr'
field of 'struct vfio_device_bind_iommufd', the copy_struct_from_user()
in vfio_df_ioctl_bind_iommufd() fails with -E2BIG. For the 'minsz' passed,
copy_struct_from_user() expects the newly introduced field to be zero-ed,
which would be incorrect in this case.
Fix this by passing the actual size of the kernel struct. If working
with a newer userspace, copy_struct_from_user() would copy the
'token_uuid_ptr' field, and if working with an old userspace, it would
zero out this field, thus still retaining backward compatibility.
Fixes: 86624ba3b522 ("vfio/pci: Do vf_token checks for VFIO_DEVICE_BIND_IOMMUFD") Cc: stable@vger.kernel.org Signed-off-by: Raghavendra Rao Ananta <rananta@google.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20251031170603.2260022-2-rananta@google.com Signed-off-by: Alex Williamson <alex@shazbot.org>
Longfang Liu [Thu, 30 Oct 2025 01:57:44 +0000 (09:57 +0800)]
hisi_acc_vfio_pci: adapt to new migration configuration
On new platforms greater than QM_HW_V3, the migration region has been
relocated from the VF to the PF. The VF's own configuration space is
restored to the complete 64KB, and there is no need to divide the
size of the BAR configuration space equally. The driver should be
modified accordingly to adapt to the new hardware device.
On the older hardware platform QM_HW_V3, the live migration configuration
region is placed in the latter 32K portion of the VF's BAR2 configuration
space. On the new hardware platform QM_HW_V4, the live migration
configuration region also exists in the same 32K area immediately following
the VF's BAR2, just like on QM_HW_V3.
However, access to this region is now controlled by hardware. Additionally,
a copy of the live migration configuration region is present in the PF's
BAR2 configuration space. On the new hardware platform QM_HW_V4, when an
older version of the driver is loaded, it behaves like QM_HW_V3 and uses
the configuration region in the VF, ensuring that the live migration
function continues to work normally. When the new version of the driver is
loaded, it directly uses the configuration region in the PF. Meanwhile,
hardware configuration disables the live migration configuration region
in the VF's BAR2: reads return all 0xF values, and writes are silently
ignored.
Longfang Liu [Thu, 30 Oct 2025 01:57:43 +0000 (09:57 +0800)]
crypto: hisilicon - qm updates BAR configuration
On new platforms greater than QM_HW_V3, the configuration region for the
live migration function of the accelerator device is no longer
placed in the VF, but is instead placed in the PF.
Therefore, the configuration region of the live migration function
needs to be opened when the QM driver is loaded. When the QM driver
is uninstalled, the driver needs to clear this configuration.
Signed-off-by: Longfang Liu <liulongfang@huawei.com> Reviewed-by: Shameer Kolothum <shameerkolothum@gmail.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Link: https://lore.kernel.org/r/20251030015744.131771-2-liulongfang@huawei.com Signed-off-by: Alex Williamson <alex@shazbot.org>
David Matlack [Mon, 22 Sep 2025 22:48:57 +0000 (22:48 +0000)]
vfio: selftests: Store libvfio build outputs in $(OUTPUT)/libvfio
Store the tools/testing/selftests/vfio/lib outputs (e.g. object files)
in $(OUTPUT)/libvfio rather than in $(OUTPUT)/lib. This is in
preparation for building the VFIO selftests library into the KVM
selftests (see Link below).
Specifically this will avoid name conflicts between
tools/testing/selftests/{vfio,kvm/lib and also avoid leaving behind
empty directories under tools/testing/selftests/kvm after a make clean.
Linus Torvalds [Sat, 1 Nov 2025 17:49:12 +0000 (10:49 -0700)]
Merge tag 'regulator-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator
Pull regulator fix from Mark Brown:
"A simple fix for a missed part of an API conversion in the bd718x7
driver"
* tag 'regulator-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regulator:
regulator: bd718x7: Fix voltages scaled by resistor divider
Linus Torvalds [Sat, 1 Nov 2025 17:45:39 +0000 (10:45 -0700)]
Merge tag 'regmap-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap
Pull regmap fixes from Mark Brown:
"One documentation fix and a fix for a problem with the slimbus regmap
which was uncovered by some changes in one of the drivers"
* tag 'regmap-fix-v6.18-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap:
regmap: irq: Correct documentation of wake_invert flag
regmap: slimbus: fix bus_context pointer in regmap init calls
Linus Torvalds [Sat, 1 Nov 2025 17:20:07 +0000 (10:20 -0700)]
Merge tag 'x86-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 fixes from Ingo Molnar:
- Limit AMD microcode Entrysign sha256 signature checking to
known CPU generations
- Disable AMD RDSEED32 on certain Zen5 CPUs that have a
microcode version before when the microcode-based fix was
issued for the AMD-SB-7055 erratum
- Fix FPU AMD XFD state synchronization on signal delivery
- Fix (work around) a SSE4a-disassembly related build failure
on X86_NATIVE_CPU=y builds
- Extend the AMD Zen6 model space with a new range of models
- Fix <asm/intel-family.h> CPU model comments
- Fix the CONFIG_CFI=y and CONFIG_LTO_CLANG_FULL=y build, which
was unhappy due to missing kCFI type annotations of clear_page()
variants
* tag 'x86-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm: Ensure clear_page() variants always have __kcfi_typeid_ symbols
x86/cpu: Add/fix core comments for {Panther,Nova} Lake
x86/CPU/AMD: Extend Zen6 model range
x86/build: Disable SSE4a
x86/fpu: Ensure XFD state on signal delivery
x86/CPU/AMD: Add RDSEED fix for Zen5
x86/microcode/AMD: Limit Entrysign signature checking to known generations
Linus Torvalds [Sat, 1 Nov 2025 17:17:40 +0000 (10:17 -0700)]
Merge tag 'perf-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf event fixes from Ingo Molnar:
"Miscellaneous fixes and CPU model updates:
- Fix an out-of-bounds access on non-hybrid platforms in the Intel
PMU DS code, reported by KASAN
- Add WildcatLake PMU and uncore support: it's identical to the
PantherLake version"
* tag 'perf-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/uncore: Add uncore PMU support for Wildcat Lake
perf/x86/intel: Add PMU support for WildcatLake
perf/x86/intel: Fix KASAN global-out-of-bounds warning
Linus Torvalds [Sat, 1 Nov 2025 17:07:35 +0000 (10:07 -0700)]
Merge tag 'objtool-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fix from Ingo Molnar:
"Fix objtool warning when faced with raw STAC/CLAC instructions"
* tag 'objtool-urgent-2025-11-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix skip_alt_group() for non-alternative STAC/CLAC
Linus Torvalds [Sat, 1 Nov 2025 17:04:35 +0000 (10:04 -0700)]
Merge tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Carlos Maiolino:
"Just a single bug fix (and documentation for the issue)"
* tag 'xfs-fixes-6.18-rc4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: document another racy GC case in xfs_zoned_map_extent
xfs: prevent gc from picking the same zone twice
Linus Torvalds [Sat, 1 Nov 2025 17:00:53 +0000 (10:00 -0700)]
Merge tag 'kbuild-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux
Pull Kbuild fixes from Nathan Chancellor:
- Formally adopt Kconfig in MAINTAINERS
- Fix install-extmod-build for more O= paths
- Align end of .modinfo to fix Authenticode calculation in EDK2
- Restore dynamic check for '-fsanitize=kernel-memory' in
CONFIG_HAVE_KMSAN_COMPILER to ensure backend target has support
for it
- Initialize locale in menuconfig and nconfig to fix UTF-8 terminals
that may not support VT100 ACS by default like PuTTY
* tag 'kbuild-fixes-6.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux:
kconfig/nconf: Initialize the default locale at startup
kconfig/mconf: Initialize the default locale at startup
KMSAN: Restore dynamic check for '-fsanitize=kernel-memory'
kbuild: align modinfo section for Secureboot Authenticode EDK2 compat
kbuild: install-extmod-build: Fix when given dir outside the build dir
MAINTAINERS: Update Kconfig section
Josh Poimboeuf [Wed, 29 Oct 2025 19:54:08 +0000 (12:54 -0700)]
objtool: Fix skip_alt_group() for non-alternative STAC/CLAC
If an insn->alt points to a STAC/CLAC instruction, skip_alt_group()
assumes it's part of an alternative ("alt group") as opposed to some
other kind of "alt" such as an exception fixup.
While that assumption may hold true in the current code base, Linus has
an out-of-tree patch which breaks that assumption by replacing the
STAC/CLAC alternatives with raw STAC/CLAC instructions.
Make skip_alt_group() more robust by making sure it's actually an alt
group before continuing.
Jakub Horký [Tue, 14 Oct 2025 14:44:06 +0000 (16:44 +0200)]
kconfig/nconf: Initialize the default locale at startup
Fix bug where make nconfig doesn't initialize the default locale, which
causes ncurses menu borders to be displayed incorrectly (lqqqqk) in
UTF-8 terminals that don't support VT100 ACS by default, such as PuTTY.
Jakub Horký [Tue, 14 Oct 2025 15:49:32 +0000 (17:49 +0200)]
kconfig/mconf: Initialize the default locale at startup
Fix bug where make menuconfig doesn't initialize the default locale, which
causes ncurses menu borders to be displayed incorrectly (lqqqqk) in
UTF-8 terminals that don't support VT100 ACS by default, such as PuTTY.
- Reject negative head_room in __bpf_skb_change_head (Daniel Borkmann)
- Conditionally include dynptr copy kfuncs (Malin Jonsson)
- Sync pending IRQ work before freeing BPF ring buffer (Noorain Eqbal)
- Do not audit capability check in x86 do_jit() (Ondrej Mosnacek)
- Fix arm64 JIT of BPF_ST insn when it writes into arena memory
(Puranjay Mohan)
* tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf/arm64: Fix BPF_ST into arena memory
bpf: Make migrate_disable always inline to avoid partial inlining
bpf: Reject negative head_room in __bpf_skb_change_head
bpf: Conditionally include dynptr copy kfuncs
libbpf: Fix powerpc's stack register definition in bpf_tracing.h
bpf: Do not audit capability check in do_jit()
bpf: Sync pending IRQ work before freeing ring buffer
x86/mm: Ensure clear_page() variants always have __kcfi_typeid_ symbols
When building with CONFIG_CFI=y and CONFIG_LTO_CLANG_FULL=y, there is a series
of errors from the various versions of clear_page() not having __kcfi_typeid_
symbols.
$ cat kernel/configs/repro.config
CONFIG_CFI=y
# CONFIG_LTO_NONE is not set
CONFIG_LTO_CLANG_FULL=y
$ make -skj"$(nproc)" ARCH=x86_64 LLVM=1 clean defconfig repro.config bzImage
ld.lld: error: undefined symbol: __kcfi_typeid_clear_page_rep
>>> referenced by ld-temp.o
>>> vmlinux.o:(__cfi_clear_page_rep)
With full LTO, it is possible for LLVM to realize that these functions never
have their address taken (as they are only used within an alternative, which
will make them a direct call) across the whole kernel and either drop or skip
generating their kCFI type identification symbols.
clear_page_{rep,orig,erms}() are defined in clear_page_64.S with
SYM_TYPED_FUNC_START as a result of
2981557cb040 ("x86,kcfi: Fix EXPORT_SYMBOL vs kCFI"),
as exported functions are free to be called indirectly thus need kCFI type
identifiers.
Use KCFI_REFERENCE with these clear_page() functions to force LLVM to see
these functions as address-taken and generate then keep the kCFI type
identifiers.
Linus Torvalds [Fri, 31 Oct 2025 21:47:02 +0000 (14:47 -0700)]
Merge tag 'drm-fixes-2025-10-31' of https://gitlab.freedesktop.org/drm/kernel
Pull drm fixes from Simona Vetter:
"Looks like stochastics conspired to make this one a bit bigger, but
nothing scary at all. Also first examples of the new Link: tags, yay!
* tag 'drm-fixes-2025-10-31' of https://gitlab.freedesktop.org/drm/kernel: (44 commits)
drm/ast: Clear preserved bits from register output value
drm/imx: parallel-display: add the bridge before attaching it
drm/imx: parallel-display: convert to devm_drm_bridge_alloc() API
drm/panel: kingdisplay-kd097d04: Disable EoTp
drm/panel: sitronix-st7789v: fix sync flags for t28cp45tn89
drm/xe: Do not wake device during a GT reset
drm/xe: Fix uninitialized return value from xe_validation_guard()
drm/msm/dpu: Fix adjusted mode clock check for 3d merge
drm/msm/dpu: Disable broken YUV on QSEED2 hardware
drm/msm/dpu: Require linear modifier for writeback framebuffers
drm/msm/dpu: Fix pixel extension sub-sampling
drm/msm/dpu: Disable scaling for unsupported scaler types
drm/msm/dpu: Propagate error from dpu_assign_plane_resources
drm/msm/dpu: Fix allocation of RGB SSPPs without scaling
drm/msm: dsi: fix PLL init in bonded mode
drm/i915/dmc: Clear HRR EVT_CTL/HTP to zero on ADL-S
drm/amd/display: Fix incorrect return of vblank enable on unconfigured crtc
drm/amd/display: Add HDR workaround for a specific eDP
drm/amdgpu: fix SPDX header on cyan_skillfish_reg_init.c
drm/amdgpu: fix SPDX header on irqsrcs_vcn_5_0.h
...
Linus Torvalds [Fri, 31 Oct 2025 21:24:32 +0000 (14:24 -0700)]
Merge tag 'pci-v6.18-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci
Pull pci fixes from Bjorn Helgaas:
- Restore custom qcom ASPM enablement code so L1 PM Substates are
enabled as they were in v6.17 even though the PCI core now enables
just L0s and L1 by default (Bjorn Helgaas)
- Size prefetchable bridge windows only when they actually exist, to
avoid a WARN_ON() regression (Ilpo Järvinen)
* tag 'pci-v6.18-fixes-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci:
PCI: Do not size non-existing prefetchable window
Revert "PCI: qcom: Remove custom ASPM enablement code"
Linus Torvalds [Fri, 31 Oct 2025 21:20:09 +0000 (14:20 -0700)]
Merge tag 'vfio-v6.18-rc4' of https://github.com/awilliam/linux-vfio
Pull VFIO fixes from Alex Williamson:
- Fix overflows in vfio type1 backend for mappings at the end of the
64-bit address space, resulting in leaked pinned memory.
New selftest support included to avoid such issues in the future
(Alex Mastro)
* tag 'vfio-v6.18-rc4' of https://github.com/awilliam/linux-vfio:
vfio: selftests: add end of address space DMA map/unmap tests
vfio: selftests: update DMA map/unmap helpers to support more test kinds
vfio/type1: handle DMA map/unmap up to the addressable limit
vfio/type1: move iova increment to unmap_unpin_*() caller
vfio/type1: sanitize for overflow using check_*_overflow()
Ilpo Järvinen [Mon, 27 Oct 2025 13:24:23 +0000 (15:24 +0200)]
PCI: Do not size non-existing prefetchable window
pbus_size_mem() should only be called for bridge windows that exist but
__pci_bus_size_bridges() may point 'pref' to a resource that does not exist
(has zero flags) in case of non-root buses.
When prefetchable bridge window does not exist, the same non-prefetchable
bridge window is sized more than once which may result in duplicating
entries into the realloc_head list. Duplicated entries are shown in this
log and trigger a WARN_ON() because realloc_head had residual entries after
the resource assignment algorithm:
pci 0000:00:03.0: [11ab:6820] type 01 class 0x060400 PCIe Root Port
pci 0000:00:03.0: PCI bridge to [bus 00]
pci 0000:00:03.0: bridge window [io 0x0000-0x0fff]
pci 0000:00:03.0: bridge window [mem 0x00000000-0x000fffff]
pci 0000:00:03.0: bridge window [mem 0x00200000-0x003fffff] to [bus 02] add_size 200000 add_align 200000
pci 0000:00:03.0: bridge window [mem 0x00200000-0x003fffff] to [bus 02] add_size 200000 add_align 200000
pci 0000:00:03.0: bridge window [mem 0xe0000000-0xe03fffff]: assigned
pci 0000:00:03.0: PCI bridge to [bus 02]
pci 0000:00:03.0: bridge window [mem 0xe0000000-0xe03fffff]
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at drivers/pci/setup-bus.c:2373 pci_assign_unassigned_root_bus_resources+0x1bc/0x234
Check resource flags of 'pref' and only size the prefetchable window if the
resource has the IORESOURCE_PREFETCH flag.
Fixes: ae88d0b9c57f ("PCI: Use pbus_select_window_for_type() during mem window sizing") Reported-by: Klaus Kudielka <klaus.kudielka@gmail.com> Closes: https://lore.kernel.org/r/51e8cf1c62b8318882257d6b5a9de7fdaaecc343.camel@gmail.com/ Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Klaus Kudielka <klaus.kudielka@gmail.com> Link: https://patch.msgid.link/20251027132423.8841-1-ilpo.jarvinen@linux.intel.com
Prior to a729c1664619 ("PCI: qcom: Remove custom ASPM enablement code"),
the qcom controller driver enabled ASPM, including L0s, L1, and L1 PM
Substates, for all devices powered on at the time the controller driver
enumerates them.
ASPM was *not* enabled for devices powered on later by pwrctrl (unless the
kernel was built with PCIEASPM_POWERSAVE or PCIEASPM_POWER_SUPERSAVE, or
the user enabled ASPM via module parameter or sysfs).
After f3ac2ff14834 ("PCI/ASPM: Enable all ClockPM and ASPM states for
devicetree platforms"), the PCI core enabled all ASPM states for all
devices whether powered on initially or by pwrctrl, so a729c1664619 was
unnecessary and reverted.
But f3ac2ff14834 was too aggressive and broke platforms that didn't support
CLKREQ# or required device-specific configuration for L1 Substates, so df5192d9bb0e ("PCI/ASPM: Enable only L0s and L1 for devicetree platforms")
enabled only L0s and L1.
On Qualcomm platforms, this left L1 Substates disabled, which was a
regression. Revert a729c1664619 so L1 Substates will be enabled on devices
that are initially powered on. Devices powered on by pwrctrl will be
addressed later.
Fixes: df5192d9bb0e ("PCI/ASPM: Enable only L0s and L1 for devicetree platforms") Reported-by: Johan Hovold <johan@kernel.org> Closes: https://lore.kernel.org/lkml/aPuXZlaawFmmsLmX@hovoldconsulting.com/ Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Tested-by: Johan Hovold <johan@kernel.org> Reviewed-by: Manivannan Sadhasivam <mani@kernel.org> Link: https://patch.msgid.link/20251024210514.1365996-1-helgaas@kernel.org
Linus Torvalds [Fri, 31 Oct 2025 19:57:19 +0000 (12:57 -0700)]
Merge tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux
Pull block fixes from Jens Axboe:
- Fix blk-crypto reporting EIO when EINVAL is the correct error code
- Two bug fixes for the block zone support
- NVME pull request via Keith:
- Target side authentication fixup
- Peer-to-peer metadata fixup
- null_blk DMA alignment fix
* tag 'block-6.18-20251031' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux:
null_blk: set dma alignment to logical block size
blk-crypto: use BLK_STS_INVAL for alignment errors
block: make REQ_OP_ZONE_OPEN a write operation
block: fix op_is_zone_mgmt() to handle REQ_OP_ZONE_RESET_ALL
nvme-pci: use blk_map_iter for p2p metadata
nvmet-auth: update sc_c in host response
Linus Torvalds [Fri, 31 Oct 2025 19:50:35 +0000 (12:50 -0700)]
Merge tag 's390-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Heiko Carstens:
- Use correct locking in zPCI event code to avoid deadlock
- Get rid of irqs_registered flag in zpci_dev structure and restore IRQ
unconditionally for zPCI devices. This fixes sit uations where the
flag was not correctly updated
- Disable (revert) ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP for s390 again.
The optimized hugetlb vmemmap code modifies kernel page tables in a
way which does not work on s390 and leads to reproducible kernel
crashes due to stale TLB entries. This needs to be addressed with
some larger changes. For now simply disable the feature
- Update defconfigs
* tag 's390-6.18-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: Disable ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
s390/mm: Fix memory leak in add_marker() when kvrealloc() fails
s390/pci: Restore IRQ unconditionally for the zPCI device
s390: Update defconfigs
s390/pci: Avoid deadlock between PCI error recovery and mlx5 crdump
Puranjay Mohan [Thu, 30 Oct 2025 12:17:14 +0000 (12:17 +0000)]
bpf/arm64: Fix BPF_ST into arena memory
The arm64 JIT supports BPF_ST with BPF_PROBE_MEM32 (arena) by using the
tmp2 register to hold the dst + arena_vm_base value and using tmp2 as the
new dst register. But this is broken because in case is_lsi_offset()
returns false the tmp2 will be clobbered by emit_a64_mov_i(1, tmp2, off,
ctx); and hence the emitted store instruction will be of the form:
strb w10, [x11, x11]
Fix this by using the third temporary register to hold the dst +
arena_vm_base.
Two functions with identical names but different addresses are
considered ambiguous and removed by "pahole" from vmlinux BTF.
Later resolve_btfids warns since it cannot find them.
Commit 378b7708194f ("sched: Make migrate_{en,dis}able() inline") made
them inlineable in most places, but in vmlinux built with llvm 21 and 22
there are four symbols for migrate_{enable,disable}:
three static functions and one global function.
Fix the issue by marking migrate_{enable,disable} as always inline.
The alternative is to mark them as notrace/nokprobe which is more
drastic. Only bpf programs are prevented from attaching to these
functions. The rest of the tracing shouldn't be affected.
[note: Peter ok-ed the patch, Alexei rewrote commit log]
Fixes: 378b7708194f ("sched: Make migrate_{en,dis}able() inline") Signed-off-by: Yonghong Song <yonghong.song@linux.dev> Acked-by: Menglong Dong <menglong.dong@linux.dev> Link: https://lore.kernel.org/r/20251029183646.3811774-1-yonghong.song@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Linus Torvalds [Fri, 31 Oct 2025 16:34:21 +0000 (09:34 -0700)]
Merge tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb client fixes from Steve French:
- fix potential UAF in statfs
- DFS fix for expired referrals
- fix minor modinfo typo
- small improvement to reconnect for smbdirect
* tag '6.18-rc3-smb-client-fixes' of git://git.samba.org/sfrench/cifs-2.6:
smb: client: call smbd_destroy() in the same splace as kernel_sock_shutdown()/sock_release()
smb: client: handle lack of IPC in dfs_cache_refresh()
smb: client: fix potential cfid UAF in smb2_query_info_compound
cifs: fix typo in enable_gcm_256 module parameter
Hans Holmberg [Fri, 31 Oct 2025 09:48:26 +0000 (10:48 +0100)]
null_blk: set dma alignment to logical block size
This driver assumes that bio vectors are memory aligned to the logical
block size, so set the queue limit to reflect that.
Unless we set up the limit based on the logical block size, we will go
out of page bounds in copy_to_nullb / copy_from_nullb.
Apparently this wasn't noticed so far because none of the tests generate
such buffers, but since commit 851c4c96db00 ("xfs: implement
XFS_IOC_DIOINFO in terms of vfs_getattr") xfstests generates unaligned
I/O, which now lead to memory corruption when using null_blk devices
with 4k block size.
Fixes: bf8d08532bc1 ("iomap: add support for dma aligned direct-io") Fixes: b1a000d3b8ec ("block: relax direct io memory alignment") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Hans Holmberg <hans.holmberg@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Fri, 31 Oct 2025 14:29:09 +0000 (07:29 -0700)]
Merge tag 'sound-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. It became slightly bigger than usual due
to timing issues (holidays, etc), but all changes are rather
device-specific fixes, so not really worrisome.
- ASoC Cirrus codec fixes for AMD
- Various fixes for ASoC Intel AVS, Qualcomm, SoundWire, FSL,
Mediatek, Renesas
- A few HD-audio quirks, and USB-audio regression fixes for Presonus"
* tag 'sound-6.18-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (24 commits)
ALSA: hda/realtek: Enable mic on Vaio RPL
ASoC: dt-bindings: pm4125-sdw: correct number of soundwire ports
ASoC: renesas: rz-ssi: Use proper dma_buffer_pos after resume
ASoC: soc_sdw_utils: remove cs42l43 component_name
ASoC: fsl_sai: Fix sync error in consumer mode
ASoC: Fix build for sdw_utils
ALSA: hda/realtek: Fix mute led for HP Victus 15-fa1xxx (MB 8C2D)
ASoC: rt721: fix prepare clock stop failed
ALSA: usb-audio: don't log messages meant for 1810c when initializing 1824c
ASoC: mediatek: Fix double pm_runtime_disable in remove functions
ASoC: fsl_micfil: correct the endian format for DSD
ASoC: fsl_sai: fix bit order for DSD format
ASoC: Intel: avs: Use snd_codec format when initializing probe
ASoC: Intel: avs: Disable periods-elapsed work when closing PCM
ASoC: Intel: avs: Unprepare a stream when XRUN occurs
ASoC: sdw_utils: add name_prefix for rt1321 part id
ASoC: qdsp6: q6asm: do not sleep while atomic
ASoC: Intel: soc-acpi-intel-ptl-match: Remove cs42l43 match from sdw link3
ASOC: max98090/91: fix for filter configuration: AHPF removed DMIC2_HPF added
ASoC: amd: acp: Add ACP7.0 match entries for cs35l56 and cs42l43
...
Linus Torvalds [Fri, 31 Oct 2025 14:25:10 +0000 (07:25 -0700)]
Merge tag 'v6.18-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto fixes from Herbert Xu:
- Fix double free in aspeed
- Fix req->nbytes clobbering in s390/phmac
* tag 'v6.18-p4' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: aspeed - fix double free caused by devm
crypto: s390/phmac - Do not modify the req->nbytes value