]> git.ipfire.org Git - thirdparty/qemu.git/log
thirdparty/qemu.git
8 weeks agointel-iommu: Move dma_translation to x86-iommu
Joao Martins [Fri, 19 Sep 2025 21:35:14 +0000 (21:35 +0000)] 
intel-iommu: Move dma_translation to x86-iommu

To be later reused by AMD, now that it shares similar property.

Signed-off-by: Joao Martins <joao.m.martins@oracle.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-22-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Refactor amdvi_page_walk() to use common code for page walk
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:13 +0000 (21:35 +0000)] 
amd_iommu: Refactor amdvi_page_walk() to use common code for page walk

Simplify amdvi_page_walk() by making it call the fetch_pte() helper that is
already in use by the shadow page synchronization code. Ensures all code
uses the same page table walking algorithm.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-21-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Do not assume passthrough translation when DTE[TV]=0
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:12 +0000 (21:35 +0000)] 
amd_iommu: Do not assume passthrough translation when DTE[TV]=0

The AMD I/O Virtualization Technology (IOMMU) Specification (see Table
8: V, TV, and GV Fields in Device Table Entry), specifies that a DTE
with V=1, TV=0 does not contain a valid address translation information.
If a request requires a table walk, the walk is terminated when this
condition is encountered.

Do not assume that addresses for a device with DTE[TV]=0 are passed
through (i.e. not remapped) and instead terminate the page table walk
early.

Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU")
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-20-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Toggle address translation mode on devtab entry invalidation
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:11 +0000 (21:35 +0000)] 
amd_iommu: Toggle address translation mode on devtab entry invalidation

A guest must issue an INVALIDATE_DEVTAB_ENTRY command after changing a
Device Table entry (DTE) e.g. after attaching a device and setting up its
DTE. When intercepting this event, determine if the DTE has been configured
for paging or not, and toggle the appropriate memory regions to allow DMA
address translation for the address space if needed. Requires dma-remap=on.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-19-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Add dma-remap property to AMD vIOMMU device
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:10 +0000 (21:35 +0000)] 
amd_iommu: Add dma-remap property to AMD vIOMMU device

In order to enable device assignment with IOMMU protection and guest DMA
address translation, IOMMU MAP notifier support is necessary to allow users
like VFIO to synchronize the shadow page tables i.e. to receive
notifications when the guest updates its I/O page tables and replay the
mappings onto host I/O page tables.

Provide a new dma-remap property to govern the ability to register for MAP
notifications, effectively providing global control over the DMA address
translation functionality that was implemented in previous changes.

Note that DMA remapping support also requires the vIOMMU is configured with
the NpCache capability, so a guest driver issues IOMMU invalidations for
both map() and unmap() operations. This capability is already set by default
and written to the configuration in amdvi_pci_realize() as part of
AMDVI_CAPAB_FEATURES.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-18-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Set all address spaces to use passthrough mode on reset
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:09 +0000 (21:35 +0000)] 
amd_iommu: Set all address spaces to use passthrough mode on reset

On reset, restore the default address translation mode (passthrough) for all
the address spaces managed by the vIOMMU.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-17-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Toggle memory regions based on address translation mode
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:08 +0000 (21:35 +0000)] 
amd_iommu: Toggle memory regions based on address translation mode

Enable the appropriate memory region for an address space depending on the
address translation mode selected for it. This is currently based on a
generic x86 IOMMU property, and only done during the address space
initialization. Extract the code into a helper and toggle the regions based
on whether the specific address space is using address translation (via the
newly introduced addr_translation field). Later, region activation will also
be controlled by availability of DMA remapping capability (via dma-remap
property to be introduced in follow up changes).

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-16-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:07 +0000 (21:35 +0000)] 
amd_iommu: Invalidate address translations on INVALIDATE_IOMMU_ALL

When the kernel IOMMU driver issues an INVALIDATE_IOMMU_ALL, the address
translation and interrupt remapping information must be cleared for all
Device IDs and all domains. Introduce a helper to sync the shadow page table
for all the address spaces with registered notifiers, which replays both MAP
and UNMAP events.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-15-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Add replay callback
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:06 +0000 (21:35 +0000)] 
amd_iommu: Add replay callback

A replay() method is necessary to efficiently synchronize the host page
tables after VFIO registers a notifier for IOMMU events. It is called to
ensure that existing mappings from an IOMMU memory region are "replayed" to
a specified notifier, initializing or updating the shadow page tables on the
host.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-14-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Unmap all address spaces under the AMD IOMMU on reset
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:05 +0000 (21:35 +0000)] 
amd_iommu: Unmap all address spaces under the AMD IOMMU on reset

Support dropping all existing mappings on reset. When the guest kernel
reboots it will create new ones, but other components that run before
the kernel (e.g. OVMF) should not be able to use existing mappings from
the previous boot.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-13-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Use iova_tree records to determine large page size on UNMAP
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:04 +0000 (21:35 +0000)] 
amd_iommu: Use iova_tree records to determine large page size on UNMAP

Keep a record of mapped IOVA ranges per address space, using the iova_tree
implementation. Besides enabling optimizations like avoiding unnecessary
notifications, a record of existing <IOVA, size> mappings makes it possible
to determine if a specific IOVA is mapped by the guest using a large page,
and adjust the size when notifying UNMAP events.

When unmapping a large page, the information in the guest PTE encoding the
page size is lost, since the guest clears the PTE before issuing the
invalidation command to the IOMMU. In such case, the size of the original
mapping can be retrieved from the iova_tree and used to issue the UNMAP
notification. Using the correct size is essential since the VFIO IOMMU
Type1v2 driver in the host kernel will reject unmap requests that do not
fully cover previous mappings.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-12-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Sync shadow page tables on page invalidation
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:03 +0000 (21:35 +0000)] 
amd_iommu: Sync shadow page tables on page invalidation

When the guest issues an INVALIDATE_IOMMU_PAGES command, decode the address
and size of the invalidation and sync the guest page table state with the
host. This requires walking the guest page table and calling notifiers
registered for address spaces matching the domain ID encoded in the command.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-11-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Add basic structure to support IOMMU notifier updates
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:02 +0000 (21:35 +0000)] 
amd_iommu: Add basic structure to support IOMMU notifier updates

Add the minimal data structures required to maintain a list of address
spaces (i.e. devices) with registered notifiers, and to update the type of
events that require notifications.
Note that the ability to register for MAP notifications is not available.
It will be unblocked by following changes that enable the synchronization of
guest I/O page tables with host IOMMU state, at which point an amd-iommu
device property will be introduced to control this capability.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-10-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Add a page walker to sync shadow page tables on invalidation
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:01 +0000 (21:35 +0000)] 
amd_iommu: Add a page walker to sync shadow page tables on invalidation

For the specified address range, walk the page table identifying regions
as mapped or unmapped and invoke registered notifiers with the
corresponding event type.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-9-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Add helpers to walk AMD v1 Page Table format
Alejandro Jimenez [Fri, 19 Sep 2025 21:35:00 +0000 (21:35 +0000)] 
amd_iommu: Add helpers to walk AMD v1 Page Table format

The current amdvi_page_walk() is designed to be called by the replay()
method. Rather than drastically altering it, introduce helpers to fetch
guest PTEs that will be used by a page walker implementation.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-8-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Return an error when unable to read PTE from guest memory
Alejandro Jimenez [Fri, 19 Sep 2025 21:34:59 +0000 (21:34 +0000)] 
amd_iommu: Return an error when unable to read PTE from guest memory

Make amdvi_get_pte_entry() return an error value (-1) in cases where the
memory read fails, versus the current return of 0 to indicate failure.
The reason is that 0 is also a valid value to have stored in the PTE in
guest memory i.e. the guest does not have a mapping. Before this change,
amdvi_get_pte_entry() returned 0 for both an error and for empty PTEs, but
the page walker implementation that will be introduced in upcoming changes
needs a method to differentiate between the two scenarios.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-7-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Add helper function to extract the DTE
Alejandro Jimenez [Fri, 19 Sep 2025 21:34:58 +0000 (21:34 +0000)] 
amd_iommu: Add helper function to extract the DTE

Extracting the DTE from a given AMDVIAddressSpace pointer structure is a
common operation required for syncing the shadow page tables. Implement a
helper to do it and check for common error conditions.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-6-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Helper to decode size of page invalidation command
Alejandro Jimenez [Fri, 19 Sep 2025 21:34:57 +0000 (21:34 +0000)] 
amd_iommu: Helper to decode size of page invalidation command

The size of the region to invalidate depends on the S bit and address
encoded in the command. Add a helper to extract this information, which
will be used to sync shadow page tables in upcoming changes.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-5-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Reorder device and page table helpers
Alejandro Jimenez [Fri, 19 Sep 2025 21:34:56 +0000 (21:34 +0000)] 
amd_iommu: Reorder device and page table helpers

Move code related to Device Table and Page Table to an earlier location in
the file, where it does not require forward declarations to be used by the
various invalidation functions that will need to query the DTE and walk the
page table in upcoming changes.

This change consist of code movement only, no functional change intended.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-4-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoamd_iommu: Document '-device amd-iommu' common options
Alejandro Jimenez [Fri, 19 Sep 2025 21:34:55 +0000 (21:34 +0000)] 
amd_iommu: Document '-device amd-iommu' common options

Document the common parameters used when emulating AMD vIOMMU.
Besides the two amd-iommu specific options: 'xtsup' and 'dma-remap', the
the generic x86 IOMMU option 'intremap' is also included, since it is
typically specified in QEMU command line examples and mailing list threads.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-3-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agomemory: Adjust event ranges to fit within notifier boundaries
Alejandro Jimenez [Fri, 19 Sep 2025 21:34:54 +0000 (21:34 +0000)] 
memory: Adjust event ranges to fit within notifier boundaries

Invalidating the entire address space (i.e. range of [0, ~0ULL]) is a
valid and required operation by vIOMMU implementations. However, such
invalidations currently trigger an assertion unless they originate from
device IOTLB invalidations.

Although in recent Linux guests this case is not exercised by the VTD
implementation due to various optimizations, the assertion will be hit
by upcoming AMD vIOMMU changes to support DMA address translation. More
specifically, when running a Linux guest with VFIO passthrough device,
and a kernel that does not contain commmit 3f2571fed2fa ("iommu/amd:
Remove redundant domain flush from attach_device()").

Remove the assertion altogether and adjust the range to ensure it does
not cross notifier boundaries.

Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Xu <peterx@redhat.com>
Message-Id: <20201116165506.31315-6-eperezma@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250919213515.917111-2-alejandro.j.jimenez@oracle.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agopcie_sriov: make pcie_sriov_pf_exit() safe on non-SR-IOV devices
Stefan Hajnoczi [Wed, 24 Sep 2025 15:51:53 +0000 (11:51 -0400)] 
pcie_sriov: make pcie_sriov_pf_exit() safe on non-SR-IOV devices

Commit 3f9cfaa92c96 ("virtio-pci: Implement SR-IOV PF") added an
unconditional call from virtio_pci_exit() to pcie_sriov_pf_exit().

pcie_sriov_pf_exit() reads from the SR-IOV Capability in Configuration
Space:

  uint8_t *cfg = dev->config + dev->exp.sriov_cap;
  ...
  unparent_vfs(dev, pci_get_word(cfg + PCI_SRIOV_TOTAL_VF));
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This results in undefined behavior when dev->exp.sriov_cap is 0 because
this is not an SR-IOV device. For example, unparent_vfs() segfaults when
total_vfs happens to be non-zero.

Fix this by returning early from pcie_sriov_pf_exit() when
dev->exp.sriov_cap is 0 because this is not an SR-IOV device.

Cc: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Cc: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Qing Wang <qinwang@redhat.com>
Buglink: https://issues.redhat.com/browse/RHEL-116443
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Fixes: cab1398a60eb ("pcie_sriov: Reuse SR-IOV VF device instances")
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250924155153.579495-1-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agotests/virtio-scsi: add a virtio_error() IOThread test
Stefan Hajnoczi [Mon, 22 Sep 2025 22:01:49 +0000 (18:01 -0400)] 
tests/virtio-scsi: add a virtio_error() IOThread test

Now that virtio_error() calls should work in an IOThread, add a
virtio-scsi IOThread test cases that triggers virtio_error().

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250922220149.498967-6-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agotests/libqos: extract qvirtqueue_set_avail_idx()
Stefan Hajnoczi [Mon, 22 Sep 2025 22:01:48 +0000 (18:01 -0400)] 
tests/libqos: extract qvirtqueue_set_avail_idx()

Setting the vring's avail.idx can be useful for low-level VIRTIO tests,
especially for testing error scenarios with invalid vrings. Extract it
into a new function so that the next commit can add a test that uses
this new test API.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Acked-by: Fabiano Rosas <farosas@suse.de>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250922220149.498967-5-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio: support irqfd in virtio_notify_config()
Stefan Hajnoczi [Mon, 22 Sep 2025 22:01:47 +0000 (18:01 -0400)] 
virtio: support irqfd in virtio_notify_config()

virtio_error() calls virtio_notify_config() to inject a VIRTIO
Configuration Change Notification. This doesn't work from IOThreads
because the BQL is not held and the interrupt code path requires the
BQL.

Follow the same approach as virtio_notify() and use ->config_notifier
(an irqfd) when called from the IOThread.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250922220149.498967-4-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio: unify virtio_notify_irqfd() and virtio_notify()
Stefan Hajnoczi [Mon, 22 Sep 2025 22:01:46 +0000 (18:01 -0400)] 
virtio: unify virtio_notify_irqfd() and virtio_notify()

The difference between these two functions:
- virtio_notify() uses the interrupt code path (MSI or classic IRQs)
- virtio_notify_irqfd() uses guest notifiers (irqfds)

virtio_notify() can only be called with the BQL held because the
interrupt code path requires the BQL. Device models use
virtio_notify_irqfd() from IOThreads since the BQL is not held.

The two functions can be unified by pushing down the if
(qemu_in_iothread()) check from virtio-blk and virtio-scsi into core
virtio code. This is in preparation for the next commit that will add
irqfd support to virtio_notify_config() and where it's unattractive to
introduce another irqfd-only API for device model callers.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250922220149.498967-3-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovhost: use virtio_config_get_guest_notifier()
Stefan Hajnoczi [Mon, 22 Sep 2025 22:01:45 +0000 (18:01 -0400)] 
vhost: use virtio_config_get_guest_notifier()

There is a getter function so avoid accessing the ->config_notifier
field directly.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250922220149.498967-2-stefanha@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agox86: ich9: fix default value of 'No Reboot' bit in GCS
Igor Mammedov [Mon, 22 Sep 2025 13:26:00 +0000 (15:26 +0200)] 
x86: ich9: fix default value of 'No Reboot' bit in GCS

[2] initialized 'No Reboot' bit to 1 by default. And due to quirk it happened
to work with linux iTCO_wdt driver (which clears it on module load).

However spec [1] states:
"
R/W. This bit is set when the “No Reboot” strap (SPKR pin on
ICH9) is sampled high on PWROK.
"

So it should be set only when  '-global ICH9-LPC.noreboot=true' and cleared
when it's false (which should be default).

Fix it to behave according to spec and set 'No Reboot' bit only when
'-global ICH9-LPC.noreboot=true'.

1)
Intel I/O Controller Hub 9 (ICH9) Family Datasheet (rev: 004)
2)

Fixes: 920557971b6 (ich9: add TCO interface emulation)
Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Tested-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250922132600.562193-1-imammedo@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agointel_iommu: Add PRI operations support
CLEMENT MATHIEU--DRIF [Mon, 1 Sep 2025 11:17:24 +0000 (11:17 +0000)] 
intel_iommu: Add PRI operations support

Implement the PRI callbacks in vtd_iommu_ops.

Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901111630.1018573-6-clement.mathieu--drif@eviden.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agointel_iommu: Declare registers for PRI
CLEMENT MATHIEU--DRIF [Mon, 1 Sep 2025 11:17:23 +0000 (11:17 +0000)] 
intel_iommu: Declare registers for PRI

Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901111630.1018573-5-clement.mathieu--drif@eviden.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agointel_iommu: Declare PRI constants and structures
CLEMENT MATHIEU--DRIF [Mon, 1 Sep 2025 11:17:21 +0000 (11:17 +0000)] 
intel_iommu: Declare PRI constants and structures

Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901111630.1018573-4-clement.mathieu--drif@eviden.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agointel_iommu: Bypass barrier wait descriptor
CLEMENT MATHIEU--DRIF [Mon, 1 Sep 2025 11:17:20 +0000 (11:17 +0000)] 
intel_iommu: Bypass barrier wait descriptor

wait_desc with SW=0,IF=0,FN=1 must not be considered as an
invalid descriptor as it is used to implement section 7.10 of
the VT-d spec.

Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901111630.1018573-3-clement.mathieu--drif@eviden.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agopcie: Add a way to get the outstanding page request allocation (pri) from the config...
CLEMENT MATHIEU--DRIF [Mon, 1 Sep 2025 11:17:19 +0000 (11:17 +0000)] 
pcie: Add a way to get the outstanding page request allocation (pri) from the config space.

Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901111630.1018573-2-clement.mathieu--drif@eviden.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agosmbios: cap DIMM size to 2Tb as workaround for broken Windows
Igor Mammedov [Mon, 1 Sep 2025 08:49:15 +0000 (10:49 +0200)] 
smbios: cap DIMM size to 2Tb as workaround for broken Windows

With current limit set to match max spec size (2PTb),
Windows fails to parse type 17 records when DIMM size reaches 4Tb+.
Failure happens in GetPhysicallyInstalledSystemMemory() function,
and fails "Check SMBIOS System Memory Tables" SVVP test.
Though not fatal, it might cause issues for userspace apps,
something like [1].

Lets cap default DIMM size to 2Tb for now, until MS fixes it.

1) https://issues.redhat.com/browse/RHEL-81999?focusedId=27731200&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-27731200

PS: It's obvious 32 int overflow math somewhere in Windows,
    MS admitted that it's Windows bug and in a process of fixing it.
    However it's unclear if W10 and earlier would get the fix.
    So however I dislike changing defaults, we heed to work around
    the issue (it looks like QEMU regression while not being it).
    Hopefully 2Tb/DIMM split will last longer until VM memory size
    will become large enough to cause to many type 17 records issue
    again.
PPS:
    Alternatively, instead of messing with defaults, we can create
    a dedicated knob to ask for desired DIMM size cap explicitly
    on CLI. That will let users to enable workaround when they
    hit this corner case. Downside is that knob has to be propagated
    up all mgmt stack, which might be not desirable.
PPPS:
    Yet alternatively, users can configure initial RAM to be less
    than 4Tb and all additional RAM add as DIMMs on QEMU CLI.
    (however it's the job to be done by mgmt which could know
    Windows version and total amount of RAM)

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
Fixes: 62f182c97b ("smbios: make memory device size configurable per Machine")
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901084915.2607632-1-imammedo@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agohw/virtio: rename vhost-user-device and make user creatable
Alex Bennée [Mon, 1 Sep 2025 10:59:48 +0000 (11:59 +0100)] 
hw/virtio: rename vhost-user-device and make user creatable

We didn't make the device user creatable in the first place because we
were worried users might get confused. Rename the device to make its
nature as a test device even more explicit. While we are at it add a
Kconfig variable so it can be skipped for those that want to thin out
their build configuration even further.

Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Message-ID: <20250820195632.1956795-1-alex.bennee@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901105948.982583-1-alex.bennee@linaro.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agopcie_sriov: Fix broken MMIO accesses from SR-IOV VFs
Damien Bergamini [Mon, 1 Sep 2025 15:14:23 +0000 (15:14 +0000)] 
pcie_sriov: Fix broken MMIO accesses from SR-IOV VFs

Starting with commit cab1398a60eb, SR-IOV VFs are realized as soon as
pcie_sriov_pf_init() is called.  Because pcie_sriov_pf_init() must be
called before pcie_sriov_pf_init_vf_bar(), the VF BARs types won't be
known when the VF realize function calls pcie_sriov_vf_register_bar().

This breaks the memory regions of the VFs (for instance with igbvf):

$ lspci
...
    Region 0: Memory at 281a00000 (64-bit, prefetchable) [virtual] [size=16K]
    Region 3: Memory at 281a20000 (64-bit, prefetchable) [virtual] [size=16K]

$ info mtree
...
address-space: pci_bridge_pci_mem
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci_bridge_pci
    0000000081a00000-0000000081a03fff (prio 1, i/o): igbvf-mmio
    0000000081a20000-0000000081a23fff (prio 1, i/o): igbvf-msix

and causes MMIO accesses to fail:

    Invalid write at addr 0x281A01520, size 4, region '(null)', reason: rejected
    Invalid read at addr 0x281A00C40, size 4, region '(null)', reason: rejected

To fix this, VF BARs are now registered with pci_register_bar() which
has a type parameter and pcie_sriov_vf_register_bar() is removed.

Fixes: cab1398a60eb ("pcie_sriov: Reuse SR-IOV VF device instances")
Signed-off-by: Damien Bergamini <damien.bergamini@eviden.com>
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250901151314.1038020-1-clement.mathieu--drif@eviden.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agohw/i386/pc: Avoid overlap between CXL window and PCI 64bit BARs in QEMU
peng guo [Tue, 5 Aug 2025 14:23:00 +0000 (22:23 +0800)] 
hw/i386/pc: Avoid overlap between CXL window and PCI 64bit BARs in QEMU

When using a CXL Type 3 device together with a virtio 9p device in QEMU on a
physical server, the 9p device fails to initialize properly. The kernel reports
the following error:

    virtio: device uses modern interface but does not have VIRTIO_F_VERSION_1
    9pnet_virtio virtio0: probe with driver 9pnet_virtio failed with error -22

Further investigation revealed that the 64-bit BAR space assigned to the 9pnet
device was overlapped by the memory window allocated for the CXL devices. As a
result, the kernel could not correctly access the BAR region, causing the
virtio device to malfunction.

An excerpt from /proc/iomem shows:

    480010000-cffffffff : CXL Window 0
      480010000-4bfffffff : PCI Bus 0000:00
      4c0000000-4c01fffff : PCI Bus 0000:0c
        4c0000000-4c01fffff : PCI Bus 0000:0d
      4c0200000-cffffffff : PCI Bus 0000:00
        4c0200000-4c0203fff : 0000:00:03.0
          4c0200000-4c0203fff : virtio-pci-modern

To address this issue, this patch adds the reserved memory end calculation
for cxl devices to reserve sufficient address space and ensure that CXL memory
windows are allocated beyond all PCI 64-bit BARs. This prevents overlap with
64-bit BARs regions such as those used by virtio or other pcie devices,
resolving the conflict.

QEMU Build Configuration:

    ./configure --prefix=/home/work/qemu_master/build/ \
                --target-list=x86_64-softmmu \
                --enable-kvm \
                --enable-virtfs

QEMU Boot Command:

    sudo /home/work/qemu_master/qemu/build/qemu-system-x86_64 \
        -nographic -machine q35,cxl=on -enable-kvm -m 16G -smp 8 \
        -hda /home/work/gp_qemu/rootfs.img \
        -virtfs local,path=/home/work/gp_qemu/share,mount_tag=host0,security_model=passthrough,id=host0 \
        -kernel /home/work/linux_output/arch/x86/boot/bzImage \
        --append "console=ttyS0 crashkernel=256M root=/dev/sda rootfstype=ext4 rw loglevel=8" \
        -object memory-backend-ram,id=vmem0,share=on,size=4096M \
        -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
        -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
        -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=cxl-vmem0,sn=0x123456789 \
        -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G

Fixes: 03b39fcf64bc ("hw/cxl: Make the CXL fixed memory window setup a machine parameter")
Signed-off-by: peng guo <engguopeng@buaa.edu.cn>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250805142300.15226-1-engguopeng@buaa.edu.cn>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agohw/smbios: allow clearing the VM bit in SMBIOS table 0
Daniil Tatianin [Thu, 24 Jul 2025 19:54:09 +0000 (22:54 +0300)] 
hw/smbios: allow clearing the VM bit in SMBIOS table 0

This is useful to be able to freeze a specific version of SeaBIOS to
prevent guest visible changes between BIOS updates. This is currently
not possible since the extension byte 2 provided by SeaBIOS does not
set the VM bit, whereas QEMU sets it unconditionally.

Allowing to clear it also seems useful if we want to hide the fact that
the guest system is running inside a virtual machine.

Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <20250724195409.43499-1-d-tatianin@yandex-team.ru>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoscripts/ghes_inject: add a script to generate GHES error inject
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:11 +0000 (09:04 +0200)] 
scripts/ghes_inject: add a script to generate GHES error inject

Using the QMP GHESv2 API requires preparing a raw data array
containing a CPER record.

Add a helper script with subcommands to prepare such data.

Currently, only ARM Processor error CPER record is supported, by
using:
$ ghes_inject.py arm

which produces those warnings on Linux:

[  705.032426] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error
[  774.866308] {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[  774.866583] {4}[Hardware Error]: event severity: recoverable
[  774.866738] {4}[Hardware Error]:  Error 0, type: recoverable
[  774.866889] {4}[Hardware Error]:   section_type: ARM processor error
[  774.867048] {4}[Hardware Error]:   MIDR: 0x00000000000f0510
[  774.867189] {4}[Hardware Error]:   running state: 0x0
[  774.867321] {4}[Hardware Error]:   Power State Coordination Interface state: 0
[  774.867511] {4}[Hardware Error]:   Error info structure 0:
[  774.867679] {4}[Hardware Error]:   num errors: 2
[  774.867801] {4}[Hardware Error]:    error_type: 0x02: cache error
[  774.867962] {4}[Hardware Error]:    error_info: 0x000000000091000f
[  774.868124] {4}[Hardware Error]:     transaction type: Data Access
[  774.868280] {4}[Hardware Error]:     cache error, operation type: Data write
[  774.868465] {4}[Hardware Error]:     cache level: 2
[  774.868592] {4}[Hardware Error]:     processor context not corrupted
[  774.868774] [Firmware Warn]: GHES: Unhandled processor error type 0x02: cache error

Such script allows customizing the error data, allowing to change
all fields at the record. Please use:

$ ghes_inject.py arm -h

For more details about its usage.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <5ea174638e33d23635332fa6d4ae9d751355f127.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agodocs: hest: add new "etc/acpi_table_hest_addr" and update workflow
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:10 +0000 (09:04 +0200)] 
docs: hest: add new "etc/acpi_table_hest_addr" and update workflow

While the HEST layout didn't change, there are some internal
changes related to how offsets are calculated and how memory error
events are triggered.

Update specs to reflect such changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <e3e8bd92ce40d997c67ac1d4d973c0041b8f59fc.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agotests/acpi: virt: update HEST and DSDT tables
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:09 +0000 (09:04 +0200)] 
tests/acpi: virt: update HEST and DSDT tables

The following changes for DSDT affecting all files
under tests/data/acpi/aarch64/virt/DSDT* :

    -"tests/data/acpi/aarch64/virt/DSDT",
    -"tests/data/acpi/aarch64/virt/DSDT.acpihmatvirt",
    -"tests/data/acpi/aarch64/virt/DSDT.acpipcihp",
    -"tests/data/acpi/aarch64/virt/DSDT.hpoffacpiindex",
    -"tests/data/acpi/aarch64/virt/DSDT.memhp",
    -"tests/data/acpi/aarch64/virt/DSDT.pxb",
    -"tests/data/acpi/aarch64/virt/DSDT.topology",
    -"tests/data/acpi/aarch64/virt/DSDT.viot",
    -"tests/data/acpi/aarch64/virt/DSDT.smmuv3-dev",
    -"tests/data/acpi/aarch64/virt/DSDT.smmuv3-legacy",

    --- /tmp/DSDT_old.dsl   2025-09-05 15:03:18.964968499 +0200
    +++ /tmp/DSDT.dsl       2025-09-05 15:03:18.966968470 +0200
    @@ -1886,6 +1886,11 @@
                     {
                         Notify (PWRB, 0x80) // Status Change
                     }
    +
    +                If (((Local0 & 0x20) == 0x20))
    +                {
    +                    Notify (GEDD, 0x80) // Status Change
    +                }
                 }
             }

    @@ -1894,6 +1899,12 @@
                 Name (_HID, "PNP0C0C" /* Power Button Device */)  // _HID: Hardware ID
                 Name (_UID, Zero)  // _UID: Unique ID
             }
    +
    +        Device (GEDD)
    +        {
    +            Name (_HID, "PNP0C33" /* Error Device */)  // _HID: Hardware ID
    +            Name (_UID, Zero)  // _UID: Unique ID
    +        }
         }

         Scope (\_SB.PCI0)

Additionally, HEST changes:
    -"tests/data/acpi/aarch64/virt/HEST",

    --- /tmp/HEST_old.dsl   2025-09-05 15:03:19.078653625 +0200
    +++ /tmp/HEST.dsl       2025-09-05 15:03:19.079511472 +0200
    @@ -3,7 +3,7 @@
      * AML/ASL+ Disassembler version 20240322 (64-bit version)
      * Copyright (c) 2000 - 2023 Intel Corporation
      *
    - * Disassembly of /tmp/HEST_old
    + * Disassembly of /tmp/HEST
      *
      * ACPI Data Table [HEST]
      *
    @@ -11,16 +11,16 @@
      */

     [000h 0000 004h]                   Signature : "HEST"    [Hardware Error Source Table]
    -[004h 0004 004h]                Table Length : 00000084
    +[004h 0004 004h]                Table Length : 000000E0
     [008h 0008 001h]                    Revision : 01
    -[009h 0009 001h]                    Checksum : E2
    +[009h 0009 001h]                    Checksum : 6C
     [00Ah 0010 006h]                      Oem ID : "BOCHS "
     [010h 0016 008h]                Oem Table ID : "BXPC    "
     [018h 0024 004h]                Oem Revision : 00000001
     [01Ch 0028 004h]             Asl Compiler ID : "BXPC"
     [020h 0032 004h]       Asl Compiler Revision : 00000001

    -[024h 0036 004h]          Error Source Count : 00000001
    +[024h 0036 004h]          Error Source Count : 00000002

     [028h 0040 002h]               Subtable Type : 000A [Generic Hardware Error Source V2]
     [02Ah 0042 002h]                   Source Id : 0000
    @@ -55,19 +55,62 @@
     [069h 0105 001h]                   Bit Width : 40
     [06Ah 0106 001h]                  Bit Offset : 00
     [06Bh 0107 001h]        Encoded Access Width : 04 [QWord Access:64]
    -[06Ch 0108 008h]                     Address : 0000000043DA0008
    +[06Ch 0108 008h]                     Address : 0000000043DA0010

     [074h 0116 008h]           Read Ack Preserve : FFFFFFFFFFFFFFFE
     [07Ch 0124 008h]              Read Ack Write : 0000000000000001

    -Raw Table Data: Length 132 (0x84)
    +[084h 0132 002h]               Subtable Type : 000A [Generic Hardware Error Source V2]
    +[086h 0134 002h]                   Source Id : 0001
    +[088h 0136 002h]           Related Source Id : FFFF
    +[08Ah 0138 001h]                    Reserved : 00
    +[08Bh 0139 001h]                     Enabled : 01
    +[08Ch 0140 004h]      Records To Preallocate : 00000001
    +[090h 0144 004h]     Max Sections Per Record : 00000001
    +[094h 0148 004h]         Max Raw Data Length : 00000400
    +
    +[098h 0152 00Ch]        Error Status Address : [Generic Address Structure]
    +[098h 0152 001h]                    Space ID : 00 [SystemMemory]
    +[099h 0153 001h]                   Bit Width : 40
    +[09Ah 0154 001h]                  Bit Offset : 00
    +[09Bh 0155 001h]        Encoded Access Width : 04 [QWord Access:64]
    +[09Ch 0156 008h]                     Address : 0000000043DA0008
    +
    +[0A4h 0164 01Ch]                      Notify : [Hardware Error Notification Structure]
    +[0A4h 0164 001h]                 Notify Type : 07 [GPIO]
    +[0A5h 0165 001h]               Notify Length : 1C
    +[0A6h 0166 002h]  Configuration Write Enable : 0000
    +[0A8h 0168 004h]                PollInterval : 00000000
    +[0ACh 0172 004h]                      Vector : 00000000
    +[0B0h 0176 004h]     Polling Threshold Value : 00000000
    +[0B4h 0180 004h]    Polling Threshold Window : 00000000
    +[0B8h 0184 004h]       Error Threshold Value : 00000000
    +[0BCh 0188 004h]      Error Threshold Window : 00000000
    +
    +[0C0h 0192 004h]   Error Status Block Length : 00000400
    +[0C4h 0196 00Ch]           Read Ack Register : [Generic Address Structure]
    +[0C4h 0196 001h]                    Space ID : 00 [SystemMemory]
    +[0C5h 0197 001h]                   Bit Width : 40
    +[0C6h 0198 001h]                  Bit Offset : 00
    +[0C7h 0199 001h]        Encoded Access Width : 04 [QWord Access:64]
    +[0C8h 0200 008h]                     Address : 0000000043DA0018

    -    0000: 48 45 53 54 84 00 00 00 01 E2 42 4F 43 48 53 20  // HEST......BOCHS
    +[0D0h 0208 008h]           Read Ack Preserve : FFFFFFFFFFFFFFFE
    +[0D8h 0216 008h]              Read Ack Write : 0000000000000001
    +
    +Raw Table Data: Length 224 (0xE0)
    +
    +    0000: 48 45 53 54 E0 00 00 00 01 6C 42 4F 43 48 53 20  // HEST.....lBOCHS
         0010: 42 58 50 43 20 20 20 20 01 00 00 00 42 58 50 43  // BXPC    ....BXPC
    -    0020: 01 00 00 00 01 00 00 00 0A 00 00 00 FF FF 00 01  // ................
    +    0020: 01 00 00 00 02 00 00 00 0A 00 00 00 FF FF 00 01  // ................
         0030: 01 00 00 00 01 00 00 00 00 04 00 00 00 40 00 04  // .............@..
         0040: 00 00 DA 43 00 00 00 00 08 1C 00 00 00 00 00 00  // ...C............
         0050: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // ................
    -    0060: 00 00 00 00 00 04 00 00 00 40 00 04 08 00 DA 43  // .........@.....C
    +    0060: 00 00 00 00 00 04 00 00 00 40 00 04 10 00 DA 43  // .........@.....C
         0070: 00 00 00 00 FE FF FF FF FF FF FF FF 01 00 00 00  // ................
    -    0080: 00 00 00 00                                      // ....
    +    0080: 00 00 00 00 0A 00 01 00 FF FF 00 01 01 00 00 00  // ................
    +    0090: 01 00 00 00 00 04 00 00 00 40 00 04 08 00 DA 43  // .........@.....C
    +    00A0: 00 00 00 00 07 1C 00 00 00 00 00 00 00 00 00 00  // ................
    +    00B0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  // ................
    +    00C0: 00 04 00 00 00 40 00 04 18 00 DA 43 00 00 00 00  // .....@.....C....
    +    00D0: FE FF FF FF FF FF FF FF 01 00 00 00 00 00 00 00  // ................

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <2253eb50df797ab320b4ca610bd22a38e5cfd17a.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/generic_event_device.c: enable use_hest_addr for QEMU 10.x
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:08 +0000 (09:04 +0200)] 
acpi/generic_event_device.c: enable use_hest_addr for QEMU 10.x

Now that we have everything in place, enable using HEST GPA
instead of etc/hardware_errors GPA.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <ad77b64aa1f09141efe942539445908631423975.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoqapi/acpi-hest: add an interface to do generic CPER error injection
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:07 +0000 (09:04 +0200)] 
qapi/acpi-hest: add an interface to do generic CPER error injection

Create a QMP command to be used for generic ACPI APEI hardware error
injection (HEST) via GHESv2, and add support for it for ARM guests.

Error injection uses ACPI_HEST_SRC_ID_QMP source ID to be platform
independent. This is mapped at arch virt bindings, depending on the
types supported by QEMU and by the BIOS. So, on ARM, this is supported
via ACPI_GHES_NOTIFY_GPIO notification type.

This patch was co-authored:
    - original ghes logic to inject a simple ARM record by Shiju Jose;
    - generic logic to handle block addresses by Jonathan Cameron;
    - generic GHESv2 error inject by Mauro Carvalho Chehab;

Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Co-authored-by: Shiju Jose <shiju.jose@huawei.com>
Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <81e2118b3c8b7e5da341817f277d61251655e0db.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoarm/virt: Wire up a GED error device for ACPI / GHES
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:06 +0000 (09:04 +0200)] 
arm/virt: Wire up a GED error device for ACPI / GHES

Adds support to ARM virtualization to allow handling
generic error ACPI Event via GED & error source device.

It is aligned with Linux Kernel patch:
https://lore.kernel.org/lkml/1272350481-27951-8-git-send-email-ying.huang@intel.com/

Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <3237a76b1469d669436399495825348bf34122cd.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agotests/acpi: virt: allow acpi table changes at DSDT and HEST tables
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:05 +0000 (09:04 +0200)] 
tests/acpi: virt: allow acpi table changes at DSDT and HEST tables

We'll be adding a new GED device for HEST GPIO notification and
increasing the number of entries at the HEST table.

Blocklist testing HEST and DSDT tables until such changes
are completed.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <7fca7eb9b801f1b196210f66538234b94bd31c23.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/generic_event_device: add an APEI error device
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:04 +0000 (09:04 +0200)] 
acpi/generic_event_device: add an APEI error device

Adds a generic error device to handle generic hardware error
events as specified at ACPI 6.5 specification at 18.3.2.7.2:
https://uefi.org/specs/ACPI/6.5/18_Platform_Error_Interfaces.html#event-notification-for-generic-error-sources
using HID PNP0C33.

The PNP0C33 device is used to report hardware errors to
the guest via ACPI APEI Generic Hardware Error Source (GHES).

Co-authored-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Co-authored-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <2790f664c849d53de0ce3049fa8c7950c1de1f86.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/generic_event_device: add logic to detect if HEST addr is available
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:03 +0000 (09:04 +0200)] 
acpi/generic_event_device: add logic to detect if HEST addr is available

Create a new property (x-has-hest-addr) and use it to detect if
the GHES table offsets can be calculated from the HEST address
(qemu 10.0 and upper) or via the legacy way via an offset obtained
from the hardware_errors firmware file.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <c4eb3cf32a3f158ae62dac29e866ac3f373956c3.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/generic_event_device: Update GHES migration to cover hest addr
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:02 +0000 (09:04 +0200)] 
acpi/generic_event_device: Update GHES migration to cover hest addr

The GHES migration logic should now support HEST table location too.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <ede7ddf4b10f34094a4327dc458d630ad319bd1c.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/ghes: add a notifier to notify when error data is ready
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:01 +0000 (09:04 +0200)] 
acpi/ghes: add a notifier to notify when error data is ready

Some error injection notify methods are async, like GPIO
notify. Add a notifier to be used when the error record is
ready to be sent to the guest OS.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Acked-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <edf9c6e5b80dc57e3443893bf9e1eb25cb9d266b.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/ghes: don't hard-code the number of sources for HEST table
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:04:00 +0000 (09:04 +0200)] 
acpi/ghes: don't hard-code the number of sources for HEST table

The current code is actually dependent on having just one error
structure with a single source, as any change there would cause
migration issues.

As the number of sources should be arch-dependent, as it will depend on
what kind of notifications will exist, and how many errors can be
reported at the same time, change the logic to be more flexible,
allowing the number of sources to be defined when building the
HEST table by the caller.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <1698680848c11d6f26368426f1657e14faaf55c4.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/ghes: Use HEST table offsets when preparing GHES records
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:03:59 +0000 (09:03 +0200)] 
acpi/ghes: Use HEST table offsets when preparing GHES records

There are two pointers that are needed during error injection:

1. The start address of the CPER block to be stored;
2. The address of the read ack.

It is preferable to calculate them from the HEST table.  This allows
checking the source ID, the size of the table and the type of the
HEST error block structures.

Yet, keep the old code, as this is needed for migration purposes
from older QEMU versions.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <d4344e8dbe66372e1e093d968eda2e8b0527ba48.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/ghes: add a firmware file with HEST address
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:03:58 +0000 (09:03 +0200)] 
acpi/ghes: add a firmware file with HEST address

Store HEST table address at GPA, placing its the start of the table at
hest_addr_le variable.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <c29aa5e6ab9b2d93dd5328481630c3b03da86261.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/ghes: prepare to change the way HEST offsets are calculated
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:03:57 +0000 (09:03 +0200)] 
acpi/ghes: prepare to change the way HEST offsets are calculated

Add a new ags flag to change the way HEST offsets are calculated.
Currently, offsets needed to store ACPI HEST offsets and read ack
are calculated based on a previous knowledge from the logic
which creates the HEST table.

Such logic is not generic, not allowing to easily add more HEST
entries nor replicates what OSPM does.

As the next patches will be adding a more generic logic, add a
new use_hest_addr, set to false, in preparation for such changes.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <f5de17bf04b27828e1a439ad396b4f7982eaf156.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoacpi/ghes: Cleanup the code which gets ghes ged state
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:03:56 +0000 (09:03 +0200)] 
acpi/ghes: Cleanup the code which gets ghes ged state

Move the check logic into a common function and simplify the
code which checks if GHES is enabled and was properly setup.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <2bbb1d3eb88b0a668114adef2f1c2a94deebba0e.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoRevert "hw/acpi/ghes: Make ghes_record_cper_errors() static"
Mauro Carvalho Chehab [Tue, 23 Sep 2025 07:03:55 +0000 (09:03 +0200)] 
Revert "hw/acpi/ghes: Make ghes_record_cper_errors() static"

The ghes_record_cper_errors() function was introduced to be used
by other types of errors, as part of the error injection
patch series. That's why it is not static.

Make it non-static again to allow its usage outside ghes.c

This reverts commit 611f3bdb20f7828b0813aa90d47d1275ef18329b.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <14f2a888cfbf922d5f2bf94d7612114f25107d59.1758610789.git.mchehab+huawei@kernel.org>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agonet: implement UDP tunnel features offloading
Paolo Abeni [Mon, 22 Sep 2025 14:18:28 +0000 (16:18 +0200)] 
net: implement UDP tunnel features offloading

When any host or guest GSO over UDP tunnel offload is enabled the
virtio net header includes the additional tunnel-related fields,
update the size accordingly.

Push the GSO over UDP tunnel offloads all the way down to the tap
device extending the newly introduced NetFeatures struct, and
eventually enable the associated features.

As per virtio specification, to convert features bit to offload bit,
map the extended features into the reserved range.

Finally, make the vhost backend aware of the exact header layout, to
copy it correctly. The tunnel-related field are present if either
the guest or the host negotiated any UDP tunnel related feature:
add them to the kernel supported features list, to allow qemu
transfer to the backend the needed information.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <093b4bc68368046bffbcab2202227632d6e4e83b.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agonet: implement tunnel probing
Paolo Abeni [Mon, 22 Sep 2025 14:18:27 +0000 (16:18 +0200)] 
net: implement tunnel probing

Tap devices support GSO over UDP tunnel offload. Probe for such
feature in a similar manner to other offloads.

GSO over UDP tunnel needs to be enabled in addition to a "plain"
offload (TSO or USO).

No need to check separately for the outer header checksum offload:
the kernel is going to support both of them or none.

The new features are disabled by default to avoid compat issues,
and could be enabled, after that hw_compat_10_1 will be added,
together with the related compat entries.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <a987a8a7613cbf33bb2209c7c7f5889b512638a7.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio-net: implement extended features support
Paolo Abeni [Mon, 22 Sep 2025 14:18:26 +0000 (16:18 +0200)] 
virtio-net: implement extended features support

Use the extended types and helpers to manipulate the virtio_net
features.

Note that offloads are still 64bits wide, as per specification,
and extended offloads will be mapped into such range.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <bc5afdc5c1cb1a37238dd2b36004db3d46cbf211.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovhost-net: implement extended features support
Paolo Abeni [Mon, 22 Sep 2025 14:18:25 +0000 (16:18 +0200)] 
vhost-net: implement extended features support

Provide extended version of the features manipulation helpers,
and let the device initialization deal with the full features space,
adjusting the relevant format strings accordingly.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <69c78c432e28e146a8874b2a7d00e9cbd111b1ba.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovhost-backend: implement extended features support
Paolo Abeni [Mon, 22 Sep 2025 14:18:24 +0000 (16:18 +0200)] 
vhost-backend: implement extended features support

Leverage the kernel extended features manipulation ioctls(), if
available, and fallback to old ops otherwise. Error out when setting
extended features but kernel support is not available.

Note that extended support for get/set backend features is not needed,
as the only feature that can be changed belongs to the 64 bit range.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <150daade3d59e77629276920e014ee8e5fc12121.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoqmp: update virtio features map to support extended features
Paolo Abeni [Mon, 22 Sep 2025 14:18:23 +0000 (16:18 +0200)] 
qmp: update virtio features map to support extended features

Extend the VirtioDeviceFeatures struct with an additional u64
to track unknown features in the 64-127 bit range and decode
the full virtio features spaces for vhost and virtio devices.

Also add entries for the soon-to-be-supported virtio net GSO over
UDP features.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <e51969f94d89045b333f1bc5ef5fca9e12fc371a.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovhost: add support for negotiating extended features
Paolo Abeni [Mon, 22 Sep 2025 14:18:22 +0000 (16:18 +0200)] 
vhost: add support for negotiating extended features

Similar to virtio infra, vhost core maintains the features status
in the full extended format and allows the devices to implement
extended version of the getter/setter.

Note that 'protocol_features' are not extended: they are only
used by vhost-user, and the latter device is not going to implement
extended features soon.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <a0062c3b1847fb2baedd6cd8f6ef13b051d6beb2.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio-pci: implement support for extended features
Paolo Abeni [Mon, 22 Sep 2025 14:18:21 +0000 (16:18 +0200)] 
virtio-pci: implement support for extended features

Extend the features configuration space to 128 bits. If the virtio
device supports any extended features, allow the common read/write
operation to access all of it, otherwise keep exposing only the
lower 64 bits.

On migration, save the 128 bit version of the features only if the
upper bits are non zero. Relay on reset to clear all the feature
space before load.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <c0b81601f65b41ca8310eba8f05e2dcf3702de89.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio: add support for negotiating extended features
Paolo Abeni [Mon, 22 Sep 2025 14:18:20 +0000 (16:18 +0200)] 
virtio: add support for negotiating extended features

The virtio specifications allows for a device features space up
to 128 bits and more. Soon we are going to use some of the 'extended'
bits features for the virtio net driver.

Add support to allow extended features negotiation on a per
devices basis. Devices willing to negotiated extended features
need to implemented a new pair of features getter/setter, the
core will conditionally use them instead of the basic one.

Note that 'bad_features' don't need to be extended, as they are
bound to the 64 bits limit.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <9bb29d70adc3f2b8c7756d4e3cd076cffee87826.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio: serialize extended features state
Paolo Abeni [Mon, 22 Sep 2025 14:18:19 +0000 (16:18 +0200)] 
virtio: serialize extended features state

If the driver uses any of the extended features (i.e. 64 or above),
store the extended features range (64-127 bits).

At load time, let legacy features initialize the full features range
and pass it to the set helper; sub-states loading will have filled-up
the extended part as needed.

This is one of the few spots that need explicitly to know and set
in stone the extended features array size; add a build bug to prevent
breaking the migration should such size change again in the future:
more serialization plumbing will be needed.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <d5d9d398675bee6c4c7d7308c5d3d5d3c6d17d87.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agovirtio: introduce extended features type
Paolo Abeni [Mon, 22 Sep 2025 14:18:18 +0000 (16:18 +0200)] 
virtio: introduce extended features type

The virtio specifications allows for up to 128 bits for the
device features. Soon we are going to use some of the 'extended'
bits features (bit 64 and above) for the virtio net driver.

Represent the virtio features bitmask with a fixed size array, and
introduce a few helpers to help manipulate them.

Most drivers will keep using only 64 bits features space: use union
to allow them access the lower part of the extended space without any
per driver change.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <6a9bbb5eb33830f20afbcb7e64d300af4126dd98.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agolinux-headers: Update to Linux v6.17-rc1
Paolo Abeni [Mon, 22 Sep 2025 14:18:17 +0000 (16:18 +0200)] 
linux-headers: Update to Linux v6.17-rc1

Update headers to include the virtio GSO over UDP tunnel features

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <0b1f3c011f90583ab52aa4fef04df6db35cc4a69.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agolinux-headers: deal with counted_by annotation
Paolo Abeni [Mon, 22 Sep 2025 14:18:16 +0000 (16:18 +0200)] 
linux-headers: deal with counted_by annotation

Such annotation is present into the kernel uAPI headers since
v6.7, and will be used soon by the vhost_type.h. Deal with it
just stripping it.

Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Acked-by: Jason Wang <jasowang@redhat.com>
Acked-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <a1430f43cc954d2a931fa60581bda6d6af4bc771.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agonet: bundle all offloads in a single struct
Paolo Abeni [Mon, 22 Sep 2025 14:18:15 +0000 (16:18 +0200)] 
net: bundle all offloads in a single struct

The set_offload() argument list is already pretty long and
we are going to introduce soon a bunch of additional offloads.

Replace the offload arguments with a single struct and update
all the relevant call-sites.

No functional changes intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Tested-by: Lei Yang <leiyang@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
Message-ID: <a9d4dd043b8c71b791e9ff05e17ef06072d9714e.1758549625.git.pabeni@redhat.com>
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
8 weeks agoMerge tag 'pull-vfio-20251003' of https://github.com/legoater/qemu into staging
Richard Henderson [Fri, 3 Oct 2025 11:57:45 +0000 (04:57 -0700)] 
Merge tag 'pull-vfio-20251003' of https://github.com/legoater/qemu into staging

vfio queue:

* Remove workaround for kernel DMA unmap overflow
* Remove invalid uses of ram_addr_t type

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCAAdFiEEoPZlSPBIlev+awtgUaNDx8/77KEFAmjfpl4ACgkQUaNDx8/7
# 7KFAHQ//R0WtsAsEYE8Diczscl9++gqORrrLYN2ffTKrhUBrBskPptWZ+4Rh4R2e
# OSxdcf1cl0sFNkzCqnbWE3sbAG1Yq6mvCXTGTx3Y+2wi0KNwZXSxYGMWApOydp5K
# McQv1Uyd48TKCEwjumu6jmoPUSi89kvA58BLjBtw2bwJQzdlMZpIHX0XlSjlBHTz
# wHPqqW5+WCWq52pTp2vNkRrcqTl/HuoaijHPEJMzd/GIl1x2tBruuXuwzkY33ZKy
# EyDNq/stK12Pa1Va1ey8QOMQUJJ1jb3feVognDDVRMUGbBPljMawi8vtXW6LW28P
# 0micGzDk1A3yi8X+tIHjQE/rcL86mIKyzCmrSB7WM+t3r79/hWZQruUu2e1eUGCE
# Mw5K0UoxBvp4LxeB2wKSIFUL1VgcB0azgsq6nOwRgMyzcqjniBu7M7gctIQQdypZ
# wSdUo8cViagUXS+YDVLsMreq4FShFWx6JLOGlxvN/eTaicUTjiOccriGmu1huhW/
# VzcfkgZWL1lSKoDeOAOafNjUP557hv0YbiAGa8ywglrukFLdFKIFJOvNdnzmmkiG
# 5YJt2RH/rx+etF0hBI4uZLCnumpiKVM27/9MuMRiF7jZSXx0rz8tFVcscxQY10GP
# pSPL3SZAeLD4HMhndrlLSPAJyboQ4TGPA26yn5nahUGmOhoP91o=
# =kCV9
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 03 Oct 2025 03:33:02 AM PDT
# gpg:                using RSA key A0F66548F04895EBFE6B0B6051A343C7CFFBECA1
# gpg: Good signature from "Cédric Le Goater <clg@redhat.com>" [full]
# gpg:                 aka "Cédric Le Goater <clg@kaod.org>" [full]

* tag 'pull-vfio-20251003' of https://github.com/legoater/qemu:
  hw/vfio: Use uint64_t for IOVA mapping size in vfio_container_dma_*map
  hw/vfio: Avoid ram_addr_t in vfio_container_query_dirty_bitmap()
  hw/vfio: Reorder vfio_container_query_dirty_bitmap() trace format
  system/iommufd: Use uint64_t type for IOVA mapping size
  vfio: Remove workaround for kernel DMA unmap overflow bug

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
8 weeks agoMerge tag 'pull-riscv-to-apply-20251003-3' of https://github.com/alistair23/qemu...
Richard Henderson [Fri, 3 Oct 2025 11:57:12 +0000 (04:57 -0700)] 
Merge tag 'pull-riscv-to-apply-20251003-3' of https://github.com/alistair23/qemu into staging

First RISC-V PR for 10.2

* Fix MSI table size limit
* Add riscv64 to FirmwareArchitecture
* Sync RISC-V hwprobe with Linux
* Implement MonitorDef HMP API
* Update OpenSBI to v1.7
* Fix SiFive UART character drop issue and minor refactors
* Fix RISC-V timer migration issues
* Use riscv_cpu_is_32bit() when handling SBI_DBCN reg
* Use riscv_csrr in riscv_csr_read
* Align memory allocations to 2M on RISC-V
* Do not use translator_ldl in opcode_at
* Minor fixes of RISC-V CFI
* Modify minimum VLEN rule
* Fix vslide1[up|down].vx unexpected result when XLEN=32 and SEW=64
* Fixup IOMMU PDT Nested Walk
* Fix endianness swap on compressed instructions
* Update status of IOMMU kernel support

# -----BEGIN PGP SIGNATURE-----
#
# iQIzBAABCgAdFiEEaukCtqfKh31tZZKWr3yVEwxTgBMFAmjfQhoACgkQr3yVEwxT
# gBPnTg//eQ9GMFTLcW4kFMsVYeY8TbkmQN9Wnk+XubG92siGkzuNmfy36yo7oeib
# dB6/h5JLjycjttOfgyx73/TKUucyZs+ZYkVVWWQCSU+sqPTA370MmGNM8CSmPms/
# lFuNIixd+sSUDIOod9zQHzxv+f3ZN2bjEAyzJAEhSXgTO+1xnOeJHHjxB5O2Z/a1
# ccd3Po1wR6nm2T4x88LcHDHj8svLsfG0G1RRkU+yeLu7J6Qpp0d/lOZI7if+AQqb
# Nmz65n2uSuUEuNNQIxYaQp/nbkF3DSxi3mg3+hCQjF+hMjXL4hAhSEPril3MQjGi
# 802nEaqG8Qdzec+bZiKt0c3e0f4SrnpDXDnz7NrtfSO6vXAvqqZuC8kTdZy8dsPU
# 1D809ksZoNDIB87z89MQPsQ7k1Bs2Iq9pNpB9huD3mzY4DHqYhkzysAwc8Qhvimv
# pBaeSDV66OrI/al5c0FqSN0LiLHvlRcwqiATiQwIdCV+PUe+cVPwIKq6ABQiYpVu
# mvnzgEJ4r7iO92hOoAGM+eRC7krafF1/gbe3SDI3RLUTDPM6hcTRcluvBlpBdNDj
# lIYXs89f0jBh0I4IRGm8ftqD9xPDP56mZVEIIjSWDRTT6mfZLxWWMmXC/OK63U7/
# bpJKohFOKy8P6SSvTACcLSOQlP3r+FRrmBOXs7S24U+Hr9xUep0=
# =DGkt
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 02 Oct 2025 08:25:14 PM PDT
# gpg:                using RSA key 6AE902B6A7CA877D6D659296AF7C95130C538013
# gpg: Good signature from "Alistair Francis <alistair@alistair23.me>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg:          There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 6AE9 02B6 A7CA 877D 6D65  9296 AF7C 9513 0C53 8013

* tag 'pull-riscv-to-apply-20251003-3' of https://github.com/alistair23/qemu: (26 commits)
  docs: riscv-iommu: Update status of kernel support
  target/riscv: Fix endianness swap on compressed instructions
  hw/riscv/riscv-iommu: Fixup PDT Nested Walk
  target/riscv: rvv: Fix vslide1[up|down].vx unexpected result when XLEN=32 and SEW=64
  target/riscv: rvv: Modify minimum VLEN according to enabled vector extensions
  target/riscv: rvv: Replace checking V by checking Zve32x
  target/riscv: Fix ssamoswap error handling
  target/riscv: Fix SSP CSR error handling in VU/VS mode
  target/riscv: Fix the mepc when sspopchk triggers the exception
  target/riscv: do not use translator_ldl in opcode_at
  qemu/osdep: align memory allocations to 2M on RISC-V
  target/riscv: use riscv_csrr in riscv_csr_read
  target/riscv/kvm: Use riscv_cpu_is_32bit() when handling SBI_DBCN reg
  target/riscv: Save stimer and vstimer in CPU vmstate
  hw/intc: Save timers array in RISC-V mtimer VMState
  migration: Add support for a variable-length array of UINT32 pointers
  hw/intc: Save time_delta in RISC-V mtimer VMState
  hw/char: sifive_uart: Add newline to error message
  hw/char: sifive_uart: Remove outdated comment about Tx FIFO
  hw/char: sifive_uart: Avoid pushing Tx FIFO when size is zero
  ...

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
8 weeks agodocs: riscv-iommu: Update status of kernel support
Joel Stanley [Thu, 14 Aug 2025 00:14:50 +0000 (09:44 +0930)] 
docs: riscv-iommu: Update status of kernel support

The iommu Linux kernel support is now upstream. VFIO is still
downstream at this stage.

Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
Message-ID: <20250814001452.504510-1-joel@jms.id.au>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: Fix endianness swap on compressed instructions
vhaudiquet [Mon, 29 Sep 2025 11:55:43 +0000 (13:55 +0200)] 
target/riscv: Fix endianness swap on compressed instructions

Three instructions were not using the endianness swap flag, which resulted in a bug on big-endian architectures.

Resolves: https://gitlab.com/qemu-project/qemu/-/issues/3131
Buglink: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/2123828
Fixes: e0a3054f18e ("target/riscv: add support for Zcb extension")
Signed-off-by: Valentin Haudiquet <valentin.haudiquet@canonical.com>
Cc: qemu-stable@nongnu.org
Reviewed-by: Anton Johansson <anjo@rev.ng>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Heinrich Schuchardt <heinrich.schuchardt@canonical.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250929115543.1648157-1-valentin.haudiquet@canonical.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/riscv/riscv-iommu: Fixup PDT Nested Walk
Guo Ren (Alibaba DAMO Academy) [Sat, 13 Sep 2025 04:12:33 +0000 (00:12 -0400)] 
hw/riscv/riscv-iommu: Fixup PDT Nested Walk

Current implementation is wrong when iohgatp != bare. The RISC-V
IOMMU specification has defined that the PDT is based on GPA, not
SPA. So this patch fixes the problem, making PDT walk correctly
when the G-stage table walk is enabled.

Fixes: 0c54acb8243d ("hw/riscv: add RISC-V IOMMU base emulation")
Cc: qemu-stable@nongnu.org
Cc: Sebastien Boeuf <seb@rivosinc.com>
Cc: Tomasz Jeznach <tjeznach@rivosinc.com>
Reviewed-by: Weiwei Li <liwei1518@gmail.com>
Reviewed-by: Nutty Liu <liujingqi@lanxincomputing.com>
Signed-off-by: Guo Ren (Alibaba DAMO Academy) <guoren@kernel.org>
Tested-by: Chen Pei <cp0613@linux.alibaba.com>
Tested-by: Fangyu Yu <fangyu.yu@linux.alibaba.com>
Message-ID: <20250913041233.972870-1-guoren@kernel.org>
[ Changes by AF:
 - Add braces to if statements
]
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: rvv: Fix vslide1[up|down].vx unexpected result when XLEN=32 and SEW=64
Max Chou [Fri, 24 Jan 2025 07:33:22 +0000 (15:33 +0800)] 
target/riscv: rvv: Fix vslide1[up|down].vx unexpected result when XLEN=32 and SEW=64

When XLEN is 32 and SEW is 64, the original implementation of
vslide1up.vx and vslide1down.vx helper functions fills the 32-bit value
of rs1 into the first element of the destination vector register (rd),
which is a 64-bit element.

This commit attempted to resolve the issue by extending the rs1 value
to 64 bits during the TCG translation phase to ensure that the helper
functions won't lost the higer 32 bits.

Signed-off-by: Max Chou <max.chou@sifive.com>
Acked-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250124073325.2467664-1-max.chou@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: rvv: Modify minimum VLEN according to enabled vector extensions
Max Chou [Tue, 23 Sep 2025 09:07:29 +0000 (17:07 +0800)] 
target/riscv: rvv: Modify minimum VLEN according to enabled vector extensions

According to the RISC-V unprivileged specification, the VLEN should be greater
or equal to the ELEN. This commit modifies the minimum VLEN based on the vector
extensions and introduces a check rule for VLEN and ELEN.

  Extension     Minimum VLEN
* V                      128
* Zve64[d|f|x]            64
* Zve32[f|x]              32

Signed-off-by: Max Chou <max.chou@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250923090729.1887406-3-max.chou@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: rvv: Replace checking V by checking Zve32x
Max Chou [Tue, 23 Sep 2025 09:07:28 +0000 (17:07 +0800)] 
target/riscv: rvv: Replace checking V by checking Zve32x

The Zve32x extension will be applied by the V and Zve* extensions.
Therefore we can replace the original V checking with Zve32x checking for both
the V and Zve* extensions.

Signed-off-by: Max Chou <max.chou@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250923090729.1887406-2-max.chou@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: Fix ssamoswap error handling
Jim Shu [Wed, 24 Sep 2025 07:48:18 +0000 (15:48 +0800)] 
target/riscv: Fix ssamoswap error handling

Follow the RISC-V CFI v1.0 spec [1] to fix the exception type
when ssamoswap is disabled by xSSE.

[1] RISC-V CFI spec v1.0, ch2.7 Atomic Swap from a Shadow Stack Location

Signed-off-by: Jim Shu <jim.shu@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250924074818.230010-4-jim.shu@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: Fix SSP CSR error handling in VU/VS mode
Jim Shu [Wed, 24 Sep 2025 07:48:17 +0000 (15:48 +0800)] 
target/riscv: Fix SSP CSR error handling in VU/VS mode

In VU/VS mode, accessing $ssp CSR will trigger the virtual instruction
exception instead of illegal instruction exception if SSE is disabled
via xenvcfg CSRs.

This is from RISC-V CFI v1.0 spec ch2.2.4. Shadow Stack Pointer

Signed-off-by: Jim Shu <jim.shu@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250924074818.230010-3-jim.shu@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: Fix the mepc when sspopchk triggers the exception
Jim Shu [Wed, 24 Sep 2025 07:48:16 +0000 (15:48 +0800)] 
target/riscv: Fix the mepc when sspopchk triggers the exception

When sspopchk is in the middle of TB and triggers the SW check
exception, it should update PC from gen_update_pc(). If not, RISC-V mepc
CSR will get wrong PC address which is still at the start of TB.

Signed-off-by: Jim Shu <jim.shu@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250924074818.230010-2-jim.shu@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: do not use translator_ldl in opcode_at
Vladimir Isaev [Fri, 15 Aug 2025 14:06:33 +0000 (17:06 +0300)] 
target/riscv: do not use translator_ldl in opcode_at

opcode_at is used only in semihosting checks to match opcodes with expected
pattern.

This is not a translator and if we got following assert if page is not in TLB:
qemu-system-riscv64: ../accel/tcg/translator.c:363: record_save: Assertion
`offset == db->record_start + db->record_len' failed.

Fixes: 1f9c4462334f ("target/riscv: Use translator_ld* for everything")
Signed-off-by: Vladimir Isaev <vladimir.isaev@syntacore.com>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250815140633.86920-1-vladimir.isaev@syntacore.com>
[ Changes by AF:
 - Fixup header includes after rebase
]
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agoqemu/osdep: align memory allocations to 2M on RISC-V
Xuemei Liu [Wed, 24 Sep 2025 05:18:03 +0000 (13:18 +0800)] 
qemu/osdep: align memory allocations to 2M on RISC-V

Similar to other architectures (e.g., x86_64, aarch64), utilizing THP on RISC-V
KVM requires 2MiB-aligned memory blocks.

Signed-off-by: Xuemei Liu <liu.xuemei1@zte.com.cn>
Reviewed-by: David Hildenbrand <david@redhat.com>
Message-ID: <20250924131803656Yqt9ZJKfevWkInaGppFdE@zte.com.cn>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: use riscv_csrr in riscv_csr_read
stove [Wed, 27 Aug 2025 20:36:17 +0000 (13:36 -0700)] 
target/riscv: use riscv_csrr in riscv_csr_read

Commit 38c83e8d3a33 ("target/riscv: raise an exception when CSRRS/CSRRC
writes a read-only CSR") changed the behavior of riscv_csrrw, which
would formerly be treated as read-only if the write mask were set to 0.

Fixes an exception being raised when accessing read-only vector CSRs
like vtype.

Fixes: 38c83e8d3a33 ("target/riscv: raise an exception when CSRRS/CSRRC writes a read-only CSR")
Signed-off-by: stove <stove@rivosinc.com>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Message-ID: <20250827203617.79947-1-stove@rivosinc.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv/kvm: Use riscv_cpu_is_32bit() when handling SBI_DBCN reg
Philippe Mathieu-Daudé [Wed, 24 Sep 2025 16:45:15 +0000 (18:45 +0200)] 
target/riscv/kvm: Use riscv_cpu_is_32bit() when handling SBI_DBCN reg

Use the existing riscv_cpu_is_32bit() helper to check for 32-bit CPU.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Reviewed-by: Andrew Jones <ajones@ventanamicro.com>
Message-ID: <20250924164515.51782-1-philmd@linaro.org>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: Save stimer and vstimer in CPU vmstate
TANG Tiancheng [Thu, 11 Sep 2025 09:56:16 +0000 (17:56 +0800)] 
target/riscv: Save stimer and vstimer in CPU vmstate

vmstate_riscv_cpu was missing env.stimer and env.vstimer.
Without migrating these QEMUTimer fields, active S/VS-mode
timer events are lost after snapshot or migration.

Add VMSTATE_TIMER_PTR() entries to save and restore them.

Reviewed-by: LIU Zhiwei <zhiwei_liu@linux.alibaba.com>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: TANG Tiancheng <lyndra@linux.alibaba.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911-timers-v3-4-60508f640050@linux.alibaba.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/intc: Save timers array in RISC-V mtimer VMState
TANG Tiancheng [Thu, 11 Sep 2025 09:56:15 +0000 (17:56 +0800)] 
hw/intc: Save timers array in RISC-V mtimer VMState

The current 'timecmp' field in vmstate_riscv_mtimer is insufficient to keep
timers functional after migration.

If an mtimer's entry in 'mtimer->timers' is active at the time the snapshot
is taken, it means riscv_aclint_mtimer_write_timecmp() has written to
'mtimecmp' and scheduled a timer into QEMU's main loop 'timer_list'.

During snapshot save, these active timers must also be migrated; otherwise,
after snapshot load there is no mechanism to restore 'mtimer->timers' back
into the 'timer_list', and any pending timer events would be lost.

QEMU's migration framework commonly uses VMSTATE_TIMER_xxx macros to save
and restore 'QEMUTimer' variables. However, 'timers' is a pointer array
with variable length, and vmstate.h did not previously provide a helper
macro for such type.

This commit adds a new macro, 'VMSTATE_TIMER_PTR_VARRAY', to handle saving
and restoring a variable-length array of 'QEMUTimer *'. We then use this
macro to migrate the 'mtimer->timers' array, ensuring that timer events
remain scheduled correctly after snapshot load.

Reviewed-by: LIU Zhiwei <zhiwei_liu@linux.alibaba.com>
Signed-off-by: TANG Tiancheng <lyndra@linux.alibaba.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911-timers-v3-3-60508f640050@linux.alibaba.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agomigration: Add support for a variable-length array of UINT32 pointers
TANG Tiancheng [Thu, 11 Sep 2025 09:56:14 +0000 (17:56 +0800)] 
migration: Add support for a variable-length array of UINT32 pointers

Add support for defining a vmstate field which is a variable-length array
of pointers, and use this to define a VMSTATE_TIMER_PTR_VARRAY() which allows
a variable-length array of QEMUTimer* to be used by devices.

Message-id: 20250909-timers-v1-0-7ee18a9d8f4b@linux.alibaba.com
Reviewed-by: LIU Zhiwei <zhiwei_liu@linux.alibaba.com>
Reviewed-by: Peter Xu <peterx@redhat.com>
Signed-off-by: TANG Tiancheng <lyndra@linux.alibaba.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911-timers-v3-2-60508f640050@linux.alibaba.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/vfio: Use uint64_t for IOVA mapping size in vfio_container_dma_*map
Philippe Mathieu-Daudé [Tue, 30 Sep 2025 12:35:28 +0000 (14:35 +0200)] 
hw/vfio: Use uint64_t for IOVA mapping size in vfio_container_dma_*map

The 'ram_addr_t' type is described as:

  a QEMU internal address space that maps guest RAM physical
  addresses into an intermediate address space that can map
  to host virtual address spaces.

This doesn't represent well an IOVA mapping size. Simply use
the uint64_t type.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-5-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
8 weeks agohw/vfio: Avoid ram_addr_t in vfio_container_query_dirty_bitmap()
Philippe Mathieu-Daudé [Tue, 30 Sep 2025 12:35:27 +0000 (14:35 +0200)] 
hw/vfio: Avoid ram_addr_t in vfio_container_query_dirty_bitmap()

The 'ram_addr_t' type is described as:

  a QEMU internal address space that maps guest RAM physical
  addresses into an intermediate address space that can map
  to host virtual address spaces.

vfio_container_query_dirty_bitmap() doesn't expect such QEMU
intermediate address, but a guest physical addresses. Use the
appropriate 'hwaddr' type, rename as @translated_addr for
clarity.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-4-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
8 weeks agohw/vfio: Reorder vfio_container_query_dirty_bitmap() trace format
Philippe Mathieu-Daudé [Tue, 30 Sep 2025 12:35:26 +0000 (14:35 +0200)] 
hw/vfio: Reorder vfio_container_query_dirty_bitmap() trace format

Update the trace-events comments after the changes from
commit dcce51b1938 ("hw/vfio/container-base.c: rename file
to container.c") and commit a3bcae62b6a ("hw/vfio/container.c:
rename file to container-legacy.c").

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-3-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
8 weeks agosystem/iommufd: Use uint64_t type for IOVA mapping size
Philippe Mathieu-Daudé [Tue, 30 Sep 2025 12:35:25 +0000 (14:35 +0200)] 
system/iommufd: Use uint64_t type for IOVA mapping size

The 'ram_addr_t' type is described as:

  a QEMU internal address space that maps guest RAM physical
  addresses into an intermediate address space that can map
  to host virtual address spaces.

This doesn't represent well an IOVA mapping size. Simply use
the uint64_t type.

Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Cédric Le Goater <clg@redhat.com>
Link: https://lore.kernel.org/qemu-devel/20250930123528.42878-2-philmd@linaro.org
Signed-off-by: Cédric Le Goater <clg@redhat.com>
8 weeks agovfio: Remove workaround for kernel DMA unmap overflow bug
Cédric Le Goater [Fri, 26 Sep 2025 08:54:23 +0000 (10:54 +0200)] 
vfio: Remove workaround for kernel DMA unmap overflow bug

A kernel bug was introduced in Linux v4.15 via commit 71a7d3d78e3c
("vfio/type1: Check for address space wrap-around on unmap"), which
added a test for address space wrap-around in the vfio DMA unmap path.
Unfortunately, due to an integer overflow, the kernel would
incorrectly detect an unmap of the last page in the 64-bit address
space as a wrap-around, causing the unmap to fail with -EINVAL.

A QEMU workaround was introduced in commit 567d7d3e6be5 ("vfio/common:
Work around kernel overflow bug in DMA unmap") to retry the unmap,
excluding the final page of the range.

The kernel bug was then fixed in Linux v5.0 via commit 58fec830fc19
("vfio/type1: Fix dma_unmap wrap-around check"). Since the oldest
supported LTS kernel is now v5.4, kernels affected by this bug are
considered deprecated, and the workaround is no longer necessary.

This change reverts 567d7d3e6be5, removing the workaround.

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Link: https://lore.kernel.org/qemu-devel/20250926085423.375547-1-clg@redhat.com
Signed-off-by: Cédric Le Goater <clg@redhat.com>
8 weeks agohw/intc: Save time_delta in RISC-V mtimer VMState
TANG Tiancheng [Thu, 11 Sep 2025 09:56:13 +0000 (17:56 +0800)] 
hw/intc: Save time_delta in RISC-V mtimer VMState

In QEMU's RISC-V ACLINT timer model, 'mtime' is not stored directly as a
state variable. It is computed on demand as:

    mtime = rtc_r + time_delta

where:
- 'rtc_r' is the current VM virtual time (in ticks) obtained via
  cpu_riscv_read_rtc_raw() from QEMU_CLOCK_VIRTUAL.
- 'time_delta' is an offset applied when the guest writes a new 'mtime'
  value via riscv_aclint_mtimer_write():

    time_delta = value - rtc_r

Under this design, 'rtc_r' is assumed to be monotonically increasing
during VM execution. Even if the guest writes an 'mtime' value smaller
than the current one (making 'time_delta' negative in signed arithmetic,
or underflow in unsigned arithmetic), the computed 'mtime' remains
correct because 'rtc_r_new > rtc_r_old':

    mtime_new = rtc_r_new + (value - rtc_r_old)

However, this monotonicity assumption breaks on snapshot load.

Before restoring a snapshot, QEMU resets the guest, which calls
riscv_aclint_mtimer_reset_enter() to set 'mtime' to 0 and recompute
'time_delta' as:

    time_delta = 0 - rtc_r_reset

Here, the time_delta differs from the value that was present when the
snapshot was saved. As a result, subsequent reads produce a fixed offset
from the true mtime.

This can be observed with the 'date' command inside the guest: after loading
a snapshot, the reported time appears "frozen" at the save point, and only
resumes correctly after the guest has run long enough to compensate for the
erroneous offset.

The fix is to treat 'time_delta' as part of the device's migratable
state and save/restore it via vmstate. This preserves the correct
relation between 'rtc_r' and 'mtime' across snapshot save/load, ensuring
'mtime' continues incrementing from the precise saved value after
restore.

Reviewed-by: LIU Zhiwei <zhiwei_liu@linux.alibaba.com>
Reviewed-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Signed-off-by: TANG Tiancheng <lyndra@linux.alibaba.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911-timers-v3-1-60508f640050@linux.alibaba.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/char: sifive_uart: Add newline to error message
Frank Chang [Thu, 11 Sep 2025 16:06:46 +0000 (00:06 +0800)] 
hw/char: sifive_uart: Add newline to error message

Adds a missing newline character to the error message.

Signed-off-by: Frank Chang <frank.chang@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911160647.5710-5-frank.chang@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/char: sifive_uart: Remove outdated comment about Tx FIFO
Frank Chang [Thu, 11 Sep 2025 16:06:45 +0000 (00:06 +0800)] 
hw/char: sifive_uart: Remove outdated comment about Tx FIFO

Since Tx FIFO is now implemented using "qemu/fifo8.h", remove the comment
that no longer reflects the current implementation.

Signed-off-by: Frank Chang <frank.chang@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911160647.5710-4-frank.chang@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/char: sifive_uart: Avoid pushing Tx FIFO when size is zero
Frank Chang [Thu, 11 Sep 2025 16:06:44 +0000 (00:06 +0800)] 
hw/char: sifive_uart: Avoid pushing Tx FIFO when size is zero

There's no need to call fifo8_push_all() when size is zero.

Signed-off-by: Frank Chang <frank.chang@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911160647.5710-3-frank.chang@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agohw/char: sifive_uart: Raise IRQ according to the Tx/Rx watermark thresholds
Frank Chang [Thu, 11 Sep 2025 16:06:43 +0000 (00:06 +0800)] 
hw/char: sifive_uart: Raise IRQ according to the Tx/Rx watermark thresholds

Currently, the SiFive UART raises an IRQ whenever:

  1. ie.txwm is enabled.
  2. ie.rxwm is enabled and the Rx FIFO is not empty.

It does not check the watermark thresholds set by software. However,
since commit [1] changed the SiFive UART character printing from
synchronous to asynchronous, Tx overflows may occur, causing characters
to be dropped when running Linux because:

  1. The Linux SiFive UART driver sets the transmit watermark level to 1
     [2], meaning a transmit watermark interrupt is raised whenever a
     character is enqueued into the Tx FIFO.
  2. Upon receiving a transmit watermark interrupt, the Linux driver
     transfers up to a full Tx FIFO's worth of characters from the Linux
     serial transmit buffer [3], without checking the txdata.full flag
     before transferring multiple characters [4].

To fix this issue, we must honor the Tx/Rx watermark thresholds and
raise interrupts only when the Tx threshold is exceeded or the Rx
threshold is undercut.

[1] 53c1557b230986ab6320a58e1b2c26216ecd86d5
[2] https://github.com/torvalds/linux/blob/master/drivers/tty/serial/sifive.c#L1039
[3] https://github.com/torvalds/linux/blob/master/drivers/tty/serial/sifive.c#L538
[4] https://github.com/torvalds/linux/blob/master/drivers/tty/serial/sifive.c#L291

Signed-off-by: Frank Chang <frank.chang@sifive.com>
Signed-off-by: Emmanuel Blot <emmanuel.blot@sifive.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250911160647.5710-2-frank.chang@sifive.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agoroms/opensbi: Update to v1.7
Daniel Henrique Barboza [Tue, 22 Jul 2025 17:29:06 +0000 (14:29 -0300)] 
roms/opensbi: Update to v1.7

Update OpenSBI and the pre-built opensbi32 and opensbi64 images to
version 1.7.

It has been almost an year since we last updated OpenSBI (at the time,
up to v1.5.1) and we're missing a lot of good stuff from both v1.6 and
v1.7, including SBI 3.0 and RPMI 1.0.

The changelog is too large and tedious to post in the commit msg so I
encourage refering to [1] and [2] to see the new features we're adding
into the QEMU roms.

[1] https://github.com/riscv-software-src/opensbi/releases/tag/v1.6
[2] https://github.com/riscv-software-src/opensbi/releases/tag/v1.7

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agotarget/riscv: implement MonitorDef HMP API
Daniel Henrique Barboza [Thu, 3 Jul 2025 13:08:15 +0000 (10:08 -0300)] 
target/riscv: implement MonitorDef HMP API

The MonitorDef API is related to two HMP monitor commands: 'p' and 'x':

(qemu) help p
print|p /fmt expr -- print expression value (use $reg for CPU register access)
(qemu) help x
x /fmt addr -- virtual memory dump starting at 'addr'

For x86, one of the few targets that implements it, it is possible to
print the PC register value with $pc and use the PC value in the 'x'
command as well.

Those 2 commands are hooked into get_monitor_def(), called by
exp_unary() in hmp.c. The function tries to fetch a reg value in two
ways: by reading them directly via a target_monitor_defs array or using
a target_get_monitor_def() helper. In RISC-V we have *A LOT* of
registers and this number will keep getting bigger, so we're opting out
of an array declaration.

We're able to retrieve all regs but vregs because the API only fits an
uint64_t and vregs have 'vlen' size that are bigger than that.

With this patch we can do things such as:

- print CSRs and use their val in expressions:
(qemu) p $mstatus
0xa000000a0
(qemu) p $mstatus & 0xFF
0xa0

- dump the next 10 insn from virtual memory starting at x1 (ra):

(qemu) x/10i $ra
0xffffffff80958aea:  a9bff0ef          jal                     ra,-1382                # 0xffffffff80958584
0xffffffff80958aee:  10016073          csrrsi                  zero,sstatus,2
0xffffffff80958af2:  60a2              ld                      ra,8(sp)
0xffffffff80958af4:  6402              ld                      s0,0(sp)
0xffffffff80958af6:  0141              addi                    sp,sp,16
0xffffffff80958af8:  8082              ret
0xffffffff80958afa:  10016073          csrrsi                  zero,sstatus,2
0xffffffff80958afe:  8082              ret
0xffffffff80958b00:  1141              addi                    sp,sp,-16
0xffffffff80958b02:  e422              sd                      s0,8(sp)
(qemu)

Suggested-by: Dr. David Alan Gilbert <dave@treblig.org>
Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250703130815.1592493-1-dbarboza@ventanamicro.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>
8 weeks agolinux-user/syscall.c: sync RISC-V hwprobe with Linux
Daniel Henrique Barboza [Wed, 3 Sep 2025 16:40:43 +0000 (13:40 -0300)] 
linux-user/syscall.c: sync RISC-V hwprobe with Linux

It has been awhile since the last sync. Let's bring QEMU hwprobe support
on par with Linux 6.17-rc4.

A lot of new RISCV_HWPROBE_KEY_* entities are added but this patch is
only adding support for ZICBOM_BLOCK_SIZE.

Signed-off-by: Daniel Henrique Barboza <dbarboza@ventanamicro.com>
Reviewed-by: Alistair Francis <alistair.francis@wdc.com>
Message-ID: <20250903164043.2828336-1-dbarboza@ventanamicro.com>
Signed-off-by: Alistair Francis <alistair.francis@wdc.com>