Eric Auger [Mon, 14 Jul 2025 08:05:11 +0000 (10:05 +0200)]
hw/acpi/pcihp: Remove root arg in acpi_pcihp_init
Let pass the root bus to ich9 and piix4 through a property link
instead of through an argument passed to acpi_pcihp_init().
Also make sure the root bus is set at the entry of acpi_pcihp_init().
The rationale of that change is to be consistent with the forecoming ARM
implementation where the machine passes the root bus (steming from GPEX)
to the GED device through a link property.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Suggested-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-28-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:10 +0000 (10:05 +0200)]
hw/acpi/ged: Call pcihp plug callbacks in hotplug handler implementation
Add PCI device related code in the TYPE_HOTPLUG_HANDLER
implementation.
For a PCI device hotplug/hotunplug event, the code routes to
acpi_pcihp_device callbacks (pre_plug_cb, plug_cb, unplug_request_cb,
unplug_cb).
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-27-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:09 +0000 (10:05 +0200)]
hw/arm/virt: Pass the bus on the ged creation
The bus will be needed on ged realize for acpi pci hp setup.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-26-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:08 +0000 (10:05 +0200)]
hw/acpi/ged: Add a bus link property
This property will be set by the machine code on the object
creation. It will be used by acpi pcihp hotplug code.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-25-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:07 +0000 (10:05 +0200)]
hw/arm/virt-acpi-build: Modify the DSDT ACPI table to enable ACPI PCI hotplug
Modify the DSDT ACPI table to enable ACPI PCI hotplug.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-24-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-23-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:05 +0000 (10:05 +0200)]
hw/arm/virt-acpi-build: Let non hotplug ports support static acpi-index
hw/arm/virt-acpi-build: Let non hotplug ports support static acpi-index
Add the requested ACPI bits requested to support static acpi-index
for non hotplug ports.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-22-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
tests/qtest/bios-tables-test: Prepare for changes in the arm virt DSDT table
This commit adds DSDT blobs to the whilelist in the prospect to
allow changes in the arm virt DSDT method.
Signed-off-by: Gustavo Romero <gustavo.romero@linaro.org> Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-21-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:03 +0000 (10:05 +0200)]
qtest/bios-tables-test: Generate DSDT.viot
Use a specific DSDT.viot reference blob instead of relying on
the default DSDT blob. The content is unchanged.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-20-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:02 +0000 (10:05 +0200)]
qtest/bios-tables-test: Add a variant to the aarch64 viot test
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-19-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:01 +0000 (10:05 +0200)]
qtest/bios-tables-test: Prepare for fixing the aarch64 viot test
The test misses a variant and this puts the mess on subsequent
rebuild-expected-aml.sh where a first DSDT reference blob is
overriden by another one.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-18-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:05:00 +0000 (10:05 +0200)]
hw/i386/acpi-build: Move aml_pci_edsm to a generic place
Move aml_pci_edsm to pci-bridge.c since we want to reuse that for
ARM and acpi-index support. Also rename it into build_pci_bridge_edsm.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-17-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:59 +0000 (10:04 +0200)]
hw/i386/acpi-build: Use AcpiPciHpState::root in acpi_set_pci_info
pcihp acpi_set_pci_info() generic code currently uses
acpi_get_i386_pci_host() to retrieve the pci host bridge.
To make it work also on ARM we get rid of that call and
directly use AcpiPciHpState::root.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Suggested-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-16-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:58 +0000 (10:04 +0200)]
hw/i386/acpi-build: Move build_append_pci_bus_devices/pcihp_slots to pcihp
We intend to reuse build_append_pci_bus_devices and build_append_pcihp_slots
on ARM. So let's move them to hw/acpi/pcihp.c as well as all static
helpers they use.
No functional change intended.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-15-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:57 +0000 (10:04 +0200)]
hw/i386/acpi-build: Move build_append_notification_callback to pcihp
We plan to reuse build_append_notification_callback() on ARM
so let's move it to pcihp.c.
No functional change intended.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-14-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:56 +0000 (10:04 +0200)]
hw/acpi/pcihp: Add an AmlRegionSpace arg to build_acpi_pci_hotplug
On ARM we will put the operation regions in AML_SYSTEM_MEMORY.
So let's allow this configuration.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-13-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Extract the code that reserves resources for ACPI PCI hotplug
into a new helper named build_append_pcihp_resources() and
move it to pcihp.c. We will reuse it on ARM.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-12-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-11-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:53 +0000 (10:04 +0200)]
hw/pci-host/gpex-acpi: Use build_pci_host_bridge_osc_method
gpex build_host_bridge_osc() and x86 originated
build_pci_host_bridge_osc_method() are mostly identical.
In GPEX, SUPP is set to CDW2 but is not further used. CTRL
is same as Local0.
So let gpex code reuse build_pci_host_bridge_osc_method()
and remove build_host_bridge_osc().
Also add an imply ACPI_PCI clause along with
PCI_EXPRESS_GENERIC_BRIDGE to compile hw/acpi/pci.c
when its dependency is resolved (ie. CONFIG_ACPI_PCI).
This is requested to link qemu-system-mips64el.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-10-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:52 +0000 (10:04 +0200)]
hw/i386/acpi-build: Turn build_q35_osc_method into a generic method
GPEX acpi_dsdt_add_pci_osc() does basically the same as
build_q35_osc_method().
Rename build_q35_osc_method() into build_pci_host_bridge_osc_method()
and move it into hw/acpi/pci.c. In a subsequent patch we will
use this later in place of acpi_dsdt_add_pci_osc().
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-9-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:51 +0000 (10:04 +0200)]
hw/pci-host/gpex-acpi: Use GED acpi pcihp property
Retrieve the acpi pcihp property value from the ged. In case this latter
is not set, PCI native hotplug is used on pci0. For expander bridges we
keep pci native hotplug, as done on x86 q35.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-8-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:50 +0000 (10:04 +0200)]
hw/acpi/ged: Add a acpi-pci-hotplug-with-bridge-support property
A new boolean property is introduced. This will be used to turn
ACPI PCI hotplug support. By default it is unset.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-7-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:49 +0000 (10:04 +0200)]
hw/pci-host/gpex-acpi: Split host bridge OSC and DSM generation
acpi_dsdt_add_pci_osc() name is confusing as it gives the impression
it appends the _OSC method but in fact it also appends the _DSM method
for the host bridge. Let's split the function into two separate ones
and let them return the method Aml pointer instead. This matches the
way it is done on x86 (build_q35_osc_method). In a subsequent patch
we will replace the gpex method by the q35 implementation that will
become shared between ARM and x86.
acpi_dsdt_add_host_bridge_methods is a new top helper that generates
both the _OSC and _DSM methods.
We take the opportunity to move SUPP and CTRL in the _osc method
that use them.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-6-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
tests/qtest/bios-tables-test: Prepare for changes in the DSDT table
This commit adds DSDT blobs to the whilelist in the prospect to
allow changes in the GPEX _OSC method.
Signed-off-by: Gustavo Romero <gustavo.romero@linaro.org> Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com> Acked-by: Igor Mammedov <imammedo@redhat.com>
Message-Id: <20250714080639.2525563-5-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:47 +0000 (10:04 +0200)]
hw/pci-host/gpex-acpi: Add native_pci_hotplug arg to acpi_dsdt_add_pci_osc
Add a new argument to acpi_dsdt_add_pci_osc to be able to disable
native pci hotplug.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-4-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:46 +0000 (10:04 +0200)]
hw/acpi: Rename and move build_x86_acpi_pci_hotplug to pcihp
We plan to reuse build_x86_acpi_pci_hotplug() implementation
for ARM so let's move the code to generic pcihp.
Associated static aml_pci_pdsm() helper is also moved along.
build_x86_acpi_pci_hotplug is renamed into build_acpi_pci_hotplug().
No code change intended.
Also fix the reference to acpi_pci_hotplug.rst documentation
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-3-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Eric Auger [Mon, 14 Jul 2025 08:04:45 +0000 (10:04 +0200)]
hw/i386/acpi-build: Make aml_pci_device_dsm() static
No need to export aml_pci_device_dsm() as it is only used
in hw/i386/acpi-build.c.
Signed-off-by: Eric Auger <eric.auger@redhat.com> Reviewed-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-Id: <20250714080639.2525563-2-eric.auger@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Now that various VirtIO files don't use target specific
API anymore, we can move them to the system_ss[] source
set to build them once.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250708215320.70426-9-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
qemu: Declare all load/store helper in 'qemu/bswap.h'
Restrict "exec/tswap.h" to the tswap*() methods,
move the load/store helpers with the other ones
declared in "qemu/bswap.h".
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250708215320.70426-8-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Check endianness at runtime to remove the target-specific
TARGET_BIG_ENDIAN definition. Use cpu_to_[be,le]XX() from
"qemu/bswap.h" instead of tswapXX() from "exec/tswap.h".
Suggested-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20250708215320.70426-7-philmd@linaro.org> Reviewed-by: Manos Pitsidianakis <manos.pitsidianakis@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
qemu: Convert target_words_bigendian() to TargetInfo API
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250708215320.70426-6-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
target_endian_mode() returns the default endianness (QAPI type)
of a target.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20250708215320.70426-5-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
qemu/target-info: Add %target_arch field to TargetInfo
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250708215320.70426-4-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
To keep "qemu/target-info.h" self-contained to native
types, declare target_arch() -- which returns a QAPI
type -- in "qemu/target-info-qapi.h".
No logical change.
Keeping native types in "qemu/target-info.h" is necessary
to keep building tests such tests/tcg/plugins/mem.c, as
per the comment added in commit ecbcc9ead2f ("tests/tcg:
add a system test to check memory instrumentation"):
/*
* plugins should not include anything from QEMU aside from the
* API header. However as this is a test plugin to exercise the
* internals of QEMU and we want to avoid needless code duplication we
* do so here. bswap.h is pretty self-contained although it needs a
* few things provided by compiler.h.
*/
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250708215320.70426-3-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org>
Message-Id: <20250708215320.70426-2-philmd@linaro.org> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-11-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
intel_iommu: Set address mask when a translation fails and adjust W permission
Implements the behavior defined in section 10.2.3.5 of PCIe spec rev 5.
This is needed by devices that support ATS.
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-10-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
intel_iommu: Return page walk level even when the translation fails
We will use this information in vtd_do_iommu_translate to populate the
IOMMUTLBEntry and indicate the correct page mask. This prevents ATS
devices from sending many useless translation requests when a megapage
or gigapage is not present.
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-9-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
intel_iommu: Implement the PCIIOMMUOps callbacks related to invalidations of device-IOTLB
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-8-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
intel_iommu: Implement vtd_get_iotlb_info from PCIIOMMUOps
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-7-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
the PSS field of the extended capabilities stores the supported PASID
size minus 1. This commit adds support for 8bits PASIDs (limited by
MemTxAttrs::pid).
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-6-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
intel_iommu: Fill the PASID field when creating an IOMMUTLBEntry
PASID value must be used by devices as a key (or part of a key)
when populating their ATC with the IOTLB entries returned by the IOMMU.
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-5-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This will be useful for devices that support ATS
and need to store entries in an ATC (device IOTLB).
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-4-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
This will be necessary for devices implementing ATS.
We also define a new macro IOMMU_ACCESS_FLAG_FULL in addition to
IOMMU_ACCESS_FLAG to support more access flags.
IOMMU_ACCESS_FLAG is kept for convenience and backward compatibility.
Here are the flags added (defined by the PCIe 5 specification) :
- Execute Requested
- Privileged Mode Requested
- Global
- Untranslated Only
IOMMU_ACCESS_FLAG sets the additional flags to 0
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-3-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
pci: Add a memory attribute for pre-translated DMA operations
The address_type bit will be set to PCI_AT_TRANSLATED by devices that
use cached addresses obtained via ATS.
Signed-off-by: Clement Mathieu--Drif <clement.mathieu--drif@eviden.com>
Message-Id: <20250628180226.133285-2-clement.mathieu--drif@eviden.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
We are going to be adding more parameters, and this makes
rust unhappy:
Functions with lots of parameters are considered bad style and reduce
readability (“what does the 5th parameter mean?”). Consider grouping
some parameters into a new type.
Specifically:
error: this function has too many arguments (8/7)
--> /builds/mstredhat/qemu/build/rust/qemu-api/rust-qemu-api-tests.p/structured/bindings.inc.rs:3840:5
|
3840 | / pub fn new_bitfield_1(
3841 | | secure: std::os::raw::c_uint,
3842 | | space: std::os::raw::c_uint,
3843 | | user: std::os::raw::c_uint,
... |
3848 | | address_type: std::os::raw::c_uint,
3849 | | ) -> __BindgenBitfieldUnit<[u8; 4usize]> {
| |____________________________________________^
|
= help: for further information visit https://rust-lang.github.io/rust-clippy/master/index.html#too_many_arguments
= note: `-D clippy::too-many-arguments` implied by `-D warnings`
= help: to override `-D warnings` add `#[allow(clippy::too_many_arguments)]`
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <e41344bd22248b0883752ef7a7c459090a3d9cfc.1752560127.git.mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Li Chen [Wed, 28 May 2025 10:53:37 +0000 (18:53 +0800)]
tests/qtest/bios-tables-test: Add test for disabling SPCR on RISC-V
Add ACPI SPCR table test case for RISC-V when SPCR was off.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Message-Id: <20250528105404.457729-4-me@linux.beauty> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Li Chen [Wed, 28 May 2025 10:53:36 +0000 (18:53 +0800)]
tests/qtest/bios-tables-test: Add test for disabling SPCR on AArch64
Add ACPI SPCR table test case for ARM when SPCR was off.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn>
Message-Id: <20250528105404.457729-3-me@linux.beauty> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Li Chen [Wed, 28 May 2025 10:53:35 +0000 (18:53 +0800)]
acpi: Add machine option to disable SPCR table
The ACPI SPCR (Serial Port Console Redirection) table allows firmware
to specify a preferred serial console device to the operating system.
On ARM64 systems, Linux by default respects this table: even if the
kernel command line does not include a hardware serial console (e.g.,
"console=ttyAMA0"), the kernel still register the serial device
referenced by SPCR as a printk console.
While this behavior is standard-compliant, it can lead to situations
where guest console behavior is influenced by platform firmware rather
than user-specified configuration. To make guest console behavior more
predictable and under user control, this patch introduces a machine
option to explicitly disable SPCR table exposure:
-machine spcr=off
By default, the option is enabled (spcr=on), preserving existing
behavior. When disabled, QEMU will omit the SPCR table from the guest's
ACPI namespace, ensuring that only consoles explicitly declared in the
kernel command line are registered.
Signed-off-by: Li Chen <chenl311@chinatelecom.cn> Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Reviewed-by: Sunil V L <sunilvl@ventanamicro.com>
Message-Id: <20250528105404.457729-2-me@linux.beauty> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Ethan Milon [Tue, 17 Jun 2025 15:04:27 +0000 (15:04 +0000)]
amd_iommu: Fix truncation of oldval in amdvi_writeq
The variable `oldval` was incorrectly declared as a 32-bit `uint32_t`.
This could lead to truncation and incorrect behavior where the upper
read-only 32 bits are significant.
Fix the type of `oldval` to match the return type of `ldq_le_p()`.
Cc: qemu-stable@nongnu.org Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU") Signed-off-by: Ethan Milon <ethan.milon@eviden.com>
Message-Id: <20250617150427.20585-9-alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Message-Id: <20250617150427.20585-8-alejandro.j.jimenez@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
amd_iommu: Fix mask to retrieve Interrupt Table Root Pointer from DTE
Fix an off-by-one error in the definition of AMDVI_IR_PHYS_ADDR_MASK. The
current definition masks off the most significant bit of the Interrupt Table
Root ptr i.e. it only generates a mask with bits [50:6] set. See the AMD I/O
Virtualization Technology (IOMMU) Specification for the Interrupt Table
Root Pointer[51:6] field in the Device Table Entry format.
Cc: qemu-stable@nongnu.org Fixes: b44159fe0078 ("x86_iommu/amd: Add interrupt remap support when VAPIC is not enabled") Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Message-Id: <20250617150427.20585-6-alejandro.j.jimenez@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
amd_iommu: Fix masks for various IOMMU MMIO Registers
Address various issues with definitions of the MMIO registers e.g. for the
Device Table Address Register, the size mask currently encompasses reserved
bits [11:9], so change it to only extract the bits [8:0] encoding size.
Convert masks to use GENMASK64 for consistency, and make unrelated
definitions independent.
Cc: qemu-stable@nongnu.org Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU") Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Message-Id: <20250617150427.20585-5-alejandro.j.jimenez@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
The DTE validation method verifies that all bits in reserved DTE fields are
unset. Update them according to the latest definition available in AMD I/O
Virtualization Technology (IOMMU) Specification - Section 2.2.2.1 Device
Table Entry Format. Remove the magic numbers and use a macro helper to
generate bitmasks covering the specified ranges for better legibility.
Note that some reserved fields specify that events are generated when they
contain non-zero bits, or checks are skipped under certain configurations.
This change only updates the reserved masks, checks for special conditions
are not yet implemented.
Cc: qemu-stable@nongnu.org Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Message-Id: <20250617150427.20585-4-alejandro.j.jimenez@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
amd_iommu: Fix Device ID decoding for INVALIDATE_IOTLB_PAGES command
The DeviceID bits are extracted using an incorrect offset in the call to
amdvi_iotlb_remove_page(). This field is read (correctly) earlier, so use
the value already retrieved for devid.
Cc: qemu-stable@nongnu.org Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU") Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com>
Message-Id: <20250617150427.20585-3-alejandro.j.jimenez@oracle.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
amd_iommu: Fix Miscellaneous Information Register 0 encoding
The definitions encoding the maximum Virtual, Physical, and Guest Virtual
Address sizes supported by the IOMMU are using incorrect offsets i.e. the
VASize and GVASize offsets are switched. The value in the GVAsize field is
also modified, since it was incorrectly encoded.
Cc: qemu-stable@nongnu.org Fixes: d29a09ca6842 ("hw/i386: Introduce AMD IOMMU") Co-developed-by: Ethan MILON <ethan.milon@eviden.com> Signed-off-by: Ethan MILON <ethan.milon@eviden.com> Signed-off-by: Alejandro Jimenez <alejandro.j.jimenez@oracle.com>
Message-Id: <20250617150427.20585-2-alejandro.j.jimenez@oracle.com> Reviewed-by: Vasant Hegde <vasant.hegde@amd.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Li Zhijian [Fri, 13 Jun 2025 08:51:10 +0000 (16:51 +0800)]
hw/acpi: Fix GPtrArray memory leak in crs_range_merge
This leak was detected by the valgrind.
The crs_range_merge() function unconditionally allocated a GPtrArray
'even when range->len was zero, causing an early return without freeing
the allocated array. This resulted in a memory leak when an empty range
was processed.
Instead of moving the allocation after the check (as previously attempted),
use g_autoptr for automatic cleanup. This ensures the array is freed even
on early returns, and also removes the need for the explicit free at the
end of the function.
Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-Id: <20250613085110.111204-1-lizhijian@fujitsu.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Ani Sinha <anisinha@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Bibo Mao [Thu, 12 Jun 2025 09:03:21 +0000 (17:03 +0800)]
tests/acpi: Remove stale allowed tables
Remove stale allowed tables for LoongArch virt machine.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-Id: <20250612090321.3416594-6-maobibo@loongson.cn> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Bibo Mao [Thu, 12 Jun 2025 09:03:20 +0000 (17:03 +0800)]
tests/acpi: Fill acpi table data for LoongArch
The acpi table data is filled for LoongArch virt machine with the
following command:
tests/data/acpi/rebuild-expected-aml.sh
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-Id: <20250612090321.3416594-5-maobibo@loongson.cn> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Bibo Mao [Thu, 12 Jun 2025 09:03:19 +0000 (17:03 +0800)]
rebuild-expected-aml.sh: Add support for LoongArch
Update the list of supported architectures to include LoongArch.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-Id: <20250612090321.3416594-4-maobibo@loongson.cn> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Bibo Mao [Thu, 12 Jun 2025 09:03:18 +0000 (17:03 +0800)]
tests/qtest/bios-tables-test: Add basic testing for LoongArch
Add basic ACPI table test case for LoongArch, including cpu topology,
numa memory, memory hotplug and oem-id test cases.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-Id: <20250612090321.3416594-3-maobibo@loongson.cn> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Bibo Mao [Thu, 12 Jun 2025 09:03:17 +0000 (17:03 +0800)]
tests/acpi: Add empty ACPI data files for LoongArch
Add empty acpi table for LoongArch virt machine, it is only empty
file and there is no data in these files.
Signed-off-by: Bibo Mao <maobibo@loongson.cn>
Message-Id: <20250612090321.3416594-2-maobibo@loongson.cn> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Daniil Tatianin [Mon, 9 Jun 2025 21:25:47 +0000 (00:25 +0300)]
vhost-user-blk: add an option to skip GET_VRING_BASE for force shutdown
If we have a server running disk requests that is for whatever reason
hanging or not able to process any more IO requests but still has some
in-flight requests previously issued by the guest OS, QEMU will still
try to drain the vring before shutting down even if it was explicitly
asked to do a "force shutdown" via SIGTERM or QMP quit. This is not
useful since the guest is no longer running at this point since it was
killed by QEMU earlier in the process. At this point, we don't care
about whatever in-flight IO it might have pending, we just want QEMU
to shut down.
Add an option called "skip-get-vring-base-on-force-shutdown" to allow
SIGTERM/QMP quit() to actually act like a "force shutdown" at least
for vhost-user-blk devices since those require the drain operation
to shut down gracefully unlike, for example, network devices.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Message-Id: <20250609212547.2859224-4-d-tatianin@yandex-team.ru> Acked-by: Raphael Norwitz <raphael@enfabrica.net> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Daniil Tatianin [Mon, 9 Jun 2025 21:25:46 +0000 (00:25 +0300)]
vhost: add a helper for force stopping a device
This adds an ability to skip GET_VRING_BASE during device stop entirely,
and thus the expensive drain operation that this call entails as well,
which may be useful during a non-graceful shutdown in case the guest
operating system hangs or refuses to react to a previously requested
ACPI shutdown for whatever reason.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Message-Id: <20250609212547.2859224-3-d-tatianin@yandex-team.ru> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Daniil Tatianin [Mon, 9 Jun 2025 21:25:45 +0000 (00:25 +0300)]
softmmu/runstate: add a way to detect force shutdowns
This can be useful for devices that might take too long to shut down
gracefully, but may have a way to shutdown quickly otherwise if needed
or explicitly requested by a force shutdown.
For now we only consider SIGTERM or the QMP quit() command a force
shutdown, since those bypass the guest entirely and are equivalent to
pulling the power plug.
Signed-off-by: Daniil Tatianin <d-tatianin@yandex-team.ru>
Message-Id: <20250609212547.2859224-2-d-tatianin@yandex-team.ru> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
vhost: Fix used memslot tracking when destroying a vhost device
When we unplug a vhost device, we end up calling vhost_dev_cleanup()
where we do a memory_listener_unregister().
This memory_listener_unregister() call will end up disconnecting the
listener from the address space through listener_del_address_space().
In that process, we effectively communicate the removal of all memory
regions from that listener, resulting in region_del() + commit()
callbacks getting triggered.
So in case of vhost, we end up calling vhost_commit() with no remaining
memory slots (0).
In vhost_commit() we end up overwriting the global variables
used_memslots / used_shared_memslots, used for detecting the number
of free memslots. With used_memslots / used_shared_memslots set to 0
by vhost_commit() during device removal, we'll later assume that the
other vhost devices still have plenty of memslots left when calling
vhost_get_free_memslots().
Let's fix it by simply removing the global variables and depending
only on the actual per-device count.
Easy to reproduce by adding two vhost-user devices to a VM and then
hot-unplugging one of them.
While at it, detect unexpected underflows in vhost_get_free_memslots()
and issue a warning.
Reported-by: yuanminghao <yuanmh12@chinatelecom.cn> Link: https://lore.kernel.org/qemu-devel/20241121060755.164310-1-yuanmh12@chinatelecom.cn/ Fixes: 2ce68e4cf5be ("vhost: add vhost_has_free_slot() interface") Cc: Igor Mammedov <imammedo@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefano Garzarella <sgarzare@redhat.com> Signed-off-by: David Hildenbrand <david@redhat.com>
Message-Id: <20250603111336.1858888-1-david@redhat.com> Reviewed-by: Igor Mammedov <imammedo@redhat.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Akihiko Odaki [Fri, 30 May 2025 04:33:21 +0000 (13:33 +0900)]
virtio-net: Add hash type options
By default, virtio-net limits the hash types that will be advertised to
the guest so that all hash types are covered by the offloading
capability the client provides. This change allows to override this
behavior and to advertise hash types that require user-space hash
calculation by specifying "on" for the corresponding properties.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20250530-vdpa-v1-6-5af4109b1c19@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Akihiko Odaki [Fri, 30 May 2025 04:33:20 +0000 (13:33 +0900)]
net/vhost-vdpa: Remove dummy SetSteeringEBPF
It is no longer used.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20250530-vdpa-v1-5-5af4109b1c19@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Akihiko Odaki [Fri, 30 May 2025 04:33:19 +0000 (13:33 +0900)]
virtio-net: Retrieve peer hashing capability
Retrieve peer hashing capability instead of hardcoding.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20250530-vdpa-v1-4-5af4109b1c19@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Akihiko Odaki [Fri, 30 May 2025 04:33:18 +0000 (13:33 +0900)]
virtio-net: Move virtio_net_get_features() down
Move virtio_net_get_features() to the later part of the file so that
it can call other functions.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20250530-vdpa-v1-3-5af4109b1c19@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Akihiko Odaki [Fri, 30 May 2025 04:33:17 +0000 (13:33 +0900)]
net/vhost-vdpa: Report hashing capability
Report hashing capability so that virtio-net can deliver the correct
capability information to the guest.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20250530-vdpa-v1-2-5af4109b1c19@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
DEFINE_PROP_ON_OFF_AUTO_BIT64() corresponds to DEFINE_PROP_ON_OFF_AUTO()
as DEFINE_PROP_BIT64() corresponds to DEFINE_PROP_BOOL(). The difference
is that DEFINE_PROP_ON_OFF_AUTO_BIT64() exposes OnOffAuto instead of
bool.
Signed-off-by: Akihiko Odaki <akihiko.odaki@daynix.com>
Message-Id: <20250530-vdpa-v1-1-5af4109b1c19@daynix.com> Reviewed-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Stefan Hajnoczi [Sun, 13 Jul 2025 05:46:04 +0000 (01:46 -0400)]
Merge tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu into staging
fpu: Process float_muladd_negate_result after rounding
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user: Implement fchmodat2 syscall
linux-user: Check for EFAULT failure in nanosleep
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
linux-user/gen-vdso: Handle fseek() failure
linux-user/gen-vdso: Don't read off the end of buf[]
* tag 'pull-tcg-20250711' of https://gitlab.com/rth7680/qemu:
linux-user: Use qemu_set_cloexec() to mark pidfd as FD_CLOEXEC
tcg: Use uintptr_t in tcg_malloc implementation
linux-user: Hold the fd-trans lock across fork
linux-user/mips/o32: Drop sa_restorer functionality
linux-user/gen-vdso: Don't read off the end of buf[]
linux-user/gen-vdso: Handle fseek() failure
linux-user: Check for EFAULT failure in nanosleep
linux-user: Implement fchmodat2 syscall
fpu: Process float_muladd_negate_result after rounding
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stefan Hajnoczi [Sun, 13 Jul 2025 05:45:30 +0000 (01:45 -0400)]
Merge tag 'migration-20250711-pull-request' of https://gitlab.com/farosas/qemu into staging
Migration pull request
- General cleanups around: postcopy, bg-snapshot, migration hooks,
migration completion and formatting of 'info migrate'.
- Overhaul of postcopy blocktime tracking.
# -----BEGIN PGP SIGNATURE-----
#
# iQJEBAABCAAuFiEEqhtIsKIjJqWkw2TPx5jcdBvsMZ0FAmhxGdgQHGZhcm9zYXNA
# c3VzZS5kZQAKCRDHmNx0G+wxnahoD/9uNXirlmRk3tDnhiJsiYx+HnXYPFEORSZq
# zlpUyqvhQ1POp3Fa5pRf+bJ5mmPw8h8PdOR2StMpnW2Xa1OatAZj5m1uityAVWOl
# EkVfZLl0j6j9HCCmE3c4dztOGIBsd9YY0GWizL05XHYZPrdX4zOpolMN4m53RwQY
# HUVD6T2y9eFDnCO6MsoA9EfmkFYCRvqlS0VzTcYzQFN4H+QHlcpDfweqJpTLPa+1
# trahAN9PBuMjoewjDqwkNkf0CLaCXHszAfj6yv62Vi8Cbp9DDPywIYJKFnxspElW
# Fjg1b4MdsbYZNmeKgIawzgTOL1RrojvKkoi7KWp3D7M+/ZZl9kBwQuUcBXKI7N0R
# Y0GNfkkTycn18nM0JU/6QWSuVeiPbLArxQUGP1cLgvcHSSNgD9JxWbNBu5+1fFOG
# Gg3qnyYatJ6xJDiCrdKqV8fwozNlm/G6b9BiCDeVq+4nA2OKQ0shiNA1GZHvVSQL
# X4uAPexETdHfA/LeA2w5sgVBEw7BewBdjLntZDIFsyBnLrvqrDcU5Aav0wiHoI8U
# QBC2aIpJfMLHiIQ93mVX96NltXC7KvJTIZVl3iwfiYEYCvQtTYgdJ09ELXFJYxFX
# XpTTazqpmPSfuZpPRgx9YbDP/kS8Fg/PTOlPeD0T/frFgd1S6Thh6OW455PavMp8
# ht2lE4sxjA==
# =vtRD
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 11 Jul 2025 10:04:08 EDT
# gpg: using RSA key AA1B48B0A22326A5A4C364CFC798DC741BEC319D
# gpg: issuer "farosas@suse.de"
# gpg: Good signature from "Fabiano Rosas <farosas@suse.de>" [unknown]
# gpg: aka "Fabiano Almeida Rosas <fabiano.rosas@suse.com>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: AA1B 48B0 A223 26A5 A4C3 64CF C798 DC74 1BEC 319D
* tag 'migration-20250711-pull-request' of https://gitlab.com/farosas/qemu: (26 commits)
migration: Rename save_live_complete_precopy_thread to save_complete_precopy_thread
migration/postcopy: Add latency distribution report for blocktime
migration/postcopy: blocktime allows track / report non-vCPU faults
migration/postcopy: Optimize blocktime fault tracking with hashtable
migration/postcopy: Cleanup the total blocktime accounting
migration/postcopy: Cache the tid->vcpu mapping for blocktime
migration/postcopy: Initialize blocktime context only until listen
migration/postcopy: Report fault latencies in blocktime
migration/postcopy: Add blocktime fault counts per-vcpu
migration/postcopy: Bring blocktime layer to ns level
migration/postcopy: Drop PostcopyBlocktimeContext.start_time
migration/postcopy: Make all blocktime vars 64bits
migration/postcopy: Drop all atomic ops in blocktime feature
migration/postcopy: Push blocktime start/end into page req mutex
migration: Add option to set postcopy-blocktime
migration/postcopy: Avoid clearing dirty bitmap for postcopy too
migration: Rewrite the migration complete detect logic
migration/ram: Add tracepoints for ram_save_complete()
migration/ram: One less indent for ram_find_and_save_block()
migration: qemu_savevm_complete*() helpers
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stefan Hajnoczi [Sun, 13 Jul 2025 05:45:17 +0000 (01:45 -0400)]
Merge tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu into staging
target-arm queue:
* New board type max78000fthr
* Enable use of CXL on Arm 'virt' board
* Some more tidyup of ID register handling
* Refactor AT insns and PMU regs into separate source files
* Don't enforce NSE,NS check for EL3->EL3 returns
* hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
* Allow nested-virtualization with KVM on the 'virt' board
* system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
* hw/arm/virt-acpi-build: Don't create ITS id mappings by default
* target/arm: Remove unused helper_sme2_luti4_4b
* tag 'pull-target-arm-20250711' of https://gitlab.com/pm215/qemu: (36 commits)
tests/functional: Add a test for the MAX78000 arm machine
docs/system: arm: Add max78000 board description
target/arm: Remove helper_sme2_luti4_4b
hw/arm/virt-acpi-build: Don't create ITS id mappings by default
system/qdev: Remove pointless NULL check in qdev_device_add_from_qdict
hw/arm/virt: Allow virt extensions with KVM
hw/arm/arm_gicv3_kvm: Add a migration blocker with kvm nested virt
target/arm: Enable feature ARM_FEATURE_EL2 if EL2 is supported
target/arm/kvm: Add helper to detect EL2 when using KVM
hw/arm: Allow setting KVM vGIC maintenance IRQ
hw/arm/fsl-imx8mp: Wire VIRQ and VFIQ
target/arm: Don't enforce NSE,NS check for EL3->EL3 returns
target/arm: Split out performance monitor regs to cpregs-pmu.c
target/arm: Split out AT insns to tcg/cpregs-at.c
target/arm: Drop stub for define_tlb_insn_regs
arm/kvm: shorten one overly long line
arm/cpu: store clidr into the idregs array
arm/cpu: fix trailing ',' for SET_IDREG
arm/cpu: store id_aa64afr{0,1} into the idregs array
arm/cpu: store id_afr0 into the idregs array
...
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stefan Hajnoczi [Sun, 13 Jul 2025 05:44:51 +0000 (01:44 -0400)]
Merge tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu into staging
* s390x: Allow to select different entries when booting via pxelinux.cfg
* Link s390-ccw.img statically
* Fix broken bamboo functional test
* s390x code cleanups and refactorings
* tag 'pull-request-2025-07-11' of https://gitlab.com/thuth/qemu:
target/s390x: Have s390_cpu_halt() not return anything
target/s390x: Expose s390_count_running_cpus() method
target/s390x: Remove unused s390_cpu_[un]halt() user stubs
tests/functional/test_ppc_bamboo: Replace broken link with working assets
tests/functional: Add dependency to the keymap_targets
pc-bios: Update the s390 bios images with the pxelinux.cfg loadparm changes
pc-bios/s390-ccw: link statically
tests/functional: Add a test for s390x pxelinux.cfg network booting
pc-bios/s390-ccw: Add a boot menu for booting via pxelinux.cfg
pc-bios/s390-ccw: Make get_boot_index() from menu.c global
pc-bios/s390-ccw: Allow up to 31 entries for pxelinux.cfg
pc-bios/s390-ccw: Allow to select a different pxelinux.cfg entry via loadparm
hw/s390x/s390-pci-bus.c: Use g_assert_not_reached() in functions taking an ett
target/s390x/tcg: Use vaddr in s390_probe_access()
target/s390x/kvm: Use vaddr in find/insert_hw_breakpoint()
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
This has two problems:
(1) it doesn't check errors, which Coverity complains about
(2) we use F_GETFL when we mean F_GETFD
Deal with both of these problems by using qemu_set_cloexec() instead.
That function will assert() if the fcntls fail, which is fine (we are
inside fork_start()/fork_end() so we know nothing can mess around
with our file descriptors here, and we just got this one from
pidfd_open()).
(As we are touching the if() statement here, we correct the
indentation.)
Coverity: CID 1508111 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Message-ID: <20250711141217.1429412-1-peter.maydell@linaro.org>
Juraj Marcin [Thu, 26 Jun 2025 08:52:32 +0000 (10:52 +0200)]
migration: Rename save_live_complete_precopy_thread to save_complete_precopy_thread
Recent patch [1] renames the save_live_complete_precopy handler to
save_complete, as the machine is not live in most cases when this
handler is executed. The same is true also for
save_live_complete_precopy_thread, therefore this patch removes the
"live" keyword from the handler itself and related types to keep the
naming unified.
In contrast to save_complete, this handler is only executed at the end
of precopy, therefore the "precopy" keyword is retained.
Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Cédric Le Goater <clg@redhat.com> Signed-off-by: Juraj Marcin <jmarcin@redhat.com> Link: https://lore.kernel.org/r/20250626085235.294690-1-jmarcin@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Peter Xu [Fri, 13 Jun 2025 14:12:17 +0000 (10:12 -0400)]
migration/postcopy: Add latency distribution report for blocktime
Add the latency distribution too for blocktime, using order-of-two buckets.
It accounts for all the faults, from either vCPU or non-vCPU threads. With
prior rework, it's very easy to achieve by adding an array to account for
faults in each buckets.
Sample output for HMP (while for QMP it's simply an array):
Postcopy Latency Distribution:
[ 1 us - 2 us ]: 0
[ 2 us - 4 us ]: 0
[ 4 us - 8 us ]: 1
[ 8 us - 16 us ]: 2
[ 16 us - 32 us ]: 2
[ 32 us - 64 us ]: 3
[ 64 us - 128 us ]: 10169
[ 128 us - 256 us ]: 50151
[ 256 us - 512 us ]: 12876
[ 512 us - 1 ms ]: 97
[ 1 ms - 2 ms ]: 42
[ 2 ms - 4 ms ]: 44
[ 4 ms - 8 ms ]: 93
[ 8 ms - 16 ms ]: 138
[ 16 ms - 32 ms ]: 0
[ 32 ms - 65 ms ]: 0
[ 65 ms - 131 ms ]: 0
[ 131 ms - 262 ms ]: 0
[ 262 ms - 524 ms ]: 0
[ 524 ms - 1 sec ]: 0
[ 1 sec - 2 sec ]: 0
[ 2 sec - 4 sec ]: 0
[ 4 sec - 8 sec ]: 0
[ 8 sec - 16 sec ]: 0
Cc: Markus Armbruster <armbru@redhat.com> Acked-by: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Fabiano Rosas <farosas@suse.de> Link: https://lore.kernel.org/r/20250613141217.474825-15-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
When used to report page fault latencies, the blocktime feature can be
almost useless when KVM async page fault is enabled, because in most cases
such remote fault will kickoff async page faults, then it's not trackable
from blocktime layer.
After all these recent rewrites to blocktime layer, it's finally so easy to
also support tracking non-vCPU faults. It'll be even faster if we could
always index fault records with TIDs, unfortunately we need to maintain the
blocktime API which report things in vCPU indexes.
Of course this can work not only for kworkers, but also any guest accesses
that may reach a missing page, for example, very likely when in the QEMU
main thread too (and all other threads whenever applicable).
In this case, we don't care about "how long the threads are blocked", but
we only care about "how long the fault will be resolved".
Cc: Markus Armbruster <armbru@redhat.com> Cc: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Fabiano Rosas <farosas@suse.de> Tested-by: Mario Casquero <mcasquer@redhat.com> Link: https://lore.kernel.org/r/20250613141217.474825-14-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
The old algorithm was almost OK and fast on inserts, except that the lookup
is slow and won't scale if there are a lot of vCPUs: when a page is copied
during postcopy, mark_postcopy_blocktime_end() will walk the whole array
trying to find which vCPUs are blocked by the address. So it needs
constant O(N) walk for each page resolution.
Alexey (the author of postcopy blocktime) mentioned the perf issue and how
to optimize it in a piece of comment in the page resolution path. The
comment was (interestingly..) not complete, but it's relatively clear what
he wanted to say about this perf issue.
Issue 2: Wrong Accounting on re-entrancies
==========================================
People might think that each vCPU should only and always get one fault at a
time, so that when the blocktime layer captured one fault on one vCPU, we
should never see another fault message on this vCPU.
It's almost correct, except in some extreme rare cases.
Case 1: it's possible the fault thread processes the userfaultfd messages
too fast so it can see >1 messages on one vCPU before the previous one was
resolved.
Case 2: it's theoretically also possible one vCPU can get even more than
one message on the same fault address if a fault is retried by the
kernel (e.g., handle_userfault() got interrupted before page resolution).
As this info might be important, instead of using commit message, I put
more details into the code as comment, when introducing an array
maintaining concurrent faults on one vCPU. Please refer to the comments
for details on both cases, especially case 1 which can be tricky.
Case 1 sounds rare, but it can be easily reproduced locally for me when we
run blocktime together with the migration-test on the vanilla postcopy.
New Design
==========
This patch should do almost what Alexey mentioned, but slightly
differently: instead of having an array to maintain vCPU fault addresses,
for each of the fault message we push a message into a hash, indexed by the
fault address.
With the hash, it can replace the old two structs: both the vcpu_addr[]
array, and also the array to store the start time of the fault. However
due to above we need one more counter array to account concurrent faults on
the same vCPU - that should even be needed in the old code, it's just that
the old code was buggy and it will blindly overwrite an existing
entry.. now we'll start to really track everything.
The hash structure might be more efficient than tree to maintain such
addr->(cpu, fault_time) information, so that the insert() and lookup()
paths should ideally both be ~O(1). After all, we do not need to sort.
Here we need to do one remove() though after the lookup(). It could be
slow but only if many vCPUs faulted exactly on the same address (so when
the list of cpu entries is long), which should be unlikely. Even with that,
it's still a worst case O(N) (consider 400 vCPUs faulted on the same
address and how likely is it..) rather than a constant O(N) complexity.
When at it, touch up the tracepoints to make them slightly more useful.
One tracepoint is added when walking all the fault entries.
Peter Xu [Fri, 13 Jun 2025 14:12:14 +0000 (10:12 -0400)]
migration/postcopy: Cleanup the total blocktime accounting
The variable vcpu_total_blocktime isn't easy to follow. In reality, it
wants to capture the case where all vCPUs are stopped, and now there will
be some vCPUs starts running.
The name now starts to conflict with vcpu_blocktime_total[], meanwhile it's
actually not necessary to have the variable at all: since nobody is
touching smp_cpus_down except ourselves, we can safely do the calculation
at the end before decrementing smp_cpus_down.
Hopefully this makes the logic easier to read, side benefit is we drop one
temp var.
Peter Xu [Fri, 13 Jun 2025 14:12:13 +0000 (10:12 -0400)]
migration/postcopy: Cache the tid->vcpu mapping for blocktime
Looking up the vCPU index for each fault can be expensive when there're
hundreds of vCPUs. Provide a cache for tid->vcpu instead with a hash
table, then lookup from there.
When at it, add another counter to record how many non-vCPU faults it gets.
For example, the main thread can also access a guest page that was missing.
These kind of faults are not accounted by blocktime so far.
Peter Xu [Fri, 13 Jun 2025 14:12:12 +0000 (10:12 -0400)]
migration/postcopy: Initialize blocktime context only until listen
Before this patch, the blocktime context can be created very early, because
postcopy_ram_supported_by_host() <- migrate_caps_check() can happen during
migration object init.
The trick here is the blocktime context needs system vCPU information,
which seems to be possible to change after that point. I didn't verify it,
but it doesn't sound right.
Now move it out and initialize the context only when postcopy listen
starts. That is already during a migration so it should be guaranteed the
vCPU topology can never change on both sides.
While at it, assert that the ctx isn't created instead this time; the old
"if" trick isn't needed when we're sure it will only happen once now.
Peter Xu [Fri, 13 Jun 2025 14:12:11 +0000 (10:12 -0400)]
migration/postcopy: Report fault latencies in blocktime
Blocktime so far only cares about the time one vcpu (or the whole system)
got blocked. It would be also be helpful if it can also report the latency
of page requests, which could be very sensitive during postcopy.
Blocktime itself is sometimes not very important, especially when one
thinks about KVM async PF support, which means vCPUs are literally almost
not blocked at all because the guest OS is smart enough to switch to
another task when a remote fault is needed.
However, latency is still sensitive and important because even if the guest
vCPU is running on threads that do not need a remote fault, the workload
that accesses some missing page is still affected.
Add two entries to the report, showing how long it takes to resolve a
remote fault. Mention in the QAPI doc that this is not the real average
fault latency, but only the ones that was requested for a remote fault.
Unwrap get_vcpu_blocktime_list() so we don't need to walk the list twice,
meanwhile add the entry checks in qtests for all postcopy tests.
Cc: Markus Armbruster <armbru@redhat.com> Cc: Dr. David Alan Gilbert <dave@treblig.org> Reviewed-by: Fabiano Rosas <farosas@suse.de> Tested-by: Mario Casquero <mcasquer@redhat.com> Link: https://lore.kernel.org/r/20250613141217.474825-9-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Peter Xu [Fri, 13 Jun 2025 14:12:09 +0000 (10:12 -0400)]
migration/postcopy: Bring blocktime layer to ns level
With 64-bit fields, it is trivial. The caution is when exposing any values
in QMP, it was still declared with milliseconds (ms). Hence it's needed to
do the convertion when exporting the values to existing QMP queries.
Peter Xu [Fri, 13 Jun 2025 14:12:07 +0000 (10:12 -0400)]
migration/postcopy: Make all blocktime vars 64bits
I am guessing it was used to be 32bits because of the atomic ops. Now all
the atomic ops are gone and we're protected by a mutex instead, it's ok we
can switch to 64 bits.
Reasons to move over:
- Allow further patches to change the unit from ms to us: with postcopy
preempt mode, we're really into hundreds of microseconds level on
blocktime. We'd better be able to trap those.
- This also paves way for some other tricks that the original version
used to avoid overflows, e.g., start_time was almost only useful before
to make sure the sampled timestamp won't overflow a 32-bit field.
- This prepares further reports on top of existing data collected,
e.g. average page fault latencies. When average operation is taken into
account, milliseconds are simply too coarse grained.
When at it:
- Rename page_fault_vcpu_time to vcpu_blocktime_start.
- Rename vcpu_blocktime to vcpu_blocktime_total.
- Touch up the trace-events to not dump blocktime ctx pointer
Peter Xu [Fri, 13 Jun 2025 14:12:05 +0000 (10:12 -0400)]
migration/postcopy: Push blocktime start/end into page req mutex
The postcopy blocktime feature was tricky that it used quite some atomic
operations over quite a few arrays and vars, without explaining how that
would be thread safe. The thread safety here is about concurrency between
the fault thread and the fault resolution threads, possible to access the
same chunk of data. All these atomic ops can be expensive too before
knowing clearly how it works.
OTOH, postcopy has one page_request_mutex used to serialize the received
bitmap updates. So far it's ok - we don't yet have a lot of threads
contending the lock. It might change after multifd will be supported, but
that's a separate story. What is important is, with that mutex, it's
pretty lightweight to move all the blocktime maintenance into the mutex
critical section. It's because the blocktime layer is lightweighted:
almost "remember which vcpu faulted on which address", and "ok we get some
fault resolved, calculate how long it takes". It's also an optional
feature for now (but I have thought of changing that, maybe in the future).
Let's push the blocktime layer into the mutex, so that it's always
thread-safe even without any atomic ops.
To achieve that, I'll need to add a tid parameter on fault path so that
it'll start to pass the faulted thread ID into deeper the stack, but not
too deep. When at it, add a comment for the shared fault handler (for
example, vhost-user devices running with postcopy), to mention a TODO. One
reason it might not be trivial is that vhost-user's userfaultfds should be
opened by vhost-user process, so it's pretty hard to control making sure
the TID feature will be around. It wasn't supported before, so keep it
like that for now.
Now we should be as ease when everything is protected by a mutex that we
always take anyway.
One side effect: we can finally remove one ramblock_recv_bitmap_test() in
mark_postcopy_blocktime_begin(), which was pretty weird and which also
includes a weird (but maybe necessary.. but maybe not?) operation to inject
a blocktime entry then quickly erase it.. When we're with the mutex, and
when we make sure it's invoked after checking the receive bitmap, it's not
needed anymore. Instead, we assert.
As another side effect, this paves way for removing all atomic ops in all
the mem accesses in blocktime layer.
Note that we need a stub for mark_postcopy_blocktime_begin() for Windows
builds.
Peter Xu [Fri, 13 Jun 2025 14:08:00 +0000 (10:08 -0400)]
migration: Rewrite the migration complete detect logic
There're a few things off here in that logic, rewrite it. When at it, add
rich comment to explain each of the decisions.
Since this is very sensitive path for migration, below are the list of
things changed with their reasonings.
(1) Exact pending size is only needed for precopy not postcopy
Fundamentally it's because "exact" version only does one more deep
sync to fetch the pending results, while in postcopy's case it's
never going to sync anything more than estimate as the VM on source
is stopped.
(2) Do _not_ rely on threshold_size anymore to decide whether postcopy
should complete
threshold_size was calculated from the expected downtime and
bandwidth only during precopy as an efficient way to decide when to
switchover. It's not sensible to rely on threshold_size in postcopy.
For precopy, if switchover is decided, the migration will complete
soon. It's not true for postcopy. Logically speaking, postcopy
should only complete the migration if all pending data is flushed.
Here it used to work because save_complete() used to implicitly
contain save_live_iterate() when there's pending size.
Even if that looks benign, having RAMs to be migrated in postcopy's
save_complete() has other bad side effects:
(a) Since save_complete() needs to be run once at a time, it means
when moving RAM there's no way moving other things (rather than
round-robin iterating the vmstate handlers like what we do with
ITERABLE phase). Not an immediate concern, but it may stop working
in the future when there're more than one iterables (e.g. vfio
postcopy).
(b) postcopy recovery, unfortunately, only works during ITERABLE
phase. IOW, if src QEMU moves RAM during postcopy's save_complete()
and network failed, then it'll crash both QEMUs... OTOH if it failed
during iteration it'll still be recoverable. IOW, this change should
further reduce the window QEMU split brain and crash in extreme cases.
If we enable the ram_save_complete() tracepoints, we'll see this
before this patch:
This shouldn't be super important, the movement makes sure there's
only one in_postcopy check, then we are clear on what we do with the
two completely differnt use cases (precopy v.s. postcopy).
(4) Trivial touch up on threshold_size comparision