James Hilliard [Fri, 8 May 2026 08:56:16 +0000 (10:56 +0200)]
target/mips: add Octeon V3MULU instruction
V3MULU extends VMULU across the full Octeon3 multiplier state, adding rt
and queued partial products.
Return the low result while shifting the remaining accumulated limbs back
into P[0] through P[5].
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-23-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:55:47 +0000 (10:55 +0200)]
target/mips: add Octeon VMM0 instruction
VMM0 multiplies MPL[0] by rs, adds rt and the queued P[0] partial
product, returns the low result, and feeds that result back into MPL[0].
It sets MPL[1] to zero and clears partial products.
Include hardware-backed regression coverage for VMM0 MPL1 zeroing.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-22-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:55:23 +0000 (10:55 +0200)]
target/mips: add Octeon VMULU instruction
VMULU multiplies the active Octeon multiplier state by rs, adds rt and
queued partial products, returns the low result, and advances P[0]/P[1]
with carry limbs.
Expand the two-limb accumulator operation inline with TCG so the result
and partial-product state stay visible to the optimizer.
Add a mips64/mips64el linux-user TCG smoke test for representative
Octeon multiplier instruction paths.
Include hardware-backed regression coverage for MTP0 P1 zeroing.
Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-21-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:53:11 +0000 (10:53 +0200)]
target/mips: add Octeon MTP instructions
Add the MTP0, MTP1, and MTP2 forms. MTP0 loads the low Octeon3
partial-product pair from rs/rt into P[0]/P[3], MTP1 loads the middle
pair into P[1]/P[4], and MTP2 loads the high pair into P[2]/P[5].
For MTP0, also set P[1] to zero for backward compatibility with
Octeon2 VMULU.
Legacy single-source encodings have rt encoded as $zero, so the same
translator path also preserves the older Octeon behavior.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-20-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:52:58 +0000 (10:52 +0200)]
target/mips: add Octeon MTM instructions
Add the MTM0, MTM1, and MTM2 forms that load the Octeon3 multiplier
operand pair from rs/rt into MPL[x] and MPL[x+3], then clear the partial
products. For MPL0, also set MPL[1] to zero for backward compatibility
with Octeon2 VMULU.
Legacy single-source encodings have rt encoded as $zero, so the same
translator path also preserves the older Octeon behavior.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-19-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 15:12:07 +0000 (09:12 -0600)]
target/mips: add Octeon multiplier state
Add per-thread Octeon multiplier state for the MPL and P limb banks used
by the VMULU/VMM0/V3MULU instruction family.
Octeon3 extends the older MPL0-MPL2/P0-P2 state with high lanes
MPL3-MPL5/P3-P5, programmed by the two-source MTM/MTP forms. Represent
both banks as uint64_t arrays so the TC state matches the architected
64-bit limb layout used by Octeon68XX user-mode code.
Expose MPL/P as global TCG variables so the multiplier translators can
expand inline without helper calls.
Migrate the multiplier registers in an Octeon-only subsection so
non-Octeon CPU models do not grow migration state.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-18-philmd@linaro.org>
Add a helper for multi-limb 64-bit addition. The helper emits native
carry-chain TCG ops when they are available and falls back to explicit
carry propagation otherwise.
This lets target translators build wider integer accumulators inline
without open-coding the same add-with-carry sequence at each use site.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-17-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:51:57 +0000 (10:51 +0200)]
target/mips: add Octeon ZCB and ZCBT instructions
ZCB zeros the 128-byte cache block containing the base address. ZCBT has
the same user-mode-visible memory effect for QEMU purposes.
Model both forms with a single decodetree wildcard entry, align the
address down to a 128-byte line, and store eight zero 128-bit chunks to
guest memory.
Acked-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-16-philmd@linaro.org>
Inspired-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20260520121644.10835-1-philmd@linaro.org>
[rth: Move the function to tcg-op.c] Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
James Hilliard [Fri, 8 May 2026 08:51:40 +0000 (10:51 +0200)]
target/mips: add Octeon SAAD instruction
SAAD is the doubleword form of SAA: it atomically adds rt to the
naturally aligned 64-bit doubleword at base and discards the old memory
value.
Route it through the common SAA/SAAD translator so the MemOp selects the
aligned doubleword transaction size.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-15-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:51:24 +0000 (10:51 +0200)]
target/mips: add Octeon SAA instruction
SAA atomically adds rt to the naturally aligned 32-bit word at base and
discards the old memory value.
Implement the common SAA/SAAD translator with TCG atomic_fetch_add_i64.
The MemOp selects the word or doubleword transaction size. QEMU only has
one Octeon CPU model today, so keep SAA/SAAD under the existing Octeon
instruction feature bucket instead of adding a finer-grained Octeon+
feature bit.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-14-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:50:58 +0000 (10:50 +0200)]
target/mips: add Octeon LWUX instruction
LWUX performs an indexed unsigned word load from base + index and
zero-extends the result into rd.
Add the decode entry and route it through the common indexed-load
translator with MO_UL.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-13-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:50:41 +0000 (10:50 +0200)]
target/mips: add Octeon LHUX instruction
LHUX performs an indexed unsigned halfword load from base + index and
zero-extends the result into rd.
Add the decode entry and reuse the common indexed-load translator with
MO_UW.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-12-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 08:50:28 +0000 (10:50 +0200)]
target/mips: add Octeon LBX instruction
LBX performs an indexed signed byte load from base + index and writes the
sign-extended result to rd.
Wire the existing indexed-load helper to MO_SB so Octeon user-mode
binaries can use the signed byte variant alongside the existing LBUX
path.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-11-philmd@linaro.org>
James Hilliard [Tue, 21 Apr 2026 17:27:30 +0000 (11:27 -0600)]
target/mips: split Octeon SEQI/SNEI decode
Decode the equality and inequality forms as explicit SEQI/SNEI
instructions rather than using shared generated SEQNEI entries.
The explicit decoder names match the architectural mnemonics, which
makes the translator entry points and trace/debug output easier to
correlate with the instruction set.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
[PMD: Split SEQNE vs SEQNEI (this patch)] Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-10-philmd@linaro.org>
James Hilliard [Mon, 20 Apr 2026 19:27:35 +0000 (21:27 +0200)]
target/mips: split Octeon SEQ/SNE decode
Decode the equality and inequality forms as explicit SEQ/SNE
instructions rather than using shared generated SEQNE entries.
The explicit decoder names match the architectural mnemonics, which
makes the translator entry points and trace/debug output easier to
correlate with the instruction set.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
[PMD: Split SEQNE (this patch) vs SEQNEI] Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-9-philmd@linaro.org>
James Hilliard [Tue, 21 Apr 2026 15:10:18 +0000 (17:10 +0200)]
target/mips: drop Octeon zero-register fast paths
EXTS, CINS, and POP route their destination writes through
gen_store_gpr(), which already discards writes to $zero. Remove the
remaining translator fast paths for destination $zero so these Octeon
instructions follow the same shape as BADDU/DMUL and the generic MIPS
translator helpers.
Add a mips64/mips64el linux-user TCG smoke test for representative
Octeon population count instruction paths.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-8-philmd@linaro.org>
BADDU and DMUL write their results to rd, not rt. Route writes through
gen_store_gpr() so rd == $zero is handled consistently.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-7-philmd@linaro.org>
James Hilliard [Fri, 8 May 2026 15:12:24 +0000 (09:12 -0600)]
tests/tcg/mips: add Octeon instruction smoke test
Add a mips64/mips64el linux-user TCG smoke test for representative
Octeon instruction paths.
Run the test with -cpu Octeon68XX and share the source between the
mips64 and mips64el target directories.
Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20260520172313.23777-6-philmd@linaro.org>
James Hilliard [Sat, 11 Apr 2026 07:06:28 +0000 (01:06 -0600)]
target/mips: expose Octeon68XX floating-point support
Octeon68XX cores implement CP1. Advertise that in the CPU definition by
setting Config1.FP, enabling the writable Status bits, and providing the
FCR0/FCR31 defaults used by this CPU model.
This lets guests observe the expected floating-point feature bits and use
CP1 with -cpu Octeon68XX.
Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-5-philmd@linaro.org>
James Hilliard [Wed, 8 Apr 2026 18:54:28 +0000 (20:54 +0200)]
linux-user/mips, target/mips: honor MIPS_FIXADE for unaligned accesses
Linux/MIPS enables software fixups for user-mode unaligned scalar
accesses by default through MIPS_FIXADE/TIF_FIXADE. QEMU linux-user did
not model that ABI, so MIPS guests took fatal AdEL/AdES exceptions unless
translation was forced to use unaligned host accesses.
Key MIPS translation blocks on the linux-user unaligned policy, implement
sysmips(MIPS_FIXADE) to toggle that policy, and raise SIGBUS/BUS_ADRALN
when fixups are disabled.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-4-philmd@linaro.org>
Implement the MIPS_ATOMIC_SET sysmips command as an aligned 32-bit atomic
exchange in target memory.
MIPS reports syscall errors through a separate register, so successful old
values can overlap the errno range. Write the return value and error flag
directly and return -QEMU_ESIGRETURN so the common syscall path leaves the
registers unchanged.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-3-philmd@linaro.org>
Add the target sysmips dispatcher and implement MIPS_FLUSH_CACHE as a
successful no-op for linux-user.
Self-modifying code is handled by QEMU's normal user-mode translation
invalidation machinery, so the target ABI only needs the syscall command
to be accepted.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org> Signed-off-by: James Hilliard <james.hilliard1@gmail.com> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-Id: <20260520172313.23777-2-philmd@linaro.org>
Peter Maydell [Tue, 12 May 2026 11:15:36 +0000 (12:15 +0100)]
hw/intc/mips_gic: Avoid Coverity complaint in VP writes
The MIPS GIC does a check for a guest error in the write path for the
SH_MAP*_VP registers which triggers a Coverity complaint because it
assigns -1 to a uint64_t. The code doesn't misbehave because the -1
case will be caught by the following OFFSET_CHECK(), but the code
could be improved:
* there is no need to special case to avoid passing 0 to ctz64(),
because (unlike the compiler builtins) QEMU defines that this
has a specific behaviour, returning 64
* the OFFSET_CHECK() macro will go to the "bad_offset" label and
print an error implying that the guest wrote to an invalid
register offset. This is misleading about the actual problem,
which is that the guest wrote a bogus value to a valid register
offset
Make the error check print a better log message, and avoid the
special casing on ctz64(); in passing, this should also make
Coverity happier.
CID: 1547545 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-ID: <20260512111536.3437645-1-peter.maydell@linaro.org> Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org>
We removed support for MIPS host. Remove the now unreachable
TCG host code.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20260511135312.38705-6-philmd@linaro.org>
We removed support for MIPS host. The KVM MIPS code
is now unreachable, remove it.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20260511135312.38705-5-philmd@linaro.org>
"target/mips/cpu.h" is indirectly pulled in via the "system/kvm.h"
header, which next commit will remove. Explicitly include the "cpu.h"
header, otherwise we'd get:
hw/mips/mips_int.c:29:5: error: use of undeclared identifier 'MIPSCPU'
29 | MIPSCPU *cpu = opaque;
| ^
hw/mips/mips_int.c:30:5: error: use of undeclared identifier 'CPUMIPSState'
30 | CPUMIPSState *env = &cpu->env;
| ^
hw/mips/loongson3_virt.c:156:39: error: unknown type name 'MIPSCPU'
156 | static uint64_t get_cpu_freq_hz(const MIPSCPU *cpu)
| ^
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20260511135312.38705-4-philmd@linaro.org>
MIPS host support is deprecated since commit 269ffaabc84
("buildsys: Remove support for 32-bit MIPS hosts"). Time
to remove.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20260511135312.38705-3-philmd@linaro.org>
As mentioned in commit 269ffaabc84 ("buildsys: Remove support
for 32-bit MIPS hosts"), Debian 13 "Trixie" removed support for
MIPS.
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Message-Id: <20260511135312.38705-2-philmd@linaro.org>
docker: Remove LegacyKeyValueFormat warnings in non-generated files
Manually update Dockerfiles to not use legacy 'ENV key value' format:
https://docs.docker.com/reference/build-checks/legacy-key-value-format/
This removes warnings when building / using the containers:
- LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 98)
- LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 64)
- LegacyKeyValueFormat: "ENV key=value" should be used instead of legacy "ENV key value" format (line 97)
Signed-off-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com> Reviewed-by: Brian Cain <brian.cain@oss.qualcomm.com>
Message-ID: <20260518102222.80735-7-philmd@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Stefan Hajnoczi [Wed, 20 May 2026 20:53:28 +0000 (16:53 -0400)]
Merge tag 'pull-vfio-20260520' of https://github.com/legoater/qemu into staging
vfio queue:
* Fix IRQ notifier return value in vfio/ap and vfio/ccw
* Fix vfio-user: reject malformed migration capabilities and avoid
leaking a duplicate device name
* Report overflow in migration size queries
* Fix s390x cpu_models build regression
* Update libvfio-user subproject to fix compilation on newer compilers
* Update update-linux-headers.sh to support typelimits.h and inject
VIRTIO_RING_NO_LEGACY in virtio_ring.h to fix the Windows build
* Replace abort() with g_assert_not_reached() in the vfio/pci
interrupt handling path
* Drop superfluous inclusion of hw-error.h from vfio device files
hongmianquan [Tue, 19 May 2026 13:43:15 +0000 (21:43 +0800)]
migration/cpr: use hashtable for cpr fds
Use a GHashTable to store cpr fds to reduce the time
consumption of `cpr_find_fd` in scenarios with a large
number of fds. The time complexity for `cpr_find_fd` is
reduced from O(N) to O(1).
Keep cpr fds lookups in a GHashTable during normal runtime
while preserving the existing QLIST migration ABI. Build a
temporary QLIST from the hash table in pre_save and rebuild
the hash table from the loaded QLIST in post_load.
To demonstrate the performance improvement, we tested the total time
consumed by `cpr_find_fd` (called N times for N fds) under our real-world
business scenarios with different numbers of file descriptors. The results
are measured in nanoseconds:
| Number of FDs | Total time with QLIST (ns) | Total time with GHashTable (ns) |
|---------------|----------------------------|---------------------------------|
| 540 | 936,753 | 393,358 |
| 2,870 | 24,102,342 | 2,212,113 |
| 7,530 | 152,715,916 | 5,474,310 |
As shown in the data, the lookup time grows exponentially with the QLIST
as the number of fds increases. With the GHashTable, the time consumption
remains linear (O(1) per lookup), significantly reducing the downtime during
the CPR process.
Bin Guo [Mon, 18 May 2026 11:01:12 +0000 (19:01 +0800)]
migration/multifd: cache channel count in multifd_send_sync_main
multifd_send_sync_main() is called once per RAM synchronization round
during live migration. It iterates over all multifd channels twice
(signal loop + wait loop), calling migrate_multifd_channels()
independently in each loop header.
Cache migrate_multifd_channels() in a local thread_count variable at
function entry, matching the pattern already used in
multifd_send_setup() and multifd_recv_setup(). This eliminates 2
redundant config lookups per sync call.
Bin Guo [Mon, 18 May 2026 11:01:11 +0000 (19:01 +0800)]
migration/multifd: cache migrate_multifd_channels() in send/recv hot paths
multifd_send() and multifd_recv() are on the per-page-batch hot path
of live migration. Both functions call migrate_multifd_channels()
multiple times (3-4 calls each) for modulo arithmetic in the
round-robin channel selection loop.
Each call goes through migrate_get_current() -> dereference
MigrationState -> read parameters.multifd_channels. While each
individual call is cheap, these functions execute for every page
batch during the entire migration, easily millions of times.
Cache the return value in a local variable at function entry. The
channel count is fixed for the duration of a migration and cannot
change mid-flight.
For multifd_send(): 3 calls reduced to 1.
For multifd_recv(): 4 calls reduced to 1.
Bin Guo [Mon, 18 May 2026 11:01:09 +0000 (19:01 +0800)]
migration/multifd: fix off-by-one in recv channel ID validation
multifd_recv_initial_packet() validates the channel ID received from
the source against the configured number of channels. The current
check uses '>' which allows msg.id == N to pass through. This ID is
then used to index multifd_recv_state->params[msg.id], which was
allocated with g_new0(MultiFDRecvParams, N) -- an out-of-bounds
access.
A malicious or buggy source could send id == N and cause heap
corruption on the destination.
Fix by changing '>' to '>='. Also fix the error message to say
"exceeds channel count" for accuracy.
Bin Guo [Mon, 18 May 2026 11:01:08 +0000 (19:01 +0800)]
migration/savevm: use stack-allocated bitmap in configuration_validate_capabilities
configuration_validate_capabilities() allocates a bitmap on the heap
to track source capabilities via bitmap_new()/g_free(). Since
MIGRATION_CAPABILITY__MAX is a small compile-time constant (< 64),
a heap allocation for a bitmap this small is wasteful: it adds
malloc/free overhead and a potential cache miss for a transient
8-byte allocation.
Replace with DECLARE_BITMAP() on the stack and bitmap_zero() to
initialize. This eliminates the heap round-trip entirely.
Bin Guo [Mon, 18 May 2026 11:01:07 +0000 (19:01 +0800)]
migration/vmstate: avoid per-element heap churn in vmsd ptr marker field
For every NULL slot in a VMS_ARRAY_OF_POINTER (or every entry of a
dynamic array), the saver allocates a 1-element fake VMStateField via
g_new0 and frees it again right after the save. For arrays of
thousands of entries this is thousands of malloc/free pairs on the
hot save path.
Replace the heap-allocated marker with a stack-resident field
populated by an init helper. The caller passes a pointer to a local
VMStateField, the helper fills it in (still asserting the
precondition), and no g_free is needed.
Bin Guo [Mon, 18 May 2026 11:01:06 +0000 (19:01 +0800)]
migration/global_state: replace strcpy("") with explicit NUL termination
Drop the unnecessary strcpy of an empty literal (and its spurious
(char *)& cast) in favor of a direct NUL store, which avoids the
libc call and hides no bugs behind a cast.
Fabiano Rosas [Tue, 5 May 2026 16:09:14 +0000 (13:09 -0300)]
tests/qtest/migration: Unify URIs
The migration tests have always used localhost migration and therefore
the same URI for both sides of migration. Change the listen_uri and
connect_uri into a single uri variable.
For migrations using sockets, there's the possibility of detecting the
socket address the destination side is using. For those, keep using
different variables for migrate_qmp and migrate_incoming_qmp.
Fabiano Rosas [Tue, 5 May 2026 16:09:13 +0000 (13:09 -0300)]
tests/qtest/migration: Stop passing URI into migrate_start
Don't allow changing the default -incoming URI via migrate_start. The
default is now -incoming defer. If a test really needs to alter this
(such as with CPR), the target_opts variable is still available to
change the command line.
(aside from the larger goal of using defer, this change is a step
towards allowing migrate_start() to be invoked only once for all
tests)
Fabiano Rosas [Tue, 5 May 2026 16:09:07 +0000 (13:09 -0300)]
tests/qtest/migration: Set compression method in compression-tests
Stop calling a common function to set the multifd compression
method. The default method is "none", so the common function is not
necessary for tests that don't set compression and will be removed.
Fabiano Rosas [Tue, 5 May 2026 16:09:06 +0000 (13:09 -0300)]
tests/qtest/migration: Defer by default in precopy_common
As a design direction, we're restricting the usage of the command line
option -incoming <URI>. The alternative -incoming defer should be used
instead.
Make all precopy_common tests defer by default.
Using the defer option means that QEMU will not start the incoming
migration automatically. Add the incoming QMP command. With the added
command, the invocation at the multifd_common hook becomes redundant,
so remove it.
Fabiano Rosas [Tue, 5 May 2026 16:09:04 +0000 (13:09 -0300)]
tests/qtest/migration: Use precopy_unix_common for ignore-shared test
The ignore-shared test has the same code as the precopy_common test
but inverting (probably incorrectly) the order of a few event
waits. Change it to use the common code instead.
Fabiano Rosas [Tue, 5 May 2026 16:09:00 +0000 (13:09 -0300)]
tests/qtest/migration: Move cpr transfer logic into cpr-tests.c
There's some amount of cpr-transfer logic at precopy_common, which in
retrospect was a bad idea. For just two tests, that's too much code to
be in the common function. Move it to the cpr file. We'll need this
cleanup for subsequent improvements.
Avihai Horon [Tue, 5 May 2026 08:14:10 +0000 (11:14 +0300)]
scripts/update-linux-headers: Add typelimits.h
Upstream Linux added include/uapi/linux/typelimits.h and includes it
from ethtool.h [1][2].
Teach update-linux-headers.sh to install that header into
standard-headers to be able to update kernel headers to versions that
include the above changes.
[1] ca9d74eb5f6a ("uapi: add INT_MAX and INT_MIN constants")
[2] a8a11e5237ae ("ethtool: uapi: Use UAPI definition of INT_MAX")
Cédric Le Goater [Wed, 13 May 2026 09:45:22 +0000 (11:45 +0200)]
vfio/migration: Detect and report overflow in migration size queries
VFIO migration ioctls (VFIO_DEVICE_FEATURE_MIG_DATA_SIZE and
VFIO_MIG_GET_PRECOPY_INFO) return device-estimated migration sizes as
uint64_t values. A misbehaving kernel driver could return values that
are unreasonably large, which would corrupt the size accounting used
to decide migration convergence.
This misbehavior occurred a few times when testing migration of a VM
with an assigned NVIDIA vGPU and an MLX5 VF. In some of the save
iterations, the reported precopy and stopcopy sizes were unreasonably
large (close to UINT64_MAX):
Add a helper to detect values that exceed INT64_MAX, which is far
beyond any realistic device state size, and report them with an error
message. Return -ERANGE from the query functions so callers can abort
the migration rather than proceeding with corrupted estimates.
However, the callers don't yet check the return value to actually stop
the migration.
Cédric Le Goater [Mon, 11 May 2026 11:19:13 +0000 (13:19 +0200)]
update-linux-headers: Inject VIRTIO_RING_NO_LEGACY in virtio_ring.h
The kernel commit 3c4629b68dbe ("virtio: uapi: avoid usage of libc
types") changed the virtio_ring.h header and this breaks the build on
Windows which requires the uintptr_t type to cast from pointer to
integer.
Inject '#define VIRTIO_RING_NO_LEGACY' at the top of the synced header
via the update script after the include guard. This discards the code
section incompatible with Windows.
GuoHan Zhao [Sun, 10 May 2026 08:43:53 +0000 (16:43 +0800)]
vfio/ccw: Return false when IRQ notifier setup fails
vfio_ccw_register_irq_notifier() cleans up the fd handler and EventNotifier
when vfio_device_irq_set_signaling() fails, but still returns true to its
caller.
Return false after cleanup so the caller can handle the failed
registration path instead of treating it as a successful notifier setup.
Fixes: 8aaeff97acee ("vfio/ccw: Make vfio_ccw_register_irq_notifier() return a bool") Signed-off-by: GuoHan Zhao <zhaoguohan@kylinos.cn> Reviewed-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260510084353.58263-3-zhaoguohan@kylinos.cn Signed-off-by: Cédric Le Goater <clg@redhat.com>
GuoHan Zhao [Sun, 10 May 2026 08:43:52 +0000 (16:43 +0800)]
vfio/ap: Return false when IRQ notifier setup fails
vfio_ap_register_irq_notifier() cleans up the fd handler and EventNotifier
when vfio_device_irq_set_signaling() fails, but still returns true to its
caller.
Return false after cleanup so the caller can handle the failed
registration path instead of treating it as a successful notifier setup.
Fixes: cbd470f0aac5 ("vfio/ap: Make vfio_ap_register_irq_notifier() return a bool") Signed-off-by: GuoHan Zhao <zhaoguohan@kylinos.cn> Reviewed-by: Anthony Krowiak <akrowiak@linux.ibm.com> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260510084353.58263-2-zhaoguohan@kylinos.cn Signed-off-by: Cédric Le Goater <clg@redhat.com>
vfio/pci: Replace abort() with g_assert_not_reached()
This check was originally introduced in commit b3ebc10c373e
("vfio-pci: Add debug config options to disable MSI/X KVM support") as
part of a debug block to retrieve the MSI/MSIX message, and was later
moved by commit 0de70dc7bab1 ("vfio/pci: Rename MSI/X functions for
easier tracing") into the main interrupt handling path, becoming
production code.
Under normal conditions, this code path cannot be reached because the
BQL serializes all handler registration, vdev->interrupt updates, and
handler removal. Replace abort() with g_assert_not_reached(), which is
preferred nowdays, and add a comment clarifying the purpose.
check_migr() sets an error when the migration capability is not an object,
but still returns true. This lets version negotiation continue with an
Error set and reports the wrong capability name in the diagnostic.
Return false for the malformed capability, and report the migration
capability name.
vfio_user_pci_realize() assigns vbasedev->name before connecting to the
server, then assigns the same name again after installing the request
handler. The second assignment overwrites the first allocation, so only
the second string can be freed later by vfio_device_free_name().
Drop the duplicate assignment and keep the first name allocation, which is
also available on connection failures for error reporting.
Introduce a source set common to system / user. Start it
with the files built in both sets: 'cpu_models_user.c'
and 'gdbstub.c' No logical change intended.
Except that's not true:
git show 0b83acf2f0 | grep cpu_models
with the files built in both sets: 'cpu_models_user.c'
+ 'cpu_models_user.c',
- 'cpu_models_system.c',
- 'cpu_models_user.c',
Restore the s390x_user_ss section, move "cpu_models_user.c" back
into it, and re-add "cpu_models_system.c" to the common_system
section.
Reported-by: Cédric Le Goater <clg@redhat.com> Fixes: 0b83acf2f05 ("target/s390x: Introduce common system/user meson source set") Signed-off-by: Eric Farman <farman@linux.ibm.com> Reviewed-by: Farhan Ali <alifm@linux.ibm.com> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@oss.qualcomm.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Matthew Rosato <mjrosato@linux.ibm.com> Tested-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Cédric Le Goater <clg@redhat.com> Link: https://lore.kernel.org/qemu-devel/20260511163541.192533-1-farman@linux.ibm.com Signed-off-by: Cédric Le Goater <clg@redhat.com>
Jaemyung Lee [Thu, 14 May 2026 08:10:48 +0000 (17:10 +0900)]
tests/qtest: Add UFS Write Booster QTest
It adds 'wb-init' and 'wb-read-write' TCs into tests/qtest/ufs-test.c.
'wb-init' tests that the WB support is properly initialized with UFS
device and 'wb-read-write' tests that WB can be enabled and WRITE I/O
can be handled/buffered as a WB command.
Signed-off-by: Jaemyung Lee <jaemyung.lee@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
Jaemyung Lee [Thu, 14 May 2026 08:10:46 +0000 (17:10 +0900)]
hw/ufs: Add idle operation
When no I/O occurs, the UFS Device performs various internal operations.
To emulate this, adds a timer that periodically checks the current I/O
status of the device and call the ufs_process_idle() function when idle.
Signed-off-by: Jaemyung Lee <jaemyung.lee@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
Jaemyung Lee [Thu, 14 May 2026 08:10:45 +0000 (17:10 +0900)]
hw/ufs: Modify flag handling operation
Change internal flag handling operation same as attribute's
In UFS device, some flag queries directly trigger specific device
behaviour like attribute's, not only changes the internal values.
So restructure flag query processing functions same as attribute
processing, to facilitate linking detailed implementations based on
individual flag value changes.
Signed-off-by: Jaemyung Lee <jaemyung.lee@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
Jaemyung Lee [Thu, 14 May 2026 08:10:44 +0000 (17:10 +0900)]
hw/ufs: Apply UFS 4.1 Specification
Apply current UFS 4.1 Specification to QEMU-UFS.
QEMU-UFS device emulates operation via UFS 4.0 Specification,
but current latest Spec. version is UFS 4.1. So extent internal
DESCRIPTOR/FLAG/ATTRIBUTE declaration to follow UFS 4.1 Spec.
It does not implement any actual functionallity, but only adds
minimum supportability for further implementation.
Signed-off-by: Jaemyung Lee <jaemyung.lee@samsung.com> Signed-off-by: Jeuk Kim <jeuk20.kim@samsung.com>
Aadeshveer Singh [Wed, 13 May 2026 06:35:13 +0000 (12:05 +0530)]
migration: Replace current_migration with migrate_get_current()
Replaces the direct accesses to global variable `current_migration`
with `migrate_get_current()` to ensure consistency across systems.
Note: Following this only direct access to `current_migration` will be
* `migrate_get_current()` itself
* `migration_object_init()` initializes `current_migration`
* `migration_shutdown()` to pair up with initialization
* `migration_is_running()`, as there might be a case where this function
is called by a thread before object initialization
Fabiano Rosas [Tue, 12 May 2026 14:13:38 +0000 (11:13 -0300)]
tests/qtest/migration: Fix auto-converge test
We fixed the cpu throttling sync thread affecting the
dirty-sync-count, but the test still relies on it to gauge for
progress. Remove that block from the test with no replacement.
While here remove several incorrect or redundant comments.
Peter Xu [Mon, 11 May 2026 18:24:32 +0000 (14:24 -0400)]
migration: Fix possible division by zero on calc expected downtime
Commit dd4fe8844b changed the reporting of expected downtime behavior, so
that the value will be calculated on-demand. One side effect on the change
is QEMU will allow the calculation to happen anytime even if there's no
transfer happening for a short while.
PeterM reported an ubsan report from clang when running migration-test with
aarch64 binary on x86_64 hosts. I can also reproduce if I run the test
concurrently so some of the src QEMU may not get chance to push any data,
causing mbps to be 0:
../migration/migration.c:1051:12: runtime error: -nan is outside the range of representable values of type 'long'
Fix it by properly handle both Inf and Nan to return INT64_MAX.
migration: Remove VMS_MULTIPLY_ELEMENTS and VMSTATE_VARRAY_MULTIPLY()
Commit c1eb3ac3468 ("target/sparc: Replace
VMSTATE_VARRAY_MULTIPLY -> VMSTATE_UINTTL_ARRAY") removed the
last use of the VMSTATE_VARRAY_MULTIPLY() macro. We can now
remove it as unnecessary, along with the VMS_MULTIPLY_ELEMENTS
flag and the associated tests.
Peter Xu [Tue, 21 Apr 2026 17:58:20 +0000 (13:58 -0400)]
migration: Fix crash on second migration when cancel early
Marc-André reported an issue on QEMU crash when retrying a cancelled
migration during early setup phase, see "Link:" for more information, and
also easy way to reproduce.
This patch is a replacement of the prior fix proposed by not only switching
to migration_cleanup(), but also fixing it from CPR side, so that we track
hup_source properly to know if src QEMU is waiting or the HUP signal.
To put it simple: this chunk of special casing in migration_cancel() should
not affect normal migration, but only cpr-transfer migration to cover the
small window when the src QEMU is waiting for a HUP signal on cpr
channel (so that src QEMU can continue the migration on the main channel).
To achieve that, we'll also need to remember to detach the hup_source
whenenver invoked: after that point, we should always be able to cleanup
the migration.
It's not a generic operation to explicitly detach a gsource from its
context while in its dispatch() function. But it should be safe, because
gsource disptch() will only happen with a boosted refcount for the
dispatcher so that the gsource will not be freed until the callback
completes. It's also safe to return G_SOURCE_REMOVE after the gsource is
detached, as glib will simply ignore the G_SOURCE_REMOVE.
One can refer to latest 2.86.5 glib code in g_main_dispatch() for that:
When at this, add a bunch of assertions to make sure nothing surprises us.
After this patch applied, the 2nd migration will not crash QEMU, instead
it'll be in CANCELLING until the socket connection times out (it will take
~2min on my Fedora default kernel). During this process no 2nd migration
will be allowed, and after it timed out migration can be restarted.
It's because so far we don't have control over socket_connect_outgoing(),
or anything yet managed by a task executed in qio_task_run_in_thread().
Speeding up the cancellation to be left for future.
I also tested cpr-transfer by only providing cpr channel not the main
channel (with -incoming defer), kickoff migration on source, then cancel it
on source directly without providing the main channel. It keeps working.
I wanted to add an unit test for that but it'll need to refactor current
cpr-transfer tests first; let's leave it for later.
* tag 'for-upstream' of https://repo.or.cz/qemu/kevin:
block: Improve readability in HMP 'info blockstats' output
block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
iotests/046: Test that discard/write_zeroes wait for dependencies
qcow2: Fix corruption on discard during write with COW
qemu-io: Add 'aio_discard' command
commit: Drain nodes across all of bdrv_commit()
block: Add more defaults to DEFAULT_BLOCK_CONF
block: Create DEFAULT_BLOCK_CONF macro
MAINTAINERS: Rename Replication -> COLO block replication
MAINTAINERS: Add myself as maintainer for replication
Remove the deprecated glusterfs block driver
ide-test: Test reset during TRIM
ide-test: Factor out wait_dma_completion()
ide: Clean up ide_trim_co_entry() to be idiomatic coroutine code
ide: Minimal fix for deadlock between TRIM and drain
block: Add flags parameter to blk_*_pdiscard()
block: Add blk_co_start/end_request() and BDRV_REQ_NO_QUEUE
blkdebug: Add 'delay-ns' option
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stefan Hajnoczi [Tue, 19 May 2026 19:22:56 +0000 (15:22 -0400)]
Merge tag 'hppa-post-v11-patches-pull-request' of https://github.com/hdeller/qemu-hppa into staging
Two hppa cleanup patches
Two leftover cleanup patches which I did not wanted to merge shortly before the qemu-v11 release.
Nothing critical, and both suggested by Philippe Mathieu-Daudé.
# -----BEGIN PGP SIGNATURE-----
#
# iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCagx/NQAKCRD3ErUQojoP
# XxeFAQDBvHtWAnZTjp9YAsqGGJbiFNQkRGglXcsz8bKAIBfCjwD/VMG3MLh4zLX2
# 7ShvU9L7eNnqtZJY0dVEA86xQcey+gc=
# =Ye2s
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 19 May 2026 11:18:13 EDT
# gpg: using EDDSA key BCE9123E1AD29F07C049BBDEF712B510A23A0F5F
# gpg: Good signature from "Helge Deller <deller@gmx.de>" [unknown]
# gpg: aka "Helge Deller <deller@kernel.org>" [unknown]
# gpg: aka "Helge Deller <deller@debian.org>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 4544 8228 2CD9 10DB EF3D 25F8 3E5F 3D04 A7A2 4603
# Subkey fingerprint: BCE9 123E 1AD2 9F07 C049 BBDE F712 B510 A23A 0F5F
* tag 'hppa-post-v11-patches-pull-request' of https://github.com/hdeller/qemu-hppa:
hw/hppa: Move static variable lasi_dev into MachineState
hw/pci-host/astro: Encode Astro version numbers
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Stefan Hajnoczi [Tue, 19 May 2026 19:22:46 +0000 (15:22 -0400)]
Merge tag 'linux-user-next-pull-request' of https://github.com/hdeller/qemu-hppa into staging
linux-user patches
pthread_create() failure path cleanups, sh4 libunwind/sigtramp fixes and
a (emulated) dynamic linker fix for AT_EXECFN.
# -----BEGIN PGP SIGNATURE-----
#
# iHUEABYKAB0WIQS86RI+GtKfB8BJu973ErUQojoPXwUCagxuHwAKCRD3ErUQojoP
# X0uTAP40HtUVEGzGewSruS6cdnrivkn/8TWOTvXp2izE2HwYoAEA2S0XWZ8ehE5j
# jWzzyJHFBKgGeCeuubAnhZ8qnv698w0=
# =HO5b
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 19 May 2026 10:05:19 EDT
# gpg: using EDDSA key BCE9123E1AD29F07C049BBDEF712B510A23A0F5F
# gpg: Good signature from "Helge Deller <deller@gmx.de>" [unknown]
# gpg: aka "Helge Deller <deller@kernel.org>" [unknown]
# gpg: aka "Helge Deller <deller@debian.org>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 4544 8228 2CD9 10DB EF3D 25F8 3E5F 3D04 A7A2 4603
# Subkey fingerprint: BCE9 123E 1AD2 9F07 C049 BBDE F712 B510 A23A 0F5F
* tag 'linux-user-next-pull-request' of https://github.com/hdeller/qemu-hppa:
linux-user: Fix a memory leak when pthread_create fails
linux-user/sh4: Fix setup_sigtramp to match Linux kernel trampoline pattern
linux-user/sh4: Fix target_ucontext tuc_link field type
linux-user: Fix AT_EXECFN in AUXV for symlinked programs
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Kevin Wolf [Tue, 12 May 2026 11:27:59 +0000 (13:27 +0200)]
block: Improve readability in HMP 'info blockstats' output
Instead of a long line with key=value pairs for each block device,
switch to a tabular form with aligned values. This makes it much easier
to find the relevant information in the output.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260512112759.66038-1-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
block/graph-lock: fix missed wakeup in bdrv_graph_co_rdunlock()
tests/qemu-iotests/tests/iothreads-create reproduces the hang on
master under `stress-ng --cpu $(nproc) --timeout 0`. The iotest's
vm.run_job() times out and qemu stays permanently stuck in
ppoll(timeout=-1) inside bdrv_graph_wrlock_drained -> blk_remove_bs
during qemu_cleanup(). The timing window is narrow on modern
bare-metal hardware and much wider in a VM guest; downstream trees
that still use plain bdrv_graph_wrlock() in blk_remove_bs() hit it
on the first iteration under the same stress.
bdrv_graph_wrlock() zeroes has_writer around its AIO_WAIT_WHILE loop
so that callbacks dispatched by aio_poll() can still take the read
lock on the fast path. The rdunlock side, however, only kicks a
waiting writer when has_writer is observed set; a reader that drops
its lock inside the polling window silently returns and nothing ever
wakes the writer:
reader_count is now 0 and num_waiters is still 1, but no BH, fd or
timer on the main AioContext will fire -- the only entity that could
kick just decided it did not have to. Main stays in ppoll() holding
BQL, so RCU, VCPUs and any iothread path that needs BQL stall behind
it. The hang is final; no timeout, no forward progress, no recovery
as there is no other source of wake up inside qemu_cleanup().
bdrv_drain_all_begin() does not close the race on its own: it
quiesces in-flight I/O, but graph readers also include non-I/O
coroutines (block-job cleanup, virtio-scsi polling) that drain does
not evict. The bdrv_graph_wrlock_drained() wrapper narrows the
window but does not eliminate it; every plain bdrv_graph_wrlock()
site is exposed on the same basis.
Drop the has_writer check in bdrv_graph_co_rdunlock() and call
aio_wait_kick() unconditionally. The helper itself loads num_waiters
atomically and only schedules a dummy BH when a waiter exists, so the
change is a no-op on the no-writer path and closes the missed-wakeup
on the writer path.
Signed-off-by: Denis V. Lunev <den@openvz.org> Cc: Kevin Wolf <kwolf@redhat.com> Cc: Hanna Reitz <hreitz@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Fiona Ebner <f.ebner@proxmox.com>
Message-ID: <20260424103917.248668-2-den@openvz.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Mon, 27 Apr 2026 17:05:20 +0000 (19:05 +0200)]
iotests/046: Test that discard/write_zeroes wait for dependencies
This is a regression test for the bug fixed in the previous commit where
discard and write_zeroes operations wouldn't consider their dependencies
in s->cluster_allocs. Without the fix, this results in a corrupted
image.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260427170520.101242-5-kwolf@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Tested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Mon, 27 Apr 2026 17:05:19 +0000 (19:05 +0200)]
qcow2: Fix corruption on discard during write with COW
Most code in qcow2 that accesses (and potentially modifies) L2 tables
does so while holding s->lock.
There is one exception, which is allocating writes. They hold the lock
initially while allocating clusters, but drop it for writing the guest
payload before taking the lock again for updating the L2 tables. This
allows concurrent requests that touch other parts of the image file to
continue in parallel and is an important performance optimisation.
However, this means that other requests that run while the lock is
dropped for writing guest data must synchronise with the list of
allocating requests in s->cluster_allocs and wait if they would overlap.
For writes, this is done in handle_dependencies(), but discard and write
zeros operations neglect to synchronise with s->cluster_allocs.
This means that discard can free a cluster whose L2 entry will already
be modified in qcow2_alloc_cluster_link_l2() by a previously started
write. In the case of a pre-allocated zero cluster that is in the
process of being overwritten, this means that discard can lead to a
situation where the cluster is still mapped (because the write will
restore the L2 entry just without the zero flag), but its refcount has
been decreased, resulting in a corrupted image.
Add the missing synchronisation to qcow2_cluster_discard() and
qcow2_subcluster_zeroize() to fix the problem.
Cc: qemu-stable@nongnu.org Reported-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260427170520.101242-4-kwolf@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Tested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Mon, 27 Apr 2026 17:05:18 +0000 (19:05 +0200)]
qemu-io: Add 'aio_discard' command
Testing interactions between multiple requests that include discard
requests require that qemu-io can do the discard asynchronously, like it
already does for reads and writes. To this effect, add an 'aio_discard'
command.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260427170520.101242-3-kwolf@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Tested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Mon, 27 Apr 2026 17:05:17 +0000 (19:05 +0200)]
commit: Drain nodes across all of bdrv_commit()
The whole implementation of bdrv_commit() is only correct if no new
writes come in while it's running: It has only a single loop checking
the allocation status for each block and finally calls bdrv_make_empty()
without checking if that throws away any new changes.
We already have to drain while taking the graph write lock. Just extend
the drained section to all of bdrv_commit() to make sure that we don't
get any inconsistencies.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260427170520.101242-2-kwolf@redhat.com> Reviewed-by: Denis V. Lunev <den@openvz.org> Tested-by: Denis V. Lunev <den@openvz.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Fri, 10 Apr 2026 15:23:14 +0000 (17:23 +0200)]
block: Add more defaults to DEFAULT_BLOCK_CONF
discard_granularity was missing from this, which means that SCSI disks
created with -drive if=scsi would default to 0 (i.e. disabling discards)
instead of -1, which makes scsi-hd automatically pick a granularity and
is the default of the corresponding qdev property for -device scsi-hd.
Also set other fields whose default isn't an obvious 0. These are not
actual bug fixes because ON_OFF_AUTO_AUTO in fact happens to be 0, but
it's better not to rely on the order of enums.
Cc: qemu-stable@nongnu.org Fixes: 308963746169 ('scsi: Don't ignore most usb-storage properties') Reported-by: Lexi Winter <ivy@FreeBSD.org> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260410152314.86412-3-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Fri, 10 Apr 2026 15:23:13 +0000 (17:23 +0200)]
block: Create DEFAULT_BLOCK_CONF macro
The property default values from include/hw/block/block.h were
duplicated in scsi_bus_legacy_handle_cmdline(), allowing them to go out
of sync easily. There doesn't seem a good way to avoid the duplication,
but moving them next to each other in the header file should help to
avoid this problem in the future.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260410152314.86412-2-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
MAINTAINERS: Add myself as maintainer for replication
I recently took up maintainership for the orphaned COLO migraion component.
Here I take over maintainership for replication which is another important
component for COLO.
Signed-off-by: Lukas Straub <lukasstraub2@web.de>
Message-ID: <20260425-replication_maintainer-v1-1-f6ab019ff0ca@web.de> Reviewed-by: Zhang Chen <zhangckid@gmail.com> Acked-by: Peter Xu <peterx@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Thomas Huth [Mon, 11 May 2026 06:30:13 +0000 (08:30 +0200)]
Remove the deprecated glusterfs block driver
Glusterfs has been marked as deprecated since QEMU v9.2, and as far
as I know, nobody spoke up 'til today that it should be kept.
The listed e-mail address integration@gluster.org in our MAINTAINERS
file seems to be bouncing nowadays, and looking at their website
https://www.gluster.org/ the most recent news are from 2020 / 2021 ...
so it seems like there is really hardly any interest in Glusterfs
anymore. Thus it's time to remove the code now from QEMU.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org> Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20260511063013.39805-1-thuth@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Tue, 21 Apr 2026 16:11:30 +0000 (18:11 +0200)]
ide: Clean up ide_trim_co_entry() to be idiomatic coroutine code
The previous commit did a minimal conversion of the callback based state
machine for TRIM to a coroutine in order to fix a bug. Refactor it to
actually look like normal coroutine based code, which improves its
readability.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260421161132.99878-6-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Tue, 21 Apr 2026 16:11:29 +0000 (18:11 +0200)]
ide: Minimal fix for deadlock between TRIM and drain
The implementation of TRIM in IDE can chain multiple discard requests
and uses blk_inc/dec_in_flight() to make sure that the whole TRIM
operation has completed when the device needs to be quiescent (e.g. for
the drain when performing an IDE reset, it would be bad if an IDE
request like TRIM were still in flight).
The problem is that each drain request calls blk_wait_while_drained()
and when draining, it waits until the drained section ends. At the same
time, drain_begin can only return if the whole TRIM operation has
completed. This is a classic deadlock.
Use blk_co_start/end_request() and BDRV_REQ_NO_QUEUE to avoid the
problem. This requires moving the TRIM state machine to a coroutine.
This commit does the minimal conversion so that we do have a coroutine
that works for the fix, but it still looks much like a callback-based
implementation. This will be cleaned up in the next patch.
Cc: qemu-stable@nongnu.org Fixes: 7e5cdb345f77 ('ide: Increment BB in-flight counter for TRIM BH') Buglink: https://redhat.atlassian.net/browse/RHEL-121686 Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260421161132.99878-5-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Tue, 21 Apr 2026 16:11:27 +0000 (18:11 +0200)]
block: Add blk_co_start/end_request() and BDRV_REQ_NO_QUEUE
If a device uses blk_inc/dec_in_flight() in order to build macro
operations that involve multiple requests for the block layer and that
need to be completed as a unit before the BlockBackend can be considered
drained, it sets the stage for a deadlock: When a drain is requested,
the inner request at the BlockBackend level will be queued in
blk_wait_while_drained() and wait until the drained section ends, but at
the same time, drain_begin can only return if the whole macro operation
at the device level has completed.
Introduce a new interface to allow implementing the logic correctly:
Instead of queueing individual requests, blk_co_start_request() calls
blk_wait_while_drained() once at the beginning. The individual requests
must then set BDRV_REQ_NO_QUEUE to avoid being queued and running into
the deadlock; being wrapped in blk_co_start/end_request() makes sure
that drain_begin waits for them and they don't sneak in when the
BlockBackend is supposed to already be quiescent.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260421161132.99878-3-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Kevin Wolf [Tue, 21 Apr 2026 16:11:26 +0000 (18:11 +0200)]
blkdebug: Add 'delay-ns' option
Sometimes reproducing a problem for debugging involves slow I/O, so
let's add something to blkdebug to make I/O slow when we need it. This
can be used either together with an error so that the request fails
after the delay, or with errno=0, which allows the request to succeed
after the delay.
Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20260421161132.99878-2-kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>