Hanna Czenczek [Mon, 10 Nov 2025 15:48:53 +0000 (16:48 +0100)]
null-aio: Run CB in original AioContext
AIO callbacks must be called in the originally calling AioContext,
regardless of the BDS’s “main” AioContext.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-19-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:52 +0000 (16:48 +0100)]
iscsi: Create AIO BH in original AioContext
AIO callbacks must be called in the original request’s AioContext,
regardless of the BDS’s “main” AioContext.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-18-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:51 +0000 (16:48 +0100)]
block: Note in which AioContext AIO CBs are called
This doesn’t seem to be specified anywhere, but is something we probably
want to be clear. I believe it is reasonable to implicitly assume that
callbacks are run in the current thread (unless explicitly noted
otherwise), so codify that assumption.
Some implementations don’t actually fulfill this contract yet. The next
patches should rectify that.
Note: I don’t know of any user-visible bugs produced by not running AIO
callbacks in the original context. AIO functionality is generally
mapped to coroutines through the use of bdrv_co_io_em_complete(), which
can run in any AioContext, and will always wake the yielding coroutine
in its original context. The only benefit here is that running
bdrv_co_io_em_complete() in the original context will make that
aio_co_wake() most likely a simpler qemu_coroutine_enter() instead of
scheduling the wakeup through AioContext.co_schedule_bh.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-17-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:50 +0000 (16:48 +0100)]
blkreplay: Run BH in coroutine’s AioContext
While it does not matter in which AioContext we run aio_co_wake() to
continue an exactly-once-yielding coroutine, making this commit not
strictly necessary, there is also no reason why the BH should run in any
context but the request’s AioContext.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-16-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:49 +0000 (16:48 +0100)]
ssh: Run restart_coroutine in current AioContext
restart_coroutine() is attached as an FD handler just to wake the
current coroutine after yielding. It makes most sense to attach it to
the current (request) AioContext instead of the BDS main context. This
way, the coroutine can be entered directly from the BH instead of having
yet another indirection through AioContext.co_schedule_bh.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-15-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:48 +0000 (16:48 +0100)]
qcow2: Schedule cache-clean-timer in realtime
There is no reason why the cache cleaning timer should run in virtual
time, run it in realtime instead.
Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-14-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:47 +0000 (16:48 +0100)]
qcow2: Fix cache_clean_timer
The cache-cleaner runs as a timer CB in the BDS AioContext. With
multiqueue, it can run concurrently to I/O requests, and because it does
not take any lock, this can break concurrent cache accesses, corrupting
the image. While the chances of this happening are low, it can be
reproduced e.g. by modifying the code to schedule the timer CB every
5 ms (instead of at most once per second) and modifying the last (inner)
while loop of qcow2_cache_clean_unused() like so:
i.e. making it wait on purpose for the point in time where the cache is
in use by something else.
The solution chosen for this in this patch is not the best solution, I
hope, but I admittedly can’t come up with anything strictly better.
We can protect from concurrent cache accesses either by taking the
existing s->lock, or we introduce a new (non-coroutine) mutex
specifically for cache accesses. I would prefer to avoid the latter so
as not to introduce additional (very slight) overhead.
Using s->lock, which is a coroutine mutex, however means that we need to
take it in a coroutine, so the timer must run in a coroutine. We can
transform it from the current timer CB style into a coroutine that
sleeps for the set interval. As a result, however, we can no longer
just deschedule the timer to instantly guarantee it won’t run anymore,
but have to await the coroutine’s exit.
(Note even before this patch there were places that may not have been so
guaranteed after all: Anything calling cache_clean_timer_del() from the
QEMU main AioContext could have been running concurrently to an existing
timer CB invocation.)
Polling to await the timer to actually settle seems very complicated for
something that’s rather a minor problem, but I can’t come up with any
better solution that doesn’t again just overlook potential problems.
(Not Cc-ing qemu-stable, as the issue is quite unlikely to be hit, and
I’m not too fond of this solution.)
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-13-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:46 +0000 (16:48 +0100)]
qcow2: Re-initialize lock in invalidate_cache
After clearing our state (memset()-ing it to 0), we should
re-initialize objects that need it. Specifically, that applies to
s->lock, which is originally initialized in qcow2_open().
Given qemu_co_mutex_init() is just a memset() to 0, this is functionally
a no-op, but still seems like the right thing to do.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-12-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:45 +0000 (16:48 +0100)]
block/io: Take reqs_lock for tracked_requests
bdrv_co_get_self_request() does not take a lock around iterating through
bs->tracked_requests. With multiqueue, it may thus iterate over a list
that is in the process of being modified, producing an assertion
failure:
[0] abort() at /lib64/libc.so.6
[1] __assert_fail_base.cold() at /lib64/libc.so.6
[2] raw_do_pwrite_zeroes() at ../block/file-posix.c:3702
[3] bdrv_co_do_pwrite_zeroes() at ../block/io.c:1910
[4] bdrv_aligned_pwritev() at ../block/io.c:2109
[5] bdrv_co_do_zero_pwritev() at ../block/io.c:2192
[6] bdrv_co_pwritev_part() at ../block/io.c:2292
[7] bdrv_co_pwritev() at ../block/io.c:2225
[8] handle_alloc_space() at ../block/qcow2.c:2573
[9] qcow2_co_pwritev_task() at ../block/qcow2.c:2625
Fix this by taking reqs_lock.
Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-11-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:44 +0000 (16:48 +0100)]
nvme: Note in which AioContext some functions run
Sprinkle comments throughout block/nvme.c noting for some functions
(where it may not be obvious) that they require a certain AioContext, or
in which AioContext they do happen to run (for callbacks, BHs, event
notifiers).
Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-10-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:43 +0000 (16:48 +0100)]
nvme: Fix coroutine waking
nvme wakes the request coroutine via qemu_coroutine_enter() from a BH
scheduled in the BDS AioContext. This may not be the same context as
the one in which the request originally ran, which would be wrong:
- It could mean we enter the coroutine before it yields,
- We would move the coroutine in to a different context.
(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (data.ret == -EINPROGRESS)` loop.)
To fix that, use aio_co_wake() to run the coroutine in its home context.
Just like in the preceding iscsi and nfs patches, we can drop the
trivial nvme_rw_cb_bh() and use aio_co_wake() directly.
With this, we can remove NVMeCoData.ctx.
Note the check of data->co == NULL to bypass the BH/yield combination in
case nvme_rw_cb() is called from nvme_submit_command(): We probably want
to keep this fast path for performance reasons, but we have to be quite
careful about it:
- We cannot overload .ret for this, but have to use a dedicated
.skip_yield field. Otherwise, if nvme_rw_cb() runs in a different
thread than the coroutine, it may see .ret set and skip the yield,
while nvme_rw_cb() will still schedule a BH for waking. Therefore,
the signal to skip the yield can only be set in nvme_rw_cb() if waking
too is skipped, which is independent from communicating the return
value.
- We can only skip the yield if nvme_rw_cb() actually runs in the
request coroutine. Otherwise (specifically if they run in different
AioContexts), the order between this function’s execution and the
coroutine yielding (or not yielding) is not reliable.
- There is no point to yielding in a loop; there are no spurious wakes,
so once we yield, we will only be re-entered once the command is done.
Replace `while` by `if`.
Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-9-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:42 +0000 (16:48 +0100)]
nvme: Kick and check completions in BDS context
nvme_process_completion() must run in the main BDS context, so schedule
a BH for requests that aren’t there.
The context in which we kick does not matter, but let’s just keep kick
and process_completion together for simplicity’s sake.
(For what it’s worth, a quick fio bandwidth test indicates that on my
test hardware, if anything, this may be a bit better than kicking
immediately before scheduling a pure nvme_process_completion() BH. But
I wouldn’t take more from those results than that it doesn’t really seem
to matter either way.)
Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-8-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:41 +0000 (16:48 +0100)]
gluster: Do not move coroutine into BDS context
The request coroutine may not run in the BDS AioContext. We should wake
it in its own context, not move it.
With that, we can remove GlusterAIOCB.aio_context.
Also add a comment why aio_co_schedule() is safe to use in this way.
**Note:** Due to a lack of a gluster set-up, I have not tested this
commit. It seemed safe enough to send anyway, just maybe not to
qemu-stable. To be clear, I don’t know of any user-visible bugs that
would arise from the state without this patch; the request coroutine is
moved into the main BDS AioContext, so guest device completion code will
run in a different context than where the request started, which can’t
be good, but I haven’t actually confirmed any bugs (due to not being
able to test it).
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-7-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:40 +0000 (16:48 +0100)]
curl: Fix coroutine waking
If we wake a coroutine from a different context, we must ensure that it
will yield exactly once (now or later), awaiting that wake.
curl’s current .ret == -EINPROGRESS loop may lead to the coroutine not
yielding if the request finishes before the loop gets run. To fix it,
we must drop the loop and yield exactly once, if we need to yield.
Finding out that latter part ("if we need to yield") makes it a bit
complicated: Requests may be served from a cache internal to the curl
block driver, or fail before being submitted. In these cases, we must
not yield. However, if we find a matching but still ongoing request in
the cache, we will have to await that, i.e. still yield.
To address this, move the yield inside of the respective functions:
- Inside of curl_find_buf() when awaiting ongoing concurrent requests,
- Inside of curl_setup_preadv() when having created a new request.
Rename curl_setup_preadv() to curl_do_preadv() to reflect this.
(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (acb.ret == -EINPROGRESS)` loop.)
Also, add a comment why aio_co_wake() is safe regardless of whether the
coroutine and curl_multi_check_completion() run in the same context.
Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-6-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:39 +0000 (16:48 +0100)]
nfs: Run co BH CB in the coroutine’s AioContext
Like in “rbd: Run co BH CB in the coroutine’s AioContext”, drop the
completion flag, yield exactly once, and run the BH in the coroutine’s
AioContext.
(Can be reproduced with multiqueue by adding a usleep(100000) before the
`while (!task.complete)` loops.)
Like in “iscsi: Run co BH CB in the coroutine’s AioContext”, this makes
nfs_co_generic_bh_cb() trivial, so we can drop it in favor of just
calling aio_co_wake() directly.
Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-5-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:38 +0000 (16:48 +0100)]
iscsi: Run co BH CB in the coroutine’s AioContext
For rbd (and others), as described in “rbd: Run co BH CB in the
coroutine’s AioContext”, the pattern of setting a completion flag and
waking a coroutine that yields while the flag is not set can only work
when both run in the same thread.
iscsi has the same pattern, but the details are a bit different:
iscsi_co_generic_cb() can (as far as I understand) only run through
iscsi_service(), not just from a random thread at a random time.
iscsi_service() in turn can only be run after iscsi_set_events() set up
an FD event handler, which is done in iscsi_co_wait_for_task().
As a result, iscsi_co_wait_for_task() will always yield exactly once,
because iscsi_co_generic_cb() can only run after iscsi_set_events(),
after the completion flag has already been checked, and the yielding
coroutine will then be woken only once the completion flag was set to
true. So as far as I can tell, iscsi has no bug and already works fine.
Still, we don’t need the completion flag because we know we have to
yield exactly once, so we can drop it. This simplifies the code and
makes it more obvious that the “rbd bug” isn’t present here.
This makes iscsi_co_generic_bh_cb() and iscsi_retry_timer_expired() a
bit boring, so at least the former we can drop and call aio_co_wake()
directly from scsi_co_generic_cb() to the same effect. As for the
latter, the timer needs a CB, so we can’t drop it (I suppose we could
technically use aio_co_wake directly as the CB, but that would be
nasty), but we can put it into the coroutine’s AioContext to make its
aio_co_wake() a simple wrapper around qemu_coroutine_enter() without a
further BH indirection.
Finally, remove the iTask->co != NULL checks: This field is set by
iscsi_co_init_iscsitask(), which all users of IscsiTask run before even
setting up iscsi_co_generic_cb() as the callback, and it is never set or
cleared elsewhere, so it is impossible to not be set in
iscsi_co_generic_cb().
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-4-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:37 +0000 (16:48 +0100)]
rbd: Run co BH CB in the coroutine’s AioContext
qemu_rbd_completion_cb() schedules the request completion code
(qemu_rbd_finish_bh()) to run in the BDS’s AioContext, assuming that
this is the same thread in which qemu_rbd_start_co() runs.
To explain, this is how both latter functions interact:
In qemu_rbd_start_co():
while (!task.complete)
qemu_coroutine_yield();
In qemu_rbd_finish_bh():
task->complete = true;
aio_co_wake(task->co); // task->co is qemu_rbd_start_co()
For this interaction to work reliably, both must run in the same thread
so that qemu_rbd_finish_bh() can only run once the coroutine yields.
Otherwise, finish_bh() may run before start_co() checks task.complete,
which will result in the latter seeing .complete as true immediately and
skipping the yield altogether, even though finish_bh() still wakes it.
With multiqueue, the BDS’s AioContext is not necessarily the thread
start_co() runs in, and so finish_bh() may be scheduled to run in a
different thread than start_co(). With the right timing, this will
cause the problems described above; waking a non-yielding coroutine is
not good, as can be reproduced by putting e.g. a usleep(100000) above
the while loop in start_co() (and using multiqueue), giving finish_bh()
a much better chance at exiting before start_co() can yield.
So instead of scheduling finish_bh() in the BDS’s AioContext, schedule
finish_bh() in task->co’s AioContext.
In addition, we can get rid of task.complete altogether because we will
get woken exactly once, when the task is indeed complete, no need to
check.
(We could go further and drop the BH, running aio_co_wake() directly in
qemu_rbd_completion_cb() because we are allowed to do that even if the
coroutine isn’t yet yielding and we’re in a different thread – but the
doc comment on qemu_rbd_completion_cb() says to be careful, so I decided
not to go so far here.)
Buglink: https://issues.redhat.com/browse/RHEL-67115 Reported-by: Junyao Zhao <junzhao@redhat.com> Cc: qemu-stable@nongnu.org Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-3-hreitz@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Hanna Czenczek [Mon, 10 Nov 2025 15:48:36 +0000 (16:48 +0100)]
block: Note on aio_co_wake use if not yet yielding
aio_co_wake() is generally safe to call regardless of whether the
coroutine is already yielding or not. If it is not yet yielding, it
will be scheduled to run when it does yield.
Caveats:
- The caller must be independent of the coroutine (to ensure the
coroutine must be yielding if both are in the same AioContext), i.e.
must not be the same coroutine
- The coroutine must yield at some point
Make note of this so callers can reason that their use is safe.
Signed-off-by: Hanna Czenczek <hreitz@redhat.com>
Message-ID: <20251110154854.151484-2-hreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Merge tag 'pull-10.2-maintainer-171125-2' of https://gitlab.com/stsquad/qemu into staging
testing updates for 10.2
- fix emsdk image for podman
- update lcitool and clean-up ENV stanzas
- include coreutils for io tests
- move a number of assets due to linaro changes
- add ppc64le custom runner
- rationalise the gitlab custom runners with templates
- clean-up the custom runner rules
- add a scheduled container build
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEEZoWumedRZ7yvyN81+9DbCVqeKkQFAmkbRI0ACgkQ+9DbCVqe
# KkShRgf+Ma6E/m4ovXO/zrOqLx01XdXExbWPdCm+EqNc7OLvKKODFqFPaRtJvDRs
# s6JAiKWONJfXAHRmXGSlq2gHXMIyUlQds5K96tdyyXywKMOiOSTruOLJcOViWSP0
# i4o7AfxcsqKhIsy2/YaaMDHPcS4IR6AvoJCzgZVsEbSupbMYmLFsiOQa7uaauBtm
# BI2P07EN+q3DWFXnmKsYFtdqI0Kvazv5tMqR5y97TRX84yUAWJ7eVWwd2M7oFfRL
# eWmziUTzKGuwEkzGIxM4m3YD1iEmTKGp0B2se+wTFb0aIqWC5af+HdJvbUznasI/
# IAXZcFZbjSbn7yPLxV9x5CfJVdIYDg==
# =AM+R
# -----END PGP SIGNATURE-----
# gpg: Signature made Mon 17 Nov 2025 04:51:41 PM CET
# gpg: using RSA key 6685AE99E75167BCAFC8DF35FBD0DB095A9E2A44
# gpg: Good signature from "Alex Bennée (Master Work Key) <alex.bennee@linaro.org>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 6685 AE99 E751 67BC AFC8 DF35 FBD0 DB09 5A9E 2A44
* tag 'pull-10.2-maintainer-171125-2' of https://gitlab.com/stsquad/qemu:
gitlab: add a weekly container building job
gitlab: make the schedule rules a bit more general
gitlab: make custom runners need QEMU_CI to run
gitlab: suppress custom runners being triggered by schedule
gitlab: simplify the ubuntu-24.04-aarch64 rules
gitlab: use template for ubuntu-24.04-s390x jobs
gitlab: add initial ppc64le custom-runner test
tests: move test_virt_gpu to share.linaro.org
tests: move test_kvm to share.linaro.org
tests: move test_kvm_xen to share.linaro.org
tests: move test_netdev_ethtool to share.linaro.org
tests: move test_virt assets to share.linaro.org
tests: move test_xen assets to share.linaro.org
docs/about/emulation: update assets for uftrace plugin documentation
tests/docker: add coreutils to the package list
tests/lcitool: update ENV stanzas outputted by refresh
libvirt-ci: bump libvirt-ci to latest version
tests/docker: drop --link from COPYs in emsdk docker
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:23 +0000 (11:55 +0000)]
gitlab: add a weekly container building job
This will hopefully catch containers that break because of upstream
changes as well as keep the container cache fresh.
As we have all the container jobs as dependants we tweaks the
container template to allow scheduled runs. Because we added a new
rules stanza we also need to make sure we catch the normal runs as
well.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-19-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:22 +0000 (11:55 +0000)]
gitlab: make the schedule rules a bit more general
By default no jobs should run under the schedule and then we can be
more explicit for the ones that we need to. Otherwise I trigger all my
custom runners every time I do a scheduled run.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-18-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:21 +0000 (11:55 +0000)]
gitlab: make custom runners need QEMU_CI to run
In addition to not being triggered by schedule we should follow the
same rules about QEMU_CI. One day we may figure out how to fold the
custom runner rules into the .base_job_template but today is not that
day.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-17-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:20 +0000 (11:55 +0000)]
gitlab: suppress custom runners being triggered by schedule
Otherwise the mere presence of the RUNNER env vars is enough to
trigger the jobs.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-16-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:19 +0000 (11:55 +0000)]
gitlab: simplify the ubuntu-24.04-aarch64 rules
We don't need to duplicate the if rules to get the allow_failure and
manual behaviour we want. Clean that up to keep all the rules in the
same place.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-15-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:18 +0000 (11:55 +0000)]
gitlab: use template for ubuntu-24.04-s390x jobs
Most of the test is pure boilerplate so to save ourselves from
repetition move all the main bits into a minimal copy of
native_build_job_template but without the caching.
We keep all the current allow_fail, manual and configure setups but do
take the opportunity to replace the inline nproc calls to using a
common JOBS variable. We also fix the namespace check to use the
QEMU_CI_UPSTREAM variable.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-14-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:17 +0000 (11:55 +0000)]
gitlab: add initial ppc64le custom-runner test
This is a plain configure build but I only run a subset of the tests
until the kinks have been worked out.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-13-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:12 +0000 (11:55 +0000)]
tests: move test_virt assets to share.linaro.org
Linaro are migrating file-hosting from the old NextCloud instance to
another sharing site. While I'm at it drop the old pauth-impdef flag
which is no longer needed.
Reviewed-by: Thomas Huth <thuth@redhat.com> Cc: qemu-stable@nongnu.org
Message-ID: <20251117115523.3993105-8-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:09 +0000 (11:55 +0000)]
tests/docker: add coreutils to the package list
We need coreutils to run the IO tests so we need to include it in the
package list. Now we have the latest libvirt we can do that.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-5-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:08 +0000 (11:55 +0000)]
tests/lcitool: update ENV stanzas outputted by refresh
Now lcitool has been updated to use the non-legacy ENVs we should do
the same for what refresh adds.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-4-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Alex Bennée [Mon, 17 Nov 2025 11:55:07 +0000 (11:55 +0000)]
libvirt-ci: bump libvirt-ci to latest version
We will need the latest version to add coreutils in the next commit.
As libvirt has updated the handling of ENV variables this brings a
little bit of churn to the docker images.
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251117115523.3993105-3-alex.bennee@linaro.org> Signed-off-by: Alex Bennée <alex.bennee@linaro.org>
Merge tag 'pull-target-arm-20251114' of https://gitlab.com/pm215/qemu into staging
target-arm queue:
* MAINTAINERS file update for whpx
* target/arm: Fix accidental write to TCG constant
* target/arm/cpu64: remove duplicate include
* hw/display/xlnx_dp: don't abort() on guest errors
* cxl, vfio, tests: clean up includes
* hw/misc/npcm_clk: Don't divide by zero when calculating frequency
* hw/audio/lm4549: Don't try to open a zero-frequency audio voice
# -----BEGIN PGP SIGNATURE-----
#
# iQJNBAABCAA3FiEE4aXFk81BneKOgxXPPCUl7RQ2DN4FAmkXSF0ZHHBldGVyLm1h
# eWRlbGxAbGluYXJvLm9yZwAKCRA8JSXtFDYM3iLKEACahSPxoRe4+TOgr3F7mJvq
# CDFOOUQSXbBC4WTviyJAh1+MYFhtWrOxUB1EzLb9iw1+sbBcT6/K1CBEFiQ65dpn
# kjtIaJDidz4x52vNc1nz1B9jzRdme4xQ0kg5NeY9PqCGO4nC0iWqzzbBoA1XYHsR
# RXfXr9JNXKqN3cm+x/ZX/o++rz3eG8ba0DxJUIO+OR9rAv3n0No+oTOeAJ4SbDu4
# lcP+MHFA/V//Q4O9QSeZv1tD+brXerpNcMQlsRrffkmT8bvJMPozyvcijtEZQz3+
# 9s8GUeL0b7/GgpdIqWyEAl2sreMtqmWh1GGpCZziFTiEmNWWI9M6fHINyZ2NVnPD
# T5UFOA9JbSG1ybxQHHf4Vj5tUjwWAAnVwRP1wXAb3p35fBYl0Y3JFDX+0HpL9tM/
# vB1BHA+PGRV51vDy7VoUpbbZkpa1/WJCqTm9s1BxzZ2BFu0tpQ2Rqg/V+y004NQY
# Xx1t7ilm18LyQrZpHYqmw3OJ/EVPtATBN2jomK2Z8ZWExLsDQ/Qd8k3cHg6OcN4N
# /ORpbqy29dOL5mQTEuBW8L0tLEN9tBqfadlqvlsbI9S0eDlZdyvPT9utV0aSCfe2
# km/rSjD2IJEmtJA1kcYgq3ipNsPu5eGFfw2OqGe+vowLaU42ki3uteaOqLgN81AX
# sB5cO49w7AtAmaocraAzPA==
# =+I+o
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 14 Nov 2025 04:18:53 PM CET
# gpg: using RSA key E1A5C593CD419DE28E8315CF3C2525ED14360CDE
# gpg: issuer "peter.maydell@linaro.org"
# gpg: Good signature from "Peter Maydell <peter.maydell@linaro.org>" [unknown]
# gpg: aka "Peter Maydell <pmaydell@gmail.com>" [unknown]
# gpg: aka "Peter Maydell <pmaydell@chiark.greenend.org.uk>" [unknown]
# gpg: aka "Peter Maydell <peter@archaic.org.uk>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: E1A5 C593 CD41 9DE2 8E83 15CF 3C25 25ED 1436 0CDE
* tag 'pull-target-arm-20251114' of https://gitlab.com/pm215/qemu:
hw/audio/lm4549: Don't try to open a zero-frequency audio voice
hw/misc/npcm_clk: Don't divide by zero when calculating frequency
tests: Clean up includes
vfio: Clean up includes
cxl: Clean up includes
hw/display/xlnx_dp: Don't abort for unsupported graphics formats
hw/display/xlnx_dp.c: Don't abort on AUX FIFO overrun/underrun
target/arm/cpu64: remove duplicate include
target/arm: Fix accidental write to TCG constant
MAINTAINERS: update maintainers for WHPX
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Merge tag 'net-pull-request' of https://github.com/jasowang/qemu into staging
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCAAdFiEEIV1G9IJGaJ7HfzVi7wSWWzmNYhEFAmkWo9EACgkQ7wSWWzmN
# YhHargf/Uf801PmKskryVENF9sVe6u5NxJZlT3BUJVsSTGitucBIHWZ5J7MMR1lw
# If4tfMho3BX5Wrtl5GuCEzolk9pCz3wmSN6nyOU25C5tKaoJ/uR135K25D0CwVmD
# eTOyg+gKktVfogXxJ/zwZpRHMq4XXrk/C2ZP41r/CdcLyaeuDS9GIbd/q4N7f3vv
# bEsVqECzjEwWr2JBY9SD0xlIRp3nWwEvRsgRZPzBiQzfjSTlImqGLUsxIpF5V2LV
# 1BU0V/FShWyrwckBXSqCWBUh6uBUGgEl6qKnK4vH7+ed4Kd9giyp1vWAFEjHgIg+
# gZtPaT/MJQOtLyCuzfuSdUpAzz5Sfw==
# =Is8a
# -----END PGP SIGNATURE-----
# gpg: Signature made Fri 14 Nov 2025 04:36:49 AM CET
# gpg: using RSA key 215D46F48246689EC77F3562EF04965B398D6211
# gpg: Good signature from "Jason Wang (Jason Wang on RedHat) <jasowang@redhat.com>" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 215D 46F4 8246 689E C77F 3562 EF04 965B 398D 6211
* tag 'net-pull-request' of https://github.com/jasowang/qemu:
net: pad packets to minimum length in qemu_receive_packet()
hw/net/e1000e_core: Adjust e1000e_write_payload_frag_to_rx_buffers() assert
hw/net/e1000e_core: Correct rx oversize packet checks
hw/net/e1000e_core: Don't advance desc_offset for NULL buffer RX descriptors
net/hub: make net_hub_port_cleanup idempotent
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Merge tag 'pull-nbd-2025-11-13' of https://repo.or.cz/qemu/ericb into staging
NBD patches for 2025-11-13
- Fix NBD client deadlock when connecting to same-process server
- Several iotests improvements
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCAAdFiEEccLMIrHEYCkn0vOqp6FrSiUnQ2oFAmkWYUwACgkQp6FrSiUn
# Q2rYDgf/TQZ1UVkLhUvnH7RhF4y94tXpfVcl3/PObtis5mldZKkGlTEnFSZGJG4Y
# +ra/tdMS8ZBbTgXIAdR7tEp+n9YpWMLvYxcWcLpQQ2H3MXghtBGGjYHwkzppIvG+
# U3F8YdImbuOgR0V9NP0JWlk9DztsoRkiO3zaqLqvtwvzDXKPdjsMsGM13pHJVVru
# LdkM828Mrr8eu+DcAVFd7ZofftEgyd/E7IV1/0YCj3MaWR3BJ45gsfMUHvWwtaBP
# Mn8tQvB6yJEbAZwmepZbxrkFAJQhE916qbQyZscbnEJvDiKwK6PagQ5NAVtBaiz5
# xN3ywPOw4kghRaRLMiOsq1q/9M/p9A==
# =hhAb
# -----END PGP SIGNATURE-----
# gpg: Signature made Thu 13 Nov 2025 11:53:00 PM CET
# gpg: using RSA key 71C2CC22B1C4602927D2F3AAA7A16B4A2527436A
# gpg: Good signature from "Eric Blake <eblake@redhat.com>" [unknown]
# gpg: aka "Eric Blake (Free Software Programmer) <ebb9@byu.net>" [unknown]
# gpg: aka "[jpeg image of size 6874]" [unknown]
# gpg: WARNING: This key is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 71C2 CC22 B1C4 6029 27D2 F3AA A7A1 6B4A 2527 436A
* tag 'pull-nbd-2025-11-13' of https://repo.or.cz/qemu/ericb:
tests/qemu-iotest: fix iotest 024 with qed images
tests/qemu-iotests: Fix broken grep command in iotest 207
iotests: Add coverage of recent NBD qio deadlock fix
nbd: Avoid deadlock in client connecting to same-process server
qio: Add QIONetListener API for using AioContext
qio: Prepare NetListener to use AioContext
qio: Provide accessor around QIONetListener->sioc
chardev: Reuse channel's cached local address
qio: Factor out helpers qio_net_listener_[un]watch
qio: Minor optimization when callback function is unchanged
qio: Protect NetListener callback with mutex
qio: Remember context of qio_net_listener_set_client_func_full
qio: Unwatch before notify in QIONetListener
qio: Add trace points to net_listener
iotests: Drop execute permissions on vvfat.out
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Peter Maydell [Fri, 7 Nov 2025 15:41:16 +0000 (15:41 +0000)]
hw/audio/lm4549: Don't try to open a zero-frequency audio voice
If the guest incorrectly programs the lm4549 audio chip with a zero
frequency, we will pass this to AUD_open_out(), which will complain:
A bug was just triggered in AUD_open_out
Save all your work and restart without audio
I am sorry
Context:
audio: frequency=0 nchannels=2 fmt=S16 endianness=little
The datasheet doesn't say what we should do here, only that the valid
range for the freqency is 4000 to 48000 Hz; we choose to log the
guest error and ignore an attempt to change the DAC rate to something
outside the valid range.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/410 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20251107154116.1396769-1-peter.maydell@linaro.org
Peter Maydell [Fri, 7 Nov 2025 15:01:37 +0000 (15:01 +0000)]
hw/misc/npcm_clk: Don't divide by zero when calculating frequency
If the guest misprograms the PLL registers to request a zero
divisor, we currently fall over with a division by zero:
../../hw/misc/npcm_clk.c:221:14: runtime error: division by zero
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ../../hw/misc/npcm_clk.c:221:14
Thread 1 "qemu-system-aar" received signal SIGFPE, Arithmetic exception.
0x00005555584d8f6d in npcm7xx_clk_update_pll (opaque=0x7fffed159a20) at ../../hw/misc/npcm_clk.c:221
221 freq /= PLLCON_INDV(con) * PLLCON_OTDV1(con) * PLLCON_OTDV2(con);
Avoid this by treating this invalid setting like a stopped clock
(setting freq to 0).
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/549 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20251107150137.1353532-1-peter.maydell@linaro.org
Peter Maydell [Tue, 4 Nov 2025 16:09:43 +0000 (16:09 +0000)]
tests: Clean up includes
This commit was created with scripts/clean-includes:
./scripts/clean-includes --git tests tests
with one hand-edit to remove a now-empty #ifndef WIN32...#endif
from tests/qtest/dbus-display-test.c .
All .c should include qemu/osdep.h first. The script performs three
related cleanups:
* Ensure .c files include qemu/osdep.h first.
* Including it in a .h is redundant, since the .c already includes
it. Drop such inclusions.
* Likewise, including headers qemu/osdep.h includes is redundant.
Drop these, too.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Cédric Le Goater <clg@redhat.com>
Message-id: 20251104160943.751997-10-peter.maydell@linaro.org
Peter Maydell [Tue, 4 Nov 2025 16:09:42 +0000 (16:09 +0000)]
vfio: Clean up includes
This commit was created with scripts/clean-includes:
./scripts/clean-includes --git vfio hw/vfio hw/vfio-user
All .c should include qemu/osdep.h first. The script performs three
related cleanups:
* Ensure .c files include qemu/osdep.h first.
* Including it in a .h is redundant, since the .c already includes
it. Drop such inclusions.
* Likewise, including headers qemu/osdep.h includes is redundant.
Drop these, too.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Cédric Le Goater <clg@redhat.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20251104160943.751997-9-peter.maydell@linaro.org
Peter Maydell [Tue, 4 Nov 2025 16:09:41 +0000 (16:09 +0000)]
cxl: Clean up includes
This commit was created with scripts/clean-includes:
./scripts/clean-includes --git cxl hw/cxl hw/mem
All .c should include qemu/osdep.h first. The script performs three
related cleanups:
* Ensure .c files include qemu/osdep.h first.
* Including it in a .h is redundant, since the .c already includes
it. Drop such inclusions.
* Likewise, including headers qemu/osdep.h includes is redundant.
Drop these, too.
Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Acked-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Message-id: 20251104160943.751997-8-peter.maydell@linaro.org
Peter Maydell [Thu, 6 Nov 2025 14:52:09 +0000 (14:52 +0000)]
hw/display/xlnx_dp: Don't abort for unsupported graphics formats
If the guest writes an invalid or unsupported value to the
AV_BUF_FORMAT register, currently we abort(). Instead, log this as
either a guest error or an unimplemented error and continue.
The existing code treats DP_NL_VID_CB_Y0_CR_Y1 as x8b8g8r8
via a "case 0" that does not use the enum constant name for some
reason; we leave that alone beyond adding a comment about the
weird code.
Documentation of this register seems to be at:
https://docs.amd.com/r/en-US/ug1087-zynq-ultrascale-registers/AV_BUF_FORMAT-DISPLAY_PORT-Register
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1415 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Edgar E. Iglesias <edgar.iglesias@amd.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Message-id: 20251106145209.1083998-3-peter.maydell@linaro.org
Peter Maydell [Thu, 6 Nov 2025 14:52:08 +0000 (14:52 +0000)]
hw/display/xlnx_dp.c: Don't abort on AUX FIFO overrun/underrun
The documentation of the Xilinx DisplayPort subsystem at
https://www.xilinx.com/support/documents/ip_documentation/v_dp_txss1/v3_1/pg299-v-dp-txss1.pdf
doesn't say what happens if a guest tries to issue an AUX write
command with a length greater than the amount of data in the AUX
write FIFO, or tries to write more data to the write FIFO than it can
hold, or issues multiple commands that put data into the AUX read
FIFO without reading it such that it overflows.
Currently QEMU will abort() in these guest-error situations, either
in xlnx_dp.c itself or in the fifo8 code. Make these cases all be
logged as guest errors instead. We choose to ignore the new data on
overflow, and return 0 on underflow. This is in line with how we handled
the "read from empty RX FIFO" case in commit a09ef5040477.
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1418
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1419
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/1424 Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Edgar E. Iglesias <edgar.iglesias@amd.com>
Message-id: 20251106145209.1083998-2-peter.maydell@linaro.org
which is clearly a bug: writing to a constant is incorrect and
discards the result of the mask. Fix this by always doing an and_i32
and trusting the optimizer to turn this into a simple move when the
mask is zero.
Signed-off-by: Anton Johansson <anjo@rev.ng> Signed-off-by: Richard Henderson <richard.henderson@linaro.org> Tested-by: Gustavo Romero <gustavo.romero@linaro.org> Reviewed-by: <gustavo.romero@linaro.org>
Message-id: 20251106144909.533997-1-richard.henderson@linaro.org
[rth: Avoid an extra temp and extra move.] Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
[PMM: commit message tweak] Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Peter Maydell [Tue, 28 Oct 2025 16:00:42 +0000 (16:00 +0000)]
net: pad packets to minimum length in qemu_receive_packet()
In commits like 969e50b61a28 ("net: Pad short frames to minimum size
before sending from SLiRP/TAP") we switched away from requiring
network devices to handle short frames to instead having the net core
code do the padding of short frames out to the ETH_ZLEN minimum size.
We then dropped the code for handling short frames from the network
devices in a series of commits like 140eae9c8f7 ("hw/net: e1000:
Remove the logic of padding short frames in the receive path").
This missed one route where the device's receive code can still see a
short frame: if the device is in loopback mode and it transmits a
short frame via the qemu_receive_packet() function, this will be fed
back into its own receive code without being padded.
Add the padding logic to qemu_receive_packet().
This fixes a buffer overrun which can be triggered in the
e1000_receive_iov() logic via the loopback code path.
Other devices that use qemu_receive_packet() to implement loopback
are cadence_gem, dp8393x, lan9118, msf2-emac, pcnet, rtl8139
and sungem.
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/3043 Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Jason Wang <jasowang@redhat.com>
An assertion in e1000e_write_payload_frag_to_rx_buffers() attempts to
guard against the calling code accidentally trying to write too much
data to a single RX descriptor, such that the E1000EBAState::cur_idx
indexes off the end of the EB1000BAState::written[] array.
Unfortunately it is overzealous: it asserts that cur_idx is in
range after it has been incremented. This will fire incorrectly
for the case where the guest configures four buffers and exactly
enough bytes are written to fill all four of them.
The only places where we use cur_idx and index in to the written[]
array are the functions e1000e_write_hdr_frag_to_rx_buffers() and
e1000e_write_payload_frag_to_rx_buffers(), so we can rewrite this to
assert before doing the array dereference, rather than asserting
after updating cur_idx.
Cc: qemu-stable@nongnu.org Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Jason Wang <jasowang@redhat.com>
In e1000e_write_packet_to_guest() we attempt to ensure that we don't
write more of a packet to a descriptor than will fit in the guest
configured receive buffers. However, this code does not allow for
the "packet split" feature. When packet splitting is enabled, the
first of up to 4 buffers in the descriptor is used for the packet
header only, with the payload going into buffers 2, 3 and 4. Our
length check only checks against the total sizes of all 4 buffers,
which meant that if an incoming packet was large enough to fit in (1
+ 2 + 3 + 4) but not into (2 + 3 + 4) and packet splitting was
enabled, we would run into the assertion in
e1000e_write_hdr_frag_to_rx_buffers() that we had enough buffers for
the data:
A malicious guest could provoke this assertion by configuring the
device into loopback mode, and then sending itself a suitably sized
packet into a suitably arrange rx descriptor.
The code also fails to deal with the possibility that the descriptor
buffers are sized such that the trailing checksum word does not fit
into the last descriptor which has actual data, which might also
trigger this assertion.
Rework the length handling to use two variables:
* desc_size is the total amount of data DMA'd to the guest
for the descriptor being processed in this iteration of the loop
* rx_desc_buf_size is the total amount of space left in it
As we copy data to the guest (packet header, payload, checksum),
update these two variables. (Previously we attempted to calculate
desc_size once at the top of the loop, but this is too difficult to
do correctly.) Then we can use the variables to ensure that we clamp
the amount of copied payload data to the remaining space in the
descriptor's buffers, even if we've used one of the buffers up in the
packet-split code, and we can tell whether we have enough space for
the full checksum word in this descriptor or whether we're going to
need to split that to the following descriptor.
I have included comments that hopefully help to make the loop
logic a little clearer.
Cc: qemu-stable@nongnu.org
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/537 Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Signed-off-by: Jason Wang <jasowang@redhat.com>
Peter Maydell [Mon, 3 Nov 2025 17:58:49 +0000 (17:58 +0000)]
hw/net/e1000e_core: Don't advance desc_offset for NULL buffer RX descriptors
In e1000e_write_packet_to_guest() we don't write data for RX descriptors
where the buffer address is NULL (as required by the i82574 datasheet
section 7.1.7.2). However, when we do this we still update desc_offset
by the amount of data we would have written to the RX descriptor if
it had a valid buffer pointer, resulting in our dropping that data
entirely. The data sheet is not 100% clear on the subject, but this
seems unlikely to be the correct behaviour.
Rearrange the null-descriptor logic so that we don't treat these
do-nothing descriptors as if we'd really written the data.
This both fixes a bug and also is a prerequisite to cleaning up
the size calculation logic in the next patch.
(Cc to stable largely because it will be needed for the next patch,
which fixes a more serious bug.)
Cc: qemu-stable@nongnu.org Signed-off-by: Peter Maydell <peter.maydell@linaro.org> Reviewed-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp> Signed-off-by: Jason Wang <jasowang@redhat.com>
the shutdown order starts with net_cleanup, which walks the list and
deletes netdevs (including hubports). Then Xen's xen_device_unrealize is
called, which eventually leads to a second net_hub_port_cleanup call,
resulting in a segfault.
Fixes: e7891c57 ("net: move backend cleanup to NIC cleanup") Reported-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Jonah Palmer <jonah.palmer@oracle.com> Signed-off-by: Jason Wang <jasowang@redhat.com>
Thomas Huth [Thu, 13 Nov 2025 08:05:25 +0000 (09:05 +0100)]
tests/qemu-iotests: Fix broken grep command in iotest 207
Running "./check -ssh 207" fails for me with lots of lines like this
in the output:
+base64: invalid input
While looking closer at it, I noticed that the grep -v "\\^#" command
in this test is not working as expected - it is likely meant to filter
out the comment lines that are starting with a "#", but at least my
version of grep (GNU grep 3.11) does not work with the backslashes here.
There does not seem to be a compelling reason for these backslashes,
so let's simply drop them to fix this issue.
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251113080525.444826-1-thuth@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:38 +0000 (19:11 -0600)]
iotests: Add coverage of recent NBD qio deadlock fix
Test that all images in a qcow2 chain using an NBD backing file can be
served by the same process. Prior to the recent QIONetListener fixes,
this test would demonstrate deadlock.
The test borrows heavily from the original formula by "John Doe" in
the gitlab bug, but uses a Unix socket rather than TCP to avoid port
contention, and uses a full-blown QEMU rather than qemu-storage-daemon
since both programs were impacted.
The test starts out with the even simpler task of directly adding an
NBD client without qcow2 chain ('client'), which also provokes the
deadlock; but commenting out the 'Adding explicit NBD client' section
will still show deadlock when reaching the 'Adding wrapper image...'.
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169 Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20251113011625.878876-28-eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:37 +0000 (19:11 -0600)]
nbd: Avoid deadlock in client connecting to same-process server
See the previous patch for a longer description of the deadlock. Now
that QIONetListener supports waiting for clients in the main loop
AioContext, NBD can use that to ensure that the server can make
progress even when a client is intentionally starving the GMainContext
from any activity not tied to an AioContext.
Note that command-line arguments and QMP commands like
nbd-server-start or nbd-server-stop that manipulate whether the NBD
server exists are serviced in the main loop; and therefore, this patch
does not fall foul of the restrictions in the previous patch about the
inherent unsafe race possible if a QIONetListener can have its async
callback modified by a different thread than the one servicing polls.
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169 Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-27-eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:36 +0000 (19:11 -0600)]
qio: Add QIONetListener API for using AioContext
The user calling himself "John Doe" reported a deadlock when
attempting to use qemu-storage-daemon to serve both a base file over
NBD, and a qcow2 file with that NBD export as its backing file, from
the same process, even though it worked just fine when there were two
q-s-d processes. The bulk of the NBD server code properly uses
coroutines to make progress in an event-driven manner, but the code
for spawning a new coroutine at the point when listen(2) detects a new
client was hard-coded to use the global GMainContext; in other words,
the callback that triggers nbd_client_new to let the server start the
negotiation sequence with the client requires the main loop to be
making progress. However, the code for bdrv_open of a qcow2 image
with an NBD backing file uses an AIO_WAIT_WHILE nested event loop to
ensure that the entire qcow2 backing chain is either fully loaded or
rejected, without any side effects from the main loop causing unwanted
changes to the disk being loaded (in short, an AioContext represents
the set of actions that are known to be safe while handling block
layer I/O, while excluding any other pending actions in the global
main loop with potentially larger risk of unwanted side effects).
This creates a classic case of deadlock: the server can't progress to
the point of accept(2)ing the client to write to the NBD socket
because the main loop is being starved until the AIO_WAIT_WHILE
completes the bdrv_open, but the AIO_WAIT_WHILE can't progress because
it is blocked on the client coroutine stuck in a read() of the
expected magic number from the server side of the socket.
This patch adds a new API to allow clients to opt in to listening via
an AioContext rather than a GMainContext. This will allow NBD to fix
the deadlock by performing all actions during bdrv_open in the main
loop AioContext.
Technical debt warning: I would have loved to utilize a notify
function with AioContext to guarantee that we don't finalize listener
due to an object_unref if there is any callback still running (the way
GSource does), but wiring up notify functions into AioContext is a
bigger task that will be deferred to a later QEMU release. But for
solving the NBD deadlock, it is sufficient to note that the QMP
commands for enabling and disabling the NBD server are really the only
points where we want to change the listener's callback. Furthermore,
those commands are serviced in the main loop, which is the same
AioContext that is also listening for connections. Since a thread
cannot interrupt itself, we are ensured that at the point where we are
changing the watch, there are no callbacks active. This is NOT as
powerful as the GSource cross-thread safety, but sufficient for the
needs of today.
An upcoming patch will then add a unit test (kept separate to make it
easier to rearrange the series to demonstrate the deadlock without
this patch).
Fixes: https://gitlab.com/qemu-project/qemu/-/issues/3169 Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-26-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:35 +0000 (19:11 -0600)]
qio: Prepare NetListener to use AioContext
For ease of review, this patch adds an AioContext pointer to the
QIONetListener struct, the code to trace it, and refactors
listener->io_source to instead be an array of utility structs; but the
aio_context pointer is always NULL until the next patch adds an API to
set it. There should be no semantic change in this patch.
Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-25-eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:34 +0000 (19:11 -0600)]
qio: Provide accessor around QIONetListener->sioc
An upcoming patch needs to pass more than just sioc as the opaque
pointer to an AioContext; but since our AioContext code in general
(and its QIO Channel wrapper code) lacks a notify callback present
with GSource, we do not have the trivial option of just g_malloc'ing a
small struct to hold all that data coupled with a notify of g_free.
Instead, the data pointer must outlive the registered handler; in
fact, having the data pointer have the same lifetime as QIONetListener
is adequate.
But the cleanest way to stick such a helper struct in QIONetListener
will be to rearrange internal struct members. And that in turn means
that all existing code that currently directly accesses
listener->nsioc and listener->sioc[] should instead go through
accessor functions, to be immune to the upcoming struct layout
changes. So this patch adds accessor methods qio_net_listener_nsioc()
and qio_net_listener_sioc(), and puts them to use.
While at it, notice that the pattern of grabbing an sioc from the
listener only to turn around can call
qio_channel_socket_get_local_address is common enough to also warrant
the helper of qio_net_listener_get_local_address, and fix a copy-paste
error in the corresponding documentation.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-24-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:33 +0000 (19:11 -0600)]
chardev: Reuse channel's cached local address
Directly accessing the fd member of a QIOChannelSocket is an
undesirable leaky abstraction. What's more, grabbing that fd merely
to force an eventual call to getsockname() can be wasteful, since the
channel is often able to return its cached local name.
Reported-by: Daniel P. Berrangé <berrange@redhat.com> Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-23-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:32 +0000 (19:11 -0600)]
qio: Factor out helpers qio_net_listener_[un]watch
The code had three similar repetitions of an iteration over one or all
of nsiocs to set up a GSource, and likewise for teardown. Since an
upcoming patch wants to tweak whether GSource or AioContext is used,
it's better to consolidate that into one helper function for fewer
places to edit later.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-22-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:31 +0000 (19:11 -0600)]
qio: Minor optimization when callback function is unchanged
In qemu-nbd and other NBD server setups where parallel clients are
supported, it is common that the caller will re-register the same
callback function as long as it has not reached its limit on
simultaneous clients. In that case, there is no need to tear down and
reinstall GSource watches in the GMainContext.
In practice, all existing callers currently pass NULL for notify, and
no caller ever changes context across calls (for async uses, either
the caller consistently uses qio_net_listener_set_client_func_full
with the same context, or the caller consistently uses only
qio_net_listener_set_client_func which always uses the global
context); but the time spent checking these two fields in addition to
the more important func and data is still less than the savings of not
churning through extra GSource manipulations when the result will be
unchanged.
Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-21-eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:30 +0000 (19:11 -0600)]
qio: Protect NetListener callback with mutex
Without a mutex, NetListener can run into this data race between a
thread changing the async callback callback function to use when a
client connects, and the thread servicing polling of the listening
sockets:
Thread 2:
=> call lstnr->io_func(lstnr->io_data) // now sees f2
=> return dispatch(sock)
=> unref(GSourceCallback)
=> destroy-notify
=> object_unref
Found by inspection; I did not spend the time trying to add sleeps or
execute under gdb to try and actually trigger the race in practice.
This is a SEGFAULT waiting to happen if f2 can become NULL because
thread 1 deregisters the user's callback while thread 2 is trying to
service the callback. Other messes are also theoretically possible,
such as running callback f1 with an opaque pointer that should only be
passed to f2 (if the client code were to use more than just a binary
choice between a single async function or NULL).
Mitigating factor: if the code that modifies the QIONetListener can
only be reached by the same thread that is executing the polling and
async callbacks, then we are not in a two-thread race documented above
(even though poll can see two clients trying to connect in the same
window of time, any changes made to the listener by the first async
callback will be completed before the thread moves on to the second
client). However, QEMU is complex enough that this is hard to
generically analyze. If QMP commands (like nbd-server-stop) are run
in the main loop and the listener uses the main loop, things should be
okay. But when a client uses an alternative GMainContext, or if
servicing a QMP command hands off to a coroutine to avoid blocking, I
am unable to state with certainty whether a given net listener can be
modified by a thread different from the polling thread running
callbacks.
At any rate, it is worth having the API be robust. To ensure that
modifying a NetListener can be safely done from any thread, add a
mutex that guarantees atomicity to all members of a listener object
related to callbacks. This problem has been present since
QIONetListener was introduced.
Note that this does NOT prevent the case of a second round of the
user's old async callback being invoked with the old opaque data, even
when the user has already tried to change the async callback during
the first async callback; it is only about ensuring that there is no
sharding (the eventual io_func(io_data) call that does get made will
correspond to a particular combination that the user had requested at
some point in time, and not be sharded to a combination that never
existed in practice). In other words, this patch maintains the status
quo that a user's async callback function already needs to be robust
to parallel clients landing in the same window of poll servicing, even
when only one client is desired, if that particular listener can be
amended in a thread other than the one doing the polling.
CC: qemu-stable@nongnu.org Fixes: 53047392 ("io: introduce a network socket listener API", v2.12.0) Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251113011625.878876-20-eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
[eblake: minor commit message wording improvements] Signed-off-by: Eric Blake <eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:29 +0000 (19:11 -0600)]
qio: Remember context of qio_net_listener_set_client_func_full
io/net-listener.c has two modes of use: asynchronous (the user calls
qio_net_listener_set_client_func to wake up the callback via the
global GMainContext, or qio_net_listener_set_client_func_full to wake
up the callback via the caller's own alternative GMainContext), and
synchronous (the user calls qio_net_listener_wait_client which creates
its own GMainContext and waits for the first client connection before
returning, with no need for a user's callback). But commit 938c8b79
has a latent logic flaw: when qio_net_listener_wait_client finishes on
its temporary context, it reverts all of the siocs back to the global
GMainContext rather than the potentially non-NULL context they might
have been originally registered with. Similarly, if the user creates
a net-listener, adds initial addresses, registers an async callback
with a non-default context (which ties to all siocs for the initial
addresses), then adds more addresses with qio_net_listener_add, the
siocs for later addresses are blindly placed in the global context,
rather than sharing the context of the earlier ones.
In practice, I don't think this has caused issues. As pointed out by
the original commit, all async callers prior to that commit were
already okay with the NULL default context; and the typical usage
pattern is to first add ALL the addresses the listener will pay
attention to before ever setting the async callback. Likewise, if a
file uses only qio_net_listener_set_client_func instead of
qio_net_listener_set_client_func_full, then it is never using a custom
context, so later assignments of async callbacks will still be to the
same global context as earlier ones. Meanwhile, any callers that want
to do the sync operation to grab the first client are unlikely to
register an async callback; altogether bypassing the question of
whether later assignments of a GSource are being tied to a different
context over time.
I do note that chardev/char-socket.c is the only file that calls both
qio_net_listener_wait_client (sync for a single client in
tcp_chr_accept_server_sync), and qio_net_listener_set_client_func_full
(several places, all with chr->gcontext, but sometimes with a NULL
callback function during teardown). But as far as I can tell, the two
uses are mutually exclusive, based on the is_waitconnect parameter to
qmp_chardev_open_socket_server.
That said, it is more robust to remember when an async callback
function is tied to a non-default context, and have both the sync wait
and any late address additions honor that same context. That way, the
code will be robust even if a later user performs a sync wait for a
specific client in the middle of servicing a longer-lived
QIONetListener that has an async callback for all other clients.
CC: qemu-stable@nongnu.org Fixes: 938c8b79 ("qio: store gsources for net listeners", v2.12.0) Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-19-eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:28 +0000 (19:11 -0600)]
qio: Unwatch before notify in QIONetListener
When changing the callback registered with QIONetListener, the code
was calling notify on the old opaque data prior to actually removing
the old GSource objects still pointing to that data. Similarly,
during finalize, it called notify before tearing down the various
GSource objects tied to the data.
In practice, a grep of the QEMU code base found that every existing
client of QIONetListener passes in a NULL notifier (the opaque data,
if non-NULL, outlives the NetListener and so does not need cleanup
when the NetListener is torn down), so this patch has no impact. And
even if a caller had passed in a reference-counted object with a
notifier of object_unref but kept its own reference on the data, then
the early notify would merely reduce a refcount from (say) 2 to 1, but
not free the object. However, it is a latent bug waiting to bite any
future caller that passes in data where the notifier actually frees
the object, because the GSource could then trigger a use-after-free if
it loses the race on a last-minute client connection resulting in the
data being passed to one final use of the async callback.
Better is to delay the notify call until after all GSource that have
been given a copy of the opaque data are torn down.
CC: qemu-stable@nongnu.org Fixes: 530473924d "io: introduce a network socket listener API", v2.12.0 Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-18-eblake@redhat.com>
Eric Blake [Thu, 13 Nov 2025 01:11:27 +0000 (19:11 -0600)]
qio: Add trace points to net_listener
Upcoming patches will adjust how net_listener watches for new client
connections; adding trace points now makes it easier to debug that the
changes work as intended. For example, adding
--trace='qio_net_listener*' to the qemu-storage-daemon command line
before --nbd-server will track when the server first starts listening
for clients.
Signed-off-by: Eric Blake <eblake@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
Message-ID: <20251113011625.878876-17-eblake@redhat.com>
Merge tag 'for-upstream' of https://repo.or.cz/qemu/kevin into staging
Block layer patches
- stream: Fix potential crash during job completion
- aio: add the aio_add_sqe() io_uring API
- qcow2: put discards in discard queue when discard-no-unref is enabled
- qcow2, vmdk: Restrict creation with secondary file using protocol
- qemu-img rebase: Fix assertion failure due to exceeding IO_BUF_SIZE
- iotests: Run iotests with sanitizers
- iotests: Add more image formats to the thorough testing
- iotests: Improve the dry run list to speed up thorough testing
- Code cleanup
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCgAvFiEE3D3rFZqa+V09dFb+fwmycsiPL9YFAmkTqWcRHGt3b2xmQHJl
# ZGhhdC5jb20ACgkQfwmycsiPL9awPg//VqEgqYbEr3dVUvBFk8tlcewoo7KGICVk
# 4kddOwMJIdcsVpiLuNzqQARH2kHV93Hiv+mVt25o00PkJx565eCGTh/bBFas3UXL
# JMBjgHyJutGr4cijkNrnQgqWfeTgc32xdVEWh1nZM2K7LslzC9I1PfUzfxRMYqZA
# Em0KE3vwQDC7xtIyk4t451hkfcQY8fwN9bDMpD+zbzaLsYTEyOJ900En88iW7oHE
# TuJhrviin11jdQCA26QVNXRaw7iIVVo8vJP1VEgbn31iY+Qpcr/HcQRs0x2gex67
# OqIdh4onqkdGCFDxTGUoAH+jORXWUmk/JipIhl9pJP0ZDyAjsm97ThJ6SvctURsK
# UMU0dzXEc1C5spD2CWnN0PujqHYQqYaylx7MdiCJMjaCfDB3ZeIRsTGoiLMB24P+
# WBrcn2P+f03nC/sVvxRZWrpyI2kZwEh1RsO/mnLQ3apVBFeKqaFi8Ouo9oi1ZMd6
# ahUw7sZSoTxmGY1FhOSRCGEh2Wjy0ZIOx9tHT1U9vig5Kf9KeE81yO8yaq2T60mq
# 9eaUL8rcUrKRiJw9NUkcEYmIUJrh0nUe/kK2RWmbEGMYIH7ASrGqiyUP5FxpekD+
# i/uen4BeyRwe6rnPOzGolg+HMysMBr8VD/8PwJ8g88FLH1jIdTYvFUdRbrkciUlo
# okC+y4+kqiU=
# =SI8s
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 11 Nov 2025 10:23:51 PM CET
# gpg: using RSA key DC3DEB159A9AF95D3D7456FE7F09B272C88F2FD6
# gpg: issuer "kwolf@redhat.com"
# gpg: Good signature from "Kevin Wolf <kwolf@redhat.com>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: DC3D EB15 9A9A F95D 3D74 56FE 7F09 B272 C88F 2FD6
* tag 'for-upstream' of https://repo.or.cz/qemu/kevin: (28 commits)
qemu-img rebase: don't exceed IO_BUF_SIZE in one operation
qcow2, vmdk: Restrict creation with secondary file using protocol
block: Allow drivers to control protocol prefix at creation
tests/qemu-iotest: Add more image formats to the thorough testing
tests/qemu-iotests: Improve the dry run list to speed up thorough testing
tests/qemu-iotests/184: Fix skip message for qemu-img without throttle
qcow2: put discards in discard queue when discard-no-unref is enabled
qcow2: rename update_refcount_discard to queue_discard
iotests: Run iotests with sanitizers
qemu-img: Fix amend option parse error handling
iotests: Test resizing file node under raw with size/offset
block: Drop detach_subchain for bdrv_replace_node
block: replace TABs with space
block/io_uring: use non-vectored read/write when possible
block/io_uring: use aio_add_sqe()
aio-posix: add aio_add_sqe() API for user-defined io_uring requests
aio-posix: add fdmon_ops->dispatch()
aio-posix: unindent fdmon_io_uring_destroy()
aio-posix: gracefully handle io_uring_queue_init() failure
aio: add errp argument to aio_context_setup()
...
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Alberto Garcia [Fri, 7 Nov 2025 09:18:30 +0000 (10:18 +0100)]
qemu-img rebase: don't exceed IO_BUF_SIZE in one operation
During a rebase operation data is copied from the backing chain into
the target image using a loop, and each iteration looks for a
contiguous region of allocated data of at most IO_BUF_SIZE (2 MB).
Once that region is found, and in order to avoid partial writes, its
boundaries are extended so they are aligned to the (sub)clusters of
the target image (see commit 12df580b).
This operation can however result in a region that exceeds the maximum
allowed IO_BUF_SIZE, crashing qemu-img.
This can be easily reproduced when the source image has a smaller
cluster size than the target image:
Eric Blake [Mon, 15 Sep 2025 21:37:27 +0000 (16:37 -0500)]
qcow2, vmdk: Restrict creation with secondary file using protocol
Ever since CVE-2024-4467 (see commit 7ead9469 in qemu v9.1.0), we have
intentionally treated the opening of secondary files whose name is
specified in the contents of the primary file, such as a qcow2
data_file, as something that must be a local file and not a protocol
prefix (it is still possible to open a qcow2 file that wraps an NBD
data image by using QMP commands, but that is from the explicit action
of the QMP overriding any string encoded in the qcow2 file). At the
time, we did not prevent the use of protocol prefixes on the secondary
image while creating a qcow2 file, but it results in a qcow2 file that
records an empty string for the data_file, rather than the protocol
passed in during creation:
$ qemu-img create -f raw datastore.raw 2G
$ qemu-nbd -e 0 -t -f raw datastore.raw &
$ qemu-img create -f qcow2 -o data_file=nbd://localhost:10809/ \
datastore_nbd.qcow2 2G
Formatting 'datastore_nbd.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 data_file=nbd://localhost:10809/ lazy_refcounts=off refcount_bits=16
$ qemu-img info datastore_nbd.qcow2 | grep data
$ qemu-img info datastore_nbd.qcow2 | grep data
image: datastore_nbd.qcow2
data file:
data file raw: false
filename: datastore_nbd.qcow2
And since an empty string was recorded in the file, attempting to open
the image without using QMP to supply the NBD data store fails, with a
somewhat confusing error message:
$ qemu-io -f qcow2 datastore_nbd.qcow2
qemu-io: can't open device datastore_nbd.qcow2: The 'file' block driver requires a file name
Although the ability to create an image with a convenience reference
to a protocol data file is not a security hole (unlike the case with
open, the image is not untrusted if we are the ones creating it), the
above demo shows that it is still inconsistent. Thus, it makes more
sense if we also insist that image creation rejects a protocol prefix
when using the same syntax. Now, the above attempt produces:
$ qemu-img create -f qcow2 -o data_file=nbd://localhost:10809/ \
datastore_nbd.qcow2 2G
Formatting 'datastore_nbd.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off compression_type=zlib size=2147483648 data_file=nbd://localhost:10809/ lazy_refcounts=off refcount_bits=16
qemu-img: datastore_nbd.qcow2: Could not create 'nbd://localhost:10809/': No such file or directory
with datastore_nbd.qcow2 no longer created.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250915213919.3121401-6-eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Eric Blake [Mon, 15 Sep 2025 21:37:26 +0000 (16:37 -0500)]
block: Allow drivers to control protocol prefix at creation
This patch is pure refactoring: instead of hard-coding permission to
use a protocol prefix when creating an image, the drivers can now pass
in a parameter, comparable to what they could already do for opening a
pre-existing image. This patch is purely mechanical (all drivers pass
in true for now), but it will enable the next patch to cater to
drivers that want to differ in behavior for the primary image vs. any
secondary images that are opened at the same time as creating the
primary image.
Signed-off-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250915213919.3121401-5-eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Thomas Huth [Tue, 14 Oct 2025 10:41:42 +0000 (12:41 +0200)]
tests/qemu-iotest: Add more image formats to the thorough testing
Now that the "check" script is a little bit smarter with providing
a list of tests that are supported for an image format, we can also
add more image formats that can be used for generic block layer
testing. (Note: qcow1 and luks are not added because some tests
there currently fail, and other formats like bochs, cloop, dmg and
vvfat do not work with the generic tests and thus would only get
skipped if we'd tried to add them here)
Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251014104142.1281028-4-thuth@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Thomas Huth [Tue, 14 Oct 2025 10:41:41 +0000 (12:41 +0200)]
tests/qemu-iotests: Improve the dry run list to speed up thorough testing
When running the tests in thorough mode, e.g. with:
make -j$(nproc) check SPEED=thorough
we currently always get a huge amount of total tests that the test
runner tries to execute (2457 in my case), but a big bunch of them are
only skipped (1099 in my case, meaning that only 1358 got executed).
This happens because we try to run the whole set of iotests for multiple
image formats while a lot of the tests can only run with one certain
format only and thus are marked as SKIP during execution. This is quite a
waste of time during each test run, and also unnecessarily blows up the
displayed list of executed tests in the console output.
Thus let's try to be a little bit smarter: If the "check" script is run
with "-n" and an image format switch (like "-qed") at the same time (which
is what we do for discovering the tests for the meson test runner already),
only report the tests that likely support the given format instead of
providing the whole list of all tests. We can determine whether a test
supports a format or not by looking at the lines in the file that contain
a "supported_fmt" or "unsupported_fmt" statement. This is only heuristics,
of course, but it is good enough for running the iotests via "make
check-block" - I double-checked that the list of executed tests does not
get changed by this patch, it's only the tests that are skipped anyway that
are now not run anymore.
This way the amount of total tests drops from 2457 to 1432 for me, and
the amount of skipped tests drops from 1099 to just 74 (meaning that we
still properly run 1432 - 74 = 1358 tests as we did before).
Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251014104142.1281028-3-thuth@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Thomas Huth [Tue, 14 Oct 2025 10:41:40 +0000 (12:41 +0200)]
tests/qemu-iotests/184: Fix skip message for qemu-img without throttle
If qemu-img does not support throttling, test 184 currently skips
with the message:
not suitable for this image format: raw
But that's wrong, it's not about the image format, it's about the
throttling not being available in qemu-img. Thus fix this by using
_notrun with a proper message instead.
Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251014104142.1281028-2-thuth@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2: put discards in discard queue when discard-no-unref is enabled
When discard-no-unref is enabled, discards are not queued like it
should.
This was broken since discard-no-unref was added.
Add a helper function qcow2_discard_cluster which handles some common
checks and calls the queue_discards function if needed to add the
discard request to the queue.
Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
Message-ID: <20250513132628.1055549-3-jean-louis@dupond.be> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
qcow2: rename update_refcount_discard to queue_discard
The function just queues discards, and doesn't do any refcount change.
So let's change the function name to align with its function.
Signed-off-by: Jean-Louis Dupond <jean-louis@dupond.be>
Message-ID: <20250513132628.1055549-2-jean-louis@dupond.be> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Akihiko Odaki [Thu, 23 Oct 2025 08:10:59 +0000 (17:10 +0900)]
iotests: Run iotests with sanitizers
Commit 2cc4d1c5eab1 ("tests/check-block: Skip iotests when sanitizers
are enabled") changed iotests to skip when sanitizers are enabled.
The rationale is that AddressSanitizer emits warnings and reports leaks,
which results in test breakage. Later, sanitizers that are enabled for
production environments (safe-stack and cfi-icall) were exempted.
However, this approach has a few problems.
- It requires rebuild to disable sanitizers if the existing build has
them enabled.
- It disables other useful non-production sanitizers.
- The exemption of safe-stack and cfi-icall is not correctly
implemented, so qemu-iotests are incorrectly enabled whenever either
safe-stack or cfi-icall is enabled *and*, even if there is another
sanitizer like AddressSanitizer.
To solve these problems, direct AddressSanitizer warnings to separate
files to avoid changing the test results, and selectively disable
leak detection at runtime instead of requiring to disable all
sanitizers at buildtime.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Message-ID: <20251023-iotests-v1-2-fab143ca4c2f@rsg.ci.i.u-tokyo.ac.jp> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Akihiko Odaki [Thu, 23 Oct 2025 08:10:58 +0000 (17:10 +0900)]
qemu-img: Fix amend option parse error handling
qemu_opts_del(opts) dereferences opts->list, which is the old amend_opts
pointer that can be dangling after executing
qemu_opts_append(amend_opts, bs->drv->create_opts) and cause
use-after-free.
Fix the potential use-after-free by moving the qemu_opts_del() call
before the qemu_opts_append() call.
Signed-off-by: Akihiko Odaki <odaki@rsg.ci.i.u-tokyo.ac.jp>
Message-ID: <20251023-iotests-v1-1-fab143ca4c2f@rsg.ci.i.u-tokyo.ac.jp> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Detaching filters using detach_subchain=true can cause segfaults as
described in #3149.
More specifically, this was observed when executing concurrent
block-stream and query-named-block-nodes. block-stream adds a
copy-on-read filter as the main BDS for the blockjob; that filter was
dropped with detach_subchain=true but not unref'd until the the blockjob
was free'd. Because query-named-block-nodes assumes that a filter will
always have exactly one child, it caused a segfault when it observed the
detached filter. Stacktrace:
0 bdrv_refresh_filename (bs=0x5efed72f8350)
at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:8082
1 0x00005efea73cf9dc in bdrv_block_device_info
(blk=0x0, bs=0x5efed72f8350, flat=true, errp=0x7ffeb829ebd8)
at block/qapi.c:62
2 0x00005efea7391ed3 in bdrv_named_nodes_list
(flat=<optimized out>, errp=0x7ffeb829ebd8)
at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/block.c:6275
3 0x00005efea7471993 in qmp_query_named_block_nodes
(has_flat=<optimized out>, flat=<optimized out>, errp=0x7ffeb829ebd8)
at /usr/src/qemu-1:10.1.0+ds-5ubuntu2/b/qemu/blockdev.c:2834
4 qmp_marshal_query_named_block_nodes
(args=<optimized out>, ret=0x7f2b753beec0, errp=0x7f2b753beec8)
at qapi/qapi-commands-block-core.c:553
5 0x00005efea74f03a5 in do_qmp_dispatch_bh (opaque=0x7f2b753beed0)
at qapi/qmp-dispatch.c:128
6 0x00005efea75108e6 in aio_bh_poll (ctx=0x5efed6f3f430)
at util/async.c:219
7 0x00005efea74ffdb2 in aio_dispatch (ctx=0x5efed6f3f430)
at util/aio-posix.c:436
8 0x00005efea7512846 in aio_ctx_dispatch (source=<optimized out>,
callback=<optimized out>,user_data=<optimized out>)
at util/async.c:361
9 0x00007f2b77809bfb in ?? ()
from /lib/x86_64-linux-gnu/libglib-2.0.so.0
10 0x00007f2b77809e70 in g_main_context_dispatch ()
from /lib/x86_64-linux-gnu/libglib-2.0.so.0
11 0x00005efea7517228 in glib_pollfds_poll () at util/main-loop.c:287
12 os_host_main_loop_wait (timeout=0) at util/main-loop.c:310
13 main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:589
14 0x00005efea7140482 in qemu_main_loop () at system/runstate.c:905
15 0x00005efea744e4e8 in qemu_default_main (opaque=opaque@entry=0x0)
at system/main.c:50
16 0x00005efea6e76319 in main
(argc=<optimized out>, argv=<optimized out>)
at system/main.c:93
As discussed in 20251024-second-fix-3149-v1-1-d997fa3d5ce2@canonical.com,
a filter should not exist without children in the first place; therefore,
drop the parameter entirely as it is only used for filters.
After this change, a blockdev-backup job's copy-before-write filter will
hold references to its children until the filter is unref'd. This causes
an additional flush during bdrv_close, so also update iotest 257.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/3149 Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Wesley Hershberger <wesley.hershberger@canonical.com> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@yandex-team.ru>
Message-ID: <20251029-third-fix-3149-v2-1-94932bb404f4@canonical.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Yeqi Fu [Tue, 7 Oct 2025 16:35:11 +0000 (18:35 +0200)]
block: replace TABs with space
Bring the block files in line with the QEMU coding style, with spaces
for indentation. This patch partially resolves the issue 371.
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/371 Signed-off-by: Yeqi Fu <fufuyqqqqqq@gmail.com>
Message-ID: <20230325085224.23842-1-fufuyqqqqqq@gmail.com>
[thuth: Rebased the patch to the current master branch] Signed-off-by: Thomas Huth <thuth@redhat.com>
Message-ID: <20251007163511.334178-1-thuth@redhat.com>
[kwolf: Fixed up vertical alignemnt] Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:33 +0000 (21:29 -0500)]
block/io_uring: use non-vectored read/write when possible
The io_uring_prep_readv2/writev2() man pages recommend using the
non-vectored read/write operations when possible for performance
reasons.
I didn't measure a significant difference but it doesn't hurt to have
this optimization in place.
Suggested-by: Eric Blake <eblake@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-16-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:32 +0000 (21:29 -0500)]
block/io_uring: use aio_add_sqe()
AioContext has its own io_uring instance for file descriptor monitoring.
The disk I/O io_uring code was developed separately. Originally I
thought the characteristics of file descriptor monitoring and disk I/O
were too different, requiring separate io_uring instances.
Now it has become clear to me that it's feasible to share a single
io_uring instance for file descriptor monitoring and disk I/O. We're not
using io_uring's IOPOLL feature or anything else that would require a
separate instance.
Unify block/io_uring.c and util/fdmon-io_uring.c using the new
aio_add_sqe() API that allows user-defined io_uring sqe submission. Now
block/io_uring.c just needs to submit readv/writev/fsync and most of the
io_uring-specific logic is handled by fdmon-io_uring.c.
There are two immediate advantages:
1. Fewer system calls. There is no need to monitor the disk I/O io_uring
ring fd from the file descriptor monitoring io_uring instance. Disk
I/O completions are now picked up directly. Also, sqes are
accumulated in the sq ring until the end of the event loop iteration
and there are fewer io_uring_enter(2) syscalls.
2. Less code duplication.
Note that error_setg() messages are not supposed to end with
punctuation, so I removed a '.' for the non-io_uring build error
message.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-15-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:31 +0000 (21:29 -0500)]
aio-posix: add aio_add_sqe() API for user-defined io_uring requests
Introduce the aio_add_sqe() API for submitting io_uring requests in the
current AioContext. This allows other components in QEMU, like the block
layer, to take advantage of io_uring features without creating their own
io_uring context.
This API supports nested event loops just like file descriptor
monitoring and BHs do. This comes at a complexity cost: CQE callbacks
must be placed on a list so that nested event loops can invoke pending
CQE callbacks from parent event loops. If you're wondering why
CqeHandler exists instead of just a callback function pointer, this is
why.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-14-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:30 +0000 (21:29 -0500)]
aio-posix: add fdmon_ops->dispatch()
The ppoll and epoll file descriptor monitoring implementations rely on
the event loop's generic file descriptor, timer, and BH dispatch code to
invoke user callbacks.
The io_uring file descriptor monitoring implementation will need
io_uring-specific dispatch logic for CQE handlers for custom SQEs.
Introduce a new FDMonOps ->dispatch() callback that allows file
descriptor monitoring implementations to invoke user callbacks. The next
patch will use this new callback.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-13-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:29 +0000 (21:29 -0500)]
aio-posix: unindent fdmon_io_uring_destroy()
Reduce the level of indentation to make further code changes easier to
read.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-12-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
io_uring may not be available at runtime due to system policies (e.g.
the io_uring_disabled sysctl) or creation could fail due to file
descriptor resource limits.
Handle failure scenarios as follows:
If another AioContext already has io_uring, then fail AioContext
creation so that the aio_add_sqe() API is available uniformly from all
QEMU threads. Otherwise fall back to epoll(7) if io_uring is
unavailable.
Notes:
- Update the comment about selecting the fastest fdmon implementation.
At this point it's not about speed anymore, it's about aio_add_sqe()
API availability.
- Uppercase the error message when converting from error_report() to
error_setg_errno() for consistency (but there are instances of
lowercase in the codebase).
- It's easier to move the #ifdefs from aio-posix.h to aio-posix.c.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-11-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:27 +0000 (21:29 -0500)]
aio: add errp argument to aio_context_setup()
When aio_context_new() -> aio_context_setup() fails at startup it
doesn't really matter whether errors are returned to the caller or the
process terminates immediately.
However, it is not acceptable to terminate when hotplugging --object
iothread at runtime. Refactor aio_context_setup() so that errors can be
propagated. The next commit will set errp when fdmon_io_uring_setup()
fails.
Suggested-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-10-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:26 +0000 (21:29 -0500)]
aio: free AioContext when aio_context_new() fails
g_source_destroy() only removes the GSource from the GMainContext it's
attached to, if any. It does not free it.
Use g_source_unref() instead so that the AioContext (which embeds a
GSource) is freed. There is no need to call g_source_destroy() in
aio_context_new() because the GSource isn't attached to a GMainContext
yet.
aio_ctx_finalize() expects everything to be set up already, so introduce
the new ctx->initialized boolean and do nothing when called with
!initialized. This also requires moving aio_context_setup() down after
event_notifier_init() since aio_ctx_finalize() won't release any
resources that aio_context_setup() acquired.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-9-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:25 +0000 (21:29 -0500)]
aio: remove aio_context_use_g_source()
There is no need for aio_context_use_g_source() now that epoll(7) and
io_uring(7) file descriptor monitoring works with the glib event loop.
AioContext doesn't need to be notified that GSource is being used.
On hosts with io_uring support this now enables fdmon-io_uring.c by
default, replacing fdmon-poll.c and fdmon-epoll.c. In other words, the
event loop will use io_uring!
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-8-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:24 +0000 (21:29 -0500)]
aio-posix: integrate fdmon into glib event loop
AioContext's glib integration only supports ppoll(2) file descriptor
monitoring. epoll(7) and io_uring(7) disable themselves and switch back
to ppoll(2) when the glib event loop is used. The main loop thread
cannot use epoll(7) or io_uring(7) because it always uses the glib event
loop.
Future QEMU features may require io_uring(7). One example is uring_cmd
support in FUSE exports. Each feature could create its own io_uring(7)
context and integrate it into the event loop, but this is inefficient
due to extra syscalls. It would be more efficient to reuse the
AioContext's existing fdmon-io_uring.c io_uring(7) context because
fdmon-io_uring.c will already be active on systems where Linux io_uring
is available.
In order to keep fdmon-io_uring.c's AioContext operational even when the
glib event loop is used, extend FDMonOps with an API similar to
GSourceFuncs so that file descriptor monitoring can integrate into the
glib event loop.
A quick summary of the GSourceFuncs API:
- prepare() is called each event loop iteration before waiting for file
descriptors and timers.
- check() is called to determine whether events are ready to be
dispatched after waiting.
- dispatch() is called to process events.
More details here: https://docs.gtk.org/glib/struct.SourceFuncs.html
Move the ppoll(2)-specific code from aio-posix.c into fdmon-poll.c and
also implement epoll(7)- and io_uring(7)-specific file descriptor
monitoring code for glib event loops.
Note that it's still faster to use aio_poll() rather than the glib event
loop since glib waits for file descriptor activity with ppoll(2) and
does not support adaptive polling. But at least epoll(7) and io_uring(7)
now work in glib event loops.
Splitting this into multiple commits without temporarily breaking
AioContext proved difficult so this commit makes all the changes. The
next commit will remove the aio_context_use_g_source() API because it is
no longer needed.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-7-stefanha@redhat.com>
[kwolf: Build fixes; fix AioContext.list_lock use after destroy] Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:23 +0000 (21:29 -0500)]
tests/unit: skip test-nested-aio-poll with io_uring
test-nested-aio-poll relies on internal details of how fdmon-poll.c
handles AioContext polling. Skip it when other fdmon implementations are
in use.
The reason why fdmon-io_uring.c behaves differently from fdmon-poll.c is
that its fdmon_ops->need_wait() function returns true when
io_uring_enter(2) must be called (e.g. to submit pending SQEs).
AioContext polling is skipped when ->need_wait() returns true, so the
test case will never enter AioContext polling mode with
fdmon-io_uring.c.
Restrict this test to fdmon-poll.c and drop the
aio_context_use_g_source() call since it's no longer necessary.
Note that this test is only built on POSIX systems so it is safe to
include "util/aio-posix.h".
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20251104022933.618123-6-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:22 +0000 (21:29 -0500)]
aio-posix: keep polling enabled with fdmon-io_uring.c
Commit 816a430c517e ("util/aio: Defer disabling poll mode as long as
possible") kept polling enabled when the event loop timeout is 0. Since
there is no timeout the event loop will continue immediately and the
overhead of disabling and re-enabling polling can be avoided.
fdmon-io_uring.c is unable to take advantage of this optimization
because its ->need_wait() function returns true whenever there are new
io_uring SQEs to submit:
if (timeout || ctx->fdmon_ops->need_wait(ctx)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Polling will be disabled even when timeout == 0.
Extend the optimization to handle the case when need_wait() returns true
and timeout == 0.
Cc: Chao Gao <chao.gao@intel.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-5-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:21 +0000 (21:29 -0500)]
aio-posix: fix spurious return from ->wait() due to signals
io_uring_enter(2) only returns -EINTR in some cases when interrupted by
a signal. Therefore the while loop in fdmon_io_uring_wait() is
incomplete and can lead to a spurious early return.
Handle the case when a signal interrupts io_uring_enter(2) but the
syscall returns the number of SQEs submitted (that takes priority over
-EINTR).
This patch probably makes little difference for QEMU, but the test suite
relies on the exact pattern of aio_poll() return values, so it's best to
hide this io_uring syscall interface quirk.
Here is the strace of test-aio receiving 3 SIGCONT signals after this
fix has been applied. Notice how the io_uring_enter(2) return value is 1
the first time because an SQE was submitted, but -EINTR the other times:
Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-4-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
io_uring_prep_timeout() stashes a pointer to the timespec struct rather
than copying its fields. That means the struct must live until after the
SQE has been submitted by io_uring_enter(2). add_timeout_sqe() violates
this constraint because the SQE is not submitted within the function.
Inline add_timeout_sqe() into fdmon_io_uring_wait() so that the struct
lives at least as long as io_uring_enter(2).
This fixes random hangs (bogus timeout values) when the kernel loads
undefined timespec struct values from userspace after the original
struct on the stack has been destroyed.
Reported-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Message-ID: <20251104022933.618123-3-stefanha@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Stefan Hajnoczi [Tue, 4 Nov 2025 02:29:19 +0000 (21:29 -0500)]
aio-posix: fix race between io_uring CQE and AioHandler deletion
When an AioHandler is enqueued on ctx->submit_list for removal, the
fill_sq_ring() function will submit an io_uring POLL_REMOVE operation to
cancel the in-flight POLL_ADD operation.
There is a race when another thread enqueues an AioHandler for deletion
on ctx->submit_list when the POLL_ADD CQE has already appeared. In that
case POLL_REMOVE is unnecessary. The code already handled this, but
forgot that the AioHandler itself is still on ctx->submit_list when the
POLL_ADD CQE is being processed. It's unsafe to delete the AioHandler at
that point in time (use-after-free).
Solve this problem by keeping the AioHandler alive but setting a flag so
that it will be deleted by fill_sq_ring() when it runs.
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com> Reviewed-by: Kevin Wolf <kwolf@redhat.com>
Message-ID: <20251104022933.618123-2-stefanha@redhat.com> Signed-off-by: Kevin Wolf <kwolf@redhat.com>
Merge tag 'pull-request-2025-11-11' of https://gitlab.com/thuth/qemu into staging
* Fix some issues in the functional tests that pylint complains about
# -----BEGIN PGP SIGNATURE-----
#
# iQJFBAABCgAvFiEEJ7iIR+7gJQEY8+q5LtnXdP5wLbUFAmkTDfQRHHRodXRoQHJl
# ZGhhdC5jb20ACgkQLtnXdP5wLbVj8RAAhOSNyBa81eFJXydkqp0qrQYw6WGT/mAP
# Zn5oTm6NhsgLbUKgbqYQIAivE7VNVWfdhj7aOO9wYM1GfhCk/LOHZWBTNXxFF/uH
# m7ICV5dtSF2zE1AdsWn2rB6vPocc/VMDCHhIzfC7AYlEA7AGuu/O2QALE8H/qOS5
# mQ3+Fuq2EYkOKxKsSnUcj+ZPnUA3NlIF2CTeY0jTQFrwO5RKU3jsScm+uOZZJycn
# DTOzJTymIBGNSlFMNEoj4AhoY43SDdcQcZhwvAPzHZZTVhotJxHf5Fvr7XnDW5VA
# zTA7xZgnY0eAtvzZ4ihyT9BfAHdk62WgBrUeohQ1Ggf/Bo11DVCJtkQ4iY5bY4uI
# yalO7QSMi04PudeIRJmKTAhR6zhDZb/XijtrIcFn6ypTnOEMw8V7MJt9qXB76I/X
# HDZ9859a0//8F70I3mAxDKj8ve/Y6ACuY7pOwKR1Ea0iuM47Dgw9jsuUKRRPUZ+p
# rhJiQ10j8B6mxI0HCqEr8S47zMbW7uJViVYLT7yYKL7vokr96mm08/gEOI07cc88
# CKw3FocW2/suOdFCJVsIrjjq/ySVv0GTAkIeGUaefnY13dmq8ZILmT+GOOf695s9
# PDCoPWzdCY5n0OxToMUosJkQKbFp2F2ls5IGcEHUwxkqPT68/gsqb1VeC8W7x6Gs
# nJGM9ZR7XcM=
# =FhJ1
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 11 Nov 2025 11:20:36 AM CET
# gpg: using RSA key 27B88847EEE0250118F3EAB92ED9D774FE702DB5
# gpg: issuer "thuth@redhat.com"
# gpg: Good signature from "Thomas Huth <th.huth@gmx.de>" [unknown]
# gpg: aka "Thomas Huth <thuth@redhat.com>" [unknown]
# gpg: aka "Thomas Huth <th.huth@posteo.de>" [unknown]
# gpg: aka "Thomas Huth <huth@tuxfamily.org>" [unknown]
# gpg: WARNING: The key's User ID is not certified with a trusted signature!
# gpg: There is no indication that the signature belongs to the owner.
# Primary key fingerprint: 27B8 8847 EEE0 2501 18F3 EAB9 2ED9 D774 FE70 2DB5
* tag 'pull-request-2025-11-11' of https://gitlab.com/thuth/qemu:
tests/functional/m68k/test_nextcube: Fix issues reported by pylint
tests/functional/mips64el: Silence issues reported by pylint
tests/functional/aarch64/test_device_passthrough: Fix warnings from pylint
tests/functional: Fix problems in testcase.py reported by pylint
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>