Thomas Huth [Tue, 29 Apr 2025 15:21:39 +0000 (17:21 +0200)]
scripts/vmstate-static-checker.py: Allow new name for ghes_addr_le field
ghes_addr_le has been renamed to hw_error_le in commit 652f6d86cbb
("acpi/ghes: better name the offset of the hardware error firmware").
Adjust the checker script to allow that changed field name.
I hit following error which testing migration in pure RoCE env:
"-incoming rdma:[::]:8089: RDMA ERROR: You only have RoCE / iWARP devices in your
systems and your management software has specified '[::]', but IPv6 over RoCE /
iWARP is not supported in Linux.#012'."
In our setup, we use rdma bind on ipv6 on target host, while connect from source
with ipv4, remove the qemu_rdma_broken_ipv6_kernel, migration just work
fine.
Checking the git history, the function was added since introducing of
rdma migration, which is more than 10 years ago. linux-rdma has
improved support on RoCE/iWARP for ipv6 over past years. There are a few fixes
back in 2016 seems related to the issue, eg: aeb76df46d11 ("IB/core: Set routable RoCE gid type for ipv4/ipv6 networks")
other fixes back in 2018, eg: 052eac6eeb56 RDMA/cma: Update RoCE multicast routines to use net namespace 8d20a1f0ecd5 RDMA/cma: Fix rdma_cm raw IB path setting for RoCE 9327c7afdce3 RDMA/cma: Provide a function to set RoCE path record L2 parameters 5c181bda77f4 RDMA/cma: Set default GID type as RoCE when resolving RoCE route 3c7f67d1880d IB/cma: Fix default RoCE type setting be1d325a3358 IB/core: Set RoCEv2 MGID according to spec 63a5f483af0e IB/cma: Set default gid type to RoCEv2
So remove the outdated function and it's usage.
Cc: Peter Xu <peterx@redhat.com> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Yu Zhang <yu.zhang@ionos.com> Cc: Fabiano Rosas <farosas@suse.de> Cc: qemu-devel@nongnu.org Cc: linux-rdma@vger.kernel.org Cc: michael@flatgalaxy.com Signed-off-by: Jack Wang <jinpu.wang@ionos.com> Tested-by: Li zhijian <lizhijian@fujitsu.com> Reviewed-by: Michael Galaxy <mrgalaxy@nvidia.com> Link: https://lore.kernel.org/r/20250402051306.6509-1-jinpu.wang@ionos.com
[peterx: some cosmetic changes] Signed-off-by: Peter Xu <peterx@redhat.com>
Peter Xu [Thu, 24 Apr 2025 22:07:05 +0000 (18:07 -0400)]
migration/postcopy: Spatial locality page hint for preempt mode
The preempt mode postcopy has been introduced for a while. From latency
POV, it should always win the vanilla postcopy.
However there's one thing missing when preempt mode is enabled right now,
which is the spatial locality hint when there're page requests from the
destination side.
In vanilla postcopy, as long as a page request was unqueued, it will update
the PSS of the precopy background stream, so that after a page request the
background thread will move the pages after whatever was requested. It's
pretty much a natural behavior when there's only one channel anyway, and
one scanner to send the pages.
Preempt mode didn't follow that, because preempt mode has its own channel
and its own PSS (which doesn't linearly scan the guest memory, but
dedicated to resolve page requested from destination). So the page request
process and the background migration process are completely separate.
This patch adds the hint explicitly for preempt mode. With that, whenever
the preempt mode receives a page request on the source, it will service the
remote page fault in the return path, then it'll provide a hint to the
background thread so that we'll start sending the pages right after the
requested ones in the background, assuming the follow up pages have a
higher chance to be accessed later.
NOTE: since the background migration thread and return path thread run
completely concurrently, it doesn't always mean the hint will be applied
every single time. For example, it's possible that the return path thread
receives multiple page requests in a row without the background thread
getting the chance to consume one. In such case, the preempt thread only
provide the hint if the previous hint has been consumed. After all,
there's no point queuing hints when we only have one linear scanner.
This could measureably improve the simple sequential memory access pattern
during postcopy (when preempt is on). For random accesses, I can measure a
slight increase of remote page fault latency from ~500us -> ~600us, that
could be a trade-off to have such hint mechanism, and after all that's
still greatly improved comparing to vanilla postcopy on random (~10ms).
The patch is verified by our QE team in a video streaming test case, to
reduce the pause of the video from ~1min to a few seconds when switching
over to postcopy with preempt mode.
tests/qtest/migration: consolidate set capabilities
Migration capabilities are set in multiple '.start_hook'
functions for various tests. Instead, consolidate setting
capabilities in 'migrate_start_set_capabilities()' function
which is called from the 'migrate_start()' function.
While simplifying the capabilities setting, it helps
to declutter the qtest sources.
Add a savevm handler for a module to opt-in sending extra sections right
before postcopy starts, and before VM is stopped.
RAM will start to use this new savevm handler in the next patch to do flush
and sync for multifd pages.
Note that we choose to do it before VM stopped because the current only
potential user is not sensitive to VM status, so doing it before VM is
stopped is preferred to enlarge any postcopy downtime.
It is still a bit unfortunate that we need to introduce such a new savevm
handler just for the only use case, however it's so far the cleanest.
The various logical migration channels don't have a
standardized way of advertising themselves and their
connections may be seen out of order by the migration
destination. When a new connection arrives, the incoming
migration currently make use of heuristics to determine
which channel it belongs to.
The next few patches will need to change how the multifd
and postcopy capabilities interact and that affects the
channel discovery heuristic.
Refactor the channel discovery heuristic to make it less
opaque and simplify the subsequent patches.
Li Zhijian [Tue, 11 Mar 2025 02:42:21 +0000 (10:42 +0800)]
migration: Add qtest for migration over RDMA
This qtest requires there is a RDMA(RoCE) link in the host.
In order to make the test work smoothly, introduce a
scripts/rdma-migration-helper.sh to detect existing RoCE link before
running the test.
Test will be skipped if there is no available RoCE link.
# Start of rdma tests
# Running /x86_64/migration/precopy/rdma/plain
ok 1 /x86_64/migration/precopy/rdma/plain # SKIP No rdma link available
# To enable the test:
# Run 'scripts/rdma-migration-helper.sh setup' with root to setup a new rdma/rxe link and rerun the test
# Optional: run 'scripts/rdma-migration-helper.sh clean' to revert the 'setup'
# End of rdma tests
Cc: Philippe Mathieu-Daudé <philmd@linaro.org> Cc: Stefan Hajnoczi <stefanha@gmail.com> Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250311024221.363421-1-lizhijian@fujitsu.com>
[add 'head -1' to script, reformat test message] Signed-off-by: Fabiano Rosas <farosas@suse.de>
Li Zhijian [Wed, 5 Mar 2025 06:28:24 +0000 (14:28 +0800)]
migration: Unfold control_save_page()
control_save_page() is for RDMA only, unfold it to make the code more
clear.
In addition:
- Similar to other branches style in ram_save_target_page(), involve RDMA
only if the condition 'migrate_rdma()' is true.
- Further simplify the code by removing the RAM_SAVE_CONTROL_NOT_SUPP.
Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-6-lizhijian@fujitsu.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Li Zhijian [Wed, 5 Mar 2025 06:28:21 +0000 (14:28 +0800)]
migration: check RDMA and capabilities are compatible on both sides
Depending on the order of starting RDMA and setting capability,
they can be categorized into the following scenarios:
Source:
S1: [set capabilities] -> [Start RDMA outgoing]
Destination:
D1: [set capabilities] -> [Start RDMA incoming]
D2: [Start RDMA incoming] -> [set capabilities]
Previously, compatibility between RDMA and capabilities was verified only
in scenario D1, potentially causing migration failures in other situations.
For scenarios S1 and D1, we can seamlessly incorporate
migration_transport_compatible() to address compatibility between
channels and capabilities vs transport.
For scenario D2, ensure compatibility within migrate_caps_check().
Reviewed-by: Peter Xu <peterx@redhat.com> Signed-off-by: Li Zhijian <lizhijian@fujitsu.com>
Message-ID: <20250305062825.772629-3-lizhijian@fujitsu.com> Signed-off-by: Fabiano Rosas <farosas@suse.de>
Kevin Wolf [Tue, 29 Apr 2025 15:56:54 +0000 (17:56 +0200)]
file-posix: Fix crash on discard_granularity == 0
Block devices that don't support discard have a discard_granularity of
0. Currently, this results in a division by zero when we try to make
sure that it's a multiple of request_alignment. Only try to update
bs->bl.pdiscard_alignment when we got a non-zero discard_granularity
from sysfs.
Fixes: f605796aae4 ('file-posix: probe discard alignment on Linux block devices') Signed-off-by: Kevin Wolf <kwolf@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Eric Blake <eblake@redhat.com>
Message-ID: <20250429155654.102735-1-kwolf@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
Merge into INDEX_op_{ld,st,ld2,st2}, where "2" indicates that two
inputs or outputs are required. This simplifies the processing of
i64/i128 depending on host word size.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
We were using S32 | U32 for add2/sub2. But the ALGFI and SLGFI
insns that implement this both have uint32_t immediates.
This makes the composite range balanced and
enables use of -0xffffffff ... -0x80000001.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Do not clobber flags if they're live. Required in order
to perform register allocation on add/sub carry opcodes.
LA and AGHI are the same size, so use LA unconditionally.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
target/ppc: Use tcg_gen_addcio_tl for ADD and SUBF
Tested-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Using addci with two zeros as input in order to capture the value
of the carry-in bit is common. Special case this with sbb+neg so
that we do not have to load 0 into a register first.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
tcg/optimize: With two const operands, prefer 0 in arg1
For most binary operands, two const operands fold.
However, the add/sub carry opcodes have a third input.
Prefer "reg, zero, const" since many risc hosts have a
zero register that can fit a "reg, reg, const" insn format.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Propagate known carry when possible, and simplify the opcodes
to not require carry-in when known. The result will be cleaned
up further by the subsequent liveness analysis pass.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
tcg: Add add/sub with carry opcodes and infrastructure
Liveness needs to track carry-live state in order to
determine if the (hidden) output of the opcode is used.
Code generation needs to track carry-live state in order
to avoid clobbering cpu flags when loading constants.
So far, output routines and backends are unchanged.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
tcg: Sink def, nb_iargs, nb_oargs loads in liveness_pass_1
Sink the sets of the def, nb_iargs, nb_oargs variables to
the default and do_not_remove labels. They're not really
needed beforehand, and it avoids preceding code from having
to keep them up-to-date. Note that def had *not* been kept
up-to-date; thankfully only def->flags had been used and
those bits were constant between opcode changes.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Drop the cast from TCGv_i64 to TCGv_i32 in tcg_gen_extrl_i64_i32
an emit extrl_i64_i32 unconditionally. Move that special case
to tcg_gen_code when we find out if the output is live or dead.
In this way even hosts that canonicalize truncations can make
use of a store directly from the 64-bit host register.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Use TCGOutOpUnary instead of TCGOutOpBswap because the
flags are not used with this opcode; they are merely
present for uniformity with the smaller bswaps.
Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Tested-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org> Reviewed-by: Pierrick Bouvier <pierrick.bouvier@linaro.org> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>