From: Luca Boccassi Date: Tue, 19 May 2026 10:29:58 +0000 (+0100) Subject: test: pin stress-ng --vm-method to a portable scalar method in TEST-55-OOMD X-Git-Tag: v261-rc1~96 X-Git-Url: http://git.ipfire.org/gitweb/index.cgi?a=commitdiff_plain;h=881e4717c7;p=thirdparty%2Fsystemd.git test: pin stress-ng --vm-method to a portable scalar method in TEST-55-OOMD The stress-ng "vm" stressor's default --vm-method=all cycles through every VM stress method, including newer ones that use AVX-512 instructions. On CPUs without AVX-512 support (e.g. AMD Zen 1 to 3) those methods crash with SIGILL. In testcase_oom_rulesets_lasting_sec all 10 stress-ng workers die within ~2.34 seconds, so by the time the 6 second sleep elapses the unit is already in failed/exit-code state and the assert_eq for ActiveState=active trips. Pin --vm-method=zero-one, a long-standing scalar method, on all four stress-ng --vm invocations in this test (the two transient services in testcase_oom_rulesets and testcase_oom_rulesets_lasting_sec, plus TEST-55-OOMD-testbloat.service and TEST-55-OOMD-testmunch.service) so the workers do not crash on AVX-512-less CPUs. testbloat, testmunch and testcase_oom_rulesets have not been observed failing because they get OOM-killed by systemd-oomd within ~1 to 2 seconds, before stress-ng cycles into an AVX-512 method, but they share the same latent flake. Journal excerpts from the failing run, TEST-55-OOMD-slowrule.service in testcase_oom_rulesets_lasting_sec (journalctl -o short-monotonic): [ 58.018676] stress-ng[1015]: invoked with '/usr/bin/stress-ng --timeout 15s --vm 10 --vm-bytes 50M --vm-keep' by user 0 'root' [ 59.866072] stress-ng[1030]: stress-ng: debug: [1030] caught SIGILL, address 0x000055bd8d609140 (ILL_ILLOPN) [ 59.921050] stress-ng[1030]: stress-ng: debug: [1030] stress-ng: info: 0x000055bd8d609140:<62>71 fd 48 6f 2d 36 14 1c 00 c5 d1 ef ed 49 29 [ 59.929310] stress-ng[1015]: stress-ng: error: [1015] vm: [1021] terminated with an error, exit status=2 (stressor failed) [ 60.364111] stress-ng[1015]: stress-ng: info: [1015] failed: 10: vm (10) [ 60.364493] stress-ng[1015]: stress-ng: info: [1015] unsuccessful run completed in 2.34 secs [ 60.371290] systemd[1]: TEST-55-OOMD-slowrule.service: Main process exited, code=exited, status=2/INVALIDARGUMENT [ 60.371396] systemd[1]: TEST-55-OOMD-slowrule.service: Failed with result 'exit-code'. [ 64.017061] TEST-55-OOMD.sh[1010]: + assert_eq failed active [ 64.018167] TEST-55-OOMD.sh[1039]: FAIL: expected: 'active' actual: 'failed' The faulting bytes marked by stress-ng with <62> (the byte at the instruction pointer) decode unambiguously to an AVX-512 VMOVDQA64 using the 512-bit zmm13 register, confirmed independently by two disassemblers: $ printf '\x62\x71\xfd\x48\x6f\x2d\x36\x14\x1c\x00' | ndisasm -b 64 - 00000000 6271FD486F2D3614 vmovdqa64 zmm13,zword [rel 0x1c1440] -1C00 $ echo '0x62, 0x71, 0xfd, 0x48, 0x6f, 0x2d, 0x36, 0x14, 0x1c, 0x00' | \ llvm-mc -disassemble -triple=x86_64 -mattr=+avx512f .text vmovdqa64 1840182(%rip), %zmm13 The leading 0x62 is the EVEX prefix (exclusive to AVX-512 on this target), zmm13 is a 512-bit register that only exists when AVX-512 is implemented, and VMOVDQA64 requires the AVX512F (Foundation) CPUID feature (Intel SDM Vol 2C). Executing this on a CPU without AVX-512 raises #UD, delivered by the kernel as SIGILL/ILL_ILLOPN, matching the journal entry above. The same journal shows the kernel reporting "kvm_amd: TSC scaling supported", i.e. the guest is on AMD KVM, and AMD did not ship AVX-512 before Zen 4. Co-developed-by: Claude Opus 4.7 --- diff --git a/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testbloat.service b/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testbloat.service index 70c87727c8b..22bbd210e96 100644 --- a/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testbloat.service +++ b/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testbloat.service @@ -7,4 +7,7 @@ Description=Create a lot of memory pressure # to throttle and be put under heavy pressure. MemoryHigh=3M Slice=TEST-55-OOMD-workload.slice -ExecStart=stress-ng --timeout 3m --vm 10 --vm-bytes 200M --vm-keep +# Pin --vm-method to a portable method (zero-one): the default 'all' cycles +# through methods, including newer ones using AVX-512 instructions that SIGILL +# on CPUs without AVX-512 (e.g. AMD Zen 1-3), making the test flaky. +ExecStart=stress-ng --timeout 3m --vm 10 --vm-bytes 200M --vm-keep --vm-method=zero-one diff --git a/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testmunch.service b/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testmunch.service index 79bd01838e1..06eea10b79a 100644 --- a/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testmunch.service +++ b/test/integration-tests/TEST-55-OOMD/TEST-55-OOMD.units/TEST-55-OOMD-testmunch.service @@ -5,4 +5,7 @@ Description=Create some memory pressure [Service] MemoryHigh=12M Slice=TEST-55-OOMD-workload.slice -ExecStart=stress-ng --timeout 3m --vm 10 --vm-bytes 200M --vm-keep +# Pin --vm-method to a portable method (zero-one): the default 'all' cycles +# through methods, including newer ones using AVX-512 instructions that SIGILL +# on CPUs without AVX-512 (e.g. AMD Zen 1-3), making the test flaky. +ExecStart=stress-ng --timeout 3m --vm 10 --vm-bytes 200M --vm-keep --vm-method=zero-one diff --git a/test/units/TEST-55-OOMD.sh b/test/units/TEST-55-OOMD.sh index 6689bbdd733..b7311e83dca 100755 --- a/test/units/TEST-55-OOMD.sh +++ b/test/units/TEST-55-OOMD.sh @@ -365,11 +365,14 @@ EOF systemctl reload systemd-oomd.service - # Run a transient service with OOMRules=testrule that generates memory pressure + # Run a transient service with OOMRules=testrule that generates memory pressure. + # Pin --vm-method to a portable method (zero-one): the default 'all' cycles + # through every method, including newer ones using AVX-512 instructions that + # SIGILL on CPUs without AVX-512 (e.g. AMD Zen 1-3), making the test flaky. (! systemd-run --wait --unit=TEST-55-OOMD-testrules \ -p MemoryHigh=3M \ -p OOMRules=testrule \ - stress-ng --timeout 3m --vm 10 --vm-bytes 50M --vm-keep) + stress-ng --timeout 3m --vm 10 --vm-bytes 50M --vm-keep --vm-method=zero-one) # Verify in the journal that the rule triggered journalctl --sync @@ -454,10 +457,14 @@ EOF # Start the unit without --wait so we can check mid-run state. The # stress-ng timeout bounds the test if anything goes wrong. + # Pin --vm-method to a portable method (zero-one): the default 'all' cycles + # through every method, including newer ones using AVX-512 instructions that + # SIGILL on CPUs without AVX-512 (e.g. AMD Zen 1-3) and would cause stress-ng + # to exit before the 6 s wait below elapses, failing the ActiveState check. systemd-run --unit=TEST-55-OOMD-slowrule \ -p MemoryHigh=3M \ -p OOMRules=slowrule \ - stress-ng --timeout 15s --vm 10 --vm-bytes 50M --vm-keep + stress-ng --timeout 15s --vm 10 --vm-bytes 50M --vm-keep --vm-method=zero-one # Wait long enough for oomd's 1s rule-check loop to evaluate the condition # many times. With LastingSec=1h the kill must not fire.