From: Vsevolod Stakhov Date: Mon, 1 Jun 2026 14:30:16 +0000 (+0100) Subject: [Test] functional: wait for rspamd ports to free in teardown X-Git-Tag: 4.1.0~11^2~3 X-Git-Url: http://git.ipfire.org/gitweb/index.cgi?a=commitdiff_plain;h=65f4e09cea669fd7ef379e9bdf6f929b0c00bec9;p=thirdparty%2Frspamd.git [Test] functional: wait for rspamd ports to free in teardown Under pabot each worker runs many suites sequentially on the SAME port range (base + worker_index*100). Rspamd Teardown did Terminate Process + Wait For Process, but that only reaps the MAIN rspamd; the listening sockets are shared with forked workers and can linger a beat after main exits. The next suite's rspamd on that worker then races them and dies: rspamd_inet_address_listen: bind 127.0.0.1:57090 failed: 98, 'Address already in use' spawn_workers: cannot listen on normal socket 127.0.0.1:57090 Process Is Gone (rc=1, port=57089) which cascades the whole shared-rspamd suite (e.g. 001_merged -> 250+ failures) or single suites like 440_ssl_server. rspamd sets SO_REUSEADDR before bind, so this is NOT TIME_WAIT -- it is a still-LISTENing socket from a not-yet-fully-gone worker. Add port_is_free() (rspamd.py) and a Wait For Rspamd Ports Released keyword, called from Rspamd Teardown after Wait For Process: block (up to ~6s, warn-not-fail) until the normal + controller ports actually refuse connections before releasing the suite. Closes the handoff race window. This is a pre-existing flake (same bind-98 signature on master, e.g. fedora job for #6067 with :56990), independent of the dummy-port templating in this branch; both CI runs of this PR hit it in different suites, the tell-tale of nondeterministic infra flake. Verified: the keyword runs on every teardown (357 invocations / 714 port checks in a 4-worker pabot run) and port_is_free correctly passes on a free port and blocks on a live listener; no regression in serial or parallel runs. The race itself is timing-dependent and reproduces under CI container contention rather than locally, so CI is the real check. --- diff --git a/test/functional/lib/rspamd.py b/test/functional/lib/rspamd.py index a974bbb34d..8a41634abe 100644 --- a/test/functional/lib/rspamd.py +++ b/test/functional/lib/rspamd.py @@ -539,6 +539,33 @@ def TCP_Connect(addr, port): s.close() +def port_is_free(addr, port): + """Assert that nothing is listening on addr:port. + + Used by teardown to confirm a just-terminated rspamd has actually + released its listening sockets before the next suite on this pabot + worker reuses the same port range. `Wait For Process` only reaps the + main rspamd; the listening sockets are shared with forked workers and + can linger briefly after main exits. rspamd sets SO_REUSEADDR, so this + is NOT about TIME_WAIT -- a still-LISTENing socket from a not-yet-gone + worker genuinely fails the next bind() with EADDRINUSE. Connecting and + succeeding means someone is still listening -> raise so Wait Until + Keyword Succeeds retries; connection refused means the port is free. + + Example: + | Wait Until Keyword Succeeds | 10s | 0.2s | Port Is Free | 127.0.0.1 | 56790 | + """ + s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) + s.settimeout(0.5) + try: + s.connect((addr, int(port))) + except (ConnectionRefusedError, socket.timeout, OSError): + return True + finally: + s.close() + raise AssertionError("port %s:%s is still in use" % (addr, port)) + + def try_reap_zombies(): try: os.waitpid(-1, os.WNOHANG) diff --git a/test/functional/lib/rspamd.robot b/test/functional/lib/rspamd.robot index 8fb7286abd..bc3fd60276 100644 --- a/test/functional/lib/rspamd.robot +++ b/test/functional/lib/rspamd.robot @@ -314,6 +314,14 @@ Rspamd Teardown END Terminate Process ${RSPAMD_PROCESS} Wait For Process ${RSPAMD_PROCESS} + # Wait For Process only reaps the main rspamd; its listening sockets are + # shared with forked workers and can linger a beat after main exits. + # Under pabot each worker runs many suites sequentially on the SAME port + # range, so if we release this suite before the kernel has closed the + # normal+controller sockets, the next suite's rspamd races them and dies + # with EADDRINUSE (rspamd sets SO_REUSEADDR, so this is a live socket, + # not TIME_WAIT). Block until the ports are actually free. + Wait For Rspamd Ports Released Save Run Results ${RSPAMD_TMPDIR} configdump.stdout configdump.stderr rspamd.stderr rspamd.stdout rspamd.conf rspamd.log redis.log clickhouse-config.xml Log does not contain segfault record Collect Lua Coverage @@ -323,6 +331,22 @@ Rspamd Redis Teardown Rspamd Teardown Redis Teardown +Wait For Rspamd Ports Released + [Documentation] Block until this suite's rspamd listening ports are + ... free, so the next suite on the same pabot worker can rebind them. + ... Checks the always-present normal + controller ports; each is given + ... up to ~6s (matches a slow worker shutdown under CPU contention) and + ... failure to free is a warning, not a hard error -- we don't want a + ... stuck port to mask the real test result, just to close the common + ... handoff race. See port_is_free in rspamd.py for why SO_REUSEADDR + ... does not cover this. + Run Keyword And Warn On Failure + ... Wait Until Keyword Succeeds 30x 0.2s + ... Port Is Free ${RSPAMD_LOCAL_ADDR} ${RSPAMD_PORT_NORMAL} + Run Keyword And Warn On Failure + ... Wait Until Keyword Succeeds 30x 0.2s + ... Port Is Free ${RSPAMD_LOCAL_ADDR} ${RSPAMD_PORT_CONTROLLER} + Run Redis ${RSPAMD_TMPDIR} = Make Temporary Directory ${template} = Get File ${RSPAMD_TESTDIR}/configs/redis-server.conf