git.ipfire.org Git - thirdparty/rspamd.git/log

]> git.ipfire.org Git - thirdparty/rspamd.git/log

projects / thirdparty / rspamd.git / log

Vsevolod Stakhov [Tue, 2 Jun 2026 20:07:26 +0000 (21:07 +0100)]

Merge pull request #6074 from rspamd/vstakhov-checkv3-custom-metadata

[Feature] protocol: Expose custom metadata for /checkv3

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 2 Jun 2026 18:04:34 +0000 (19:04 +0100)]

[Feature] protocol: Expose custom metadata for /checkv3

Add two complementary ways to read custom fields sent with a /checkv3
multipart scan request, both free of the 80KB HTTP header limit that v2
hits, since the metadata travels in the multipart body:

  * A "headers" sub-object in the metadata part is injected into the
    task request headers, so task:get_request_header() works for custom
    fields exactly like v2 HTTP request headers. Reserved control-header
    names (shm/file/path/dictionary/Content-Encoding...) are skipped so
    client metadata cannot collide with the message-loading channel, and
    a repeated name (collapsed by UCL into an array) expands to a
    multi-valued request header.

  * The parsed metadata object is kept on task->meta and exposed to Lua
    via task:get_metadata() and task:get_metadata_field(key), mirroring
    get_settings()/lookup_settings(). The task now owns the object and
    frees it once in rspamd_task_free instead of via a pool destructor.

rspamc gains a repeatable --metadata-header KEY=VALUE option that builds
the metadata "headers" sub-object for v3 requests. Also drop a dead
is_msgpack variable in the v3 request handler.

Tests: functional cases in 430_checkv3.robot plus a checkv3_meta.lua
plugin exercising both options via raw multipart and rspamc.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 18:47:42 +0000 (19:47 +0100)]

Merge pull request #6068 from moisseev/upstream

[Minor] upstream: improve cooldown log message clarity

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 18:46:28 +0000 (19:46 +0100)]

Merge pull request #6071 from rspamd/vstakhov-functional-dummy-readiness

[Test] functional: fix dummy-helper start/scan race and parallel port collisions

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 18:00:42 +0000 (19:00 +0100)]

[Fix] functional: move test server ports below the ephemeral range

The real root cause of the 440_ssl_server flake (and the family of
intermittent "bind 98 / Address already in use" failures): the test
server ports sat INSIDE Linux's default ephemeral range
(net.ipv4.ip_local_port_range = 32768..60999). Bases were 56379 (redis),
56380 (nginx) and 567xx (rspamd normal/controller/proxy/fuzzy + the two
TLS listeners), all squarely in that window.

So any outbound client socket in the test environment -- a redis client,
monitored URIBL DNS lookups, an upstream connection, a dummy-helper
connection -- could be handed one of those numbers by the kernel as its
EPHEMERAL SOURCE PORT on connect(). When rspamd later tried to bind() a
LISTENER on that exact port it got EADDRINUSE. rspamd sets SO_REUSEADDR,
which does nothing against a live socket already bound by another
process. The controller's SSL socket is the LAST of its five ports to
bind -- by then the controller has already opened many client sockets --
so it lost this race most often and surfaced as "SSL controller never
came up" -> HTTPS connection-refused for the whole retry budget. It was
probabilistic (depends which ephemeral ports were in use at bind time),
hence flaky and distro-dependent.

Move the whole rspamd/redis/nginx block down by 31000 (e.g. normal
56789 -> 25789, controller-SSL 56796 -> 25796, redis 56379 -> 25379,
nginx 56380 -> 25380). This preserves every relative offset, so the
carefully spaced, collision-free per-worker layout (base + slot*100) is
unchanged: across 64 worker slots the dummy_* helpers stay <= 24383,
this block spans 25379..32097, and the ephemeral floor 32768 is never
reached. Verified by importing vars.py for slots 0 and 63 (max port
32097 < 32768, zero cross-family collisions) and a serial 001_merged run
(all six 440_ssl_server tests pass on the relocated ports).

Also bump the two cosmetic fallbacks that mirrored the old bases:
test_redis_client.lua's getenv default and a port_is_free docstring.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 17:10:53 +0000 (18:10 +0100)]

[Minor] ci: dedupe concurrent push + pull_request runs

A commit on a branch with an open PR triggers two full ci runs at once:
one for the push event (ref refs/heads/<branch>) and one for the
pull_request event (ref refs/pull/<n>/merge). Besides wasting runner
time they share GitHub's hosted runners and double the CPU load, which
is enough to push the heavy 001_merged rspamd's controller startup past
the functional suites' fixed readiness timeouts -- the residual
440_ssl_server flake reproduced only on whichever of the two same-SHA
runs lost the CPU race (the other passed).

Add a top-level concurrency group keyed on the head commit SHA with
cancel-in-progress. push and pull_request expose the head differently
(github.sha vs github.event.pull_request.head.sha -- the latter is the
real head on PR events, where github.sha is the merge commit), so the
group key uses pull_request.head.sha when present and falls back to
github.sha, collapsing both events for one commit into a single run.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 16:32:43 +0000 (17:32 +0100)]

[Test] functional: also wait for SSL/proxy ports in teardown

The teardown port-release wait added in the previous commit only covered
the normal + controller plain ports. The 440_ssl_server flake is the
same race on a port it missed: a test rspamd binds up to five sockets
(normal, controller, proxy, controller-SSL, normal-SSL), and a previous
suite's controller-SSL listener could still hold its port when the next
rspamd on that pabot worker started. The CI log shows it exactly:

  rspamd_fork_worker: prepare to fork process controller (0);
    listen on: 127.0.0.1:57190
  rspamd_inet_address_listen: bind 127.0.0.1:57196 failed: 98,
    'Address already in use'
  spawn_workers: cannot listen on normal socket 127.0.0.1:57196

57196 is PORT_CONTROLLER_SSL for that worker slot. main carried on and
forked the controller with only its plain socket, so the SSL listener
never came up and every HTTPS test hit connection-refused for the full
retry budget -- the "slow SSL controller" the two prior band-aids tried
to wait out.

Extend Wait For Rspamd Ports Released to loop over all five ports. All
RSPAMD_PORT_* vars are always defined in vars.py, and a port the current
config never bound refuses connection immediately, so Port Is Free
passes at once -- waiting on an unused port is a cheap no-op. Verified
001_merged (which owns 440_ssl_server) still passes serially with the
SSL ports now checked in teardown; the SSL bind race is timing-dependent
under CI contention, so the fedora/ubuntu runs are the real check.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 14:30:16 +0000 (15:30 +0100)]

[Test] functional: wait for rspamd ports to free in teardown

Under pabot each worker runs many suites sequentially on the SAME port
range (base + worker_index*100). Rspamd Teardown did Terminate Process +
Wait For Process, but that only reaps the MAIN rspamd; the listening
sockets are shared with forked workers and can linger a beat after main
exits. The next suite's rspamd on that worker then races them and dies:

  rspamd_inet_address_listen: bind 127.0.0.1:57090 failed: 98,
    'Address already in use'
  spawn_workers: cannot listen on normal socket 127.0.0.1:57090
  Process Is Gone (rc=1, port=57089)

which cascades the whole shared-rspamd suite (e.g. 001_merged -> 250+
failures) or single suites like 440_ssl_server. rspamd sets SO_REUSEADDR
before bind, so this is NOT TIME_WAIT -- it is a still-LISTENing socket
from a not-yet-fully-gone worker.

Add port_is_free() (rspamd.py) and a Wait For Rspamd Ports Released
keyword, called from Rspamd Teardown after Wait For Process: block (up to
~6s, warn-not-fail) until the normal + controller ports actually refuse
connections before releasing the suite. Closes the handoff race window.

This is a pre-existing flake (same bind-98 signature on master, e.g.
fedora job for #6067 with :56990), independent of the dummy-port
templating in this branch; both CI runs of this PR hit it in different
suites, the tell-tale of nondeterministic infra flake.

Verified: the keyword runs on every teardown (357 invocations / 714 port
checks in a 4-worker pabot run) and port_is_free correctly passes on a
free port and blocks on a live listener; no regression in serial or
parallel runs. The race itself is timing-dependent and reproduces under
CI container contention rather than locally, so CI is the real check.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 12:46:04 +0000 (13:46 +0100)]

[Test] functional: template dummy ports for parallel safety

The dummy_* helper ports already get a per-pabot-worker offset in
lib/vars.py (base + worker_index*100), but consumers hardcoded the
worker-0 literals (:18080/:18081/:18083), so under parallel pabot a
worker bound its dummy on an offset port while tests/configs still
pointed at :18080. That produced two failure modes: Errno 48 "address
already in use" when two workers raced the same literal port, and
cross-worker URL mismatches (worker 3's redirector fetching worker 0's
dummy, assertions expecting :18080 that never appeared).

Route every consumer through the existing per-worker value:

  * Lua test scripts (http/tcp/http_early_response): read the port from
    rspamd_env.PORT_DUMMY_HTTP/HTTPS/HTTP_EARLY (rspamd strips the
    RSPAMD_ prefix when building the Lua env table), defaulting to the
    historical literal for ad-hoc runs. Mirrors the existing maps_kv.lua
    pattern.
  * neural_llm.conf: Jinja {= env.PORT_DUMMY_HTTP =}, like the other
    templated configs.
  * test_tcp_client.lua (rspamadm): os.getenv fallback chain (it runs
    under `rspamadm lua`, not the config loader).

The url_redirector .eml fixtures embed the dummy URL but are fed raw to
the scanner -- the config-time Jinja engine does not touch them. Add a
Render Message Template keyword (Get File -> Replace Variables ->
Create File in the suite tmpdir) and have suites 162-169 render their
fixtures in setup, with ${RSPAMD_PORT_DUMMY_HTTP} placeholders in the
fixtures and assertions. Normalise redir_chain_tel_url.eml to LF while
touching it.

Verified: serial runs unchanged (worker-0 keeps the historical ports),
and 4x parallel pabot stress over the url_redirector + http/tcp/early +
antivirus/udp/p0f/settings/llm suites is stable at 142/143 with zero
Errno 48 / address-in-use and no cross-worker mismatches. The lone
remaining failure (169 path-less ?u= wrapper) is a pre-existing
redirector behaviour bug -- it fails identically on master.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 31 May 2026 19:48:45 +0000 (20:48 +0100)]

[Test] functional: centralize dummy helper readiness barrier

Fix a start/scan race in the functional suite: dummy_* mock services
were started and then connected to (by rspamd or the test) before they
were listening. Under parallel pabot the short 2s PID waits timed out
under CPU contention, one-shot helpers (clam/fprot/avast/p0f) left stale
PID files so a same-port restart satisfied Wait Until Created instantly
and raced the new bind, and p0f derived its PID path inconsistently
between helper and suite.

Every dummy_* helper already writes its PID only after server_bind/
server_activate, so PID-existence is a valid "listening" signal. This
routes all helper startup through one barrier:

  * Start Dummy Service (lib/rspamd.robot): drop stale PID, start the
    helper, block until the PID file appears (5s). Single source of
    truth for startup ordering.
  * Wait Until Dummy Listening: active TCP-connect probe layered on top
    for loop servers (http/https/ssl) only; not used for one-shot or
    single-threaded smtp helpers, where a probe would consume the one
    session the test needs.

Rewrite Run Dummy Http/Https/Llm/Http Early/Ssl/Udp/Clam/Fprot/Avast/p0f
and the 168/169 SMTP suites to go through it; move SMTP temp files from
/tmp to the per-worker RSPAMD_TMP_PREFIX; teach dummy_p0f.py to accept an
explicit PID path.

Add util/check_no_bare_dummy_start.py, run as a run-parallel.sh
preflight, which fails if a suite reintroduces a bare
Start Process ... dummy_*.py instead of using the barrier.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 31 May 2026 13:45:56 +0000 (14:45 +0100)]

[Fix] archives: bounds guards for RAR/ZIP/7-zip parsing

Several archive parsers could read slightly past the input buffer on
crafted (attacker-controlled) attachments:

- RAR v4 file header: fname_len was validated against the remaining
  buffer, but p then advanced past the attrs and optional
  HIGH_PACK_SIZE/HIGH_UNP_SIZE fields (4-12 bytes) before the filename
  was read, allowing an over-read of up to 12 bytes. Re-validate
  fname_len at the point of use.

- 7-zip: rspamd_7zip_read_next_section, _read_digest and _read_bits
  dereferenced *p before any bounds check; a section/type byte landing
  on the last byte of the buffer (e.g. a trailing kCRC or kHeader) led
  to a one-byte over-read. Guard p < end before the dereference.
  rspamd_7zip_read_archive_props guarded only p != NULL; also require
  p < end.

- ZIP central-directory extra-field loop advanced p by an
  attacker-controlled hlen without checking it against the remaining
  extra-field length, producing a past-the-end pointer. Clamp the
  advance and stop on a truncated field.

All reads, no writes; impact is a potential crash on malformed input.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 30 May 2026 12:21:37 +0000 (13:21 +0100)]

[Fix] mime_parser: bound S/MIME recursion depth

Nested S/MIME structures re-entered the parser through
rspamd_mime_parse_normal_part -> rspamd_mime_process_multipart_node ->
rspamd_mime_parse_normal_part without passing through the
multipart/message nesting checks, so st->nesting was never incremented
on that path. application/pkcs7-mime only sets the SMIME content-type
flag (not MESSAGE/MULTIPART), so such parts take the normal-part branch.
A crafted message with deeply nested application/pkcs7-mime layers could
therefore recurse to a depth bounded only by message size rather than by
max_nested, exhausting the worker stack (DoS) and accumulating the
CMS/PKCS7/BIO objects of every level simultaneously.

Account for the S/MIME re-entry against max_nested and free the
CMS/PKCS7/BIO objects on the new error path; the nesting cap also bounds
the peak memory held during unwinding.

Two related defensive guards:
- rspamd_mime_preprocess_message now looks back one byte before the body
  only when that stays within the buffer, avoiding a potential 1-byte
  out-of-bounds read when raw_data.begin == st->start.
- guard the boundary-stack pop in rspamd_mime_parse_multipart_part with
  len > 0, mirroring the guarded pop in rspamd_mime_parse_message.

commit | commitdiff | tree

Alexander Moisseev [Fri, 29 May 2026 13:43:56 +0000 (16:43 +0300)]

[Minor] upstream: improve cooldown log message clarity

When elapsed time rounded to the same value as the minimum interval,
the log showed "checked 60 seconds ago (60 is minimum)", suggesting
the check was skipped at equality despite the strict < comparison.
Replace with remaining cooldown time using ceil() to avoid ambiguity.

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 29 May 2026 10:30:33 +0000 (12:30 +0200)]

[Feature] mx_check: three-layer cache rewrite (#6055)

* [Feature] mx_check: three-layer cache rewrite

This is the comprehensive implementation behind issue #6032. The single-
layer cache from previous shape is replaced by a three-layer Redis design
(d:<domain> / m:<mxhost> / i:<ip>) under <key_prefix>:. Short-code wire
formats minimise Redis footprint; per-layer validators
(is_valid_cache_value) treat unrecognised entries as a cache miss;
the resolve / probe path that follows then issues a fresh cache_set at
the same key, overwriting the bad entry in place.

Probe coordination

- SET NX EX claims the i:<ip> probe lock; a post-claim GET disambiguates
  held lock, already-published verdict, and corrupted-value-needing-heal
  cases. A separate force_claim_probe_lock path overwrites corrupted
  values to break the SET NX loop without leaking refcounts.
- Redis errors during the lock claim surface as MX_REDIS_ERROR; lock held
  by another worker surfaces as MX_INFLIGHT and skips duplicated TCP
  connections which under high-load would result in DoS like activity
  from the target side and most likely will negatively impact Rspamd's
  user IP/ASN/Org reputation.

DNS / probe model

- Dual-stack via probe_ipv4 / probe_ipv6 / prefer_ipv6 with family-tagged
  cache values (v4: / v6: / v64:) and coverage checks so flipping the
  probe-family set re-resolves only as needed.
- Real DNS path failures (SERVFAIL / REFUSED / timeout) are distinguished
  from authoritative NXDOMAIN / NOREC via is_dns_real_failure; the former
  surface as MX_DNS_FAIL (cached as 'df') so a recovered resolver path
  can be re-tried promptly. NXDOMAIN/NOREC collapse into MX_NONE.
- step3 partitions resolved IPs into PUBLIC / LOCAL (RFC1918 / CGNAT /
  ULA) / BOGON (loopback, TEST-NET, multicast, link-local, etc.). Only
  PUBLIC IPs reach the TCP probe. MX_LOCAL_ONLY / MX_LOCAL_MIX /
  MX_BOGON_ONLY / MX_BOGON_MIX fire with the offending IPs as options.
  test_mode lifts loopback out of the bogon set so the probe path can be
  exercised against 127.0.0.1.

Symbol surface

- Multi-source: check_from / check_mime_from / check_reply_to with
  envelope > reply-to > mime-from priority dedup if same domain is hitting
  MX checks from different sources. Per-source prefixes
  (symbol_prefix_from / symbol_prefix_mime_from / symbol_prefix_reply_to)
  fan every MX_* symbol across the three sources at registration time.
- A-fallback path (no MX RR, A used as implicit MX per RFC 5321 §5.1)
  has its own MX_A_* symbol family so operators can score it
  independently of the MX-RR path.
- Per-outcome greylist and reject gates (greylist_invalid /
  greylist_none / greylist_broken / ..., reject_null_mx with
  reject_authorized / reject_local kill switches); null-MX domains can
  now trigger a real set_pre_result. reject_nxdomain_mx removed
  as bad option to serve, practically nxdomain reject would be good only
  on eTLD+1.
- Probe-outcome symbols (MX_GOOD / MX_TIMEOUT_* / MX_REFUSED /
  MX_INVALID / MX_ERROR / MX_INFLIGHT) populate the option field with
  the MX hostname; IP-class symbols still carry IPs since that's where
  IP information is the point. MX_REDIS_ERROR has no option (it's a
  module-internal signal).
- New punishment maps: bad_mxs (glob on MX hostnames) and bad_ips
  (radix on resolved IPs). Any hit short-circuits with MX_BAD /
  MX_IP_BAD before any TCP probe runs which allows to punish
  domains which shares same MX infra.

Scoring

- set_metric_all_sources ships sensible defaults for every symbol.
  Operators can tune any weight through the new "mx" group in
  conf/groups.conf via local.d/mx_group.conf or override.d/
  mx_group.conf without touching the module.

Functional tests

- 167_mx_check.robot refreshed for the new symbol set; MX_NONE replaces
  MX_NXDOMAIN/MX_MISSING, MX_A_REFUSED covers the closed-port
  A-fallback case, and MX_BAD / MX_IP_BAD have dedicated assertions.
- 168_mx_check_greeting.robot covers verify_greeting=true /
  send_quit=false: silent listener -> MX_TIMEOUT_READ; continuation
  220- with no follow-up held past read_timeout -> MX_GOOD (a
  regression that re-queued reads under send_quit=false would surface
  as MX_TIMEOUT_READ); 5xx greeting -> MX_ERROR; non-SMTP line ->
  MX_INVALID.
- 169_mx_check_greeting_quit.robot covers verify_greeting=true /
  send_quit=true: proper multi-line timing -> MX_GOOD plus dummy
  status file QUIT_AFTER_FINAL (catches a regression where QUIT is
  sent before the final 220 line, which rspamd's verdict alone cannot
  detect); slow second line -> MX_TIMEOUT_READ.
- util/dummy_smtp.py mock with silent / error / messy / greeting_single
  / greeting_multi modes and a --status-file argument for out-of-band
  timing verification.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Feature] mx_check: optional per-entry weight multiplier for bad_mxs / bad_ips

Both bad_mxs (glob) and bad_ips (radix) entries can now carry an optional numeric second token that is read as a weight multiplier on top of the MX_BAD / MX_IP_BAD group score. Examples: `trapmx.example.com 3` triples the weight; `1.2.3.4 0.5` halves it. Default multiplier is 1.0 (no value or non-numeric value). Lets operators tier confidence within a single map without maintaining several.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] Use static parent callback in mx_check module

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] Add missing executable flag on dummy_smtp python script

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Chore] Add group to parent mx_check symbol

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] change rspamd_config:add_map to lua_maps so inline maps works too, adjust autotests so they survive parallelism

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
---------

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
Co-authored-by: Vsevolod Stakhov <vsevolod@rspamd.com>

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 29 May 2026 10:30:13 +0000 (12:30 +0200)]

Merge pull request #6066 from dragoangel/fix/properly-handle-redirects

[Fix] Handle query-embedded URL targets in wrappers and redirectors

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 09:06:22 +0000 (10:06 +0100)]

Merge pull request #6067 from rspamd/vstakhov-env-baseline-templating

[Feature] Env-overridable baseline config and fasttext model auto-load

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 08:05:46 +0000 (09:05 +0100)]

[Feature] Auto-load shipped fasttext model when present

When no fasttext_model is configured, fall back to the model shipped at
$SHAREDIR/languages/fasttext_model.ftz: if the file is readable, load
it via the existing direct-load path; otherwise stay silent (debug
only) so stock installs without the model behave exactly as before.

This lets images that ship the model file drop the explicit
fasttext_model config override. The success path reuses
load_model_direct (the same code used for an explicit fasttext_model),
and the absent-file case produces no error and leaves the detector
reporting 'fasttext model is not loaded' as before.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 07:43:53 +0000 (08:43 +0100)]

[Feature] Make pidfile env-overridable, empty disables it

Template the baseline pidfile so deployments can relocate or disable it
without patching conf/rspamd.conf:

pidfile = "{= env.PIDFILE|default('$RUNDIR/rspamd.pid') =}";

With no RSPAMD_PIDFILE set it renders to the previous default
($RUNDIR/rspamd.pid). An empty RSPAMD_PIDFILE renders an empty string,
which now means "do not write a pidfile" -- useful when running as PID 1
in a container. Extend the existing cfg->pid_file == NULL guards in both
rspamd_write_pid() and main() to also treat an empty string as unset, so
the existing "pid file is not specified" path is taken.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 07:36:12 +0000 (08:36 +0100)]

[Conf] Make logging type and filename env-overridable

Template the baseline logging block so deployments can switch logging
without patching conf/rspamd.conf:

type = "{= env.LOG_TYPE|default('file') =}";
filename = "{= env.LOG_FILE|default('$LOGDIR/rspamd.log') =}";

With no RSPAMD_LOG_TYPE/RSPAMD_LOG_FILE set the values render to the
previous hardcoded defaults (file, $LOGDIR/rspamd.log), so stock
installs are unchanged. A container can now set RSPAMD_LOG_TYPE=console
to log to stdout. Mirrors the env-template style introduced for the
worker bind_socket lines.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 28 May 2026 08:29:47 +0000 (09:29 +0100)]

Merge pull request #6064 from rspamd/vstakhov-dynamic-composites

[Feature] Dynamic composites: hot-reloadable composites map

commit | commitdiff | tree

Alexander Moisseev [Wed, 27 May 2026 08:09:39 +0000 (11:09 +0300)]

[Feature] Add fixed-point formatting to fpconv (#6061)

* [Feature] Add fixed-point formatting to fpconv

- Add FPCONV_PRECISION_ALL sentinel for trim-trailing-zeros mode
  with compile-time guard (static_assert > 17 significant digits)
- Implement %.Nf rounding with carry (round_at, trim_trailing_zeros)
- Fix %.0f carry detection for numbers like 9.9 -> 10
- %f/%F/%g/%G use FPCONV_PRECISION_ALL instead of hardcoded literals
- Add C++ unit tests for fpconv precision and rounding

* [Fix] Fix carry overflow from fractional rounding in fpconv

- Add round_at_ex with carry_overflow flag to detect full carry
  that shifts digits and prepends '1'
- Fix offset<=0 branch (0.xxx): carry now correctly produces
  "1.0" instead of "0.1" (e.g. 0.96 → "1.0")
- Fix offset>0 branch (1.xxx-9.xxx): round_at called before
  copying to dest so integer digits are always fresh; carry
  correctly expands integer part (e.g. 9.96 → "10.0")

* [Fix] Fix wrong digits array index in fpconv offset<=0 rounding

Leading zeros are written by memset to dest, not stored in the
digits array. The rounding path incorrectly used orig_offset as
an index into digits for both round_at_ex position and memcpy
source, causing wrong output (e.g. 0.0123 → "0.02" instead of
"0.01") and potential out-of-bounds reads when ndigits < orig_offset

* [Rework] Extract fpconv fixed-point formatting into a separate shim layer

* [Fix] Fix rounding in fpconv_format emit_fixed_digits

Defect 1: Change >= to > when comparing leading zeros count with
precision, so that values like 0.005 with %.2f correctly round to
"0.01" instead of "0.00".

Defect 2: When carry occurs within the fractional part (e.g. 0.0999
with %.2f), emit "0.10" instead of incorrectly outputting "1.00".
Carry now distinguishes between crossing the integer boundary and
propagating within the fraction.

Also handle the case where precision equals the leading zeros count:
check the first significant digit directly for rounding instead of
calling round_at_ex with precision=0.

* [Refactor] Move fpconv_format shim from contrib/ to src/libutil/

---------

Co-authored-by: Vsevolod Stakhov <vsevolod@rspamd.com>

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 20:57:47 +0000 (21:57 +0100)]

[Test] composites: functional test for dynamic UCL composites map

Exercises load -> reload-with-update -> reload-with-stub:
1. INITIAL MAP - DYN_ONE FIRES: load composites from map.1, scan a
    message, confirm DYN_ONE and DYN_TWO fire with their declared
    scores. Static composite STATIC_COMP also fires alongside.
2. RELOAD - UPDATED SCORES AND NEW NAME: swap to map.2 (DYN_ONE
    score updated, DYN_TWO removed, DYN_THREE introduced), wait for
    the map watcher, scan, confirm new scores + new composite +
    DYN_TWO gone (stubbed).
3. RELOAD - REMOVED COMPOSITE BECOMES STUB: swap back to map.1.
    DYN_ONE/DYN_TWO are back with original scores, DYN_THREE was in
    the previous generation but is now absent -> verifies the stub
    path keeps the name out of scan results.

Lua plugin registers DYN_BASE_A/B/C as always-firing atomic symbols
so the composite expressions resolve deterministically. Config sets
map_watch_interval = 0.5s for tight reload turnaround.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 20:08:45 +0000 (21:08 +0100)]

[Conf] composites: route composites.dynamic to map handler

Add a reserved key in the composites { ... } config block so users can
attach a hot-reloadable map of composites:

    composites {
        STATIC_COMP { expression = "..."; score = 1.0; }
        dynamic = "/etc/rspamd/composites.map";
        # or dynamic = ["http://a/x", "file://y"];
        # or dynamic = { url = "..."; signature = "..."; }
    }

The handler intercepts the 'dynamic' key inside the composites section,
hands the UCL value to rspamd_composites_add_dynamic_map(), and lets
the rest of the section continue with static composite definitions.

Smoke-tested by running rspamd against a config with a file-backed
dynamic map: map_fin fires, the publish pipeline registers the
composites with the symcache, and the dynamic generation bumps to 1.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 19:19:24 +0000 (20:19 +0100)]

[Feature] composites: dynamic UCL map handler

Implements hot-reloadable composites maps. The map content is a UCL
object mapping composite name to a body of expression, score, group,
policy, description, groups, enabled — the same vocabulary the static
composites { ... } config block accepts.

Manager additions:
- build_staging() clones base_gen so the map handler can mutate a
   detached generation without disturbing in-flight tasks
- add_composite_to_staging() parses one UCL composite into staging
   and reflects it in cfg->symbols
- disable_in_staging() materialises a disabled stub for a name
- publish_generation() registers any new composite names with the
   symcache, bumps the resort generation, runs the analysis pipeline
   on the staging, and atomically swaps current_gen
- seal_static_load() captures the static-config generation as
   base_gen and seeds ever_seen_names; called once from
   rspamd_composites_mark_whitelist_deps
- symcache_pinned keeps the first composite shared_ptr per name
   alive forever, so the symcache's cbdata never dangles even when
   later generations replace the composite

Per-map state (map_cbdata) tracks last_names so a reload that drops a
name turns it into a stub instead of leaving it ghosted.

rspamd_composites_add_map_handlers — already in tree but unwired —
now parses the buffered bytes as UCL instead of NAME:SCORE EXPRESSION,
and routes through the new staging pipeline.

Public C API:
- rspamd_composites_add_dynamic_map() — registers a dynamic map
- rspamd_composites_current_generation() — diagnostics

cfg_rcl wiring (composites.dynamic = ...) is the next commit; this
commit only adds the runtime + API. Static composites are unchanged;
17/17 functional tests in 109_composites + 109_settings_merge pass.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 19:02:01 +0000 (20:02 +0100)]

[Refactor] composites: parameterise build helpers by generation

process_dependencies, build_inverted_index, mark_whitelist_dependencies,
collect_leaf_atoms, the composite-dep cbdata and the inverted-index
cbdata all take an explicit composites_generation reference now and
operate solely on it, with no implicit access to manager state.

The manager keeps a no-arg overload of each that forwards to
*current_gen — config-load wiring is unchanged.

This unblocks building a staging generation (under a dynamic-map
reload) without touching the live one. No behaviour change for static
configurations.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 18:49:35 +0000 (19:49 +0100)]

[Refactor] composites: extract per-task generation snapshot

Hoist the per-pass evaluation vectors, inverted index, and ownership
lists into a new composites_generation struct held inside composites_manager
as a shared_ptr<composites_generation> current_gen.

composites_data takes a snapshot of current_gen at task-creation time and
all read paths (first/second-pass walking, inverted-index lookup,
not_only fallback, composite-reference recursion) now go through the
pinned snapshot. This is a no-op today — only one generation ever
exists — but is the foundation for hot-reloadable composite maps where
the manager swaps current_gen while in-flight tasks must keep using
their snapshot.

Composite ids are now allocated through composites_manager::next_id()
which is monotonic across generations so an id is unique for the life
of the worker; composites_data::checked is sized from the maximum id
in the snapshot.

Removed the cached atom->ncomp / comp_type resolution. Caching a
manager pointer on a shared atom would dangle if a referenced
composite is replaced in a later generation; instead each evaluation
resolves the composite name through the task's snapshot via a single
hashtable lookup. Dropped rspamd_composites_resolve_atom_types and the
corresponding enum.

Added rspamd_composite::disabled — wired through the eval path,
process_dependencies, build_inverted_index and mark_whitelist_dependencies
so that stub composites (used in later commits to replace removed
entries on map reload) skip out of every index without being evaluated.

No behaviour change for static composites configurations; functional
tests in test/functional/cases/109_composites.robot pass unchanged.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 15:09:21 +0000 (16:09 +0100)]

[Test] 440_ssl_server: wait for SSL controller in suite setup

The previous attempt at killing this flake added per-test retries of
15 x 0.4s = 6s to the two controller-SSL HTTPS tests. Under heavy
parallel pabot load (4 workers + concurrent serial robot on the same
box) we have observed the controller's SSL listener take longer than
6s to start accepting after Run Rspamd's readiness check passes, and
both retry budgets get exhausted in sequence.

Run Rspamd's readiness check pings the plain normal worker and (for
configs with a control socket) waits for the controller to register
its workers with main. Neither covers the SSL listener: OpenSSL ctx
init for that listener happens after the worker is announced and
can lag by hundreds of ms in the worst case.

Move the wait into a single Suite Setup with a generous 30s budget
(60 x 0.5s) so we pay it once and the individual tests can issue a
direct HTTPS request again. The suite setup uses /ping (smallest
controller endpoint, served unauthenticated from 127.0.0.1 which is
in secure_ip). If the listener never comes up the suite fails loudly
in setup rather than every test independently exhausting a 6s retry.

Local: three back-to-back parallel pabot runs (4 processes, full
001 Merged suite) -- 6/6 pass, suite finishes in ~4-5s.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 08:09:41 +0000 (09:09 +0100)]

[Minor] DNS: Remove unused SERVFAIL cache

The fails_cache feature (introduced in e3057e5e4, Oct 2019) was undocumented,
disabled by default, never exercised in tests, and never adopted in
practice — including by the single deployment it was originally written for.

Negative DNS caching, if ever needed, belongs in librdns.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 25 May 2026 20:00:43 +0000 (21:00 +0100)]

[Test] 411_logging: read per-suite rspamd output, not global .last

save_run_results writes each rspamd's logs to two destinations: the
stable per-suite/per-test directory under robot-save/, and a global
robot-save/<file>.last "convenience" copy of the most recent run.

The three 411_logging tests asserted on the .last copies. Under
pabot another worker can teardown -- and overwrite the .last files
-- between this suite's Rspamd Teardown saving them and the
assertion reading them, so the assertion ends up running against a
different suite's rspamd output and matching the wrong format.

Switch to the per-suite paths
(robot-save/${SUITE_NAME}/rspamd.stderr for the console suites,
robot-save/${SUITE_NAME}/${TEST_NAME}/rspamd.log for the JSON file
test). Those paths aren't shared across pabot workers.

Local: three back-to-back parallel runs of the 411_logging
directory pass 3/3 each time.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 25 May 2026 16:12:43 +0000 (17:12 +0100)]

[Test] 440_ssl_server: tolerate slow controller SSL bind

The controller worker registers with the main process slightly
before its SSL listener finishes initializing OpenSSL and starts
accepting connections. The pre-test readiness check in Run Rspamd
sees "workers" appear in `rspamadm control stat` -- proof that
registration is done -- but the SSL socket on PORT_CONTROLLER_SSL
can still briefly refuse for tens to hundreds of milliseconds
after that, especially under concurrent-phase load on CI.

The first two tests in 440_ssl_server hit the SSL controller port
back-to-back and were the only ones to occasionally fail with
"Connection refused"; the remaining four (plain controller,
SSL/plain normal worker) ran later in the suite and always passed
because the SSL listener was up by the time they reached it.

Wrap just those two HTTPS calls in `Wait Until Keyword Succeeds`
(15 x 0.4s = ~6s) so the test reflects what it actually verifies:
the SSL controller eventually serves /stat and /errors. Refactor
the assertion into a small `Fetch HTTPS And Expect 200` keyword
to keep both retries readable.

Local: three back-to-back parallel pabot runs of the suite -- 6/6
pass each time, no flakes.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 25 May 2026 12:44:37 +0000 (13:44 +0100)]

[Project] Parallelise functional tests via pabot (#6060)

* [Project] Parallelise functional tests via pabot

Switch the Robot Framework functional test suite from a single serial
robot invocation to a two-phase pabot + robot run, giving CI a ~3-4x
wall-clock win on the parallel-safe portion while keeping the rest
working unchanged.

Worker isolation lives in test/functional/lib/vars.py. Each pabot
worker reads PABOTEXECUTIONPOOLID and applies a port offset of
index*100 across every rspamd / redis / nginx / clam / fprot / avast
/ dummy-http / dummy-https / dummy-http-early / dummy-llm / dummy-udp
/ dummy-ssl port, plus a per-worker /tmp/rspamd-functional-<index>/
prefix for unix sockets and pidfiles. Plain `robot` runs unchanged
(no env var -> index 0 -> the historical port numbers).

The dummy_* helper utilities now derive their PID paths from
{tmp_prefix}/dummy_<svc>-<port>.pid (or socket basename for p0f) via
a small util/dummy_pidfile module, so two instances on different
ports no longer collide. Existing override-via-argv callsites still
work. Robot keywords in lib/rspamd.robot are updated to use the
vars-driven ports and pidfile paths; suites that read those PIDs
(161_p0f, 230_tcp, 001_merged/{160_antivirus,310_udp}) and the
url-redirector log-grep in 162_url_redirector are templated to
match.

Twelve suites still bake dummy_http/dummy_llm/dummy_http_early/tcp
port numbers into Lua test scripts (test/functional/lua/{http,
http_early_response,tcp}.lua) and three configs (settings.conf,
neural_llm.conf and the assertion literals in url_redirector*),
so they only work at the worker-0 port offset. Tagging them
`notparallel` and running them with plain robot after the pabot
batch sidesteps the collision without templating those Lua scripts
in this change.

CI (.github/workflows/ci_rspamd.yml) installs pabot via pip
(--break-system-packages with a fallback for older pip in the
Fedora image), then runs:
  * Phase 1: pabot --processes 4 --exclude notparallel
            -> outputdir build/parallel/
  * Phase 2: robot --include notparallel
            -> outputdir build/serial/
Both phases run unconditionally and the step exits non-zero if
either failed. Artifact upload now collects both outputdirs plus
the legacy build/*.*ml path.

Local invocation is `test/functional/run-parallel.sh`, a thin
wrapper documented in CLAUDE.md. The script forces suite-level
splitting (no --testlevelsplit) because each Suite Setup starts
its own rspamd.

Follow-ups (not in this change):
  * Template the four Lua scripts and three configs so the twelve
    notparallel suites can drop the tag.
  * Split 001_merged/ (30 sub-suites under one rspamd) into
    independent units; currently pinned to one worker and the long
    pole of phase 1.

* [Fix] functional tests: claim worker slot via /tmp lockfile

Pabot 5.2.2 does not export PABOTEXECUTIONPOOLID to child robot
subprocesses, even though the variable name appears in pabot's own
source for internal accounting. The previous worker-index detection
fell through to 0 in every pabot worker, so all four workers used
identical rspamd / redis / fuzzy port offsets and crashed in
Multi Setup with "Address already in use".

Replace the env-only lookup with an atomic file-claim:

  * RSPAMD_WORKER_INDEX / PABOTEXECUTIONPOOLID still win when set
    (explicit override, future pabot versions).
  * Otherwise each process atomically grabs the first free
    /tmp/rspamd-functional.slot-<N> via O_CREAT|O_EXCL, writing its
    pid. A stale slot (pid no longer alive) is reclaimed by the next
    caller. atexit unlinks the slot when the process exits.

Verified locally:

  * Four concurrent python imports of vars.py get indices 0..3 with
    no collisions; slot files cleaned up on exit.
  * `pabot --processes 2` over two trivial robot suites prints
    distinct port ranges (56789 vs 56889) from each worker.

* [Fix] worker binds: env-templated defaults; diagnostic log tail

The four built-in workers (normal, controller, rspamd_proxy, fuzzy)
in conf/rspamd.conf hardcoded `localhost:1133[2-5]`. Under parallel
pabot every rspamd instance tried to bind those same ports and the
second one onwards hard-terminated with "Address already in use".

Switch the bind_socket lines to jinja templates with the existing
production strings as defaults:

  bind_socket = "{= env.LOCAL_ADDR|default('localhost') =}:\
                 {= env.PORT_NORMAL|default('11333') =}";

Production behaviour is preserved bit-for-bit -- with no env vars,
the templates resolve back to `localhost:11332..11335`. The functional
test harness already exports RSPAMD_LOCAL_ADDR / RSPAMD_PORT_*, which
rspamd's lua_common.c strips of the RSPAMD_ prefix when populating
rspamd_env, so `env.PORT_NORMAL` etc. pick up the per-worker slot
values from test/functional/lib/vars.py automatically.

Verified locally:
  - `pabot --processes 4` over the four `001_merged` sub-suites
    (Cases.001 Merged.{099,100,101,102}) passes 122/122 tests where
    it used to fail every test with hard_terminate.
  - Full phase-1 run (`pabot --processes 4 --exclude notparallel`)
    completes in 2m20s with 646/666 passing; the 20 failures are all
    local mac env-specific issues (missing pynacl, missing
    liblua.5.1.dylib for miltertest, etc.) unrelated to this change.
  - `rspamadm configdump` on a stock config (no env override) still
    binds `localhost:11332..11335` byte-for-byte.

Also enrich Rspamd Startup Check to surface the last 80 lines of
rspamd.log plus exit code, port and tmpdir on Process Is Gone --
the previous one-line "loading configuration" stderr made the bind
collision invisible from CI artifacts and forced a local repro to
diagnose.

* [Test] functional: dummy-port env in lua + settle after startup

Three classes of leftover collisions surfaced once worker bind_sockets
were templated and parallel rspamds actually started:

  * lua/udp.lua and lua/maps_kv.lua (loaded by 001_merged) and the
    rspamadm script lua/rspamadm/test_redis_client.lua hardcoded the
    dummy_udp / dummy_http / redis ports. Workers on slot index > 0
    bound their dummies on shifted ports, so the lua scripts kept
    talking to the slot-0 endpoints and tests timed out. Read
    env.PORT_DUMMY_UDP / env.PORT_DUMMY_HTTP / env.REDIS_PORT (set
    via vars.py -> RSPAMD_PORT_* -> rspamd_env stripped of the
    RSPAMD_ prefix in lua_common.c) and fall back to the historical
    literals so the scripts still run outside the harness.

  * configs/merged-override.conf EXTERNAL_MULTIMAP and
    configs/settings.conf external_map baked
    `http://127.0.0.1:18080/...` into rspamd's own config. Switch
    those to `{= env.PORT_DUMMY_HTTP|default('18080') =}` so the
    multimap external backend resolves to the per-worker dummy_http.

  * lib/rspamd.robot Rspamd Setup polled the startup-check loop with
    `IF ${ok} CONTINUE`, which kept iterating after the first
    successful ping but added effectively no grace period for the
    controller / proxy workers to finish registering with the main
    process. Under parallel load the first `rspamadm control stat`
    in 001_merged.099 Control returned an empty workers list.
    Switch to `BREAK` on success and add a 0.5s settle period.

Verified locally: previously-failing
099_control / 100_general / 101_lua / 102_multimap /
310_udp / 151_rspamadm_async now pass 126/126 under
pabot --processes 4 in ~17s.

* [Test] functional: fix two more parallel races

Two leftover collisions surfaced once 001_merged was actually starting
rspamds in parallel across pabot workers:

* test/functional/lua/lua_extras_test.lua writes its staging tree to
  os.getenv('TMPDIR'). On Linux CI TMPDIR is unset, so every worker
  raced on a shared /tmp/lua_extras_test directory -- one worker's
  `rm -rf` would wipe another worker's tree mid-test and rspamd
  config load aborted with `cannot init lua file ... No such file
  or directory`. Prefer RSPAMD_TMPDIR (per-suite tmpdir, propagated
  via env:RSPAMD_TMPDIR in Run Rspamd) so workers don't share state.

* 151_rspamadm_async/Redis client invokes `rspamadm lua -b
  test_redis_client.lua` which connects to redis directly. The
  previous fix used `rspamd_env.REDIS_PORT`, but rspamadm's lua
  context (unlike the daemon's) does not populate the `rspamd_env`
  global -- only rspamadm_session/_ev_base/_dns_resolver are set --
  so the lookup always fell through to the literal 56379. Read
  `os.getenv("RSPAMD_REDIS_PORT")` instead. Also call
  `Export Rspamd Variables To Environment` from the suite's Setup
  so the env vars are actually present in the rspamadm subprocess
  inherited environment (this suite never calls Run Rspamd, which
  is where the export normally happens).

Local: `pabot --processes 2` over 102_multimap / 151_rspamadm_async /
271_lua_extras passes 83/83 in ~8s.

* [Test] CI: run parallel + serial functional phases concurrently

The two-phase split (pabot for parallel-safe suites, plain robot for
notparallel-tagged ones) ran sequentially -- on fedora that meant
2:16 (pabot, 666 tests) + 1:35 (robot, 92 tests) = ~4 minutes total
versus master's ~6 minutes serial. The pabot phase itself is already
at ~91% of theoretical 4-worker speedup (8:14 of work in 2:16
wall-clock), so bumping --processes won't help much -- the cheap
win is overlapping the two phases.

Background both phases with `&`, capture their PIDs, then `wait`
each separately to harvest exit codes. They claim disjoint slots
from the vars.py file-based allocator (pabot grabs 0..3, robot
grabs 4), so their rspamds use different port ranges and tmp
prefixes and don't collide.

Expected total wall-clock: ~max(2:16, 1:35) ~= 2:20, down from ~4:00.

Verified locally: 4 pabot workers + 1 serial robot running 6
suites in parallel (115 + 33 tests) all pass in 27s on a 4-core
mac with the same vars.py slot allocator. No port collisions
observed.

* [Test] Revert misleading CLAUDE.md additions

The functional-test commands I added were wrong on two counts:

  * RSPAMD_INSTALLROOT=~/rspamd.install -- that path is stale on this
    repo's typical setup; the CMake install prefix is /usr/local.
  * "driven by PABOTEXECUTIONPOOLID" -- pabot 5.2.2 does NOT actually
    export that env var to child robot subprocesses (confirmed via
    dump-env test). The real mechanism is the file-based slot claim
    in test/functional/lib/vars.py (/tmp/rspamd-functional.slot-N).

Removing the lines rather than fixing them in place; the right
home for parallel-test docs is alongside the runner script and the
PR description, not duplicated and risk-of-drift in CLAUDE.md.

* [Test] Verify controller ready + rebot merge unified report

Two issues from the concurrent-phases run:

* `Cases.001 Merged.099 Control` flaked again ("'' does not contain
  'workers'"). rspamd's controller binds and answers HTTP ping
  almost immediately, but its workers list is populated only after
  each worker has registered back with the main process. Under
  parallel pabot + the concurrent serial phase (5 rspamds competing
  for CPU at startup) the gap stretched out and a fixed 0.5s settle
  was no longer enough.

  Replace the blind settle with a real readiness check: after the
  ping loop, if rspamd.sock is present in TMPDIR, poll
  `rspamadm control stat` (via the new keyword
  Verify Controller Workers Registered) until the response actually
  contains "workers". Cheap when fast, retried up to ~6s when
  rspamd is starting slowly. Local: five back-to-back parallel
  runs over 099/100/102/270 -- 530/530 tests pass, no flakes.

* The CI step left three output.xml files
  (build/parallel/{pabot_results/N/,}output.xml and
  build/serial/output.xml) and no single top-level report, so a
  reviewer skimming the CI log saw only one pabot sub-suite path
  and read it as "we only ran part of the suite". Run
  `rebot --merge` after both phases finish to produce a unified
  build/output.xml + log.html + report.html alongside the two
  phase outputs, matching the artifact shape master used to have.

* [Test] Fix readiness check; replace [Return] with RETURN

Two fixes:

* The previous unconditional `Wait Until Keyword Succeeds` for the
  control socket assumed every suite produces $DBDIR/rspamd.sock.
  That holds for 001_merged (includes options.inc -> control_socket
  = "$DBDIR/rspamd.sock") but NOT for the many suites that build a
  minimal standalone config (231_tcp_down etc.). Those never get a
  control socket, so the 50 x 0.2s poll always exhausted and broke
  every test in those suites.

  Wait up to 2s for the socket file to appear -- if it does, poll
  `rspamadm control stat` until the response contains "workers"
  (the real readiness signal CONTROL STAT depends on); if it
  doesn't, just proceed, since suites that never produce a control
  socket can't be testing it.

* Convert the [Return] setting to the RETURN statement across the
  five files that still used the old syntax. Robot Framework 7
  deprecated [Return] and the unrelated noise warnings were
  swamping every test step's stdout, making real failures hard to
  spot:
    cases/001_merged/115_dmarc.robot
    cases/001_merged/160_antivirus.robot
    cases/151_rspamadm_async.robot
    cases/320_arc_signing/003_roundtrip.robot
    lib/rspamd.robot

Verified locally: three back-to-back concurrent-phase runs (4-way
pabot + serial robot for notparallel suites) -- (106 + 33) tests
all pass each time, no flakes, no deprecation warnings.

* [Test] CI: redirect each phase to its own log, group in step output

Previously both concurrent phases (pabot and serial robot) wrote to
the step's combined stdout, so pabot's batched end-of-run summary
and robot's streaming output interleaved. Reviewers were seeing
what looked like only one of the two runs.

Redirect each phase's stdout+stderr to its own
build/phase{1-parallel,2-serial}.log, wait on both PIDs, then
`cat` the two logs in fixed order with GH Actions
::group::/::endgroup:: directives so they collapse to two clean
sections in the web UI. Wall-clock unchanged -- the two phases
still run concurrently; only the presentation is sequential.

Also include the two per-phase logs in the robotlog artifact
upload so they're inspectable after the run.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 17:34:42 +0000 (18:34 +0100)]

Merge pull request #6056 from dragoangel/feat/url-redirector-swap-redirectors-map-to-glob

[Feature] url_redirector: switch redirector_hosts_map from set to glob

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 13:15:02 +0000 (14:15 +0100)]

[Fix] fuzzy_check: accept SRV-only rules at config-load

After switching the default rspamd.com rule to service=fuzzy+rspamd.com,
'rspamadm configtest' logged 'no servers defined for fuzzy rule with
name: rspamd.com' and the rule was rejected. The check at
fuzzy_check.c:2183 uses rspamd_upstreams_count(), which deliberately
excludes SRV parent placeholders because callers like the upstream-
weight setter in dns.c and the lua_createtable size hints elsewhere
want the dispatchable cluster size, not the configured-entry count.

At config-load the SRV parent is the only thing in the list (members
are populated asynchronously after DNS resolution), so the existing
count returned 0 and the rule was rejected.

Add rspamd_upstreams_count_total() that includes SRV parents and use
it for the "is anything configured at all" gate. The four other
callers of rspamd_upstreams_count (dns weight, three Lua table size
hints) keep the existing dispatchable-only semantics, which is what
they want.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 10:43:21 +0000 (11:43 +0100)]

[Test] neural: drift threshold for pure-symbols mode (50%)

Adds a Robot suite that exercises both sides of the new
is_profile_compatible threshold:

  Train pure-symbols ANN
    Standard 10 spam + 10 ham autotrain pattern (mirrors 001_autotrain).

  Inference fires before drift
    Baseline check: NEURAL_SPAM_SHORT / NEURAL_HAM_SHORT fire after
    training completes.

  40 percent drift keeps the prior profile compatible
    FORCE_DRIFT_NEURAL_40 drops the last 40% of set.symbols and prepends
    40% fresh "DRIFT_NEW_SYM_*" entries; distance_sorted against the
    trained profile reports ~40% of |set.symbols|. With the cap raised
    to 50%, the prior profile is still accepted and inference keeps
    firing. Pre-fix (30% cap) this configuration would have orphaned
    the ANN.

  60 percent drift rejects the prior profile
    FORCE_DRIFT_NEURAL_60 pushes drift to ~60%, above the new 50%
    cap. is_profile_compatible rejects, set.ann stays unset,
    NEURAL_*_SHORT do not fire -- pins the upper bound so a future
    too-permissive change (e.g. raising the cap to 70%) trips here.

Note on the drift formula: distance_sorted is an asymmetric edit-
distance walk, not a symmetric-difference counter. When the fresh
entries sort before every baseline name and the dropped entries are
at the tail, the function reports dist ≈ replace_k rather than 2k.
So to hit dist == drift_pct% of n the helper drops and adds
k = drift_pct * n / 100 (not / 200). The first attempt at this test
hit the / 200 trap and the 60% case stayed under the cap.

Per-(rule, set) baseline is snapshotted on the first drift call so
the 60% test compares against the originally-trained list, not the
already-drifted one from the 40% test.

The disable_symbols_input + providers scenario is already covered by
003_carryover; the hybrid (providers + symbols) carryover-misindexing
scenario is harder to drive deterministically in a Robot harness and
is left as a future addition.

Verified locally: 20/20 of Functional.Cases.330_Neural pass.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 10:34:17 +0000 (11:34 +0100)]

[Fix] neural: resilient ANN reuse across symbol-list drift

Two follow-up fixes that complete the "neural keeps working when symbols
change" story started by the disable_symbols_input digest stability
commit. Both motivated by inspecting the actual vbspam Redis state on
sp-collector, which showed multiple coexisting profiles per rule and an
orphaned training set (~100 spam / 15 ham) under a stale digest.

is_profile_compatible (pure-symbols mode)

The 30% Levenshtein-drift cap rejected the prior profile on every modest
config change (new RBL, multimap addition, SA-style rule loaded via
multimap regexp_rules). When rejected, set.training_profile stayed nil,
inference went dark, and training samples had nowhere to accumulate
until a brand-new ANN trained from scratch -- weeks under realistic
class imbalance. Raise the cap to 50%, with a comment pointing at the
result_to_vector path (it builds vectors from profile.symbols, NOT
set.symbols, so loading the older profile keeps the trained weights
correctly indexed against the features that produced them).

maybe_carryover_ann (hybrid providers + symbols)

The carryover copied an ANN blob from an old key (trained against
profile.symbols A) into a fresh key whose profile entry carries
set.symbols (current = B). load_new_ann later writes
set.ann.symbols = profile.symbols, so at inference the copied weights
got applied to indices that no longer correspond to the symbols they
were trained on -- silent garbage output. Guard the carryover with
rule.disable_symbols_input: only then does the symbol portion not
contribute to the input vector, and copied weights remain meaningful.
For hybrid mode without disable_symbols_input the existing
is_profile_compatible path already keeps inference alive via the prior
profile entry (whose own symbol list keeps weights aligned), so
skipping carryover is the correct behaviour, not a regression.

Combined with the earlier digest-stability commit, the failure
modes the user kept hitting in production -- disable_symbols_input
digest rotation, pure-symbols cap too tight, hybrid carryover
misindexing -- are all addressed.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 10:14:42 +0000 (11:14 +0100)]

[Fix] neural: digest stability under disable_symbols_input

The profile digest forms part of the Redis key holding the trained
ANN (rn_<rule>_<settings>_<digest>_<v>). process_settings_elt computed
it as lua_util.table_digest(selt.symbols) unconditionally.

With disable_symbols_input=true the symbol catalogue does not feed the
model -- only providers + fusion + max_inputs determine the input-vector
schema (see is_profile_compatible) -- so hashing the unrelated symbol
list rotated the digest whenever any rspamd symbol was added/removed
elsewhere (a new RBL, a multimap rule, an SA-style rule loaded via
multimap's regexp_rules). The trained ANN was orphaned in Redis under
the old key and inference silently dropped to zero hits until a new
sample set retrained from scratch (weeks under realistic class
imbalance). Manual recovery via `redis-cli COPY` of the old key to the
new digest was the only fix.

Now: when has_providers + disable_symbols_input, the digest is
providers_config_digest(rule.providers, rule). Other modes keep the
existing symbol-based digest.

Migration: any deployment already running disable_symbols_input=true
with a trained ANN will see its digest rotate once on first start
after this lands. Either let the model retrain, or use the same
`redis-cli COPY rn_<rule>_<settings>_<old>_<v> rn_<rule>_<settings>_<new>_<v>`
recipe one final time -- after this fix the digest is stable across
unrelated rspamd config changes.

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 22 May 2026 19:58:11 +0000 (21:58 +0200)]

Merge branch 'master' into feat/url-redirector-swap-redirectors-map-to-glob

commit | commitdiff | tree

Dmitriy Alekseev [Fri, 22 May 2026 19:56:20 +0000 (21:56 +0200)]

[Feature] url_redirector: switch redirector_hosts_map from set to glob

Allow operators to use glob patterns (e.g. *.bit.ly, *.t.co) in the
redirector hosts list. Bare hostnames continue to match exactly, so no
operational change for existing maps; only the option to use wildcards
is new.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 19:21:21 +0000 (20:21 +0100)]

[Conf] fuzzy_check: discover servers via SRV by default

Switch the default "rspamd.com" rule from a hardcoded round-robin host
list to SRV-based discovery. "service=fuzzy+rspamd.com" makes the
upstream parser resolve the _fuzzy._tcp.rspamd.com SRV record, so
backends and ports are managed entirely in DNS with no client-side
config change.

The legacy fuzzy1/fuzzy2 hostnames keep resolving to every live
backend, so existing installs that pinned the old round-robin string
are unaffected. See rspamd/dns#8.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 18:00:49 +0000 (19:00 +0100)]

[Minor] Update version to 4.1.0

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 17:59:47 +0000 (18:59 +0100)]

Merge pull request #6054 from dragoangel/fix/tcp-lua-populate-timeout-read

[Fix] Properly populate timeout read in tcp_lua.c

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 17:43:23 +0000 (18:43 +0100)]

Merge branch 'master' into fix/tcp-lua-populate-timeout-read

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 17:41:54 +0000 (18:41 +0100)]

[Project] CI: swap Droid review for Claude Code + z.ai

Switch the automated PR review workflow from Factory.ai's Droid CLI to
Claude Code running headless against z.ai's Anthropic-compatible
endpoint.

- Trigger changed from "@droid review" to "@review"
- Optional model argument ("@review glm-4.7"); defaults to glm-5.1
- Provider prefixes (z-ai/) are stripped and the id is lowercased
- All model slots pinned to real GLM ids (glm-5.1 / glm-5-turbo) so no
claude-* alias can reach the endpoint
- Requires the ZAI_API_KEY actions secret; FACTORY_API_KEY now unused

commit | commitdiff | tree

Dmitriy Alekseev [Fri, 22 May 2026 12:52:49 +0000 (14:52 +0200)]

[Feature] lua_tcp: bound the dial under connect_timeout for all queue shapes

Seat a LUA_WANT_CONNECT marker at the head of every non-empty queue, not
only when the head is LUA_WANT_READ. A LUA_WANT_WRITE-headed request was
already routing connect errors correctly (EV_WRITE naturally armed by the
write handler, SO_ERROR check fires before LUA_TCP_FLAG_CONNECTED), but
the timer was armed under write_timeout, not connect_timeout: a
black-holed SYN sat under the write budget and the caller's
connect_timeout was silently ignored.

After this change the prepended marker re-arms EV_WRITE under
connect_timeout for the dial; once CONNECTED is set, plan_handler_event
re-arms EV_WRITE under write_timeout for the actual write. Read-only
shapes continue to work as fixed in the previous commit.

Legacy single-budget callers (only `timeout` set, use_deduction = TRUE)
are unaffected: plan_handler_event gates per-phase timer re-arms on
!use_deduction, so the single budget rides through all phases via the
elapsed-time deduction in lua_tcp_handler. The extra LUA_WANT_CONNECT
phase costs one event-loop trip; total budget is preserved.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>

commit | commitdiff | tree

Dmitriy Alekseev [Fri, 22 May 2026 11:44:43 +0000 (13:44 +0200)]

[Fix] Properly populate timeout read in tcp_lua.c

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 10:52:51 +0000 (11:52 +0100)]

Merge pull request #6053 from rspamd/vstakhov-url-redirector-stealth

[Feature] url_redirector: stealth-mode browser fingerprint profiles

commit | commitdiff | tree

dependabot[bot] [Fri, 22 May 2026 09:53:22 +0000 (10:53 +0100)]

Bump transformers in /contrib/neural-embedding-service (#5971)

Bumps [transformers](https://github.com/huggingface/transformers) from 4.53.0 to 5.0.0rc3.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.53.0...v5.0.0rc3)

---
updated-dependencies:
- dependency-name: transformers
dependency-version: 5.0.0rc3
dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 09:38:01 +0000 (10:38 +0100)]

[Feature] url_redirector: coherent browser fingerprint profiles

Resolving redirector/shortener URLs with a lone randomly-picked
User-Agent is easily spotted by cloaking pages, which key on a missing
or inconsistent header set. Replace the flat default_ua list with
default_profiles: five coherent browser profiles (Chrome, Edge,
Firefox, Safari) that each bundle a User-Agent with the exact header
set, values and order that browser sends. Chromium profiles carry
sec-ch-ua client hints; Firefox and Safari correctly omit them.

One profile is picked per task and reused for every hop of every
chain, so the identity stays consistent the way a real browser would.
Headers are sent as an ordered list so their order is preserved on the
wire (RSPAMD_HTTP_FLAG_ORDERED_HEADERS).

settings.user_agent becomes an optional operator override (legacy
single-header path) and is unset by default; settings.fingerprint_profiles
holds the profile list.

dummy_http.py logs received request headers in order; a new
STEALTH FINGERPRINT HEADERS functional test asserts the redirector
emits a coherent fingerprint with preserved header order.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 09:36:36 +0000 (10:36 +0100)]

[Feature] http: optional insertion-ordered header emission

The HTTP client stores headers in a khash and emits them in bucket
order, so the on-the-wire header order is unpredictable. Add an opt-in
RSPAMD_HTTP_FLAG_ORDERED_HEADERS flag: each header is stamped with a
monotonic `order` at insertion time, and when the flag is set the
client serialises headers sorted by that order instead of hash order.

lua_http now accepts a list form for the headers table
({{'name', 'value'}, ...}) which preserves order and sets the flag;
the existing map form and every other caller are byte-identical.

This lets callers reproduce a real browser's exact header order, used
by the url_redirector stealth fingerprint profiles.

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 22 May 2026 08:55:51 +0000 (10:55 +0200)]

[Feature] Allow utilize GET in url_redirector for user-defined list (#6043)

* [Feature] Allow utilize GET in url_redirector for user-defined list of URLs via regexp

* [Fix] Regression in link writing to redis

Properly encode next_str, fix debug log, and limit callback to http only urls

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 08:40:20 +0000 (09:40 +0100)]

Merge pull request #6042 from dragoangel/feature/update-default-ua-url-redirector

[Feature] Update default UA in url_redirector module

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 08:04:31 +0000 (09:04 +0100)]

Merge pull request #6052 from rspamd/vstakhov-arc-header-order

[Fix] arc: emit ARC headers in a deterministic order

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 08:02:21 +0000 (09:02 +0100)]

Merge pull request #6039 from rspamd/vstakhov-mx-check-phase-a

[Rework] mx_check: three-layer cache, finer outcomes, IP-class classification (#6032 Phases A & C)

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 21 May 2026 21:36:48 +0000 (22:36 +0100)]

Merge pull request #6050 from moisseev/autolearnstats

[Fix] autolearnstats: fix table formatting crash and add sorting/grouping options

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 21 May 2026 15:56:18 +0000 (16:56 +0100)]

[Fix] arc: emit ARC headers in a deterministic order

lua_mime.modify_headers accepted an `order` list but it had no effect:
the headers passed through a string-keyed Lua table and were serialised
to the milter reply in arbitrary hash order. arc.lua relied on `order`
to lay out an ARC set, so the three ARC headers were emitted in a
non-deterministic order. Some validators (e.g. O365) reject ARC sets
that are not in the conventional ARC-Seal, ARC-Message-Signature,
ARC-Authentication-Results layout.

When `order` is given, emit one milter reply per header in that order
(set_milter_reply merges replies cumulatively, so a single-key reply
has no ambiguous iteration order) and apply the internal modify_header
calls in the same order.

Issue: #6045

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 21 May 2026 09:43:00 +0000 (10:43 +0100)]

[Feature] mx_check: IP-class classification, trust maps, run-scope toggles

Phase C of #6032 (IPv6 probing deferred):

IP-class classification. Resolved MX-target IPs are partitioned into
PUBLIC / LOCAL / BOGON against fixed RFC range sets. LOCAL (RFC1918,
CGNAT, ULA) is unprobeable from our vantage point; BOGON (loopback,
link-local, TEST-NET, multicast, reserved) has no legitimate meaning as
an MX target and is a packet-injection footgun. Only PUBLIC addresses
are probed; the rest emit MX_LOCAL_ONLY/MIX and MX_BOGON_ONLY/MIX. The
range sets are a correctness invariant and are not operator-tunable.

Per-layer trust/skip maps. exclude_mxs is a glob map of trusted MX
hostnames; a hit short-circuits the whole check with MX_WHITE. exclude_ips
is a radix map of IPs dropped from the probe set; if it empties the set,
MX_SKIP fires.

Run-scope toggles. check_authorized and check_local (both default false)
control whether authenticated and local-network senders are checked,
replacing the previous hardcoded skip.

test_mode (testing only) lifts loopback out of the bogon set so the probe
path stays exercisable against a local listener; functional tests use it.

The IPv4-mapped range ::ffff:0:0/96 is intentionally excluded from the
bogon set: rspamd's radix stores IPv4 as its v4-mapped form, so listing
that prefix would classify all IPv4 traffic as bogon.

Refs #6032.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 21 May 2026 09:09:53 +0000 (10:09 +0100)]

[Fix] mx_check: address Phase A review defects

Fix three defects found in review of the Phase A rework:

- step2/step3: a non-working probe verdict for one MX host ended the
  whole lookup instead of trying the remaining MX records. Domains with
  a refused/timed-out primary MX and a reachable backup MX were scored
  MX_INVALID instead of MX_GOOD. step3 now hands its verdict to a
  continuation; step2 walks the MX list in priority order and only
  emits a failure after every selected host fails. Also stop caching a
  broken-MX domain under d: as 'nxd' (it would later be misreported as
  NXDOMAIN).

- A-fallback: a NODATA/empty A response was cached and reported as
  NXDOMAIN. nxdomain is now returned only for a genuine DNS_ERR_NXDOMAIN;
  domains that exist but publish neither MX nor A emit a missing/invalid
  outcome and write no d: cache entry.

- Legacy aliases: the shipped modules.d/mx_check.conf set connect_timeout
  and verify_greeting, so the merged config always carried them and the
  `timeout`/`wait_for_greeting` aliases were silently ignored. Drop those
  keys from the shipped file (kept as documented comments); warn when a
  legacy key and its replacement are both set.

Add a functional test for the NODATA case.

Refs #6032.

commit | commitdiff | tree

Alexander Moisseev [Thu, 21 May 2026 08:34:30 +0000 (11:34 +0300)]

[Feature] autolearnstats: add --sort-by and --group options

Add --sort-by <col> to sort rows by a chosen column (verdict, score,
ts, tid, ip, from, rcpts) with timestamp as a tiebreaker. Score is
compared numerically; all other columns lexicographically.

Add --group flag to insert a blank separator line between consecutive
rows where the --sort-by key changes.

Add unit tests for sort key extraction functions.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 21 May 2026 07:50:44 +0000 (08:50 +0100)]

Merge branch 'master' into vstakhov-mx-check-phase-a

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 21 May 2026 07:50:31 +0000 (08:50 +0100)]

Merge pull request #6048 from dragoangel/fix/avoid-tcp-leak-on-read-wo-write

[Fix] Avoid TCP leak on read without write

commit | commitdiff | tree

Alexander Moisseev [Thu, 21 May 2026 06:34:27 +0000 (09:34 +0300)]

[Fix] autolearnstats: fix crash and truncate long table columns

LuaJIT string.format only parses 2-digit widths (max 99); 3-digit
column widths like %-176s caused "invalid option" errors. Replace
header string.format with pad() calls.

Cap From/Recipients column display width at 60 chars; introduce
cell() helper that truncates overlong values with a ".." suffix.

Add unit tests for pad() and cell() covering truncation, width
invariant, and the >= 100 width regression.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 13:19:45 +0000 (14:19 +0100)]

[Minor] css: fix out-of-bounds read in ident escape scanner

consume_ident scanned a backslash escape with a do-while that read
input[++i] at the top of the body but checked i < input.size() only
at the bottom. When i reached input.size() - 1 the loop re-entered
and input[++i] read one element past the string_view.

CSS reaches the tokeniser from style attributes whose value lives in
a tightly sized mempool buffer, so a token ending in backslash plus a
hex digit produced a one-byte heap over-read. Gate the increment with
i + 1 < input.size().

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 13:00:25 +0000 (14:00 +0100)]

[Minor] str_util: fix lookahead over-read in find_eoh

rspamd_string_find_eoh peeks p[1] in the got_cr state but guarded it
with "p < end", which is already guaranteed by the loop and does not
cover the p+1 access. On input whose header region ends with \r\r the
peek read one byte past the buffer; the MIME parser calls this with a
non-NUL-terminated GString view over the message, so that byte is not
guaranteed to exist.

Check p + 1 < end instead; a truncated \r\r at end of input then
falls through to the existing branch that treats it as end-of-headers.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 12:46:46 +0000 (13:46 +0100)]

[Minor] archives: fix 7zip varint decoding

rspamd_archive_7zip_read_vint had two defects in the multi-byte path:
the destination uint64_t was left uninitialised before a partial
memcpy, and the "shift back" used sizeof(tgt) (bytes) mixed with
NBBY * intlen (bits). For intlen >= 2 that expression underflows the
unsigned size_t and produces a shift of 64 or more, which is
undefined behavior.

Zero-initialise the value and drop the bogus shift: with a zeroed
target the little-endian memcpy already yields the intlen-byte value
directly.

commit | commitdiff | tree

Dmytro Alieksieiev [Wed, 20 May 2026 12:39:48 +0000 (14:39 +0200)]

Merge branch 'master' into fix/avoid-tcp-leak-on-read-wo-write

commit | commitdiff | tree

Dmitriy Alekseev [Wed, 20 May 2026 12:39:08 +0000 (14:39 +0200)]

[Fix] Avoid TCP leak on read without write

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 12:08:40 +0000 (13:08 +0100)]

[Minor] spf: fix over-read on a bare "spf2." sender-id record

start_spf_parse validated only the "spf2." prefix (sizeof - 1) but
then advanced begin by the full sizeof, skipping one unvalidated
byte. A TXT record consisting of exactly "spf2." made the following
'/' check read past the logical end of the string, and could chain
into parse_spf_scopes walking past the allocation.

Advance past the validated prefix only, then check the version digit
and '/' with short-circuiting so neither read goes past the
terminator.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 11:05:08 +0000 (12:05 +0100)]

[Fix] rdns: reject DNS labels that overrun the packet

rdns_parse_labels computes the name length in a first pass that only
reads label length bytes, then a second pass copies the label data.
The first pass never checked that a label's data actually fits within
the packet, so a reply whose final label declared more bytes than
remained made the second-pass memcpy read past the end of the reply
buffer. On the DNS-over-TCP path that buffer is malloc'd to exactly
the advertised message size, so the over-read ran past the allocation.

Validate in the first pass that both plain and compressed label data
stay within the packet, and reject the name otherwise. Also fix an
off-by-one in rdns_decompress_label where an offset equal to the
packet length was accepted and read one byte past the end.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 10:39:51 +0000 (11:39 +0100)]

[Fix] html: prevent buffer overflow in entity decoding

decode_html_entitles_inplace works in place, relying on the
replacement never being longer than the source entity text. That
assumption does not hold for some short entity names that expand to
multi-codepoint replacements (e.g. nGt, nLt, nvap): when such an
entity sits at the very end of the buffer the named-entity memcpy
wrote a few bytes past the end.

Bounds-check the replacement against the remaining buffer before
copying, matching the existing numeric-entity path, and drop the
entity when it does not fit.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 10:29:49 +0000 (11:29 +0100)]

[Minor] url: fix out-of-bounds read on empty/all-dots host

rspamd_url_maybe_regenerate_from_ip could read host[-1]:

* The trailing-dot strip loop tested *(end - 1) before the end > p
  bound, so an all-dots host (http://.../) walked end down to p and
  then dereferenced one byte before the host buffer.
* rspamd_url_parse only rejected an empty host before URL-decoding;
  a host such as "%" decodes to zero bytes, so hostlen could become 0
  and still reach the regen/telephone code with end == p.

Reorder the loop condition, re-check hostlen after the host is
decoded and shifted, and guard rspamd_url_maybe_regenerate_from_ip
against a zero-length host.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 20 May 2026 10:18:57 +0000 (11:18 +0100)]

[Minor] mime_headers: avoid uninitialised bytes in rfc2047 decode

When an encoded-word fails to decode, the failure branch reset the
token length with `token->len -= tok_len`. For the base64 path that is
wrong: rspamd_cryptobox_base64_decode writes its *outlen argument
(tok_len) even on failure, so the subtraction no longer restores the
original offset and leaves token->len above pos. The bytes between the
partial decode and the grown GByteArray capacity are uninitialised and
were flushed into the decoded header value.

Reset token->len to the saved pos offset in both failure branches
instead, discarding the token cleanly.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 18 May 2026 11:01:23 +0000 (12:01 +0100)]

[Fix] fuzzy_storage: harden network input paths

Three defensive fixes for user-controlled input over UDP/TCP:

* accept_fuzzy_socket: reset msg_namelen back to the buffer capacity
  before every recvmsg/recvmmsg call. The kernel overwrites msg_namelen
  with the actual source address size on output; on the non-recvmmsg
  path the for(;;) loop reused the same msghdr across calls, so a
  larger source address (e.g. IPv6 after IPv4) was silently truncated
  by the kernel and the trailing bytes of the parsed sockaddr came
  from stale stack memory.

* rspamd_fuzzy_tcp_io: validate the reconstructed 16-bit frame length
  before folding it into cur_frame_state. The state machine only has
  14 bits for the length (top two bits are flags), so values with bit
  14 or 15 set were silently masked off, letting a client smuggle a
  large advertised size while the server parsed a much smaller frame.
  Now any length above FUZZY_TCP_BUFFER_LENGTH or equal to zero closes
  the connection immediately.

* rspamd_fuzzy_make_reply: clamp mf_result->n_extra_flags to
  RSPAMD_FUZZY_MAX_EXTRA_FLAGS before the memcpy into the fixed-size
  rep_v2->extra_flags[7]. All current backends already bound this
  value, but the frontend was trusting them; clamp defensively so a
  future backend bug cannot become an OOB write on the reply struct.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 18 May 2026 09:40:40 +0000 (10:40 +0100)]

[Fix] fuzzy_storage: peer-pipe write resume and shutdown drain

fuzzy_peer_try_send retried short writes from byte 0 of the command
instead of resuming at the offset already sent, so a partial write
followed by a watcher-driven retry shoved garbage into the peer pipe.

Track the bytes sent on the request and resume from there. Convert
the helper to a tri-state (DONE / AGAIN / FATAL) so the watcher can
keep firing on transient short writes and only stop+free on completion
or a hard error.

Also link pending requests into a list on the ctx so worker shutdown
can drain any whose write watcher never fires (e.g. on non-update
workers where the event loop has already broken out), instead of
leaking the up_req allocations.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 18 May 2026 09:40:24 +0000 (10:40 +0100)]

[Fix] fuzzy_storage: avoid per-refresh leak in dynamic ban inserts

rspamd_fuzzy_block_addr allocated the ban struct from the radix tree's
long-lived mempool before calling radix_insert_compressed. When the
prefix was already present (the common case: ban_sync re-applies on every
bans_version bump, provisional re-blocks every provisional_ttl), the
btrie rejected the duplicate and the code mutated the existing struct in
place — leaving the freshly allocated one orphaned in the mempool with no
way to reclaim it short of a worker restart.

The pool is created with rspamd_mempool_new_long_lived and freed only at
radix_destroy_compressed, so the orphans accumulate monotonically. With
thousands of bans churning across a fuzzy fleet and the rspamd-mem-watchdog
trimming workers on a 30-minute cadence, this matches the growth pattern
we have been compensating for.

Look up the prefix first; on a hit, mutate in place without allocating.
Allocate and insert only on a true miss.

commit | commitdiff | tree

Dmitriy Alekseev [Sun, 17 May 2026 20:58:24 +0000 (22:58 +0200)]

[Feature] Update default UA in url_redirector module

commit | commitdiff | tree

Dmytro Alieksieiev [Sun, 17 May 2026 20:48:09 +0000 (22:48 +0200)]

Merge branch 'master' into feature/update-default-ua-url-redirector

commit | commitdiff | tree

Dmitriy Alekseev [Sun, 17 May 2026 19:36:50 +0000 (21:36 +0200)]

[Feature] Update default UA in url_redirector module

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 17 May 2026 20:03:32 +0000 (21:03 +0100)]

Add Dmytro Alieksieiev to AUTHORS.md

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 17 May 2026 20:02:51 +0000 (21:02 +0100)]

Merge pull request #6040 from fatalbanana/copyright

[Minor] Update copyright for some plugins

commit | commitdiff | tree

Andrew Lewis [Sun, 17 May 2026 18:30:23 +0000 (20:30 +0200)]

[Minor] Update copyright for some plugins

commit | commitdiff | tree

Andrew Lewis [Fri, 15 May 2026 11:19:29 +0000 (13:19 +0200)]

[Minor] Update AUTHORS

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 17 May 2026 11:32:33 +0000 (12:32 +0100)]

[Test] multimap: cover regexp_rules selector atom brand spoof

Adds a Bank of America display-name spoof scenario to the SA-style
regexp_rules tests: a `selector =~` atom on `from:name`, a `selector !~`
atom on `from:domain`, and a meta combining them. Validates both =~ and
!~ behavior plus meta scoring on a real spoofed-display-name message.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 17 May 2026 08:59:58 +0000 (09:59 +0100)]

Merge pull request #6041 from rspamd/vstakhov-neural-profile-carryover

[Fix] neural: preserve trained ANN across symcache-driven profile rotation

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 16 May 2026 20:13:29 +0000 (21:13 +0100)]

[Test] neural: cover providers_digest rotation carryover

Regression test for the symcache-driven profile rotation fix.

Drives a live rspamd + Redis through: train ANN with providers-only
input (metatokens, disable_symbols_input=true) -> verify NEURAL_SPAM /
NEURAL_HAM fire -> mutate set.symbols/set.digest in the scanner worker
(simulates a symcache shift) -> verify inference still fires after the
next check_anns poll.

Pre-fix the mutation pushes the symbol-list Levenshtein distance well
past the 30% tolerance, the worker rejects the trained profile, and
NEURAL_SPAM stops firing. Post-fix the providers_digest stays
constant and is recognised as the authoritative schema fingerprint, so
the trained ANN is reloaded.

max_trains=1 because metatokens-only scans produce an identical
vector per message and Redis SADD deduplicates — one spam + one ham
scan are enough to fire training.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 16 May 2026 19:03:12 +0000 (20:03 +0100)]

[Fix] neural: preserve trained ANN across symcache-driven profile rotation

When rspamd's symbol cache shifts (any added/removed symbol, even unrelated
to the neural rule), the per-rule symbol digest changes and the plugin
historically picked a brand-new profile — abandoning the previously-trained
ANN at the old redis_key.  In deployments where the input vector is built
from providers (e.g. fasttext_embed conv1d) and `disable_symbols_input` is
set, the symbol list is irrelevant to the vector schema, so the
rotation needlessly reset inference until enough new training data
accumulated.

Make providers_digest the authoritative schema fingerprint when providers
are configured:

* New helper `is_profile_compatible` in lualib/plugins/neural.lua decides
  load eligibility based on providers_digest first; symbol-list drift is
  ignored entirely when `disable_symbols_input = true`, and tolerated
  without bound for hybrid (providers + symbols) rules where symbols form
  only a minor slice of the fused vector.  Pure-symbols rules keep the
  legacy 30% Levenshtein tolerance and now also reject profiles that were
  trained with providers (vector schemas differ).

* process_existing_ann/maybe_train_existing_ann use the new helper, and
  the reload decision in process_existing_ann picks the fresher version
  when the providers schema matches across a symbol-digest shift.

* new_ann_profile triggers an async carryover after ZADD: ZREVRANGE the
  zset, find the most recent prior profile with a matching
  providers_digest, HMGET its ann/roc_thresholds/pca/providers_meta/
  norm_stats, and HMSET them into the fresh redis_key.  Gated on
  HEXISTS new_key ann == 0 so a freshly-trained model is never
  overwritten.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 16 May 2026 14:45:03 +0000 (15:45 +0100)]

[Fix] mime_headers/encoding: correct lengths after in-place rewrites

- mime_headers (message-id): after g_strstrip shifts content forward
  in-place, the pre-strip length is stale; re-acquire p and len so the
  cleanup loop does not scan past the live content and pull stale bytes
  (which the loop would otherwise turn into '?' or treat as a trailing
  '>') into MESSAGE_FIELD(task, message_id).
- mime_encoding (rspamd_charset_normalize): fix the trim-in-place math;
  the previous version copied one extra byte past `end` and wrote the
  null terminator at the unshifted offset, leaving stale trailing bytes
  in the normalized charset name.
- mime_encoding (rspamd_mime_charset_utf_enforce): use goffset for the
  inner offsets so buffers >= 2 GiB cannot truncate to int32_t and make
  p += cur_offset walk backwards into OOB writes.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 16 May 2026 14:44:50 +0000 (15:44 +0100)]

[Fix] images/archives: harden parsers against malformed inputs

- images.c: guard Content-Id image linking against NULL rh->decoded.
- archives.c (zip): require >= 22 bytes for the EOCD scan to avoid a
  pointer-below-start computation; widen cd_offset + cd_size to uint64_t
  so a 32-bit wrap can no longer bypass the bounds check and let cd land
  outside the buffer.
- archives.c (rar v5): replace pointer-arithmetic bound on the file
  extra-field with a size-based check so an attacker-controlled 64-bit
  extra_sz cannot wrap p + fname_len + extra_sz and trigger an OOB read.
- archives.c (7z): same fix in rspamd_7zip_read_archive_props for proplen.
- archives.c: two return NULL from a bool-returning function changed to
  return false (cosmetic).

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 16 May 2026 13:41:51 +0000 (14:41 +0100)]

[Fix] mime_parser: defensive guards against NULL deref and resource leaks

- Fix incorrect offset in begin-base64 UUE prefix detection (was using
  sizeof("begin ") instead of sizeof("begin-base64 ")).
- Guard against NULL header value when iterating Content-Type headers
  in rspamd_mime_process_multipart_node and rspamd_mime_parse_message.
- Add NULL checks for p7->d.sign / contents / type in the SMIME branch
  to avoid crashes on malformed PKCS7 signed-data structures.
- Free the recursive parser context on the early error-return path in
  rspamd_mime_parse_message so it does not leak the per-recursion stack
  and boundaries arrays.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 15 May 2026 10:55:02 +0000 (11:55 +0100)]

[Fix] url_suspect: require TLD >= 3 chars for word_dot naked domain matches

Two-char country TLDs (.so, .to, .me, .in, .us, etc.) overlap with common
English words, causing false positives when normal prose like "pale blue dot
so insignificant" is matched by the word_dot pattern and normalized to a
valid-looking naked domain (blue.so).

Explicit-protocol patterns (hxxp, spaced_protocol) are unaffected and still
match 2-char TLDs.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 14 May 2026 18:53:46 +0000 (19:53 +0100)]

Merge branch 'master' into vstakhov-mx-check-phase-a

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 14 May 2026 18:50:17 +0000 (19:50 +0100)]

[Minor] Defensive guards in JPEG and RFC 2047 QP decoders

process_jpg_image(): bail out early when the input is shorter than the
minimum needed to safely access the SOF fields referenced as p[4..7].
Pointer-arithmetic associativity already makes the existing
`end = p + data->len - 8` benign on standard targets (the loop simply
doesn't execute for tiny buffers), but the explicit precondition makes
the intent obvious and is robust against future refactors.

rspamd_decode_qp2047_buf(): when an encoded-word ends with a bare `=`
that has no following hex digits, emit a literal `=` instead of reading
one byte past the input. Two paths could reach the OOB read - the
direct `*p == '='` block and the else-branch's `goto decode` after
memcspn finds a trailing `=` - both are now guarded. In production the
read landed inside the surrounding header-value buffer (mempool
allocated, null-terminated), so this is cosmetic, but it silences
fuzzer/ASAN noise on direct-call test harnesses.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 14 May 2026 18:53:09 +0000 (19:53 +0100)]

[Minor] CI: Upgrade model version from gpt-5.4 to gpt-5.5

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 14 May 2026 18:19:27 +0000 (19:19 +0100)]

[Rework] mx_check: three-layer Redis cache and finer outcomes (Phase A)

Replaces the single domain-keyed cache with three namespaces — `<key_prefix>:d:`
for the per-domain MX/A-fallback verdict, `<key_prefix>:m:` for per-MX-host A
records, and `<key_prefix>:i:` for per-IP probe verdicts. Two domains pointing
at a shared MX host (every G-Suite / M365 tenant, every ESP customer) now share
the m-layer and i-layer entries, so the second domain hits cache at every step
and emits its symbol with zero new DNS or TCP work.

Splits the probe into two clean shapes — pure connect-only and full SMTP banner
validation — using the new `lua_tcp` options merged in #6034. `verify_greeting`
+ `send_quit` replace the conflated `wait_for_greeting`; banner parsing
honours multi-line greetings (RFC 5321 §4.2.1), validates the reply code, and
distinguishes 220 success, 4xx/5xx rejection (real SMTP, `MX_ERROR`), and
non-SMTP listeners (`MX_INVALID`).

Adds informational symbols at score 0: `MX_REFUSED`, `MX_TIMEOUT_CONNECT`,
`MX_TIMEOUT_READ`, `MX_ERROR`, `MX_NXDOMAIN`, `MX_NULL` (RFC 7505 detection),
`MX_BROKEN` (every MX RR points at an unresolvable host). Primary symbols
(`MX_GOOD` / `MX_INVALID` / `MX_MISSING` / `MX_WHITE`) keep today's scores —
operator-visible behaviour is preserved, the new symbols are emitted alongside
for tuning data ahead of Phase B's two-path matrix.

Legacy keys are honoured with deprecation warnings: `timeout` maps to
`connect_timeout`, `wait_for_greeting` maps to `verify_greeting`. Adds a `port`
setting (default 25) so the module is testable on non-privileged ports.

Functional tests in test/functional/cases/167_mx_check.robot cover Null MX,
NXDOMAIN, broken-reference MX, connect-refused, and the A-fallback path.

Refs #6032.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 13 May 2026 21:30:12 +0000 (22:30 +0100)]

Merge pull request #6035 from moisseev/url-scheme

[Minor] url_redirector: skip non-HTTP(S) URLs in http_walk

commit | commitdiff | tree

Alexander Moisseev [Tue, 12 May 2026 17:17:27 +0000 (20:17 +0300)]

[Minor] url_redirector: skip non-HTTP(S) URLs in http_walk

Non-HTTP(S) schemes (such as tel:, mailto:, etc.) cannot have HTTP
redirects. Attempting to follow them in http_walk is unnecessary and
could potentially lead to errors. This change skips these URLs early
in the redirect chain walk and emits the URL_REDIRECTOR_NON_HTTP
virtual symbol with a single option in the format:

scheme=http_chain->non_http_url

e.g.: telephone=click.example.com->tel:+71234567890

commit | commitdiff | tree

Alexander Moisseev [Tue, 12 May 2026 15:13:44 +0000 (18:13 +0300)]

[Fix] Dot add :// to mailto: URIs (RFC 6068)

mailto: is non-hierarchical — the // authority component never applies.
The bug was in rspamd_mailto_parse setting RSPAMD_URL_FLAG_MISSINGSLASHES
when // was absent, causing rspamd_url_parse_text to
inject :// into the stored string.

Note: bare email addresses detected via the @ pattern (user@example.net
in text, no scheme prefix) still go through a different path where
"mailto://" is injected as a literal prefix — that's a separate issue
and out of scope here.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 12 May 2026 15:57:40 +0000 (16:57 +0100)]

[Feature] memstat: per-callsite mempool counters and structured jemalloc

Track lifetime pools/chunks/bytes counters per mempool callsite and
expose them via rspamd_mempool_entry_stat_t. memory_stat now emits
per-arena jemalloc stats instead of the raw malloc_stats_print dump.
The rspamadm control memstat renderer gains --compact and --only
modes, sortable callsite columns (cur/total bytes and pools), and
prints just the callsite filename.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 12 May 2026 14:43:45 +0000 (15:43 +0100)]

[Feature] lua_task: bulk and regexp symbol lookups

Add table-form overloads to task:has_symbol() and task:get_symbol()
that accept {S1, S2, ..., Sn} and return true / a {name -> info} map
if any of the listed symbols fired. Both keep the legacy single-name
form (with optional shadow_result_name) untouched.

Introduce task:has_symbol_regexp(re [, shadow_result_name]) and
task:get_symbol_regexp(re [, shadow_result_name]) that match fired
symbol names against an rspamd_regexp userdata.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 11 May 2026 17:45:13 +0000 (18:45 +0100)]

Merge pull request #6034 from rspamd/vstakhov-lua-tcp-phased

[Feature] lua_tcp: phase-specific timeouts and on_error callback

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 11 May 2026 17:41:00 +0000 (18:41 +0100)]

Merge pull request #6027 from moisseev/fuzzy-flags

[Minor] Warn on fuzzy flag collisions across writable rules

Mirror of https://github.com/rspamd/rspamd.git

RSS Atom