git.ipfire.org Git - thirdparty/rspamd.git/log

]> git.ipfire.org Git - thirdparty/rspamd.git/log

projects / thirdparty / rspamd.git / log

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 19 Jun 2026 08:50:15 +0000 (09:50 +0100)]

Merge pull request #6106 from rspamd/vstakhov-text-stats

[Feature] lua_text: byte-distribution statistics methods

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 18 Jun 2026 19:32:17 +0000 (20:32 +0100)]

Merge branch 'master' into vstakhov-text-stats

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 18 Jun 2026 18:16:54 +0000 (19:16 +0100)]

[Minor] CI: trigger push only on master to dedup PR runs

A commit on a branch with an open PR fired both a push and a
pull_request run; concurrency cancel-in-progress reaped one, leaving a
spurious *cancelled* run that reads like a CI failure. Restrict push to
master so feature branches run a single pull_request workflow; key the
concurrency group on PR number / ref to still cancel superseded runs.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 18 Jun 2026 17:56:29 +0000 (18:56 +0100)]

Merge pull request #6105 from rspamd/vstakhov-multipattern-som

[Feature] multipattern: explicit SOM flag and offset docs

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 18 Jun 2026 14:12:09 +0000 (15:12 +0100)]

[Feature] lua_text: byte-distribution statistics methods

Add byte-distribution statistics as methods on the rspamd_text class,
implemented in C++20 under src/lua (lua_text_stats.{hxx,cxx}); lua_text.c
is left untouched and the rspamd{text} metatable is augmented at load.

Methods (each takes an optional 0-based (off, len) range, defaulting to
the whole buffer):
  - text:entropy([off[, len]])              Shannon entropy, bits/byte
  - text:byte_mean([off[, len]])            mean of unsigned byte values
  - text:byte_deviation(mean[, off[, len]]) mean abs deviation from mean
  - text:serial_correlation([off[, len]])   ENT serial correlation
  - text:monte_carlo_pi([off[, len]])       ENT Monte-Carlo Pi deviation

The core is header-only, allocation-free and O(n) (a single histogram
pass shared by entropy/mean/deviation) and produces deterministic,
bit-reproducible results. Offsets are byte offsets, 0-based; the range is
clamped to the buffer and an out-of-range or empty range yields 0.

Add C++ doctest golden-vector tests (analytically-derived exact values)
and Lua unit tests covering empty/single-byte/uniform/two-symbol buffers,
overlapping groups, slicing and edge cases.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 18 Jun 2026 13:22:25 +0000 (14:22 +0100)]

[Feature] multipattern: explicit SOM flag and offset docs

SOM (start-of-match) reporting already exists on master as the default
(hyperscan compiles every pattern with HS_FLAG_SOM_LEFTMOST), but there
was no explicit way to request it and the offset convention was
undocumented.

- Add RSPAMD_MULTIPATTERN_SOM (rspamd_trie.flags.som): an explicit
  opt-in for start offsets that also overrides no_start/single_match
  (forces SOM and drops the incompatible SINGLEMATCH).
- Document the offset convention: pattern id is 1-based; match start
  and end are byte offsets, 0-based, start inclusive and end exclusive
  (one past the last matched byte), so end - start is the match length.
- Fix the regex (flags.re) fallback used when hyperscan is unavailable:
  it discarded the real PCRE start and reported end - strlen(pattern),
  which is bogus for variable-length matches. It now reports the true
  start/end from rspamd_regexp_search.

Add C++ (rspamd_cxx_unit_multipattern.hxx) and Lua (trie.lua) unit
tests asserting (id, start, end) against hand-computed positions:
multiple/overlapping occurrences, icase, literal vs regex, no-match,
SOM-overrides-single_match and a large buffer. Existing rspamd_trie
behaviour and its callers (url.c, lang_detection, lua plugins) are
unchanged.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 17 Jun 2026 17:52:16 +0000 (18:52 +0100)]

[Fix] monitored: alphanumeric-only random DNS prefixes

random_monitored RBL checks built random labels from an alphabet that
included '-' and '_'. That produced names like '_Q8...0-' (leading
underscore, trailing hyphen) or '-7d0...' (leading hyphen) which are
not valid DNS labels (RFC 952/1123: no leading/trailing hyphen, no
underscore in hostnames). Authoritative DNSBL servers such as
spfbl.net reject these with SERVFAIL.

Restrict the alphabet to alphanumerics, which always forms a valid
label regardless of position while keeping ample entropy.

Fixes #6103

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 17 Jun 2026 17:49:15 +0000 (18:49 +0100)]

Merge pull request #6104 from rspamd/vstakhov-coroutines

[Fix] lua: state management and reuse safety for coroutine thread pool

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 16 Jun 2026 19:05:26 +0000 (20:05 +0100)]

[Fix] lua: refcount coroutine thread entries

The state/generation guards from the previous commit read the entry on
resume, so they only help while the entry is still allocated. If the
owning task is torn down while an async request is in flight, the entry
could be freed before the late completion fires, turning the guard into
a use-after-free.

Make thread_entry refcounted (ref.h, non-atomic - workers are single
threaded). The pool holds the initial reference; terminate_thread() and
the pool-full path now drop it via REF_RELEASE instead of freeing
directly, so the struct is destroyed only once the last reference goes
away. Every async library that stashes an entry for a later completion
now takes its own reference and drops it when done:

  - dns/util: retain at the request, release in the one-shot callback.
  - http/redis: retain at yield, release in the cbdata/ctx destructor
    (and, for redis, at each point that consumes ctx->thread).
  - tcp: retain at each yield, release at the matching resume, since the
    tcp cbdata has several direct-free error paths that bypass its
    destructor; pairing with yield/resume keeps the balance exact and
    leaves bad-argument paths (which return before yielding) untouched.

Combined with the generation guard, a completion that races task
teardown now finds the entry alive but DEAD/recycled and refuses the
resume, instead of dereferencing freed memory.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 16 Jun 2026 15:35:34 +0000 (16:35 +0100)]

[Fix] lua: state management for coroutine thread pool

Async libraries (dns/redis/tcp/http/util) capture the "currently
running" coroutine at a yield point and resume it later from a C
completion callback. Nothing previously guaranteed the entry resumed
was still the one captured: a double-fired event, a completion racing
task teardown, or an entry recycled into another task would resume the
wrong (or freed) coroutine and corrupt memory. These failures are
interleaving-dependent and invisible in isolation.

Give each pooled thread an explicit lifecycle (FREE/RUNNING/YIELDED/
DEAD) plus a generation counter, both carried in the existing
thread_entry and per-request cbdata structs - no new allocations on any
hot path:

  - get/return/terminate/yield/resume enforce legal state transitions,
    so returning a suspended thread or resuming a non-suspended one now
    aborts at the exact violation instead of corrupting a core later.
  - lua_thread_resume_checked() refuses to resume unless the thread is
    still YIELDED and its generation matches the value snapshotted at
    the yield point; a stale/duplicate completion becomes a logged
    no-op rather than a wrong-coroutine resume. All five async libs are
    migrated to it.
  - lua_tcp keeps cbd->thread pointing at the coroutine actually
    yielded by sync read/write, so the resume always targets it.

generation is bumped on every acquire and release, so an entry that
goes back to the pool and is handed out again no longer matches a
completion that was already in flight.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 16 Jun 2026 07:54:45 +0000 (08:54 +0100)]

[Fix] dns: do not defer resolver nameservers (fixes #6096)

A nameserver that failed to resolve at config time was turned into a
zero-address PENDING_RESOLVE upstream by 904fd6218, then dereferenced as
NULL in rspamd_dns_server_init -> SIGSEGV at worker startup (regression
in 4.1.0). DNS resolver nameservers are consumed synchronously by
rdns_resolver_add_server and can never be promoted async, so never defer
them; also NULL-guard rspamd_dns_server_init as defense in depth.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 15 Jun 2026 08:59:47 +0000 (09:59 +0100)]

[Fix] mx_check: loopback-only MX is LOCAL, not bogon

A domain whose MX resolves only to loopback is hosted on the scanning
host itself -- a self-MX, typically the host's own FQDN mapped to
127.0.0.1 in /etc/hosts, which rspamd's resolver honours as a fake reply
that shadows public DNS. That made fully DMARC-aligned self-hosted mail
score MX_BOGON_ONLY (+8.0): the strongest "not spam infrastructure"
signal treated as the strongest spam signal.

Move 127.0.0.0/8 and ::1/128 from BOGON_CIDRS to LOCAL_CIDRS so a
loopback-only MX emits MX_LOCAL_ONLY (3.0) instead. test_mode now lifts
loopback out of the LOCAL set (was: bogon) so the probe path stays
exercisable against local listeners.

Add a regression test (170_mx_check_selfmx.robot, test_mode = false):
the existing suites run test_mode = true and cannot cover the production
loopback-classification path.

Closes #6101

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 13 Jun 2026 20:32:40 +0000 (21:32 +0100)]

Merge pull request #6102 from moisseev/devdeps

[Test] Update dev dependencies

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 13 Jun 2026 12:18:18 +0000 (13:18 +0100)]

[Feature] neural: sequence output mode and SIF word selection in fasttext_embed

Add a generic word-vector sequence output to the fasttext_embed provider
so that custom ANN architectures (e.g. attention pooling) can learn their
own pooling instead of receiving a pre-pooled vector:

* output_mode = "sequence": emits the first max_words word vectors
  flattened word-major and zero-padded to max_words * channels.
* word_selection = "sif": since order-invariant poolers do not need a
  prefix, optionally fill the sequence with the max_words most
  distinctive (highest SIF weight) words from anywhere in the message
  instead of the leading ones. Default stays "prefix".

This is the data half only; how the sequence is consumed is left to the
ANN architecture. Both modes are bounded by max_words.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 13 Jun 2026 12:18:01 +0000 (13:18 +0100)]

[Feature] neural: pluggable feature-provider and ANN-architecture registries

Turn the neural plugin into an extension point so third-party (including
closed-source) modules can add feature providers and network topologies
without patching the core.

* register_architecture(name, builder) / get_architecture(name): a
  registry of ANN builders, function(n_inputs, rule) -> kann network.
  The built-in 'symbol', 'embedding' and 'conv1d' architectures are now
  registered through it; create_ann() dispatches on rule.architecture
  and falls back to the historical auto-selection when it is unset, so
  existing configs are unaffected.
* register_provider (already present) and register_architecture are
  exported from the neural module, so a module that does
  require 'plugins/neural' can register a custom provider or
  architecture and select it with provider type / rule.architecture.

An unknown rule.architecture now fails loudly with a hint that the
providing module may not be loaded, instead of silently falling back.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 13 Jun 2026 12:17:42 +0000 (13:17 +0100)]

[Feature] lua_kann: expose slice and concat transforms

kad_slice and kad_concat_array were already implemented and serialized
in kautodiff but not reachable from Lua. Exposed as
rspamd_kann.transform.slice(node, axis, start, end) (0-based, end
exclusive, batch is axis 0) and rspamd_kann.transform.concat(axis,
node1, node2, ...). These make split/bypass/merge graph topologies
buildable from Lua, e.g. routing different parts of a fused input
vector through different sub-networks (needed by custom ANN
architectures such as attention pooling with late fusion).

commit | commitdiff | tree

Alexander Moisseev [Sat, 13 Jun 2026 07:16:53 +0000 (10:16 +0300)]

[Test] Update dev dependencies

Update ESLint 10.3.0 → 10.5.0, stylelint 17.11.0 → 17.13.0, and related packages

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 12 Jun 2026 15:38:19 +0000 (16:38 +0100)]

[Feature] kann: add multi-head attention pooling operator

New kad operator attn_pool (op 38, appended to preserve model
serialization compatibility): multi-head dot-product attention pooling
over a zero-padded sequence of word vectors with learned query vectors.
All-zero positions are treated as padding and masked out of the
softmax; attention weights are stashed in gtmp between the forward and
backward passes. Exposed as kann_layer_attn_pool() and
rspamd_kann.layer.attn_pool(node, n_words[, n_heads]).

Verified: converges on a needle-in-haystack task unsolvable by a flat
dense net (0.985 vs 0.715 accuracy), exact word-order invariance of the
pooled output, padding determinism and save/load roundtrip.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 12 Jun 2026 14:35:59 +0000 (15:35 +0100)]

[Fix] neural: stabilize training on dense embedding inputs

Training an ANN on dense provider features (fasttext_embed, text_hash)
could silently produce a degenerate model: with the historical
learning_rate=0.01 default, RMSprop drives the net into tanh saturation
depending on weight init luck - the loss freezes, yet the constant
all-one-class model is saved and classifies every message as spam (or
ham) until the next retrain. On a real corpus this happened in roughly
one of three weight inits.

Fixes:

* use the embedding (funnel) architecture for any rule with dense
  feature providers, not only LLM ones: the simple symbol architecture
  applies ReLU directly to the input, clipping the negative half of the
  embedding space, and is the least stable option on such vectors
  (it is also less accurate; layernorm in the funnel fixes the
  conditioning)
* resolve the learning_rate default by input type: 0.01 for symbol
  vectors as before, 0.001 for dense embeddings, which converges
  reliably with equal accuracy; an explicit config value still wins
* add a quality gate to the training child: a model with constant or
  single-class output on its own training set is rejected instead of
  saved; the lock is released and training retries on the next cycle
  with a different weight init, which converges in practice
* return an explicit msgpack rejection marker from the training child
  instead of nil on the gate/NaN paths: a nil return used to deadlock
  the controller against the training subprocess (see the lua_worker
  fix) and stalled training forever

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 12 Jun 2026 14:35:41 +0000 (15:35 +0100)]

[Fix] lua_worker: do not deadlock when subprocess returns an invalid value

When a function run via worker:spawn_process returned nil (or any
non-string value), the child logged an error but wrote nothing to the
result pipe. The parent then kept waiting for a reply while the child
blocked forever on the post-reply ack read, deadlocking both processes
and anything serialised behind them (e.g. the neural training lock,
which got extended indefinitely so training never retried until a full
restart).

Report invalid return values to the parent as a regular error reply so
on_complete fires and the caller can recover.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 12 Jun 2026 11:03:34 +0000 (12:03 +0100)]

Merge pull request #6065 from dragoangel/fix/url-suspect-oneshot

[Fix] Do not multiply URL multiple AT signs and backslash in URLs

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 12 Jun 2026 10:46:34 +0000 (11:46 +0100)]

[Fix] protocol: use case preserving boundary for HTTP multipart parsing

RFC 2046 boundaries are case sensitive, but the v3 HTTP multipart
callers used ct->boundary, which is lowercased for MIME clients quirks,
so requests with uppercase characters in the boundary failed to parse.
Use ct->orig_boundary in protocol.c, rspamd_proxy.c and rspamdclient.c.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 12 Jun 2026 00:15:50 +0000 (01:15 +0100)]

[Feature] css: detect more text hiding tricks

Extend the invisible-text detection with several common CSS hiding
techniques used to dilute visible content with hidden ham text:

- off-screen positioning: position:absolute|fixed with a large negative
left/top
- image-replacement text-indent: a large negative text-indent
- clip / clip-path collapsing the element to a zero area, e.g.
rect(0,0,0,0), inset(100%), circle(0)
- visibility:collapse (treated as hidden)
- tiny font sizes (<= 3px), not only font-size:0

These are modelled as a hidden display in compile_to_block so the
hiding correctly propagates to descendants. The previous off-screen
heuristic was a fragile substring match on the raw style that only
incremented a feature counter and never actually hid the text; it is
replaced by structured parsing of the position/left/top/text-indent/
clip properties, with the offscreen feature counter now driven by a
flag set on the compiled block.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 11 Jun 2026 22:02:56 +0000 (23:02 +0100)]

[Fix] css: detect text hidden via overflow clipping, opacity and max-* sizes

Phishing messages dilute the visible content with hidden ham text using
CSS hiding techniques that the parser did not understand:

- 'max-width:0; max-height:0; overflow:hidden' was fully ignored as
  max-width/max-height/overflow were not parsed at all
- 'opacity:0' was parsed but the value was silently discarded in
  compile_to_block
- 'height:0' was applied to the block width due to a copy-paste bug,
  and zero dimensions were never considered by compute_visibility

Fixes:

- parse max-width/max-height (clamping width/height) and overflow
- treat a block with zero height or width and overflow:hidden as
  invisible, propagating it to descendants via the display value
- treat opacity < 0.1 as a hidden display, as descendants cannot reset
  the ancestor opacity
- do not allow a child display value to resurrect content of a hidden
  ancestor in propagate_block (display:none is not resettable in CSS)
- fix the height->width copy-paste bug and a missing break that made
  the font-size case fall through into the opacity case

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 11 Jun 2026 19:50:57 +0000 (20:50 +0100)]

Merge pull request #6100 from a16bitsysop/s390x-test

[Fix] unit test checks upstream rate limit state using custom fake_clock

commit | commitdiff | tree

Duncan Bellamy [Thu, 11 Jun 2026 18:37:18 +0000 (19:37 +0100)]

[Fix] unit test checks upstream rate limit state using a custom fake_clock

under a mock environment where the event loop (ev_run()) is never actually executed.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 11 Jun 2026 17:20:02 +0000 (18:20 +0100)]

[Fix] mime: fix build with OpenSSL 4.0 opaque ASN1_STRING

OpenSSL 4.0 made ASN1_STRING (and thus ASN1_OCTET_STRING) opaque, so
direct access to its length/data fields no longer compiles. Use
ASN1_STRING_length()/ASN1_STRING_get0_data() which are available since
OpenSSL 1.1.0 and LibreSSL 2.7.

Also move the legacy OpenSSL init calls (ERR_load_crypto_strings,
SSL_load_error_strings, OpenSSL_add_all_*) under the pre-1.1.0 guard:
they are redundant on modern OpenSSL and break no-deprecated builds.

Fixes: #6087

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 11 Jun 2026 14:05:13 +0000 (15:05 +0100)]

[Fix] milter: send QUARANTINE even with a custom reply

The QUARANTINE command was sent inside the `if (!reply)` guard that
synthesises a default quarantine reason, so a caller-supplied SMTP
message (e.g. task:set_pre_result('quarantine', 'reason')) suppressed
the command entirely and the message was accepted instead of
quarantined. Affects both METRIC_ACTION_QUARANTINE and reject converted
via quarantine_on_reject. Regression from fbc6e35db (3.10.0).

The reply now becomes the quarantine reason, matching the reject and
tempfail branches where a caller-supplied message takes precedence over
the configured default.

Reported by @johnmosli, who also attached the fix.

Issue: #6088

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 10 Jun 2026 19:03:33 +0000 (20:03 +0100)]

Merge pull request #6094 from moisseev/symcache

[Fix] symcache: fix timeout inflation in pre_postfilter_iter grouping

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 10 Jun 2026 19:02:34 +0000 (20:02 +0100)]

[Fix] url: scan bare query parameters containing '=' as a whole

A bare embedded URL is its own query parameter and can contain '='
itself (e.g. base64 padding in the path). The query-embedded scan
treated everything before the first '=' as a parameter key, so such
URLs were discarded. Treat the prefix as a key only when it has no
URL structure characters (':' or '/'); otherwise scan the whole
parameter.

commit | commitdiff | tree

Alexander Moisseev [Wed, 10 Jun 2026 07:33:15 +0000 (10:33 +0300)]

[Fix] symcache: fix timeout inflation in pre_postfilter_iter grouping

The `saved_priority` initialization to -1 caused the first item in each
phase vector (prefilters/postfilters/idempotent) to be split off from
its priority group and counted individually, inflating the computed
maximum symbols cache timeout. Initialize to the first item's priority
instead so items at the same priority are correctly grouped.

Issue: #6092

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 9 Jun 2026 13:13:33 +0000 (14:13 +0100)]

[Test] checkv3: drop requests_toolbelt from form-data parse

The /checkv3 content negotiation tests parsed the multipart/form-data
reply with requests_toolbelt, a third-party module not present in every
test pipeline (e.g. the rspamd-docker functional run installs
python3-msgpack but not requests-toolbelt), so they failed with
ModuleNotFoundError.

Replace it with a self-contained stdlib HTTP-multipart splitter: split
on the boundary delimiter (HTTP-multipart style, deliberately not the
email/MIME parser used for the message/rfc822 case) and trim only the
single CRLF framing each part so binary (zstd) payloads stay byte-exact.
Drop the now-unneeded requests-toolbelt from the CI pip install.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 9 Jun 2026 12:24:03 +0000 (13:24 +0100)]

Merge pull request #6089 from rspamd/vstakhov-neural-stale-fix

[Fix] neural: don't strand trained ANNs behind tombstones

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 9 Jun 2026 11:43:40 +0000 (12:43 +0100)]

[Fix] neural: don't strand trained ANNs behind tombstones

A trained ANN could become unreachable to workers even though
training succeeded: NEURAL_SPAM/NEURAL_HAM stopped firing while the
controller logged "ann ... is changed, our version = N, remote
version = M" forever.

Root cause is a version regression, not a missing zset registration.
The new version was seeded from the in-memory set.ann, and
fill_set_ann resets set.ann.version to 0 whenever a worker never
loaded an ANN (restart, or the selected profile's blob was missing).
A worker that trained from the _4 profile then saved version 1.
process_existing_ann selects the highest version among compatible
profiles, so the live version-1 blob was shadowed by the stale
version-4 zset entry whose key was empty. The profile zset has no
TTL, so the dead high-version tombstone was immortal and the
condition self-perpetuated (the _4 blob was never rewritten).

Three fixes:

1. Version monotonicity (lualib/plugins/neural.lua): seed the new
   version from the profile actually trained from (the trained-from
   key encodes it as the trailing _<n>), max'd with
   training_profile/set.ann, so the new entry always outranks the
   profile it supersedes.

2. Liveness-aware selection (src/plugins/lua/neural.lua,
   neural_maybe_invalidate.lua): when the selected profile's blob is
   missing, fall back to the next compatible profile with a live blob
   instead of going dark, and emit a throttled warning (was a silent
   debug line). The invalidate script also GCs profile entries that
   have no blob and no training data and are older than a grace
   window.

3. Lifetime coupling (neural_save_unlock.lua,
   src/plugins/lua/neural.lua): give the profile zset a TTL refreshed
   each check_anns cycle, and refresh the blob TTL on every reload,
   so an actively used ANN never expires out from under its entry.

Adds 330_neural/005_stale_version.robot, which injects a
higher-version tombstone and asserts inference recovers.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 8 Jun 2026 08:51:58 +0000 (09:51 +0100)]

Merge pull request #6072 from xandris/bugfix/mime_string_const_iterator

fix: mime_string const iterator

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 7 Jun 2026 16:27:38 +0000 (17:27 +0100)]

Merge pull request #6083 from rspamd/vstakhov-controller-checkv3

[Feature] checkv3: controller endpoint + Accept/Accept-Encoding negotiation

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 7 Jun 2026 16:27:24 +0000 (17:27 +0100)]

Merge pull request #6084 from moisseev/redirector

[Minor] url_redirector: distinguish direct URL errors from redirect errors

commit | commitdiff | tree

Alexander Moisseev [Sun, 7 Jun 2026 08:53:45 +0000 (11:53 +0300)]

[Minor] url_redirector: distinguish direct URL errors from redirect errors

When http_callback reports an error on the first hop (orig_url == url),
no redirect has occurred yet, but the old message "found redirect error
from X to X" implied one. Split the message: "error checking URL" for
direct failures and "redirect error: A -> B" for mid-chain failures.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 6 Jun 2026 19:21:58 +0000 (20:21 +0100)]

Merge pull request #6082 from moisseev/simdutf

[Fix] Prioritise bundled simdutf headers over system ones

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 6 Jun 2026 17:00:24 +0000 (18:00 +0100)]

[Minor] checkv3: trim verbose comments to house style

Condense the multi-line explanatory blocks added with the negotiation
reply to single-line notes, and drop a dangling 'see contract above'
reference that pointed to nothing in the source.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 6 Jun 2026 14:33:58 +0000 (15:33 +0100)]

[Feature] checkv3: Accept/Accept-Encoding negotiation

The /checkv3 reply ignored Accept beyond a json-vs-msgpack toggle and
always emitted a hard-coded multipart/mixed body (with form-data part
headers). There was no way to ask for a plain v2-style json/msgpack
body, no true multipart/form-data reply for HTTP multipart parsers,
and Accept-Encoding had no defined default.

Negotiate the representation solely from Accept and compression solely
from Accept-Encoding on the single chokepoint all three workers (normal
scan worker, rspamd_proxy, controller) share, the reply_v3 helper:

  application/json | application/msgpack -> single-body v2 reply
  message/rfc822                         -> multipart/mixed envelope
  multipart/form-data                    -> multipart/form-data envelope
  absent / wildcard                      -> multipart/form-data default
  only unsupported types (e.g. xml)      -> 406 Not Acceptable

Inside the multipart envelopes the result-part serialization mirrors
the input metadata serialization (json or msgpack); the two envelopes
differ only in the top-level Content-Type. Compression honours
Accept-Encoding: zstd and defaults to identity. Vary: Accept,
Accept-Encoding is always advertised. Negotiation reuses the existing
http_content_negotiation parser (q-values + wildcards), extended with
two media types; the input metadata serialization is recorded on the
task via a new protocol flag.

rspamc previously sent Accept: application/json|msgpack for v3, which
now selects a single-body reply it does not expect; it now requests
multipart/form-data and accepts any multipart/ subtype, with the result
serialization carried by the metadata Content-Type.

Tested by a new C++ content_negotiation suite, multipart envelope-mode
unit tests, and a functional negotiation suite run against both the
normal worker and the controller (json/msgpack/email-MIME/HTTP-multipart
parsers). Adds msgpack/requests/requests-toolbelt to functional CI deps.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 6 Jun 2026 12:45:18 +0000 (13:45 +0100)]

[Fix] controller: support the /checkv3 endpoint

The scan worker and rspamd_proxy both handle /checkv3 (multipart
metadata + message in, multipart/mixed results out), but the
controller's HTTP path router never registered it. Since rspamc
defaults to the controller port (11334) for localhost, a plain
`rspamc --protocol-v3` returned 404 with
"rspamd_http_router_finish_handler: path: /checkv3 not found".

Register /checkv3 on the controller (routed to the existing scan
handler) and branch on CMD_CHECK_V3 in both directions:
parse the body via rspamd_protocol_handle_v3_request() on input and
emit the multipart reply via rspamd_protocol_http_reply_v3() on
output, mirroring the proxy. Auth posture matches /check and
/checkv2 (read command, no enable password).

commit | commitdiff | tree

Alexander Moisseev [Sat, 6 Jun 2026 06:36:24 +0000 (09:36 +0300)]

[Fix] Prioritise bundled simdutf headers over system ones

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 5 Jun 2026 10:37:50 +0000 (11:37 +0100)]

[Minor] Update version to 4.1.1

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 5 Jun 2026 10:36:33 +0000 (11:36 +0100)]

Release 4.1.0

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 5 Jun 2026 09:51:58 +0000 (10:51 +0100)]

Merge pull request #6080 from moisseev/redirector-log

[Minor] url_redirector: clarify log messages for successful HTTP responses

commit | commitdiff | tree

Alexander Moisseev [Fri, 5 Jun 2026 09:18:26 +0000 (12:18 +0300)]

[Minor] url_redirector: clarify log messages for successful HTTP responses

The phrases "err code 200" and "err code <N>" are misleading since
they refer to HTTP status codes, not errors. Successful resolutions
(HTTP 200) and intermediate redirects (30x) now use unambiguous
wording that clearly separates the action from the status code.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 5 Jun 2026 08:09:20 +0000 (09:09 +0100)]

[Fix] url: canonicalise bare emails to slash-less mailto:

A bare email in text/HTML and the same address inside an explicit
mailto: URL were extracted as two separate emails. The '@' matcher (and
the HTML bare-email path) injected a literal "mailto://" prefix, while a
parsed mailto: URL is non-hierarchical and drops the // (RFC 6068,
a4ae51536). The URL/email khash dedup keys on the full url->string
(hash + a urllen guard ahead of rspamd_emails_cmp), so the two string
forms landed in different buckets and never collapsed -> the address was
duplicated.

Canonicalise both bare-email injection sites to the slash-less "mailto:"
form so every path yields the identical mailto:user@host string and the
existing dedup works:

- src/libserver/url.c: '@' matcher prefix "mailto://" -> "mailto:"
- src/libserver/html/html_url.cxx: same for bare emails in HTML

Adjust the extract_specific_urls scheme-strip helper to tolerate the
slash-less form, and add a regression test in test/lua/unit/url.lua.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 4 Jun 2026 18:06:48 +0000 (19:06 +0100)]

[Fix] url: keep query nesting cap a fixed functional limit

f068a1156 derived RSPAMD_URL_QUERY_MAX_NESTING from the multipattern
scratch budget (MAX_REENTRANCY - 2), which silently bumped the
redirect/wrapper unwrap depth from 5 to 8 and broke the get_html_urls
unit test that pins the cap at 5.

The nesting depth is a functional/product decision, not a function of
the scratch pool size. Restore the fixed cap of 5 and instead assert at
compile time that it stays within the scratch budget (plus the
enclosing scan and the leaf TLD lookup); rspamd_multipattern_lookup()
still degrades gracefully if the bound is ever exceeded.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 4 Jun 2026 15:11:32 +0000 (16:11 +0100)]

[Feature] external_services: per-service _CHECK anchor

Schedule every external service under a stable <RULE>_CHECK callback
symbol so it is a predictable dependency target, regardless of how the
scan result symbols are named. This generalises the pattern vadesecure
and cloudmark already follow (VADE_CHECK, CLOUDMARK_CHECK).

A scanner whose main symbol is already a *_CHECK keeps it as the
callback. Otherwise the callback is named <KEY>_CHECK (from the config
block key -- unique per rule, so instances never collide) and the
scanner's result symbol (e.g. DCC_REJECT, RAZOR) becomes a virtual
child of it: its score and emitted results are unchanged, and an
existing dependency on the old name still resolves (virtual -> parent).

This lets register_dependency('<SERVICE>_CHECK', X) order any external
check after another symbol uniformly, e.g. after a sender-bypass rule.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 4 Jun 2026 12:55:24 +0000 (13:55 +0100)]

[Fix] external_services: honour scanner default symbol name

Commit f654ec35d resolved the rule symbol names up-front as
`opts.symbol or sym:upper()` so the configure()-failed stub could still
register a fail symbol, then assigned them unconditionally - which
discarded the symbol the scanner resolves in configure(). A scanner
such as vadesecure defines `symbol = 'VADE_CHECK'` as its default main
symbol, but a rule keyed `vadesecure { ... }` without an explicit
`symbol =` was then registered as VADESECURE instead of VADE_CHECK.

Besides the surprising rename, this broke dependency registration
against the documented symbol:

cannot register delayed dependency VADE_CHECK -> X:
source VADE_CHECK is missing

Restore the pre-regression precedence: prefer the symbol the scanner
resolved (its own default, or the user's `symbol =` applied via
override_defaults) and fall back to the key-derived names only when the
scanner left them unset. The up-front names are still used for the
configure()-failed stub. Applies to every scanner with a default
symbol (vadesecure, dcc, razor, cloudmark, ...).

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 4 Jun 2026 08:55:22 +0000 (09:55 +0100)]

[Fix] multipattern: bound URL query scan reentrancy

A URL whose query embeds a percent-escaped URL is unwrapped
recursively (PR #6066): rspamd_url_find_in_query re-enters
rspamd_multipattern_lookup on the URL trie while the enclosing scan
is still on the stack. Each scan borrows one of MAX_SCRATCH hyperscan
scratch contexts; once the recursion nests deeper than the pool, the
slot loop leaves scr == NULL and g_assert(scr != NULL) aborts the
worker. A crafted message with a few levels of nested query URLs thus
crashes a normal worker (DoS).

The peak number of simultaneously-held scratch contexts on the
deepest chain is RSPAMD_URL_QUERY_MAX_NESTING + 2: one for the
enclosing text/subject scan and one for the per-URL TLD lookup that
rspamd_url_parse runs on each freshly extracted leaf. The old pool of
4 with a nesting cap of 5 needed 7 -> assertion.

- Introduce RSPAMD_MULTIPATTERN_MAX_REENTRANCY (10) and size the
  scratch stack from it; a scratch context is ~2.5-4 KiB, so the
  deeper stack costs only tens of KiB per multipattern.
- Tie RSPAMD_URL_QUERY_MAX_NESTING to that budget (minus the two
  implicit levels) so normal nesting stays on the fast path.
- Make scratch exhaustion non-fatal: allocate a one-off scratch for
  the scan instead of aborting the worker on attacker input.
- Guard the unsigned-int scratch bitmask with a static assert.

Add functional regression test 170_url_query_nesting.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 4 Jun 2026 07:53:23 +0000 (08:53 +0100)]

[Fix] composites: avoid over-eager second-pass deferral

The two-phase composite evaluation classified a composite as second-pass
if any referenced symbol carried SYMBOL_TYPE_NOSTAT. NOSTAT is auto-set
on nearly every virtual/callback symbol (regexp rules, multimap, rbl,
...), so most composites that depend only on ordinary filter rules were
wrongly deferred to the COMPOSITES_POST stage, which runs after
post-filters. As a result task:get_groups()/get_symbols() called from a
postfilter no longer saw those composite symbols and their groups
(regression vs 3.5); they only became visible from idempotent rules.
Classifiers run before composites, so keying off SYMBOL_TYPE_CLASSIFIER
was also unnecessary.

Only postfilter-stage symbols are genuinely unavailable during the first
composites pass. Add rspamd_symcache_get_symbol_stage(), which resolves
virtual symbols to their parent and returns the processing stage as a
SYMBOL_TYPE_* bit, and defer a composite only when a dependency resolves
to the postfilter stage. NEURAL_SPAM (a virtual child of the
NEURAL_CHECK postfilter) still resolves to POSTFILTER, so the original
#5674 fix keeps working.

Add a functional regression test covering a composite that depends only
on a filter-stage NOSTAT symbol and must be visible from a postfilter.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 3 Jun 2026 19:58:28 +0000 (20:58 +0100)]

Merge pull request #6076 from rspamd/vstakhov-ratelimit-multi-bucket

[Fix] ratelimit: Track all buckets in selector rules

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 3 Jun 2026 09:40:48 +0000 (10:40 +0100)]

[Fix] ratelimit: Track all buckets in selector rules

When a rule defines several buckets (e.g. "200 / 1h" plus "30 / 1m")
and uses a selector, limit_to_prefixes keyed the prefixes table by the
selector value alone. Every bucket therefore mapped to the same Redis
key and only the last bucket in the array was ever tracked, so the
other limits were silently ignored.

Give each bucket a distinct key by prefixing the selector value with a
per-bucket id (burst + rate), mirroring the burst component that
gen_rate_key already prepends for the non-selector path. The
non-selector path is unchanged, so its existing Redis keys are kept.

Adds a functional regression test (two buckets, burst 2 + burst 20)
that fails before the fix because the restrictive bucket is ignored.

Closes #6059

commit | commitdiff | tree

Alexandra Parker [Sun, 31 May 2026 20:52:35 +0000 (13:52 -0700)]

[Fix] mime_string const iterator

mime_string's iterator is just a value iterator anyway. it's
intrinsically const. drop reference to const_iterator, let iterator_base
take a const pointer, and mark begin() and end() as const.

doctest 2.5.0 receives a const reference and can't use a mutable iterator,
which leads to compile error.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 2 Jun 2026 20:07:26 +0000 (21:07 +0100)]

Merge pull request #6074 from rspamd/vstakhov-checkv3-custom-metadata

[Feature] protocol: Expose custom metadata for /checkv3

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 2 Jun 2026 18:04:34 +0000 (19:04 +0100)]

[Feature] protocol: Expose custom metadata for /checkv3

Add two complementary ways to read custom fields sent with a /checkv3
multipart scan request, both free of the 80KB HTTP header limit that v2
hits, since the metadata travels in the multipart body:

  * A "headers" sub-object in the metadata part is injected into the
    task request headers, so task:get_request_header() works for custom
    fields exactly like v2 HTTP request headers. Reserved control-header
    names (shm/file/path/dictionary/Content-Encoding...) are skipped so
    client metadata cannot collide with the message-loading channel, and
    a repeated name (collapsed by UCL into an array) expands to a
    multi-valued request header.

  * The parsed metadata object is kept on task->meta and exposed to Lua
    via task:get_metadata() and task:get_metadata_field(key), mirroring
    get_settings()/lookup_settings(). The task now owns the object and
    frees it once in rspamd_task_free instead of via a pool destructor.

rspamc gains a repeatable --metadata-header KEY=VALUE option that builds
the metadata "headers" sub-object for v3 requests. Also drop a dead
is_msgpack variable in the v3 request handler.

Tests: functional cases in 430_checkv3.robot plus a checkv3_meta.lua
plugin exercising both options via raw multipart and rspamc.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 18:47:42 +0000 (19:47 +0100)]

Merge pull request #6068 from moisseev/upstream

[Minor] upstream: improve cooldown log message clarity

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 18:46:28 +0000 (19:46 +0100)]

Merge pull request #6071 from rspamd/vstakhov-functional-dummy-readiness

[Test] functional: fix dummy-helper start/scan race and parallel port collisions

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 18:00:42 +0000 (19:00 +0100)]

[Fix] functional: move test server ports below the ephemeral range

The real root cause of the 440_ssl_server flake (and the family of
intermittent "bind 98 / Address already in use" failures): the test
server ports sat INSIDE Linux's default ephemeral range
(net.ipv4.ip_local_port_range = 32768..60999). Bases were 56379 (redis),
56380 (nginx) and 567xx (rspamd normal/controller/proxy/fuzzy + the two
TLS listeners), all squarely in that window.

So any outbound client socket in the test environment -- a redis client,
monitored URIBL DNS lookups, an upstream connection, a dummy-helper
connection -- could be handed one of those numbers by the kernel as its
EPHEMERAL SOURCE PORT on connect(). When rspamd later tried to bind() a
LISTENER on that exact port it got EADDRINUSE. rspamd sets SO_REUSEADDR,
which does nothing against a live socket already bound by another
process. The controller's SSL socket is the LAST of its five ports to
bind -- by then the controller has already opened many client sockets --
so it lost this race most often and surfaced as "SSL controller never
came up" -> HTTPS connection-refused for the whole retry budget. It was
probabilistic (depends which ephemeral ports were in use at bind time),
hence flaky and distro-dependent.

Move the whole rspamd/redis/nginx block down by 31000 (e.g. normal
56789 -> 25789, controller-SSL 56796 -> 25796, redis 56379 -> 25379,
nginx 56380 -> 25380). This preserves every relative offset, so the
carefully spaced, collision-free per-worker layout (base + slot*100) is
unchanged: across 64 worker slots the dummy_* helpers stay <= 24383,
this block spans 25379..32097, and the ephemeral floor 32768 is never
reached. Verified by importing vars.py for slots 0 and 63 (max port
32097 < 32768, zero cross-family collisions) and a serial 001_merged run
(all six 440_ssl_server tests pass on the relocated ports).

Also bump the two cosmetic fallbacks that mirrored the old bases:
test_redis_client.lua's getenv default and a port_is_free docstring.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 17:10:53 +0000 (18:10 +0100)]

[Minor] ci: dedupe concurrent push + pull_request runs

A commit on a branch with an open PR triggers two full ci runs at once:
one for the push event (ref refs/heads/<branch>) and one for the
pull_request event (ref refs/pull/<n>/merge). Besides wasting runner
time they share GitHub's hosted runners and double the CPU load, which
is enough to push the heavy 001_merged rspamd's controller startup past
the functional suites' fixed readiness timeouts -- the residual
440_ssl_server flake reproduced only on whichever of the two same-SHA
runs lost the CPU race (the other passed).

Add a top-level concurrency group keyed on the head commit SHA with
cancel-in-progress. push and pull_request expose the head differently
(github.sha vs github.event.pull_request.head.sha -- the latter is the
real head on PR events, where github.sha is the merge commit), so the
group key uses pull_request.head.sha when present and falls back to
github.sha, collapsing both events for one commit into a single run.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 16:32:43 +0000 (17:32 +0100)]

[Test] functional: also wait for SSL/proxy ports in teardown

The teardown port-release wait added in the previous commit only covered
the normal + controller plain ports. The 440_ssl_server flake is the
same race on a port it missed: a test rspamd binds up to five sockets
(normal, controller, proxy, controller-SSL, normal-SSL), and a previous
suite's controller-SSL listener could still hold its port when the next
rspamd on that pabot worker started. The CI log shows it exactly:

  rspamd_fork_worker: prepare to fork process controller (0);
    listen on: 127.0.0.1:57190
  rspamd_inet_address_listen: bind 127.0.0.1:57196 failed: 98,
    'Address already in use'
  spawn_workers: cannot listen on normal socket 127.0.0.1:57196

57196 is PORT_CONTROLLER_SSL for that worker slot. main carried on and
forked the controller with only its plain socket, so the SSL listener
never came up and every HTTPS test hit connection-refused for the full
retry budget -- the "slow SSL controller" the two prior band-aids tried
to wait out.

Extend Wait For Rspamd Ports Released to loop over all five ports. All
RSPAMD_PORT_* vars are always defined in vars.py, and a port the current
config never bound refuses connection immediately, so Port Is Free
passes at once -- waiting on an unused port is a cheap no-op. Verified
001_merged (which owns 440_ssl_server) still passes serially with the
SSL ports now checked in teardown; the SSL bind race is timing-dependent
under CI contention, so the fedora/ubuntu runs are the real check.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 14:30:16 +0000 (15:30 +0100)]

[Test] functional: wait for rspamd ports to free in teardown

Under pabot each worker runs many suites sequentially on the SAME port
range (base + worker_index*100). Rspamd Teardown did Terminate Process +
Wait For Process, but that only reaps the MAIN rspamd; the listening
sockets are shared with forked workers and can linger a beat after main
exits. The next suite's rspamd on that worker then races them and dies:

  rspamd_inet_address_listen: bind 127.0.0.1:57090 failed: 98,
    'Address already in use'
  spawn_workers: cannot listen on normal socket 127.0.0.1:57090
  Process Is Gone (rc=1, port=57089)

which cascades the whole shared-rspamd suite (e.g. 001_merged -> 250+
failures) or single suites like 440_ssl_server. rspamd sets SO_REUSEADDR
before bind, so this is NOT TIME_WAIT -- it is a still-LISTENing socket
from a not-yet-fully-gone worker.

Add port_is_free() (rspamd.py) and a Wait For Rspamd Ports Released
keyword, called from Rspamd Teardown after Wait For Process: block (up to
~6s, warn-not-fail) until the normal + controller ports actually refuse
connections before releasing the suite. Closes the handoff race window.

This is a pre-existing flake (same bind-98 signature on master, e.g.
fedora job for #6067 with :56990), independent of the dummy-port
templating in this branch; both CI runs of this PR hit it in different
suites, the tell-tale of nondeterministic infra flake.

Verified: the keyword runs on every teardown (357 invocations / 714 port
checks in a 4-worker pabot run) and port_is_free correctly passes on a
free port and blocks on a live listener; no regression in serial or
parallel runs. The race itself is timing-dependent and reproduces under
CI container contention rather than locally, so CI is the real check.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 1 Jun 2026 12:46:04 +0000 (13:46 +0100)]

[Test] functional: template dummy ports for parallel safety

The dummy_* helper ports already get a per-pabot-worker offset in
lib/vars.py (base + worker_index*100), but consumers hardcoded the
worker-0 literals (:18080/:18081/:18083), so under parallel pabot a
worker bound its dummy on an offset port while tests/configs still
pointed at :18080. That produced two failure modes: Errno 48 "address
already in use" when two workers raced the same literal port, and
cross-worker URL mismatches (worker 3's redirector fetching worker 0's
dummy, assertions expecting :18080 that never appeared).

Route every consumer through the existing per-worker value:

  * Lua test scripts (http/tcp/http_early_response): read the port from
    rspamd_env.PORT_DUMMY_HTTP/HTTPS/HTTP_EARLY (rspamd strips the
    RSPAMD_ prefix when building the Lua env table), defaulting to the
    historical literal for ad-hoc runs. Mirrors the existing maps_kv.lua
    pattern.
  * neural_llm.conf: Jinja {= env.PORT_DUMMY_HTTP =}, like the other
    templated configs.
  * test_tcp_client.lua (rspamadm): os.getenv fallback chain (it runs
    under `rspamadm lua`, not the config loader).

The url_redirector .eml fixtures embed the dummy URL but are fed raw to
the scanner -- the config-time Jinja engine does not touch them. Add a
Render Message Template keyword (Get File -> Replace Variables ->
Create File in the suite tmpdir) and have suites 162-169 render their
fixtures in setup, with ${RSPAMD_PORT_DUMMY_HTTP} placeholders in the
fixtures and assertions. Normalise redir_chain_tel_url.eml to LF while
touching it.

Verified: serial runs unchanged (worker-0 keeps the historical ports),
and 4x parallel pabot stress over the url_redirector + http/tcp/early +
antivirus/udp/p0f/settings/llm suites is stable at 142/143 with zero
Errno 48 / address-in-use and no cross-worker mismatches. The lone
remaining failure (169 path-less ?u= wrapper) is a pre-existing
redirector behaviour bug -- it fails identically on master.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 31 May 2026 19:48:45 +0000 (20:48 +0100)]

[Test] functional: centralize dummy helper readiness barrier

Fix a start/scan race in the functional suite: dummy_* mock services
were started and then connected to (by rspamd or the test) before they
were listening. Under parallel pabot the short 2s PID waits timed out
under CPU contention, one-shot helpers (clam/fprot/avast/p0f) left stale
PID files so a same-port restart satisfied Wait Until Created instantly
and raced the new bind, and p0f derived its PID path inconsistently
between helper and suite.

Every dummy_* helper already writes its PID only after server_bind/
server_activate, so PID-existence is a valid "listening" signal. This
routes all helper startup through one barrier:

  * Start Dummy Service (lib/rspamd.robot): drop stale PID, start the
    helper, block until the PID file appears (5s). Single source of
    truth for startup ordering.
  * Wait Until Dummy Listening: active TCP-connect probe layered on top
    for loop servers (http/https/ssl) only; not used for one-shot or
    single-threaded smtp helpers, where a probe would consume the one
    session the test needs.

Rewrite Run Dummy Http/Https/Llm/Http Early/Ssl/Udp/Clam/Fprot/Avast/p0f
and the 168/169 SMTP suites to go through it; move SMTP temp files from
/tmp to the per-worker RSPAMD_TMP_PREFIX; teach dummy_p0f.py to accept an
explicit PID path.

Add util/check_no_bare_dummy_start.py, run as a run-parallel.sh
preflight, which fails if a suite reintroduces a bare
Start Process ... dummy_*.py instead of using the barrier.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 31 May 2026 13:45:56 +0000 (14:45 +0100)]

[Fix] archives: bounds guards for RAR/ZIP/7-zip parsing

Several archive parsers could read slightly past the input buffer on
crafted (attacker-controlled) attachments:

- RAR v4 file header: fname_len was validated against the remaining
  buffer, but p then advanced past the attrs and optional
  HIGH_PACK_SIZE/HIGH_UNP_SIZE fields (4-12 bytes) before the filename
  was read, allowing an over-read of up to 12 bytes. Re-validate
  fname_len at the point of use.

- 7-zip: rspamd_7zip_read_next_section, _read_digest and _read_bits
  dereferenced *p before any bounds check; a section/type byte landing
  on the last byte of the buffer (e.g. a trailing kCRC or kHeader) led
  to a one-byte over-read. Guard p < end before the dereference.
  rspamd_7zip_read_archive_props guarded only p != NULL; also require
  p < end.

- ZIP central-directory extra-field loop advanced p by an
  attacker-controlled hlen without checking it against the remaining
  extra-field length, producing a past-the-end pointer. Clamp the
  advance and stop on a truncated field.

All reads, no writes; impact is a potential crash on malformed input.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 30 May 2026 12:21:37 +0000 (13:21 +0100)]

[Fix] mime_parser: bound S/MIME recursion depth

Nested S/MIME structures re-entered the parser through
rspamd_mime_parse_normal_part -> rspamd_mime_process_multipart_node ->
rspamd_mime_parse_normal_part without passing through the
multipart/message nesting checks, so st->nesting was never incremented
on that path. application/pkcs7-mime only sets the SMIME content-type
flag (not MESSAGE/MULTIPART), so such parts take the normal-part branch.
A crafted message with deeply nested application/pkcs7-mime layers could
therefore recurse to a depth bounded only by message size rather than by
max_nested, exhausting the worker stack (DoS) and accumulating the
CMS/PKCS7/BIO objects of every level simultaneously.

Account for the S/MIME re-entry against max_nested and free the
CMS/PKCS7/BIO objects on the new error path; the nesting cap also bounds
the peak memory held during unwinding.

Two related defensive guards:
- rspamd_mime_preprocess_message now looks back one byte before the body
  only when that stays within the buffer, avoiding a potential 1-byte
  out-of-bounds read when raw_data.begin == st->start.
- guard the boundary-stack pop in rspamd_mime_parse_multipart_part with
  len > 0, mirroring the guarded pop in rspamd_mime_parse_message.

commit | commitdiff | tree

Alexander Moisseev [Fri, 29 May 2026 13:43:56 +0000 (16:43 +0300)]

[Minor] upstream: improve cooldown log message clarity

When elapsed time rounded to the same value as the minimum interval,
the log showed "checked 60 seconds ago (60 is minimum)", suggesting
the check was skipped at equality despite the strict < comparison.
Replace with remaining cooldown time using ceil() to avoid ambiguity.

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 29 May 2026 11:28:12 +0000 (13:28 +0200)]

Merge branch 'master' into fix/url-suspect-oneshot

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 29 May 2026 10:30:33 +0000 (12:30 +0200)]

[Feature] mx_check: three-layer cache rewrite (#6055)

* [Feature] mx_check: three-layer cache rewrite

This is the comprehensive implementation behind issue #6032. The single-
layer cache from previous shape is replaced by a three-layer Redis design
(d:<domain> / m:<mxhost> / i:<ip>) under <key_prefix>:. Short-code wire
formats minimise Redis footprint; per-layer validators
(is_valid_cache_value) treat unrecognised entries as a cache miss;
the resolve / probe path that follows then issues a fresh cache_set at
the same key, overwriting the bad entry in place.

Probe coordination

- SET NX EX claims the i:<ip> probe lock; a post-claim GET disambiguates
  held lock, already-published verdict, and corrupted-value-needing-heal
  cases. A separate force_claim_probe_lock path overwrites corrupted
  values to break the SET NX loop without leaking refcounts.
- Redis errors during the lock claim surface as MX_REDIS_ERROR; lock held
  by another worker surfaces as MX_INFLIGHT and skips duplicated TCP
  connections which under high-load would result in DoS like activity
  from the target side and most likely will negatively impact Rspamd's
  user IP/ASN/Org reputation.

DNS / probe model

- Dual-stack via probe_ipv4 / probe_ipv6 / prefer_ipv6 with family-tagged
  cache values (v4: / v6: / v64:) and coverage checks so flipping the
  probe-family set re-resolves only as needed.
- Real DNS path failures (SERVFAIL / REFUSED / timeout) are distinguished
  from authoritative NXDOMAIN / NOREC via is_dns_real_failure; the former
  surface as MX_DNS_FAIL (cached as 'df') so a recovered resolver path
  can be re-tried promptly. NXDOMAIN/NOREC collapse into MX_NONE.
- step3 partitions resolved IPs into PUBLIC / LOCAL (RFC1918 / CGNAT /
  ULA) / BOGON (loopback, TEST-NET, multicast, link-local, etc.). Only
  PUBLIC IPs reach the TCP probe. MX_LOCAL_ONLY / MX_LOCAL_MIX /
  MX_BOGON_ONLY / MX_BOGON_MIX fire with the offending IPs as options.
  test_mode lifts loopback out of the bogon set so the probe path can be
  exercised against 127.0.0.1.

Symbol surface

- Multi-source: check_from / check_mime_from / check_reply_to with
  envelope > reply-to > mime-from priority dedup if same domain is hitting
  MX checks from different sources. Per-source prefixes
  (symbol_prefix_from / symbol_prefix_mime_from / symbol_prefix_reply_to)
  fan every MX_* symbol across the three sources at registration time.
- A-fallback path (no MX RR, A used as implicit MX per RFC 5321 §5.1)
  has its own MX_A_* symbol family so operators can score it
  independently of the MX-RR path.
- Per-outcome greylist and reject gates (greylist_invalid /
  greylist_none / greylist_broken / ..., reject_null_mx with
  reject_authorized / reject_local kill switches); null-MX domains can
  now trigger a real set_pre_result. reject_nxdomain_mx removed
  as bad option to serve, practically nxdomain reject would be good only
  on eTLD+1.
- Probe-outcome symbols (MX_GOOD / MX_TIMEOUT_* / MX_REFUSED /
  MX_INVALID / MX_ERROR / MX_INFLIGHT) populate the option field with
  the MX hostname; IP-class symbols still carry IPs since that's where
  IP information is the point. MX_REDIS_ERROR has no option (it's a
  module-internal signal).
- New punishment maps: bad_mxs (glob on MX hostnames) and bad_ips
  (radix on resolved IPs). Any hit short-circuits with MX_BAD /
  MX_IP_BAD before any TCP probe runs which allows to punish
  domains which shares same MX infra.

Scoring

- set_metric_all_sources ships sensible defaults for every symbol.
  Operators can tune any weight through the new "mx" group in
  conf/groups.conf via local.d/mx_group.conf or override.d/
  mx_group.conf without touching the module.

Functional tests

- 167_mx_check.robot refreshed for the new symbol set; MX_NONE replaces
  MX_NXDOMAIN/MX_MISSING, MX_A_REFUSED covers the closed-port
  A-fallback case, and MX_BAD / MX_IP_BAD have dedicated assertions.
- 168_mx_check_greeting.robot covers verify_greeting=true /
  send_quit=false: silent listener -> MX_TIMEOUT_READ; continuation
  220- with no follow-up held past read_timeout -> MX_GOOD (a
  regression that re-queued reads under send_quit=false would surface
  as MX_TIMEOUT_READ); 5xx greeting -> MX_ERROR; non-SMTP line ->
  MX_INVALID.
- 169_mx_check_greeting_quit.robot covers verify_greeting=true /
  send_quit=true: proper multi-line timing -> MX_GOOD plus dummy
  status file QUIT_AFTER_FINAL (catches a regression where QUIT is
  sent before the final 220 line, which rspamd's verdict alone cannot
  detect); slow second line -> MX_TIMEOUT_READ.
- util/dummy_smtp.py mock with silent / error / messy / greeting_single
  / greeting_multi modes and a --status-file argument for out-of-band
  timing verification.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Feature] mx_check: optional per-entry weight multiplier for bad_mxs / bad_ips

Both bad_mxs (glob) and bad_ips (radix) entries can now carry an optional numeric second token that is read as a weight multiplier on top of the MX_BAD / MX_IP_BAD group score. Examples: `trapmx.example.com 3` triples the weight; `1.2.3.4 0.5` halves it. Default multiplier is 1.0 (no value or non-numeric value). Lets operators tier confidence within a single map without maintaining several.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] Use static parent callback in mx_check module

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] Add missing executable flag on dummy_smtp python script

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Chore] Add group to parent mx_check symbol

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] change rspamd_config:add_map to lua_maps so inline maps works too, adjust autotests so they survive parallelism

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
---------

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
Co-authored-by: Vsevolod Stakhov <vsevolod@rspamd.com>

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 29 May 2026 10:30:13 +0000 (12:30 +0200)]

Merge pull request #6066 from dragoangel/fix/properly-handle-redirects

[Fix] Handle query-embedded URL targets in wrappers and redirectors

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 09:06:22 +0000 (10:06 +0100)]

Merge pull request #6067 from rspamd/vstakhov-env-baseline-templating

[Feature] Env-overridable baseline config and fasttext model auto-load

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 08:05:46 +0000 (09:05 +0100)]

[Feature] Auto-load shipped fasttext model when present

When no fasttext_model is configured, fall back to the model shipped at
$SHAREDIR/languages/fasttext_model.ftz: if the file is readable, load
it via the existing direct-load path; otherwise stay silent (debug
only) so stock installs without the model behave exactly as before.

This lets images that ship the model file drop the explicit
fasttext_model config override. The success path reuses
load_model_direct (the same code used for an explicit fasttext_model),
and the absent-file case produces no error and leaves the detector
reporting 'fasttext model is not loaded' as before.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 07:43:53 +0000 (08:43 +0100)]

[Feature] Make pidfile env-overridable, empty disables it

Template the baseline pidfile so deployments can relocate or disable it
without patching conf/rspamd.conf:

pidfile = "{= env.PIDFILE|default('$RUNDIR/rspamd.pid') =}";

With no RSPAMD_PIDFILE set it renders to the previous default
($RUNDIR/rspamd.pid). An empty RSPAMD_PIDFILE renders an empty string,
which now means "do not write a pidfile" -- useful when running as PID 1
in a container. Extend the existing cfg->pid_file == NULL guards in both
rspamd_write_pid() and main() to also treat an empty string as unset, so
the existing "pid file is not specified" path is taken.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 29 May 2026 07:36:12 +0000 (08:36 +0100)]

[Conf] Make logging type and filename env-overridable

Template the baseline logging block so deployments can switch logging
without patching conf/rspamd.conf:

type = "{= env.LOG_TYPE|default('file') =}";
filename = "{= env.LOG_FILE|default('$LOGDIR/rspamd.log') =}";

With no RSPAMD_LOG_TYPE/RSPAMD_LOG_FILE set the values render to the
previous hardcoded defaults (file, $LOGDIR/rspamd.log), so stock
installs are unchanged. A container can now set RSPAMD_LOG_TYPE=console
to log to stdout. Mirrors the env-template style introduced for the
worker bind_socket lines.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 28 May 2026 08:29:47 +0000 (09:29 +0100)]

Merge pull request #6064 from rspamd/vstakhov-dynamic-composites

[Feature] Dynamic composites: hot-reloadable composites map

commit | commitdiff | tree

Dmitriy Alekseev [Wed, 27 May 2026 12:40:10 +0000 (14:40 +0200)]

[Fix] Do not multiply URL multiple AT signs and backslash in URLs

Long real conversations can accumulate a validly used such links, and this symbols solely
will drop emails and if autotrain enabled - train badly system

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>

commit | commitdiff | tree

Alexander Moisseev [Wed, 27 May 2026 08:09:39 +0000 (11:09 +0300)]

[Feature] Add fixed-point formatting to fpconv (#6061)

* [Feature] Add fixed-point formatting to fpconv

- Add FPCONV_PRECISION_ALL sentinel for trim-trailing-zeros mode
  with compile-time guard (static_assert > 17 significant digits)
- Implement %.Nf rounding with carry (round_at, trim_trailing_zeros)
- Fix %.0f carry detection for numbers like 9.9 -> 10
- %f/%F/%g/%G use FPCONV_PRECISION_ALL instead of hardcoded literals
- Add C++ unit tests for fpconv precision and rounding

* [Fix] Fix carry overflow from fractional rounding in fpconv

- Add round_at_ex with carry_overflow flag to detect full carry
  that shifts digits and prepends '1'
- Fix offset<=0 branch (0.xxx): carry now correctly produces
  "1.0" instead of "0.1" (e.g. 0.96 → "1.0")
- Fix offset>0 branch (1.xxx-9.xxx): round_at called before
  copying to dest so integer digits are always fresh; carry
  correctly expands integer part (e.g. 9.96 → "10.0")

* [Fix] Fix wrong digits array index in fpconv offset<=0 rounding

Leading zeros are written by memset to dest, not stored in the
digits array. The rounding path incorrectly used orig_offset as
an index into digits for both round_at_ex position and memcpy
source, causing wrong output (e.g. 0.0123 → "0.02" instead of
"0.01") and potential out-of-bounds reads when ndigits < orig_offset

* [Rework] Extract fpconv fixed-point formatting into a separate shim layer

* [Fix] Fix rounding in fpconv_format emit_fixed_digits

Defect 1: Change >= to > when comparing leading zeros count with
precision, so that values like 0.005 with %.2f correctly round to
"0.01" instead of "0.00".

Defect 2: When carry occurs within the fractional part (e.g. 0.0999
with %.2f), emit "0.10" instead of incorrectly outputting "1.00".
Carry now distinguishes between crossing the integer boundary and
propagating within the fraction.

Also handle the case where precision equals the leading zeros count:
check the first significant digit directly for rounding instead of
calling round_at_ex with precision=0.

* [Refactor] Move fpconv_format shim from contrib/ to src/libutil/

---------

Co-authored-by: Vsevolod Stakhov <vsevolod@rspamd.com>

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 20:57:47 +0000 (21:57 +0100)]

[Test] composites: functional test for dynamic UCL composites map

Exercises load -> reload-with-update -> reload-with-stub:
1. INITIAL MAP - DYN_ONE FIRES: load composites from map.1, scan a
    message, confirm DYN_ONE and DYN_TWO fire with their declared
    scores. Static composite STATIC_COMP also fires alongside.
2. RELOAD - UPDATED SCORES AND NEW NAME: swap to map.2 (DYN_ONE
    score updated, DYN_TWO removed, DYN_THREE introduced), wait for
    the map watcher, scan, confirm new scores + new composite +
    DYN_TWO gone (stubbed).
3. RELOAD - REMOVED COMPOSITE BECOMES STUB: swap back to map.1.
    DYN_ONE/DYN_TWO are back with original scores, DYN_THREE was in
    the previous generation but is now absent -> verifies the stub
    path keeps the name out of scan results.

Lua plugin registers DYN_BASE_A/B/C as always-firing atomic symbols
so the composite expressions resolve deterministically. Config sets
map_watch_interval = 0.5s for tight reload turnaround.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 20:08:45 +0000 (21:08 +0100)]

[Conf] composites: route composites.dynamic to map handler

Add a reserved key in the composites { ... } config block so users can
attach a hot-reloadable map of composites:

    composites {
        STATIC_COMP { expression = "..."; score = 1.0; }
        dynamic = "/etc/rspamd/composites.map";
        # or dynamic = ["http://a/x", "file://y"];
        # or dynamic = { url = "..."; signature = "..."; }
    }

The handler intercepts the 'dynamic' key inside the composites section,
hands the UCL value to rspamd_composites_add_dynamic_map(), and lets
the rest of the section continue with static composite definitions.

Smoke-tested by running rspamd against a config with a file-backed
dynamic map: map_fin fires, the publish pipeline registers the
composites with the symcache, and the dynamic generation bumps to 1.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 19:19:24 +0000 (20:19 +0100)]

[Feature] composites: dynamic UCL map handler

Implements hot-reloadable composites maps. The map content is a UCL
object mapping composite name to a body of expression, score, group,
policy, description, groups, enabled — the same vocabulary the static
composites { ... } config block accepts.

Manager additions:
- build_staging() clones base_gen so the map handler can mutate a
   detached generation without disturbing in-flight tasks
- add_composite_to_staging() parses one UCL composite into staging
   and reflects it in cfg->symbols
- disable_in_staging() materialises a disabled stub for a name
- publish_generation() registers any new composite names with the
   symcache, bumps the resort generation, runs the analysis pipeline
   on the staging, and atomically swaps current_gen
- seal_static_load() captures the static-config generation as
   base_gen and seeds ever_seen_names; called once from
   rspamd_composites_mark_whitelist_deps
- symcache_pinned keeps the first composite shared_ptr per name
   alive forever, so the symcache's cbdata never dangles even when
   later generations replace the composite

Per-map state (map_cbdata) tracks last_names so a reload that drops a
name turns it into a stub instead of leaving it ghosted.

rspamd_composites_add_map_handlers — already in tree but unwired —
now parses the buffered bytes as UCL instead of NAME:SCORE EXPRESSION,
and routes through the new staging pipeline.

Public C API:
- rspamd_composites_add_dynamic_map() — registers a dynamic map
- rspamd_composites_current_generation() — diagnostics

cfg_rcl wiring (composites.dynamic = ...) is the next commit; this
commit only adds the runtime + API. Static composites are unchanged;
17/17 functional tests in 109_composites + 109_settings_merge pass.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 19:02:01 +0000 (20:02 +0100)]

[Refactor] composites: parameterise build helpers by generation

process_dependencies, build_inverted_index, mark_whitelist_dependencies,
collect_leaf_atoms, the composite-dep cbdata and the inverted-index
cbdata all take an explicit composites_generation reference now and
operate solely on it, with no implicit access to manager state.

The manager keeps a no-arg overload of each that forwards to
*current_gen — config-load wiring is unchanged.

This unblocks building a staging generation (under a dynamic-map
reload) without touching the live one. No behaviour change for static
configurations.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 18:49:35 +0000 (19:49 +0100)]

[Refactor] composites: extract per-task generation snapshot

Hoist the per-pass evaluation vectors, inverted index, and ownership
lists into a new composites_generation struct held inside composites_manager
as a shared_ptr<composites_generation> current_gen.

composites_data takes a snapshot of current_gen at task-creation time and
all read paths (first/second-pass walking, inverted-index lookup,
not_only fallback, composite-reference recursion) now go through the
pinned snapshot. This is a no-op today — only one generation ever
exists — but is the foundation for hot-reloadable composite maps where
the manager swaps current_gen while in-flight tasks must keep using
their snapshot.

Composite ids are now allocated through composites_manager::next_id()
which is monotonic across generations so an id is unique for the life
of the worker; composites_data::checked is sized from the maximum id
in the snapshot.

Removed the cached atom->ncomp / comp_type resolution. Caching a
manager pointer on a shared atom would dangle if a referenced
composite is replaced in a later generation; instead each evaluation
resolves the composite name through the task's snapshot via a single
hashtable lookup. Dropped rspamd_composites_resolve_atom_types and the
corresponding enum.

Added rspamd_composite::disabled — wired through the eval path,
process_dependencies, build_inverted_index and mark_whitelist_dependencies
so that stub composites (used in later commits to replace removed
entries on map reload) skip out of every index without being evaluated.

No behaviour change for static composites configurations; functional
tests in test/functional/cases/109_composites.robot pass unchanged.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 15:09:21 +0000 (16:09 +0100)]

[Test] 440_ssl_server: wait for SSL controller in suite setup

The previous attempt at killing this flake added per-test retries of
15 x 0.4s = 6s to the two controller-SSL HTTPS tests. Under heavy
parallel pabot load (4 workers + concurrent serial robot on the same
box) we have observed the controller's SSL listener take longer than
6s to start accepting after Run Rspamd's readiness check passes, and
both retry budgets get exhausted in sequence.

Run Rspamd's readiness check pings the plain normal worker and (for
configs with a control socket) waits for the controller to register
its workers with main. Neither covers the SSL listener: OpenSSL ctx
init for that listener happens after the worker is announced and
can lag by hundreds of ms in the worst case.

Move the wait into a single Suite Setup with a generous 30s budget
(60 x 0.5s) so we pay it once and the individual tests can issue a
direct HTTPS request again. The suite setup uses /ping (smallest
controller endpoint, served unauthenticated from 127.0.0.1 which is
in secure_ip). If the listener never comes up the suite fails loudly
in setup rather than every test independently exhausting a 6s retry.

Local: three back-to-back parallel pabot runs (4 processes, full
001 Merged suite) -- 6/6 pass, suite finishes in ~4-5s.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 26 May 2026 08:09:41 +0000 (09:09 +0100)]

[Minor] DNS: Remove unused SERVFAIL cache

The fails_cache feature (introduced in e3057e5e4, Oct 2019) was undocumented,
disabled by default, never exercised in tests, and never adopted in
practice — including by the single deployment it was originally written for.

Negative DNS caching, if ever needed, belongs in librdns.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 25 May 2026 20:00:43 +0000 (21:00 +0100)]

[Test] 411_logging: read per-suite rspamd output, not global .last

save_run_results writes each rspamd's logs to two destinations: the
stable per-suite/per-test directory under robot-save/, and a global
robot-save/<file>.last "convenience" copy of the most recent run.

The three 411_logging tests asserted on the .last copies. Under
pabot another worker can teardown -- and overwrite the .last files
-- between this suite's Rspamd Teardown saving them and the
assertion reading them, so the assertion ends up running against a
different suite's rspamd output and matching the wrong format.

Switch to the per-suite paths
(robot-save/${SUITE_NAME}/rspamd.stderr for the console suites,
robot-save/${SUITE_NAME}/${TEST_NAME}/rspamd.log for the JSON file
test). Those paths aren't shared across pabot workers.

Local: three back-to-back parallel runs of the 411_logging
directory pass 3/3 each time.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 25 May 2026 16:12:43 +0000 (17:12 +0100)]

[Test] 440_ssl_server: tolerate slow controller SSL bind

The controller worker registers with the main process slightly
before its SSL listener finishes initializing OpenSSL and starts
accepting connections. The pre-test readiness check in Run Rspamd
sees "workers" appear in `rspamadm control stat` -- proof that
registration is done -- but the SSL socket on PORT_CONTROLLER_SSL
can still briefly refuse for tens to hundreds of milliseconds
after that, especially under concurrent-phase load on CI.

The first two tests in 440_ssl_server hit the SSL controller port
back-to-back and were the only ones to occasionally fail with
"Connection refused"; the remaining four (plain controller,
SSL/plain normal worker) ran later in the suite and always passed
because the SSL listener was up by the time they reached it.

Wrap just those two HTTPS calls in `Wait Until Keyword Succeeds`
(15 x 0.4s = ~6s) so the test reflects what it actually verifies:
the SSL controller eventually serves /stat and /errors. Refactor
the assertion into a small `Fetch HTTPS And Expect 200` keyword
to keep both retries readable.

Local: three back-to-back parallel pabot runs of the suite -- 6/6
pass each time, no flakes.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 25 May 2026 12:44:37 +0000 (13:44 +0100)]

[Project] Parallelise functional tests via pabot (#6060)

* [Project] Parallelise functional tests via pabot

Switch the Robot Framework functional test suite from a single serial
robot invocation to a two-phase pabot + robot run, giving CI a ~3-4x
wall-clock win on the parallel-safe portion while keeping the rest
working unchanged.

Worker isolation lives in test/functional/lib/vars.py. Each pabot
worker reads PABOTEXECUTIONPOOLID and applies a port offset of
index*100 across every rspamd / redis / nginx / clam / fprot / avast
/ dummy-http / dummy-https / dummy-http-early / dummy-llm / dummy-udp
/ dummy-ssl port, plus a per-worker /tmp/rspamd-functional-<index>/
prefix for unix sockets and pidfiles. Plain `robot` runs unchanged
(no env var -> index 0 -> the historical port numbers).

The dummy_* helper utilities now derive their PID paths from
{tmp_prefix}/dummy_<svc>-<port>.pid (or socket basename for p0f) via
a small util/dummy_pidfile module, so two instances on different
ports no longer collide. Existing override-via-argv callsites still
work. Robot keywords in lib/rspamd.robot are updated to use the
vars-driven ports and pidfile paths; suites that read those PIDs
(161_p0f, 230_tcp, 001_merged/{160_antivirus,310_udp}) and the
url-redirector log-grep in 162_url_redirector are templated to
match.

Twelve suites still bake dummy_http/dummy_llm/dummy_http_early/tcp
port numbers into Lua test scripts (test/functional/lua/{http,
http_early_response,tcp}.lua) and three configs (settings.conf,
neural_llm.conf and the assertion literals in url_redirector*),
so they only work at the worker-0 port offset. Tagging them
`notparallel` and running them with plain robot after the pabot
batch sidesteps the collision without templating those Lua scripts
in this change.

CI (.github/workflows/ci_rspamd.yml) installs pabot via pip
(--break-system-packages with a fallback for older pip in the
Fedora image), then runs:
  * Phase 1: pabot --processes 4 --exclude notparallel
            -> outputdir build/parallel/
  * Phase 2: robot --include notparallel
            -> outputdir build/serial/
Both phases run unconditionally and the step exits non-zero if
either failed. Artifact upload now collects both outputdirs plus
the legacy build/*.*ml path.

Local invocation is `test/functional/run-parallel.sh`, a thin
wrapper documented in CLAUDE.md. The script forces suite-level
splitting (no --testlevelsplit) because each Suite Setup starts
its own rspamd.

Follow-ups (not in this change):
  * Template the four Lua scripts and three configs so the twelve
    notparallel suites can drop the tag.
  * Split 001_merged/ (30 sub-suites under one rspamd) into
    independent units; currently pinned to one worker and the long
    pole of phase 1.

* [Fix] functional tests: claim worker slot via /tmp lockfile

Pabot 5.2.2 does not export PABOTEXECUTIONPOOLID to child robot
subprocesses, even though the variable name appears in pabot's own
source for internal accounting. The previous worker-index detection
fell through to 0 in every pabot worker, so all four workers used
identical rspamd / redis / fuzzy port offsets and crashed in
Multi Setup with "Address already in use".

Replace the env-only lookup with an atomic file-claim:

  * RSPAMD_WORKER_INDEX / PABOTEXECUTIONPOOLID still win when set
    (explicit override, future pabot versions).
  * Otherwise each process atomically grabs the first free
    /tmp/rspamd-functional.slot-<N> via O_CREAT|O_EXCL, writing its
    pid. A stale slot (pid no longer alive) is reclaimed by the next
    caller. atexit unlinks the slot when the process exits.

Verified locally:

  * Four concurrent python imports of vars.py get indices 0..3 with
    no collisions; slot files cleaned up on exit.
  * `pabot --processes 2` over two trivial robot suites prints
    distinct port ranges (56789 vs 56889) from each worker.

* [Fix] worker binds: env-templated defaults; diagnostic log tail

The four built-in workers (normal, controller, rspamd_proxy, fuzzy)
in conf/rspamd.conf hardcoded `localhost:1133[2-5]`. Under parallel
pabot every rspamd instance tried to bind those same ports and the
second one onwards hard-terminated with "Address already in use".

Switch the bind_socket lines to jinja templates with the existing
production strings as defaults:

  bind_socket = "{= env.LOCAL_ADDR|default('localhost') =}:\
                 {= env.PORT_NORMAL|default('11333') =}";

Production behaviour is preserved bit-for-bit -- with no env vars,
the templates resolve back to `localhost:11332..11335`. The functional
test harness already exports RSPAMD_LOCAL_ADDR / RSPAMD_PORT_*, which
rspamd's lua_common.c strips of the RSPAMD_ prefix when populating
rspamd_env, so `env.PORT_NORMAL` etc. pick up the per-worker slot
values from test/functional/lib/vars.py automatically.

Verified locally:
  - `pabot --processes 4` over the four `001_merged` sub-suites
    (Cases.001 Merged.{099,100,101,102}) passes 122/122 tests where
    it used to fail every test with hard_terminate.
  - Full phase-1 run (`pabot --processes 4 --exclude notparallel`)
    completes in 2m20s with 646/666 passing; the 20 failures are all
    local mac env-specific issues (missing pynacl, missing
    liblua.5.1.dylib for miltertest, etc.) unrelated to this change.
  - `rspamadm configdump` on a stock config (no env override) still
    binds `localhost:11332..11335` byte-for-byte.

Also enrich Rspamd Startup Check to surface the last 80 lines of
rspamd.log plus exit code, port and tmpdir on Process Is Gone --
the previous one-line "loading configuration" stderr made the bind
collision invisible from CI artifacts and forced a local repro to
diagnose.

* [Test] functional: dummy-port env in lua + settle after startup

Three classes of leftover collisions surfaced once worker bind_sockets
were templated and parallel rspamds actually started:

  * lua/udp.lua and lua/maps_kv.lua (loaded by 001_merged) and the
    rspamadm script lua/rspamadm/test_redis_client.lua hardcoded the
    dummy_udp / dummy_http / redis ports. Workers on slot index > 0
    bound their dummies on shifted ports, so the lua scripts kept
    talking to the slot-0 endpoints and tests timed out. Read
    env.PORT_DUMMY_UDP / env.PORT_DUMMY_HTTP / env.REDIS_PORT (set
    via vars.py -> RSPAMD_PORT_* -> rspamd_env stripped of the
    RSPAMD_ prefix in lua_common.c) and fall back to the historical
    literals so the scripts still run outside the harness.

  * configs/merged-override.conf EXTERNAL_MULTIMAP and
    configs/settings.conf external_map baked
    `http://127.0.0.1:18080/...` into rspamd's own config. Switch
    those to `{= env.PORT_DUMMY_HTTP|default('18080') =}` so the
    multimap external backend resolves to the per-worker dummy_http.

  * lib/rspamd.robot Rspamd Setup polled the startup-check loop with
    `IF ${ok} CONTINUE`, which kept iterating after the first
    successful ping but added effectively no grace period for the
    controller / proxy workers to finish registering with the main
    process. Under parallel load the first `rspamadm control stat`
    in 001_merged.099 Control returned an empty workers list.
    Switch to `BREAK` on success and add a 0.5s settle period.

Verified locally: previously-failing
099_control / 100_general / 101_lua / 102_multimap /
310_udp / 151_rspamadm_async now pass 126/126 under
pabot --processes 4 in ~17s.

* [Test] functional: fix two more parallel races

Two leftover collisions surfaced once 001_merged was actually starting
rspamds in parallel across pabot workers:

* test/functional/lua/lua_extras_test.lua writes its staging tree to
  os.getenv('TMPDIR'). On Linux CI TMPDIR is unset, so every worker
  raced on a shared /tmp/lua_extras_test directory -- one worker's
  `rm -rf` would wipe another worker's tree mid-test and rspamd
  config load aborted with `cannot init lua file ... No such file
  or directory`. Prefer RSPAMD_TMPDIR (per-suite tmpdir, propagated
  via env:RSPAMD_TMPDIR in Run Rspamd) so workers don't share state.

* 151_rspamadm_async/Redis client invokes `rspamadm lua -b
  test_redis_client.lua` which connects to redis directly. The
  previous fix used `rspamd_env.REDIS_PORT`, but rspamadm's lua
  context (unlike the daemon's) does not populate the `rspamd_env`
  global -- only rspamadm_session/_ev_base/_dns_resolver are set --
  so the lookup always fell through to the literal 56379. Read
  `os.getenv("RSPAMD_REDIS_PORT")` instead. Also call
  `Export Rspamd Variables To Environment` from the suite's Setup
  so the env vars are actually present in the rspamadm subprocess
  inherited environment (this suite never calls Run Rspamd, which
  is where the export normally happens).

Local: `pabot --processes 2` over 102_multimap / 151_rspamadm_async /
271_lua_extras passes 83/83 in ~8s.

* [Test] CI: run parallel + serial functional phases concurrently

The two-phase split (pabot for parallel-safe suites, plain robot for
notparallel-tagged ones) ran sequentially -- on fedora that meant
2:16 (pabot, 666 tests) + 1:35 (robot, 92 tests) = ~4 minutes total
versus master's ~6 minutes serial. The pabot phase itself is already
at ~91% of theoretical 4-worker speedup (8:14 of work in 2:16
wall-clock), so bumping --processes won't help much -- the cheap
win is overlapping the two phases.

Background both phases with `&`, capture their PIDs, then `wait`
each separately to harvest exit codes. They claim disjoint slots
from the vars.py file-based allocator (pabot grabs 0..3, robot
grabs 4), so their rspamds use different port ranges and tmp
prefixes and don't collide.

Expected total wall-clock: ~max(2:16, 1:35) ~= 2:20, down from ~4:00.

Verified locally: 4 pabot workers + 1 serial robot running 6
suites in parallel (115 + 33 tests) all pass in 27s on a 4-core
mac with the same vars.py slot allocator. No port collisions
observed.

* [Test] Revert misleading CLAUDE.md additions

The functional-test commands I added were wrong on two counts:

  * RSPAMD_INSTALLROOT=~/rspamd.install -- that path is stale on this
    repo's typical setup; the CMake install prefix is /usr/local.
  * "driven by PABOTEXECUTIONPOOLID" -- pabot 5.2.2 does NOT actually
    export that env var to child robot subprocesses (confirmed via
    dump-env test). The real mechanism is the file-based slot claim
    in test/functional/lib/vars.py (/tmp/rspamd-functional.slot-N).

Removing the lines rather than fixing them in place; the right
home for parallel-test docs is alongside the runner script and the
PR description, not duplicated and risk-of-drift in CLAUDE.md.

* [Test] Verify controller ready + rebot merge unified report

Two issues from the concurrent-phases run:

* `Cases.001 Merged.099 Control` flaked again ("'' does not contain
  'workers'"). rspamd's controller binds and answers HTTP ping
  almost immediately, but its workers list is populated only after
  each worker has registered back with the main process. Under
  parallel pabot + the concurrent serial phase (5 rspamds competing
  for CPU at startup) the gap stretched out and a fixed 0.5s settle
  was no longer enough.

  Replace the blind settle with a real readiness check: after the
  ping loop, if rspamd.sock is present in TMPDIR, poll
  `rspamadm control stat` (via the new keyword
  Verify Controller Workers Registered) until the response actually
  contains "workers". Cheap when fast, retried up to ~6s when
  rspamd is starting slowly. Local: five back-to-back parallel
  runs over 099/100/102/270 -- 530/530 tests pass, no flakes.

* The CI step left three output.xml files
  (build/parallel/{pabot_results/N/,}output.xml and
  build/serial/output.xml) and no single top-level report, so a
  reviewer skimming the CI log saw only one pabot sub-suite path
  and read it as "we only ran part of the suite". Run
  `rebot --merge` after both phases finish to produce a unified
  build/output.xml + log.html + report.html alongside the two
  phase outputs, matching the artifact shape master used to have.

* [Test] Fix readiness check; replace [Return] with RETURN

Two fixes:

* The previous unconditional `Wait Until Keyword Succeeds` for the
  control socket assumed every suite produces $DBDIR/rspamd.sock.
  That holds for 001_merged (includes options.inc -> control_socket
  = "$DBDIR/rspamd.sock") but NOT for the many suites that build a
  minimal standalone config (231_tcp_down etc.). Those never get a
  control socket, so the 50 x 0.2s poll always exhausted and broke
  every test in those suites.

  Wait up to 2s for the socket file to appear -- if it does, poll
  `rspamadm control stat` until the response contains "workers"
  (the real readiness signal CONTROL STAT depends on); if it
  doesn't, just proceed, since suites that never produce a control
  socket can't be testing it.

* Convert the [Return] setting to the RETURN statement across the
  five files that still used the old syntax. Robot Framework 7
  deprecated [Return] and the unrelated noise warnings were
  swamping every test step's stdout, making real failures hard to
  spot:
    cases/001_merged/115_dmarc.robot
    cases/001_merged/160_antivirus.robot
    cases/151_rspamadm_async.robot
    cases/320_arc_signing/003_roundtrip.robot
    lib/rspamd.robot

Verified locally: three back-to-back concurrent-phase runs (4-way
pabot + serial robot for notparallel suites) -- (106 + 33) tests
all pass each time, no flakes, no deprecation warnings.

* [Test] CI: redirect each phase to its own log, group in step output

Previously both concurrent phases (pabot and serial robot) wrote to
the step's combined stdout, so pabot's batched end-of-run summary
and robot's streaming output interleaved. Reviewers were seeing
what looked like only one of the two runs.

Redirect each phase's stdout+stderr to its own
build/phase{1-parallel,2-serial}.log, wait on both PIDs, then
`cat` the two logs in fixed order with GH Actions
::group::/::endgroup:: directives so they collapse to two clean
sections in the web UI. Wall-clock unchanged -- the two phases
still run concurrently; only the presentation is sequential.

Also include the two per-phase logs in the robotlog artifact
upload so they're inspectable after the run.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 17:34:42 +0000 (18:34 +0100)]

Merge pull request #6056 from dragoangel/feat/url-redirector-swap-redirectors-map-to-glob

[Feature] url_redirector: switch redirector_hosts_map from set to glob

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 13:15:02 +0000 (14:15 +0100)]

[Fix] fuzzy_check: accept SRV-only rules at config-load

After switching the default rspamd.com rule to service=fuzzy+rspamd.com,
'rspamadm configtest' logged 'no servers defined for fuzzy rule with
name: rspamd.com' and the rule was rejected. The check at
fuzzy_check.c:2183 uses rspamd_upstreams_count(), which deliberately
excludes SRV parent placeholders because callers like the upstream-
weight setter in dns.c and the lua_createtable size hints elsewhere
want the dispatchable cluster size, not the configured-entry count.

At config-load the SRV parent is the only thing in the list (members
are populated asynchronously after DNS resolution), so the existing
count returned 0 and the rule was rejected.

Add rspamd_upstreams_count_total() that includes SRV parents and use
it for the "is anything configured at all" gate. The four other
callers of rspamd_upstreams_count (dns weight, three Lua table size
hints) keep the existing dispatchable-only semantics, which is what
they want.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 10:43:21 +0000 (11:43 +0100)]

[Test] neural: drift threshold for pure-symbols mode (50%)

Adds a Robot suite that exercises both sides of the new
is_profile_compatible threshold:

  Train pure-symbols ANN
    Standard 10 spam + 10 ham autotrain pattern (mirrors 001_autotrain).

  Inference fires before drift
    Baseline check: NEURAL_SPAM_SHORT / NEURAL_HAM_SHORT fire after
    training completes.

  40 percent drift keeps the prior profile compatible
    FORCE_DRIFT_NEURAL_40 drops the last 40% of set.symbols and prepends
    40% fresh "DRIFT_NEW_SYM_*" entries; distance_sorted against the
    trained profile reports ~40% of |set.symbols|. With the cap raised
    to 50%, the prior profile is still accepted and inference keeps
    firing. Pre-fix (30% cap) this configuration would have orphaned
    the ANN.

  60 percent drift rejects the prior profile
    FORCE_DRIFT_NEURAL_60 pushes drift to ~60%, above the new 50%
    cap. is_profile_compatible rejects, set.ann stays unset,
    NEURAL_*_SHORT do not fire -- pins the upper bound so a future
    too-permissive change (e.g. raising the cap to 70%) trips here.

Note on the drift formula: distance_sorted is an asymmetric edit-
distance walk, not a symmetric-difference counter. When the fresh
entries sort before every baseline name and the dropped entries are
at the tail, the function reports dist ≈ replace_k rather than 2k.
So to hit dist == drift_pct% of n the helper drops and adds
k = drift_pct * n / 100 (not / 200). The first attempt at this test
hit the / 200 trap and the 60% case stayed under the cap.

Per-(rule, set) baseline is snapshotted on the first drift call so
the 60% test compares against the originally-trained list, not the
already-drifted one from the 40% test.

The disable_symbols_input + providers scenario is already covered by
003_carryover; the hybrid (providers + symbols) carryover-misindexing
scenario is harder to drive deterministically in a Robot harness and
is left as a future addition.

Verified locally: 20/20 of Functional.Cases.330_Neural pass.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 10:34:17 +0000 (11:34 +0100)]

[Fix] neural: resilient ANN reuse across symbol-list drift

Two follow-up fixes that complete the "neural keeps working when symbols
change" story started by the disable_symbols_input digest stability
commit. Both motivated by inspecting the actual vbspam Redis state on
sp-collector, which showed multiple coexisting profiles per rule and an
orphaned training set (~100 spam / 15 ham) under a stale digest.

is_profile_compatible (pure-symbols mode)

The 30% Levenshtein-drift cap rejected the prior profile on every modest
config change (new RBL, multimap addition, SA-style rule loaded via
multimap regexp_rules). When rejected, set.training_profile stayed nil,
inference went dark, and training samples had nowhere to accumulate
until a brand-new ANN trained from scratch -- weeks under realistic
class imbalance. Raise the cap to 50%, with a comment pointing at the
result_to_vector path (it builds vectors from profile.symbols, NOT
set.symbols, so loading the older profile keeps the trained weights
correctly indexed against the features that produced them).

maybe_carryover_ann (hybrid providers + symbols)

The carryover copied an ANN blob from an old key (trained against
profile.symbols A) into a fresh key whose profile entry carries
set.symbols (current = B). load_new_ann later writes
set.ann.symbols = profile.symbols, so at inference the copied weights
got applied to indices that no longer correspond to the symbols they
were trained on -- silent garbage output. Guard the carryover with
rule.disable_symbols_input: only then does the symbol portion not
contribute to the input vector, and copied weights remain meaningful.
For hybrid mode without disable_symbols_input the existing
is_profile_compatible path already keeps inference alive via the prior
profile entry (whose own symbol list keeps weights aligned), so
skipping carryover is the correct behaviour, not a regression.

Combined with the earlier digest-stability commit, the failure
modes the user kept hitting in production -- disable_symbols_input
digest rotation, pure-symbols cap too tight, hybrid carryover
misindexing -- are all addressed.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 23 May 2026 10:14:42 +0000 (11:14 +0100)]

[Fix] neural: digest stability under disable_symbols_input

The profile digest forms part of the Redis key holding the trained
ANN (rn_<rule>_<settings>_<digest>_<v>). process_settings_elt computed
it as lua_util.table_digest(selt.symbols) unconditionally.

With disable_symbols_input=true the symbol catalogue does not feed the
model -- only providers + fusion + max_inputs determine the input-vector
schema (see is_profile_compatible) -- so hashing the unrelated symbol
list rotated the digest whenever any rspamd symbol was added/removed
elsewhere (a new RBL, a multimap rule, an SA-style rule loaded via
multimap's regexp_rules). The trained ANN was orphaned in Redis under
the old key and inference silently dropped to zero hits until a new
sample set retrained from scratch (weeks under realistic class
imbalance). Manual recovery via `redis-cli COPY` of the old key to the
new digest was the only fix.

Now: when has_providers + disable_symbols_input, the digest is
providers_config_digest(rule.providers, rule). Other modes keep the
existing symbol-based digest.

Migration: any deployment already running disable_symbols_input=true
with a trained ANN will see its digest rotate once on first start
after this lands. Either let the model retrain, or use the same
`redis-cli COPY rn_<rule>_<settings>_<old>_<v> rn_<rule>_<settings>_<new>_<v>`
recipe one final time -- after this fix the digest is stable across
unrelated rspamd config changes.

commit | commitdiff | tree

Dmytro Alieksieiev [Fri, 22 May 2026 19:58:11 +0000 (21:58 +0200)]

Merge branch 'master' into feat/url-redirector-swap-redirectors-map-to-glob

commit | commitdiff | tree

Dmitriy Alekseev [Fri, 22 May 2026 19:56:20 +0000 (21:56 +0200)]

[Feature] url_redirector: switch redirector_hosts_map from set to glob

Allow operators to use glob patterns (e.g. *.bit.ly, *.t.co) in the
redirector hosts list. Bare hostnames continue to match exactly, so no
operational change for existing maps; only the option to use wildcards
is new.

Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 19:21:21 +0000 (20:21 +0100)]

[Conf] fuzzy_check: discover servers via SRV by default

Switch the default "rspamd.com" rule from a hardcoded round-robin host
list to SRV-based discovery. "service=fuzzy+rspamd.com" makes the
upstream parser resolve the _fuzzy._tcp.rspamd.com SRV record, so
backends and ports are managed entirely in DNS with no client-side
config change.

The legacy fuzzy1/fuzzy2 hostnames keep resolving to every live
backend, so existing installs that pinned the old round-robin string
are unaffected. See rspamd/dns#8.

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 18:00:49 +0000 (19:00 +0100)]

[Minor] Update version to 4.1.0

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 17:59:47 +0000 (18:59 +0100)]

Merge pull request #6054 from dragoangel/fix/tcp-lua-populate-timeout-read

[Fix] Properly populate timeout read in tcp_lua.c

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 22 May 2026 17:43:23 +0000 (18:43 +0100)]

Merge branch 'master' into fix/tcp-lua-populate-timeout-read

Mirror of https://github.com/rspamd/rspamd.git

RSS Atom