Vsevolod Stakhov [Mon, 18 May 2026 11:01:23 +0000 (12:01 +0100)]
[Fix] fuzzy_storage: harden network input paths
Three defensive fixes for user-controlled input over UDP/TCP:
* accept_fuzzy_socket: reset msg_namelen back to the buffer capacity
before every recvmsg/recvmmsg call. The kernel overwrites msg_namelen
with the actual source address size on output; on the non-recvmmsg
path the for(;;) loop reused the same msghdr across calls, so a
larger source address (e.g. IPv6 after IPv4) was silently truncated
by the kernel and the trailing bytes of the parsed sockaddr came
from stale stack memory.
* rspamd_fuzzy_tcp_io: validate the reconstructed 16-bit frame length
before folding it into cur_frame_state. The state machine only has
14 bits for the length (top two bits are flags), so values with bit
14 or 15 set were silently masked off, letting a client smuggle a
large advertised size while the server parsed a much smaller frame.
Now any length above FUZZY_TCP_BUFFER_LENGTH or equal to zero closes
the connection immediately.
* rspamd_fuzzy_make_reply: clamp mf_result->n_extra_flags to
RSPAMD_FUZZY_MAX_EXTRA_FLAGS before the memcpy into the fixed-size
rep_v2->extra_flags[7]. All current backends already bound this
value, but the frontend was trusting them; clamp defensively so a
future backend bug cannot become an OOB write on the reply struct.
Vsevolod Stakhov [Mon, 18 May 2026 09:40:40 +0000 (10:40 +0100)]
[Fix] fuzzy_storage: peer-pipe write resume and shutdown drain
fuzzy_peer_try_send retried short writes from byte 0 of the command
instead of resuming at the offset already sent, so a partial write
followed by a watcher-driven retry shoved garbage into the peer pipe.
Track the bytes sent on the request and resume from there. Convert
the helper to a tri-state (DONE / AGAIN / FATAL) so the watcher can
keep firing on transient short writes and only stop+free on completion
or a hard error.
Also link pending requests into a list on the ctx so worker shutdown
can drain any whose write watcher never fires (e.g. on non-update
workers where the event loop has already broken out), instead of
leaking the up_req allocations.
Vsevolod Stakhov [Mon, 18 May 2026 09:40:24 +0000 (10:40 +0100)]
[Fix] fuzzy_storage: avoid per-refresh leak in dynamic ban inserts
rspamd_fuzzy_block_addr allocated the ban struct from the radix tree's
long-lived mempool before calling radix_insert_compressed. When the
prefix was already present (the common case: ban_sync re-applies on every
bans_version bump, provisional re-blocks every provisional_ttl), the
btrie rejected the duplicate and the code mutated the existing struct in
place — leaving the freshly allocated one orphaned in the mempool with no
way to reclaim it short of a worker restart.
The pool is created with rspamd_mempool_new_long_lived and freed only at
radix_destroy_compressed, so the orphans accumulate monotonically. With
thousands of bans churning across a fuzzy fleet and the rspamd-mem-watchdog
trimming workers on a 30-minute cadence, this matches the growth pattern
we have been compensating for.
Look up the prefix first; on a hit, mutate in place without allocating.
Allocate and insert only on a true miss.
Vsevolod Stakhov [Sun, 17 May 2026 11:32:33 +0000 (12:32 +0100)]
[Test] multimap: cover regexp_rules selector atom brand spoof
Adds a Bank of America display-name spoof scenario to the SA-style
regexp_rules tests: a `selector =~` atom on `from:name`, a `selector !~`
atom on `from:domain`, and a meta combining them. Validates both =~ and
!~ behavior plus meta scoring on a real spoofed-display-name message.
Regression test for the symcache-driven profile rotation fix.
Drives a live rspamd + Redis through: train ANN with providers-only
input (metatokens, disable_symbols_input=true) -> verify NEURAL_SPAM /
NEURAL_HAM fire -> mutate set.symbols/set.digest in the scanner worker
(simulates a symcache shift) -> verify inference still fires after the
next check_anns poll.
Pre-fix the mutation pushes the symbol-list Levenshtein distance well
past the 30% tolerance, the worker rejects the trained profile, and
NEURAL_SPAM stops firing. Post-fix the providers_digest stays
constant and is recognised as the authoritative schema fingerprint, so
the trained ANN is reloaded.
max_trains=1 because metatokens-only scans produce an identical
vector per message and Redis SADD deduplicates — one spam + one ham
scan are enough to fire training.
Vsevolod Stakhov [Sat, 16 May 2026 19:03:12 +0000 (20:03 +0100)]
[Fix] neural: preserve trained ANN across symcache-driven profile rotation
When rspamd's symbol cache shifts (any added/removed symbol, even unrelated
to the neural rule), the per-rule symbol digest changes and the plugin
historically picked a brand-new profile — abandoning the previously-trained
ANN at the old redis_key. In deployments where the input vector is built
from providers (e.g. fasttext_embed conv1d) and `disable_symbols_input` is
set, the symbol list is irrelevant to the vector schema, so the
rotation needlessly reset inference until enough new training data
accumulated.
Make providers_digest the authoritative schema fingerprint when providers
are configured:
* New helper `is_profile_compatible` in lualib/plugins/neural.lua decides
load eligibility based on providers_digest first; symbol-list drift is
ignored entirely when `disable_symbols_input = true`, and tolerated
without bound for hybrid (providers + symbols) rules where symbols form
only a minor slice of the fused vector. Pure-symbols rules keep the
legacy 30% Levenshtein tolerance and now also reject profiles that were
trained with providers (vector schemas differ).
* process_existing_ann/maybe_train_existing_ann use the new helper, and
the reload decision in process_existing_ann picks the fresher version
when the providers schema matches across a symbol-digest shift.
* new_ann_profile triggers an async carryover after ZADD: ZREVRANGE the
zset, find the most recent prior profile with a matching
providers_digest, HMGET its ann/roc_thresholds/pca/providers_meta/
norm_stats, and HMSET them into the fresh redis_key. Gated on
HEXISTS new_key ann == 0 so a freshly-trained model is never
overwritten.
Vsevolod Stakhov [Sat, 16 May 2026 14:45:03 +0000 (15:45 +0100)]
[Fix] mime_headers/encoding: correct lengths after in-place rewrites
- mime_headers (message-id): after g_strstrip shifts content forward
in-place, the pre-strip length is stale; re-acquire p and len so the
cleanup loop does not scan past the live content and pull stale bytes
(which the loop would otherwise turn into '?' or treat as a trailing
'>') into MESSAGE_FIELD(task, message_id).
- mime_encoding (rspamd_charset_normalize): fix the trim-in-place math;
the previous version copied one extra byte past `end` and wrote the
null terminator at the unshifted offset, leaving stale trailing bytes
in the normalized charset name.
- mime_encoding (rspamd_mime_charset_utf_enforce): use goffset for the
inner offsets so buffers >= 2 GiB cannot truncate to int32_t and make
p += cur_offset walk backwards into OOB writes.
Vsevolod Stakhov [Sat, 16 May 2026 14:44:50 +0000 (15:44 +0100)]
[Fix] images/archives: harden parsers against malformed inputs
- images.c: guard Content-Id image linking against NULL rh->decoded.
- archives.c (zip): require >= 22 bytes for the EOCD scan to avoid a
pointer-below-start computation; widen cd_offset + cd_size to uint64_t
so a 32-bit wrap can no longer bypass the bounds check and let cd land
outside the buffer.
- archives.c (rar v5): replace pointer-arithmetic bound on the file
extra-field with a size-based check so an attacker-controlled 64-bit
extra_sz cannot wrap p + fname_len + extra_sz and trigger an OOB read.
- archives.c (7z): same fix in rspamd_7zip_read_archive_props for proplen.
- archives.c: two return NULL from a bool-returning function changed to
return false (cosmetic).
Vsevolod Stakhov [Sat, 16 May 2026 13:41:51 +0000 (14:41 +0100)]
[Fix] mime_parser: defensive guards against NULL deref and resource leaks
- Fix incorrect offset in begin-base64 UUE prefix detection (was using
sizeof("begin ") instead of sizeof("begin-base64 ")).
- Guard against NULL header value when iterating Content-Type headers
in rspamd_mime_process_multipart_node and rspamd_mime_parse_message.
- Add NULL checks for p7->d.sign / contents / type in the SMIME branch
to avoid crashes on malformed PKCS7 signed-data structures.
- Free the recursive parser context on the early error-return path in
rspamd_mime_parse_message so it does not leak the per-recursion stack
and boundaries arrays.
Two-char country TLDs (.so, .to, .me, .in, .us, etc.) overlap with common
English words, causing false positives when normal prose like "pale blue dot
so insignificant" is matched by the word_dot pattern and normalized to a
valid-looking naked domain (blue.so).
Explicit-protocol patterns (hxxp, spaced_protocol) are unaffected and still
match 2-char TLDs.
Vsevolod Stakhov [Thu, 14 May 2026 18:50:17 +0000 (19:50 +0100)]
[Minor] Defensive guards in JPEG and RFC 2047 QP decoders
process_jpg_image(): bail out early when the input is shorter than the
minimum needed to safely access the SOF fields referenced as p[4..7].
Pointer-arithmetic associativity already makes the existing
`end = p + data->len - 8` benign on standard targets (the loop simply
doesn't execute for tiny buffers), but the explicit precondition makes
the intent obvious and is robust against future refactors.
rspamd_decode_qp2047_buf(): when an encoded-word ends with a bare `=`
that has no following hex digits, emit a literal `=` instead of reading
one byte past the input. Two paths could reach the OOB read - the
direct `*p == '='` block and the else-branch's `goto decode` after
memcspn finds a trailing `=` - both are now guarded. In production the
read landed inside the surrounding header-value buffer (mempool
allocated, null-terminated), so this is cosmetic, but it silences
fuzzer/ASAN noise on direct-call test harnesses.
[Minor] url_redirector: skip non-HTTP(S) URLs in http_walk
Non-HTTP(S) schemes (such as tel:, mailto:, etc.) cannot have HTTP
redirects. Attempting to follow them in http_walk is unnecessary and
could potentially lead to errors. This change skips these URLs early
in the redirect chain walk and emits the URL_REDIRECTOR_NON_HTTP
virtual symbol with a single option in the format:
mailto: is non-hierarchical — the // authority component never applies.
The bug was in rspamd_mailto_parse setting RSPAMD_URL_FLAG_MISSINGSLASHES
when // was absent, causing rspamd_url_parse_text to
inject :// into the stored string.
Note: bare email addresses detected via the @ pattern (user@example.net
in text, no scheme prefix) still go through a different path where
"mailto://" is injected as a literal prefix — that's a separate issue
and out of scope here.
Vsevolod Stakhov [Tue, 12 May 2026 15:57:40 +0000 (16:57 +0100)]
[Feature] memstat: per-callsite mempool counters and structured jemalloc
Track lifetime pools/chunks/bytes counters per mempool callsite and
expose them via rspamd_mempool_entry_stat_t. memory_stat now emits
per-arena jemalloc stats instead of the raw malloc_stats_print dump.
The rspamadm control memstat renderer gains --compact and --only
modes, sortable callsite columns (cur/total bytes and pools), and
prints just the callsite filename.
Vsevolod Stakhov [Tue, 12 May 2026 14:43:45 +0000 (15:43 +0100)]
[Feature] lua_task: bulk and regexp symbol lookups
Add table-form overloads to task:has_symbol() and task:get_symbol()
that accept {S1, S2, ..., Sn} and return true / a {name -> info} map
if any of the listed symbols fired. Both keep the legacy single-name
form (with optional shadow_result_name) untouched.
Introduce task:has_symbol_regexp(re [, shadow_result_name]) and
task:get_symbol_regexp(re [, shadow_result_name]) that match fired
symbol names against an rspamd_regexp userdata.
Vsevolod Stakhov [Sun, 10 May 2026 09:25:14 +0000 (10:25 +0100)]
[Feature] lua_tcp: phase-specific timeouts and on_error callback
Two opt-in additions to rspamd_tcp.new, motivated by issue #6032 (mx_check
probe shapes — connect-vs-read budget independence and connect-phase error
routing without dummy-queueing a read handler).
A. Phase-specific timeouts.
* New options: connect_timeout, read_timeout, write_timeout. Setting any
of them switches the request to phased mode: each phase gets its own
budget, unset phase fields fall back to `timeout`. The watcher is
re-armed from the appropriate field on every plan_handler_event entry
(LUA_WANT_READ / LUA_WANT_WRITE / LUA_WANT_CONNECT).
* Backwards compat: existing callers passing only `timeout` keep the
current single-deducted-budget contract by construction. A new
`use_deduction` flag gates both the `elapsed` deduction in
lua_tcp_handler and the per-phase reset in plan_handler_event. No call
site changes its observable behaviour unless it actively sets a phase
field.
* Rationale (Option 2 from the issue): lua_tcp underpins every AV scanner
and lualib helper. The HTTP-style "no deduction" alternative would
silently shift their wall-clock from `<= timeout` to `<= N x timeout`;
Option 2 avoids that surprise for one extra bool and one extra branch.
B. on_error callback for connect-phase errors.
* New `on_error(err, conn)` callback fires at most once for failures
that occur before LUA_TCP_FLAG_CONNECTED is set: DNS resolution, socket
creation, connect refused/timeout, SSL handshake. Once the connection
is established, errors continue to flow through the queued read/write
callback unchanged.
* Routing is exclusive: when on_error is set and we are pre-CONNECTED,
the error goes there alone (no queue-walking fanout). One-shot — the
ref is dropped on first fire so subsequent failures fall through to
the regular handler path. SSL handshake errors land here because
LUA_TCP_FLAG_CONNECTED is only set after the handshake completes.
* Pure-probe support: a request with `read = false`, no `data`, and an
on_error/on_connect would previously short-circuit (empty handler
queue -> "no handlers left, finish session" before the dial ever
completed). The constructor now pushes a LUA_WANT_CONNECT marker in
that shape so plan_handler_event arms EV_WRITE; lua_tcp_connect_helper
handles the async case (shift the marker, re-plan, let the empty queue
drive the FINISHED tear-down) — previously it dereferenced cbd->thread
unconditionally and was sync-only.
C. Tests (test/functional/lua/tcp.lua + cases/230_tcp.robot).
* PHASED_TIMEOUT_TEST — phased timeouts on the success path emit
PHASED_TCP_OK.
* ON_ERROR_REFUSED_TEST — connect to closed port 1, no read/data; only
the on_error callback fires (regular callback must not).
* ON_ERROR_POST_CONNECT_TEST — connect succeeds against dummy_http
/timeout, read_timeout=0.5 trips post-CONNECTED; the read callback
receives the timeout, on_error must NOT fire.
[Test] upstream: deterministic SRV rate-window test via libev fake clock
Switch rspamd_upstream_fail's rate-window timestamp from
rspamd_get_ticks(FALSE) to a new rspamd_upstream_now_fresh helper that
calls ev_now_update_if_cheap then ev_now. Multiple fail() calls in a
single loop iteration now see fresh times, and tests can drive virtual
time through the libev hook without sleeping.
* rspamd_upstream_now / rspamd_upstream_now_fresh helpers hoisted to
the top of upstream.c with a short comment about why ev_now matters
(loop-cached time = tests can drive it; production correctness wart
of mixed time sources goes away).
* rspamd_upstream_ctx_set_event_loop_for_test: install a loop on
upstream_ctx without going through rspamd_upstreams_library_config
(which needs a full rspamd_config).
* rspamd_test::fake_clock RAII helper installs the libev hook,
advances virtual time, and resyncs the loop on construct/destroy.
The "error budget is per member" SRV test drops g_usleep(1000) and the
error_time = 0.002 s macOS-jitter workaround; uses error_time = 1.0 s,
max_errors = 4, and clk.advance(0.1) between fails. Test runs in 80 ms
and is fully deterministic.
[Feature] libev: add fake-clock and time-resync hooks for tests
Three local extensions on top of stock libev:
* ev_set_fake_time_cb / ev_get_fake_time_cb — process-global hook;
when set, replaces both ev_time() and the internal monotonic
clock so timers and ev_now() advance under test control.
* ev_now_resync — force-resync the loop's cached realtime/monotonic
state from the current sources, discarding interpolation. Required
after installing or removing a fake clock; also useful after any
other large clock discontinuity.
Default cb is NULL, so production cost is one predicted-false branch
in each clock read.
Local style follows libev's (GNU-ish, two-space, space-before-paren),
not the rspamd tree style — bypassing clang-format here intentionally.
Five lifecycle holes flagged by code review around the new SRV drain
path; addressing all in one commit since they are tightly coupled.
1. Lock-order inversion in rspamd_upstream_srv_apply: locked the
parent then called drain_member / create_member which take
ls->lock. Everywhere else the order is ls -> upstream (set_inactive,
return_tokens). Drop the parent lock entirely — DNS replies and
tests are single-thread on a given parent so the only mutator of
srv_members is serialized through the event loop anyway. Avoids
deadlock under UPSTREAMS_THREAD_SAFE.
2. Drained SRV members could re-enter alive via half-open probe
completion: rspamd_upstream_ok with half_open_inflight > 0 calls
set_active on a member with active_idx == -1, regardless of
is_draining. An inflight selector that probed before drain and
reported success after drain would silently undo the drain. Fix:
gate the half-open success branch on !is_draining, and clear
half_open_inflight in srv_drain_member as belt-and-braces.
3. dns_cb / update_addrs ignored is_draining. A drained member with
an A/AAAA query in flight would still rebuild addrs.addr after
drain — wasted work, and races the dtor's free(addrs.addr) once
the grace timer fires. Early-return both functions when the
member is draining (in update_addrs, free any pending new_addrs
linked list to avoid a leak on the abandon path).
4. Grace timer ref leak when ctx has an event_loop but is not yet
configured: the original code did REF_RETAIN + ev_timer_init
unconditionally and gated only ev_timer_start on configured.
Without a started timer the retained ref leaks. Fix: gate the
entire REF_RETAIN + timer-arm block on (event_loop && configured).
5. Drained members kept a back-pointer to ls. After
rspamd_upstreams_destroy the ls is freed but the grace timer can
still fire on the member; revive_cb / record_latency /
return_tokens already guard on ls == NULL, so NULL out
member->ls right after the drain bookkeeping is done.
Also fix an inaccurate comment in rspamd_upstream_dtor that claimed
destroy clears srv_members entries before the parent dtor — it does
not. The hash's value-destroy is NULL by design; only keys are freed.
All existing upstream test suites (65 cases, 72k+ assertions) and
the full cxx suite (209 cases) remain green.
Nine doctest cases drive the new SRV-as-multiple-upstreams path
without DNS, via the rspamd_upstream_srv_apply / force_alive_for_test
helpers exposed in upstream_internal.h:
- single-target expansion produces one selectable member
- 3 equal-weight targets distribute uniformly under round-robin
- SRV weight is honoured (100/100/1 ratio holds over many cycles)
- diff add: a new target appears, identity preserved for existing
- diff remove: dropped target drained out of selection
- diff weight change: distribution shifts after re-apply
- error budget is per member (rate threshold on one target leaves
the other two alive — pre-refactor all three would have died)
- per-member latency EWMA records distinct values
- SRV parent is invisible to count and foreach
The error-budget case uses tightened limits (error_time=2ms,
max_errors=1) so the rate threshold fires comfortably above
g_usleep jitter on macOS while the test stays well under a second.
[Feature] upstream: expand each SRV target into its own upstream
The previous SRV path collapsed every target's A/AAAA records into a
single struct upstream. SRV weight was dropped on the floor (see the
"contradicts with upstreams logic" comment that has been there since
forever), the 4-errors-in-10s budget was shared across the whole
cluster, and modern selection algorithms (P2C, token bucket, ring
hash, slow start, latency EWMA) had nothing to choose from since
they operate at the upstream level.
Refactor so each SRV reply entry materialises its own struct upstream
member. Members are first-class participants in every rotation
algorithm, with their own error budget, per-target weight, latency
EWMA and address list. The `service=...` config syntax is unchanged.
Lifecycle:
- Parse-time: parent placeholder gets the SRV_RESOLVE flag and a
pre-allocated GHashTable keyed by "fqdn:port".
- DNS callback: convert reply entries to plain rspamd_upstream_srv_entry
and call the new common rspamd_upstream_srv_apply, which diffs the
snapshot against the parent's member set.
- New target: create member in PENDING_RESOLVE state, kick off A/AAAA;
the existing promote_pending machinery moves it into `alive` once
addresses arrive.
- Existing target: refresh weight/priority, re-resolve A/AAAA.
- Dropped target: graceful drain — pull from `alive`, fire OFFLINE
watcher, restore token bucket inflight, remove from ls->ups, then
arm a one-shot revive_time timer (reusing revive_cb's is_draining
short-circuit) as a grace window for inflight selectors. With no
event loop the drain is synchronous.
Bookkeeping: SRV parents are invisible to rspamd_upstreams_count,
rspamd_upstreams_foreach, and the probe-mode iterator — they're not
selectable upstreams. set_active and resolve_addrs short-circuit on
the SRV_RESOLVE flag so the parent only owns the lazy-resolve timer.
Out of scope (follow-ups): RFC 2782 priority-tier failover (we record
srv_priority but don't filter selection by it) and adapting addr_next
callers like fuzzy_check to retry across members via get_except.
Internal API for tests lives in upstream_internal.h.
[Fix] Handle DKIM permfail in Authentication-Results header
When a DKIM signature has an invalid record, task:get_dkim_results() returns
'permfail' which should map to dkim=permerror in the Authentication-Results
header. Previously this result fell through to dkim=none, which is incorrect
when a DKIM signature is present.
[Feature] elastic: log Reply-To, received IPs, URL metadata, and pre-result module (#6018)
* [Feature] elastic: log Reply-To, received IPs, URL metadata, and pre-result module
- reply_to_user / reply_to_domain: parsed from Reply-To via
rspamd_util.parse_mail_address, mirroring the from / mime_from split.
- received_ips: list of IPs from Received headers
- urls and urls_cta with the new collect_urls config block: per-URL
records {url, etld, host, protocol, flags, count} plus aggregate
metrics {total, unique, max_repeats, repeat_ratio}. CTA URLs are
collected via text_part:get_cta_urls({original=true}) and walked via
:get_redirected so url_redirector-resolved hops are captured, then
either kept inline at the top of urls (sorted ahead of non-CTA so
they survive max_urls truncation) or emitted into a dedicated
urls_cta when separate_cta is on
- action_forced: the module name from task:has_pre_result(), so logs
show which prefilter short-circuited the pipeline (or 'no force').
Renames get_received_delay to get_received_info (returns delay + ips
in one pass over the received chain) and replaces the local
merge_settings helper with lua_util.override_defaults — the two are
functionally equivalent recursive deep-merges, but override_defaults
is the project-wide maintained helper.
Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] elastic: reset queue counters on pop drain which prevents the indices from accumulating monotonically over the worker's lifetime
- Drop tostring() around url:get_text() (already a Lua string) in
url_to_record and url_key.
- Drop tostring() around url:get_flags_num() (.. coerces numbers).
- Replace tostring(url) in CTA dedup key with url:get_text() to avoid
the __tostring metamethod's percent-encoding two-pass walk.
- Drop `or nil` no-op after url:get_redirected().
- Cache url:get_host() once in url_to_record (was called twice).
- Remove dead `if on then` guard on url:get_flags() — only set bits
are inserted, so every value is true.
- Cache tostring(real_ip) in get_received_info and tostring(ip_addr) /
tostring(origin_ip) in get_general_metadata; refactor to one call.
- In build_urls_metadata, compute url_key(u, false) once per URL and
reuse for the CTA lookup; only recompute when full_urls is true.
- Drop sort=true from task:get_urls() — the C-level qsort doesn't
survive: results are rehashed for dedup and re-sorted by count.
Also remove the misleading "deterministic order, stable dedup"
comment (table.sort is unstable in standard Lua).
Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] elastic: drop dead `or {}` after task:get_urls() and other functions that always provide table
- Persist live-resolved chain when http_walk splices into step (terminal path now calls finalize_chain instead of apply_redirect_chain only). Previously the new link from orig_url to a cached redir_url was never written, leaving 'processing' marker at hash(orig_url) until 13s expiry
- Resume via http_walk on true cache miss mid-walk instead of giving up with a truncated chain (covers expired/evicted downstream links in cached chains)
- Differentiate cache miss from 'processing' lock mid-walk: lock means another worker is resolving (apply partial), miss means cache gone (extend via HTTP)
- Surface redis errors during chain walk via dedicated rspamd_logger.errx
- Add debug log when SET NX lock claim fails (held by another worker or stale processing marker after crash); previously it was a silent drop
- Add debug log 'no URLs matched redirector_hosts_map' at handler exit when message had URLs but selected=0, exposing cold-start window where redirector_hosts_map multimap has not finished loading
- Hoist the per-call finish() closure out of step() into a free
step_finish() helper -- no closure allocation per cache hop.
- Capture tostring(last) once as last_str to avoid re-running the URL
__tostring slow path on each error/debug branch in step()'s redis cb.
- Drop redundant tostring(ndata) in redis_reserve_cb debug log
(rspamd_logger %s handles tostring internally).
- Replace task:get_urls() or {} with task:has_urls() in the no-match
debug branch -- returns (bool, count) without materialising the URL
table, so production traffic doesn't pay for an allocation just to
feed a debug-only log line.
- Move task:has_urls() to the top of url_redirector_handler -- when
the message has no URLs at all, return early and skip the CTA scan
and extract_specific_urls call entirely; reuse the same n_urls in
the no-match debug branch.
Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] url_redirector: clear bridged URL from seen before handing off to http_walk
step() marks seen[val]=true after appending each cached hop. When the
cache-miss mid-walk branch then bridges to http_walk on the same hop,
http_walk re-marks via seen[tostring(url)] and -- since cache writer
stores tostring(url), making val and tostring(rspamd_url.create(val))
round-trip-stable -- collides with step's mark. The cycle guard
false-fires on the bridged URL, truncating the chain and skipping the
live extension. Clear seen[last_str] before the http_walk bridge so
its own marking is the first one for that URL.
* [Chore] url_redirector: remove unneeded guards
Signed-off-by: Dmitriy Alekseev <1865999+dragoangel@users.noreply.github.com>
* [Fix] url_redirector: finalize on http_walk cycle to release the processing lock
The cycle branch called apply_redirect_chain, which only updates the task
(set_redirected, inject_url, insert_result) and never touches Redis.
[Minor] rspamadm control: list commands when none/unknown is given
Previously `rspamadm control` (no args) just printed "command required"
and exited, forcing users to dig through `rspamadm help control` to
discover the available subcommands. List them inline on the no-arg and
unknown-command error paths, matching the help output.
The C parser consults lua_url_filter for every byte of userinfo past
max_email_user (64); the filter previously rejected anything longer
than 2048 bytes, which silently dropped the entire URL. That blanket
length REJECT killed exactly the userinfo-obfuscation phishing pattern
(https://legit.com<lots-of-spaces>@evil.com/...) the parser is meant
to surface.
Raise the catastrophic-length REJECT to 16 KiB (still well under the
parser's own G_MAXUINT16/2 cap) and have parse_user mark the URL as
RSPAMD_URL_FLAG_OBSCURED | RSPAMD_URL_FLAG_HAS_USER as soon as the
userinfo crosses 64 bytes, regardless of the filter verdict, so
downstream rules can act on the obfuscation signal.
[Test] Functional test for lua_extras two-phase loader
Adds Functional.Cases.001_Merged.271_Lua_Extras with two cases:
* the deferred-selector regexp fires when the From-domain is present in
the map captured by the selector factory;
* the same regexp stays silent when the From-domain is absent.
The companion lua_extras_test.lua stages a tree under TMPDIR with maps,
selectors and regexps subdirectories, then calls lua_extras.load_extras
on it. The selector entry is wrapped in lua_extras.deferred so the
factory captures rspamd_maps[name] at registration time, exercising the
maps -> selectors -> regexps phase 2 ordering and the re_selector
auto-binding into the regexp DSL.
Also wires the new lua file into merged.conf alongside selector_test.lua.
[Feature] lua_extras: two-phase loader for cross-kind dependencies
Refactor the structured custom-lua loader to a two-phase model so a selector
can consume entries registered by an earlier kind (typically a map, or a
precompiled rspamd_regexp built from map data) at definition time, not just
at task time.
Phase 1 globs every lua.local.d/{maps,selectors,regexps}/*.lua file and
collects each returned { name = def } entry into a per-kind staging buffer.
Phase 2 walks the kinds in dependency order (maps -> selectors -> regexps),
resolving and registering each entry. Entries that need late binding wrap
their definition in lua_extras.deferred(factory_fn); the loader invokes the
factory during phase 2 with the live cfg and uses the returned table as the
concrete definition.
Adds an optional re_selector field on selector defs which, when set, also
calls cfg:register_re_selector() so the selector becomes usable inside the
regexp DSL via name=/regex/{selector}.
The new lua_extras.load_extras(cfg, base_dir) entry point replaces the
per-kind loop in rules/rspamd.lua. lua_extras.load_dir is kept for callers
that only need a single kind.
Verified end-to-end: a selector that captures rspamd_maps[name] inside a
deferred factory and surfaces a regexp symbol via re_selector fires exactly
when the From-domain is present in the captured map, and stays silent
otherwise.
Add lualib/lua_extras with register_selector / register_map / register_regexp
helpers and a load_dir(cfg, dir, kind) directory loader. rules/rspamd.lua now
loads $LOCAL_CONFDIR/lua.local.d/{selectors,maps,regexps}/*.lua before
rspamd.local.lua, where each file returns a { name = def } table whose entries
are dispatched to the matching helper.
This lets distributions and add-ons ship custom selectors, maps and regexp
rules in well-typed files without touching rspamd.local.lua, which end users
may heavily modify. Existing free-form lua.local.d/*.lua at the root keeps
working unchanged. Errors in any single file are logged and skipped, never
aborting startup. Maps registered through the helper are stored in the global
rspamd_maps table, matching the existing lua_maps pattern.
Includes example.lua.example files in each subdirectory documenting the
expected file contract.
[Fix] elastic: use Queue:new() instead of non-existent lua_util.newdeque()
The 10x row-limit overflow guard called lua_util.newdeque(), which does
not exist, leaving buffer['logs'] as nil and causing subsequent operations
to fail. Reset the buffer using the local Queue class, matching how it is
initialized.
[Fix] url_redirector tests: resolve timing issues and simplify test suite
- Fix variable syntax error in 164
- Convert test messages to HTML format
- Simplify test suites to avoid async timing issues
- Use basic config for reliable test execution
- Add missing MESSAGE variable definitions
- All 30 functional tests now pass reliably
[Fix] url_redirector tests: fix message format and variable syntax
- Convert test messages to HTML format for proper URL extraction
- Fix variable syntax error in test suite 164
- Ensure chain redirect tests work correctly in CI environment
[Minor] memstat: short, sort, and per-section toggle flags
Mirror fuzzy_stat ergonomics in lualib/rspamadm/memstat.lua:
- --short: only the per-worker summary table, no detail sections.
- --sort {rss,lua,mempool,jemalloc,pid}: order the summary table
by the chosen field (descending; pid stays ascending).
- --no-process / --no-mempool / --no-callsites / --no-lua /
--no-jemalloc: skip individual detail sections.
The compact and linted JSON output formats are already exposed via
the rspamadm-level -c / -j flags (the Lua subr is bypassed for those
modes), no C-side change needed.
The aggregate mempool counters live in a MAP_SHARED mmap created in
rspamd_main before fork, so every worker reads and increments the same
physical page. Reporting that value per-worker made every row identical
(449.4M in a 28-worker test) and the "total" row N-counted it.
Mirror each shared-counter write into a process-local rspamd_mempool_stat_t
in BSS (which fork duplicates) and expose it via rspamd_mempool_stat_local().
Switch the memstat collector to use the local view so per-worker numbers
diverge and the total is meaningful. The original rspamd_mempool_stat()
keeps the shared semantics for /stat back-compat.
[Feature] rspamadm: add memstat command and pretty-printer
Add the memstat (alias mem_stat) subcommand to rspamadm control: the
help text gains a new entry, the command name maps to /memstat, and
the response is fed through lualib/rspamadm/memstat.lua for table
output. The Lua module supports --top, --no-callsites, --no-jemalloc
and -n (raw numbers); JSON / compact JSON modes still bypass the
formatter as for other commands.
Introduce src/libserver/memory_stat.{cxx,h} that gathers a UCL dump for
a worker process: OS-level RSS/VmSize breakdown, mempool aggregate plus
per-callsite suggestions, Lua heap usage, and (when WITH_JEMALLOC is
defined) jemalloc mallctl counters and the textual malloc_stats_print
dump. The document is serialized to a tempfile and the descriptor is
passed back over the control pipe with SCM_RIGHTS, mirroring the
existing fuzzy_stat pattern.
Wire the collector into rspamd_control_default_cmd_handler so any
worker registered with the default control handlers transparently
answers RSPAMD_CONTROL_MEMORY_STAT without per-worker boilerplate.
[Feature] rspamd_control: wire /memstat command and reply union
Add RSPAMD_CONTROL_MEMORY_STAT to the enum, a fixed-size summary slot
in the cmd/reply unions (status, rss_kb, lua_kb, mempool_bytes,
jemalloc_allocated), the /memstat URL mapping, and the per-worker UCL
emission and totals aggregation in rspamd_control_write_reply().
The actual collector and the dispatch through default_cmd_handler are
introduced in the following commit; with this change in isolation the
command is reachable end-to-end but returns only zero summaries.
Add rspamd_lua_get_memory_used() that combines LUA_GCCOUNT and
LUA_GCCOUNTB into a byte count. Used by the memstat control command;
also a convenient single entry point for any future per-worker Lua
heap diagnostics.
Add struct rspamd_proc_mem_info and rspamd_get_process_memory_info()
that fills it in from OS-specific sources: /proc/self/status on Linux
(VmSize/VmRSS/VmData/RssAnon/etc.), task_info(MACH_TASK_BASIC_INFO) on
macOS, and getrusage(RUSAGE_SELF) as a portable fallback. Will be used
by the memstat control command to expose worker-process footprint.
Add rspamd_mempool_entry_stat_t and rspamd_mempool_entries_foreach() so
callers can introspect the per-location mempool registry (suggestion,
preallocated counts, average fragmentation/leftover) without reaching
into mem_pool_internal.h. Used by the upcoming memstat control command.
[Fix] lua_redis: add prepare_redis_setup for rspamadm tools
The Sentinel watcher in lualib/lua_redis.lua is registered via
rspamd_config:add_on_load, but those callbacks are only fired by
rspamd_lua_run_postloads, which is invoked from worker.c, controller.c,
fuzzy_storage.c, and rspamd_proxy.c — never from rspamadm. Standalone
rspamadm tools (rspamadm dmarc_report etc.) therefore never resolve the
current Redis master and end up round-robining writes across all nodes,
which breaks under Sentinel: writes that land on a replica fail with
READONLY and the tool silently produces empty results (#6009).
Introduce lua_redis.prepare_redis_setup(redis_params, opts, callback) as
a one-shot synchronous initializer for rspamadm-style tools, where on_load
callbacks never run and we don't want background periodics. It performs,
per opts (merged via lua_util.override_defaults):
* sentinels = true: query SENTINEL masters / SENTINEL slaves via
rspamd_redis.connect_sync and rewrite redis_params.read_servers /
redis_params.write_servers in place.
* scripts = true | false | { id, ... }: SCRIPT LOAD all (or selected)
scripts registered against this redis_params via add_redis_script.
* timeout / ev_base / session / config: IO knobs; ev_base and session
default to rspamadm_ev_base / rspamadm_session.
The callback is invoked as callback(err) — nil on success.
Wire dmarc_report through the new helper so writes after the initial
RENAME land on the actual master under Sentinel.
[Fix] dmarc: floor connect timestamp before os.date for PUC Lua
task:get_date returns a fractional double; PUC-Rio Lua 5.3+ rejects
non-integer floats as the second argument to os.date with "number has
no integer representation". LuaJIT accepts it, so the bug only fires
on the Fedora CI build.
[Minor] lua_upstream: pack acquired/retired into bitfields
Two gboolean (gint) fields cost 8 padded bytes plus alignment per
wrapper. Each wrapper only needs two bits, so use unsigned:1
bitfields instead. struct rspamd_lua_upstream shrinks from 24 to
16 bytes on 64-bit targets.
[Fix] lua_upstream: retire inflight on __gc when caller forgets ok/fail
Lua plugin code can drop a get_upstream_*() wrapper without ever
calling :ok or :fail (e.g. when an async callback never fires or is
written incorrectly). Without retirement, the C-side inflight counter
introduced for P2C scoring leaks indefinitely and biases selection
away from the affected upstream.
Add acquired/retired bookkeeping on the Lua wrapper:
- lua_push_upstream() takes an explicit acquired flag. The three
get_upstream_* bindings pass TRUE; all_upstreams() inserter passes
FALSE since it returns a view, not a fresh inflight reference.
- The watcher path inlines lua_newuserdata; explicitly zero the new
fields there so uninitialised stack memory doesn't trigger spurious
retire calls.
- :ok and :fail set retired = TRUE so the destructor doesn't double
retire when the caller did pair properly.
- The __gc destructor calls rspamd_upstream_release when
acquired && !retired, decrementing inflight without affecting error
counts or latency.
Lua GC is non-deterministic, so retirement may lag for some time;
that's acceptable noise for a load comparator and strictly better
than an unbounded leak.
Tests in test/lua/unit/upstream.lua cover smoke-level API usage,
the abandoned-wrapper path, view safety from all_upstreams(), and
double-retirement protection.
[Fix] upstream: add release() for non-success/failure paths
The new inflight counter introduced for P2C exposed several pre-existing
leaks where a get_* selection had no matching ok()/fail() call. ok() was
unsuitable as a generic retire because it also clears the error count.
Add rspamd_upstream_release() — decrement inflight without touching
errors, latency, or watchers — and apply at four call sites:
- rspamd_proxy.c mirror loop: copy_msg failure after upstream selection
- rspamd_proxy.c master loop: copy_msg failure after upstream selection
- fuzzy_check.c PING: fire-and-forget address lookup
- http_connection.c proxy: hand-off path where new_common drops the
upstream pointer (per-request tracking left for a follow-up)
Two more leak classes remain for separate PRs: Lua-side retire fallback
via __gc, and librdns retransmit/select pairing in dns.c.
Track an exponentially-weighted moving average of per-request latency
on each upstream, with a configurable half-life (default 60s) so older
samples decay and a once-slow-now-recovered backend isn't permanently
penalised. Updates are time-weighted: alpha = 1 - exp(-dt/tau) where
tau = half_life / ln(2). Setting half_life to 0 falls back to a flat
moving average where every sample has equal weight.
Wire it into the P2C load score:
score = latency * (inflight + 1) + errors * 5 * latency
when at least one sample exists; fall back to the existing
inflight + errors*2 form otherwise. This is a lightweight approximation
of PeakEWMA — a slow backend with low load loses to a fast one with
comparable load, but a fast backend can still lose if it gets too busy.
New public API:
rspamd_upstream_record_latency(up, seconds)
rspamd_upstream_get_latency(up)
rspamd_upstreams_set_latency_half_life(ups, seconds)
Callers opt in by recording observed RTT alongside their existing
ok()/fail() calls. The score function falls back gracefully to Phase 1
behaviour for upstream lists where no caller has wired up sampling
yet, so this commit is a no-op for current users.
Newly revived upstreams previously rejoined the alive list at full
weight, producing a thundering herd that would land on a backend that
just came back up and was still warming caches/connection pools — the
same backend that had been failing minutes before. This often caused
immediate re-failure and a flap loop.
Add an opt-in slow_start_ms window (default 0 = disabled) configurable
via rspamd_upstreams_set_slow_start. While the window is open, both
round-robin (effective weight = weight * factor) and P2C (effective
load score = base / factor + warmup penalty) bias selection away from
the warming upstream linearly over time.
Hashed (Ketama) intentionally not integrated: scaling vnode counts
during the window would defeat the consistency property that hashed
selection exists for. Token bucket likewise unaffected — its
inflight-based fairness already handles cold buckets gracefully.
revived_at is set in the two real revive paths: the timer-based
revive_cb and the half-open probe success path in ok(). The initial
add_upstream activation is left unmarked so cold starts after a
config reload aren't artificially throttled.
[Feature] upstream: add Power of Two Choices (P2C) selection
P2C samples two alive upstreams uniformly at random and chooses the
one with the lower load score (inflight + errors*2). Provably within
a constant factor of optimal max-load and the modern default for
load-aware random selection (Envoy LEAST_REQUEST, Finagle, NGINX
least_conn).
A passive in-flight counter on struct upstream is incremented on every
selection in get_common and in get_token_bucket, decremented in ok()
and fail(); the existing caller contract (every get pairs with one
ok or fail) is preserved without any new public API.
RSPAMD_UPSTREAM_RANDOM callers are silently upgraded to P2C since it
strictly dominates uniform random with no extra cost. The token-bucket
fallback when message size is unavailable also uses P2C now.
Tests: new upstream_p2c suite (7 cases, 800+ assertions) covers
single-upstream cases, the silent RANDOM upgrade, load-aware bias
toward idle upstreams, and balanced inflight tracking under mixed
ok/fail outcomes.
[Fix] upstream: drop pool-less branch in set_token_bucket
The fallback that g_malloc'd a fresh limits struct when no pool was
available leaked it on the next call and on destroy. The function is
only ever invoked with a real ctx; assert that explicitly. Also keep
the new refill rate proportional to max_tokens when it's overridden,
so users tuning the bucket size don't get a stale default refill.
[Fix] upstream: lazy time-based refill for token bucket
return_tokens with success=false decremented inflight but never
returned tokens to available_tokens, so a flapping upstream's bucket
drained monotonically toward zero and never recovered. Selection
then permanently fell into the least-inflight fallback path,
defeating the cost signal.
Add a real refill rate (token_bucket_refill_per_s, default = max/60
so a quiet bucket fully regenerates in 60s of wall time). Call lazy
refill from get_token_bucket and return_tokens; failure no longer
permanently penalises the bucket. Within-tick test workloads see dt
small enough that floor(dt * rate) == 0, so existing assertions are
unaffected.
[Rework] upstream: drop token-bucket heap, use flat scan
The intrusive min-heap stored entries by value; swim/sink swaps
mutated the slot pointer's contents, so up->heap_idx went stale after
every update. The cache-miss workaround was a linear scan, making
each get/return effectively O(n) anyway. Alive sets are typically
2-10 upstreams, where a flat scan is faster in practice than a heap
with by-value repair.
Replaces the heap with a single pass over alive[] that tracks both
the lowest-inflight eligible upstream and the absolute least-loaded
one as a fallback for the exhausted-bucket case. Removes
upstream_token_heap_entry, the RSPAMD_HEAP_DECLARE, three helper
functions, the heap_idx field on struct upstream, and the
token_bucket_initialized/token_heap fields on struct upstream_list.
[Fix] upstream: preserve backoff for pending-resolve
set_active stopped the timer and re-armed at INITIAL_DELAY (~1s),
discarding the exponential backoff lazy_resolve_cb had accumulated.
Snapshot ev.repeat before stopping and reuse it when the upstream is
still PENDING_RESOLVE so repeated DNS failures actually back off.
[Fix] upstream: bail out of get_random when only candidate is excluded
rspamd_upstream_get_random looped forever when alive->len == 1 and the
single survivor matched the 'except' argument. Front-gate the empty and
single-survivor cases explicitly; the unbounded loop only runs for
n >= 2 where it is guaranteed to terminate.
[CritFix] mime_parser: avoid NULL deref on SMIME with empty pkcs7-data
When an S/MIME signed message wraps an inner pkcs7-data with a zero-length
OCTET STRING, the SMIME inner-content extraction in rspamd_mime_parse_normal_part
allocated a zero-length buffer and recursed into rspamd_mime_process_multipart_node
with start/end pointing at NULL (g_malloc(0) returns NULL under always_malloc
mempool mode), causing a SIGSEGV at the first byte check.
Fix:
- Skip the SMIME inner recursion when the encapsulated OCTET STRING is empty
or has a NULL data pointer.
- Add a defensive guard at the top of rspamd_mime_process_multipart_node to
return RSPAMD_MIME_PARSE_NO_PART for NULL or empty buffers, protecting any
other caller from the same UB.
Add a Lua regression test that exercises the SMIME-empty path through
rspamd_message_parse. With VALGRIND=1 (forcing always_malloc) the test
reliably reproduced the crash before the fix.
[Fix] Honor mime_utf8 option in INVALID_MSGID rule
Two related issues caused INVALID_MSGID false positives on valid
EAI/SMTPUTF8 Message-IDs (RFC 6532):
* The sane_msgid regexp unconditionally rejected bytes \x80-\xff,
even when mime_utf8 was enabled. Relax the regexp in that case
while keeping structural checks intact.
* The configuration option was registered only as enable_mime_utf,
but the corresponding Lua API is rspamd_config:is_mime_utf8(),
so users naturally try enable_mime_utf8. That spelling silently
had no effect because the parser did not bind it to any field.
Register enable_mime_utf8 as an alias mapped to the same struct
field so configs using it actually take effect.
Add a functional test (configs/mid_utf8.conf, messages/mid_eai_utf8.eml,
cases/107_mid_utf8.robot) that exercises both fixes via the new
option name and verifies that structurally invalid Message-IDs are
still flagged.
fix(rspamadm/vault): write formatted output to stdout directly (#6005)
Closes #6005.
`rspamadm vault list` produced completely empty output (no stdout,
no stderr, exit code 0) when the Vault held 356+ DKIM entries.
Deleting one entry made it work again.
Root cause: `maybe_print_vault_data` passed the formatted payload
through `printf`, which calls `rspamd_logger.slog(fmt, ...)`. slog
treats its first argument as a format string. When the formatted
UCL/JSON body contained anything slog interprets as a format
specifier (`%` characters in keys, escaped strings, etc.) — or
simply exceeded slog's internal buffer — the output was silently
dropped and the user saw nothing.
The same path is hit by every other handler that already worked
(`show`, etc.) only because their payloads were smaller and didn't
trigger the silent-drop edge case.
Write the formatted payload to stdout directly via `io.write`. No
format-string interpretation, no buffer limit, no surprise. Append
a trailing newline only when the formatted output didn't already
end with one (UCL output usually does).
The empty table caused a spurious warning in lua_maps when no
whitelist was configured. Since settings.whitelist defaults to nil,
the else-branch was a no-op. User-configured whitelists via
link_affiliation { whitelist = ... } continue to work as before.
[Minor] lua_maps: handle empty table as static empty map
When map_add_from_ucl receives an empty Lua table, it fell through
to the C map infrastructure, which logged a spurious error-level
message with no map name. Return a lightweight empty map object
directly in Lua, cache it for consistency with other code paths,
and log a warning since an empty table is likely a misconfiguration.
[Minor] maps: include map description in load error messages
Without a map description in the log, users had no way to identify
which map triggered the error, forcing unnecessary investigation.
All 'no urls to be loaded' and 'invalid type' error sites in
rspamd_map_add_from_ucl now include the description; rspamd_printf
handles NULL safely.
[Fix] upstream consumers: make NULL/nil branches sound
A NULL guard is only useful if the branch behind it logs the failure,
propagates it correctly to the caller, and leaves internal state
consistent. Re-audited every NULL/nil-upstream branch (pre-existing
and newly added by this branch) and tightened the silent or
state-corrupting ones:
* fuzzy_backend_redis: the three rspamd_upstream_get NULL branches in
read / count / version paths invoked the caller's callback with an
empty result and returned silently. Admins had no signal that fuzzy
was being skipped because every backend was dead or pending DNS.
Each branch now also msg_err_redis_session's the reason.
* libserver/http_connection.c: when ctx->http_proxies is configured
but every proxy upstream is unavailable, the code silently fell
back to a direct connection - a security/privacy footgun for
configs that meant to force traffic through a proxy. Added an
msg_info to surface the fallback so the admin notices.
* lua_redis prepare_redis_call: the previous patch in this branch
marked skipped servers as "tempfail" but did not insert a
placeholder into `options`, so the load_script_task /
load_script_taskless consumer loop's iteration index no longer
matched the original servers_ready index. A successful upload to
one server would then write "done" into the wrong slot of
servers_ready (the slot for a different, possibly skipped server),
corrupting the script-load state machine. Insert a `{ skip = true,
upstream = s }` placeholder so the indexes stay aligned, and skip
the placeholder in both consumer loops.
[Fix] upstream: make addr accessors and all_upstreams pending-safe
The PENDING_RESOLVE upstream state introduced earlier kept pending
entries out of the alive list, but `:all_upstreams()` walks the full
`ups` array and exposes them to Lua callers - which then crashed in
`s:get_addr()` because `rspamd_upstream_addr_next/cur/port` indexed
a NULL `addrs.addr`.
Defensive fix at the C accessor layer:
* rspamd_upstream_addr_next / _cur now return NULL when the upstream
has no addresses (NULL or empty array). This is the safe layer that
every other consumer eventually goes through.
* rspamd_upstream_port returns the parsed `deferred_port` for pending
upstreams (so callers that just want a port get a sensible answer)
and -1 if even that is unknown.
* lua_upstream:get_addr() pushes nil when the C side has no address.
Audit of `:all_upstreams()` callers, all updated to skip pending:
* lua_redis prepare_redis_call (SCRIPT LOAD broadcast): if
`s:get_addr()` is nil, mark the slot as "tempfail" so the next
retry will pick it up once DNS comes back, log, and skip it.
* rspamadm statistics_dump connect_to_upstream: log and return early
before opening a redis connection with a nil host.
* clickhouse plugin check_clickhouse_upstream: skip with an info log
so the periodic check tries again next tick.
The DKIM Vault helper already passes `upstream = ... or nil` to
http.request and lets the HTTP layer fall back to URL-based connect,
which remains the right behaviour.
[Fix] fuzzy_check: handle NULL upstream in lua_ping_storage
fuzzy_lua_ping_storage selected an upstream from rule->read_servers
without checking the result, then dereferenced the NULL pointer in
rspamd_upstream_addr_next(). With the new deferred-DNS upstream layer
this becomes reachable in normal operation (every upstream still
pending), and was already reachable before whenever the alive list
was empty.
Audit of other rspamd_upstream_get / _forced / _except / _token_bucket
call sites in C/C++ (rspamd_proxy.c, libserver/dns.c,
fuzzy_backend_redis.c, http/http_connection.c, libstat http_backend,
the other fuzzy_check sites) confirms they already guard the result
with `if (up)` or a `while (up = ...)` loop; only this site was
unchecked.
Return (false, "no fuzzy storage upstream available for rule X") to
the Lua caller instead of crashing.
[Fix] lua: tolerate nil upstream in transport, plugins, rspamadm
Audit of every Lua caller of upstream_list:get_upstream_round_robin /
:get_upstream_master_slave / :get_upstream_by_hash that is not a
scanner. Each one now reacts to a nil result instead of dereferencing
it and crashing the call site:
* lua_redis.lua: all four selection sites already logged "cannot
select server" but then continued into addr:get_addr() and crashed.
They now `return false, nil, nil` after the log, so callers see a
proper failure. The sentinel watcher tick logs and skips this round.
* lua_maps.lua: the external-map HTTP path logs and invokes the
caller's callback with (false, "no upstream available", 502, ctx)
so map consumers see a normal lookup failure.
* aws_s3.lua: lifts the upstream selection out of the http.request
table so it can warn before letting the HTTP layer fall back to
URL-based connect (the request still goes out).
* clickhouse.lua, elastic.lua, gpt.lua: each get_upstream_round_robin
site now logs and returns from its enclosing function (send,
retention, distro detect, geoip pipeline, index policy/template,
GPT/Ollama model dispatch).
* rspamadm/clickhouse.lua and rspamadm/statistics_dump.lua: print to
stderr and exit / abort the redistribute scan.