Dmitriy Alekseev [Tue, 17 Mar 2026 10:43:51 +0000 (11:43 +0100)]
[Fix] Register redis and http as known hs_helper worker options
The redis and http configuration blocks in the hs_helper worker section
were not registered via rspamd_rcl_register_worker_option, causing
rspamadm configdump to emit "unknown worker attribute: redis" warnings.
The Lua backend reads these blocks at runtime through the full UCL
options object, so they worked correctly despite not being registered.
Add proper RCL registration for both redis and http as ucl_object_t
fields so the config schema recognizes them as valid worker attributes.
Vsevolod Stakhov [Sun, 15 Mar 2026 12:26:17 +0000 (12:26 +0000)]
[Fix] Add PCRE2 complexity checks before JIT compilation
Check compiled pattern size, frame size, and capture count
before calling pcre2_jit_compile to avoid crashes on
pathological patterns. Also set map->map pointer consistently
in lua_config_add_map for all map types.
Vsevolod Stakhov [Sat, 14 Mar 2026 22:22:28 +0000 (22:22 +0000)]
[Fix] Fix Clickhouse column name mismatch: UUID -> TaskUUID
The insert field list used 'UUID' but the actual column is named
'TaskUUID' (as defined in schema and migration 9), causing
NO_SUCH_COLUMN_IN_TABLE errors on insert.
Vsevolod Stakhov [Sat, 14 Mar 2026 22:13:42 +0000 (22:13 +0000)]
[Feature] Add context_augment hook to GPT module
Add a new `context_augment` configuration option that accepts Lua code
returning an async function(task, content, callback). The callback
receives a string that gets injected as additional context into the
LLM prompt alongside existing user/domain and search contexts.
This enables external Lua code to enrich LLM requests with arbitrary
context — e.g., Telegram channel topic and recent messages for
community spam detection.
The augment function runs in parallel with other context fetchers
and supports async operations (Redis, HTTP).
Vsevolod Stakhov [Sat, 14 Mar 2026 20:31:15 +0000 (20:31 +0000)]
[Fix] Fix rspamc neural_learn config fetch and output
- Fix config fetch port: when user specifies the default scan port
(11333), redirect the /plugins/neural/config preflight request to
the controller port (11334) where the endpoint actually lives.
Previously, the config fetch used the user-specified port verbatim,
causing "invalid command" errors on the normal worker.
- Fix misleading output: rspamc_neural_learn_output no longer defaults
to "success = true" when the response lacks an explicit success field.
For scan-based learning (checkv2 path), it now detects the scan
response and shows method = "scan". For missing/error responses,
it correctly reports success = false.
- Add user-visible note when the config fetch fails, explaining the
fallback to scan-based learning.
Vsevolod Stakhov [Fri, 13 Mar 2026 13:58:33 +0000 (13:58 +0000)]
[Test] Add unit tests for caseless table
25 tests covering creation, case-insensitive lookup, key case
preservation, assignment, deletion, has_key, iteration via each(),
to_table conversion, multi-value get_all, and edge cases including
long keys, empty keys, and metamethod isolation.
Vsevolod Stakhov [Fri, 13 Mar 2026 13:54:35 +0000 (13:54 +0000)]
[Feature] Add case-insensitive table type for HTTP headers
Introduce rspamd{caseless_table} userdata type that provides
case-insensitive key lookup while preserving original key case.
HTTP response headers now use this type instead of plain Lua tables,
fixing two issues: headers were forcibly lowercased (mutating original
data) and duplicate headers were silently lost.
Multi-value headers are stored as arrays and can be retrieved via
get_all(key). The __index metamethod returns the first value for
convenience. The type is generic and reusable beyond HTTP.
Vsevolod Stakhov [Tue, 10 Mar 2026 17:35:51 +0000 (17:35 +0000)]
[Fix] Fix external neural model merge defects
- merge_weights returns boolean, not ANN object: use ext_ann directly
- Add missing digest/symbols/distance fields for external-only set.ann
- Fix inverted alpha in merge call (alpha meant external weight, not local)
- Add missing newline at EOF in lua_kann.c
The rspamd_fuzzy_tcp_frame payload was sized for v1 encrypted replies
(136 bytes) but v2 encrypted replies are 184 bytes, causing a buffer
overflow when sending multi-flag responses over TCP with encryption.
Use a union to accommodate both v1 and v2 reply sizes.
Add multi-flag delete tests for all TCP transport modes.
Rob4226 [Tue, 10 Mar 2026 09:04:36 +0000 (05:04 -0400)]
[Minor] Fix default date description in rspamadm dmarc_report help
Change the help text for the `date` argument of `rspamadm dmarc_report`
from "today" to "yesterday". When the command is run without specifying
a date, it actually processes reports for yesterday, so this update
makes the help message match the command's behavior.
Limit scheduled integration-test runs to rspamd/rspamd while keeping
manual start available in forks. This avoids unnecessary fork cron
runs and reduces noisy CI failures unrelated to upstream.
[Fix] Skip recipient check when no hash found in Redis
When a key is not found in Redis, lua_redis returns a redis.null
userdata (not nil), which is truthy and caused check_recipient()
to be called unconditionally, logging a misleading "no recipients
are matching hash" message despite no hash being stored.
[Feature] Add external pretrained neural model support
This commit adds the ability to load pretrained neural network models
from external sources (HTTP/HTTPS) and merge them with locally trained
weights. Users can receive a pretrained model and fine-tune it with
their own data.
Model format (msgpack with magic "RNM1"):
- magic: format identifier
- version: format version (currently 1)
- model_version: model training version
- providers_digest: must match local providers config
- ann_data: serialized KANN (zstd compressed)
- pca_data: optional PCA matrix
- norm_stats, roc_thresholds: optional metadata
Key changes:
- lualib/lua_neural_external.lua: new module for external model handling
- Model parsing, KANN loading, weight merging via interpolation
- Map-based loading with signature verification support
- Base model storage in Redis for future re-merge
- src/lua/lua_kann.c: Lua bindings for merge_weights and is_compatible
- Neural plugin integration:
- Register external model as callback map at config time
- Apply loaded model to all settings elements
- Automatic update checking via map infrastructure
[Fix] Preserve content flags for injected query URLs
Propagate the parent URL flags when task:inject_url() extracts nested query URLs. This keeps the content flag on URLs injected from computed parts such as PDF text, so follow-up query URLs are classified the same way as the outer injected URL.
Do not suppress URLs from mime_part:get_urls() when the same URL was already seen in another MIME part. This restores per-part URL visibility for multipart/alternative messages and keeps text/plain URLs available even when text/html contains the same links.
[Minor] Skip empty In-Reply-To header in replies check
An empty `In-Reply-To` header value ("") is truthy in Lua,
bypassing the `nil` check. In `replies_check` this caused a
misleading log entry "ignoring reply to as no recipients
are matching hash ". In `replies_check_cookie` it triggered
an unnecessary `decrypt_cookie` call.
[Feature] Add follow_master option for proxy mirror connections
When a mirror has a short timeout to avoid delays from misconfigured
mirrors, the mirror connection gets prematurely terminated if the
upstream takes longer than the mirror timeout. The new follow_master
option ties the mirror's lifetime to the master upstream: the mirror
stays alive while the upstream is processing and is terminated once the
upstream completes or permanently errors out.
Vsevolod Stakhov [Fri, 27 Feb 2026 11:15:36 +0000 (11:15 +0000)]
[Fix] Force recompilation of stale hyperscan classes instead of skipping
When a cached hyperscan blob fails validation during load (stale IDs
pointing to wrong re_class), mark the class with needs_recompile flag.
On subsequent exists_async check in hs_helper, ignore the "exists"
result and proceed with recompilation instead of skipping.
Vsevolod Stakhov [Fri, 27 Feb 2026 11:01:13 +0000 (11:01 +0000)]
[Fix] Do not enable HS cleanup when disable_hyperscan is set
When disable_hyperscan is true, workers skip loading hyperscan databases
and never notify main about known cache files. This caused main to delete
all cached .hs.zst files on exit since none were marked as "known".
Also promote worker hyperscan notification to info log level.
Vsevolod Stakhov [Fri, 27 Feb 2026 10:22:18 +0000 (10:22 +0000)]
[Feature] Per-class deterministic regexp IDs in re_cache
Group regexp IDs by class instead of assigning them globally.
Each class gets a deterministic base_offset in the global array,
and hyperscan stores intra-class IDs (0..M-1). This prevents
adding/removing a regexp in one class from shifting IDs in all
other classes, eliminating stale hyperscan databases and
unnecessary recompilations.
Key changes:
- Sort classes by class_id, regexps within each class by content hash
- Assign contiguous global IDs per class (base_offset + local_index)
- Use class-local regexp count in per-class hash (not global count)
- Hyperscan compile stores intra-class IDs, callback translates back
- Bump blob magic version to reject old format databases
Vsevolod Stakhov [Thu, 26 Feb 2026 12:54:05 +0000 (12:54 +0000)]
[Feature] Wire Lua rspamd_fasttext through maps infrastructure
Add load_map(cfg, path) to rspamd_fasttext module that loads FastText
models via the maps infrastructure (HTTP URLs + file with shared mmap).
The fasttext_embed neural provider now registers models as maps at
config time via a new init callback, enabling shared memory across
workers and automatic reload on map updates.
Vsevolod Stakhov [Wed, 25 Feb 2026 15:26:58 +0000 (15:26 +0000)]
[Fix] Fix SIGSEGV on termination in fasttext map dtor callback
Two bugs in the map callback lifecycle caused a crash during
rspamd_map_remove_all at shutdown:
1. Type mismatch: fin_callback published *target = model pointer
(fasttext_model*), but the dtor cast it to fasttext_map_data* -
the standard map pattern requires *target = data->cur_data.
2. Use-after-free: map->user_data pointed into the fasttext_langdet
object which was destroyed before rspamd_map_remove_all ran.
Fix by allocating the user_data target on cfg->cfg_pool (outlives
the lang detector), following the standard map consumer pattern,
and accessing the model through a get_model() indirection.
Vsevolod Stakhov [Wed, 25 Feb 2026 14:59:53 +0000 (14:59 +0000)]
[Fix] Use 16K map cache header for mmap alignment on ARM64
Apple Silicon requires mmap offsets to be 16K-aligned (page size is
16384, not 4096). Bump RSPAMD_MAP_CACHE_HEADER_SIZE to 16384 to work
on all common architectures.
Vsevolod Stakhov [Wed, 25 Feb 2026 14:51:07 +0000 (14:51 +0000)]
[Feature] Wire fasttext lang detector through maps infrastructure
The fasttext language detector now supports HTTP/HTTPS URLs for model
loading via the maps system, enabling automatic download, disk caching,
periodic reload, and cross-worker mmap sharing.
Changes:
- fasttext_model::load() accepts an offset parameter for mmap at a
non-zero position (used with page-aligned map cache files)
- fasttext_langdet uses rspamd_map_is_map() to detect URLs vs local
paths; URLs go through rspamd_map_add() with RSPAMD_MAP_FILE_NO_READ
- Map callbacks (read/fin/dtor) handle atomic model swap on reload
- Local file paths continue to work as before with direct loading
Vsevolod Stakhov [Wed, 25 Feb 2026 12:57:35 +0000 (12:57 +0000)]
[Feature] Page-aligned map cache header for no_file_read mmap support
Upgrade HTTP map cache file format to use a page-aligned (4096 byte)
header so that no_file_read consumers (CDB, fasttext models) can mmap
the cached file directly at a fixed offset without needing a separate
sidecar file.
Changes:
- Bump cache magic to rmcd2001; old rmcd2000 files are read gracefully
and rewritten on next update
- Header page (4096 bytes) contains struct + etag + zero padding; data
payload always starts at RSPAMD_MAP_CACHE_HEADER_SIZE offset
- For no_file_read maps with HTTP backends, pass the cache file path
to read_callback (instead of payload bytes) with no_file_read_offset
set to 4096; for file backends offset remains 0
- Add rspamd_map_get_no_file_read_offset() public API for consumers
- Refactor cache path computation into rspamd_map_cache_file_path()
helper, removing 4 duplicate hash+snprintf blocks
- Handle all 3 HTTP data delivery paths: live GET (controller),
SHM cache read (scanner workers), disk cache preload (startup)
Vsevolod Stakhov [Tue, 24 Feb 2026 21:42:03 +0000 (21:42 +0000)]
[Fix] Propagate source/classification URL flags to query-extracted URLs
When a URL is found inside the query string of another URL (e.g.
http://redir.com/?q=http://target.com), the inner URL now inherits
source/classification flags (FROM_TEXT, CONTENT, SUBJECT, INVISIBLE)
from the outer URL via RSPAMD_URL_FLAG_PROPAGATE_MASK.
Previously, inner URLs only received the QUERY flag, losing all context
about where the parent URL was found. This caused inconsistencies in
plugins that filter URLs by source flags (e.g. RBL content URL filtering).
Also fixes two bugs in the subject path (rspamd_url_task_subject_callback):
- hostlen check used outer URL instead of inner query URL
- QUERY flag was not set on URLs extracted from subject URL queries
Dmitriy Alekseev [Tue, 24 Feb 2026 17:19:40 +0000 (18:19 +0100)]
fix: properly set bayes class labels to S and H for class spam and ham class names, adjust bayes expiry to write occurrences as spam and ham instead of S and H
[Conf] Disable Validity SenderScore RBLs by default
Both bl.score.senderscore.com and score.senderscore.com require
a registered MyValidity account to function. Unregistered IPs
receive 127.255.255.255 (blocked) for all queries, making the
RBLs non-functional without prior account setup regardless of
query volume.
Disable senderscore_reputation (score.senderscore.com) by default
and update the senderscore (bl.score.senderscore.com) comment to
reflect the actual registration requirement. Users must register
their querying IPs at https://my.validity.com before enabling
either RBL.
Vsevolod Stakhov [Tue, 24 Feb 2026 13:38:31 +0000 (13:38 +0000)]
[Fix] Move binary msgpack data from KEYS to ARGV in Bayes Redis scripts
When expand_keys is enabled, lutil.template() is applied to all KEYS
arguments of EVALSHA commands. This corrupts binary msgpack blobs by
stripping 0x24 ('$') bytes, breaking str8 headers where the length
byte equals 36. Move non-key arguments (msgpack tokens, config, labels)
to ARGV which is not subject to key expansion.
Also fix msgpack_str_len off-by-one for str32 (4+len -> 5+len).
Vsevolod Stakhov [Tue, 24 Feb 2026 11:37:27 +0000 (11:37 +0000)]
[Fix] Fix subprocess cleanup race in spawn_process SIGCHLD handler
When the I/O handler finishes reading the subprocess reply before
SIGCHLD arrives, neither handler would call rspamd_lua_cbdata_free:
the I/O handler skips cleanup because dead=FALSE, and the SIGCHLD
handler skips cleanup because replied=TRUE. This leaves the subprocess
in the main process workers table, causing shutdown to wait for a
child that has already exited.
Fix by always calling rspamd_lua_cbdata_free in the SIGCHLD handler
when replied=TRUE, since the I/O handler has already finished and
deferred cleanup to us.
Vsevolod Stakhov [Tue, 24 Feb 2026 11:29:54 +0000 (11:29 +0000)]
[Fix] Use main Lua state in config object destructors
The periodic, cached config, and symbol callback destructors stored a
raw lua_State pointer captured at registration time. If this was a
thread/coroutine state from the thread pool, it could be garbage
collected before the destructor runs during config cleanup, causing a
use-after-free crash in luaL_unref.
Use RSPAMD_LUA_CFG_STATE(cfg) instead, which is the main Lua state
that remains valid throughout config_free until rspamd_lua_close.
Vsevolod Stakhov [Tue, 24 Feb 2026 11:07:50 +0000 (11:07 +0000)]
[Fix] Prevent LuaJIT GC stalls after neural training
The LuaJIT GC atomic phase is non-incremental and processes the entire
gray/grayagain object graph in one uninterruptible pass. After neural
training completes, the controller's Lua heap is bloated with training
temporaries, causing the next GC cycle's atomic phase to stall the
event loop at 100% CPU for an extended period.
Two fixes:
- Force collectgarbage('collect') in ann_trained callback to clean up
training temporaries before they accumulate
- Stop/restart GC around fork() in spawn_process to prevent the child
from inheriting a mid-cycle GC state that triggers thrashing
Vsevolod Stakhov [Mon, 23 Feb 2026 09:14:35 +0000 (09:14 +0000)]
[Feature] Fasttext embed: multi-scale conv1d pooling for text features
Add conv1d output mode to the fasttext_embed provider that applies
multi-scale max-over-time pooling over sliding word windows in Lua,
producing compact feature vectors for the neural plugin's dense ANN.
For each kernel size (default {1, 3, 5}), word vectors are averaged
within sliding windows, then max-pooled across positions per channel.
Each scale's features are L2-normalized independently for balanced
contribution. This replaces the previous approach of feeding raw NCW
matrices into KANN conv1d layers.
Also adds max1d and input3d layer bindings to the KANN Lua API, and
includes conv1d settings (kernel_sizes, conv_pooling, max_words) in
the providers_config_digest for automatic retraining on config change.
Vsevolod Stakhov [Sun, 22 Feb 2026 18:11:44 +0000 (18:11 +0000)]
[Feature] Fasttext embed: SIF word weighting for sentence vectors
Add Smooth Inverse Frequency (SIF) weighting to the fasttext embedding
provider. Common words (the, is, a) get near-zero weight while
distinctive words (viagra, invoice) get high weight, significantly
improving embedding quality without changing dimensionality.
Expose get_word_frequency() from the fasttext shim C++ API and Lua
bindings, returning p(word) = count/ntokens from the model vocabulary.
SIF is enabled by default (sif_weight=true, sif_a=1e-3). Combined with
multi-model mean+max pooling, improves F1 from 0.87 to 0.90 in testing.
[Minor] Replace hash table with linear search for class deduplication
Number of classes per classifier is always small (N < 10 in practice),
so hash table overhead outweighs its O(1) lookup benefit. Linear search
over the already-built UCL array is simpler and faster here.
Vsevolod Stakhov [Sun, 22 Feb 2026 11:07:49 +0000 (11:07 +0000)]
[Feature] Fasttext embed: multi-model and mean+max pooling
Use all configured language_models for every message by default
(multi_model=true), concatenating vectors from each model for
richer cross-lingual representations.
Add mean+max pooling (pooling="mean_max" default) which concatenates
the average word vector with element-wise max pooling, capturing both
typical and prominent semantic features.
With 2 quantized 50-dim models this produces 200-dim vectors instead
of 50, significantly improving classification (F1 0.51 -> 0.87 in
testing).
Vsevolod Stakhov [Sat, 21 Feb 2026 17:27:42 +0000 (17:27 +0000)]
[Fix] Fasttext shim: fix binary format parsing and harden against corrupt models
- Fix QMatrix load order: read codesize+codes before PQ (not after)
- Fix PQ centroid count: use dim*ksub (not nsubq*ksub*dsub)
- Fix PQ centroid addressing: match FastText's get_centroids() for last sub-quantizer
- Fix dictionary load: read size_ field before nwords/nlabels
- Fix output matrix: always read qout bool between input and output matrices
- Fix subword n-gram skip: only skip single-char BOW/EOW, not full wrapped word
Add comprehensive sanity checks for all untrusted values from model files:
- Validate dimensions, entry counts, matrix sizes against sane upper bounds
- Overflow-safe multiplication for matrix element counts
- Bounds checks on centroid, codes, and dense matrix data access
- Null pointer guards on all matrix operations
- Replace throwing .at() with bounds-checked pointer return
- Limit string reads to 1024 bytes to prevent runaway allocation
- Return nullptr/false from loaders on validation failure
- Guard Lua bindings against empty/short vectors from get_word_vector
[Feature] WebUI: Add multi-class classifier support to learning UI
- Support /learnclass endpoint with Classifier and Class headers
for multi-class learning.
- Handle both old (array) and new (metadata object) /bayes/classifiers
response formats for backward compatibility.
- Add dynamic UI switching based on classifier type:
* Binary classifiers: show HAM/SPAM upload buttons
* Multi-class classifiers: show class dropdown + Learn button
- Display classifier metadata as text badges in dropdown:
[multi-class] [per-user]
- Hide "All classifiers" option when multi-class classifiers present
(different classifiers may have different class sets).
Vsevolod Stakhov [Sat, 21 Feb 2026 16:46:19 +0000 (16:46 +0000)]
[Minor] Fasttext shim: addressing review comments
- Use ICU U8_NEXT for UTF-8 iteration instead of handcrafted code
- Replace exception-based error handling with fail-bit pattern in
binary_reader, propagating errors via tl::expected
- Replace std::sort with std::reverse after min-heap extraction
Vsevolod Stakhov [Fri, 20 Feb 2026 11:21:38 +0000 (11:21 +0000)]
[Feature] Replace libfasttext with mmap-based built-in shim
Replace the external libfasttext shared library dependency with a
zero-dependency C++20 shim that reads .bin/.ftz models directly.
The large input matrix is mmap'd with MAP_SHARED/PROT_READ so all
worker processes share the same physical pages after fork.
This eliminates:
- C++ exception ABI issues across shared library boundaries
(no more fork-probe hack in lua_fasttext)
- The ENABLE_FASTTEXT cmake option (always compiled in now)
- Per-worker heap copies of the model (~500MB-7GB savings)