[Fix] Refactor multipattern to use per-multipattern TLD flag
This commit fixes the multipattern implementation to properly support
per-multipattern TLD flag instead of per-pattern flags.
Key changes:
- Remove acism_id_offset field - no longer needed since TLD is now
per-multipattern, not per-pattern
- Fix hyperscan TLD pattern suffix: use (?:[^a-zA-Z0-9]|$) instead
of (:?\b|$) because \b requires HS_FLAG_UCP which causes issues
- Initialize pats array in create functions when TLD flag is set
- Add TLD patterns to pats array at start of add_pattern_len for
ACISM fallback during hyperscan compilation
- Simplify ACISM callback - strnum IS the pattern ID for TLD patterns
For TLD multipatterns, the system now builds BOTH:
- ACISM patterns (for fallback during HS compilation or when unavailable)
- Hyperscan patterns (when available)
At lookup time: use Hyperscan if ready, fall back to ACISM otherwise.
[Fix] Fix multipattern cache file cleanup and ACISM fallback
- Register multipattern cache files with rspamd_hyperscan_notice_known()
to prevent hs_helper from cleaning them up during cache cleanup
- Fix ACISM pattern ID offset for mixed multipatterns (static + TLD):
when ACISM callback returns strnum, add acism_id_offset to get the
actual pattern ID that the URL scanner expects
[Feature] Add multipattern state machine for async compilation support
Add state machine (INIT/COMPILING/COMPILED/FALLBACK) to multipattern
for future async hyperscan compilation. Build ACISM fallback for TLD-only
patterns to allow matching while HS compiles. Mixed TLD/non-TLD patterns
use sync compile. Also update cache format to unified .hs extension.
[Feature] Unified hyperscan cache format for multipattern
Add C helper functions for serializing/deserializing hyperscan databases
with the unified format (magic, platform, CRC). Migrate multipattern from
raw .hsmp files to the unified .hs format compatible with re_cache.
- Add rspamd_hyperscan_serialize_with_header() and load_from_header()
- Update multipattern to use unified format with platform validation
- Fix CRC calculation in Lua bindings to match re_cache format
[Feature] Add Lua hyperscan compilation bindings and orchestration module
- Add rspamd_hyperscan Lua module with compile/serialize/deserialize/validate
- Create lua_hs_compile.lua orchestration module for unified compilation
- Support pluggable cache backends via lua_hs_cache integration
- Use unified file format with magic, platform info, CRC validation
This commit adds infrastructure for pluggable hyperscan cache storage
backends and FD-based shared memory distribution:
- Add platform ID function (rspamd_hyperscan_get_platform_id) for
platform-aware cache keys
- Create lua_hs_cache.lua with file, Redis, and HTTP backends
- Add FD-based loading APIs (rspamd_hyperscan_from_fd,
rspamd_hyperscan_create_shared_unser)
- Add fd_size field to control messages for FD passing
- Update worker to handle attached FDs in hyperscan notifications
- Add cache_backend configuration option to hs_helper
Vsevolod Stakhov [Wed, 31 Dec 2025 10:54:55 +0000 (10:54 +0000)]
[Feature] Add extra tables API for clickhouse plugin
Allow other plugins to dynamically register custom Clickhouse tables
via rspamd_plugins['clickhouse'].register_extra_table(). Supports
per-table schemas, row callbacks (single or multiple rows), and
independent retention settings.
Vsevolod Stakhov [Mon, 29 Dec 2025 22:28:40 +0000 (22:28 +0000)]
[Fix] Fix replxx build with LLVM 21+
- Simplify CMakeLists.txt to use CMAKE_CXX_STANDARD 20
- Replace std::unordered_map with std::map to avoid libc++ ABI issues
- Add operator< to UnicodeString for std::map compatibility
Vsevolod Stakhov [Sun, 28 Dec 2025 21:20:12 +0000 (21:20 +0000)]
[Fix] Avoid SDK headers in include path when package ROOT is specified
- Add NO_DEFAULT_PATH to FIND_PATH when PKG_ROOT is set to prevent
macOS SDK C headers from polluting include paths before libc++
- Fix typo: {RSPAMD_DEFAULT_INCLUDE_PATHS} -> ${...}
- Remove obsolete paths (/opt/csw, /sw), add /opt/homebrew for macOS
Vsevolod Stakhov [Sun, 28 Dec 2025 18:45:05 +0000 (18:45 +0000)]
[Feature] Rename fuzzy_check max_score to hits_limit for clarity
The option name max_score was confusing as it doesn't refer to the
symbol score but rather the number of fuzzy hash hits at which the
normalized score reaches ~1.0 (formula: tanh(e * hits / hits_limit)).
- Rename max_score -> hits_limit in fuzzy_check.c and default config
- Add backward compatibility: max_score is still accepted as an alias
- Add lua_cfg_transform to handle legacy configs (max_score overrides
hits_limit to ensure local.d overrides work correctly)
- Add explanatory comments in config and documentation
Vsevolod Stakhov [Sat, 27 Dec 2025 10:59:05 +0000 (10:59 +0000)]
[Fix] Add resilience to lua_cfg_transform
- Check :type() before indexing UCL objects to handle null values
- Wrap transform sections in pcall to prevent one bad config section
from breaking the entire configuration load
- Log errors with section name for easier debugging
Vsevolod Stakhov [Tue, 23 Dec 2025 10:13:43 +0000 (10:13 +0000)]
[Fix] Restore Lua stack properly in second-pass MIME detection
Fix lua_settop(L, 0) which cleared the entire Lua stack instead
of restoring to the previous state, causing segfaults when
process_message() was called from Lua unit tests.
Vsevolod Stakhov [Mon, 22 Dec 2025 11:53:37 +0000 (11:53 +0000)]
[Fix] Use Fibonacci hashing for task pointer hash
Use golden ratio multiplication for 64-bit to 32-bit pointer hashing.
This provides good distribution with minimal operations (1 multiply +
1 shift) and works well with kh_int_hash_func which is identity.
Vsevolod Stakhov [Mon, 22 Dec 2025 11:25:49 +0000 (11:25 +0000)]
[Fix] Add logging, preallocation and hash mixing to task registry
- Log error when detecting use-after-free attempt on task pointer
- Preallocate task set to 16 elements to reduce early rehashing
- Mix pointer bits using multiplicative hash for better distribution
Vsevolod Stakhov [Mon, 22 Dec 2025 10:06:19 +0000 (10:06 +0000)]
[Fix] Use pointer set instead of key map for task validation
Store task pointers in a khash set and validate them on lookup
from Lua. This works with all code paths that create task userdata
directly without going through rspamd_lua_task_push.
Vsevolod Stakhov [Sun, 21 Dec 2025 20:05:27 +0000 (20:05 +0000)]
[Feature] Add task registry for safe Lua task reference validation
Implement a global task registry that maps unique uint64_t keys to task
pointers. This prevents use-after-free bugs when Lua code holds references
to tasks that may have been freed (e.g., in async Redis callbacks).
Key changes:
- Add lua_key field to rspamd_task struct
- Implement task registry using khash (O(1) lookup)
- Store lua_key in Lua userdata instead of raw pointer
- Lookup via registry when extracting task from Lua
- Remove task from registry FIRST in rspamd_task_free()
The counter-based key approach avoids issues with:
- Pointer reuse after free (memory allocator may reuse addresses)
- Lua number precision (52-bit mantissa is sufficient for counter)
- NaN/subnormal float values that could cause issues
This fixes potential use-after-free in Redis script waitq callbacks
when Redis is unavailable longer than task lifetime.
Vsevolod Stakhov [Tue, 16 Dec 2025 11:25:29 +0000 (11:25 +0000)]
[Fix] Fix Lua 5.4 compatibility in clickhouse and elastic plugins
- Merge nested settings tables (limits, retention) to preserve defaults
when user provides partial configuration
- Use %d with math.floor() instead of %.0f for integer formatting
Vsevolod Stakhov [Fri, 12 Dec 2025 20:24:26 +0000 (20:24 +0000)]
[Fix] Handle connection errors with io_uring backend in HTTP client
When using io_uring, POLLERR is reported as both EV_READ and EV_WRITE.
This caused connection failures (e.g., ECONNREFUSED) to be misinterpreted
as early server responses. Check SO_ERROR before attempting to read when
the connection hasn't been established yet.
[Test] Skip unnecessary waiting in initial scan counter read
Since the test starts already on the Status tab, the gotoTab("status")
doesn't trigger a new request, and we're just waiting for
the autorefresh to happen, causing unnecessary delay.
Vsevolod Stakhov [Thu, 11 Dec 2025 18:11:36 +0000 (18:11 +0000)]
[Feature] Add text quality analysis for PDF garbage filtering
- Add rspamd_util.get_text_quality() function with comprehensive UTF-8
text analysis using ICU for proper Unicode classification
- Returns 18 metrics: letters, digits, punctuation, spaces, printable,
words, word_chars, total, emojis, uppercase, lowercase, ascii_chars,
non_ascii_chars, latin_vowels, latin_consonants, script_transitions,
double_spaces, non_printable
- Add confidence scoring to PDF text extraction to filter garbage tokens
(single characters, encoded data, random sequences)
- Configurable via text_quality_threshold, text_quality_min_length,
text_quality_enabled options in pdf module config
- Add unit tests for get_text_quality function
[Fix] Correct symbols column index in history and scan tables
Fixes regression introduced in 62b136a where sorting fails with
"can't access property 'sortValue', val.options is undefined" on
the History tab, and symbol reordering doesn't work on the Scan tab.
The "file" column addition shifted the symbols column index, but
history.js and upload.js were not updated, causing symbol reordering
to target wrong columns.
[Feature] Add multipart and msgpack formatters to metadata_exporter
- Add multipart formatter for HTTP export using form-data with separate
metadata (JSON) and message (rfc822) parts
- Add msgpack formatter for efficient binary serialization
- Add json_with_message formatter for JSON with base64-encoded message
- Deprecate meta_headers option (broken by design for complex data)
- HTTP pusher now auto-detects multipart boundary from formatter
[Fix] Only apply early response handling for HTTP clients
The early response detection logic should not run for server-side
connections, as it incorrectly modifies wr_pos state when the server
reads incoming requests. This was breaking spamc protocol handling.
[Fix] Handle HTTP early server responses during request write
Fix HTTP client to properly handle early server responses (e.g., 413
Too Large) that arrive before the client has finished sending the
request body. This is allowed by HTTP/1.1 (RFC 7230 Section 6.5).
- Use bitwise AND for event flag checks to handle combined EV_READ|EV_WRITE
- Watch for both READ and WRITE events during write phase
- Check for early response on write errors (EPIPE, ECONNRESET)
- Add RSPAMD_HTTP_CONN_FLAG_EARLY_RESPONSE flag to track state
FreeBSD 15.0 introduced native inotify support, which causes
libev to enable EV_USE_INOTIFY. On FreeBSD, struct statfs is
defined in <sys/mount.h> rather than <sys/statfs.h>.
Note: This fix obsoletes the corresponding patch file in the
FreeBSD `mail/rspamd` and `mail/rspamd-devel` ports.
[Fix] Allow default_headers_order to be configured in milter_headers
Fixes #5781: The default_headers_order setting was defined in the plugin
but never read from the configuration file. Add schema validation and
config loading for this option.
[Fix] Fix Lua 5.4 compatibility issues in neural plugin
This commit addresses several Lua 5.4 compatibility issues that caused
the neural LLM tests to fail:
1. Redis TTL must be integer (lua_cache.lua):
- Lua 5.4's tostring() produces "4.0" for floats instead of "4"
- Redis SETEX/EXPIRE commands require integer TTL values
- Fixed by using math.floor() before tostring()
2. Version number format in ANN keys (lualib/plugins/neural.lua):
- Changed string format from %s to %d for version numbers
- Ensures integer format "1" instead of potential "1.0"
3. Iterator vs table handling (src/plugins/lua/neural.lua):
- fun.map() returns an iterator, not a table
- In Lua 5.4, # operator on iterators returns 0
- Fixed by wrapping with fun.totable() to get a proper table
4. Nil values in table arguments (lualib/plugins/neural.lua):
- Lua 5.4 handles nil values in tables differently
- Tables like {a, b, nil, nil} have undefined length behavior
- Fixed by using empty string defaults for optional parameters
5. Redis script nil checks (neural_save_unlock.lua):
- Added empty string checks alongside nil checks
- Ensures optional fields are only set when truly provided
6. Test infrastructure improvements:
- Added logging to dummy_llm.py for debugging
- Added proper error handling and diagnostics
- Updated rspamd.robot with better dummy_llm startup logging
- Add required Host header to all HTTP/1.1 requests in tcp.lua
- Bind dummy servers to 127.0.0.1 instead of localhost to avoid
IPv6/IPv4 mismatch on systems where localhost resolves to ::1