Vsevolod Stakhov [Sun, 11 Jan 2026 11:38:22 +0000 (11:38 +0000)]
[Fix] Enable FALLBACK mode for RE multipatterns (stop words)
- Create pats array for all multipatterns, not just TLD
- Use rspamd_multipattern_build_acism() for proper RE fallback
- Add regex fallback path in lookup while HS is compiling
- Clean up mp->res in destructor for hyperscan path
This fixes stop words multipatterns which use RSPAMD_MULTIPATTERN_RE
to properly use FALLBACK mode instead of falling through to SYNC mode
and creating .hs files during config loading.
Vsevolod Stakhov [Sat, 10 Jan 2026 21:18:21 +0000 (21:18 +0000)]
[Feature] Compile small hyperscan databases in memory without file caching
For small pattern sets (< 100 patterns), compile hyperscan databases
synchronously in memory without saving to file or Redis cache.
These databases are shared with workers via fork() COW semantics.
Large pattern sets (like TLD with 10000+ patterns) continue to use
async compilation via hs_helper with Redis caching.
This eliminates unnecessary .hs files in /var/lib/rspamd for small
databases while maintaining the async path for expensive compilations.
Vsevolod Stakhov [Sat, 10 Jan 2026 17:19:32 +0000 (17:19 +0000)]
[Fix] Ensure stable re_cache class hashes independent of other classes
Previously, the global regexp index `i` was included in per-class hashes,
which caused class B's hash to change when class A got new regexps
(because indices shift). This made Redis caching ineffective as databases
were constantly being recompiled.
Now the global index is only included in the global hash, not in per-class
hashes, ensuring each class hash depends only on its own regexps.
Vsevolod Stakhov [Sat, 10 Jan 2026 15:58:40 +0000 (15:58 +0000)]
[Feature] Enhance hyperscan cache debug logging and correlation
- Add entity_name parameter to async cache API for better traceability
- Correlate cache requests with callbacks (show entity/key in both)
- Use rspamd_zhs prefix by default for compressed Redis data
- Switch to idiomatic lua_util.debugm for Lua debug logging
- Log Redis backend config (prefix, ttl, compression) on creation
[Fix] Properly terminate hs_helper during shutdown
Add RSPAMD_SRV_BUSY command to allow hs_helper to notify main process
when busy with long-running hyperscan compilation. Main skips heartbeat
checks while worker is busy and logs busy reason during shutdown.
Key fixes:
- Prevent notifications being sent after worker receives termination signal
- Propagate ev_break through rspamd_worker_set_busy to properly exit event loop
- Add shutdown monitor timer to log pending workers during termination
- Pass worker pointer to re_cache compile functions for termination checks
[Conf] Add configuration support for hs_helper worker
Add worker-hs_helper.conf and worker-hs_helper.inc config files that are
only installed when hyperscan support is enabled. The main rspamd.conf
uses try=true to gracefully handle missing config on non-hyperscan builds.
* [Feature] Add task registry for safe Lua task reference validation
* [Feature] Add text quality analysis for PDF garbage filtering
* [Feature] Implement basic PDF text extraction with UTF-16 detection
* [Feature] Add extra tables API for clickhouse plugin
* [Feature] Add confighelp documentation for RBL module
* [Feature] WebUI: add backend API interaction error log
* [Fix] Neural: by default include symbols with no flags
* [Fix] Symcache: make FINE propagation deterministic
* [Fix] URL: Prevent false positives from numeric IP regeneration in mailto URLs
* [Fix] Settings: Allow spaces in selector regexps
* [Fix] Prevent use-after-free in Redis callbacks after session cleanup
* [Fix] Lua 5.4 compatibility in clickhouse and elastic plugins
* [Fix] Use exact map lookup for DKIM key_table instead of glob
* [Fix] Handle connection errors with io_uring backend in HTTP client
* [Minor] Update public suffix list
[Fix] Use free() for hyperscan-allocated buffers in lua_hyperscan
hs_serialize_database() uses the standard C allocator, so the returned
buffer must be freed with free(), not g_free(). Mixing allocators
causes memory corruption when hiredis is configured to use glib.
[Fix] URL: Prevent false positives from numeric IP regeneration in mailto URLs
Fixes #5823 - Google Fonts URLs containing wght@0 parameter were incorrectly triggering URL_NUMERIC_IP and URL_BACKSLASH_PATH due to the @ symbol being interpreted as an email pattern and "0" being expanded to "0.0.0.0".
Also fix URL_BACKSLASH_PATH to actually check for backslashes instead of relying on the ambiguous obscured flag.
[Minor] Skip ACISM fallback build when file cache hit
In FALLBACK mode, try loading from file cache first. If successful,
skip building the ACISM trie to save memory. ACISM is only built on
cache miss (when async compilation is needed).
[Fix] Prevent hs_helper from deleting multipattern cache files
Add rspamd_hyperscan_is_file_known() API to check if a file is in the
known hyperscan files cache. Modify hs_helper cleanup to skip files
that are known (e.g., multipattern TLD cache files) even if they
aren't part of the re_cache.
[Fix] Fix ACISM fallback for multipattern async compilation
- Add per-pattern is_tld flag instead of checking multipattern-level flag
- Store pattern ID in ACISM wrapper struct for correct callback reporting
- Use ACISM-specific escaping for all patterns in fallback array
- Fix callback to use per-pattern TLD boundary check
- Set FALLBACK mode for URL scanner TLD trie
Add deferred hyperscan compilation for multipatterns (TLD patterns):
- Build ACISM fallback immediately during pre-fork (fast)
- Queue multipatterns for async HS compilation by hs_helper
- Workers hot-swap from ACISM to hyperscan when compilation completes
IPC additions:
- RSPAMD_SRV_MULTIPATTERN_LOADED: hs_helper → main
- RSPAMD_CONTROL_MULTIPATTERN_LOADED: main → workers
Bug fixes:
- Use per-pattern TLD flags instead of multipattern-level flags
- Add word boundary check in ACISM callback for TLD matching
[Feature] WebUI: add backend API interaction error log
Add an error log modal with a responsive table providing:
- tracking of the last 50 errors using a circular buffer
- an "unseen since last view" counter on the badge in bottom-right corner
- copy-to-clipboard support with execCommand fallback for HTTP connections
- color-coded error types
- automatic column hiding on smaller screens
[Minor] Add clear logging for multipattern compilation states
- Log when ACISM fallback trie is built
- Log when hyperscan cache hit/miss occurs
- Log when hot-swap to hyperscan completes
- Remove misleading "start compiling" message from url.c
[Fix] Add RSPAMD_MULTIPATTERN_TLD flag to search_trie_full creation
The TLD flag must be present at multipattern creation time for the
ACISM fallback to work. Without this flag, mp->pats array is not
created and ACISM patterns are not stored, causing fallback to fail
when Hyperscan cache is not available.
[Fix] Refactor multipattern to use per-multipattern TLD flag
This commit fixes the multipattern implementation to properly support
per-multipattern TLD flag instead of per-pattern flags.
Key changes:
- Remove acism_id_offset field - no longer needed since TLD is now
per-multipattern, not per-pattern
- Fix hyperscan TLD pattern suffix: use (?:[^a-zA-Z0-9]|$) instead
of (:?\b|$) because \b requires HS_FLAG_UCP which causes issues
- Initialize pats array in create functions when TLD flag is set
- Add TLD patterns to pats array at start of add_pattern_len for
ACISM fallback during hyperscan compilation
- Simplify ACISM callback - strnum IS the pattern ID for TLD patterns
For TLD multipatterns, the system now builds BOTH:
- ACISM patterns (for fallback during HS compilation or when unavailable)
- Hyperscan patterns (when available)
At lookup time: use Hyperscan if ready, fall back to ACISM otherwise.
[Fix] Fix multipattern cache file cleanup and ACISM fallback
- Register multipattern cache files with rspamd_hyperscan_notice_known()
to prevent hs_helper from cleaning them up during cache cleanup
- Fix ACISM pattern ID offset for mixed multipatterns (static + TLD):
when ACISM callback returns strnum, add acism_id_offset to get the
actual pattern ID that the URL scanner expects
[Feature] Add multipattern state machine for async compilation support
Add state machine (INIT/COMPILING/COMPILED/FALLBACK) to multipattern
for future async hyperscan compilation. Build ACISM fallback for TLD-only
patterns to allow matching while HS compiles. Mixed TLD/non-TLD patterns
use sync compile. Also update cache format to unified .hs extension.
[Feature] Unified hyperscan cache format for multipattern
Add C helper functions for serializing/deserializing hyperscan databases
with the unified format (magic, platform, CRC). Migrate multipattern from
raw .hsmp files to the unified .hs format compatible with re_cache.
- Add rspamd_hyperscan_serialize_with_header() and load_from_header()
- Update multipattern to use unified format with platform validation
- Fix CRC calculation in Lua bindings to match re_cache format
[Feature] Add Lua hyperscan compilation bindings and orchestration module
- Add rspamd_hyperscan Lua module with compile/serialize/deserialize/validate
- Create lua_hs_compile.lua orchestration module for unified compilation
- Support pluggable cache backends via lua_hs_cache integration
- Use unified file format with magic, platform info, CRC validation
This commit adds infrastructure for pluggable hyperscan cache storage
backends and FD-based shared memory distribution:
- Add platform ID function (rspamd_hyperscan_get_platform_id) for
platform-aware cache keys
- Create lua_hs_cache.lua with file, Redis, and HTTP backends
- Add FD-based loading APIs (rspamd_hyperscan_from_fd,
rspamd_hyperscan_create_shared_unser)
- Add fd_size field to control messages for FD passing
- Update worker to handle attached FDs in hyperscan notifications
- Add cache_backend configuration option to hs_helper
Vsevolod Stakhov [Wed, 31 Dec 2025 10:54:55 +0000 (10:54 +0000)]
[Feature] Add extra tables API for clickhouse plugin
Allow other plugins to dynamically register custom Clickhouse tables
via rspamd_plugins['clickhouse'].register_extra_table(). Supports
per-table schemas, row callbacks (single or multiple rows), and
independent retention settings.
Vsevolod Stakhov [Mon, 29 Dec 2025 22:28:40 +0000 (22:28 +0000)]
[Fix] Fix replxx build with LLVM 21+
- Simplify CMakeLists.txt to use CMAKE_CXX_STANDARD 20
- Replace std::unordered_map with std::map to avoid libc++ ABI issues
- Add operator< to UnicodeString for std::map compatibility
Vsevolod Stakhov [Sun, 28 Dec 2025 21:20:12 +0000 (21:20 +0000)]
[Fix] Avoid SDK headers in include path when package ROOT is specified
- Add NO_DEFAULT_PATH to FIND_PATH when PKG_ROOT is set to prevent
macOS SDK C headers from polluting include paths before libc++
- Fix typo: {RSPAMD_DEFAULT_INCLUDE_PATHS} -> ${...}
- Remove obsolete paths (/opt/csw, /sw), add /opt/homebrew for macOS
Vsevolod Stakhov [Sun, 28 Dec 2025 18:45:05 +0000 (18:45 +0000)]
[Feature] Rename fuzzy_check max_score to hits_limit for clarity
The option name max_score was confusing as it doesn't refer to the
symbol score but rather the number of fuzzy hash hits at which the
normalized score reaches ~1.0 (formula: tanh(e * hits / hits_limit)).
- Rename max_score -> hits_limit in fuzzy_check.c and default config
- Add backward compatibility: max_score is still accepted as an alias
- Add lua_cfg_transform to handle legacy configs (max_score overrides
hits_limit to ensure local.d overrides work correctly)
- Add explanatory comments in config and documentation
Vsevolod Stakhov [Sat, 27 Dec 2025 10:59:05 +0000 (10:59 +0000)]
[Fix] Add resilience to lua_cfg_transform
- Check :type() before indexing UCL objects to handle null values
- Wrap transform sections in pcall to prevent one bad config section
from breaking the entire configuration load
- Log errors with section name for easier debugging
Vsevolod Stakhov [Tue, 23 Dec 2025 10:13:43 +0000 (10:13 +0000)]
[Fix] Restore Lua stack properly in second-pass MIME detection
Fix lua_settop(L, 0) which cleared the entire Lua stack instead
of restoring to the previous state, causing segfaults when
process_message() was called from Lua unit tests.
Vsevolod Stakhov [Mon, 22 Dec 2025 11:53:37 +0000 (11:53 +0000)]
[Fix] Use Fibonacci hashing for task pointer hash
Use golden ratio multiplication for 64-bit to 32-bit pointer hashing.
This provides good distribution with minimal operations (1 multiply +
1 shift) and works well with kh_int_hash_func which is identity.
Vsevolod Stakhov [Mon, 22 Dec 2025 11:25:49 +0000 (11:25 +0000)]
[Fix] Add logging, preallocation and hash mixing to task registry
- Log error when detecting use-after-free attempt on task pointer
- Preallocate task set to 16 elements to reduce early rehashing
- Mix pointer bits using multiplicative hash for better distribution
Vsevolod Stakhov [Mon, 22 Dec 2025 10:06:19 +0000 (10:06 +0000)]
[Fix] Use pointer set instead of key map for task validation
Store task pointers in a khash set and validate them on lookup
from Lua. This works with all code paths that create task userdata
directly without going through rspamd_lua_task_push.
Vsevolod Stakhov [Sun, 21 Dec 2025 20:05:27 +0000 (20:05 +0000)]
[Feature] Add task registry for safe Lua task reference validation
Implement a global task registry that maps unique uint64_t keys to task
pointers. This prevents use-after-free bugs when Lua code holds references
to tasks that may have been freed (e.g., in async Redis callbacks).
Key changes:
- Add lua_key field to rspamd_task struct
- Implement task registry using khash (O(1) lookup)
- Store lua_key in Lua userdata instead of raw pointer
- Lookup via registry when extracting task from Lua
- Remove task from registry FIRST in rspamd_task_free()
The counter-based key approach avoids issues with:
- Pointer reuse after free (memory allocator may reuse addresses)
- Lua number precision (52-bit mantissa is sufficient for counter)
- NaN/subnormal float values that could cause issues
This fixes potential use-after-free in Redis script waitq callbacks
when Redis is unavailable longer than task lifetime.