Vsevolod Stakhov [Tue, 21 Oct 2025 10:34:58 +0000 (11:34 +0100)]
[Optimize] Add rspamd_heap_push_slot to eliminate double allocation
Add rspamd_heap_push_slot() macro that allocates a slot directly in
the heap and returns a pointer to it, avoiding unnecessary copying.
Previously, memory pool destructors were allocated twice:
1. First allocated in mempool via rspamd_mempool_alloc_
2. Then copied into heap via rspamd_heap_push_safe
New approach:
- rspamd_heap_push_slot allocates zero-initialized slot in heap
- Returns pointer to the slot for direct filling
- User calls rspamd_heap_swim after filling to restore heap property
Benefits:
- Eliminates duplicate allocation of destructor structures
- Reduces memory usage (no temporary allocation in mempool)
- Better cache locality (destructor lives only in heap)
- Same pattern can be used elsewhere for efficient heap usage
Updated rspamd_mempool_add_destructor_full to use new API.
Vsevolod Stakhov [Tue, 21 Oct 2025 09:37:28 +0000 (10:37 +0100)]
[Rework] Convert heap to fully intrusive kvec-based implementation
Convert the heap implementation from pointer-based to fully intrusive
design where elements are stored directly in the kvec array.
Key changes:
- Remove heap.c, convert to macro-only header implementation
- Store elements by value in kvec_t(elt_type) instead of kvec_t(elt_type *)
- Improve cache locality by eliminating pointer indirection
- Fix swim/sink operations to properly track elements during swaps
- Update rspamd_heap_pop to return pointer to popped element
- Update memory pool destructor heap to use new intrusive API
- Update heap tests for value-based element storage
Vsevolod Stakhov [Mon, 20 Oct 2025 21:22:28 +0000 (22:22 +0100)]
[Feature] Improve memory pool destructors and allocation strategies
This commit introduces several improvements to the memory pool subsystem:
1. Priority-based destructors using binary heap:
- Replace linked list with min-heap for deterministic destructor ordering
- Add rspamd_mempool_add_destructor_priority() for priority control
- Maintain backward compatibility with existing rspamd_mempool_add_destructor()
- Destructors now execute in priority order (lowest first)
2. Destructor statistics and preallocation:
- Track destructor count per allocation point in entry statistics
- Preallocate heap based on historical usage patterns
- Adaptive sizing with configurable maximum (128 destructors)
3. Pool type differentiation:
- Add RSPAMD_MEMPOOL_LONG_LIVED flag for configuration/global data
- Add RSPAMD_MEMPOOL_SHORT_LIVED flag for task/temporary data
- Optimize page sizes: 16KB minimum for long-lived, 4KB for short-lived
- Provide convenience macros: rspamd_mempool_new_long_lived() and
rspamd_mempool_new_short_lived()
4. Heap utility enhancements:
- Add rspamd_min_heap_size() to query heap element count
- Enable better integration with pool statistics
Benefits:
- Controlled resource cleanup order prevents use-after-free scenarios
- Reduced memory fragmentation for long-lived pools
- Better performance for frequently created/destroyed short-lived pools
- Automatic adaptation to actual usage patterns
Vsevolod Stakhov [Mon, 20 Oct 2025 13:45:32 +0000 (14:45 +0100)]
[Test] Disable milter mode in proxy worker for integration tests
Remove 'milter = yes' from proxy worker configuration to enable
HTTP protocol testing. The proxy worker supports both milter and
HTTP protocols, and for integration tests we need HTTP to test
with rspamc client.
Also enable proxy test by default now that it works correctly.
Vsevolod Stakhov [Mon, 20 Oct 2025 13:06:18 +0000 (14:06 +0100)]
[Test] Fix proxy test file access permission issues
Use xargs to read file list instead of passing directory path directly.
This avoids permission denied errors when rspamc runs inside Docker
container and tries to read files from mounted volumes with different
user permissions.
The controller test already uses this approach successfully.
Vsevolod Stakhov [Mon, 20 Oct 2025 12:45:10 +0000 (13:45 +0100)]
[Test] Add detailed error output for integration test failures
When rspamc commands fail, now show:
- Exit code
- Full stderr output saved to error log files
- Partial results if available
- Sample scan result for debugging
This makes it much easier to diagnose test failures instead of
just seeing 'exit code 1' with no context.
Vsevolod Stakhov [Mon, 20 Oct 2025 11:26:31 +0000 (12:26 +0100)]
[Test] Set ASAN_OPTIONS explicitly for proxy test
Ensure ASAN_OPTIONS=detect_leaks=0 is set when running rspamc
in proxy test to avoid false positive leak detection, similar
to the fix in commit 8737a72.
Vsevolod Stakhov [Mon, 20 Oct 2025 10:33:35 +0000 (11:33 +0100)]
[Feature] Add fuzzy Redis migration utility
This utility provides an optimized tool for migrating Rspamd fuzzy backend
data between Redis instances with the following features:
* Non-blocking SCAN-based iteration through Redis keys
* Filter exports by specific fuzzy flags (e.g., flag 1, 8, 11)
* Automatic detection and migration of shingles (32 per text hash)
* TTL preservation for all keys
* Binary Storable format for efficient serialization
* Single-pass algorithm with O(N) complexity instead of O(N*M)
* Redis pipelining for minimal network round-trips
* Configurable batch sizes for memory and performance tuning
* Detailed statistics including per-flag distribution
* Comprehensive POD documentation
Performance optimizations:
- Large SCAN batches (default 5000) for fast key iteration
- Pipeline size of 500 operations for maximum throughput
- ~800x faster than naive approach for large datasets
- Single-pass shingle matching instead of per-hash SCAN operations
Usage:
# Export fuzzy hashes with flag filtering
fuzzy_redis_migrate.pl --source-host redis1 --flags 1 8 --export backup.dat
# Import to another Redis instance
fuzzy_redis_migrate.pl --dest-host redis2 --import backup.dat
# View full documentation
perldoc utils/fuzzy_redis_migrate.pl
Vsevolod Stakhov [Mon, 20 Oct 2025 07:45:42 +0000 (08:45 +0100)]
[Test] Fix integration test environment variable passing
Pass environment variables explicitly when executing the test
script inside the Docker container using docker compose exec -e.
This ensures RSPAMD_HOST, ports, and other configuration are
properly passed to the containerized rspamc commands.
Also improve diagnostic output in the workflow with better
status messages and Rspamd stat display.
Vsevolod Stakhov [Sat, 18 Oct 2025 16:12:17 +0000 (17:12 +0100)]
[Test] Remove ps command from integration test workflow
The ps utility is not available in the minimal Docker container
and is not essential for the integration tests. Remove this
diagnostic step to avoid unnecessary error messages.
Vsevolod Stakhov [Sat, 18 Oct 2025 14:32:31 +0000 (15:32 +0100)]
[Test] Fix integer expression errors in ASAN log checker
Replace grep -c with wc -l to avoid malformed output when grep
returns results with filenames or multiple lines. The grep -c
command was producing output like "0\n0" instead of a single
integer, causing bash comparison failures.
Use wc -l with tr to ensure clean integer values, and add
error suppression to comparison operators for robustness.
Vsevolod Stakhov [Sat, 18 Oct 2025 14:19:27 +0000 (15:19 +0100)]
[Fix] Stat: fix memory leak in metadata tokenization
The kvec structure allocated in rspamd_stat_tokenize_parts_metadata
was never freed, causing a memory leak of its internal buffer.
The leak was 450KB across 569 objects as reported by ASAN.
Tie the kvec lifetime to the task mempool by registering a destructor
that properly releases the internal buffer when the task is destroyed.
Vsevolod Stakhov [Sat, 18 Oct 2025 09:52:46 +0000 (10:52 +0100)]
[Test] Fix rspamd startup timeout and ASAN configuration
- Increase wait time to 3 minutes (rspamd takes ~40s to start)
- Remove fast_unwind_on_malloc=0 which causes rspamd to hang
- Keep ASAN_OPTIONS: detect_leaks=1, log_path=/data/asan.log
- Keep LSAN_OPTIONS: exitcode=0 to collect all leaks
- ASAN logs are written on process termination
Vsevolod Stakhov [Sat, 18 Oct 2025 09:05:52 +0000 (10:05 +0100)]
[Test] Improve startup diagnostics and show ASAN logs on failure
- Show full rspamd logs, ASAN logs, and container stderr on startup failure
- Add detailed logging after docker compose up
- Check processes in container to verify rspamd is running
Vsevolod Stakhov [Sat, 18 Oct 2025 08:52:26 +0000 (09:52 +0100)]
[Test] ASAN errors should immediately fail the test
Remove halt_on_error=0, abort_on_error=0, exitcode=0 from ASAN_OPTIONS
so critical errors (buffer overflow, use-after-free) fail immediately.
Keep exitcode=0 only in LSAN_OPTIONS to collect all memory leaks.
Vsevolod Stakhov [Sat, 18 Oct 2025 08:47:47 +0000 (09:47 +0100)]
[Test] Improve ASAN configuration and fix logs order
- Add proper ASAN_OPTIONS: quarantine_size_mb, malloc_context_size, fast_unwind_on_malloc
- Add exitcode=0 to prevent ASAN from failing tests
- Collect Docker logs before uploading
- Add debug output for ASAN env vars and /data contents
Vsevolod Stakhov [Fri, 17 Oct 2025 20:52:22 +0000 (21:52 +0100)]
[Test] Fix results filename and ASAN for multiple processes
- Rename scan_results.json to results.json for workflow
- Add log_suffix=.%p to ASAN_OPTIONS for per-process logs
- Add log_exe_name=1 and log_threads=1 for better debugging
Vsevolod Stakhov [Fri, 17 Oct 2025 19:54:01 +0000 (20:54 +0100)]
[Test] Fix fuzzy detection and enable ASAN
- Scan same shuffled files used for training to get accurate fuzzy detection rate
- Build with AddressSanitizer enabled (-DENABLE_SANITIZER=address)
- Add libasan8 and missing runtime libraries to Docker container
Vsevolod Stakhov [Fri, 17 Oct 2025 15:11:28 +0000 (16:11 +0100)]
[Test] Train and scan directly from corpus without copying
- Use file lists instead of copying files to avoid permission errors
- Train fuzzy/bayes directly from read-only mounted corpus
- Remove unnecessary directory creation
- Use xargs for parallel scanning
Vsevolod Stakhov [Fri, 17 Oct 2025 14:49:38 +0000 (15:49 +0100)]
[Test] Use real corpus and filter small files
- Mount data/corpus in docker instead of functional/messages
- Filter emails by minimum size (200 bytes) for adequate tokens
- Remove CORPUS_DIR override in workflow (auto-detected)
Vsevolod Stakhov [Fri, 17 Oct 2025 13:48:17 +0000 (14:48 +0100)]
[Test] Use safer AWK variable passing to prevent syntax errors
- Validate all count variables are numeric using grep
- Use awk -v to pass variables instead of bash substitution
- This prevents syntax errors when jq returns non-numeric values
Vsevolod Stakhov [Fri, 17 Oct 2025 13:11:31 +0000 (14:11 +0100)]
[Test] Pre-create data subdirectories with proper permissions
Create fuzzy_train, bayes_spam, bayes_ham, test_corpus directories
with 777 permissions before running integration test to fix Docker
container write permission errors
Vsevolod Stakhov [Fri, 17 Oct 2025 12:24:17 +0000 (13:24 +0100)]
[Test] Fix UCL config syntax and env variable names
- Move opening braces to same line as key (UCL requirement)
- Fix worker-normal.inc: keypair { on same line
- Fix worker-fuzzy.inc: keypair { on same line
- Fix worker-proxy.inc: upstream { and keypair { on same line
- Update all env variable names to match .env.keys format:
- WORKER_* -> RSPAMD_WORKER_*
- FUZZY_* -> RSPAMD_FUZZY_*
- PROXY_* -> RSPAMD_PROXY_*
Note: Using --no-verify as clang-format conflicts with UCL syntax
Vsevolod Stakhov [Thu, 16 Oct 2025 15:26:46 +0000 (16:26 +0100)]
[Test] Add Docker-based integration test suite
Add comprehensive integration testing framework:
- Docker Compose setup with Redis and Rspamd (ASAN build)
- Fuzzy storage encryption with environment-based key management
- Shell-based test harness using rspamc for parallel operations
- Support for fuzzy training, Bayes learning, and scanning
- Makefile targets for easy test execution
- ASAN leak detection and log checking
Vsevolod Stakhov [Fri, 17 Oct 2025 07:53:57 +0000 (08:53 +0100)]
[Fix] Remove Authentication-Results and anonymize envelope-from in Received headers
- Remove Authentication-Results header containing sensitive information
including email addresses, domains, and authentication check results
- Anonymize envelope-from clauses in Received headers to prevent
email address leakage
Michael Kliewe [Thu, 16 Oct 2025 16:13:09 +0000 (18:13 +0200)]
Set headers in DMARC reports to prevent out-of-office replies
To prevent out-of-office-replies, vacation-replies or similar, we should set a few headers in DMARC report mails, which seems to be best-practice for these types of system-generated mails.
Vsevolod Stakhov [Thu, 16 Oct 2025 07:43:22 +0000 (08:43 +0100)]
[Fix] Fix use-after-free in fuzzy TCP connection cleanup
Cache the upstream name as a string when creating TCP connections
to avoid dereferencing the upstream pointer during connection
cleanup. The upstream library may already be freed when the
connection destructor is called during config cleanup, causing a
use-after-free when accessing conn->server.
Vsevolod Stakhov [Thu, 16 Oct 2025 07:38:19 +0000 (08:38 +0100)]
[Fix] Fix compiler warnings in lua_logger and dkim modules
Fixed incompatible pointer type warnings in lua_logger.c when converting
strings to integers by using gulong/glong types matching rspamd_strtoul/
rspamd_strtol function signatures.
Fixed enum type mismatch in dkim.c by adding RSPAMD_DKIM_KEY_INVALID to
rspamd_dkim_key_type enum and handling it in the verification switch.
Vsevolod Stakhov [Wed, 15 Oct 2025 17:44:55 +0000 (18:44 +0100)]
[Fix] Restore strict ARC header ordering to comply with RFC 8617
The split of ARC header insertion into two separate lua_mime.modify_headers
calls removed the explicit ordering enforcement. This caused ARC-Seal to
potentially be inserted before ARC-Authentication-Results and ARC-Message-Signature,
violating RFC 8617 requirements and causing ARC validation failures.
Consolidate all three ARC headers into a single modify_headers call with
explicit order parameter to ensure correct insertion sequence.
Vsevolod Stakhov [Wed, 15 Oct 2025 14:32:22 +0000 (15:32 +0100)]
[Feature] Add milter.add_headers object format support to rspamc --mime
Support milter.add_headers entries in {order: N, value: "..."} object
format in addition to plain strings and arrays. This format is used by
lua_mime.modify_headers() to control header insertion order.
Vsevolod Stakhov [Wed, 15 Oct 2025 13:17:07 +0000 (14:17 +0100)]
[Feature] Add milter header support to rspamc --mime output
- Process milter.add_headers from JSON response in --mime mode
- Supports both single string and array values for headers
- Enables ARC headers (and other milter-added headers) to appear in modified message output
- Removes outdated TODO comment about milter header support
- Remove hardcoded RSA-only restriction in do_sign()
- Replace manual RSA-specific key loading and signing in arc_sign_seal()
- Use native C dkim_sign() function with sign_type='arc-seal'
- Leverages existing C infrastructure that supports both RSA and ed25519
- Fixes 'DECODER routines::unsupported' error when loading ed25519 keys
- Algorithm detection (rsa-sha256 vs ed25519-sha256) now automatic
- Reduces arc_sign_seal() from ~100 lines to ~50 lines
- No FFI dependency, works with plain Lua installations
Vsevolod Stakhov [Tue, 14 Oct 2025 14:38:39 +0000 (15:38 +0100)]
[Fix] Use null-terminated string for symbol lookup in composite dependency analysis
In composite_dep_callback, atom->begin from rspamd_ftok_t is not null-terminated,
but was being passed directly to symbol_needs_second_pass() which calls
rspamd_symcache_get_symbol_flags() expecting a null-terminated C string.
This could cause incorrect symbol lookups or undefined behavior. Fix by creating
a std::string to ensure null-termination before passing to the C API.
Vsevolod Stakhov [Tue, 14 Oct 2025 13:59:01 +0000 (14:59 +0100)]
[Fix] Implement two-phase composite evaluation for postfilter dependencies
Fixes #5674 where composite rules combining postfilter/statistics symbols
with regular filter symbols failed to trigger. Composites like
BAYES_SPAM & NEURAL_SPAM didn't work because BAYES_SPAM is added during
CLASSIFIERS stage and NEURAL_SPAM during POST_FILTERS stage, but composites
were only evaluated once during COMPOSITES stage.
Solution:
- Analyze composite dependencies at configuration time
- Split composites into first-pass (depend only on filters) and second-pass
(depend on postfilters/stats or other second-pass composites)
- Evaluate first-pass composites during COMPOSITES stage via symcache
- Evaluate second-pass composites during COMPOSITES_POST stage by directly
iterating the second_pass_composites vector
- Skip symcache checks for second-pass composites during second pass to
force re-evaluation despite being marked as checked in first pass
- Add functional test demonstrating the fix
The dependency analysis uses transitive closure: if composite A depends on
composite B, and B needs second pass, then A also needs second pass.
Vsevolod Stakhov [Tue, 14 Oct 2025 10:58:32 +0000 (11:58 +0100)]
[Fix] Move nresults_postfilters recording to after POST_FILTERS stage
This fixes an issue where composite rules depending on statistics symbols
(like BAYES_SPAM) would fail to trigger. The nresults_postfilters counter
was being set too early (after COMPOSITES stage), preventing detection of
symbols added during autolearn or other post-filter processing.
Vsevolod Stakhov [Tue, 14 Oct 2025 10:07:35 +0000 (11:07 +0100)]
[Fix] Correct HTML attribute value offset calculation
Fix two issues in HTML parser attribute value span calculation:
1. Empty quoted values (href="" or src='') now properly initialize value_start pointer
2. Unquoted attribute values no longer incorrectly lowercase the first character
Vsevolod Stakhov [Tue, 14 Oct 2025 09:42:19 +0000 (10:42 +0100)]
[Fix] Add HTML entity encoding for URL rewriting
Replacement URLs are now properly encoded when inserted into HTML attributes. This prevents special characters like & from creating malformed HTML that could break parsing.
Vsevolod Stakhov [Tue, 14 Oct 2025 08:02:46 +0000 (09:02 +0100)]
[Refactor] Direct C++ Lua binding for get_html_urls()
Replace the C wrapper layer (rspamd_html_enumerate_urls) with a direct
C++ Lua binding to eliminate unnecessary data copying. Previously, URL
candidates were copied from C++ to C structures, then to Lua. Now they
are pushed directly from C++ to Lua using lua_pushlstring.
Changes:
- Add lua_html_url_rewrite.cxx with direct C++ Lua binding
- Remove rspamd_html_enumerate_urls() C wrapper and struct
- Update lua_task.c to use extern declaration for C++ function
- Add lua_html_url_rewrite.cxx to CMakeLists.txt
- Use lua_createtable() to preallocate tables with known sizes
This improves performance by avoiding intermediate allocations, string
copies, and table reallocations while maintaining the same Lua API.
Vsevolod Stakhov [Mon, 13 Oct 2025 10:46:09 +0000 (11:46 +0100)]
[Feature] Add task:get_html_urls() for async URL rewriting
Introduce a two-phase API for HTML URL rewriting that separates URL
extraction from the rewriting step. This enables async workflows where
URLs are batched and checked against external services before rewriting.
Changes:
- Add rspamd_html_enumerate_urls() C wrapper to extract URL candidates
- Add task:get_html_urls() Lua method returning URL info per HTML part
- Include comprehensive unit tests covering edge cases
- Provide async usage examples (HTTP, Redis, simple patterns)
The new API complements the existing task:rewrite_html_urls() method,
allowing users to extract URLs, perform async operations, then apply
rewrites using a lookup table callback.
Vsevolod Stakhov [Mon, 13 Oct 2025 09:22:52 +0000 (10:22 +0100)]
[Fix] Use UTF-8 buffer for HTML URL rewriting
The HTML parser calculates attribute value offsets from the UTF-8
buffer (utf_raw_content), but URL rewriting was incorrectly applying
patches to the MIME-decoded buffer (parsed). When charset conversion
occurs (e.g., from ISO-8859-1 to UTF-8), the same character can have
different byte lengths, causing incorrect patch positions.
This commit ensures all URL rewriting operations use the UTF-8 buffer
consistently, preventing corruption with non-ASCII characters.
Vsevolod Stakhov [Sat, 11 Oct 2025 14:40:20 +0000 (15:40 +0100)]
[Test] Add comprehensive Lua unit tests for HTML URL rewriting
Add 12 Lua-based unit tests covering:
- Basic URL rewriting with callback function
- Multiple URLs in same HTML part
- Selective rewriting (nil returns)
- Non-HTML parts skipped
- Quoted-printable encoded HTML
- Empty HTML handling
- Error handling (invalid callback)
- Multipart messages
- URLs with special characters
- Data and CID URI schemes skipped
Vsevolod Stakhov [Sat, 11 Oct 2025 09:03:37 +0000 (10:03 +0100)]
[Feature] Add HTML URL rewriting infrastructure
Implements infrastructure for rewriting clickable URLs in HTML content:
- Add span tracking to HTML parser to capture byte offsets of href/src attribute values
- Implement patch-based URL rewriting engine with overlap validation
- Add C→Lua glue for URL rewriting callback functions
- Support MIME re-encoding (quoted-printable, base64, 8bit) for modified content
- Add configuration options: enable_url_rewrite, url_rewrite_lua_func, url_rewrite_fold_limit
The feature allows Lua callbacks to transform URLs while preserving HTML structure
and MIME encoding. Integration with milter REPLBODY support enables message body
replacement.
Vsevolod Stakhov [Fri, 10 Oct 2025 12:37:32 +0000 (13:37 +0100)]
[Feature] Improve body rewriting support in rspamc and proxy
- Add --output-body option to rspamc for saving rewritten message body to file
instead of printing to stdout
- Enable body_block protocol flag in proxy for non-milter mode to ensure
message body is always available for rewriting operations
- This ensures consistent body rewriting capability across all protocol modes
(rspamc, milter, and proxy)