[Fix] Fix memory leak in rspamd_shingles_from_html
The struct_sgl object from generate_shingles_from_string_tokens() was only
deleted when pool == nullptr, causing memory leaks when a memory pool was
active. Now struct_sgl is always deleted after copying to res, regardless
of pool allocation method.
[Fix] Update HTML fuzzy encryption to use helper functions
The fuzzy_cmd_from_html_part() function was using legacy encryption logic
that only checked rule->peer_key. Updated to use fuzzy_rule_has_encryption()
and fuzzy_select_encryption_keys() helpers for consistency with other fuzzy
command functions and to support separate read/write encryption keys.
[Fix] Add fallback when only one specific encryption key is set
When only read_encryption_key or write_encryption_key is configured without
a general encryption_key, the unspecified operation type was left with NULL
keys. Now if only one specific key is set, it's used for both read and write
operations as a fallback, ensuring encryption works in all configurations.
[Fix] Fix duplicate key filtering in reply decryption
When read/write encryption keys fall back to common encryption_key,
rspamd_pubkey_ref() returns pointer to the same object. Previous duplicate
checks using pointer comparison incorrectly filtered out these keys,
causing decryption failures. Now properly checks if key was already added
to the decryption attempt list before adding it.
[Minor] Refactor encryption key selection into helper functions
Extract repeated key selection logic into fuzzy_select_encryption_keys()
and fuzzy_rule_has_encryption() helper functions. This reduces code
duplication and improves readability across fuzzy_cmd_stat(),
fuzzy_cmd_ping(), fuzzy_cmd_hash(), fuzzy_cmd_from_text_part(),
fuzzy_cmd_from_data_part(), and fuzzy_process_reply() functions.
[Fix] Fix reply decryption when using only separate read/write keys
In fuzzy_process_reply(), the tag was accessed from encrypted data before
decryption, leading to incorrect key selection. When only separate
read_encryption_key and write_encryption_key were configured (without common
encryption_key), the fallback to NULL keys caused crashes.
Now the function tries decryption with all available key pairs (read, write,
and common) until MAC verification succeeds, properly handling all key
configuration scenarios.
[Fix] Ensure encryption works with separate read/write keys in fuzzy_check
Fix condition checks that determine whether to use encryption. Previously,
functions checked only rule->peer_key, causing encryption to be disabled
when using only read_encryption_key and write_encryption_key without a
common encryption_key. Now checks for any encryption keys (peer_key,
read_peer_key, or write_peer_key) to properly enable encryption.
[Feature] Add separate encryption keys for read and write operations in fuzzy_check
Allow using different encryption keys for read (CHECK, STAT, PING) and write
(WRITE, DEL) operations by introducing read_encryption_key and write_encryption_key
configuration parameters. Falls back to encryption_key if separate keys are not
specified for backward compatibility.
[Minor] Add safety checks for short HTML to prevent false positives
Require minimum complexity for HTML fuzzy matching:
- At least 2 links (single-link emails too generic)
- At least DOM depth 3 (flat structures too common)
This prevents false positives on trivial HTML like:
<html><body><p>text <a href="...">link</a></p></body></html>
Such simple structures are not unique enough for reliable fuzzy matching.
[Minor] Use FUZZY_INCLUDE for HTML fuzzy test configuration
Create fuzzy-html.conf with HTML-specific settings and use
RSPAMD_FUZZY_INCLUDE variable to include it in the fuzzy rule.
This is the correct way to add per-test rule settings.
[Minor] Add debug logging to HTML fuzzy hash generation
Add detailed debug messages to track HTML fuzzy hash generation flow:
- Log when fuzzy_cmd_from_html_part is called
- Log HTML shingles enabled/disabled status
- Log HTML part detection
- Log tag count checks
- Log successful/failed hash generation
This helps diagnose issues with HTML fuzzy matching in tests.
[Minor] Fix HTML fuzzy test to use standard flags and keywords
Use RSPAMD_FLAG1_NUMBER (50) instead of custom flag 100 to match
existing fuzzy.conf configuration. Add proper test flow with setup
checks and standard Robot Framework keywords.
[Test] Add functional tests for HTML fuzzy hashing
Add Robot Framework tests for HTML fuzzy matching:
- html_template_1.eml: legitimate newsletter template
- html_template_1_variation.eml: same structure, different text
- html_phishing.eml: same structure, phishing CTA domains
- html-fuzzy.robot: test suite with add/check/phishing scenarios
Tests verify:
- HTML fuzzy hash generation and matching
- Template variation detection (same structure, different content)
- Phishing detection (same structure, different CTA domains)
- Integration with fuzzy storage backend
[Feature] Integrate HTML fuzzy hashing into fuzzy_check module
Add support for HTML structure fuzzy hashing in fuzzy_check plugin:
Core integration:
- Add FUZZY_CMD_FLAG_HTML flag and FUZZY_RESULT_HTML result type
- Add html_shingles, min_html_tags, html_weight options to fuzzy_rule
- Implement fuzzy_cmd_from_html_part() to generate HTML fuzzy commands
- Integrate into fuzzy_generate_commands() for automatic hash generation
- Handle HTML results with configurable weight multiplier
Configuration:
- html_shingles: enable/disable HTML fuzzy hashing per rule
- min_html_tags: minimum HTML tags threshold (default 10)
- html_weight: score multiplier for HTML matches (default 1.0)
Use cases:
1. Brand protection: detect phishing with copied HTML but fake CTA
2. Spam campaigns: group messages by HTML structure
3. Template detection: identify newsletters/notifications
4. Phishing: text match + HTML CTA mismatch = suspicious
HTML fuzzy works alongside text fuzzy:
- Both hashes generated and sent to storage
- Separate result types allow different handling
- CTA domain verification prevents false positives
Next steps:
- Performance testing on real email corpus
- Fine-tune weights and thresholds
- Collect legitimate brand templates for whitelisting
[Fix] Fix union handling in ED25519 key loading to prevent memory corruption
When loading ED25519 keys from PEM, the code was writing to key_eddsa in the
union and then attempting to free key_ssl pointers, which corrupted the
key_eddsa pointer and caused use-after-free/double-free during cleanup.
The fix saves the EVP_PKEY and BIO pointers to temporary variables, extracts
the raw key, frees the OpenSSL objects, and only then assigns to the union.
This prevents memory corruption and resource leaks.
[Feature] Add ED25519 support for DKIM signing with OpenSSL version checks
This commit adds support for ED25519 DKIM signatures when OpenSSL 1.1.1+ is available.
Key changes:
- Added HAVE_ED25519 detection in CMake to check for EVP_PKEY_ED25519 support
- All ED25519-specific code is conditionally compiled based on HAVE_ED25519
- When ED25519 is not supported, informative error messages are returned
- ED25519 keys loaded from PEM files are extracted and converted to libsodium format
- Fixed union handling to prevent double-free issues
- Updated tests to dynamically select key type based on request header
- Removed unused dkim-ed25519-pem.conf (cannot be passed via rspamc)
The implementation gracefully degrades on older OpenSSL versions while maintaining
full functionality when ED25519 support is available.
feat: Add ED25519 support for DKIM signing and verification
This commit introduces support for ED25519 keys in DKIM signing and verification. It includes changes to the DKIM library to handle ED25519 keys, along with new test cases and configuration files to demonstrate and test this functionality.
[Fix] Improve HTTP map interval logic for cache validation
Properly differentiate between maps with and without cache validation:
- With ETag/Last-Modified: use 4x multiplier (cheap conditional requests)
- Without cache validation: enforce strict 10 minute minimum
- Add overflow protection for interval multiplication
- Actually use has_etag/has_last_modified parameters
This avoids overly aggressive slowdown (120x -> 4x) for maps with cache
validation while still preventing abuse of maps without validation.
[CritFix] Prevent time_t overflow in HTTP map expires header processing
Add validation to detect and reject absurdly invalid or overflow-inducing
expires headers (>1 year in future). When expires header is invalid or
causes overflow, properly call rspamd_http_map_process_next_check with
expires=0 instead of setting map->next_check=0 which left stale overflow
values.
This prevents crashes and invalid scheduling like 'next check at Thu,
09 Nov 438498967' when servers send malformed Expires headers.
[Minor] Fix compilation errors and simplify HTML shingles
- Export rspamd_shingles_get_keys_cached() for use in HTML shingles
- Simplify extract_etld1_from_url(): use existing url->tld field
(in Rspamd, tld already contains eTLD+1/eSLD, no need to parse)
- Add proper reinterpret_cast for const char* to unsigned char*
- Fix variable name conflict (html_content parameter vs local var)
- Use rspamd_url_tld_unsafe() and rspamd_url_host_unsafe() macros
[Minor] Move HTML shingles implementation to separate C++ file
The HTML shingles code requires C++ (html_content, std::variant, etc.)
but was placed in #ifdef __cplusplus block in shingles.c (a C file),
causing linker errors.
Solution: Move all HTML-specific code to shingles_html.cxx which is
compiled as C++ and properly exports symbols with extern "C" linkage.
Files:
- shingles.c: Keep only C code (text/image shingles)
- shingles_html.cxx: New file with HTML shingles implementation
- CMakeLists.txt: Add shingles_html.cxx to build
[Feature] Add HTML fuzzy hashing for structural similarity matching
Implement fuzzy hashing algorithm for HTML content to enable efficient
matching of messages by HTML structure, independent of text content.
This feature allows:
- Detecting similar HTML emails (newsletters, notifications, spam campaigns)
- Phishing protection: similar structure but different CTA domains
- Brand protection: identify legitimate vs fake branded emails
- Template detection: group emails from the same template
Implementation details:
1. Multi-layer hash approach:
- Direct hash: blake2b of all HTML tokens (for exact matching)
- Structure shingles: sliding window over DOM tags (for fuzzy matching)
- CTA domains hash: critical for phishing detection (30% weight)
- All domains hash: top-10 most frequent domains (15% weight)
- Features hash: bucketed HTML statistics (5% weight)
6. Memory efficient:
- Uses mempool for temporary allocations
- Final structure: ~304 bytes (32 shingles + metadata + hashes)
- Performance: <1ms for typical HTML (100-200 tags)
7. Compatible with existing fuzzy storage infrastructure:
- Structure shingles use same format as text shingles
- Can be sent to fuzzy storage via standard protocol
- Additional hashes (CTA, domains, features) can be stored as extensions
Key design decisions:
- Direct hash prevents false positives from MinHash collisions
(like text parts: crypto_hash(all_tokens) for exact match)
- Sliding window (size 3) provides tolerance to small structural changes
- Bucketing of numeric features ensures stability
- CTA domain verification critical for phishing prevention
Use cases:
- Whitelisting legitimate branded emails by HTML structure
- Blacklisting spam campaigns with varying personalized text
- Detecting phishing: legitimate structure + different CTA = suspicious
- Fuzzy storage integration for distributed matching
Files changed:
- src/libutil/shingles.h: Add rspamd_html_shingle structure and API
- src/libutil/shingles.c: Implement HTML fuzzy hashing (~540 lines)
- src/lua/lua_mimepart.c: Add text_part:get_html_fuzzy_hashes() method
Future work:
- Integration with fuzzy_check module
- Configuration options (min_html_tags, similarity_threshold)
- Rules for phishing detection based on HTML similarity
- Separate fuzzy storage type for HTML hashes
Prevent aggressive HTTP map polling by implementing proper interval bounds:
- Cap absurdly high Expires headers (>8h) to min(map_interval * 10, 8h)
- Enforce configured map_interval as minimum when server requests faster refresh
- Apply 10 minute minimum interval when no Expires header and low map_interval
- Simplify logic by consolidating interval calculation in single function
This change ensures servers can control refresh rates and prevents clients
from causing issues with overly aggressive polling behavior.
[Feature] Add symbol categories for MetaDefender and VirusTotal
Implemented a category-based symbol system for hash lookup antivirus
scanners (MetaDefender and VirusTotal) to replace dynamic scoring:
- Added 4 symbol categories: CLEAN (-0.5), LOW (2.0), MEDIUM (5.0), HIGH (8.0)
- Replaced full_score_engines with threshold-based categorization (low_category, medium_category)
- Fixed symbol registration in antivirus.lua to use rule instead of config
- Updated cache format to preserve symbol category across requests
- Added backward compatibility for old cache format
- Added symbols registration and metric score assignment
- Updated configuration documentation with examples
The new system provides:
- Clear threat categorization instead of linear interpolation
- Proper symbol weights applied automatically
- Consistent behavior between MetaDefender and VirusTotal
- Cache that preserves symbol categories
[Fix] Add nil check for vault_data in show_handler
Prevent runtime errors when parsing Vault KV v2 responses if obj.data.data is nil.
This adds a safety check before accessing vault_data.selectors, consistent with
other handlers in the file (newkey_handler and roll_handler).
[Feature] Improve LLM prompt and add sender frequency tracking
* Update default prompt to reduce false positives on legitimate emails
- Explicitly recognize verification emails as legitimate
- Require MULTIPLE red flags for phishing classification
- Add guidance on known/frequent senders
* Add sender frequency detection in context
- Classify senders as: new, occasional, known, frequent
- Based on sender_counts from user context
- Passed to LLM via context snippet
* Prompt instructs LLM to reduce phishing score for known senders
* Helps avoid false positives on transactional/verification emails
[Feature] Improve GPT module with uncertain caching and server timeout
* Add GPT_UNCERTAIN symbol for caching uncertain classifications
- Cache results even when no consensus is reached
- Avoid repeated expensive LLM queries for borderline cases
- Set X-GPT-Reason header with detailed vote statistics
* Add server-side timeout support for OpenAI API requests
- New request_timeout parameter (optional, multiplied by 0.95)
- Only sent if explicitly configured (not all APIs support this)
- Accounts for connection setup and data transfer overhead
* Fix max_ham_prob initialization (was 0, now correctly 1.0)
* Add pcall protection for fold_header_with_encoding with raw fallback
* Improve error messages for token limit exceeded
* Add detailed logging for context snippets and consensus decisions
* Pass debug_module parameter to llm_context functions
[Feature] Add cache expiration timestamps to debug logs
* Show when cached data will expire in human-readable format
* Log expiration time both when caching and after successful write
* Helps with debugging cache TTL issues
[Feature] Add bidirectional context support for LLM
* Unify context for incoming and outgoing mail
* Same identity used for authenticated/local sender and recipient
* Follows replies module pattern for direction detection
* Make llm_context.lua module-agnostic with debug_module parameter
* Improve userdata handling (use :sub instead of string.sub)
* Add nil-safety to all debug logging calls
* Add cache expiration timestamps to context logs
[Fix] Add full Lua traceback to HTTP callback errors
Improved error diagnostics in lua_http_finish_handler by adding
rspamd_lua_traceback handler. Now shows complete call stack with
file names and line numbers when Lua HTTP callbacks fail, making
debugging much easier.
[Feature] Add user/domain context support for LLM-based classification
* Add llm_context.lua module for Redis-based conversation context
* Context features: sliding window, top senders, keywords, flagged phrases
* Use low-level word API (get_words('full')) with stop_word flags
* Flexible gating via maps/selectors (enable_map/enable_expression)
* Update context even when GPT condition not met (BAYES_SPAM/HAM)
* Add min_messages warm-up threshold to prevent weak context injection
* Configurable scope: user/domain/esld with TTL and sliding window
* [Feature] Archive module: Full support for encrypted ZIP archives with ZipCrypto and AES encryption
* [Feature] Archive module: Both reading and writing of AES-encrypted ZIP archives is supported
* [Feature] Archive module: Updated Lua bindings for libarchive
* [Feature] Encrypted maps: Support for encrypted maps to enable new distribution scenarios
* [Feature] Redis TLS: Configurable TLS connections in Redis backend
* [Feature] Map helpers alignment: Enforce 64-byte alignment to prevent unaligned memory access
* [Feature] Enhanced CLI for secretbox with additional security test coverage
* [Fix] MIME encoding: Major overhauls and multiple fixes for MIME encoding logic
* [Fix] MIME encoding: Improved handling and decoding of UTF-8 in MIME headers
* [Fix] Learning system: Numerous fixes to learn checks and autolearn flag handling
* [Fix] Learning system: Prevention of duplicate message learning
* [Fix] Learning system: Extended multiclass learning test coverage
* [Fix] Critical: Fixed bug when converting zero-length strings to numbers
* [Fix] Critical: Fixed XML prolog detection in lua_magic module
* [Fix] Build: Fixed build issues on 32-bit platforms
* [Fix] Compatibility: Improved compatibility with Lua versions above 5.1
* [Fix] Empty input: Addressed issues with empty input handling in lua_magic
* [Fix] Testing: Improved stability of automated testing with multiple test fixes
* [Fix] Minor compatibility improvements (buffer allocation, missing cmath include)
* Refactored rspamd_control_fill_msghdr to accept
a caller-provided control buffer, fixing the
lifetime bug where a pointer to a local array
was stored in msg_control.
* Replaced static buffers with automatic (stack)
buffers at the exact call sites of sendmsg/recvmsg,
so PowerPC and similar platforms won’t choke on
non-constant expressions.
- Removed g_strdup/g_free of TLS paths in src/lua/lua_redis.c.
- Now we:
- Keep TLS values (booleans + strings) on the Lua stack temporarily.
- Use an absolute table index (so gettable calls aren’t confused by
the growing stack).
- Call rspamd_redis_pool_connect_ext while those values are on the
stack.
- Pop all postponed values and then the table in one go immediately
after the connect call.
- The C++ pool still copies into std::string on element creation; we
only ensure Lua strings live through the call without extra
allocations.
- remove redundant `ensure_ssl_inited` function and calls. Core SSL init
should suffice.
- Refactor TLS initiation into `redis_pool_elt::initiate_tls(...)` to
eliminate duplication
- Switch TLS flags to `bool` in the public struct
- Fix ephemeral string usage in lua by duplicating the values into
locals and freeing after connect. Flags are boolean. (it's not super
likely that Lua will GC the strings before we connect to Redis, but
this ensures that it won't be a problem)
- Remove the redis TLS options propagation unit test