- Removed g_strdup/g_free of TLS paths in src/lua/lua_redis.c.
- Now we:
- Keep TLS values (booleans + strings) on the Lua stack temporarily.
- Use an absolute table index (so gettable calls aren’t confused by
the growing stack).
- Call rspamd_redis_pool_connect_ext while those values are on the
stack.
- Pop all postponed values and then the table in one go immediately
after the connect call.
- The C++ pool still copies into std::string on element creation; we
only ensure Lua strings live through the call without extra
allocations.
- remove redundant `ensure_ssl_inited` function and calls. Core SSL init
should suffice.
- Refactor TLS initiation into `redis_pool_elt::initiate_tls(...)` to
eliminate duplication
- Switch TLS flags to `bool` in the public struct
- Fix ephemeral string usage in lua by duplicating the values into
locals and freeing after connect. Flags are boolean. (it's not super
likely that Lua will GC the strings before we connect to Redis, but
this ensures that it won't be a problem)
- Remove the redis TLS options propagation unit test
* [Conf] Add defaults
* [Conf] Fix JB IDE damage
* [Feature] Add a signal from main to workers for workers ready state
* [Feature] Add lua_util.fold_header_with_encoding
* [Feature] Add some convenience options to rspamc
* [Feature] Add some more OS utility functions
* [Feature] Add symbols proxy for piecewise changes
* [Feature] Allow lua callback maps to be filled line by line
* [Feature] Allow selectors in regexp maps expressions
* [Feature] Allow to pass expression flags in the regexp plugin
* [Feature] Detect part types in mime parser
* [Feature] Resolve DNS nameservers names using getaddrinfo
* [Fix] Bayes: Try to be bug-to-bug compatible
* [Fix] Check skip_hashes for the returned hashes
* [Fix] Fix DL lists initialisations
* [Fix] Fix double free in the client...
* [Fix] Fix end-to-end proxy compression
* [Fix] Fix l= calculations again
* [Fix] Fix lua state setting ambiguity
* [Fix] Fix order of descriptor closing
* [Fix] Fix probabilities overflow
* [Fix] Fix rules setup
* [Fix] Fix statfiles ordering
* [Fix] Fix various corner cases and tests
* [Fix] Fix whitelist options in the arc module
* [Fix] GPT: Fix occasional damage
* [Fix] GPT: fix processing of messages with no subject
* [Fix] Prevent WebUI crash with empty RRD
* [Fix] Store html attributes that are empty
* [Fix] Try to fix learned order
* [Fix] Use C++20 standard consistently to resolve ODR violations
* [Fix] Use a more straightforward approach for learn cache
* [Fix] fix error check in lua_dkim_tools.lua
* [Project] Add CTA analytics engine
* [Project] Add ability to create custom tokenizers for languages
* [Project] Add controller learn endpoints
* [Project] Add support of granular timeouts to plugins and maps
* [Project] Add tests and fix stuff
* [Project] Add tests for LLM provider, fix various issues with metatokens
* [Project] Apply changes to bayes_expiry plugin
* [Project] Create an isolated API for external tokenizers
* [Project] Extract more features from HTML messages
* [Project] Fix Lua API and some constexpr compatibility
* [Project] Fix binary classification and lua scripts
* [Project] Fix more calculation issues
* [Project] Fix other classification and learning issues
* [Project] Fix scoped compilation again
* [Project] Fix symbols finalisation
* [Project] Fix unlearn stuff
* [Project] Fix various issues
* [Project] Fix various other issues
* [Project] Further updates
* [Project] Implement backoff for upstreams revival
* [Project] Implement more flexible http timeouts
* [Project] Implement scoped compilation
* [Project] Implement scoped regexp cache system
* [Project] Multi-class classification project baseline
* [Project] Rework rspamc to allow training of different neural types
* [Project] Rework system of html tags to allow more tag types
* [Project] Rework tokenizers initialisation
* [Project] Some rework of the CTA defaults
* [Project] Start implementation of the rules maps
* [Project] Start to implement better revive strategy for upstreams
* [Project] Store regexp rules state to avoid incomplete/orphaned rules
* [Project] Support more common html attributes
* [Project] Take button weight into consideration
* [Project] Use re_cache scopes for maps
* [Rework] Fix logger format string mismatch
* [Rework] MIME detection via Lua Magic; enforce cfg in Lua task API
* [Rework] Return back N-ary optimizations for arithmetic-alike expressions
* [Rework] Use GLib agnostic type for words
* [Rework]Refactor MIME detection via Lua Magic; enforce cfg in Lua task API
* [Rules] Make bitcoin expression to use explicit flags
[Rework] MIME detection via Lua Magic; enforce cfg in Lua task API
- Add rspamd_mime_parser_config on cfg; remove global state and lazy init
- Initialize parser config once per cfg; preload lua_magic.detect_mime_part
- Always run detection after normal part parse; promote .eml/message parts
- Preserve detected_ext/detected_ct/detected_type and NO_TEXT flag
- Remove duplicate detection from message.c; add debug logs
- Restore CTE parsing API and fix call sites
- Enforce cfg requirement in rspamd_task.load_from_string/load_from_file/create
- Fix unit tests to pass rspamd_config to load_from_string
[Rework]Refactor MIME detection via Lua Magic; enforce cfg in Lua task API
- Add rspamd_mime_parser_config on cfg; remove global state and lazy init
- Initialize parser config once per cfg; preload lua_magic.detect_mime_part
- Always run detection after normal part parse; promote .eml/message parts
- Preserve detected_ext/detected_ct/detected_type and NO_TEXT flag
- Remove duplicate detection from message.c; add debug logs
- Restore CTE parsing API and fix call sites
- Enforce cfg requirement in rspamd_task.load_from_string/load_from_file/create
- Fix unit tests to pass rspamd_config to load_from_string
Vsevolod Stakhov [Fri, 29 Aug 2025 12:38:01 +0000 (13:38 +0100)]
Fix DKIM relaxed body canonicalization and optimize performance
This PR addresses critical issues in DKIM relaxed body canonicalization and modernizes the codebase by replacing GLib types with standard C types.
- **RFC Compliance**: Fixed incorrect canonicalization of lines containing only whitespace. Previously, such lines were not properly handled according to RFC 6376, which could lead to DKIM signature verification failures.
- **Memory Safety**: Fixed incorrect pointer dereference in `rspamd_dkim_skip_empty_lines` that could cause undefined behavior.
- **Zero-copy Optimization**: Reimplemented `rspamd_dkim_relaxed_body_step` to avoid unnecessary memory copies. The new implementation:
- Processes input data directly without intermediate buffers
- Reduces the number of `EVP_DigestUpdate` calls by processing larger chunks
- Improves CPU cache efficiency
- Results in significantly better performance for large email bodies
- Replaced all GLib types with standard C equivalents:
- `gsize` → `size_t`
- `gssize` → `ssize_t`
- `gboolean` → `bool`
- `TRUE/FALSE` → `true/false`
- And other GLib-specific types
- Added necessary standard headers (`stdbool.h`, `stdint.h`, `limits.h`)
- Added comprehensive debug logging for:
- Chunk processing with size information
- Empty line detection and skipping
- Space collapsing operations
Petr Vaněk [Fri, 29 Aug 2025 08:31:24 +0000 (10:31 +0200)]
[Fix] Use C++20 standard consistently to resolve ODR violations
This commit resolves an ODR violations when compiling with -flto and
-Werror=odr [1]. The main project used a newer C++20 standard, while the
backward-cpp and simdutf libraries used an older C++11 standard. This
difference caused the linker to fail.
Setting C++20 standard in both libraries resolves the ODR issue.
This PR evolves the neural module from a symbols-only scorer into a general feature-fusion classifier with pluggable providers. It adds an LLM embedding provider, introduces trained normalization and metadata persistence, and isolates new models via a schema/prefix bump.
- The existing neural module is limited to metatokens and symbols.
- We want to combine multiple feature sources (LLM embeddings now; Bayes/FastText later).
- Ensure consistent train/infer behavior with stored normalization and provider metadata.
- Improve operability with caching, digest checks, and safer rollouts.
- Provider architecture
- Provider registry and fusion: `collect_features(task, rule)` concatenates provider vectors with optional weights.
- New LLM provider: `lualib/plugins/neural/providers/llm.lua` using `rspamd_http` and `lua_cache` for Redis-backed embedding caching.
- Symbols provider extracted: `lualib/plugins/neural/providers/symbols.lua`.
- Normalization and PCA
- Configurable fusion normalization: none/unit/zscore.
- Trained normalization stats computed during training and applied at inference.
- Existing global PCA preserved; loaded/saved alongside ANN.
- Schema and compatibility
- `plugin_ver` bumped to '3' to isolate from earlier profiles.
- Redis save/load extended:
- Profiles include `providers_digest`.
- ANN hash can include `providers_meta`, `norm_stats`, `pca`, `roc_thresholds`, `ann`.
- ANN load validates provider digest and skips apply on mismatch.
- Performance and reliability
- LLM embeddings cached in Redis (content+model keyed).
- Graceful fallback to symbols if providers not configured or fail.
- Basic provider configuration validation.
- `lualib/plugins/neural.lua`: provider registry, fusion, normalization helpers, profile digests, training pipeline updates.
- `src/plugins/lua/neural.lua`: integrates fusion into inference/learning, loads new metadata, applies normalization, validates digest.
- `lualib/plugins/neural/providers/llm.lua`: LLM embeddings with Redis cache.
- `lualib/plugins/neural/providers/symbols.lua`: legacy symbols provider wrapper.
- `lualib/redis_scripts/neural_save_unlock.lua`: stores `providers_meta` and `norm_stats` in ANN hash.
- `NEURAL_REWORK_PLAN.md`: design and phased TODO.
- Enable LLM alongside symbols:
```ucl
neural {
rules {
default {
providers = [
{ type = "symbols"; weight = 0.5; },
{ type = "llm"; model = "text-embed-1"; url = "https://api.openai.com/v1/embeddings";
cache_ttl = 86400; weight = 1.0; }
];
fusion { normalization = "zscore"; }
roc_enabled = true;
max_inputs = 256; # optional PCA
}
}
}
```
- LLM provider uses `gpt` block for defaults if present (e.g., API key). You can override `model`, `url`, `timeout`, and cache parameters per provider entry.
- Existing (v2) neural profiles remain unaffected (new `plugin_ver = '3'` prefixes).
- New profiles embed `providers_digest`; incompatible provider sets won’t be applied.
- No immediate cleanup required; TTL-based cleanup keeps old keys around until expiry.
- Validated: provider digest checks, ANN load/save roundtrip, normalization application at inference, LLM caching paths, symbols fallback.
- Please test with/without LLM provider and with `fusion.normalization = none|unit|zscore`.
- LLM latency/cost is mitigated by Redis caching; timeouts are configurable per provider.
- Privacy: use trusted endpoints; no content leaves unless configured.
- Failure behavior: missing/failed providers degrade to others; training/inference can proceed with partial features.
- Rules without `providers` continue to use symbols-only behavior.
- Existing command surface unchanged; future PR will introduce `rspamc learn_neural:*` and controller endpoints.
- [x] Provider registry and fusion
- [x] LLM provider with Redis caching
- [x] Symbols provider split
- [x] Normalization (unit/zscore) with trained stats
- [x] Redis schema v3 additions and profile digest
- [x] Inference uses trained normalization
- [x] Basic provider validation and fallbacks
- [x] Plan document
- [ ] Per-provider budgets/metrics and circuit breaker for LLM
- [ ] Expand providers: Bayes and FastText/subword vectors
- [ ] Per-provider PCA and learned fusion
- [ ] New CLI (`rspamc learn_neural`) and status/invalidate endpoints
- [ ] Documentation expansion under `docs/modules/neural.md`