Vsevolod Stakhov [Fri, 29 Aug 2025 12:38:01 +0000 (13:38 +0100)]
Fix DKIM relaxed body canonicalization and optimize performance
This PR addresses critical issues in DKIM relaxed body canonicalization and modernizes the codebase by replacing GLib types with standard C types.
- **RFC Compliance**: Fixed incorrect canonicalization of lines containing only whitespace. Previously, such lines were not properly handled according to RFC 6376, which could lead to DKIM signature verification failures.
- **Memory Safety**: Fixed incorrect pointer dereference in `rspamd_dkim_skip_empty_lines` that could cause undefined behavior.
- **Zero-copy Optimization**: Reimplemented `rspamd_dkim_relaxed_body_step` to avoid unnecessary memory copies. The new implementation:
- Processes input data directly without intermediate buffers
- Reduces the number of `EVP_DigestUpdate` calls by processing larger chunks
- Improves CPU cache efficiency
- Results in significantly better performance for large email bodies
- Replaced all GLib types with standard C equivalents:
- `gsize` → `size_t`
- `gssize` → `ssize_t`
- `gboolean` → `bool`
- `TRUE/FALSE` → `true/false`
- And other GLib-specific types
- Added necessary standard headers (`stdbool.h`, `stdint.h`, `limits.h`)
- Added comprehensive debug logging for:
- Chunk processing with size information
- Empty line detection and skipping
- Space collapsing operations
Petr Vaněk [Fri, 29 Aug 2025 08:31:24 +0000 (10:31 +0200)]
[Fix] Use C++20 standard consistently to resolve ODR violations
This commit resolves an ODR violations when compiling with -flto and
-Werror=odr [1]. The main project used a newer C++20 standard, while the
backward-cpp and simdutf libraries used an older C++11 standard. This
difference caused the linker to fail.
Setting C++20 standard in both libraries resolves the ODR issue.
This PR evolves the neural module from a symbols-only scorer into a general feature-fusion classifier with pluggable providers. It adds an LLM embedding provider, introduces trained normalization and metadata persistence, and isolates new models via a schema/prefix bump.
- The existing neural module is limited to metatokens and symbols.
- We want to combine multiple feature sources (LLM embeddings now; Bayes/FastText later).
- Ensure consistent train/infer behavior with stored normalization and provider metadata.
- Improve operability with caching, digest checks, and safer rollouts.
- Provider architecture
- Provider registry and fusion: `collect_features(task, rule)` concatenates provider vectors with optional weights.
- New LLM provider: `lualib/plugins/neural/providers/llm.lua` using `rspamd_http` and `lua_cache` for Redis-backed embedding caching.
- Symbols provider extracted: `lualib/plugins/neural/providers/symbols.lua`.
- Normalization and PCA
- Configurable fusion normalization: none/unit/zscore.
- Trained normalization stats computed during training and applied at inference.
- Existing global PCA preserved; loaded/saved alongside ANN.
- Schema and compatibility
- `plugin_ver` bumped to '3' to isolate from earlier profiles.
- Redis save/load extended:
- Profiles include `providers_digest`.
- ANN hash can include `providers_meta`, `norm_stats`, `pca`, `roc_thresholds`, `ann`.
- ANN load validates provider digest and skips apply on mismatch.
- Performance and reliability
- LLM embeddings cached in Redis (content+model keyed).
- Graceful fallback to symbols if providers not configured or fail.
- Basic provider configuration validation.
- `lualib/plugins/neural.lua`: provider registry, fusion, normalization helpers, profile digests, training pipeline updates.
- `src/plugins/lua/neural.lua`: integrates fusion into inference/learning, loads new metadata, applies normalization, validates digest.
- `lualib/plugins/neural/providers/llm.lua`: LLM embeddings with Redis cache.
- `lualib/plugins/neural/providers/symbols.lua`: legacy symbols provider wrapper.
- `lualib/redis_scripts/neural_save_unlock.lua`: stores `providers_meta` and `norm_stats` in ANN hash.
- `NEURAL_REWORK_PLAN.md`: design and phased TODO.
- Enable LLM alongside symbols:
```ucl
neural {
rules {
default {
providers = [
{ type = "symbols"; weight = 0.5; },
{ type = "llm"; model = "text-embed-1"; url = "https://api.openai.com/v1/embeddings";
cache_ttl = 86400; weight = 1.0; }
];
fusion { normalization = "zscore"; }
roc_enabled = true;
max_inputs = 256; # optional PCA
}
}
}
```
- LLM provider uses `gpt` block for defaults if present (e.g., API key). You can override `model`, `url`, `timeout`, and cache parameters per provider entry.
- Existing (v2) neural profiles remain unaffected (new `plugin_ver = '3'` prefixes).
- New profiles embed `providers_digest`; incompatible provider sets won’t be applied.
- No immediate cleanup required; TTL-based cleanup keeps old keys around until expiry.
- Validated: provider digest checks, ANN load/save roundtrip, normalization application at inference, LLM caching paths, symbols fallback.
- Please test with/without LLM provider and with `fusion.normalization = none|unit|zscore`.
- LLM latency/cost is mitigated by Redis caching; timeouts are configurable per provider.
- Privacy: use trusted endpoints; no content leaves unless configured.
- Failure behavior: missing/failed providers degrade to others; training/inference can proceed with partial features.
- Rules without `providers` continue to use symbols-only behavior.
- Existing command surface unchanged; future PR will introduce `rspamc learn_neural:*` and controller endpoints.
- [x] Provider registry and fusion
- [x] LLM provider with Redis caching
- [x] Symbols provider split
- [x] Normalization (unit/zscore) with trained stats
- [x] Redis schema v3 additions and profile digest
- [x] Inference uses trained normalization
- [x] Basic provider validation and fallbacks
- [x] Plan document
- [ ] Per-provider budgets/metrics and circuit breaker for LLM
- [ ] Expand providers: Bayes and FastText/subword vectors
- [ ] Per-provider PCA and learned fusion
- [ ] New CLI (`rspamc learn_neural`) and status/invalidate endpoints
- [ ] Documentation expansion under `docs/modules/neural.md`
René Draaisma [Sat, 16 Aug 2025 08:55:40 +0000 (10:55 +0200)]
Updated gpt.lua to set default gpt-5-mini as model, fix issue when GPT max_completion_tokens exceeded and returned empty reason field, Set default group to GPT for Symbols, group is now also configurable in settings with extra_symbols, fix issue when no score is defined in settings at extra_symbols, default score is now 0
Add a GitHub Actions workflow to run WebUI E2E tests
with Playwright on legacy and latest browser versions
against rspamd binaries built in the pipeline.