git.ipfire.org Git - thirdparty/rspamd.git/log

]> git.ipfire.org Git - thirdparty/rspamd.git/log

projects / thirdparty / rspamd.git / log

summary | shortlog | log | commit | commitdiff | tree
first ⋅ prev ⋅ next

commit | commitdiff | tree

dependabot[bot] [Thu, 22 Jan 2026 21:01:55 +0000 (21:01 +0000)]

Bump transformers in /contrib/neural-embedding-service

Bumps [transformers](https://github.com/huggingface/transformers) from 4.40.0 to 4.53.0.
- [Release notes](https://github.com/huggingface/transformers/releases)
- [Commits](https://github.com/huggingface/transformers/compare/v4.40.0...v4.53.0)

---
updated-dependencies:
- dependency-name: transformers
dependency-version: 4.53.0
dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 21:00:55 +0000 (21:00 +0000)]

Merge pull request #5835 from rspamd/vstakhov-llm-embedding-improvements

Add expression-based autolearn for neural LLM providers

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 19:48:30 +0000 (19:48 +0000)]

[Fix] Use versioned key for hybrid LLM+symbols manual training

Pending key is now only used for LLM-only mode where embedding
dimensions may vary. Hybrid (LLM+symbols) and symbols-only modes
use versioned key directly since dimension includes stable symbols.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 18:36:26 +0000 (18:36 +0000)]

[Fix] Use versioned key for manual training in symbols-only mode

Manual training via ANN-Train header now writes to versioned key when
no LLM provider is configured. The pending key is only used with LLM
providers where embedding dimensions may vary between versions.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 18:22:06 +0000 (18:22 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 15:35:09 +0000 (15:35 +0000)]

[Fix] Match fuzzy_check.c hash generation in text_part:get_fuzzy_hashes

Fix text_part:get_fuzzy_hashes() to produce identical hashes as the
fuzzy_check plugin's fuzzy_cmd_from_text_part():

- For short text (<32 words): hash utf_stripped_content directly instead
of individual words, and optionally include subject
- For normal text: skip words with RSPAMD_WORD_FLAG_SKIPPED flag or
empty stems

Add optional subject parameter to include in short text hash calculation
(matches fuzzy_check.c behavior with no_subject=false).

Update rspamadm mime stat to pass subject to get_fuzzy_hashes().

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 13:40:20 +0000 (13:40 +0000)]

[Fix] Stop HTTP watchers before error handlers

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 11:51:53 +0000 (11:51 +0000)]

[Feature] Put subject first in LLM embedding input

Subject is highly valuable for spam detection and placing it first
ensures it's always included even if text content gets truncated
by model token limits.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 22 Jan 2026 11:21:52 +0000 (11:21 +0000)]

[Feature] Rename neural autolearn options to match RBL module naming

Rename check_local/check_authed to exclude_local/exclude_users for
consistency with RBL module conventions. Change exclude_users default
to true (authenticated users excluded by default).

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 21 Jan 2026 13:39:51 +0000 (13:39 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 21 Jan 2026 13:39:35 +0000 (13:39 +0000)]

Merge pull request #5853 from rspamd/vstakhov-content-urls-rework

[Feature] Include content URLs by default in URL API calls

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 21 Jan 2026 13:16:52 +0000 (13:16 +0000)]

[Test] Set include_content_urls = false for functional tests

Preserve backward compatibility in tests by using the old default
behavior (exclude content URLs).

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 21 Jan 2026 09:57:54 +0000 (09:57 +0000)]

[Feature] Include content URLs by default in URL API calls

- Add `include_content_urls` global option (default: true) to control
  whether URLs extracted from content (PDF, etc.) are included in API calls
- Update task:get_urls(), task:get_emails() to include content URLs by default
- Update lua_util.extract_specific_urls() to use config default when
  need_content is not explicitly specified
- Mark URLs extracted from computed/virtual parts (PDF text) with CONTENT
  flag instead of FROM_TEXT flag, since they may be clickable links
- Add commented documentation in conf/options.inc

Users who want the old behavior can set `include_content_urls = false`
in their options configuration.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 21 Jan 2026 08:57:13 +0000 (08:57 +0000)]

[Feature] Add order-independent table digest using XXH3 XOR accumulation

Add rspamd_cryptobox.fast_hash64() C function that returns XXH3 hash as
two 32-bit integers, enabling XOR accumulation for order-independent
hashing in Lua.

Add lua_util.unordered_table_digest() that produces consistent digests
regardless of table iteration order. This fixes issues where different
Rspamd instances produced different ANN digests for identical configs
due to non-deterministic key ordering in pairs().

The original table_digest had two bugs:
- Used pairs() which iterates in undefined order across Lua VMs
- Ignored numeric and boolean values in the hash

Update neural plugin's providers_config_digest to use the new function,
fixing the "providers config changed" warnings on identical configs.

Also update lua_maps and lua_urls_compose cache key generation to use
unordered_table_digest for more reliable cache hits.

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 21 Jan 2026 08:19:21 +0000 (08:19 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 21:41:15 +0000 (21:41 +0000)]

[Fix] Clear pending regexp maps on config reload to prevent use-after-free

During HUP-triggered config reload, the pending_regexp_maps array retained
pointers to re_map objects from the old config after they were freed. When
workers received "regexp map loaded" notifications, they accessed freed memory
(visible as 0x5A poison pattern in re_digest), causing SIGSEGV.

Fix by calling rspamd_regexp_map_clear_pending() before releasing the old
config in reread_config().

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 16:54:43 +0000 (16:54 +0000)]

[Fix] Fix race condition between I/O handler and SIGCHLD in subprocess

The subprocess callback could crash when SIGCHLD handler ran concurrently
with the I/O handler processing large training results. The race:

1. I/O handler receives full data, calls callback
2. SIGCHLD fires during callback execution
3. SIGCHLD handler frees cbdata while callback still uses it
4. Callback returns, I/O handler accesses freed memory -> crash

Fix:
- Add 'dead' flag to track when child has exited
- Set 'replied' BEFORE calling callback (not after)
- SIGCHLD handler skips cleanup if replied=TRUE (I/O handler owns it)
- I/O handler does cleanup after callback if dead=TRUE
- Extract cleanup into rspamd_lua_cbdata_free() helper

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 16:13:21 +0000 (16:13 +0000)]

[Fix] Use rspamd_text for subprocess callback data to avoid large allocations

Replace lua_pushlstring with lua_new_text(FALSE) when passing subprocess
result data to Lua callbacks. This avoids copying potentially large buffers
(e.g., 2.7MB neural network training results) into Lua's heap, which could
cause crashes under memory pressure.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 16:12:41 +0000 (16:12 +0000)]

[Fix] Fix ROC threshold calculation for ham/spam labels

The ROC calculation was checking outputs[i][1] == 0 for ham samples,
but the ceb_neg cost function uses -1.0 for ham and 1.0 for spam.
Changed to check outputs[i][1] < 0 to correctly identify ham samples.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 14:25:29 +0000 (14:25 +0000)]

[Feature] Multi-layer funnel architecture for LLM embeddings

Add improved neural network architecture specifically for LLM embedding
inputs, while preserving backward compatibility for symbol-based rules.

Key changes:
- New create_embedding_ann() with multi-layer funnel architecture
- Auto-detection of LLM providers via uses_llm_embeddings()
- Support for configurable layers, dropout, layer normalization
- GELU activation by default when available (falls back to ReLU)
- Layer size auto-scaling based on input dimension:
  - >512 dims: 3 layers (0.5, 0.25, 0.125)
  - 256-512 dims: 2 layers (0.5, 0.25)
  - <256 dims: 1 layer (0.5)

Bug fixes:
- Wrap create_ann in pcall to handle errors gracefully
- Reset learning_spawned flag on ANN creation failure
- Replace assert(false) with proper error logging that resets state
- Prevents training from getting stuck after errors

New configuration options:
- layers: explicit layer size multipliers
- dropout: dropout rate (default 0.2 for embeddings)
- use_layernorm: enable layer normalization (default true)
- activation: 'gelu' or 'relu' (default 'gelu' if available)

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 14:20:51 +0000 (14:20 +0000)]

[Feature] Add GELU activation and expose dropout in KANN bindings

- Implement GELU (Gaussian Error Linear Unit) activation function
using erf: GELU(x) = 0.5 * x * (1 + erf(x / sqrt(2)))
- Add proper forward and backward passes for GELU
- Register GELU as operation #37 in kad_op_list
- Expose dropout layer to Lua (function existed but wasn't registered)
- Add Lua bindings for rspamd_kann.transform.gelu

GELU is often better than ReLU for transformer-like architectures
and high-dimensional embedding inputs.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 12:16:36 +0000 (12:16 +0000)]

Add GPU and vast.ai support for neural embedding service

- Add Dockerfile.gpu for GPU-accelerated inference with PyTorch CUDA
- Add requirements-gpu.txt with pinned versions for CUDA compatibility
- Add vastai-launch.sh script for deploying on vast.ai cloud GPUs
- Update README with GPU deployment instructions and model recommendations

Default GPU model: intfloat/multilingual-e5-large (100+ languages including Russian)
Tested on RTX 4090 with ~20-50ms latency per embedding.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 20 Jan 2026 11:09:47 +0000 (11:09 +0000)]

[Fix] Prefer higher version ANN profiles when symbol distances are equal

When multiple ANN profiles have the same symbol distance, the profile
selection would pick the first one encountered rather than the newest.
This caused issues when a newly trained ANN (version 1) existed alongside
the initial profile (version 0) - the scanner would select version 0
which had no actual ANN data.

Fix by adding a secondary selection criterion: when distances are equal,
prefer the profile with the higher version number.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 20:22:35 +0000 (20:22 +0000)]

[Fix] Remove unused variables in neural controller

Remove unused ev_base and ann_key variables to fix luacheck warnings.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 19:14:11 +0000 (19:14 +0000)]

[Fix] Fix messagepack cache decoding format string

The UCL parser's parse_text() only accepts 'msgpack' as the format
string for messagepack parsing, while to_format() accepts both
'msgpack' and 'messagepack'. This mismatch caused cached data to
fail decoding and appear as cache misses.

Fixes LLM embedding cache never being read back despite being stored.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 18:41:02 +0000 (18:41 +0000)]

[Fix] Prevent concurrent neural network training races

- Add learning_spawned check at start of do_train_ann to prevent
concurrent async Redis operations
- Move learning_spawned flag to start of spawn_train for earlier
lock acquisition
- Remove redundant flag assignments later in the training flow

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 18:08:58 +0000 (18:08 +0000)]

[Feature] Add pending training keys and fix neural network training issues

- Add pending_train_key() for version-independent training vector storage
- Fix variable shadowing bug where ann_trained callback was overwritten
- Add concurrent training prevention via learning_spawned check
- Replace assert with proper error handling for msgpack parsing
- Clean up pending keys after successful training
- Update controller endpoint to use pending keys for manual training
- Fix ev_base:sleep() to register with session events properly
- Update classifier_test.lua to support llm_embeddings classifier testing

Co-Authored-By: Claude <noreply@anthropic.com>

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 15:32:43 +0000 (15:32 +0000)]

[Feature] Add ev_base:sleep() method for Lua

Add sleep method to ev_base that supports both sync and async modes:
- ev_base:sleep(time) - sync mode using coroutines
- ev_base:sleep(time, callback) - async mode with callback

Sync mode yields the current coroutine and resumes after timeout.
Async mode schedules the callback to run after the timeout.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 14:41:30 +0000 (14:41 +0000)]

[Fix] Skip external map queries when Settings header is provided

When settings are specified manually via the Settings HTTP header,
external map queries should not be executed as they may override
the manually provided settings asynchronously.

This prevents connection errors to external maps from affecting
requests that explicitly provide their own settings.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 14:11:05 +0000 (14:11 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 14:09:01 +0000 (14:09 +0000)]

[Fix] Guard fuzzy TCP session cleanup

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 19 Jan 2026 09:29:44 +0000 (09:29 +0000)]

[Feature] Add language-based model/URL selection for LLM embeddings

Support language-specific embedding models via language_models config:
- Shorthand: language_models = { ru = "model-name" }
- Full config: language_models = { ru = { model, url, api_key } }

Uses get_displayed_text_part() for language detection.
Include language in cache key for proper separation.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 18 Jan 2026 17:31:40 +0000 (17:31 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 18 Jan 2026 17:30:04 +0000 (17:30 +0000)]

Merge pull request #5845 from rspamd/feature/extract-text-limited

[Feature] Add extract_text_limited for email text extraction with limits

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 18 Jan 2026 17:29:51 +0000 (17:29 +0000)]

Merge pull request #5846 from moisseev/webui

[Minor] Fix WebUI symbols frequency column sorting

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 18 Jan 2026 13:19:50 +0000 (13:19 +0000)]

[Feature] Add reply_trim_mode for LLM input

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 18 Jan 2026 12:04:16 +0000 (12:04 +0000)]

[Fix] Improve reply header trimming

commit | commitdiff | tree

Alexander Moisseev [Sun, 18 Jan 2026 08:20:58 +0000 (11:20 +0300)]

[Minor] WebUI: Add frequency stddev column and units to symbols table

- Add frequency standard deviation column with the same exponential
scaling as frequency for consistent notation. Hidden on smaller
screens (lg breakpoint)
- Display units (hits/s for frequencies, s for time) in table headers
- Remove "s" suffix from time cells (unit now in header)

commit | commitdiff | tree

Alexander Moisseev [Sun, 18 Jan 2026 06:04:08 +0000 (09:04 +0300)]

[Fix] Calculate frequency exponent from non-zero values only

Fix suboptimal exponential notation selection in WebUI symbols
frequency display. Previously, the exponent was calculated from
the average of all frequency values including zeros, resulting
in unnecessarily small exponents (e.g., 2300.00e-8 instead of
2.30e-5). Now only non-zero values are used for calculation,
producing more readable notation.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 17 Jan 2026 15:58:14 +0000 (15:58 +0000)]

[Feature] Add extract_text_limited for email text extraction with limits

Add lua_mime.extract_text_limited() function to extract meaningful text from
emails with long reply chains while respecting size limits.

Features:
- max_bytes: Hard limit on output size (default: 32KB)
- max_words: Alternative limit by word count
- strip_quotes: Remove quoted replies (lines starting with >)
- strip_reply_headers: Remove reply headers (On X wrote:, From: Sent:)
- strip_signatures: Remove signature blocks (-- separator, mobile signatures)
- smart_trim: Enable all heuristics

Implementation:
- Uses rspamd_text:lines() iterator for memory-efficient line processing
- No full string interning of email content (better for large emails)
- rspamd_trie for multi-pattern matching (67 signature, 44 reply patterns)
- rspamd_regexp for regex patterns (wrote:, schrieb:, etc.)
- Single-pass O(n) algorithm with early termination

Multilingual support for 10+ languages:
- English, German, French, Spanish, Russian, Portuguese, Italian
- Chinese, Japanese, Polish

Configuration API:
- lua_mime.configure_text_extraction(cfg) for custom patterns
- Supports extend_defaults to add patterns without replacing defaults

CLI integration in rspamadm mime ex:
- -L/--limit, -Q/--strip-quotes, -S/--strip-signatures
- -R/--strip-reply-headers, -T/--smart-trim

Also updates llm_common.build_llm_input() to use the new function.

commit | commitdiff | tree

Alexander Moisseev [Sat, 17 Jan 2026 16:45:44 +0000 (19:45 +0300)]

[Minor] Unify sortValue functions to arrow functions

Convert all sortValue functions in FooTable column definitions to
arrow functions with consistent parameter naming for consistency
across the codebase.

commit | commitdiff | tree

Alexander Moisseev [Sat, 17 Jan 2026 16:27:44 +0000 (19:27 +0300)]

[Minor] Fix WebUI symbols frequency column sorting

Previously, frequency values with exponential notation (e.g., "0.00e-5",
"389.40e-5") were compared as strings, causing incorrect sort order.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 17 Jan 2026 12:19:17 +0000 (12:19 +0000)]

[Fix] Normalize request header values

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 17 Jan 2026 10:46:13 +0000 (10:46 +0000)]

[Fix] Stabilize neural LLM embedding training and cache keys

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 16 Jan 2026 17:45:00 +0000 (17:45 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 16 Jan 2026 13:25:04 +0000 (13:25 +0000)]

[Fix] Fix fuzzystat control replies

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 16 Jan 2026 12:17:51 +0000 (12:17 +0000)]

[Fix] Avoid case-only alias rewrites

Refs #5843

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 16 Jan 2026 10:59:14 +0000 (10:59 +0000)]

[Fix] Respect headers_modify_mode for fuzzy hash headers

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 16 Jan 2026 08:59:17 +0000 (08:59 +0000)]

Merge pull request #5842 from fatalbanana/rl_compat

[Fix] ratelimit: fix compatibility with old records

commit | commitdiff | tree

Andrew Lewis [Thu, 15 Jan 2026 15:33:46 +0000 (17:33 +0200)]

[Fix] ratelimit: fix compatibility with old records

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 15 Jan 2026 15:02:48 +0000 (15:02 +0000)]

Merge pull request #5839 from bneumeier/master

Allow for use of Lua 5.5

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 15 Jan 2026 14:54:51 +0000 (14:54 +0000)]

Merge pull request #5840 from moisseev/frequency

[Fix] Use proper rounding for symbol frequency statistics

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 15 Jan 2026 14:54:37 +0000 (14:54 +0000)]

Merge pull request #5841 from fatalbanana/log_keys

Lua: populate missing log keys

commit | commitdiff | tree

Andrew Lewis [Thu, 15 Jan 2026 12:29:07 +0000 (14:29 +0200)]

[Minor] Satisfy luacheck

commit | commitdiff | tree

Andrew Lewis [Thu, 15 Jan 2026 12:13:22 +0000 (14:13 +0200)]

[Minor] Return errors from lua_redis.load_redis_script_from_file

commit | commitdiff | tree

Andrew Lewis [Thu, 15 Jan 2026 12:06:37 +0000 (14:06 +0200)]

[Minor] populate missing log keys in plugins, lualib

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 15 Jan 2026 10:14:55 +0000 (10:14 +0000)]

[Fix] Silence zlib preset dictionary inflate errors

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 14 Jan 2026 22:48:06 +0000 (22:48 +0000)]

[Fix] Propagate control request ids in replies

Ensure workers include cmd->id in control replies to avoid 'unknown request id 0' warnings. Update functional control tests and make RSPAMD_TMPDIR visible to child suites.

commit | commitdiff | tree

Alexander Moisseev [Wed, 14 Jan 2026 16:11:50 +0000 (19:11 +0300)]

[Fix] Use proper rounding for symbol frequency statistics

- Replace incorrect floor() with round() in rounding functions to avoid
  losing small values
- Increase counters API frequency precision from 3 to 6 decimal places
  (need 5 to avoid rspamc displaying values as multiples of 0.06, need 6
  for /counters endpoint itself - no additional overhead as JSON stores
  double anyway)
- Add frequency_stddev field to counters API output (fixes zero stdev in
  `rspamc counters` output)
- Clarify `rspamc counters` table header with "avg (stddev)" subheading
- Fix WebUI to preserve frequency precision before scaling

Example for symbol with frequency 0.004772 hits/sec:
- Before: /symbols returns 0.004772, /counters returns 0.004000,
  `rspamc counters` shows 0.240
- After:  /symbols returns 0.004772, /counters returns 0.004772,
  `rspamc counters` shows 0.286

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 14 Jan 2026 14:29:32 +0000 (14:29 +0000)]

[Feature] Route all hyperscan cache operations through Lua backend

- Route file backend through Lua for consistency with redis/http
- Add zstd compression support with magic byte detection for backward
  compatibility (reads both .hs and .hs.zst files)
- Fix rspamd_util.stat() return value handling (returns err, stat tuple)
- Fix timer management for synchronous Lua callbacks to prevent early
  termination of re_cache compilation
- Fix use-after-free in load path by pre-counting pending items
- Add priority queue for re_cache compilation (short lists first)
- Add ev_run() flush before blocking hyperscan compilations to ensure
  busy notifications are sent
- Add hyperscan_notice_known() and hyperscan_get_platform_id() Lua APIs

commit | commitdiff | tree

Vsevolod Stakhov [Wed, 14 Jan 2026 10:30:51 +0000 (10:30 +0000)]

[Feature] Add ASCII85 decode support for PDF text extraction

PDFs may use ASCII85Decode filter for content streams. This was causing
text extraction to fail for such PDFs, resulting in missed URLs and emails.

- Add rspamd_decode_ascii85_buf() in str_util.c
- Add rspamd_util.decode_ascii85() Lua binding
- Add ASCII85Decode filter support in pdf.lua
- Add --raw flag to rspamadm mime urls command

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 21:52:13 +0000 (21:52 +0000)]

[Fix] Refactor control socket to use ID-based request/reply matching

Replace the serialization-based control command handling with an ID-based
approach using khash, mirroring the existing rspamd_srv_requests pattern.

Key changes:
- Add uint64_t id field to control command/reply structs
- Use khash for O(1) request lookup by ID instead of GHashTable
- Add rspamd_control_reply_handler() for centralized reply processing
- Add rspamd_control_pending_new/destroy/remove_all() API functions
- Add control_ev watcher to worker struct for reply monitoring
- Call rspamd_srv_pipe_cleanup() on worker shutdown to prevent leaks
- Handle ID collisions gracefully (warn and free old entry)

This fixes hash table iterator corruption crashes that occurred when
modifying the hash during iteration, and provides more robust concurrent
command handling.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 16:07:24 +0000 (16:07 +0000)]

[Fix] Fix neural controller API paths

commit | commitdiff | tree

Brett Neumeier [Tue, 13 Jan 2026 16:04:28 +0000 (10:04 -0600)]

convert spaces to tabs

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 15:36:31 +0000 (15:36 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Brett Neumeier [Tue, 13 Jan 2026 15:03:47 +0000 (09:03 -0600)]

Allow for use of Lua 5.5

In Lua 5.5, the signature for lua_newstate changes
(there is now an additional "seed" argument),
and the mechanism for adjusting garbage collection also changes.

I don't know whether the seed of 0 is ideal when using lua_newstate;
probably, using a random seed would be better. This is a minimal patch
that gets back to a working build.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 14:19:53 +0000 (14:19 +0000)]

[Feature] Disable periodic recompile timer for file cache backend

The periodic recompile timer (default 60s) is only useful for shared
backends (Redis, HTTP, Lua) where another rspamd instance might have
compiled new hyperscan databases.

For file backend, recompilation is already triggered by:
- Config reload (forks new hs_helper process)
- Explicit RECOMPILE command (sent on map updates)

This eliminates unnecessary periodic checks for file-based deployments.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 13:56:27 +0000 (13:56 +0000)]

[Fix] Fix re_cache hyperscan file tracking and buffer size

Two fixes for hyperscan cache file handling:

1. Increase hyperscan_cache_file.filename buffer from 64 to 80 bytes
   to accommodate full filenames (64 hex hash + ".hs.unser" = 73 chars)

2. Add rspamd_hyperscan_notice_known() call in re_cache.c after loading
   hyperscan databases. Without this, re_cache files weren't registered
   as "known" and would be deleted by cleanup_maybe() on restart,
   causing unnecessary recompilation.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 11:46:29 +0000 (11:46 +0000)]

Merge branch 'master' into vstakhov-llm-embedding-improvements

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 11:46:03 +0000 (11:46 +0000)]

Merge pull request #5837 from rspamd/vstakhov-control-async

[Fix] Refactor control pipe to prevent deadlocks and crashes

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 11:25:28 +0000 (11:25 +0000)]

[Cleanup] Remove unused CONTROL_PATHLEN macro

No longer used after reducing hyperscan_cache_file to fixed 64-byte
filename field.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 11:02:09 +0000 (11:02 +0000)]

[Fix] Fix fd leaks and double-free in srv_pipe error handling

- Close attached_fd before freeing request data when sendmsg fails
- Fix double-free in rspamd_srv_pipe_ctx_destroy: items in send_queue
are also in the hash table, so only iterate hash to free
- Close attached_fd for unsent requests during shutdown

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 10:44:42 +0000 (10:44 +0000)]

[Fix] Reduce hyperscan_cache_file command from CONTROL_PATHLEN to 64 bytes

Send only the filename (hash.hs) instead of the full path in the
hyperscan cache notification. Main process reconstructs the full
path using cfg->hs_cache_dir.

This is the last CONTROL_PATHLEN field in rspamd_srv_command.

commit | commitdiff | tree

Vsevolod Stakhov [Tue, 13 Jan 2026 10:20:58 +0000 (10:20 +0000)]

[Fix] Refactor srv_pipe to use queue-based architecture with ID dispatch

Replace per-request ev_io watchers with a single watcher using khash
for ID-based reply matching. This fixes potential deadlocks when multiple
commands are queued rapidly (e.g., during hyperscan compilation).

Changes:
- Add rspamd_srv_pipe_ctx with single watcher, send queue, and ID hash
- Make srv_pipe non-blocking on both ends with proper EAGAIN handling
- Add EAGAIN handling to main process write path
- Remove cache_dir from hs_loaded commands (available from config)

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 16:25:30 +0000 (16:25 +0000)]

[Fix] Reduce control message size to prevent sendmsg crash

The rspamd_srv_command and rspamd_control_command structures grew too
large (~8KB) due to multiple CONTROL_PATHLEN fields in mp_loaded and
re_map_loaded, exceeding socket buffer limits and causing crashes in
sendmsg during worker startup.

Fix by:
- Removing redundant cache_dir fields (all processes know it from config)
- Using consistent name[64] for both mp_loaded and re_map_loaded
- Getting cache_dir from cfg->hs_cache_dir at receive time instead

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 12:56:57 +0000 (12:56 +0000)]

[Fix] Correct CSS duplicate property handling to use last declaration

Fix two bugs in CSS property handling that caused text to be incorrectly
marked as invisible:

1. Fixed isset() macro misuse in override_values() - was passing a bitmask
   instead of a bit index, causing the override to never find matching values

2. Changed add_rule() to call override_values() instead of merge_values()
   when duplicate properties with normal priority are encountered, ensuring
   later CSS declarations properly override earlier ones per CSS spec

This fixes an issue where HTML emails with duplicate color declarations
(e.g., "color:#FFFFFF;color:#232333") would have text incorrectly filtered
as invisible, since only the first color was being used.

Added test case for duplicate color property handling.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 12:18:33 +0000 (12:18 +0000)]

[Fix] Include content URLs in rspamadm mime urls output

Change get_urls(true) to get_urls_filtered() to include URLs
extracted from content (e.g., PDF attachments) in the output.

The get_urls() function excludes RSPAMD_URL_FLAG_CONTENT URLs
by default for backward compatibility, but get_urls_filtered()
with no arguments returns all URLs including content URLs.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 12:17:32 +0000 (12:17 +0000)]

[Fix] Add defensive checks to PDF parser for malformed input

Add pcall wrappers and type checks throughout pdf.lua to handle
malformed PDFs from untrusted sources without crashing:

- Add nil checks for stream objects before accessing fields
- Wrap grammar matches in pcall to catch parsing errors
- Add type validation before ipairs calls on trie match results
- Wrap span extractions in pcall to handle invalid offsets
- Add defensive checks in processor functions (trailer, suspicious)
- Wrap URL creation in pcall for malformed URI strings

Errors are logged via debugm for diagnosis while allowing
processing to continue gracefully.

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 10:58:03 +0000 (10:58 +0000)]

[Feature] Add expression-based autolearn for neural LLM providers

Add integrated autolearn system for neural networks with LLM providers:

- New lua_neural_learn library with guards system and rspamd_expression
  support for complex conditions
- Expression-based conditions: spam_condition, ham_condition using
  rspamd_expression syntax (e.g., "BAYES_SPAM & DMARC_POLICY_REJECT")
- Score, action, and symbol-based thresholds
- Pluggable guards via rspamd_plugins['neural'].autolearn hooks
- Mempool-based flag passing (no double scanning)
- Probabilistic sampling for training volume control

Also includes contrib/neural-embedding-service with a FastEmbed-based
Python service for CPU-optimized embedding inference, compatible with
both Ollama and OpenAI API formats.

Configuration example:
  autolearn {
    enabled = true;
    spam_score = 15.0;
    spam_condition = "BAYES_SPAM & (DMARC_POLICY_REJECT | RBL_SPAMHAUS)";
    ham_condition = "BAYES_HAM & DKIM_VALID_AU & SPF_PASS";
  }

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 08:39:46 +0000 (08:39 +0000)]

[Fix] Prevent infinite loop in fuzzy_check config transform

When transforming max_score -> hits_limit for backward compatibility,
directly assigning UCL object references between fields can corrupt
the internal linked list pointers (next/prev become self-referential).

This caused an infinite loop in ucl_object_lua_push_array() when the
C code tried to push the config object to Lua via LL_FOREACH macro.

Fix by using tonumber() to extract the numeric value instead of
copying the UCL object reference.

Reported-by: User via GDB backtrace showing hang at lua_ucl.c:240
Fixes: 7fd47dad2f9 ("[Feature] Rename fuzzy_check max_score to hits_limit for clarity")

commit | commitdiff | tree

Vsevolod Stakhov [Mon, 12 Jan 2026 08:32:06 +0000 (08:32 +0000)]

Merge pull request #5832 from rspamd/vstakhov-ct-management

[Feature] Add HTTP content negotiation framework

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 20:23:55 +0000 (20:23 +0000)]

[Fix] Use base name for OpenMetrics counter TYPE declarations

OpenMetrics specification requires counter metrics to have _total suffix
on the metric value, but HELP and TYPE declarations must use the base
name without the suffix.

Before: # TYPE rspamd_scanned_total counter
After: # TYPE rspamd_scanned counter

This fixes parser rejections due to name clashes when metrics scrapers
see _total in the TYPE line and append another _total.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 20:08:49 +0000 (20:08 +0000)]

[Feature] Add content negotiation for /stat endpoint and zstd compression

- Update /stat handler to use rspamd_controller_send_ucl_negotiated
  for Accept header content-type negotiation (JSON/msgpack)
- Add zstd compression support to rspamd_controller_maybe_compress,
  preferred over gzip when client supports it
- Add functional robot tests for content negotiation covering:
  - OpenMetrics/text/plain Accept headers for /metrics
  - JSON/msgpack Accept headers for /stat
  - gzip/zstd Accept-Encoding compression
  - Quality factor parsing

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 18:08:02 +0000 (18:08 +0000)]

[Feature] Add HTTP content negotiation framework

Add content type negotiation based on Accept header for HTTP responses.
This allows clients like DataDog's OpenMetrics scraper to receive
responses with Content-Type matching their Accept header preferences.

- Add http_content_negotiation.c/h with Accept header parsing
- Support quality factors (q=) in Accept header
- Parse Accept-Encoding for gzip/zstd/deflate support
- Add rspamd_controller_send_openmetrics_negotiated()
- Update /metrics endpoint to negotiate Content-Type
- Fallback to text/plain for Prometheus 0.0.4 compatibility

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 17:18:36 +0000 (17:18 +0000)]

Merge pull request #5813 from rspamd/vstakhov-pluggable-hs-cache

Add pluggable hyperscan cache storage infrastructure

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 16:39:46 +0000 (16:39 +0000)]

[Fix] Include TLD patterns in ACISM fallback

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 16:37:24 +0000 (16:37 +0000)]

[Fix] Fix pattern duplication in multipattern without hyperscan

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 15:47:29 +0000 (15:47 +0000)]

[Fix] Support building hs_cache_backend without Hyperscan

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 15:04:28 +0000 (15:04 +0000)]

[Conf] Add Redis backend example to hs_helper worker config

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 11:51:41 +0000 (11:51 +0000)]

[Minor] Add state machine diagram to hs_helper.c

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 11:40:58 +0000 (11:40 +0000)]

[Minor] Remove .factory from version control

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 11:38:22 +0000 (11:38 +0000)]

[Fix] Enable FALLBACK mode for RE multipatterns (stop words)

- Create pats array for all multipatterns, not just TLD
- Use rspamd_multipattern_build_acism() for proper RE fallback
- Add regex fallback path in lookup while HS is compiling
- Clean up mp->res in destructor for hyperscan path

This fixes stop words multipatterns which use RSPAMD_MULTIPATTERN_RE
to properly use FALLBACK mode instead of falling through to SYNC mode
and creating .hs files during config loading.

commit | commitdiff | tree

Vsevolod Stakhov [Sun, 11 Jan 2026 09:54:53 +0000 (09:54 +0000)]

[Feature] Use async hyperscan compilation for language detection stop words

Use FALLBACK mode for stop words - build ACISM trie first for immediate use,
then queue for async hyperscan compilation via hs_helper.

This is the same approach used for TLD/publicsuffix patterns.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 10 Jan 2026 21:18:21 +0000 (21:18 +0000)]

[Feature] Compile small hyperscan databases in memory without file caching

For small pattern sets (< 100 patterns), compile hyperscan databases
synchronously in memory without saving to file or Redis cache.
These databases are shared with workers via fork() COW semantics.

Large pattern sets (like TLD with 10000+ patterns) continue to use
async compilation via hs_helper with Redis caching.

This eliminates unnecessary .hs files in /var/lib/rspamd for small
databases while maintaining the async path for expensive compilations.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 10 Jan 2026 17:19:32 +0000 (17:19 +0000)]

[Fix] Ensure stable re_cache class hashes independent of other classes

Previously, the global regexp index `i` was included in per-class hashes,
which caused class B's hash to change when class A got new regexps
(because indices shift). This made Redis caching ineffective as databases
were constantly being recompiled.

Now the global index is only included in the global hash, not in per-class
hashes, ensuring each class hash depends only on its own regexps.

commit | commitdiff | tree

Vsevolod Stakhov [Sat, 10 Jan 2026 15:58:40 +0000 (15:58 +0000)]

[Feature] Enhance hyperscan cache debug logging and correlation

- Add entity_name parameter to async cache API for better traceability
- Correlate cache requests with callbacks (show entity/key in both)
- Use rspamd_zhs prefix by default for compressed Redis data
- Switch to idiomatic lua_util.debugm for Lua debug logging
- Log Redis backend config (prefix, ttl, compression) on creation

commit | commitdiff | tree

Vsevolod Stakhov [Fri, 9 Jan 2026 19:38:37 +0000 (19:38 +0000)]

[Feature] Pluggable async hyperscan cache backend

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 8 Jan 2026 22:40:38 +0000 (22:40 +0000)]

[Fix] Properly terminate hs_helper during shutdown

Add RSPAMD_SRV_BUSY command to allow hs_helper to notify main process
when busy with long-running hyperscan compilation. Main skips heartbeat
checks while worker is busy and logs busy reason during shutdown.

Key fixes:
- Prevent notifications being sent after worker receives termination signal
- Propagate ev_break through rspamd_worker_set_busy to properly exit event loop
- Add shutdown monitor timer to log pending workers during termination
- Pass worker pointer to re_cache compile functions for termination checks

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 8 Jan 2026 15:04:17 +0000 (15:04 +0000)]

[Conf] Add configuration support for hs_helper worker

Add worker-hs_helper.conf and worker-hs_helper.inc config files that are
only installed when hyperscan support is enabled. The main rspamd.conf
uses try=true to gracefully handle missing config on non-hyperscan builds.

commit | commitdiff | tree

Vsevolod Stakhov [Thu, 8 Jan 2026 14:02:47 +0000 (14:02 +0000)]

Merge branch 'master' into vstakhov-pluggable-hs-cache

Mirror of https://github.com/rspamd/rspamd.git