]> git.ipfire.org Git - thirdparty/rspamd.git/commit
[Feature] Add extract_text_limited for email text extraction with limits
authorVsevolod Stakhov <vsevolod@rspamd.com>
Sat, 17 Jan 2026 15:58:14 +0000 (15:58 +0000)
committerVsevolod Stakhov <vsevolod@rspamd.com>
Sat, 17 Jan 2026 17:05:08 +0000 (17:05 +0000)
commitf18ffb983cbe817986784ae225c8ea86d37c6d40
tree8213a907e9912fa5f7042cfa7b0a83d4c076d10f
parent6130f92078df729b65eb3051df5e522d60c21379
[Feature] Add extract_text_limited for email text extraction with limits

Add lua_mime.extract_text_limited() function to extract meaningful text from
emails with long reply chains while respecting size limits.

Features:
- max_bytes: Hard limit on output size (default: 32KB)
- max_words: Alternative limit by word count
- strip_quotes: Remove quoted replies (lines starting with >)
- strip_reply_headers: Remove reply headers (On X wrote:, From: Sent:)
- strip_signatures: Remove signature blocks (-- separator, mobile signatures)
- smart_trim: Enable all heuristics

Implementation:
- Uses rspamd_text:lines() iterator for memory-efficient line processing
- No full string interning of email content (better for large emails)
- rspamd_trie for multi-pattern matching (67 signature, 44 reply patterns)
- rspamd_regexp for regex patterns (wrote:, schrieb:, etc.)
- Single-pass O(n) algorithm with early termination

Multilingual support for 10+ languages:
- English, German, French, Spanish, Russian, Portuguese, Italian
- Chinese, Japanese, Polish

Configuration API:
- lua_mime.configure_text_extraction(cfg) for custom patterns
- Supports extend_defaults to add patterns without replacing defaults

CLI integration in rspamadm mime ex:
- -L/--limit, -Q/--strip-quotes, -S/--strip-signatures
- -R/--strip-reply-headers, -T/--smart-trim

Also updates llm_common.build_llm_input() to use the new function.
lualib/llm_common.lua
lualib/lua_mime.lua
lualib/rspamadm/mime.lua
test/lua/unit/lua_mime.extract_text_limited.lua [new file with mode: 0644]