]> git.ipfire.org Git - thirdparty/rspamd.git/commit
[Feature] Add text quality analysis for PDF garbage filtering
authorVsevolod Stakhov <vsevolod@rspamd.com>
Thu, 11 Dec 2025 18:11:36 +0000 (18:11 +0000)
committerVsevolod Stakhov <vsevolod@rspamd.com>
Thu, 11 Dec 2025 18:11:36 +0000 (18:11 +0000)
commitb62818f14f5f6977b5bad6c79b52ab8315c4fea4
treeb1b58d156db49a9610e98b219a7ef1488ba63a66
parent061e5cbb8fd66c594ad60e8d647dcf969a7bfa68
[Feature] Add text quality analysis for PDF garbage filtering

- Add rspamd_util.get_text_quality() function with comprehensive UTF-8
  text analysis using ICU for proper Unicode classification
- Returns 18 metrics: letters, digits, punctuation, spaces, printable,
  words, word_chars, total, emojis, uppercase, lowercase, ascii_chars,
  non_ascii_chars, latin_vowels, latin_consonants, script_transitions,
  double_spaces, non_printable
- Add confidence scoring to PDF text extraction to filter garbage tokens
  (single characters, encoded data, random sequences)
- Configurable via text_quality_threshold, text_quality_min_length,
  text_quality_enabled options in pdf module config
- Add unit tests for get_text_quality function
lualib/lua_content/pdf.lua
src/lua/lua_util.c
test/lua/unit/rspamd_util.lua