[Feature] Add text quality analysis for PDF garbage filtering
- Add rspamd_util.get_text_quality() function with comprehensive UTF-8
text analysis using ICU for proper Unicode classification
- Returns 18 metrics: letters, digits, punctuation, spaces, printable,
words, word_chars, total, emojis, uppercase, lowercase, ascii_chars,
non_ascii_chars, latin_vowels, latin_consonants, script_transitions,
double_spaces, non_printable
- Add confidence scoring to PDF text extraction to filter garbage tokens
(single characters, encoded data, random sequences)
- Configurable via text_quality_threshold, text_quality_min_length,
text_quality_enabled options in pdf module config
- Add unit tests for get_text_quality function