git.ipfire.org Git - thirdparty/rspamd.git/commit

author	Vsevolod Stakhov <vsevolod@rspamd.com>
	Sun, 23 Nov 2025 14:20:48 +0000 (14:20 +0000)
committer	Vsevolod Stakhov <vsevolod@rspamd.com>
	Sun, 23 Nov 2025 14:20:48 +0000 (14:20 +0000)
commit	9d575b165a026a39f8aecc8a361c1dfdb82107ef
tree	5442b9f109ed1ca48c8a26b980ebe4b36594cab0	tree \| snapshot
parent	9e0b9d91d7d3408f24e2c5d031d5b67c17bbdced	commit \| diff

[Feature] Implement basic PDF text extraction

- Enable text_extraction in default config
- Implement extract_text_data to collect text from Page objects
- Improve PDF grammar to handle text operators and spacing (TJ, Tj, ', ")
- Add logic for newline insertion based on Td/TD/Tm operators
- Add heuristic for space insertion based on negative kerning in TJ arrays
- Support common ligatures for StandardEncoding and MacRomanEncoding
- Support FlateDecode and ASCIIHexDecode filters
- Update rspamadm mime to support raw PDF extraction (-r flag) and better content type detection

lualib/lua_content/pdf.lua		diff \| blob \| blame \| history
lualib/rspamadm/mime.lua		diff \| blob \| blame \| history