]> git.ipfire.org Git - thirdparty/dovecot/core.git/commit
lib-fts: Fix address tokenizer to handle large input properly
authorTimo Sirainen <timo.sirainen@open-xchange.com>
Tue, 26 Oct 2021 13:59:29 +0000 (16:59 +0300)
committeraki.tuomi <aki.tuomi@open-xchange.com>
Mon, 8 Nov 2021 10:31:23 +0000 (10:31 +0000)
commite18843502604d9f4317000923a7493e8f6c8b132
treeaafe5fc395181c5c08a67d9cbc55d628f9d354c7
parent2af8437d1d19f1fba76a835c05878f19d64e9b72
lib-fts: Fix address tokenizer to handle large input properly

Previously it could have used excessive amounts of memory if the input
didn't contain separator characters.

The fix changes a bit how the address-tokenizer works: Previously large
email addresses were saved as truncated tokens. Now they're skipped
entirely by the address tokenizer. Similarly when searching long email
addresses they're no longer searched as truncated tokens, but instead
simply fed to the parent tokenizer which (likely) searches them in
smaller pieces.

Note that this also sometimes changes the order in which tokens are
returned, e.g. "foo", "example", "foo@example.com", "com" instead of
returning "com" before the email address. This isn't ideal, but fixing it
seems annoyingly complicated and practically it doesn't matter right now.
src/lib-fts/fts-tokenizer-address.c
src/lib-fts/test-fts-tokenizer.c