From: Douglas Bagnall Date: Wed, 5 Jul 2023 01:26:12 +0000 (+1200) Subject: libutil/iconv: don't allow wtf-8 surrogate pairs X-Git-Tag: talloc-2.4.2~1015 X-Git-Url: http://git.ipfire.org/gitweb.cgi?a=commitdiff_plain;h=949fe5707774fdc655b8430b0de805aa21004622;p=thirdparty%2Fsamba.git libutil/iconv: don't allow wtf-8 surrogate pairs At present, if we meet a string like "hello \xed\xa7\x96 world", the bytes in the middle will be converted into half of a surrogate pair, and the UTF-16 will be invalid. It is better to error out immediately, because the UTF-8 string is already invalid. https://learn.microsoft.com/en-us/windows/win32/api/Stringapiset/nf-stringapiset-widechartomultibyte#remarks is a citation for the statement about this being a pre-Vista problem. Signed-off-by: Douglas Bagnall Reviewed-by: Andrew Bartlett --- diff --git a/lib/util/charset/iconv.c b/lib/util/charset/iconv.c index 30e705ee119..952b9e7911b 100644 --- a/lib/util/charset/iconv.c +++ b/lib/util/charset/iconv.c @@ -861,6 +861,39 @@ static size_t utf8_pull(void *cd, const char **inbuf, size_t *inbytesleft, errno = EILSEQ; goto error; } + if (codepoint >= 0xd800 && codepoint <= 0xdfff) { + /* + * This is an invalid codepoint, per + * RFC3629, as it encodes part of a + * UTF-16 surrogate pair for a + * character over U+10000, which ought + * to have been encoded as a four byte + * utf-8 sequence. + * + * Prior to Vista, Windows might + * sometimes produce invalid strings + * where a utf-16 sequence containing + * surrogate pairs was converted + * "verbatim" into utf-8, instead of + * encoding the actual codepoint. This + * format is sometimes called "WTF-8". + * + * If we were to support that, we'd + * have a branch here for the case + * where the codepoint is between + * 0xd800 and 0xdbff (a "high + * surrogate"), and read a *six* + * character sequence from there which + * would include a low surrogate. But + * that would undermine the + * hard-learnt principle that each + * character should only have one + * encoding. + */ + errno = EILSEQ; + goto error; + } + uc[0] = codepoint & 0xff; uc[1] = codepoint >> 8; c += 3;