commit: handle large commit messages in utf8 verification
Running t4205 under UBSan with the EXPENSIVE prereq enabled triggers an
error when we try to create a commit message that is over 2GB:
commit.c:1574:6: runtime error: signed integer overflow:
-
2147483648 - 1 cannot be represented in type 'int'
The problem is that find_invalid_utf8() is not prepared to handle
large buffers, as it uses an "int" to represent buffer sizes and
offsets.
We can fix this with a few changes:
1. We'll take in "len" as a size_t (which is what the caller has
anyway, since it's working with a strbuf).
2. We need to return a size_t to give the offset to the invalid utf8,
but we also need a sentinel value for "no invalid value"
(previously "-1"). Let's split these to return a bool for "found
invalid utf8" and then pass back the offset as an out-parameter.
We'll switch the function name to match the new semantics.
3. The caller in verify_utf8() uses a "long" to store buffer
positions, which is a bit funny. This goes back to
08a94a145c
(commit/commit-tree: correct latin1 to utf-8, 2012-06-28) and is
perhaps trying to match our use of "unsigned long" for object sizes
(though we don't care about it ever becoming negative here). This
should be a size_t, too, as some platforms (like Windows) still use
a 32-bit long on machines with 64-bit pointers.
4. The "bytes" field within find_invalid_utf() does not have range
problems. It is the number of bytes the utf8 sequence claims to
have, so is limited by how many bits can be set in a single 8-bit
byte. However, if we leave it as an "int" then the compiler will
complain about the sign mismatch when comparing it to "len". So
let's make it unsigned, too.
All of this is a little silly, of course, because 2GB text commit
messages are clearly nonsense. So we might consider rejecting them
outright, but it is easy enough to make these helper functions more
robust in the meantime.
Signed-off-by: Jeff King <peff@peff.net>
Signed-off-by: Junio C Hamano <gitster@pobox.com>