Now you can optionally get back an enum value indicating whether the
character was successfully decoded, or whether U+FFFD was substituted
due to some kind of problem, and if the latter, what problem.
For a start, this allows distinguishing 'real' U+FFFD (encoded
legitimately in the input) from one invented by the decoder. Also, it
allows the recipient of the decode to treat failures differently,
either by passing on a useful error report to the user (as
utf8_unknown_char now does) or by doing something special.
In particular, there are two distinct error codes for a truncated
UTF-8 encoding, depending on whether it was truncated by the end of
the input or by encountering a non-continuation byte. The former code
means that the string is not legal UTF-8 _as it is_, but doesn't rule
out it being a (bytewise) prefix of a legal UTF-8 string - so if a
client is receiving UTF-8 data a byte at a time, they can treat that
error code specially and not make it a fatal error.
This enables it to handle data that isn't presented as a
NUL-terminated string.
In particular, the NUL byte can appear _within_ the string and be
correctly translated to the NUL wide character. So I've been able to
remove the awkwardness in the test rig of having to include the
terminating NUL in every test to ensure NUL has been tested, and
instead, insert a single explicit test for it.
Similarly to the previous commit, the simplification at the (one) call
site gives me a strong feeling of 'this is what the API should have
been all along'!
The test in question was supposed to contain the spurious UTF-8
encoding that 0xD800 would have if it were not a surrogate. But the
final continuation character 0x80 was instead 0x00.
The test passed anyway, because ED A0 was regarded as a truncated
sequence, instead of ED A0 80 being regarded as an illegal encoding of
a surrogate, and both return the same output!