decode_utf8: add an enumeration of failure reasons.

Now you can optionally get back an enum value indicating whether the character was successfully decoded, or whether U+FFFD was substituted due to some kind of problem, and if the latter, what problem. For a start, this allows distinguishing 'real' U+FFFD (encoded legitimately in the input) from one invented by the decoder. Also, it allows the recipient of the decode to treat failures differently, either by passing on a useful error report to the user (as utf8_unknown_char now does) or by doing something special. In particular, there are two distinct error codes for a truncated UTF-8 encoding, depending on whether it was truncated by the end of the input or by encountering a non-continuation byte. The former code means that the string is not legal UTF-8 _as it is_, but doesn't rule out it being a (bytewise) prefix of a legal UTF-8 string - so if a client is receiving UTF-8 data a byte at a time, they can treat that error code specially and not make it a fatal error.
2025-07-18 19:41:01 -05:00 · 2023-02-17 16:39:09 +00:00
parent 9d308b39da
commit 9e01de7c2b
6 changed files with 147 additions and 45 deletions
--- a/utils/unicode-known.c
+++ b/utils/unicode-known.c
@ -40,7 +40,13 @@ char *utf8_unknown_char(ptrlen input)
    BinarySource_BARE_INIT_PL(src, input);

    for (size_t nchars = 0; get_avail(src); nchars++) {
-        unsigned c = decode_utf8(src);
+        DecodeUTF8Failure err;
+        unsigned c = decode_utf8(src, &err);
+        if (err != DUTF8_SUCCESS)
+            return dupprintf(
+                "cannot normalise this string: UTF-8 decoding error "
+                "at character position %"SIZEu", byte position %"SIZEu": %s",
+                nchars, src->pos, decode_utf8_error_strings[err]);
        if (!known(c))
            return dupprintf(
                "cannot stably normalise this string: code point %04X "