decode_utf8: add an enumeration of failure reasons.

mirror of https://git.tartarus.org/simon/putty.git synced 2025-07-18 19:41:01 -05:00

Now you can optionally get back an enum value indicating whether the
character was successfully decoded, or whether U+FFFD was substituted
due to some kind of problem, and if the latter, what problem.

For a start, this allows distinguishing 'real' U+FFFD (encoded
legitimately in the input) from one invented by the decoder. Also, it
allows the recipient of the decode to treat failures differently,
either by passing on a useful error report to the user (as
utf8_unknown_char now does) or by doing something special.

In particular, there are two distinct error codes for a truncated
UTF-8 encoding, depending on whether it was truncated by the end of
the input or by encountering a non-continuation byte. The former code
means that the string is not legal UTF-8 _as it is_, but doesn't rule
out it being a (bytewise) prefix of a legal UTF-8 string - so if a
client is receiving UTF-8 data a byte at a time, they can treat that
error code specially and not make it a fatal error.

This commit is contained in:

Simon Tatham

2023-02-17 16:39:09 +00:00

parent 9d308b39da

commit 9e01de7c2b

6 changed files with 147 additions and 45 deletions

									
										2

utils/unicode-norm.c
									
												View File
												
				@ -295,7 +295,7 @@ strbuf *utf8_to_nfc(ptrlen input)

				    ucharbuf *inbuf = ucharbuf_new();

				    while (get_avail(src))

				        ucharbuf_append(inbuf, decode_utf8(src));

				        ucharbuf_append(inbuf, decode_utf8(src, NULL));

				    ucharbuf *outbuf = nfc(inbuf);

decode_utf8: add an enumeration of failure reasons.

2 utils/unicode-norm.c Unescape Escape View File

2

utils/unicode-norm.c

View File