putty-source

mirror of https://git.tartarus.org/simon/putty.git synced 2025-01-09 09:27:59 +00:00

Author	SHA1	Message	Date
Simon Tatham	4f756d2a4d	Rework Unicode conversion APIs to use a BinarySink. The previous mb_to_wc and wc_to_mb had horrible and also buggy APIs. This commit introduces a fresh pair of functions to replace them, which generate output by writing to a BinarySink. So it's now up to the caller to decide whether it wants the output written to a fixed-size buffer with overflow checking (via buffer_sink), or dynamically allocated, or even written directly to some other output channel. Nothing uses the new functions yet. I plan to migrate things over in upcoming commits. What was wrong with the old APIs: they had that awkward undocumented Windows-specific 'flags' parameter that I described in the previous commit and took out of the dup_X_to_Y wrappers. But much worse, the semantics for buffer overflow were not just undocumented but actually inconsistent. dup_wc_to_mb() in utils assumed that the underlying wc_to_mb would fill the buffer nearly full and return the size of data it wrote. In fact, this was untrue in the case where wc_to_mb called WideCharToMultiByte: that returns straight-up failure, setting the Windows error code to ERROR_INSUFFICIENT_BUFFER. It _does_ partially fill the output buffer, but doesn't tell you how much it wrote! What's wrong with the new API: it's a bit awkward to write a sequence of wchar_t in native byte order to a byte-oriented BinarySink, so people using put_mb_to_wc directly have to do some annoying pointer casting. But I think that's less horrible than the previous APIs. Another change: in the new API for wc_to_mb, defchr can be "", but not NULL.	2024-09-26 11:30:07 +01:00
Simon Tatham	a76109c586	Add some missing casts in ctype functions. I thought I'd found all of these before, but perhaps a few managed to slip in since I last looked. The character argument to the <ctype.h> functions must have the value of an unsigned char or EOF; passing an ordinary char (unless you know char is unsigned on every platform the code will ever go near) risks mistaking '\xFF' for EOF, and causing outright undefined behaviour on byte values in the range 80-FE. Never do it.	2023-03-05 13:15:57 +00:00
Simon Tatham	edce3fb9da	Add platform-independent Unicode setup function. Similarly to the one I just added for FontSpec: in a cross-platform main source file, you don't really want to mess about with per-platform ifdefs just to initialise a 'struct unicode_data' from a Conf. But until now, you had to, because init_ucs had a different prototype on Windows and Unix. I plan to use this in future test programs. But an immediate positive effect is that it removes the only platform-dependent call from fuzzterm.c. So now that could be built on Windows too, given only an appropriate cmake stanza. (Not that I have much idea if it's useful to fuzz the terminal separately on multiple platforms, but it's nice to know that it's possible if anyone does need to.)	2023-02-18 14:10:27 +00:00
Simon Tatham	9e01de7c2b	decode_utf8: add an enumeration of failure reasons. Now you can optionally get back an enum value indicating whether the character was successfully decoded, or whether U+FFFD was substituted due to some kind of problem, and if the latter, what problem. For a start, this allows distinguishing 'real' U+FFFD (encoded legitimately in the input) from one invented by the decoder. Also, it allows the recipient of the decode to treat failures differently, either by passing on a useful error report to the user (as utf8_unknown_char now does) or by doing something special. In particular, there are two distinct error codes for a truncated UTF-8 encoding, depending on whether it was truncated by the end of the input or by encountering a non-continuation byte. The former code means that the string is not legal UTF-8 _as it is_, but doesn't rule out it being a (bytewise) prefix of a legal UTF-8 string - so if a client is receiving UTF-8 data a byte at a time, they can treat that error code specially and not make it a fatal error.	2023-02-17 17:16:54 +00:00
Simon Tatham	d509a2dc1e	Formatting: normalise to put a space after condition keywords. 'if (thing)' is the local style here, not 'if(thing)'. Similarly with 'for' and 'while'.	2022-12-28 15:32:24 +00:00
Simon Tatham	69e217d23a	Make decode_utf8() read from a BinarySource. This enables it to handle data that isn't presented as a NUL-terminated string. In particular, the NUL byte can appear _within_ the string and be correctly translated to the NUL wide character. So I've been able to remove the awkwardness in the test rig of having to include the terminating NUL in every test to ensure NUL has been tested, and instead, insert a single explicit test for it. Similarly to the previous commit, the simplification at the (one) call site gives me a strong feeling of 'this is what the API should have been all along'!	2022-11-09 19:21:02 +00:00
Simon Tatham	834b58e39b	Make encode_utf8() output to a BinarySink. Previously it output to an ordinary char buffer, and returned the number of bytes it had written. But three out of the four call sites immediately chucked the resulting bytes into a BinarySink anyway. The fourth, in windows/unicode.c, really is writing into successive locations of a fixed-size buffer - but we can make that into a BinarySink too, using the buffer_sink added in the previous commit. So now encode_utf8() is renamed put_utf8_char, and the call sites all look simpler than they started out.	2022-11-09 19:02:32 +00:00
Simon Tatham	3442fb1aeb	windows/unicode.c: tighten up a bounds check. Coverity points out that if we refer to cp_list[codepage - 65536], we ought to have ensured that codepage - 65536 was _less_ than lenof(cp_list), not just less or equal.	2022-09-07 14:47:54 +01:00
Simon Tatham	9a84a89c32	Add a batch of missing 'static's.	2022-09-03 12:02:48 +01:00
Simon Tatham	9cac27946a	Formatting: miscellaneous. This patch fixes a few other whitespace and formatting issues which were pointed out by the bulk-reindent or which I spotted in passing, some involving manual editing to break lines more nicely. I think the weirdest hunk in here is the one in windows/window.c TranslateKey() where _half_ of an assignment statement inside an 'if' was on the same line as the trailing paren of the if condition. No idea at all how that one managed to happen!	2022-08-03 20:48:46 +01:00
Simon Tatham	4b8dc56284	Formatting: remove spurious spaces in 'type * var'. I think a lot of these were inserted by a prior run through GNU indent many years ago. I noticed in a more recent experiment that that tool doesn't always correctly distinguish which instances of 'id * id' are pointer variable declarations and which are multiplications, so it spaces some of the former as if they were the latter.	2022-08-03 20:48:46 +01:00
Simon Tatham	5a28658a6d	Remove uni_tbl from struct unicode_data. Instead of maintaining a single sparse table mapping Unicode to the currently selected code page, we now maintain a collection of such tables mapping Unicode to any code page we've so far found a need to work with, and we add code pages to that list as necessary, and never throw them away (since there are a limited number of them). This means that the wc_to_mb family of functions are effectively stateless: they no longer depend on a 'struct unicode_data' corresponding to the current terminal settings. So I've removed that parameter from all of them. This fills in the missing piece of yesterday's commit `a216d86106`: now wc_to_mb too should be able to handle internally-implemented character sets, by hastily making their reverse mapping table if it doesn't already have it. (That was only a _latent_ bug, because the only use of wc_to_mb in the cross-platform or Windows code _did_ want to convert to the currently selected code page, so the old strategy worked in that case. But there was no protection against an unworkable use of it being added later.)	2022-06-01 09:28:25 +01:00
Simon Tatham	8a907510dd	decode_codepage(): add missing const in prototype.	2022-06-01 08:29:29 +01:00
Simon Tatham	a216d86106	Windows mb_to_wc: support internal SBCSes. A user points out that the new charset-aware window title setting doesn't work if the configured character set is one of the entries in cp_list[] based on a hard-coded Unicode translation table, such as the ISO 8859 family. That's because the Windows mb_to_wc() function assumes that the code page it's given will always be OK to pass to the Windows API function MultiByteToWideChar, forgetting that for those internally implemented single-byte character sets are not. This commit adds a manual implementation of SBCS -> Unicode based on those tables, which restores the ability to set a window title specified in ISO 8859. However, it's not a full fix to windows/unicode.c in general, because wc_to_mb has a similar blind spot: it's only prepared to convert Unicode to an internally implemented SBCS if that SBCS happens to be the one currently set in ucsdata->line_codepage, because that's when we've already prepared the reverse lookup table. Probably we ought to sort that out, and arrange that it can make the reverse lookup table if suddenly called on to do a different conversion. But that needs more refactoring, so I haven't done it in this commit.	2022-05-31 13:13:57 +01:00
Simon Tatham	f23a84cf7c	windows/unicode.c: manually speak UTF-8. This is another fallback needed on Win95, where the Win32 API functions to convert between multibyte and wide strings exist, but they haven't heard of the UTF-8 code page. PuTTY can't really do without that these days. (In particular, if a server sends a remote window-title escape sequence while the terminal is in UTF-8 mode, then _something_ needs to translate the UTF-8 data into Unicode for Windows to reconvert into the character set used in window titles.) This is a weird enough thing to be doing that I've put it under the new #ifdef LEGACY_WINDOWS, so behaviour in the standard builds should be unchanged.	2022-03-12 21:05:07 +00:00
Simon Tatham	f39c51f9a7	Rename most of the platform source files. This gets rid of all those annoying 'win', 'ux' and 'gtk' prefixes which made filenames annoying to type and to tab-complete. Also, as with my other recent renaming sprees, I've taken the opportunity to expand and clarify some of the names so that they're not such cryptic abbreviations.	2021-04-26 18:00:01 +01:00

16 Commits