putty-source/utils/dup_mb_to_wc.c

/*
 * dup_mb_to_wc: memory-allocating wrapper on mb_to_wc.
 *
 * Also dup_mb_to_wc_c: same but you already know the length of the
 * string, and you get told the length of the returned wide string.
 * (But it's still NUL-terminated, for convenience.)
 */

#include "putty.h"
#include "misc.h"

wchar_t *dup_mb_to_wc_c(int codepage, const char *string,
                        size_t inlen, size_t *outlen_p)
{
    strbuf *sb = strbuf_new();
    put_mb_to_wc(sb, codepage, string, inlen);
    if (outlen_p)
        *outlen_p = sb->len / sizeof(wchar_t);

    /* Append a trailing L'\0'. For this we only need to write one
     * byte _fewer_ than sizeof(wchar_t), because strbuf will append a
     * byte '\0' for us. */
    put_padding(sb, sizeof(wchar_t) - 1, 0);
    return (wchar_t *)strbuf_to_str(sb);
}

wchar_t *dup_mb_to_wc(int codepage, const char *string)
{
    return dup_mb_to_wc_c(codepage, string, strlen(string), NULL);
}
win_set_[icon_]title: send a codepage along with the string. While fixing the previous commit I noticed that window titles don't actually _work_ properly if you change the terminal character set, because the text accumulated in the OSC string buffer is sent to the TermWin as raw bytes, with no indication of what character set it should interpret them as. You might get lucky if you happened to choose the right charset (in particular, UTF-8 is a common default), but if you change the charset half way through a run, then there's certainly no way the frontend will know to interpret two window titles sent before and after the change in two different charsets. So, now win_set_title() and win_set_icon_title() both include a codepage parameter along with the byte string, and it's up to them to translate the provided window title from that encoding to whatever the local window system expects to receive. On Windows, that's wide-string Unicode, so we can just use the existing dup_mb_to_wc utility function. But in GTK, it's UTF-8, so I had to write an extra utility function to encode a wide string as UTF-8. 2021-10-16 12:20:44 +00:00			`/*`
Remove a redundant file in utils. At some point while setting up the utils subdirectory, I apparently only got half way through renaming miscucs.c to dup_mb_to_wc.c: I created the new copy of the file, but I didn't delete the old one, I didn't mention it in utils/CMakeLists.txt, and I didn't change the comment at the top. Now done all three, so we now have just one copy of this utility module. 2021-11-30 18:48:06 +00:00			`* dup_mb_to_wc: memory-allocating wrapper on mb_to_wc.`
			`*`
			`* Also dup_mb_to_wc_c: same but you already know the length of the`
Add UTF-8 support to the new Windows ConsoleIO system. This allows you to set a flag in conio_setup() which causes the returned ConsoleIO object to interpret all its output as UTF-8, by translating it to UTF-16 and using WriteConsoleW to write it in Unicode. Similarly, input is read using ReadConsoleW and decoded from UTF-16 to UTF-8. This flag is set to false in most places, to avoid making sudden breaking changes. But when we're about to present a prompts_t to the user, it's set from the new 'utf8' flag in that prompt, which in turn is set by the userauth layer in any case where the prompts are going to the server. The idea is that this should be the start of a fix for the long- standing character-set handling bug that strings transmitted during SSH userauth (usernames, passwords, k-i prompts and responses) are all supposed to be in UTF-8, but we've always encoded them in whatever our input system happens to be using, and not done any tidying up on them. We get occasional complaints about this from users whose passwords contain characters that are encoded differently between UTF-8 and their local encoding, but I've never got round to fixing it because it's a large piece of engineering. Indeed, this isn't nearly the end of it. The next step is to add UTF-8 support to all the _other_ ways of presenting a prompts_t, as best we can. Like the previous change to console handling, it seems very likely that this will break someone's workflow. So there's a fallback command-line option '-legacy-charset-handling' to revert to PuTTY's previous behaviour. 2022-11-25 12:57:43 +00:00			`* string, and you get told the length of the returned wide string.`
			`* (But it's still NUL-terminated, for convenience.)`
win_set_[icon_]title: send a codepage along with the string. While fixing the previous commit I noticed that window titles don't actually _work_ properly if you change the terminal character set, because the text accumulated in the OSC string buffer is sent to the TermWin as raw bytes, with no indication of what character set it should interpret them as. You might get lucky if you happened to choose the right charset (in particular, UTF-8 is a common default), but if you change the charset half way through a run, then there's certainly no way the frontend will know to interpret two window titles sent before and after the change in two different charsets. So, now win_set_title() and win_set_icon_title() both include a codepage parameter along with the byte string, and it's up to them to translate the provided window title from that encoding to whatever the local window system expects to receive. On Windows, that's wide-string Unicode, so we can just use the existing dup_mb_to_wc utility function. But in GTK, it's UTF-8, so I had to write an extra utility function to encode a wide string as UTF-8. 2021-10-16 12:20:44 +00:00			`*/`

			`#include "putty.h"`
			`#include "misc.h"`

dup_mb_to_wc, dup_wc_to_mb: remove the 'flags' parameter. This parameter was undocumented, and Windows-specific: its semantics date from before PuTTY was cross-platform, and are "Pass this flags parameter straight through to the Win32 API's conversion functions". So in Windows platform code you can pass flags like MB_USEGLYPHCHARS, but in cross-platform code, you dare not pass anything nonzero at all because the Unix frontend won't recognise it (or, likely, even compile). I've kept the flag for now in the underlying mb_to_wc / wc_to_mb functions. Partly that's because there's one place in the Windows code where the parameter _is_ used; mostly, it's because I'm about to replace those functions anyway, so there's no point in editing all the call sites twice. 2024-09-24 07:46:39 +00:00			`wchar_t dup_mb_to_wc_c(int codepage, const char string,`
Add UTF-8 support to the new Windows ConsoleIO system. This allows you to set a flag in conio_setup() which causes the returned ConsoleIO object to interpret all its output as UTF-8, by translating it to UTF-16 and using WriteConsoleW to write it in Unicode. Similarly, input is read using ReadConsoleW and decoded from UTF-16 to UTF-8. This flag is set to false in most places, to avoid making sudden breaking changes. But when we're about to present a prompts_t to the user, it's set from the new 'utf8' flag in that prompt, which in turn is set by the userauth layer in any case where the prompts are going to the server. The idea is that this should be the start of a fix for the long- standing character-set handling bug that strings transmitted during SSH userauth (usernames, passwords, k-i prompts and responses) are all supposed to be in UTF-8, but we've always encoded them in whatever our input system happens to be using, and not done any tidying up on them. We get occasional complaints about this from users whose passwords contain characters that are encoded differently between UTF-8 and their local encoding, but I've never got round to fixing it because it's a large piece of engineering. Indeed, this isn't nearly the end of it. The next step is to add UTF-8 support to all the _other_ ways of presenting a prompts_t, as best we can. Like the previous change to console handling, it seems very likely that this will break someone's workflow. So there's a fallback command-line option '-legacy-charset-handling' to revert to PuTTY's previous behaviour. 2022-11-25 12:57:43 +00:00			`size_t inlen, size_t *outlen_p)`
win_set_[icon_]title: send a codepage along with the string. While fixing the previous commit I noticed that window titles don't actually _work_ properly if you change the terminal character set, because the text accumulated in the OSC string buffer is sent to the TermWin as raw bytes, with no indication of what character set it should interpret them as. You might get lucky if you happened to choose the right charset (in particular, UTF-8 is a common default), but if you change the charset half way through a run, then there's certainly no way the frontend will know to interpret two window titles sent before and after the change in two different charsets. So, now win_set_title() and win_set_icon_title() both include a codepage parameter along with the byte string, and it's up to them to translate the provided window title from that encoding to whatever the local window system expects to receive. On Windows, that's wide-string Unicode, so we can just use the existing dup_mb_to_wc utility function. But in GTK, it's UTF-8, so I had to write an extra utility function to encode a wide string as UTF-8. 2021-10-16 12:20:44 +00:00			`{`
Rework Unicode conversion APIs to use a BinarySink. The previous mb_to_wc and wc_to_mb had horrible and also buggy APIs. This commit introduces a fresh pair of functions to replace them, which generate output by writing to a BinarySink. So it's now up to the caller to decide whether it wants the output written to a fixed-size buffer with overflow checking (via buffer_sink), or dynamically allocated, or even written directly to some other output channel. Nothing uses the new functions yet. I plan to migrate things over in upcoming commits. What was wrong with the old APIs: they had that awkward undocumented Windows-specific 'flags' parameter that I described in the previous commit and took out of the dup_X_to_Y wrappers. But much worse, the semantics for buffer overflow were not just undocumented but actually inconsistent. dup_wc_to_mb() in utils assumed that the underlying wc_to_mb would fill the buffer nearly full and return the size of data it wrote. In fact, this was untrue in the case where wc_to_mb called WideCharToMultiByte: that returns straight-up failure, setting the Windows error code to ERROR_INSUFFICIENT_BUFFER. It _does_ partially fill the output buffer, but doesn't tell you how much it wrote! What's wrong with the new API: it's a bit awkward to write a sequence of wchar_t in native byte order to a byte-oriented BinarySink, so people using put_mb_to_wc directly have to do some annoying pointer casting. But I think that's less horrible than the previous APIs. Another change: in the new API for wc_to_mb, defchr can be "", but not NULL. 2024-09-24 07:18:48 +00:00			`strbuf *sb = strbuf_new();`
			`put_mb_to_wc(sb, codepage, string, inlen);`
			`if (outlen_p)`
			`*outlen_p = sb->len / sizeof(wchar_t);`

			`/* Append a trailing L'\0'. For this we only need to write one`
			`* byte _fewer_ than sizeof(wchar_t), because strbuf will append a`
			`* byte '\0' for us. */`
			`put_padding(sb, sizeof(wchar_t) - 1, 0);`
			`return (wchar_t *)strbuf_to_str(sb);`
win_set_[icon_]title: send a codepage along with the string. While fixing the previous commit I noticed that window titles don't actually _work_ properly if you change the terminal character set, because the text accumulated in the OSC string buffer is sent to the TermWin as raw bytes, with no indication of what character set it should interpret them as. You might get lucky if you happened to choose the right charset (in particular, UTF-8 is a common default), but if you change the charset half way through a run, then there's certainly no way the frontend will know to interpret two window titles sent before and after the change in two different charsets. So, now win_set_title() and win_set_icon_title() both include a codepage parameter along with the byte string, and it's up to them to translate the provided window title from that encoding to whatever the local window system expects to receive. On Windows, that's wide-string Unicode, so we can just use the existing dup_mb_to_wc utility function. But in GTK, it's UTF-8, so I had to write an extra utility function to encode a wide string as UTF-8. 2021-10-16 12:20:44 +00:00			`}`

dup_mb_to_wc, dup_wc_to_mb: remove the 'flags' parameter. This parameter was undocumented, and Windows-specific: its semantics date from before PuTTY was cross-platform, and are "Pass this flags parameter straight through to the Win32 API's conversion functions". So in Windows platform code you can pass flags like MB_USEGLYPHCHARS, but in cross-platform code, you dare not pass anything nonzero at all because the Unix frontend won't recognise it (or, likely, even compile). I've kept the flag for now in the underlying mb_to_wc / wc_to_mb functions. Partly that's because there's one place in the Windows code where the parameter _is_ used; mostly, it's because I'm about to replace those functions anyway, so there's no point in editing all the call sites twice. 2024-09-24 07:46:39 +00:00			`wchar_t dup_mb_to_wc(int codepage, const char string)`
win_set_[icon_]title: send a codepage along with the string. While fixing the previous commit I noticed that window titles don't actually _work_ properly if you change the terminal character set, because the text accumulated in the OSC string buffer is sent to the TermWin as raw bytes, with no indication of what character set it should interpret them as. You might get lucky if you happened to choose the right charset (in particular, UTF-8 is a common default), but if you change the charset half way through a run, then there's certainly no way the frontend will know to interpret two window titles sent before and after the change in two different charsets. So, now win_set_title() and win_set_icon_title() both include a codepage parameter along with the byte string, and it's up to them to translate the provided window title from that encoding to whatever the local window system expects to receive. On Windows, that's wide-string Unicode, so we can just use the existing dup_mb_to_wc utility function. But in GTK, it's UTF-8, so I had to write an extra utility function to encode a wide string as UTF-8. 2021-10-16 12:20:44 +00:00			`{`
dup_mb_to_wc, dup_wc_to_mb: remove the 'flags' parameter. This parameter was undocumented, and Windows-specific: its semantics date from before PuTTY was cross-platform, and are "Pass this flags parameter straight through to the Win32 API's conversion functions". So in Windows platform code you can pass flags like MB_USEGLYPHCHARS, but in cross-platform code, you dare not pass anything nonzero at all because the Unix frontend won't recognise it (or, likely, even compile). I've kept the flag for now in the underlying mb_to_wc / wc_to_mb functions. Partly that's because there's one place in the Windows code where the parameter _is_ used; mostly, it's because I'm about to replace those functions anyway, so there's no point in editing all the call sites twice. 2024-09-24 07:46:39 +00:00			`return dup_mb_to_wc_c(codepage, string, strlen(string), NULL);`
win_set_[icon_]title: send a codepage along with the string. While fixing the previous commit I noticed that window titles don't actually _work_ properly if you change the terminal character set, because the text accumulated in the OSC string buffer is sent to the TermWin as raw bytes, with no indication of what character set it should interpret them as. You might get lucky if you happened to choose the right charset (in particular, UTF-8 is a common default), but if you change the charset half way through a run, then there's certainly no way the frontend will know to interpret two window titles sent before and after the change in two different charsets. So, now win_set_title() and win_set_icon_title() both include a codepage parameter along with the byte string, and it's up to them to translate the provided window title from that encoding to whatever the local window system expects to receive. On Windows, that's wide-string Unicode, so we can just use the existing dup_mb_to_wc utility function. But in GTK, it's UTF-8, so I had to write an extra utility function to encode a wide string as UTF-8. 2021-10-16 12:20:44 +00:00			`}`