Rework Unicode conversion APIs to use a BinarySink.

The previous mb_to_wc and wc_to_mb had horrible and also buggy APIs. This commit introduces a fresh pair of functions to replace them, which generate output by writing to a BinarySink. So it's now up to the caller to decide whether it wants the output written to a fixed-size buffer with overflow checking (via buffer_sink), or dynamically allocated, or even written directly to some other output channel. Nothing uses the new functions yet. I plan to migrate things over in upcoming commits. What was wrong with the old APIs: they had that awkward undocumented Windows-specific 'flags' parameter that I described in the previous commit and took out of the dup_X_to_Y wrappers. But much worse, the semantics for buffer overflow were not just undocumented but actually inconsistent. dup_wc_to_mb() in utils assumed that the underlying wc_to_mb would fill the buffer nearly full and return the size of data it wrote. In fact, this was untrue in the case where wc_to_mb called WideCharToMultiByte: that returns straight-up failure, setting the Windows error code to ERROR_INSUFFICIENT_BUFFER. It _does_ partially fill the output buffer, but doesn't tell you how much it wrote! What's wrong with the new API: it's a bit awkward to write a sequence of wchar_t in native byte order to a byte-oriented BinarySink, so people using put_mb_to_wc directly have to do some annoying pointer casting. But I think that's less horrible than the previous APIs. Another change: in the new API for wc_to_mb, defchr can be "", but not NULL.
2025-07-01 19:42:48 -05:00 · 2024-09-24 08:18:48 +01:00
parent 32b8da1177
commit 4f756d2a4d
10 changed files with 267 additions and 239 deletions
--- a/utils/dup_mb_to_wc.c
+++ b/utils/dup_mb_to_wc.c
@ -12,20 +12,16 @@
 wchar_t *dup_mb_to_wc_c(int codepage, const char *string,
                        size_t inlen, size_t *outlen_p)
 {
-    assert(inlen <= INT_MAX);
-    size_t mult;
-    for (mult = 1 ;; mult++) {
-        wchar_t *ret = snewn(mult*inlen + 2, wchar_t);
-        size_t outlen = mb_to_wc(codepage, 0, string, inlen, ret,
-                                 mult*inlen + 1);
-        if (outlen < mult*inlen+1) {
-            if (outlen_p)
-                *outlen_p = outlen;
-            ret[outlen] = L'\0';
-            return ret;
-        }
-        sfree(ret);
-    }
+    strbuf *sb = strbuf_new();
+    put_mb_to_wc(sb, codepage, string, inlen);
+    if (outlen_p)
+        *outlen_p = sb->len / sizeof(wchar_t);
+
+    /* Append a trailing L'\0'. For this we only need to write one
+     * byte _fewer_ than sizeof(wchar_t), because strbuf will append a
+     * byte '\0' for us. */
+    put_padding(sb, sizeof(wchar_t) - 1, 0);
+    return (wchar_t *)strbuf_to_str(sb);
 }

 wchar_t *dup_mb_to_wc(int codepage, const char *string)
--- a/utils/dup_wc_to_mb.c
+++ b/utils/dup_wc_to_mb.c
@ -14,26 +14,11 @@
 char *dup_wc_to_mb_c(int codepage, const wchar_t *string,
                     size_t inlen, const char *defchr, size_t *outlen_p)
 {
-    assert(inlen <= INT_MAX);
-
-    size_t outsize = inlen+1;
-    char *out = snewn(outsize, char);
-
-    while (true) {
-        size_t outlen = wc_to_mb(codepage, 0, string, inlen, out, outsize,
-                                 defchr);
-        /* We can only be sure we've consumed the whole input if the
-         * output is not within a multibyte-character-length of the
-         * end of the buffer! */
-        if (outlen < outsize && outsize - outlen > MB_LEN_MAX) {
-            if (outlen_p)
-                *outlen_p = outlen;
-            out[outlen] = '\0';
-            return out;
-        }
-
-        sgrowarray(out, outsize, outsize);
-    }
+    strbuf *sb = strbuf_new();
+    put_wc_to_mb(sb, codepage, string, inlen, defchr);
+    if (outlen_p)
+        *outlen_p = sb->len;
+    return strbuf_to_str(sb);
 }

 char *dup_wc_to_mb(int codepage, const wchar_t *string,