The Windows version of the Filename structure now contains three
versions of the pathname, in UTF-16, UTF-8 and the system code page.
Callers can use whichever is most convenient.
All uses of filenames for actually opening files now use the UTF-16
version, which means they can tolerate 'exotic' filenames, by which I
mean those including Unicode characters outside the host system's
CP_ACP default code page.
Other uses of Filename structures inside the 'windows' subdirectory do
something appropriate, e.g. when printing a filename inside a message
box or a console message, we use the UTF-8 version of the filename
with the UTF-8 version of the appropriate API.
There are three remaining pieces to full Unicode filename support:
One is that the cross-platform code has many calls to
filename_to_str(), embodying the assumption that a file name can be
reliably converted into the unspecified current character set; those
will all need changing in some way.
Another is that write_setting_filename(), in windows/storage.c, still
saves filenames to the Registry as an ordinary REG_SZ in the system
code page. So even if an exotic filename were stored in a Conf, that
Conf couldn't round-trip via the Registry and back without corrupting
that filename by coercing it back to a string that fits in CP_ACP and
therefore doesn't represent the same file. This can't be fixed without
a compatibility break in the storage format, and I don't want to make
a minimal change in that area: if we're going to break compatibility,
then we should break it good and hard (the Nanny Ogg principle), and
devise a completely fresh storage representation that fixes as many
other legacy problems as possible at the same time. So that's my plan,
not yet started.
The final point, much more obviously, is that we're still short of
methods to _construct_ any Filename structures using a Unicode input
string! It should now work to enter one in the GUI configurer (either
by manual text input or via the file selector), but it won't
round-trip through a save and load (as discussed above), and there's
still no way to specify one on the command line (the groundwork is
laid by commit 10e1ac7752 but not yet linked up).
But this is a start.
I mentioned recently (in commit 9e7d4c53d8) message that I'm no
longer fond of the variable name 'ret', because it's used in two quite
different contexts: it's the return value from a subroutine you just
called (e.g. 'int ret = read(fd, buf, len);' and then check for error
or EOF), or it's the value you're preparing to return from the
_containing_ routine (maybe by assigning it a default value and then
conditionally modifying it, or by starting at NULL and reallocating,
or setting it just before using the 'goto out' cleanup idiom). In the
past I've occasionally made mistakes by forgetting which meaning the
variable had, or accidentally conflating both uses.
If all else fails, I now prefer 'retd' (short for 'returned') in the
former situation, and 'toret' (obviously, the value 'to return') in
the latter case. But even better is to pick a name that actually says
something more specific about what the thing actually is.
One particular bad habit throughout this codebase is to have a set of
functions that deal with some object type (say 'Foo'), all *but one*
of which take a 'Foo *foo' parameter, but the foo_new() function
starts with 'Foo *ret = snew(Foo)'. If all the rest of them think the
canonical name for the ambient Foo is 'foo', so should foo_new()!
So here's a no-brainer start on cutting down on the uses of 'ret': I
looked for all the cases where it was being assigned the result of an
allocation, and renamed the variable to be a description of the thing
being allocated. In the case of a new() function belonging to a
family, I picked the same name as the rest of the functions in its own
family, for consistency. In other cases I picked something sensible.
One case where it _does_ make sense not to use your usual name for the
variable type is when you're cloning an existing object. In that case,
_neither_ of the Foo objects involved should be called 'foo', because
it's ambiguous! They should be named so you can see which is which. In
the two cases I found here, I've called them 'orig' and 'copy'.
As in the previous refactoring, many thanks to clang-rename for the
help.
Now that the new CMake build system is encouraging us to lay out the
code like a set of libraries, it seems like a good idea to make them
look more _like_ libraries, by putting things into separate modules as
far as possible.
This fixes several previous annoyances in which you had to link
against some object in order to get a function you needed, but that
object also contained other functions you didn't need which included
link-time symbol references you didn't want to have to deal with. The
usual offender was subsidiary supporting programs including misc.c for
some innocuous function and then finding they had to deal with the
requirements of buildinfo().
This big reorganisation introduces three new subdirectories called
'utils', one at the top level and one in each platform subdir. In each
case, the directory contains basically the same files that were
previously placed in the 'utils' build-time library, except that the
ones that were extremely miscellaneous (misc.c, utils.c, uxmisc.c,
winmisc.c, winmiscs.c, winutils.c) have been split up into much
smaller pieces.