Unicode 16.0.0 has changed the formatting of that file in a way that
I'm sure _they_ thought was unproblematic :-) by putting spaces around
the character class field, which the reading code wasn't prepared to
cope with.
A new module in 'utils' computes NFC and NFD, via a new set of data
tables generated by read_ucd.py.
The new module comes with a new test program, which can read the
NormalizationTest.txt that appears in the Unicode Character Database.
All the tests pass, as of Unicode 15.
The initial outputs were all deliberately inconsistent with each
other, so that each one exactly matched the existing table I was
trying to replace.
Now I've done that check, I can clean them up. Normalised spacing and
case to be consistent; removed pointless indentation (these are now
include files, so they don't have to be indented to the same level as
the array declaration surrounding each one's #include); added a header
comment in each autogenerated file, saying that it's autogenerated,
what it's for, and who it's used by.
The currently supported version number of Unicode is also exposed in a
header file, so that I can put it in diagnostics.
This will replace the various pieces of Perl scattered throughout the
code base in comments above long boring data tables. The idea is that
those long boring tables will move into header files in the new
'unicode' directory, and will be #included from the source files that
use the tables.
One benefit is that I won't have to page tediously past the tables to
get to the actual code I want to edit. But more importantly, it should
now become easy to update to a new version of Unicode, by re-running
just one script and committing the changed versions of all the headers
in the 'unicode' subdir.
This version of the script regenerates six Unicode-derived tables in
the existing source code in a byte-for-byte identical form. In the
next commits I'll clean it up, commit the output, and delete the
tables from their previous locations.
(One table I _haven't_ incorporated into this system is the Arabic
shaping table in bidi.c, because my attempt to regenerate it came out
not matching the original at all. That _might_ be because the table is
based on an old Unicode standard and desperately needs updating, but
it might also be because I misunderstood how it works. So I'll leave
sorting that out for another time.)