These were deliberately thrown away in our UTF-8 decoder, with a
comment apparently introduced by RDB in the 2001 big Unicode patch.
The purpose of this character range has changed completely since then,
and now they act as modifier characters on top of U+1F3F4 to construct
a space of flags (the standard examples being those of England,
Scotland and Wales). We were failing to display those flags, and even
pasting out of the terminal didn't give back the right Unicode.
This is the only one of the newly added cases in test/utf8.txt which I
can (try to) fix unilaterally just by changing PuTTY's display code,
because it doesn't change the number of character cells occupied by
the text, only the appearance of those cells.
In this commit I make the necessary changes in terminal.c, which makes
flags start working in GTK PuTTY and pterm, but not on Windows.
The system of encoding flags in Unicode is that there's a space of 26
regional-indicator letter code points (U+1F1E6 to U+1F1FF inclusive)
corresponding to the unaccented Latin alphabet, and an adjacent pair
of those letters represents the flag associated with that two-letter
code (usually a nation, although at least one non-nation pair exists,
namely EU).
There are two plausible ways we could handle this in terminal.c:
(a) leave the regional indicators as they are in the internal data
model, so that each RI letter occupies its own character cell,
and at display time have do_paint() spot adjacent pairs of them
and send each pair to the frontend as a combined glyph.
(b) combine the pairs _in_ the internal data model, by
special-casing them in term_display_graphic_char().
This choice makes a semantic difference. What if a flag is displayed
in the terminal and something overprints one of its two character
cells? With option (a), overprinting one cell of an RI pair with a
different RI letter would change it into a different flag; with
option (b), flags behave like any other wide character, in that
overprinting one of the two cells blanks the other as a side effect.
I think we need (a), because not all terminal redraw systems
(curses-style libraries) will understand the Unicode flag glyph system
at all. So if a full-screen application genuinely wants to do a screen
redraw in which a flag changes to a different flag while keeping one
of its constituent letters the same (say, swapping between BA and CA,
or between AC and AD), then the redraw library might very well
implement that screen update by redrawing only the changed letter, and
we need not to corrupt the flag.
All of this is now implemented in terminal.c. The effect is that pairs
of RI characters are passed to the TermWin draw_text() method as if
they were a wide character with a combining mark: that is, you get a
two-character (or four-surrogate) string, with TATTR_COMBINING
indicating that it represents a single glyph, and ATTR_WIDE indicating
that that glyph occupies two character cells rather than one.
In GTK, that's enough to make flag display Just Work. But on
Windows (at least the Win10 machine I have to test on), that doesn't
make flags start working all by itself. But then, the rest of the new
emoji tests also look a bit confused on Windows too. Help would be
welcome from someone who knows how Windows emoji display is supposed
to work!
These samples all come from the 'emoji' parts of Unicode, although I
use the word a bit loosely because I'm not sure that flags count (they
have their own special system). But they're all things that ought to
display via a separate font, likely in colour.
The second line of this extra test already looks correct in PuTTY:
three code points each representing an emoji, for which wcwidth()
correctly reports that they occupy 2 cells each. On GTK, the emoji
even appear in colour; on Windows they come out in black and
white. (And I don't know what I can do to fix that; the problem is not
that I don't have any emoji font installed. I do.)
The first line consists of 'simpler' emoji in the sense of being more
common, but technically more complicated, because they're ordinary
Unicode characters such as U+2764 HEAVY BLACK HEART, modified into
emoji by U+FE0F VARIATION SELECTOR-16. This goes badly because
wcwidth() measures the primary character as having width 1 (which it
would do, by itself), and the variation selector as width 0 (also not
unreasonable), but the total is 1, where you'd like it to be 2. This
is also difficult to fix, because if we unilaterally changed it then
every curses-type library would mispredict the cursor position and
produce display corruption during partial screen redraws!
The third line uses a mechanism I've only found out about recently:
U+200D ZERO WIDTH JOINER glues together two code points that would
each be a valid emoji on its own, to make a single combined one. In
this case, WOMAN + PERSONAL COMPUTER ought to combine into a woman
using a computer. Again this doesn't work in PuTTY, which knows
nothing about ZWJ. But it comes out as expected in other tools viewing
this file, such as 'gedit', or Firefox.
The fourth line shows another complex emoji case: the WOMAN code point
is followed by U+1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2, and
another one is followed by U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6,
in each case selecting the woman's skin tone. PuTTY mishandles that
too, because it doesn't know that those should act as modifiers (again
because wcwidth gives them width 2 rather than 0), and so each one
occupies an extra two character cells.
And the last line contains some sample flags, each of which is
obtained by writing a 2-letter code for a country or region (here GB,
UA, EU) with each Latin letter replaced by the appropriate 'regional
indicator symbol letter' from the 26-code-point range U+1F1E6 to
U+1F1FF inclusive. PuTTY doesn't know anything about those either, but
they at least occupy the right number of cells if handled naïvely, so
_that_ one might be possible to fix!