Now that it's possible for a single invocation of PuTTY to connect to
multiple SSH servers (jump host followed by ultimate destination
host), it's rather unhelpful for host key prompts to just say "the
server". To check an unknown host key, users will need to know _which_
host it's purporting to be the key for.
Another possibility is to put a message in the terminal window
indicating which server we're currently in the SSH setup phase for.
That will certainly be what we have to end up doing for userpass
prompts that appear _in_ the terminal window. But that by itself is
still unhelpful for host key prompts in a separate dialog, because the
user would have to check both windows to get all the information they
need. Easier if the host key dialog itself tells you everything you
need to know to answer the question: is _this_ key the one you expect
for _that_ host?
The format _strings_ were previously centralised into the platform-
independent console.c, as const char arrays. Now the actual formatting
operation is centralised as well, by means of console.c providing a
function that takes all the necessary parameters and returns a
formatted piece of text for the console.
Mostly this is so that I can add extra parameters to the message with
some confidence: changing a format string in one file and two fprintf
statements in other files to match seems like the kind of situation
you wish you hadn't got into in the first place :-)
In a GUI app, when interactive userpass input begins, the Ldisc
acquires a reference to a prompts_t. If something bad happens to the
SSH connection (e.g. unexpected server-side closure), then all the SSH
layers will be destroyed, including freeing that prompts_t. So the
Ldisc will have a stale reference to it, which it might potentially
use.
To fix that, I've arranged a back-pointer so that prompts_t itself can
find the Ldisc's reference to it, and NULL it out on free. So now,
whichever of a prompts_t and an Ldisc is freed first, the link between
them should be cleanly broken.
(I'm not 100% sure this is absolutely necessary, in the sense of
whether a sequence of events can _actually_ happen that causes a stale
pointer dereference. But I don't want to take the chance!)
If a userauth layer is destroyed while userpass input is still
ongoing, ssh2_userauth_free forgot to free the active prompts_t,
leaking memory.
But adding the missing free_prompts call to ssh2_userauth_free results
in a double-free, because another thing I forgot was to null out that
pointer field everywhere _else_ it's freed. Fixed that too.
Commit 8f5e9a4f8d introduced a segfault into both Windows Pageant
and Windows connection sharing upstreams when they receive an incoming
named-pipe connection. This occurs because the PlugVtables for those
incoming connections had a null pointer in the 'log' field, because
hitherto, only sockets involved with an outgoing connection expected
to receive plug_log() notifications. But I added such a notification
in make_handle_socket, forgetting that that function is used for both
outgoing and incoming named-pipe connections (among other things). So
now a Plug implementation that expects to be set up by the
plug_accepting() method on a listener may still receive
PLUGLOG_CONNECT_SUCCESS.
I could fix that by adding a parameter to make_handle_socket telling
it whether to send a notification, but that seems like more faff than
is really needed. Simpler to make a rule that says _all_ Socket types
must implement the log() method, even if only with a no-op function.
We already have a no-op implementation of log(), in nullplug.c. So
I've exposed that outside its module (in the same style as all the
nullseat functions and so on), to make it really easy to add log()
methods to PlugVtables that don't need one.
Apparently in all my test runs on Linux it happened to start off
negative. But as soon as I tested on Windows, that initialised the
memory to something unhelpful.
Now that the SSH backend's user_input bufchain is no longer needed for
handling userpass input, it doesn't have to be awkwardly shared
between all the packet protocol layers any more. So we can turn the
want_user_input and got_user_input methods of PacketProtocolLayer into
methods of ConnectionLayer, and then only the two connection layers
have to bother implementing them, or store a pointer to the bufchain
they read from.
This is the big payoff from the huge refactoring in the previous
commit: now it's possible for proxy implementations to present their
own interactive prompts via the seat_get_userpass_input system,
because the input data that those prompts will need to consume is now
always somewhere sensible (and hasn't, for example, already been put
on to the main backend's input queue where the proxy can't get at it).
Like the GUI dialog prompts, this isn't yet fully polished, because
the login and password prompts are very unclear about which SSH server
they're talking about. But at least you now _can_ log in manually with
a username and password to two SSH servers in succession (if you know
which server(s) you're expecting to see prompts from), and that was
the really hard part.
The system for handling seat_get_userpass_input has always been
structured differently between GUI PuTTY and CLI tools like Plink.
In the CLI tools, password input is read directly from the OS
terminal/console device by console_get_userpass_input; this means that
you need to ensure the same terminal input data _hasn't_ already been
consumed by the main event loop and sent on to the backend. This is
achieved by the backend_sendok() method, which tells the event loop
when the backend has finished issuing password prompts, and hence,
when it's safe to start passing standard input to backend_send().
But in the GUI tools, input generated by the terminal window has
always been sent straight to backend_send(), regardless of whether
backend_sendok() says it wants it. So the terminal-based
implementation of username and password prompts has to work by
consuming input data that had _already_ been passed to the backend -
hence, any backend that needs to do that must keep its input on a
bufchain, and pass that bufchain to seat_get_userpass_input.
It's awkward that these two totally different systems coexist in the
first place. And now that SSH proxying needs to present interactive
prompts of its own, it's clear which one should win: the CLI style is
the Right Thing. So this change reworks the GUI side of the mechanism
to be more similar: terminal data now goes into a queue in the Ldisc,
and is not sent on to the backend until the backend says it's ready
for it via backend_sendok(). So terminal-based userpass prompts can
now consume data directly from that queue during the connection setup
stage.
As a result, the 'bufchain *' parameter has vanished from all the
userpass_input functions (both the official implementations of the
Seat trait method, and term_get_userpass_input() to which some of
those implementations delegate). The only function that actually used
that bufchain, namely term_get_userpass_input(), now instead reads
from the ldisc's input queue via a couple of new Ldisc functions.
(Not _trivial_ functions, since input buffered by Ldisc can be a
mixture of raw bytes and session specials like SS_EOL! The input queue
inside Ldisc is a bufchain containing a fiddly binary encoding that
can represent an arbitrary interleaving of those things.)
This greatly simplifies the calls to seat_get_userpass_input in
backends, which now don't have to mess about with passing their own
user_input bufchain around, or toggling their want_user_input flag
back and forth to request data to put on to that bufchain.
But the flip side is that now there has to be some _other_ method for
notifying the terminal when there's more input to be consumed during
an interactive prompt, and for notifying the backend when prompt input
has finished so that it can proceed to the next stage of the protocol.
This is done by a pair of extra callbacks: when more data is put on to
Ldisc's input queue, it triggers a call to term_get_userpass_input,
and when term_get_userpass_input finishes, it calls a callback
function provided in the prompts_t.
Therefore, any use of a prompts_t which *might* be asynchronous must
fill in the latter callback when setting up the prompts_t. In SSH, the
callback is centralised into a common PPL helper function, which
reinvokes the same PPL's process_queue coroutine; in rlogin we have to
set it up ourselves.
I'm sorry for this large and sprawling patch: I tried fairly hard to
break it up into individually comprehensible sub-patches, but I just
couldn't tease out any part of it that would stand sensibly alone.
I've introduced a function ldisc_notify_sendok(), which backends
should call on their ldisc (if they have one) when anything changes
that might cause backend_sendok() to start returning true.
At the moment, the function does nothing. But in future, I'm going to
make ldisc start buffering typed-ahead input data not yet sent to the
backend, and then the effect of this function will be to trigger
flushing all that data into the backend.
Backends only have to call this function if sendok was previously
false: backends requiring no network connection stage (like pty and
serial) can safely return true from sendok, and in that case, they
don't also have to immediately call this function.
I did a horrible thing with a list macro which builds up a 256-bit
bitmap of known SSH-2 message types at compile time, by means of
evaluating a conditional expression per known message type and per
bitmap word which boils down to (in pseudocode)
(shift count in range ? 1 << shift count : 0)
I think this is perfectly valid C. If the shift count is out of range,
then the use of the << operator in the true branch of the ?: would
have undefined behaviour if it were executed - but that's OK, because
in that situation, the safe false branch is executed instead.
But when the whole thing is a compile-time evaluated constant
expression, the compiler can prove statically that the << in the true
branch is an out-of-range shift, and at least some compilers will warn
about it verbosely. The same compiler *could* also prove statically
that that branch isn't taken, and use that to suppress the warning -
but at least clang does not.
The solution is the same one I used in shift_right_by_one_word and
shift_left_by_one_word in mpint.c: inside the true branch, nest a
second conditional expression which coerces the shift count to always
be in range, by setting it to 0 if it's not. This doesn't affect the
output, because the only cases in which the output of the true branch
is altered by this transformation are the ones in which the true
branch wasn't taken anyway.
So this change should make no difference to the output of this macro
construction, but it suppresses about 350 pointless warnings from
clang.
While re-testing this backend I realised that we were completely
ignoring the actual return status from seat_get_userpass_input, once
it stops being -1 ("prompt still in progress"). So if the user hits ^C
or ^D at the prompt, e.g. after realising they've started PuTTY in the
wrong mode by mistake, then we press ahead anyway and send a nonsense
username.
Now we interpret that as a request to close the connection.
This gives more sensible handling of failed network connections: if
the connection _doesn't_ get made, it's pointless to ask the user for
a login username anyway, and worse still, you'd probably end up giving
the connection-failure error message while the user was in the middle
of answering the username prompt.
But more importantly, when proxy systems start being able to present
their own interactive prompts, it's vital that no backend should
already be presenting a prompt while in the process of setting up a
network connection.
This puts the previous commit's framework to practical use. Now the
main new_connection() passes its Seat ** through to the SshProxy setup
function, which (if the stars align) will actually use it: stash it,
return a TempSeat wrapper on it for the main backend to use in the
interim, and pass through the GUI dialog prompts for host key
confirmation and weak-crypto warnings.
This is unfinished at the UI end: those dialog prompts will now need
to be much clearer about which SSH server they're talking to (since
now there could be two involved), and I haven't made that change yet.
I haven't attempted to deal with get_userpass_input yet, though.
That's much harder, and I'm still working on it.
This is working towards allowing the subsidiary SSH connection in an
SshProxy to share the main user-facing Seat, so as to be able to pass
through interactive prompts.
This is more difficult than the similar change with LogPolicy, because
Seats are stateful. In particular, the trust-sigil status will need to
be controlled by the SshProxy until it's ready to pass over control to
the main SSH (or whatever) connection.
To make this work, I've introduced a thing called a TempSeat, which is
(yet) another Seat implementation. When a backend hands its Seat to
new_connection(), it does it in a way that allows new_connection() to
borrow it completely, and replace it in the main backend structure
with a TempSeat, which acts as a temporary placeholder. If the main
backend tries to do things like changing trust status or sending
output, the TempSeat will buffer them; later on, when the connection
is established, TempSeat will replay the changes into the real Seat.
So, in each backend, I've made the following changes:
- pass &foo->seat to new_connection, which may overwrite it with a
TempSeat.
- if it has done so (which we can tell via the is_tempseat() query
function), then we have to free the TempSeat and reinstate our main
Seat. The signal that we can do so is the PLUGLOG_CONNECT_SUCCESS
notification, which indicates that SshProxy has finished all its
connection setup work.
- we also have to remember to free the TempSeat if our backend is
disposed of without that having happened (e.g. because the
connection _doesn't_ succeed).
- in backends which have no local auth phase to worry about, ensure
we don't call seat_set_trust_status on the main Seat _before_ it
gets potentially replaced with a TempSeat. Moved some calls of
seat_set_trust_status to just after new_connection(), so that now
the initial trust status setup will go into the TempSeat (if
appropriate) and be buffered until that seat is relinquished.
In all other uses of new_connection, where we don't have a Seat
available at all, we just pass NULL.
This is NFC, because neither new_connection() nor any of its delegates
will _actually_ do this replacement yet. We're just setting up the
framework to enable it to do so in the next commit.
Now new_connection() takes an optional LogPolicy * argument, and
passes it on to the SshProxy setup. This means that SshProxy's
implementation of the LogPolicy trait can answer queries like
askappend() and logging_error() by passing them on to the same
LogPolicy used by the main backend.
Not all callers of new_connection have a LogPolicy, so we still have
to fall back to the previous conservative default behaviour if
SshProxy doesn't have a LogPolicy it can ask.
The main backend implementations didn't _quite_ have access to a
LogPolicy already, but they do have a LogContext, which has a
LogPolicy vtable pointer inside it; so I've added a query function
log_get_policy() which allows them to extract that pointer to pass to
new_connection.
This is the first step of fixing the non-interactivity limitations of
SshProxy. But it's also the easiest step: the next ones will be more
involved.
In the case where these socket types are constructed because of a
local proxy command, we do actually have a SockAddr representing the
logical host we were trying to make a connection to. So we might as
well store it in the socket implementation, and then we can include it
in the PLUGLOG_CONNECT_SUCCESS call to make the log message more
informative.
Now the non-SSH backends critically depend on it, it's important not
to forget to send it, for any socket type that's going to be used for
any of those backends. But ProxySocket, and the Unix and Windows
'socket' types wrapping pipes to local subprocesses, were not doing
so.
Some of these socket types don't have a SockAddr available to
represent the destination host. (Sometimes the concept isn't even
meaningful). Therefore, I've also expanded the semantics of
PLUGLOG_CONNECT_SUCCESS so that the addr parameter is allowed to be
NULL, and invented a noncommittal fallback version of the log message
in that situation.
The call to plug_closing very likely destroys the FdSocket entirely,
so we shouldn't wait until after that to clean up its input fd via
lots of dereferences.
While re-testing the other backends, I noticed that they work really
badly in the GTK event log, which apparently interprets tabs as
meaning 'next table column', and assumes that any line not containing
a tab must be entirely in the leftmost column. So all the Telnet
negotiations are miles off to the right, beyond the longest other line
in the entire log.
Replaced with a bit more verbiage.
All four of the other network-protocol backends (Raw, Telnet, rlogin
and SUPDUP) now have a 'socket_connected' flag, which starts off false
and is set to true when (if) their Socket signals that the connection
attempt has succeeded.
This field is used to tell backend_socket_log whether the session has
started yet (hence, whether it should still be logging messages from
the proxy). This replaces various ad-hoc answers to that question in
each backend, which were the best I could do when sockets didn't
notify connection success. Now they do, we can do it properly.
Also, the new flag controls the answer to each backend's sendok()
method, which makes them all satisfy a new policy rule: no backend
shall return true from sendok() while its network connection attempt
is still ongoing.
(Rationale: the network connection attempt may in future involve a
proxy implementation interacting with the user via the terminal, and
it can't do that if the backend is already consuming all the terminal
input.)
It only had to be a header file because ldisc.c and ldiscucs.c had to
share the structure definition. But ldiscucs.c vanished a couple of
years ago in commit 71e42b04a5, so it's now fine to make the
Ldisc structure definition local to ldisc.c itself, the way it should
be.
I just happened to notice that just below my huge comment explaining
the two command-line splitting policies, there's a smaller one that
refers to it as '(see large comment below)'. It's not below - it's
above!
That was because the older parts of that comment had previously been
inside split_into_argv(), until I moved the explanation further up the
file to the top level. Another consequence of that was that the older
section of the comment was wrapped to a strangely narrow line width,
because it had previously been indented further right.
Folded the two comments together, and rewrapped the narrow paragraphs.
It looks as if Ben put that in when he originally wrote the file,
since it refers to a C file style called 'simon', which isn't what I
call my preferred style in _my_ Emacs preferences, but might plausibly
be what he called it in his.
It causes an annoying error every time I load the source file into
Emacs, because it refers to a nonexistent style name. And surely it
will do the same to almost all other Emacs users. Get rid of it.
This is called by the backend to notify the Seat that the connection
has progressed to the point where the main session channel (i.e. the
thing that would typically correspond to the client's stdin/stdout)
has been successfully set up.
The only Seat that implements this method nontrivially is the one in
SshProxy, which uses it as an indication that the proxied connection
to the remote host has succeeded, and sends the
PLUGLOG_CONNECT_SUCCESS notification to its own Plug.
Hence, the only backends that need to implement it at the moment are
the two SSH-shaped backends (SSH proper and bare-connection / psusan).
For other backends, it's not always obvious what 'main session
channel' would even mean, or whether it means anything very useful; so
I've also introduced a backend flag indicating whether the backend is
expecting to call that method at all, so as not to have to spend
pointless effort on defining an arbitrary meaning for it in other
contexts.
So a lot of this patch is just introducing the new method and putting
its trivial do-nothing implementation into all the existing Seat
methods. The interesting parts happen in ssh/mainchan.c (which
actually calls it), and sshproxy.c (which does something useful in
response).
On a similar theme of separating the query operation from the
attempted change, backend_send() now no longer has the side effect of
returning the current size of the send buffer. Instead, you have to
call backend_sendbuffer() every time you want to know that.
This complicates the API in one sense (more separate functions), but
in another sense, simplifies it (each function does something
simpler). When I start putting one Seat in front of another during SSH
proxying, the latter will be more important - in particular, it means
you can find out _whether_ a seat can support changing trust status
without having to actually attempt a destructive modification.
Going through all the backends' send() and sendbuffer() routines, I
noticed that the Unix pty backend is the only one where the return
value from send() doesn't match what sendbuffer() would tell you,
apparently because sendbuffer() was a stub implementation that I never
got round to filling in properly.
But pty masters _can_ back up, and if they do, we should return the
appropriate data.
The caller of new_connection has relinquished ownership of the
SockAddr it passes in. So the receiver of that SockAddr must remember
to free it, or else we leak memory.
(Additionally, this means SshProxy will be able to remember the
address during its run, e.g. to use in calls to its Plug. But that's
not implemented yet.)
Analogous to the bug I just fixed in xtruss: in the loop that tries to
find a reasonable port number for an X display, the sense of the
(horrible) strcmp distinguishing EADDRINUSE from other socket errors
was backwards.
It was introduced in error by commit d8fda3b6da: I had originally
intended to do test runs of prime generation by means of running every
attempt with logging enabled, and after each failed attempt, deleting
the log file and restarting it. But that was _far_ too slow, so I
abandoned that approach, and switched to the alternative method of
searching for a prime with logging turned off, and then repeating just
the final successful attempt under logging conditions.
log_discard() was the 'delete the log file' function intended for use
in the first of those strategies. It wasn't actually used in the end,
so I need not have committed it - and worse, the comment inside it
claiming it _is_ used in prime generation is needlessly confusing!
In the section about our ad-hoc trait idioms, I described a code
sample as containing a set of 'static inline' wrapper functions, which
indeed it should have done - but I forgot to put the 'inline' keyword
in the code sample itself.
I ran across their defining RFCs recently and noticed that each one
provides an explicit mathematical expression for the prime (since each
one is derived from the expansion of pi, with framing FFs and a
correction term to make it actually prime).
Those expressions can be re-evaluated trivially by spigot, so it seems
reasonable to add those spigot commands in comments. This also means
the comments contain citations for these primes in actual standards,
including both the hex digits and the mathematical expressions.
Not sure how I hadn't needed that before! Obviously, if I have a test
program that can exercise all the prime generation systems, it should
include _all_ of them.
Now that I've removed side-channel leakage from both prime candidate
generation (via mp_unsafe_mod_integer) and Miller-Rabin, the
probabilistic prime generation system in this code base is now able to
get through testsc without it detecting any source of cache or timing
side channels. So you should be able to generate an RSA key (in which
the primes themselves must be secret) in a more hostile environment
than you could previously be confident of.
This is a bit counterintuitive, because _obviously_ random prime
generation takes a variable amount of time, because it has to keep
retrying until an attempt succeeds! But that's OK as long as the
attempts are completely independent, because then any timing or cache
information leaked by a _failed_ attempt will only tell an attacker
about the numbers used in the failed attempt, and those numbers have
been thrown away, so it doesn't matter who knows them. It's only
important that the _successful_ attempt, from generating the random
candidate through to completing its verification as (probably) prime,
should be side-channel clean, because that's the attempt whose data is
actually going to be turned into a private key that needs to be kept
secret.
(In particular, this means you have to avoid the old-fashioned
strategy of generating successive prime candidates by incrementing a
starting value until you find something not divisible by any small
prime, because the number of iterations of that method would be a
timing leak. Happily, we stopped doing that last year, in commit
08a3547bc5: now every candidate integer is generated
independently, and if one fails the initial checks, we throw it away
and start completely from scratch with a fresh random value.)
So the test harness works by repeatedly running the prime generator in
one-shot mode until an attempt succeeds, and then resetting the
random-number stream to where it was just before the successful
attempt. Then we generate the same prime number again, this time with
the sclog mechanism turned on - and then, we compare it against the
version we previously generated with the same random numbers, to make
sure they're the same. This checks that the attempts really _are_
independent, in the sense that the prime generator is a pure function
of its random input stream, and doesn't depend on state left over from
previous attempts.
I had a testsc run fail because of alignment-dependent control flow
divergence in a glibc function with 'memmove' in the name, which
appears to have been an accident of different memory allocation
between two runs of the test in question.
sclog was already giving special handling to memset for the same
reason, so it's no trouble to add memmove to the same list of
functions that are treated as an opaque primitive for logging
purposes.
Previously, it would generate a prime candidate, test it, and abort if
that candidate failed to be prime. Now, it's even willing to fail
_before_ generating a prime candidate, if the first attempt to even do
that is unsuccessful.
This doesn't affect the existing use case of pcs_set_oneshot, which is
during generation of a safe prime (as implemented by test/primegen.py
--safe), where you want to make a PrimeCandidateSource that can only
return 2p+1 for your existing prime p, and then abort if that fails
the next step of testing. In that situation, the PrimeCandidateSource
will never fail to generate its first output anyway.
But these changed semantics will become useful in another use I'm
about to find for one-shot mode.
Thanks to Mark Wooding for explaining the method of doing this. At
first glance it seemed _obviously_ impossible to run an algorithm that
needs an iteration per factor of 2 in p-1, without a timing leak
giving away the number of factors of 2 in p-1. But it's not, because
you can do the M-R checks interleaved with each step of your whole
modular exponentiation, and they're cheap enough that you can do them
in _every_ step, even the ones where the exponent is too small for M-R
to be interested in yet, and then do bitwise masking to exclude the
spurious results from the final output.
I'm about to rewrite the Miller-Rabin testing code, so let's start by
introducing a test suite that the old version passes, and then I can
make sure the new one does too.
I've moved it from mpunsafe.c into the main mpint.c, and renamed it
mp_mod_known_integer, because now it manages to avoid leaking
information about the mp_int you give it.
It can still potentially leak information about the small _modulus_
integer - hence the word 'known' in the new function name. This won't
be a problem in any existing use of the function, because it's used
during prime generation to check divisibility by all the small primes,
and optionally also check for residue 1 mod the RSA public exponent.
But all those values are well known and not secret.
This removes one source of side-channel leakage from prime generation.
In most Halibut man pages I write, I have a standard convention of
referring to another man page by wrapping the page name in \cw and the
section number in \e, leaving the parentheses un-marked-up. Apparently
I forgot in this particular collection.
When UML terminates, it kills its entire process group. The way PuTTY
invokes proxy processes, they are part of its process group. So if UML
is used directly as the proxy process, it will commit patricide on
termination.
Wrapping it in 'setsid' is overkill (it doesn't need to be part of a
separate _session_, only a separate pgrp), but it's good enough to
work around this problem, and give PuTTY the opportunity to shut down
cleanly when the UML it's talking to vanishes.
Ian Jackson recently tried to use the recipe in the psusan manpage for
talking to UML, and found that the connection was not successfully set
up, because at some point during startup, UML read the SSH greeting
(ok, the bare-ssh-connection greeting) from its input fd and threw it
away. So by the time psusan was run by the guest init process, the
greeting wasn't there to be read.
Ian's report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=991958
I was also able to reproduce this locally, which makes me wonder why I
_didn't_ notice it when I originally wrote that part of the psusan man
page. It worked for me before, honest! But now it doesn't.
Anyway. The ssh verstring module already has a mode switch to decide
whether we ought to send our greeting before or after waiting for the
other side's greeting (because that decision varies between client and
server, and between SSH-1 and SSH-2). So it's easy to implement an
override that forces it to 'wait for the server greeting first'.
I've added this as yet another bug workaround flag. But unlike all the
others, it can't be autodetected from the server's version string,
because, of course, we have to act on it _before_ seeing the server's
greeting and version string! So it's a manual-only flag.
However, I've mentioned it in the UML section of the psusan man page,
since that's the place where I _know_ people are likely to need to use
this flag.
Following the same pattern as the previous one (commit 6c924ba862),
except that this time, I don't have to _set up_ the pattern in the
front-end code of presenting the current and previous key details -
just change over the actual string literals in putty.h.
But the rest is the same: new keys at the top of pgpkeys.but, old ones
relegated to the historical appendix, key ids in sign.sh switched over.
When do_paint breaks up a line of terminal text into contiguous runs
of characters to treat the same, one of the criteria it uses is, 'Does
this character even need redrawing? (or is it already displayed
correctly from the previous redraw?)' When we encounter a character
that matches its previous value, we end the previous run of
characters, so that we can skip the one we've just encountered.
That check was not taking account of the 'truecolour' field of the
termchar it was checking. So it would sometimes falsely believe the
character to be equivalent to its previously drawn value, even when in
fact it was not, and hence insert a run break, anticipating that the
previous character needed drawing and the current one did not.
This didn't cause a _wrong_ redraw, because there's a separate loop
further on which re-checks whether to actually draw things, which
didn't make the same error. So the character that loop #1 thought
didn't need a redraw, loop #2 knew _did_ need a redraw, and hence,
everything did get redrawn.
But by the time loop #2 is running, it's too late to change the run
boundaries. So everything does get redrawn, but in much smaller chunks
than it could have been. The net effect was that if the screen was
filled with text displayed in true colour, and you changed it to the
_same_ text in a different colour, then the whole terminal would be
redrawn in one-character increments instead of the usual behaviour of
folding together runs that can be drawn in one go.
Thanks to Bradley Smith for debugging this very confusing issue!