DIT, for 'Data-Independent Timing', is a bit you can set in the
processor state on sufficiently new Arm CPUs, which promises that a
long list of instructions will deliberately avoid varying their timing
based on the input register values. Just what you want for keeping
your constant-time crypto primitives constant-time.
As far as I'm aware, no CPU has _yet_ implemented any data-dependent
optimisations, so DIT is a safety precaution against them doing so in
future. It would be embarrassing to be caught without it if a future
CPU does do that, so we now turn on DIT in the PuTTY process state.
I've put a call to the new enable_dit() function at the start of every
main() and WinMain() belonging to a program that might do
cryptography (even testcrypt, in case someone uses it for something!),
and in case I missed one there, also added a second call at the first
moment that any cryptography-using part of the code looks as if it
might become active: when an instance of the SSH protocol object is
configured, when the system PRNG is initialised, and when selecting
any cryptographic authentication protocol in an HTTP or SOCKS proxy
connection. With any luck those precautions between them should ensure
it's on whenever we need it.
Arm's own recommendation is that you should carefully choose the
granularity at which you enable and disable DIT: there's a potential
time cost to turning it on and off (I'm not sure what, but plausibly
something of the order of a pipeline flush), so it's a performance hit
to do it _inside_ each individual crypto function, but if CPUs start
supporting significant data-dependent optimisation in future, then it
will also become a noticeable performance hit to just leave it on
across the whole process. So you'd like to do it somewhere in the
middle: for example, you might turn on DIT once around the whole
process of verifying and decrypting an SSH packet, instead of once for
decryption and once for MAC.
With all respect to that recommendation as a strategy for maximum
performance, I'm not following it here. I turn on DIT at the start of
the PuTTY process, and then leave it on. Rationale:
1. PuTTY is not otherwise a performance-critical application: it's
not likely to max out your CPU for any purpose _other_ than
cryptography. The most CPU-intensive non-cryptographic thing I can
imagine a PuTTY process doing is the complicated computation of
font rendering in the terminal, and that will normally be cached
(you don't recompute each glyph from its outline and hints for
every time you display it).
2. I think a bigger risk lies in accidental side channels from having
DIT turned off when it should have been on. I can imagine lots of
causes for that. Missing a crypto operation in some unswept corner
of the code; confusing control flow (like my coroutine macros)
jumping with DIT clear into the middle of a region of code that
expected DIT to have been set at the beginning; having a reference
counter of DIT requests and getting it out of sync.
In a more sophisticated programming language, it might be possible to
avoid the risk in #2 by cleverness with the type system. For example,
in Rust, you could have a zero-sized type that acts as a proof token
for DIT being enabled (it would be constructed by a function that also
sets DIT, have a Drop implementation that clears DIT, and be !Send so
you couldn't use it in a thread other than the one where DIT was set),
and then you could require all the actual crypto functions to take a
DitToken as an extra parameter, at zero runtime cost. Then "oops I
forgot to set DIT around this piece of crypto" would become a compile
error. Even so, you'd have to take some care with coroutine-structured
code (what happens if a Rust async function yields while holding a DIT
token?) and with nesting (if you have two DIT tokens, you don't want
dropping the inner one to clear DIT while the outer one is still there
to wrongly convince callees that it's set). Maybe in Rust you could
get this all to work reliably. But not in C!
DIT is an optional feature of the Arm architecture, so we must first
test to see if it's supported. This is done the same way as we already
do for the various Arm crypto accelerators: on ELF-based systems,
check the appropriate bit in the 'hwcap' words in the ELF aux vector;
on Mac, look for an appropriate sysctl flag.
On Windows I don't know of a way to query the DIT feature, _or_ of a
way to write the necessary enabling instruction in an MSVC-compatible
way. I've _heard_ that it might not be necessary, because Windows
might just turn on DIT unconditionally and leave it on, in an even
more extreme version of my own strategy. I don't have a source for
that - I heard it by word of mouth - but I _hope_ it's true, because
that would suit me very well! Certainly I can't write code to enable
DIT without knowing (a) how to do it, (b) how to know if it's safe.
Nonetheless, I've put the enable_dit() call in all the right places in
the Windows main programs as well as the Unix and cross-platform code,
so that if I later find out that I _can_ put in an explicit enable of
DIT in some way, I'll only have to arrange to set HAVE_ARM_DIT and
compile the enable_dit() function appropriately.
I was dubious about it to begin with, when I found that RFC 7616's
example seemed to be treating it as a 256-bit truncation of SHA-512,
and not the thing FIPS 180-4 section 6.7 specifies as "SHA-512/256"
(which also changes the initial hash state). Having failed to get a
clarifying response from the RFC authors, I had the idea this morning
of testing other HTTP clients to see what _they_ thought that hash
function meant, and then at least I could go with an existing
in-practice consensus.
There is no in-practice consensus. Firefox doesn't support that
algorithm at all (but they do support SHA-256); wget doesn't support
anything that RFC 7616 added to the original RFC 2617. But the prize
for weirdness goes to curl, which does accept the name "SHA-512-256"
and ... treats it as an alias for SHA-256!
So I think the situation among real clients is too confusing to even
try to work with, and I'm going to stop adding to it. PuTTY will
follow Firefox's policy: if a proxy server asks for SHA-256 digests
we'll happily provide them, but if they ask for SHA-512-256 we'll
refuse on the grounds that it's not clear enough what it means.
In http.c, this drops in reasonably neatly alongside the existing
support for Basic, now that we're waiting for an initial 407 response
from the proxy to tell us which auth mechanism it would prefer to use.
The rest of this patch is mostly contriving to add testcrypt support
for the function in cproxy.c that generates the complicated output
header to go in the HTTP request: you need about a dozen assorted
parameters, the actual response hash has two more hashes in its
preimage, and there's even an option to hash the username as well if
necessary. Much more complicated than CHAP (which is just plain
HMAC-MD5), so it needs testing!
Happily, RFC 7616 comes with some reasonably useful test cases, and
I've managed to transcribe them directly into cryptsuite.py and
demonstrate that my response-generator agrees with them.
End-to-end testing of the whole system was done against Squid 4.13
(specifically, the squid package in Debian bullseye, version 4.13-10).
Previously, the proxy negotiation functions were written as explicit
state machines, with ps->state being manually set to a sequence of
positive integer values which would be tested by if statements in the
next call to the same negotiation function.
That's not how this code base likes to do things! We have a coroutine
system to allow those state machines to be implicit rather than
explicit, so that we can use ordinary control flow statements like
while loops. Reorganised each proxy negotiation function into a
coroutine-based system like that.
While I'm at it, I've also moved each proxy negotiator out into its
own source file, to make proxy.c less overcrowded and monolithic. And
_that_ gave me the opportunity to define each negotiator as an
implementation of a trait rather than as a single function - which
means now each one can define its own local variables and have its own
cleanup function, instead of all of them having to share the variables
inside the main ProxySocket struct.
In the new coroutine system, negotiators don't have to worry about the
mechanics of actually sending data down the underlying Socket any
more. The negotiator coroutine just appends to a bufchain (via a
provided bufchain_sink), and after every call to the coroutine,
central code in proxy.c transfers the data to the Socket itself. This
avoids a lot of intermediate allocations within the negotiators, which
previously kept having to make temporary strbufs or arrays in order to
have something to point an sk_write() at; now they can just put
formatted data directly into the output bufchain via the marshal.h
interface.
In this version of the code, I've also moved most of the SOCKS5 CHAP
implementation from cproxy.c into socks5.c, so that it can sit in the
same coroutine as the rest of the proxy negotiation control flow.
That's because calling a sub-coroutine (co-subroutine?) is awkward to
set up (though it is _possible_ - we do SSH-2 kex that way), and
there's no real need to bother in this case, since the only thing that
really needs to go in cproxy.c is the actual cryptography plus a flag
to tell socks5.c whether to offer CHAP authentication in the first
place.
This seemed like a worthwhile cleanup to do while I was working on
this code anyway. Now all the magic numbers are defined in a header
file by macro names indicating their meaning, and used by both the
SOCKS client code in the proxy subdirectory and the SOCKS server code
in portfwd.c.
It's variously 'ps' and 'p' in functions that already receive one, and
it's 'ret' in the main function that initially constructs one. Let's
call it 'ps' consistently, so that the code idioms are the same
everywhere.
Having a single plug_closing() function covering various kinds of
closure is reasonably convenient from the point of view of Plug
implementations, but it's annoying for callers, who all have to fill
in pointless NULL and 0 parameters in the cases where they're not
used.
Added some inline helper functions in network.h alongside the main
plug_closing() dispatch wrappers, so that each kind of connection
closure can present a separate API for the Socket side of the
interface, without complicating the vtable for the Plug side.
Also, added OS-specific extra helpers in the Unix and Windows
directories, which centralise the job of taking an OS error code (of
whatever kind) and translating it into its error message.
In passing, this removes the horrible ad-hoc made-up error codes in
proxy.h, which is OK, because nothing checked for them anyway, and
also I'm about to do an API change to plug_closing proper that removes
the need for them.
There are quite a few of them already, and I'm about to make another
one, so let's start with a bit of tidying up.
The CMake build organisation is unchanged: I haven't put the proxy
object files into a separate library, just moved the locations of the
source files. (Organising proxying as a library would be tricky
anyway, because of the various overrides for tools that want to avoid
cryptography.)