In SSH-1, the CRC is used on sensitive data, because it takes the
place of what ought to be a MAC. This is of course hopelessly bad
security and one of the major reasons SSH-1 was replaced, but even so,
there's no need to add timing and cache side channels _as well_ as all
the other problems with it!
So I've removed the 256-entry lookup table that's the usual way to
implement CRC (in particular, the implementation given in the RFC 1662
appendix shows the same table in full). The new strategy folds in four
bits at a time, using a multiply+XOR technique to replicate the
outgoing four bits in all the right places.
In a crude timing test this gave about a factor of 2 slowdown, which
seemed surprisingly good to me - six multiplies replacing a single
table lookup? But the multiplications in each 4-bit fold are
independent of each other, so I suspect the CPU is managing to
parallelise them.
Finding even semi-official test vectors for this CRC implementation
was hard, because it turns out not to _quite_ match any of the well
known ones catalogued on the web. Its _polynomial_ is well known, but
the combination of details that go alongside it (starting state,
post-hashing transformation) are not quite the same as any other hash
I know of.
After trawling catalogue websites for a while I finally worked out
that SSH-1's CRC and RFC 1662's CRC are basically the same except for
different choices of starting value and final adjustment. And RFC
1662's CRC is common enough that there _are_ test vectors.
So I've renamed the previous crc32_compute function to crc32_ssh1,
reflecting that it seems to be its own thing unlike any other CRC;
implemented the RFC 1662 CRC as well, as an alternative tiny wrapper
on the inner crc32_update function; and exposed all three functions to
testcrypt. That lets me run standard test vectors _and_ directed tests
of the internal update routine, plus one check that crc32_ssh1 itself
does what I expect.
While I'm here, I've also modernised the code to use uint32_t in place
of unsigned long, and ptrlen instead of separate pointer,length
arguments. And I've removed the general primer on CRC theory from the
header comment, in favour of the more specifically useful information
about _which_ CRC this is and how it matches up to anything else out
there.
(I've bowed to inevitability and put the directed CRC tests in the
'crypt' class in cryptsuite.py. Of course this is a misnomer, since
CRC isn't cryptography, but it falls into the same category in terms
of the role it plays in SSH-1, and I didn't feel like making a new
pointedly-named 'notreallycrypt' container class just for this :-)
I found some that look pretty good - in particular exercising every
entry in every S-box. These will come in useful when I finish writing
a replacement for the venerable current DES implementation.
The 128-bit example from Appendix A/B is a more useful first test case
for a new implementation than the Appendix C tests, because the
standard shows even more of the working (in particular the full set of
intermediate results from key setup).
Ahem. I went to all the effort of setting up a wrapper function that
would store the result of the first call to aes_hw_available(), and
managed to forget to make it set the flag that said it _had_ stored
the result. So the underlying query function was being called every
time.
I intended cryptsuite to be Python 2/3 agnostic when I first wrote it,
but of course since then I've been testing on whichever Python was
handy and not continuing to check that both actually worked.
I've decided not to trust register-controlled shift operations to be
time-constant after all. They're surely fine on nice fast machines
where everything simple takes one cycle, but stranger machines,
perhaps not. In which case, I should avoid using them in the mpint
shift operation that's supposed not to reveal the shift count.
The 32-bit x86 Windows build can crash with an alignment fault the
first time it tries to write into the key schedule array, because it
turns out that the x86 VS C library's malloc doesn't guarantee 16-byte
alignment on the returned block even though there is a machine type
that needs it.
To avoid having to faff with non-portable library APIs, I solve the
problem locally in aes_hw_new, by over-allocating enough to guarantee
that an aligned block of the right size must exist somewhere in the
region.
When the user clicks 'yes' to a 'weak crypto primitive' warning, and
another such warning is pending next in line, we were failing an
assertion when ssh2transport called register_dialog() for the second
warning box, because the result callback in gtkdlg.c had not called
unregister_dialog() for the previous one yet. Now that's done before
rather than after delivering the result to the dialog's client.
This was the most complicated one of the cipher modes to get right, so
I thought I'd add a test to make sure the IV is being written out
correctly after a decryption of any number of cipher blocks.
The new explicit vtables for the hardware and software implementations
are now exposed by name in the testcrypt protocol, and cryptsuite.py
runs all the AES tests separately on both.
(When hardware AES is compiled out, ssh2_cipher_new("aes128_hw") and
similar calls will return None, and cryptsuite.py will respond by
skipping those tests.)
sshaes.c is more or less completely changed by this commit.
Firstly, I've changed the top-level structure. In the old structure,
there were three levels of indirection controlling what an encryption
function would actually do: first the ssh2_cipher vtable, then a
subsidiary set of function pointers within that to select the software
or hardware implementation, and then inside the main encryption
function, a switch on the key length to jump into the right place in
the unrolled loop of cipher rounds.
That was all a bit untidy. So now _all_ of that is done by means of
just one selection system, namely the ssh2_cipher vtable. The software
and hardware implementations of a given SSH cipher each have their own
separate vtable, e.g. ssh2_aes256_sdctr_sw and ssh2_aes256_sdctr_hw;
this allows them to have their own completely different state
structures too, and not have to try to coexist awkwardly in the same
universal AESContext with workaround code to align things correctly.
The old implementation-agnostic vtables like ssh2_aes256_sdctr still
exist, but now they're mostly empty, containing only the constructor
function, which will decide whether AES-NI is currently available and
then choose one of the other _real_ vtables to instantiate.
As well as the cleaner data representation, this also means the
vtables can have different description strings, which means the Event
Log will indicate which AES implementation is actually in use; it
means the SW and HW vtables are available for testcrypt to use
(although actually using them is left for the next commit); and in
principle it would also make it easy to support a user override for
the automatic SW/HW selection (in case anyone turns out to want one).
The AES-NI implementation has been reorganised to fit into the new
framework. One thing I've done is to de-optimise the key expansion:
instead of having a separate blazingly fast loop-unrolled key setup
function for each key length, there's now just one, which uses AES
intrinsics for the actual transformations of individual key words, but
wraps them in a common loop structure for all the key lengths which
has a clear correspondence to the cipher spec. (Sorry to throw away
your work there, Pavel, but this isn't an application where key setup
really _needs_ to be hugely fast, and I decided I prefer a version I
can understand and debug.)
The software AES implementation is also completely replaced with one
that uses a bit-sliced representation, i.e. the cipher state is split
across eight integers in such a way that each logical byte of the
state occupies a single bit in each of those integers. The S-box
lookup is done by a long string of AND and XOR operations on the eight
bits (removing the potential cache side channel from a lookup table),
and this representation allows 64 S-box lookups to be done in parallel
simply by extending those AND/XOR operations to be bitwise ones on a
whole word. So now we can perform four AES encryptions or decryptions
in parallel, at least when the cipher mode permits it (which SDCTR and
CBC decryption both do).
The result is slower than the old implementation, but (a) not by as
much as you might think - those parallel S-boxes are surprisingly
competitive with 64 separate table lookups; (b) the compensation is
that now it should run in constant time with no data-dependent control
flow or memory addressing; and (c) in any case the really fast
hardware implementation will supersede it for most users.
The old names like ssh_aes128 and ssh_aes128_ctr reflect the SSH
protocol IDs, which is all very well, but I think a more important
principle is that it should be easy for me to remember which cipher
mode each one refers to. So I've renamed them so that they all end in
_cbc and _sdctr.
(I've left alone the string identifiers used by testcrypt, for the
moment. Perhaps I'll go back and change those later.)
No cipher construction function _currently_ returns NULL, but one's
about to start, so the testcrypt system will have to be able to cope.
This is the first time a function in the testcrypt API has had an
'opt' type as its return value rather than an argument. But it works
just the same in reverse: the wire protocol emits the special
identifer "NULL" when the optional return value is absent, and the
Python module catches that and rewrites it as Python 'None'.
All access to AES throughout the code is now done via the ssh2_cipher
vtable interface. All code that previously made direct calls to the
underlying functions (for encrypting and decrypting private key files)
now does it by instantiating an ssh2_cipher.
This removes constraints on the AES module's internal structure, and
allows me to reorganise it as much as I like.
I had forgotten that my VS implementation of BignumADC expected the
carry parameter to be a literal carry _flag_, i.e. a boolean, rather
than a full word of extra data to be added to the sum of the main
input BignumInts a,b. So in one place where I didn't need a separate
carry I had passed one of the data words in the carry slot, which
worked fine on gcc and clang, but VS normalised that argument to 1.
That looks like the only VS bug, though: now I get a clean run of
cryptsuite.py even if it's talking to a VS-built testcrypt.exe.
The bulk of this commit is the changes necessary to make testcrypt
compile under Visual Studio. Unfortunately, I've had to remove my
fiddly clever uses of C99 variadic macros, because Visual Studio does
something unexpected when a variadic macro's expansion puts
__VA_ARGS__ in the argument list of a further macro invocation: the
commas don't separate further arguments. In other words, if you write
#define INNER(x,y,z) some expansion involving x, y and z
#define OUTER(...) INNER(__VA_ARGS__)
OUTER(1,2,3)
then gcc and clang will translate OUTER(1,2,3) into INNER(1,2,3) in
the obvious way, and the inner macro will be expanded with x=1, y=2
and z=3. But try this in Visual Studio, and you'll get the macro
parameter x expanding to the entire string 1,2,3 and the other two
empty (with warnings complaining that INNER didn't get the number of
arguments it expected).
It's hard to cite chapter and verse of the standard to say which of
those is _definitely_ right, though my reading leans towards the
gcc/clang behaviour. But I do know I can't depend on it in code that
has to compile under both!
So I've removed the system that allowed me to declare everything in
testcrypt.h as FUNC(ret,fn,arg,arg,arg), and now I have to use a
different macro for each arity (FUNC0, FUNC1, FUNC2 etc). Also, the
WRAPPED_NAME system is gone (because that too depended on the use of a
comma to shift macro arguments along by one), and now I put a custom C
wrapper around a function by simply re-#defining that function's own
name (and therefore the subsequent code has to be a little more
careful to _not_ pass functions' names between several macros before
stringifying them).
That's all a bit tedious, and commits me to a small amount of ongoing
annoyance because now I'll have to add an explicit argument count
every time I add something to testcrypt.h. But then again, perhaps it
will make the code less incomprehensible to someone trying to
understand it!
Every compiler that's so far seen testcrypt.c has tolerated me writing
'typedef enum ValueType ValueType' before actually saying what 'enum
ValueType' is, but one just pointed out that it's not actually legal
standard C to do that. Moved the typedef to after the enum.
That's a terrible name, but winutils.c was already taken. The new
source file is intended to be to winmisc.c as the new utils.c is to
misc.c: it contains all the parts that are basically safe to link into
_any_ Windows program (even standalone test things), without tying in
to the runtime infrastructure of the main tools, referring to any
other PuTTY source module, or introducing an extra Win32 API library
dependency.
When testcrypt.h lists a function argument as 'opt_val_foo', it means
that the argument is optional in the sense that the C function can
take a null pointer in place of a valid foo, and so the Python wrapper
module should accept None in the corresponding argument slot from the
client code and translate it into the special string "NULL" in the
wire protocol.
This works fine at argument translation time, but the code that reads
testcrypt.h wasn't looking at it, so if you said 'opt_val_foo_suffix'
in place of 'opt_val_foo' (indicating that that argument is optional
_and_ the C function expects it in a translated form), then the
initial pass over testcrypt.h wouldn't strip the _suffix, and would
set up data structures with mismatched type names.
The type names 'val_foo' used in testcrypt.h work on a system where a
further underscore suffix is treated as a qualifier indicating
something about how the C version of the API represents that type.
(For example, plain 'val_string' means a strbuf, but if I write
'val_string_asciz' or 'val_string_ptrlen' in testcrypt.h it will cause
the wrapper for that function to pass a char * or a ptrlen derived
from that strbuf.)
But I forgot about this when I named the type val_ssh2_cipher (and
ditto ssh1), with the effect that the testcrypt system has considered
them all along to really be called 'ssh2' and 'ssh1', and the 'cipher'
to be some irrelevant API-adaptor suffix.
This hasn't caused a bug because I didn't have any other type called
val_ssh2_something. But it was a latent one. Now renamed sensibly!
This is just like assertEqual, except that I use it when I'm comparing
random-looking binary data, and if the check fails it will encode the
two differing values in hex, which is easier to read than trying to
express them as really strange-looking string literals.
The IV-incrementing code seems to have had several bugs. It was not
propagating a carry from bit 31 to 32 of the 128-bit integer,
apparently because of having arranged the 32-bit words wrongly within
the vector register; also, when it tried to implement carry
propagation from bit 63 to 64 by checking the low 64 bits for zero, it
checked the _high_ bits instead, leading to a spurious extra addition
in the low half if the high half happened to be zero.
This must surely have been able to cause mysterious decryption
failures about once every 2^32 cipher blocks = 64Gb of data
transferred. I suppose that must be a large enough number that either
no users of the snapshot builds have encountered the problem, or the
ones who did dismissed it as computers being randomly flaky.
The revised version now passes all the same tests as the software
implementation.
This tests the CBC and SDCTR modes, in all key lengths, and in
particular includes a set of SDCTR tests designed to test the
procedure for incrementing the IV as a single 128-bit integer, by
checking propagation of the carry between every pair of words.
This makes the test suite pass if I compile with -DBIGNUM_OVERRIDE=4
to fall back to 16-bit BignumInt. In that mode, BignumInt is smaller
than 'int', which means default promotion keeps causing things to get
promoted to 'int' unexpectedly, so I had to add some casts back down.
This makes it easy to re-test the mpint functions using different word
sizes and smoke out any more confusions between integer types in
mpint.c, by recompiling with -DBIGNUM_OVERRIDE=4 or =5 or =6 (for 16-,
32- or 64-bit respectively).
BIGNUM_OVERRIDE only lets you force the size downwards, not upwards:
of course the default behaviour is to use the largest BignumInt the
ifdefs can find a way to, and they can't magic up a bigger one just
because you tell them you'd like one.
At some point in mpint.c development I switched the main macro defined
by the ifdefs from BIGNUM_INT_BITS to the new BIGNUM_INT_BITS_BITS, so
I could loop from 0 to the latter in safe bit-shift loops that test
each bit of a shift count. But I forgot to change the comment
accordingly.
This should have been done months ago in commit bf0cf984c, but I've
been indecisive about whether to keep my local dev builds in the
windows subdirectory itself or one level further down...
This enables me to control where testcrypt both reads its input and
writes its output. That in turn makes it convenient to run testcrypt
itself in a separate Unix terminal window from its client Python, by
making two named pipe files (say, 'i' and 'o'), running the client
with PUTTY_TESTCRYPT="cat o & cat > i" in its environment, and in
another window, running 'testcrypt -o o i'.
And that in turn makes it easy to attach gdb to testcrypt, so I can
easily debug its handling of whatever request the client sent.
I got the maximum shift count _completely_ wrong when trying to work
out whether each word should be compared against part of the input
uintmax_t: I measured it in bytes rather than bits _and_ applied it to
the wrong type. Ahem.
A major advantage of the new testcrypt system _not_ being written as a
native-code Python module in the usual way is that it makes it very
easy to recompile testcrypt in a non-default way, such as with -m32,
and still run the same tests via the same Python module.
But I hadn't actually _done_ that until now, and now that I do, the
test suite has picked up a couple of bugs. When computing the initial
reciprocal approximation in mp_divmod_into, I did a lot of work on
explicit uint64_t, but did it in a way that used BIGNUM_INT_BITS as
the number's bit size instead of the constant 64, and cast several
things absentmindedly to BignumInt. And because I'd only tested on a
platform where those are the same type anyway, I didn't spot it.
When I was originally designing my knockoff of Stein's algorithm, I
simplified it for my own understanding by replacing the step that
turns a into (a-b)/2 with a step that simply turned it into a-b, on
the basis that the next step would do the division by 2 in any case.
This made it easier to get my head round in the first place, and in
the initial Python prototype of the algorithm, it looked more sensible
to have two different kinds of simple step rather than one simple and
one complicated.
But actually, when it's rewritten under the constraints of time
invariance, the standard way is better, because we had to do the
computation for both kinds of step _anyway_, and this way we sometimes
make both of them useful at once instead of only ever using one.
So I've put it back to the more standard version of Stein, which is a
big improvement, because now we can run in at most 2n iterations
instead of 3n _and_ the code implementing each step is simpler. A
quick timing test suggests that modular inversion is now faster by a
factor of about 1.75.
Also, since I went to the effort of thinking up and commenting a pair
of worst-case inputs for the iteration count of Stein's algorithm, it
seems like an omission not to have made sure they were in the test
suite! Added extra tests that include 2^128-1 as a modulus and 2^127
as a value to invert.
I broke it last year in commit 4988fd410, when I made hash contexts
expose a BinarySink interface. I went round finding no end of long-
winded ways of pushing things into hash contexts, often reimplementing
some standard thing like the wire formatting of an mpint, and rewrote
them more concisely using one or two put_foo calls.
But I failed to notice that the hash preimage used in SSH-1 key
fingerprints is _not_ implementable by put_ssh1_mpint! It consists of
the two public-key integers encoded in multi-byte binary big-endian
form, but without any preceding length field at all. I must have
looked too hastily, 'recognised' it as just implementing an mpint
formatter yet again, and replaced it with put_ssh1_mpint. So SSH-1 key
fingerprints have been completely wrong in the snapshots for months.
Fixed now, and this time, added a comment to warn me in case I get the
urge to simplify the code again, and a regression test in cryptsuite.
I've moved the static method nbits up into a top-level function, so I
can use it to implement Python marshalling functions for SSH mpints.
I'm about to need one of these, and the other will surely come in
useful as well sooner or later.
This allows me to remove another diagnostic main() that I just found
lurking at the bottom of sshdes.c, which was there to allow manual
untangling of XDM-AUTHORIZATION-1 strings when debugging X forwarding.
Now you can ask the same kind of question at the interactive Python
prompt, without having to manually compile anything. For example, the
query you might previously have asked by building the sshdes test
program and running
$ ./sshdes 090a0b0c0d0e0f10 0123456789abcd
decrypt(090a0b0c0d0e0f10,0123456789abcd) = ab53fd65ae7f4ec3
encrypt(090a0b0c0d0e0f10,0123456789abcd) = 7065d20441f5abe3
you can now run using the standard testcrypt (bearing in mind that the
actual library function takes the key argument first):
$ python -i test/testcrypt.py
>>> from binascii import hexlify as H, unhexlify as U
>>> H(des_decrypt_xdmauth(U('0123456789abcd'),U('090a0b0c0d0e0f10')))
'ab53fd65ae7f4ec3'
>>> H(des_encrypt_xdmauth(U('0123456789abcd'),U('090a0b0c0d0e0f10')))
'7065d20441f5abe3'
This supersedes the '#ifdef TEST' main programs in sshsh256.c and
sshsh512.c. Now there's no need to build those test programs manually
on the rare occasion of modifying the hash implementations; instead
testcrypt is built every night and will run these test vectors.
RFC 6234 has some test vectors for HMAC-SHA-* as well, so I've
included the ones applicable to this implementation.
I noticed a few of these in the course of preparing the previous
commit. I must have been writing that idiom out by hand for _ages_
before it became totally habitual to #define it as 'lenof' in every
codebase I touch. Now I've gone through and replaced all the old
verbosity with nice terse lenofs.
This is the commit that f3295e0fb _should_ have been. Yesterday I just
added some typedefs so that I didn't have to wear out my fingers
typing 'struct' in new code, but what I ought to have done is to move
all the typedefs into defs.h with the rest, and then go through
cleaning up the legacy 'struct's all through the existing code.
But I was mostly trying to concentrate on getting the test suite
finished, so I just did the minimum. Now it's time to come back and do
it better.
PLATFORM_HAS_SMEMCLR from winstuff.h was available to misc.c via
putty.h, but the point of utils.c is not to pull in that stuff.
This is a quick bodge to unbreak the Windows build. It needs a better
answer for optionally overriding the platform-independent smemclr()
with a platform-specific implementation.
I just found a file lying around in a different source directory that
contained a test case I'd had trouble with last week, so now I've
recovered it, it ought to go in the test suite as a regression test.
After I moved parts of misc.c into utils.c, we started getting two
versions of smemclr in the Windows builds, because utils.c didn't know
to omit its one, having not included the main putty.h.
But it was deliberate that utils.c didn't include putty.h, because I
wanted it (along with the rest of testcrypt in particular) to be
portable to unusual platforms without having to port the whole of the
code base.
So I've moved into the ubiquitous defs.h just the one decision about
whether we're on a platform that will supersede utils.c's definition
of smemclr.
(Also, in the process of moving it, I've removed the clause that
disabled the Windows smemclr in winelib mode, because it looks as if
the claim that winelib doesn't have SecureZeroMemory is now out of
date.)
This is a reasonably comprehensive test that exercises basically all
the functions I rewrote at the end of last year, and it's how I found
a lot of the bugs in them that I fixed earlier today.
It's written in Python, using the unittest framework, which is
convenient because that way I can cross-check Python's own large
integers against PuTTY's.
While I'm here, I've also added a few tests of higher-level crypto
primitives such as Ed25519, AES and HMAC, when I could find official
test vectors for them. I hope to add to that collection at some point,
and also add unit tests of some of the other primitives like ECDH and
RSA KEX.
The test suite is run automatically by my top-level build script, so
that I won't be able to accidentally ship anything which regresses it.
When it's run at build time, the testcrypt binary is built using both
Address and Leak Sanitiser, so anything they don't like will also
cause a test failure.
The new testcrypt system made it easy to write a tiny Python program
that does a lot of multiplications of various large sizes, run it
against versions of the testcrypt binary built with lots of different
threshold settings, and time the output by running the Python program
with PUTTY_TESTCRYPT="command time -f %U ./testcrypt".
When I tried that I found that lots of values in the 20-30 range
looked about as good as each other. 24 was an unusually low dip which
could well have just been a random outlier, but it's a nice round
number so I picked it anyway.
I've written a new standalone test program which incorporates all of
PuTTY's crypto code, including the mp_int and low-level elliptic curve
layers but also going all the way up to the implementations of the
MAC, hash, cipher, public key and kex abstractions.
The test program itself, 'testcrypt', speaks a simple line-oriented
protocol on standard I/O in which you write the name of a function
call followed by some inputs, and it gives you back a list of outputs
preceded by a line telling you how many there are. Dynamically
allocated objects are assigned string ids in the protocol, and there's
a 'free' function that tells testcrypt when it can dispose of one.
It's possible to speak that protocol by hand, but cumbersome. I've
also provided a Python module that wraps it, by running testcrypt as a
persistent subprocess and gatewaying all the function calls into
things that look reasonably natural to call from Python. The Python
module and testcrypt.c both read a carefully formatted header file
testcrypt.h which contains the name and signature of every exported
function, so it costs minimal effort to expose a given function
through this test API. In a few cases it's necessary to write a
wrapper in testcrypt.c that makes the function look more friendly, but
mostly you don't even need that. (Though that is one of the
motivations between a lot of API cleanups I've done recently!)
I considered doing Python integration in the more obvious way, by
linking parts of the PuTTY code directly into a native-code .so Python
module. I decided against it because this way is more flexible: I can
run the testcrypt program on its own, or compile it in a way that
Python wouldn't play nicely with (I bet compiling just that .so with
Leak Sanitiser wouldn't do what you wanted when Python loaded it!), or
attach a debugger to it. I can even recompile testcrypt for a
different CPU architecture (32- vs 64-bit, or even running it on a
different machine over ssh or under emulation) and still layer the
nice API on top of that via the local Python interpreter. All I need
is a bidirectional data channel.
The __truediv__ pair makes the whole program work in Python 3 as well
as 2 (it was _so_ nearly there already!), and __int__ lets you easily
turn a ModP back into an ordinary Python integer representing its
least positive residue.
I wrote it for the sake of a test-system design I had in mind at the
time, but that design changed after I committed, and now I think
_even_ my upcoming test application won't need to copy MontyContexts.
So I'll remove the function now, so as not to have to pointlessly
write tests for it :-)