The standard says we should be checking that both r,s are in the range
[1,q-1]. Previously we were effectively reducing s mod q in the course
of inversion, and modinv() was guaranteeing never to return zero; the
remaining missing checks were benign. But the change from Bignum to
mp_int altered the error behaviour, and combined with the missing
upper bound check on s, made it possible to continue verification with
w == 0 mod q, which is a bad case.
Added a small DSA test case, including a check that none of these
types of signatures validates.
All the work I've put in in the last few months to eliminate timing
and cache side channels from PuTTY's mp_int and cipher implementations
has been on a seat-of-the-pants basis: just thinking very hard about
what kinds of language construction I think would be safe to use, and
trying not to absentmindedly leave a conditional branch or a cast to
bool somewhere vital.
Now I've got a test suite! The basic idea is that you run the same
crypto primitive multiple times, with inputs differing only in ways
that are supposed to avoid being leaked by timing or leaving evidence
in the cache; then you instrument the code so that it logs all the
control flow, memory access and a couple of other relevant things in
each of those runs, and finally, compare the logs and expect them to
be identical.
The instrumentation is done using DynamoRIO, which I found to be well
suited to this kind of work: it lets you define custom modifications
of the code in a reasonably low-effort way, and it lets you work at
both the low level of examining single instructions _and_ the higher
level of the function call ABI (so you can give things like malloc
special treatment, not to mention intercepting communications from the
program being instrumented). Build instructions are all in the comment
at the top of testsc.c.
At present, I've found this test to give a 100% pass rate using gcc
-O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of
failures, which I'll fix in the next commit.
Previously, I checked by assertion that the base was less than the
modulus. There were two things wrong with this policy. Firstly, it's
perfectly _meaningful_ to want to raise a large number to a power mod
a smaller number, even if it doesn't come up often in cryptography;
secondly, I didn't do it right, because the check was based on the
formal sizes (nw fields) of the mp_ints, which meant that it was
possible to have a failure of the assertion even in the case where the
numerical value of the base _was_ less than the modulus.
In particular, this could come up in Diffie-Hellman with a fixed
group, because the fixed group modulus was decoded from an MP_LITERAL
in sshdh.c which gave a minimal value of nw, but the base was the
public value sent by the other end of the connection, which would
sometimes be sent with the leading zero byte required by the SSH-2
mpint encoding, and would cause a value of nw one larger, failing the
assertion.
Fixed by simply using mp_modmul in monty_import, replacing the
previous clever-but-restricted strategy that I wrote when I thought I
could get away without having to write a general division-based
modular reduction at all.
This completes the conversion begun in commit be5c0e635: now every
CBC-mode cipher has "cbc" in its name, and doesn't leave it implicit.
Hopefully this will never confuse me again!
Some functions got confused if given one as input (particularly
mp_get_decimal, which assumed it could safely write at least one word
into the inv5 value it makes internally), and I've decided it's easier
to stop them ever being created than to teach everything to handle
them correctly. So now mp_make_sized enforces nw != 0 by assertion,
and I've added a max at any call site that looked as if it might
violate that precondition.
mp_from_hex("") could generate one of these, in particular, so now
I've fixed it, I've added a test to make sure it continues doing
something sensible.
In the new testSSHCiphers function, I forgot to put in the check for
None that I put in all the other functions that try to explicitly
instantiate hardware-accelerated AES.
Since I apparently can't reliably keep this script working on both
flavours of Python, I think these days I'd rather it broke on 2 than
on 3 due to my inattention. So let's default to 3.
Like the AES code before it, I've now exposed the explicit _sw and _hw
vtables for SHA-256 and SHA-1 through the testcrypt system, and now
cryptsuite will run the standard test vectors for those hashes over
both implementations, on a platform where more than one is available.
The new structure of those modules is along similar lines to the
recent rewrite of AES, with selection of HW vs SW implementation being
done by the main vtable instead of a subsidiary function pointer
within it, freedom for each implementation to define its state
structure however is most convenient, and space to drop in other
hardware-accelerated implementations.
I've removed the centralised test for compiler SHA-NI support in
ssh.h, and instead duplicated it between the two SHA modules, on the
grounds that once you start considering an open-ended set of hardware
accelerators, the two hashes _need_ not go together.
I've also added an extra test in cryptsuite that checks the point at
which the end-of-hash padding switches to adding an extra cipher
block. That was just because I was rewriting that padding code, was
briefly worried that I might have got an off-by-one error in that part
of it, and couldn't see any existing test that gave me confidence I
hadn't.
This tears out the entire previous random-pool system in sshrand.c. In
its place is a system pretty close to Ferguson and Schneier's
'Fortuna' generator, with the main difference being that I use SHA-256
instead of AES for the generation side of the system (rationale given
in comment).
The PRNG implementation lives in sshprng.c, and defines a self-
contained data type with no state stored outside the object, so you
can instantiate however many of them you like. The old sshrand.c still
exists, but in place of the previous random pool system, it's just
become a client of sshprng.c, whose job is to hold a single global
instance of the PRNG type, and manage its reference count, save file,
noise-collection timers and similar administrative business.
Advantages of this change include:
- Fortuna is designed with a more varied threat model in mind than my
old home-grown random pool. For example, after any request for
random numbers, it automatically re-seeds itself, so that if the
state of the PRNG should be leaked, it won't give enough
information to find out what past outputs _were_.
- The PRNG type can be instantiated with any hash function; the
instance used by the main tools is based on SHA-256, an improvement
on the old pool's use of SHA-1.
- The new PRNG only uses the completely standard interface to the
hash function API, instead of having to have privileged access to
the internal SHA-1 block transform function. This will make it
easier to revamp the hash code in general, and also it means that
hardware-accelerated versions of SHA-256 will automatically be used
for the PRNG as well as for everything else.
- The new PRNG can be _tested_! Because it has an actual (if not
quite explicit) specification for exactly what the output numbers
_ought_ to be derived from the hashes of, I can (and have) put
tests in cryptsuite that ensure the output really is being derived
in the way I think it is. The old pool could have been returning
any old nonsense and it would have been very hard to tell for sure.
This replaces all the separate HMAC-implementing wrappers in the
various source files implementing the underlying hashes.
The new HMAC code also correctly handles the case of a key longer than
the underlying hash's block length, by replacing it with its own hash.
This means I can reinstate the test vectors in RFC 6234 which exercise
that case, which I didn't add to cryptsuite before because they'd have
failed.
It also allows me to remove the ad-hoc code at the call site in
cproxy.c which turns out to have been doing the same thing - I think
that must have been the only call site where the question came up
(since MAC keys invented by the main SSH-2 BPP are always shorter than
that).
DES was the next target in my ongoing programme of trying to make all
our crypto code constant-time. Unfortunately, DES is very hard to make
constant-time and still have any kind of performance: my early timing
tests suggest that the implementation I have here is about 4.5 times
slower than the implementation it's replacing. That's about the same
factor as the new AES code when it's not in parallel mode and not
superseded by hardware acceleration - but of course the difference is
that AES usually _is_ superseded by HW acceleration or (failing that)
in parallel mode. This DES implementation doesn't parallelise, and
there's no hardware alternative, so DES is going to be this slow all
the time, unless someone sends me code that does it better.
But hopefully that isn't too big a problem. The main use for DES these
days is legacy devices whose SSH servers haven't been updated to speak
anything more modern, so with any luck those devices will also be old
and slow enough that _their_ end will be the bottleneck in connection
speed!
Like the recently added tests for the auxiliary encryption functions,
this new set of tests is not derived from any external source: the
expected results are simply whatever the current PuTTY code delivers
_now_ for the given operation. The aim is to protect me against
breakage during refactoring or rewrites.
The aim of this reorganisation is to make it easier to test all the
ciphers in PuTTY in a uniform way. It was inconvenient that there were
two separate vtable systems for the ciphers used in SSH-1 and SSH-2
with different functionality.
Now there's only one type, called ssh_cipher. But really it's the old
ssh2_cipher, just renamed: I haven't made any changes to the API on
the SSH-2 side. Instead, I've removed ssh1_cipher completely, and
adapted the SSH-1 BPP to use the SSH-2 style API.
(The relevant differences are that ssh1_cipher encapsulated both the
sending and receiving directions in one object - so now ssh1bpp has to
make a separate cipher instance per direction - and that ssh1_cipher
automatically initialised the IV to all zeroes, which ssh1bpp now has
to do by hand.)
The previous ssh1_cipher vtable for single-DES has been removed
completely, because when converted into the new API it became
identical to the SSH-2 single-DES vtable; so now there's just one
vtable for DES-CBC which works in both protocols. The other two SSH-1
ciphers each had to stay separate, because 3DES is completely
different between SSH-1 and SSH-2 (three layers of CBC structure
versus one), and Blowfish varies in endianness and key length between
the two.
(Actually, while I'm here, I've only just noticed that the SSH-1
Blowfish cipher mis-describes itself in log messages as Blowfish-128.
In fact it passes the whole of the input key buffer, which has length
SSH1_SESSION_KEY_LENGTH == 32 bytes == 256 bits. So it's actually
Blowfish-256, and has been all along!)
All the things like des_encrypt_xdmauth and aes256_encrypt_pubkey are
at risk of changing their behaviour if I rewrite the underlying code,
so even if I don't have any externally verified test cases, I should
still have _something_ to keep me confident that they work the same
way today that they worked yesterday.
I remembered the existence of that module while I was changing the API
of the CRC functions. It's still quite possibly the only code in PuTTY
not written specifically _for_ PuTTY, so it definitely deserves a bit
of a test suite.
In order to expose it through the ptrlen-centric testcrypt system,
I've added some missing 'const' in the detector module itself, but
otherwise I've left the detector code as it was.
Finding even semi-official test vectors for this CRC implementation
was hard, because it turns out not to _quite_ match any of the well
known ones catalogued on the web. Its _polynomial_ is well known, but
the combination of details that go alongside it (starting state,
post-hashing transformation) are not quite the same as any other hash
I know of.
After trawling catalogue websites for a while I finally worked out
that SSH-1's CRC and RFC 1662's CRC are basically the same except for
different choices of starting value and final adjustment. And RFC
1662's CRC is common enough that there _are_ test vectors.
So I've renamed the previous crc32_compute function to crc32_ssh1,
reflecting that it seems to be its own thing unlike any other CRC;
implemented the RFC 1662 CRC as well, as an alternative tiny wrapper
on the inner crc32_update function; and exposed all three functions to
testcrypt. That lets me run standard test vectors _and_ directed tests
of the internal update routine, plus one check that crc32_ssh1 itself
does what I expect.
While I'm here, I've also modernised the code to use uint32_t in place
of unsigned long, and ptrlen instead of separate pointer,length
arguments. And I've removed the general primer on CRC theory from the
header comment, in favour of the more specifically useful information
about _which_ CRC this is and how it matches up to anything else out
there.
(I've bowed to inevitability and put the directed CRC tests in the
'crypt' class in cryptsuite.py. Of course this is a misnomer, since
CRC isn't cryptography, but it falls into the same category in terms
of the role it plays in SSH-1, and I didn't feel like making a new
pointedly-named 'notreallycrypt' container class just for this :-)
I found some that look pretty good - in particular exercising every
entry in every S-box. These will come in useful when I finish writing
a replacement for the venerable current DES implementation.
The 128-bit example from Appendix A/B is a more useful first test case
for a new implementation than the Appendix C tests, because the
standard shows even more of the working (in particular the full set of
intermediate results from key setup).
I intended cryptsuite to be Python 2/3 agnostic when I first wrote it,
but of course since then I've been testing on whichever Python was
handy and not continuing to check that both actually worked.
This was the most complicated one of the cipher modes to get right, so
I thought I'd add a test to make sure the IV is being written out
correctly after a decryption of any number of cipher blocks.
The new explicit vtables for the hardware and software implementations
are now exposed by name in the testcrypt protocol, and cryptsuite.py
runs all the AES tests separately on both.
(When hardware AES is compiled out, ssh2_cipher_new("aes128_hw") and
similar calls will return None, and cryptsuite.py will respond by
skipping those tests.)
No cipher construction function _currently_ returns NULL, but one's
about to start, so the testcrypt system will have to be able to cope.
This is the first time a function in the testcrypt API has had an
'opt' type as its return value rather than an argument. But it works
just the same in reverse: the wire protocol emits the special
identifer "NULL" when the optional return value is absent, and the
Python module catches that and rewrites it as Python 'None'.
The bulk of this commit is the changes necessary to make testcrypt
compile under Visual Studio. Unfortunately, I've had to remove my
fiddly clever uses of C99 variadic macros, because Visual Studio does
something unexpected when a variadic macro's expansion puts
__VA_ARGS__ in the argument list of a further macro invocation: the
commas don't separate further arguments. In other words, if you write
#define INNER(x,y,z) some expansion involving x, y and z
#define OUTER(...) INNER(__VA_ARGS__)
OUTER(1,2,3)
then gcc and clang will translate OUTER(1,2,3) into INNER(1,2,3) in
the obvious way, and the inner macro will be expanded with x=1, y=2
and z=3. But try this in Visual Studio, and you'll get the macro
parameter x expanding to the entire string 1,2,3 and the other two
empty (with warnings complaining that INNER didn't get the number of
arguments it expected).
It's hard to cite chapter and verse of the standard to say which of
those is _definitely_ right, though my reading leans towards the
gcc/clang behaviour. But I do know I can't depend on it in code that
has to compile under both!
So I've removed the system that allowed me to declare everything in
testcrypt.h as FUNC(ret,fn,arg,arg,arg), and now I have to use a
different macro for each arity (FUNC0, FUNC1, FUNC2 etc). Also, the
WRAPPED_NAME system is gone (because that too depended on the use of a
comma to shift macro arguments along by one), and now I put a custom C
wrapper around a function by simply re-#defining that function's own
name (and therefore the subsequent code has to be a little more
careful to _not_ pass functions' names between several macros before
stringifying them).
That's all a bit tedious, and commits me to a small amount of ongoing
annoyance because now I'll have to add an explicit argument count
every time I add something to testcrypt.h. But then again, perhaps it
will make the code less incomprehensible to someone trying to
understand it!
When testcrypt.h lists a function argument as 'opt_val_foo', it means
that the argument is optional in the sense that the C function can
take a null pointer in place of a valid foo, and so the Python wrapper
module should accept None in the corresponding argument slot from the
client code and translate it into the special string "NULL" in the
wire protocol.
This works fine at argument translation time, but the code that reads
testcrypt.h wasn't looking at it, so if you said 'opt_val_foo_suffix'
in place of 'opt_val_foo' (indicating that that argument is optional
_and_ the C function expects it in a translated form), then the
initial pass over testcrypt.h wouldn't strip the _suffix, and would
set up data structures with mismatched type names.
This is just like assertEqual, except that I use it when I'm comparing
random-looking binary data, and if the check fails it will encode the
two differing values in hex, which is easier to read than trying to
express them as really strange-looking string literals.
This tests the CBC and SDCTR modes, in all key lengths, and in
particular includes a set of SDCTR tests designed to test the
procedure for incrementing the IV as a single 128-bit integer, by
checking propagation of the carry between every pair of words.
When I was originally designing my knockoff of Stein's algorithm, I
simplified it for my own understanding by replacing the step that
turns a into (a-b)/2 with a step that simply turned it into a-b, on
the basis that the next step would do the division by 2 in any case.
This made it easier to get my head round in the first place, and in
the initial Python prototype of the algorithm, it looked more sensible
to have two different kinds of simple step rather than one simple and
one complicated.
But actually, when it's rewritten under the constraints of time
invariance, the standard way is better, because we had to do the
computation for both kinds of step _anyway_, and this way we sometimes
make both of them useful at once instead of only ever using one.
So I've put it back to the more standard version of Stein, which is a
big improvement, because now we can run in at most 2n iterations
instead of 3n _and_ the code implementing each step is simpler. A
quick timing test suggests that modular inversion is now faster by a
factor of about 1.75.
Also, since I went to the effort of thinking up and commenting a pair
of worst-case inputs for the iteration count of Stein's algorithm, it
seems like an omission not to have made sure they were in the test
suite! Added extra tests that include 2^128-1 as a modulus and 2^127
as a value to invert.
I broke it last year in commit 4988fd410, when I made hash contexts
expose a BinarySink interface. I went round finding no end of long-
winded ways of pushing things into hash contexts, often reimplementing
some standard thing like the wire formatting of an mpint, and rewrote
them more concisely using one or two put_foo calls.
But I failed to notice that the hash preimage used in SSH-1 key
fingerprints is _not_ implementable by put_ssh1_mpint! It consists of
the two public-key integers encoded in multi-byte binary big-endian
form, but without any preceding length field at all. I must have
looked too hastily, 'recognised' it as just implementing an mpint
formatter yet again, and replaced it with put_ssh1_mpint. So SSH-1 key
fingerprints have been completely wrong in the snapshots for months.
Fixed now, and this time, added a comment to warn me in case I get the
urge to simplify the code again, and a regression test in cryptsuite.
I've moved the static method nbits up into a top-level function, so I
can use it to implement Python marshalling functions for SSH mpints.
I'm about to need one of these, and the other will surely come in
useful as well sooner or later.
This supersedes the '#ifdef TEST' main programs in sshsh256.c and
sshsh512.c. Now there's no need to build those test programs manually
on the rare occasion of modifying the hash implementations; instead
testcrypt is built every night and will run these test vectors.
RFC 6234 has some test vectors for HMAC-SHA-* as well, so I've
included the ones applicable to this implementation.
I just found a file lying around in a different source directory that
contained a test case I'd had trouble with last week, so now I've
recovered it, it ought to go in the test suite as a regression test.
This is a reasonably comprehensive test that exercises basically all
the functions I rewrote at the end of last year, and it's how I found
a lot of the bugs in them that I fixed earlier today.
It's written in Python, using the unittest framework, which is
convenient because that way I can cross-check Python's own large
integers against PuTTY's.
While I'm here, I've also added a few tests of higher-level crypto
primitives such as Ed25519, AES and HMAC, when I could find official
test vectors for them. I hope to add to that collection at some point,
and also add unit tests of some of the other primitives like ECDH and
RSA KEX.
The test suite is run automatically by my top-level build script, so
that I won't be able to accidentally ship anything which regresses it.
When it's run at build time, the testcrypt binary is built using both
Address and Leak Sanitiser, so anything they don't like will also
cause a test failure.
I've written a new standalone test program which incorporates all of
PuTTY's crypto code, including the mp_int and low-level elliptic curve
layers but also going all the way up to the implementations of the
MAC, hash, cipher, public key and kex abstractions.
The test program itself, 'testcrypt', speaks a simple line-oriented
protocol on standard I/O in which you write the name of a function
call followed by some inputs, and it gives you back a list of outputs
preceded by a line telling you how many there are. Dynamically
allocated objects are assigned string ids in the protocol, and there's
a 'free' function that tells testcrypt when it can dispose of one.
It's possible to speak that protocol by hand, but cumbersome. I've
also provided a Python module that wraps it, by running testcrypt as a
persistent subprocess and gatewaying all the function calls into
things that look reasonably natural to call from Python. The Python
module and testcrypt.c both read a carefully formatted header file
testcrypt.h which contains the name and signature of every exported
function, so it costs minimal effort to expose a given function
through this test API. In a few cases it's necessary to write a
wrapper in testcrypt.c that makes the function look more friendly, but
mostly you don't even need that. (Though that is one of the
motivations between a lot of API cleanups I've done recently!)
I considered doing Python integration in the more obvious way, by
linking parts of the PuTTY code directly into a native-code .so Python
module. I decided against it because this way is more flexible: I can
run the testcrypt program on its own, or compile it in a way that
Python wouldn't play nicely with (I bet compiling just that .so with
Leak Sanitiser wouldn't do what you wanted when Python loaded it!), or
attach a debugger to it. I can even recompile testcrypt for a
different CPU architecture (32- vs 64-bit, or even running it on a
different machine over ssh or under emulation) and still layer the
nice API on top of that via the local Python interpreter. All I need
is a bidirectional data channel.
The __truediv__ pair makes the whole program work in Python 3 as well
as 2 (it was _so_ nearly there already!), and __int__ lets you easily
turn a ModP back into an ordinary Python integer representing its
least positive residue.