When we invent a movzx instruction as part of shift-count logging on
x86, we apparently need to set its 'translation' field to point at a
pre-existing instruction that it's logically related to. Later
versions of DynamoRIO than I was running with will complain if this
isn't done.
This occurred to me recently as a (very small) hole in the logging
strategy: if the size of an allocated memory block depended on some
secret data, it certainly would change the control flow and memory
access pattern inside malloc, but since we disable logging inside
malloc, the log file from this test suite would never see the
difference.
Easily fixed by printing the size of each block in the code that
intercepts malloc and realloc. As expected, no test actually fails as
a result of filling in this gap.
On AArch64, there are unexpectedly malloc and free functions in ld.so,
so the module-load function finds them there, wraps them, and then
misses the real versions in libc.
This allows my side-channel test system to at least _compile_ on other
architectures without failing for the lack of OP_xxx enum constants,
although it now won't log all the things it needs to be a proper test.
We keep an internal 128-bit counter that's used as part of the hash
preimages. There's no real need to import all the mp_int machinery in
order to implement that: we can do it by hand using a small fixed-size
array and a trivial use of BignumADC. This is another inter-module
dependency that's easy to remove and useful to spinoff programs.
This changes the hash preimage calculation in the PRNG, because we're
now formatting our 128-bit integer in the fixed-length representation
of 16 little-endian bytes instead of as an SSH-2 mpint. This is
harmless (perhaps even mildly beneficial, due to the length now not
depending on how long the PRNG has been running), but means I have to
update the PRNG tests as well.
The class for general rth-root finding started off as a cube-root
finder before I generalised it, and in one part of the top-level
explanatory comment, I still referred to a subgroup having index 3
rather than index r.
Also, in a later paragraph, I seem to have said 'index' several times
where I meant the concept of 'rank' I defined in the previous
paragraph.
The testcrypt protocol expects a string literal to be a concatenation
of literal bytes other than '%' and '\n', and %-escaped hex digit
pairs. But testcrypt.py was only ever using the latter format, so even
a legible ASCII string like "123" was being sent to testcrypt as the
unreadable and needlessly long "%31%32%33".
When debugging, I often arrange to save the testcrypt input stream to
a file, and sometimes I use that file as the starting point for
editing. So it is actually useful to have the protocol exchange be
legible to humans. Hence, here's a change to testcrypt.py which makes
it only use the %-escape encoding for byte values that aren't
printable ASCII.
A 'strong' prime, as defined by the Handbook of Applied Cryptography,
is a prime p such that each of p-1 and p+1 has a large prime factor,
and that the large factor q of p-1 is such that q-1 in turn _also_ has
a large prime factor.
HoAC says that making your RSA key using primes of this form defeats
some factoring algorithms - but there are other faster algorithms to
which it makes no difference. So this is probably not a useful
precaution in practice. However, it has been recommended in the past
by some official standards, and it's easy to implement given the new
general facility in PrimeCandidateSource that lets you ask for your
prime to satisfy an arbitrary modular congruence. (And HoAC also says
there's no particular reason _not_ to use strong primes.) So I provide
it as an option, just in case anyone wants to select it.
The change to the key generation algorithm is entirely in sshrsag.c,
and is neatly independent of the prime-generation system in use. If
you're using Maurer provable prime generation, then the known factor q
of p-1 can be used to help certify p, and the one for q-1 to help with
q in turn; if you switch to probabilistic prime generation then you
still get an RSA key with the right structure, except that every time
the definition says 'prime factor' you just append '(probably)'.
(The probabilistic version of this procedure is described as 'Gordon's
algorithm' in HoAC section 4.4.2.)
Since our prime-generation code contains facilities not used by the
main key generators - Sophie Germain primes, user-specified modular
congruences, and MPU certificate output - it's probably going to be
useful sooner or later to have a command-line tool to access those
facilities. So here's a simple script that glues a Python argparse
interface on to the front of it all.
It would be nice to put this in 'contrib' rather than 'test', on the
grounds that it's at least potentially useful for purposes other than
testing PuTTY during development. But it's a client of the testcrypt
system, so it can't live anywhere other than the same directory as
testcrypt.py without me first having to do a lot of faffing about with
Python module organisation. So it can live here for the moment.
Conveniently checkable certificates of primality aren't a new concept.
I didn't invent them, and I wasn't the first to implement them. Given
that, I thought it might be useful to be able to independently verify
a prime generated by PuTTY's provable prime system. Then, even if you
don't trust _this_ code, you might still trust someone else's
verifier, or at least be less willing to believe that both were
colluding.
The Perl module Math::Prime::Util is the only free software I've found
that defines a specific text-file format for certificates of
primality. The MPU format (as it calls it) supports various different
methods of certifying the primality of a number (most of which, like
Pockle's, depend on having previously proved some smaller number(s) to
be prime). The system implemented by Pockle is on its list: MPU calls
it by the name "BLS5".
So this commit introduces extra stored data inside Pockle so that it
remembers not just _that_ it believes certain numbers to be prime, but
also _why_ it believed each one to be prime. Then there's an extra
method in the Pockle API to translate its internal data structures
into the text of an MPU certificate for any number it knows about.
Math::Prime::Util doesn't come with a command-line verification tool,
unfortunately; only a Perl function which you feed a string argument.
So also in this commit I add test/mpu-check.pl, which is a trivial
command-line client of that function.
At the moment, this new piece of API is only exposed via testcrypt. I
could easily put some user interface into the key generation tools
that would save a few primality certificates alongside the private
key, but I have yet to think of any good reason to do it. Mostly this
facility is intended for debugging and cross-checking of the
_algorithm_, not of any particular prime.
Most of them are now _mandatory_ P3 scripts, because I'm tired of
maintaining everything to be compatible with both versions.
The current exceptions are gdb.py (which has to live with whatever gdb
gives it), and kh2reg.py (which is actually designed for other people
to use, and some of them might still be stuck on P2 for the moment).
The comparison functions between an mp_int and an integer worked by
walking along the mp_int, comparing each of its words to the
corresponding word of the integer. When they ran out of mp_int, they'd
stop.
But this overlooks the possibility that they might not have run out of
_integer_ yet! If BIGNUM_INT_BITS is defined to be less than the size
of a uintmax_t, then comparing (say) the uintmax_t 0x8000000000000001
against a one-word mp_int containing 0x0001 would return equality,
because it would never get as far as spotting the high bit of the
integer.
Fixed by iterating up to the max of the number of BignumInts in the
mp_int and the number that cover a uintmax_t. That means we have to
use mp_word() instead of a direct array lookup to get the mp_int words
to compare against, since now the word indices might be out of range.
This is standardised by RFC 8709 at SHOULD level, and for us it's not
too difficult (because we use general-purpose elliptic-curve code). So
let's be up to date for a change, and add it.
This implementation uses all the formats defined in the RFC. But we
also have to choose a wire format for the public+private key blob sent
to an agent, and since the OpenSSH agent protocol is the de facto
standard but not (yet?) handled by the IETF, OpenSSH themselves get to
say what the format for a key should or shouldn't be. So if they don't
support a particular key method, what do you do?
I checked with them, and they agreed that there's an obviously right
format for Ed448 keys, which is to do them exactly like Ed25519 except
that you have a 57-byte string everywhere Ed25519 had a 32-byte
string. So I've done that.
With all the preparation now in place, this is more or less trivial.
We add a new curve setup function in sshecc.c, and an ssh_kex linking
to it; we add the curve parameters to the reference / test code
eccref.py, and use them to generate the list of low-order input values
that should be rejected by the sanity check on the kex output; we add
the standard test vectors from RFC 7748 in cryptsuite.py, and the
low-order values we just generated.
This implements an extended form of primality verification using
certificates based on Pocklington's theorem. You make a Pockle object,
and then try to convince it that one number after another is prime, by
means of providing it with a list of prime factors of p-1 and a
primitive root. (Or just by saying 'this prime is small enough for you
to check yourself'.)
Pocklington's theorem requires you to have factors of p-1 whose
product is at least the square root of p. I've extended that to
support factorisations only as big as the cube root, via an extension
of the theorem given in Maurer's paper on generating provable primes.
The Pockle object is more or less write-only: it has no methods for
reading out its contents. Its only output channel is the return value
when you try to insert a prime into it: if it isn't sufficiently
convinced that your prime is prime, it will return an error code. So
anything for which it returns POCKLE_OK you can be confident of.
I'm going to use this for provable prime generation. But exposing this
part of the system as an object in its own right means I can write a
set of unit tests for this specifically. My negative tests exercise
all the different ways a certification can be erroneous or inadequate;
the positive tests include proofs of primality of various primes used
in elliptic-curve crypto. The Poly1305 proof in particular is taken
from a proof in DJB's paper, which has exactly the form of a
Pocklington certificate only written in English.
The functions primegen() and primegen_add_progress_phase() are gone.
In their place is a small vtable system with two methods corresponding
to them, plus the usual admin of allocating and freeing contexts.
This API change is the starting point for being able to drop in
different prime generation algorithms at run time in response to user
configuration.
The more features and options I add to PrimeCandidateSource, the more
cumbersome it will be to replicate each one in a command-line option
to the ultimate primegen() function. So I'm moving to an API in which
the client of primegen() constructs a PrimeCandidateSource themself,
and passes it in to primegen().
Also, changed the API for pcs_new() so that you don't have to pass
'firstbits' unless you really want to. The net effect is that even
though we've added flexibility, we've also simplified the call sites
of primegen() in the simple case: if you want a 1234-bit prime, you
just need to pass pcs_new(1234) as the argument to primegen, and
you're done.
The new declaration of primegen() lives in ssh_keygen.h, along with
all the types it depends on. So I've had to #include that header in a
few new files.
I just ran into a bug in which the testcrypt child process was cleanly
terminated, but at least one Python object was left lying around
containing the identifier of a testcrypt object that had never been
freed. On program exit, the Python reference count on that object went
to zero, the __del__ method was invoked, and childprocess.funcall
started a _new_ instance of testcrypt just so it could tell it to free
the object identifier - which, of course, the new testcrypt had never
heard of!
We can already tell the difference between a ChildProcess object which
has no subprocess because it hasn't yet been started, and one which
has no subprocess because it's terminated: the latter has exitstatus
set to something other than None. So now we enforce by assertion that
we don't ever restart the child process, and the __del__ method avoids
doing anything if the child has already finished.
When I'm writing Python using the testcrypt API, I keep finding that I
instinctively try to call vtable methods as if they were actual
methods of the object. For example, calling key.sign(msg, 0) instead
of ssh_key_sign(key, msg, 0).
So this change to the Python side of the testcrypt mechanism panders
to my inappropriate finger-macros by making them work! The idea is
that I define a set of pairs (type, prefix), such that any function
whose name begins with the prefix and whose first argument is of that
type will be automatically translated into a method on the Python
object wrapping a testcrypt value of that type. For example, any
function of the form ssh_key_foo(val_ssh_key, other args) will
automatically be exposed as a method key.foo(other args), simply
because (val_ssh_key, "ssh_key_") appears in the translation table.
This is particularly nice for the Python 3 REPL, which will let me
tab-complete the right set of method names by knowing the type I'm
trying to invoke one on. I haven't decided yet whether I want to
switch to using it throughout cryptsuite.py.
For namespace-cleanness, I've also renamed all the existing attributes
of the Python Value class wrapper so that they start with '_', to
leave the space of sensible names clear for the new OOish methods.
This expands our previous check for the public value being zero, to
take in all the values that will _become_ zero after not many steps.
The actual check at run time is done using the new is_infinite query
method for Montgomery curve points. Test cases in cryptsuite.py cover
all the dangerous values I generated via all that fiddly quartic-
solving code.
(DJB's page http://cr.yp.to/ecdh.html#validate also lists these same
constants. But working them out again for myself makes me confident I
can do it again for other similar curves, such as Curve448.)
In particular, this makes us fully compliant with RFC 7748's demand to
check we didn't generate a trivial output key, which can happen if the
other end sends any of those low-order values.
I don't actually see why this is a vital check to perform for security
purposes, for the same reason that we didn't classify the bug
'diffie-hellman-range-check' as a vulnerability: I can't really see
what the other end's incentive might be to deliberately send one of
these nonsense values (and you can't do it by accident - none of these
values is a power of the canonical base point). It's not that a DH
participant couldn't possible want to secretly expose the session
traffic - but there are plenty of more subtle (and less subtle!) ways
to do it, so you don't really gain anything by forcing them to use one
of those instead. But the RFC says to check, so we check.
This uses the new quartic-solver mod p to generate all the values in
Curve25519 that can end up at the curve identity by repeated
application of the doubling formula.
I'm going to want to use this for finding special values in elliptic
curves' ground fields.
In order to solve cubics and quartics in F_p, you have to work in
F_{p^2}, for much the same reasons that you have to be willing to use
complex numbers if you want to solve general cubics over the reals
(even if all the eventual roots turn out to be real after all). So
I've also introduced another arithmetic class to work in that kind of
field, and a shim that glues that on to the cyclic-group root finder
from the previous commit.
I'm about to want to solve quartics mod a prime, which means I'll need
to be able to take cube roots mod p as well as square roots.
This commit introduces a more general class which can take rth roots
for any prime r, and moreover, it can do it in a general cyclic group.
(You have to tell it the group's order and give it some primitives for
doing arithmetic, plus a way of iterating over the group elements that
it can use to look for a non-rth-power and roots of unity.)
That system makes it nicely easy to test, because you can give it a
cyclic group represented as the integers under _addition_, and then
you obviously know what all the right answers are. So I've also added
a unit test system checking that.
I'm about to want to expand the underlying number-theory code, so I'll
start by moving it into a file where it has room to grow without
swamping the main purpose of eccref.py.
I've replaced the random number generation and small delta-finding
loop in primegen() with a much more elaborate system in its own source
file, with unit tests and everything.
Immediate benefits:
- fixes a theoretical possibility of overflowing the target number of
bits, if the random number was so close to the top of the range
that the addition of delta * factor pushed it over. However, this
only happened with negligible probability.
- fixes a directional bias in delta-finding. The previous code
incremented the number repeatedly until it found a value coprime to
all the right things, which meant that a prime preceded by a
particularly long sequence of numbers with tiny factors was more
likely to be chosen. Now we select candidate delta values at
random, that bias should be eliminated.
- changes the semantics of the outermost primegen() function to make
them easier to use, because now the caller specifies the 'bits' and
'firstbits' values for the actual returned prime, rather than
having to account for the factor you're multiplying it by in DSA.
DSA client code is correspondingly adjusted.
Future benefits:
- having the candidate generation in a separate function makes it
easy to reuse in alternative prime generation strategies
- the available constraints support applications such as Maurer's
algorithm for generating provable primes, or strong primes for RSA
in which both p-1 and p+1 have a large factor. So those become
things we could experiment with in future.
This still isn't the true random generator used in the live tools:
it's deterministic, for repeatable testing. The Python side of
testcrypt can now call random_make_prng(), which will instantiate a
PRNG with the given seed. random_clear() still gets rid of it.
So I can still have some tests control the precise random numbers
received by the function under test, but for others (especially key
generation, with its uncertainty about how much randomness it will
actually use) I can just say 'here, have a seed, generate as much
stuff from that seed as you need'.
This is another application of the existing mp_bezout_into, which
needed a tweak or two to cope with the numbers not necessarily being
coprime, plus a wrapper function to deal with shared factors of 2.
It reindents the entire second half of mp_bezout_into, so the patch is
best viewed with whitespace differences ignored.
There was previously no safe left shift at all, which is an omission.
And rshift_safe_into was an odd thing to be missing, so while I'm
here, I've added it on the basis that it will probably be useful
sooner or later.
While looking over the code for other reasons, I happened to notice
that the internal function mp_add_masked_integer_into was using a
totally wrong condition to check whether it was about to do an
out-of-range right shift: it was comparing a shift count measured in
bits against BIGNUM_INT_BYTES.
The resulting bug hasn't shown up in the code so far, which I assume
is just because no caller is passing any RHS to mp_add_integer_into
bigger than about 1 or 2. And it doesn't show up in the test suite
because I hadn't tested those functions. Now I am testing them, and
the newly added test fails when built for 16-bit BignumInt if you back
out the actual fix in this commit.
Also spelled '-O text', this takes a public or private key as input,
and produces on standard output a dump of all the actual numbers
involved in the key: the exponent and modulus for RSA, the p,q,g,y
parameters for DSA, the affine x and y coordinates of the public
elliptic curve point for ECC keys, and all the extra bits and pieces
in the private keys too.
Partly I expect this to be useful to me for debugging: I've had to
paste key files a few too many times through base64 decoders and hex
dump tools, then manually decode SSH marshalling and paste the result
into the Python REPL to get an integer object. Now I should be able to
get _straight_ to text I can paste into Python.
But also, it's a way that other applications can use the key
generator: if you need to generate, say, an RSA key in some format I
don't support (I've recently heard of an XML-based one, for example),
then you can run 'puttygen -t rsa --dump' and have it print the
elements of a freshly generated keypair on standard output, and then
all you have to do is understand the output format.
Sometimes, within a switch statement, you want to declare local
variables specific to the handler for one particular case. Until now
I've mostly been writing this in the form
switch (discriminant) {
case SIMPLE:
do stuff;
break;
case COMPLICATED:
{
declare variables;
do stuff;
}
break;
}
which is ugly because the two pieces of essentially similar code
appear at different indent levels, and also inconvenient because you
have less horizontal space available to write the complicated case
handler in - particuarly undesirable because _complicated_ case
handlers are the ones most likely to need all the space they can get!
After encountering a rather nicer idiom in the LLVM source code, and
after a bit of hackery this morning figuring out how to persuade
Emacs's auto-indent to do what I wanted with it, I've decided to move
to an idiom in which the open brace comes right after the case
statement, and the code within it is indented the same as it would
have been without the brace. Then the whole case handler (including
the break) lives inside those braces, and you get something that looks
more like this:
switch (discriminant) {
case SIMPLE:
do stuff;
break;
case COMPLICATED: {
declare variables;
do stuff;
break;
}
}
This commit is a big-bang change that reformats all the complicated
case handlers I could find into the new layout. This is particularly
nice in the Pageant main function, in which almost _every_ case
handler had a bundle of variables and was long and complicated. (In
fact that's what motivated me to get round to this.) Some of the
innermost parts of the terminal escape-sequence handling are also
breathing a bit easier now the horizontal pressure on them is
relieved.
(Also, in a few cases, I was able to remove the extra braces
completely, because the only variable local to the case handler was a
loop variable which our new C99 policy allows me to move into the
initialiser clause of its for statement.)
Viewed with whitespace ignored, this is not too disruptive a change.
Downstream patches that conflict with it may need to be reapplied
using --ignore-whitespace or similar.
In an early draft of the new PRNG, before I decided to get rid of
random_byte() and replace it with random_read(), it was important
after generating a hash-worth of PRNG output to buffer it so as to
return it a byte at a time. So the PRNG data structure itself had to
keep a hash-sized buffer of pending output, and be able to return the
next byte from it on every random_byte() call.
But when random_read() came in, there was no need to do that any more,
because at the end of a read, the generator is re-seeded and the
remains of any generated data is deliberately thrown away. So the
pending_output buffer has no need to live in the persistent prng
object; it can be relegated to a local variable inside random_read
(and a couple of other functions that used the same buffer since it
was conveniently there).
A side effect of this is that we're no longer yielding the bytes of
each hash in reverse order, because only the previous silly code
structure made it convenient. Fortunately, of course, nothing is
depending on that - except the cryptsuite tests, which I've updated.
Well, actually, two new test programs. agenttest.py is the actual
test; it depends on agenttestgen.py which generates a collection of
test private keys, using the newly exposed testcrypt interface to our
key generation code.
In this commit I've also factored out some Python SSH marshalling code
from cryptsuite, and moved it into a module ssh.py which the agent
tests can reuse.
This adds stability tests (of the form 'make sure this behaves
tomorrow the same way it behaved today, taking on faith that the
latter was right') for all the new in-memory APIs for public and
private key load/save.
I accepted both 'int' and 'uint' as function argument types, but
hadn't previously noticed that only 'uint' is handled properly as a
return type. Now both are.
It demonstrates a successful round trip from a source integer to
ciphertext and back, and also I've hardcoded the ciphertext I got from
the first attempt so that future changes to the code won't be able to
change it without me noticing.
[SGT's note: Pavel Kryukov says he came across these while setting up
CI runs of testsc. Two of the three guarded functions are obvious
variants on ones we're already guarding; I hadn't heard of cfree
before, but apparently it's an alternate form of ordinary free() from
an obsolete standard, whose glibc man page says 'never use it'.]
The number of people has been steadily increasing who read our source
code with an editor that thinks tab stops are 4 spaces apart, as
opposed to the traditional tty-derived 8 that the PuTTY code expects.
So I've been wondering for ages about just fixing it, and switching to
a spaces-only policy throughout the code. And I recently found out
about 'git blame -w', which should make this change not too disruptive
for the purposes of source-control archaeology; so perhaps now is the
time.
While I'm at it, I've also taken the opportunity to remove all the
trailing spaces from source lines (on the basis that git dislikes
them, and is the only thing that seems to have a strong opinion one
way or the other).
Apologies to anyone downstream of this code who has complicated patch
sets to rebase past this change. I don't intend it to be needed again.
In crypt.testCRC32(), I had intended to test every input byte with
each of several previous states, but I mis-indented what should have
been the inner loop (over bytes), with the effect that instead I
silently tested the input bytes with only the last of those states.
Previously, if the testcrypt subprocess suffered any kind of crash or
assertion failure during a run of the Python-based test system, the
effect would be that ChildProcess.read_line() would get EOF, ignore
it, and silently return the empty string. Then it would carry on doing
that for the rest of the program, leading to a long string of error
reports in tests that were nowhere near the code that actually caused
the crash.
Now ChildProcess.read_line() detects EOF and raises an exception, so
that the test suite won't heedlessly carry on trying to do things once
it's noticed that its subprocess has gone away.
This is more fiddly than it sounds, however, because of the wrinkle
that sometimes that function can be called while a Python __del__
method is asking testcrypt to free something. If that happens, the
exception can't be propagated out of the __del__ (analogously to the
rule that it's a really terrible idea for C++ destructors to throw).
So you get an annoying warning message on standard error, and then the
next command sent to testcrypt will be back in the same position.
Worse still, this can also happen if testcrypt has _already_ crashed,
because the __del__ methods will still run.
To protect against _that_, ChildProcess caches the exception after
throwing it, and then each subsequent write_line() will rethrow it.
And __del__ catches and explicitly ignores the exception (to avoid the
annoying warning if Python has to do the same).
The combined result should be that if testcrypt crashes in normal
(non-__del__) context, we should get a single exception that
terminates the run cleanly without cascade failures, and whose
backtrace localises the problem to the actual operation that caused
the crash. If testcrypt crashes in __del__, we can't quite do that
well, but we can still terminate with an exception at the next
opportunity, avoiding multiple cascade failures.
Also in this commit, I've got rid of the try-finally in
cryptsuite.py's (trivial) main program.
Now we only run the final memory-leak check if we didn't already have
some other error to report, or some other exception that terminated
the process.
Also, we wait for the subprocess to terminate before returning control
to the shell, so that any last-minute complaints from Leak Sanitiser
appear before rather than after the shell prompt comes back.
While I'm here, I've also made check_return_status tolerate the case
in which the child process never got started at all. That way, if a
failure manages to occur before even getting _that_ far, there won't
be a cascade failure from check_return_status getting confused
afterwards.
My API for ECDH KEX doesn't provide a function to input the random
bytes from which the private key is derived, but conveniently, the
existing call to random_read() in ssh_ecdhkex_m_setup treats the
provided bytes in exactly the way that these test vectors expect.
One of these tests also exercises the 'reduction mod 2^255' case that
I just added.
The standard says we should be checking that both r,s are in the range
[1,q-1]. Previously we were effectively reducing s mod q in the course
of inversion, and modinv() was guaranteeing never to return zero; the
remaining missing checks were benign. But the change from Bignum to
mp_int altered the error behaviour, and combined with the missing
upper bound check on s, made it possible to continue verification with
w == 0 mod q, which is a bad case.
Added a small DSA test case, including a check that none of these
types of signatures validates.
All the work I've put in in the last few months to eliminate timing
and cache side channels from PuTTY's mp_int and cipher implementations
has been on a seat-of-the-pants basis: just thinking very hard about
what kinds of language construction I think would be safe to use, and
trying not to absentmindedly leave a conditional branch or a cast to
bool somewhere vital.
Now I've got a test suite! The basic idea is that you run the same
crypto primitive multiple times, with inputs differing only in ways
that are supposed to avoid being leaked by timing or leaving evidence
in the cache; then you instrument the code so that it logs all the
control flow, memory access and a couple of other relevant things in
each of those runs, and finally, compare the logs and expect them to
be identical.
The instrumentation is done using DynamoRIO, which I found to be well
suited to this kind of work: it lets you define custom modifications
of the code in a reasonably low-effort way, and it lets you work at
both the low level of examining single instructions _and_ the higher
level of the function call ABI (so you can give things like malloc
special treatment, not to mention intercepting communications from the
program being instrumented). Build instructions are all in the comment
at the top of testsc.c.
At present, I've found this test to give a 100% pass rate using gcc
-O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of
failures, which I'll fix in the next commit.
Previously, I checked by assertion that the base was less than the
modulus. There were two things wrong with this policy. Firstly, it's
perfectly _meaningful_ to want to raise a large number to a power mod
a smaller number, even if it doesn't come up often in cryptography;
secondly, I didn't do it right, because the check was based on the
formal sizes (nw fields) of the mp_ints, which meant that it was
possible to have a failure of the assertion even in the case where the
numerical value of the base _was_ less than the modulus.
In particular, this could come up in Diffie-Hellman with a fixed
group, because the fixed group modulus was decoded from an MP_LITERAL
in sshdh.c which gave a minimal value of nw, but the base was the
public value sent by the other end of the connection, which would
sometimes be sent with the leading zero byte required by the SSH-2
mpint encoding, and would cause a value of nw one larger, failing the
assertion.
Fixed by simply using mp_modmul in monty_import, replacing the
previous clever-but-restricted strategy that I wrote when I thought I
could get away without having to write a general division-based
modular reduction at all.
This completes the conversion begun in commit be5c0e635: now every
CBC-mode cipher has "cbc" in its name, and doesn't leave it implicit.
Hopefully this will never confuse me again!
Some functions got confused if given one as input (particularly
mp_get_decimal, which assumed it could safely write at least one word
into the inv5 value it makes internally), and I've decided it's easier
to stop them ever being created than to teach everything to handle
them correctly. So now mp_make_sized enforces nw != 0 by assertion,
and I've added a max at any call site that looked as if it might
violate that precondition.
mp_from_hex("") could generate one of these, in particular, so now
I've fixed it, I've added a test to make sure it continues doing
something sensible.
In the new testSSHCiphers function, I forgot to put in the check for
None that I put in all the other functions that try to explicitly
instantiate hardware-accelerated AES.
Since I apparently can't reliably keep this script working on both
flavours of Python, I think these days I'd rather it broke on 2 than
on 3 due to my inattention. So let's default to 3.
Like the AES code before it, I've now exposed the explicit _sw and _hw
vtables for SHA-256 and SHA-1 through the testcrypt system, and now
cryptsuite will run the standard test vectors for those hashes over
both implementations, on a platform where more than one is available.
The new structure of those modules is along similar lines to the
recent rewrite of AES, with selection of HW vs SW implementation being
done by the main vtable instead of a subsidiary function pointer
within it, freedom for each implementation to define its state
structure however is most convenient, and space to drop in other
hardware-accelerated implementations.
I've removed the centralised test for compiler SHA-NI support in
ssh.h, and instead duplicated it between the two SHA modules, on the
grounds that once you start considering an open-ended set of hardware
accelerators, the two hashes _need_ not go together.
I've also added an extra test in cryptsuite that checks the point at
which the end-of-hash padding switches to adding an extra cipher
block. That was just because I was rewriting that padding code, was
briefly worried that I might have got an off-by-one error in that part
of it, and couldn't see any existing test that gave me confidence I
hadn't.
This tears out the entire previous random-pool system in sshrand.c. In
its place is a system pretty close to Ferguson and Schneier's
'Fortuna' generator, with the main difference being that I use SHA-256
instead of AES for the generation side of the system (rationale given
in comment).
The PRNG implementation lives in sshprng.c, and defines a self-
contained data type with no state stored outside the object, so you
can instantiate however many of them you like. The old sshrand.c still
exists, but in place of the previous random pool system, it's just
become a client of sshprng.c, whose job is to hold a single global
instance of the PRNG type, and manage its reference count, save file,
noise-collection timers and similar administrative business.
Advantages of this change include:
- Fortuna is designed with a more varied threat model in mind than my
old home-grown random pool. For example, after any request for
random numbers, it automatically re-seeds itself, so that if the
state of the PRNG should be leaked, it won't give enough
information to find out what past outputs _were_.
- The PRNG type can be instantiated with any hash function; the
instance used by the main tools is based on SHA-256, an improvement
on the old pool's use of SHA-1.
- The new PRNG only uses the completely standard interface to the
hash function API, instead of having to have privileged access to
the internal SHA-1 block transform function. This will make it
easier to revamp the hash code in general, and also it means that
hardware-accelerated versions of SHA-256 will automatically be used
for the PRNG as well as for everything else.
- The new PRNG can be _tested_! Because it has an actual (if not
quite explicit) specification for exactly what the output numbers
_ought_ to be derived from the hashes of, I can (and have) put
tests in cryptsuite that ensure the output really is being derived
in the way I think it is. The old pool could have been returning
any old nonsense and it would have been very hard to tell for sure.
This replaces all the separate HMAC-implementing wrappers in the
various source files implementing the underlying hashes.
The new HMAC code also correctly handles the case of a key longer than
the underlying hash's block length, by replacing it with its own hash.
This means I can reinstate the test vectors in RFC 6234 which exercise
that case, which I didn't add to cryptsuite before because they'd have
failed.
It also allows me to remove the ad-hoc code at the call site in
cproxy.c which turns out to have been doing the same thing - I think
that must have been the only call site where the question came up
(since MAC keys invented by the main SSH-2 BPP are always shorter than
that).
DES was the next target in my ongoing programme of trying to make all
our crypto code constant-time. Unfortunately, DES is very hard to make
constant-time and still have any kind of performance: my early timing
tests suggest that the implementation I have here is about 4.5 times
slower than the implementation it's replacing. That's about the same
factor as the new AES code when it's not in parallel mode and not
superseded by hardware acceleration - but of course the difference is
that AES usually _is_ superseded by HW acceleration or (failing that)
in parallel mode. This DES implementation doesn't parallelise, and
there's no hardware alternative, so DES is going to be this slow all
the time, unless someone sends me code that does it better.
But hopefully that isn't too big a problem. The main use for DES these
days is legacy devices whose SSH servers haven't been updated to speak
anything more modern, so with any luck those devices will also be old
and slow enough that _their_ end will be the bottleneck in connection
speed!
Like the recently added tests for the auxiliary encryption functions,
this new set of tests is not derived from any external source: the
expected results are simply whatever the current PuTTY code delivers
_now_ for the given operation. The aim is to protect me against
breakage during refactoring or rewrites.
The aim of this reorganisation is to make it easier to test all the
ciphers in PuTTY in a uniform way. It was inconvenient that there were
two separate vtable systems for the ciphers used in SSH-1 and SSH-2
with different functionality.
Now there's only one type, called ssh_cipher. But really it's the old
ssh2_cipher, just renamed: I haven't made any changes to the API on
the SSH-2 side. Instead, I've removed ssh1_cipher completely, and
adapted the SSH-1 BPP to use the SSH-2 style API.
(The relevant differences are that ssh1_cipher encapsulated both the
sending and receiving directions in one object - so now ssh1bpp has to
make a separate cipher instance per direction - and that ssh1_cipher
automatically initialised the IV to all zeroes, which ssh1bpp now has
to do by hand.)
The previous ssh1_cipher vtable for single-DES has been removed
completely, because when converted into the new API it became
identical to the SSH-2 single-DES vtable; so now there's just one
vtable for DES-CBC which works in both protocols. The other two SSH-1
ciphers each had to stay separate, because 3DES is completely
different between SSH-1 and SSH-2 (three layers of CBC structure
versus one), and Blowfish varies in endianness and key length between
the two.
(Actually, while I'm here, I've only just noticed that the SSH-1
Blowfish cipher mis-describes itself in log messages as Blowfish-128.
In fact it passes the whole of the input key buffer, which has length
SSH1_SESSION_KEY_LENGTH == 32 bytes == 256 bits. So it's actually
Blowfish-256, and has been all along!)
All the things like des_encrypt_xdmauth and aes256_encrypt_pubkey are
at risk of changing their behaviour if I rewrite the underlying code,
so even if I don't have any externally verified test cases, I should
still have _something_ to keep me confident that they work the same
way today that they worked yesterday.
I remembered the existence of that module while I was changing the API
of the CRC functions. It's still quite possibly the only code in PuTTY
not written specifically _for_ PuTTY, so it definitely deserves a bit
of a test suite.
In order to expose it through the ptrlen-centric testcrypt system,
I've added some missing 'const' in the detector module itself, but
otherwise I've left the detector code as it was.
Finding even semi-official test vectors for this CRC implementation
was hard, because it turns out not to _quite_ match any of the well
known ones catalogued on the web. Its _polynomial_ is well known, but
the combination of details that go alongside it (starting state,
post-hashing transformation) are not quite the same as any other hash
I know of.
After trawling catalogue websites for a while I finally worked out
that SSH-1's CRC and RFC 1662's CRC are basically the same except for
different choices of starting value and final adjustment. And RFC
1662's CRC is common enough that there _are_ test vectors.
So I've renamed the previous crc32_compute function to crc32_ssh1,
reflecting that it seems to be its own thing unlike any other CRC;
implemented the RFC 1662 CRC as well, as an alternative tiny wrapper
on the inner crc32_update function; and exposed all three functions to
testcrypt. That lets me run standard test vectors _and_ directed tests
of the internal update routine, plus one check that crc32_ssh1 itself
does what I expect.
While I'm here, I've also modernised the code to use uint32_t in place
of unsigned long, and ptrlen instead of separate pointer,length
arguments. And I've removed the general primer on CRC theory from the
header comment, in favour of the more specifically useful information
about _which_ CRC this is and how it matches up to anything else out
there.
(I've bowed to inevitability and put the directed CRC tests in the
'crypt' class in cryptsuite.py. Of course this is a misnomer, since
CRC isn't cryptography, but it falls into the same category in terms
of the role it plays in SSH-1, and I didn't feel like making a new
pointedly-named 'notreallycrypt' container class just for this :-)
I found some that look pretty good - in particular exercising every
entry in every S-box. These will come in useful when I finish writing
a replacement for the venerable current DES implementation.
The 128-bit example from Appendix A/B is a more useful first test case
for a new implementation than the Appendix C tests, because the
standard shows even more of the working (in particular the full set of
intermediate results from key setup).
I intended cryptsuite to be Python 2/3 agnostic when I first wrote it,
but of course since then I've been testing on whichever Python was
handy and not continuing to check that both actually worked.
This was the most complicated one of the cipher modes to get right, so
I thought I'd add a test to make sure the IV is being written out
correctly after a decryption of any number of cipher blocks.
The new explicit vtables for the hardware and software implementations
are now exposed by name in the testcrypt protocol, and cryptsuite.py
runs all the AES tests separately on both.
(When hardware AES is compiled out, ssh2_cipher_new("aes128_hw") and
similar calls will return None, and cryptsuite.py will respond by
skipping those tests.)
No cipher construction function _currently_ returns NULL, but one's
about to start, so the testcrypt system will have to be able to cope.
This is the first time a function in the testcrypt API has had an
'opt' type as its return value rather than an argument. But it works
just the same in reverse: the wire protocol emits the special
identifer "NULL" when the optional return value is absent, and the
Python module catches that and rewrites it as Python 'None'.
The bulk of this commit is the changes necessary to make testcrypt
compile under Visual Studio. Unfortunately, I've had to remove my
fiddly clever uses of C99 variadic macros, because Visual Studio does
something unexpected when a variadic macro's expansion puts
__VA_ARGS__ in the argument list of a further macro invocation: the
commas don't separate further arguments. In other words, if you write
#define INNER(x,y,z) some expansion involving x, y and z
#define OUTER(...) INNER(__VA_ARGS__)
OUTER(1,2,3)
then gcc and clang will translate OUTER(1,2,3) into INNER(1,2,3) in
the obvious way, and the inner macro will be expanded with x=1, y=2
and z=3. But try this in Visual Studio, and you'll get the macro
parameter x expanding to the entire string 1,2,3 and the other two
empty (with warnings complaining that INNER didn't get the number of
arguments it expected).
It's hard to cite chapter and verse of the standard to say which of
those is _definitely_ right, though my reading leans towards the
gcc/clang behaviour. But I do know I can't depend on it in code that
has to compile under both!
So I've removed the system that allowed me to declare everything in
testcrypt.h as FUNC(ret,fn,arg,arg,arg), and now I have to use a
different macro for each arity (FUNC0, FUNC1, FUNC2 etc). Also, the
WRAPPED_NAME system is gone (because that too depended on the use of a
comma to shift macro arguments along by one), and now I put a custom C
wrapper around a function by simply re-#defining that function's own
name (and therefore the subsequent code has to be a little more
careful to _not_ pass functions' names between several macros before
stringifying them).
That's all a bit tedious, and commits me to a small amount of ongoing
annoyance because now I'll have to add an explicit argument count
every time I add something to testcrypt.h. But then again, perhaps it
will make the code less incomprehensible to someone trying to
understand it!
When testcrypt.h lists a function argument as 'opt_val_foo', it means
that the argument is optional in the sense that the C function can
take a null pointer in place of a valid foo, and so the Python wrapper
module should accept None in the corresponding argument slot from the
client code and translate it into the special string "NULL" in the
wire protocol.
This works fine at argument translation time, but the code that reads
testcrypt.h wasn't looking at it, so if you said 'opt_val_foo_suffix'
in place of 'opt_val_foo' (indicating that that argument is optional
_and_ the C function expects it in a translated form), then the
initial pass over testcrypt.h wouldn't strip the _suffix, and would
set up data structures with mismatched type names.
This is just like assertEqual, except that I use it when I'm comparing
random-looking binary data, and if the check fails it will encode the
two differing values in hex, which is easier to read than trying to
express them as really strange-looking string literals.
This tests the CBC and SDCTR modes, in all key lengths, and in
particular includes a set of SDCTR tests designed to test the
procedure for incrementing the IV as a single 128-bit integer, by
checking propagation of the carry between every pair of words.
When I was originally designing my knockoff of Stein's algorithm, I
simplified it for my own understanding by replacing the step that
turns a into (a-b)/2 with a step that simply turned it into a-b, on
the basis that the next step would do the division by 2 in any case.
This made it easier to get my head round in the first place, and in
the initial Python prototype of the algorithm, it looked more sensible
to have two different kinds of simple step rather than one simple and
one complicated.
But actually, when it's rewritten under the constraints of time
invariance, the standard way is better, because we had to do the
computation for both kinds of step _anyway_, and this way we sometimes
make both of them useful at once instead of only ever using one.
So I've put it back to the more standard version of Stein, which is a
big improvement, because now we can run in at most 2n iterations
instead of 3n _and_ the code implementing each step is simpler. A
quick timing test suggests that modular inversion is now faster by a
factor of about 1.75.
Also, since I went to the effort of thinking up and commenting a pair
of worst-case inputs for the iteration count of Stein's algorithm, it
seems like an omission not to have made sure they were in the test
suite! Added extra tests that include 2^128-1 as a modulus and 2^127
as a value to invert.
I broke it last year in commit 4988fd410, when I made hash contexts
expose a BinarySink interface. I went round finding no end of long-
winded ways of pushing things into hash contexts, often reimplementing
some standard thing like the wire formatting of an mpint, and rewrote
them more concisely using one or two put_foo calls.
But I failed to notice that the hash preimage used in SSH-1 key
fingerprints is _not_ implementable by put_ssh1_mpint! It consists of
the two public-key integers encoded in multi-byte binary big-endian
form, but without any preceding length field at all. I must have
looked too hastily, 'recognised' it as just implementing an mpint
formatter yet again, and replaced it with put_ssh1_mpint. So SSH-1 key
fingerprints have been completely wrong in the snapshots for months.
Fixed now, and this time, added a comment to warn me in case I get the
urge to simplify the code again, and a regression test in cryptsuite.
I've moved the static method nbits up into a top-level function, so I
can use it to implement Python marshalling functions for SSH mpints.
I'm about to need one of these, and the other will surely come in
useful as well sooner or later.
This supersedes the '#ifdef TEST' main programs in sshsh256.c and
sshsh512.c. Now there's no need to build those test programs manually
on the rare occasion of modifying the hash implementations; instead
testcrypt is built every night and will run these test vectors.
RFC 6234 has some test vectors for HMAC-SHA-* as well, so I've
included the ones applicable to this implementation.
I just found a file lying around in a different source directory that
contained a test case I'd had trouble with last week, so now I've
recovered it, it ought to go in the test suite as a regression test.
This is a reasonably comprehensive test that exercises basically all
the functions I rewrote at the end of last year, and it's how I found
a lot of the bugs in them that I fixed earlier today.
It's written in Python, using the unittest framework, which is
convenient because that way I can cross-check Python's own large
integers against PuTTY's.
While I'm here, I've also added a few tests of higher-level crypto
primitives such as Ed25519, AES and HMAC, when I could find official
test vectors for them. I hope to add to that collection at some point,
and also add unit tests of some of the other primitives like ECDH and
RSA KEX.
The test suite is run automatically by my top-level build script, so
that I won't be able to accidentally ship anything which regresses it.
When it's run at build time, the testcrypt binary is built using both
Address and Leak Sanitiser, so anything they don't like will also
cause a test failure.
I've written a new standalone test program which incorporates all of
PuTTY's crypto code, including the mp_int and low-level elliptic curve
layers but also going all the way up to the implementations of the
MAC, hash, cipher, public key and kex abstractions.
The test program itself, 'testcrypt', speaks a simple line-oriented
protocol on standard I/O in which you write the name of a function
call followed by some inputs, and it gives you back a list of outputs
preceded by a line telling you how many there are. Dynamically
allocated objects are assigned string ids in the protocol, and there's
a 'free' function that tells testcrypt when it can dispose of one.
It's possible to speak that protocol by hand, but cumbersome. I've
also provided a Python module that wraps it, by running testcrypt as a
persistent subprocess and gatewaying all the function calls into
things that look reasonably natural to call from Python. The Python
module and testcrypt.c both read a carefully formatted header file
testcrypt.h which contains the name and signature of every exported
function, so it costs minimal effort to expose a given function
through this test API. In a few cases it's necessary to write a
wrapper in testcrypt.c that makes the function look more friendly, but
mostly you don't even need that. (Though that is one of the
motivations between a lot of API cleanups I've done recently!)
I considered doing Python integration in the more obvious way, by
linking parts of the PuTTY code directly into a native-code .so Python
module. I decided against it because this way is more flexible: I can
run the testcrypt program on its own, or compile it in a way that
Python wouldn't play nicely with (I bet compiling just that .so with
Leak Sanitiser wouldn't do what you wanted when Python loaded it!), or
attach a debugger to it. I can even recompile testcrypt for a
different CPU architecture (32- vs 64-bit, or even running it on a
different machine over ssh or under emulation) and still layer the
nice API on top of that via the local Python interpreter. All I need
is a bidirectional data channel.
The __truediv__ pair makes the whole program work in Python 3 as well
as 2 (it was _so_ nearly there already!), and __int__ lets you easily
turn a ModP back into an ordinary Python integer representing its
least positive residue.