Since the rewrite of hardware SHA support in cbbd464fd7, we've been
attempting to build with SHA-NI support on x86 with some GCC 4.x,
including Ubuntu 14.04's 4.8.x, whereas before we only tried it with
GCC 5.x and above. Revert to that.
(I think that GCC has had some support for this extension since 4.9.0 --
the "sha" attribute went in in upstream commit fc975a4090 -- and it
at least compiles with 4.9.2, but I'm assuming Pavel had a good reason
for sticking to 5+ in 5a38b293bd.)
When I reworked the support for rekeying after a certain amount of
data had been sent, I forgot the part where configuring the max data
limit to zero means 'never rekey due to data transfer volume'. So I
was incautiously checking the 'running' flag in the new
DataTransferStats to find out whether we needed to rekey, forgetting
that sometimes running=false means the transfer limit has expired, and
sometimes it means there never was one in the first place.
To fix this, I've got rid of the boolean return value from DTS_CONSUME
and turned it into an 'expired' flag in DataTransferStats, separate
from the 'running' flag. Now everything consistently checks 'expired'
to find out whether to rekey, and there's a new reset function that
reliably clears 'expired' but sets 'running' depending on whether the
size is nonzero.
(Also, while I'm at it, I've turned the DTS_CONSUME macro into an
inline function, because that's becoming my general preference now
that C99 is allowed in this code base.)
I think Pavel is right to have turned off -DDEBUG in the MinGW build
on general principles - it should never be the default option for any
build platform - but also, it was not intentional that sshdes.c
produces its hugely detailed diagnostics merely because you compile
with the very generic -DDEBUG. So now you have to say
-DDES_DIAGNOSTICS too if you really want sshdes.c's gory detail.
In the new testSSHCiphers function, I forgot to put in the check for
None that I put in all the other functions that try to explicitly
instantiate hardware-accelerated AES.
In the bitsliced implementation, the addition of 0x63 as the last
operation inside the S-box actually costs cycles during encryption -
four bitslice inversions - which can be easily eliminated by detaching
the constant, moving it forward past the ShiftRows and MixColumns
(with both of which it commutes) until it's adjacent to the next round
key addition, and then folding it into the round key during key setup.
I had this idea while I was originally writing this implementation,
but deferred actually doing it because it made all the intermediate
results harder to check against the standard test vectors. Now the
code is working and stable, this is the moment to come back and do it.
Since I apparently can't reliably keep this script working on both
flavours of Python, I think these days I'd rather it broke on 2 than
on 3 due to my inattention. So let's default to 3.
I've only just found out that it has the effect of treating the argv
words not as plain filenames, but as arguments to Perl default 'open',
i.e. if they end in | then the text before that is treated as a
command. That's not what was intended in any of these contexts!
Fortunately, in this project it only comes up in non-critical
'contrib' scripts.
Like the AES code before it, I've now exposed the explicit _sw and _hw
vtables for SHA-256 and SHA-1 through the testcrypt system, and now
cryptsuite will run the standard test vectors for those hashes over
both implementations, on a platform where more than one is available.
Similarly to the 'AES (unaccelerated)' naming scheme I added in the
AES rewrite, the hash functions that have multiple implementations now
each come with an annotation saying which one they are.
This was more tricky for hashes than for ciphers, because the
annotation for a hash has to be a separate string literal from the
base text name, so that it can propagate into the name field for each
HMAC wrapper without looking silly.
Similarly to my recent addition of NEON-accelerated AES, these new
implementations drop in alongside the SHA-NI ones, under a different
set of ifdefs. All the details of selection and detection are
essentially the same as they were for the AES code.
The new structure of those modules is along similar lines to the
recent rewrite of AES, with selection of HW vs SW implementation being
done by the main vtable instead of a subsidiary function pointer
within it, freedom for each implementation to define its state
structure however is most convenient, and space to drop in other
hardware-accelerated implementations.
I've removed the centralised test for compiler SHA-NI support in
ssh.h, and instead duplicated it between the two SHA modules, on the
grounds that once you start considering an open-ended set of hardware
accelerators, the two hashes _need_ not go together.
I've also added an extra test in cryptsuite that checks the point at
which the end-of-hash padding switches to adding an extra cipher
block. That was just because I was rewriting that padding code, was
briefly worried that I might have got an off-by-one error in that part
of it, and couldn't see any existing test that gave me confidence I
hadn't.
This tears out the entire previous random-pool system in sshrand.c. In
its place is a system pretty close to Ferguson and Schneier's
'Fortuna' generator, with the main difference being that I use SHA-256
instead of AES for the generation side of the system (rationale given
in comment).
The PRNG implementation lives in sshprng.c, and defines a self-
contained data type with no state stored outside the object, so you
can instantiate however many of them you like. The old sshrand.c still
exists, but in place of the previous random pool system, it's just
become a client of sshprng.c, whose job is to hold a single global
instance of the PRNG type, and manage its reference count, save file,
noise-collection timers and similar administrative business.
Advantages of this change include:
- Fortuna is designed with a more varied threat model in mind than my
old home-grown random pool. For example, after any request for
random numbers, it automatically re-seeds itself, so that if the
state of the PRNG should be leaked, it won't give enough
information to find out what past outputs _were_.
- The PRNG type can be instantiated with any hash function; the
instance used by the main tools is based on SHA-256, an improvement
on the old pool's use of SHA-1.
- The new PRNG only uses the completely standard interface to the
hash function API, instead of having to have privileged access to
the internal SHA-1 block transform function. This will make it
easier to revamp the hash code in general, and also it means that
hardware-accelerated versions of SHA-256 will automatically be used
for the PRNG as well as for everything else.
- The new PRNG can be _tested_! Because it has an actual (if not
quite explicit) specification for exactly what the output numbers
_ought_ to be derived from the hashes of, I can (and have) put
tests in cryptsuite that ensure the output really is being derived
in the way I think it is. The old pool could have been returning
any old nonsense and it would have been very hard to tell for sure.
The upcoming PRNG revamp will want to tell noise sources apart, so
that it can treat them all fairly. So I've added an extra parameter to
noise_ultralight and random_add_noise, which takes values in an
enumeration covering all the vague classes of entropy source I'm
collecting. In this commit, though, it's simply ignored.
This is in preparation for a PRNG revamp which will want to have a
well defined boundary for any given request-for-randomness, so that it
can destroy the evidence afterwards. So no more looping round calling
random_byte() and then stopping when we feel like it: now you say up
front how many random bytes you want, and call random_read() which
gives you that many in one go.
Most of the call sites that had to be fixed are fairly mechanical, and
quite a few ended up more concise afterwards. A few became more
cumbersome, such as mp_random_bits, in which the new API doesn't let
me load the random bytes directly into the target integer without
triggering undefined behaviour, so instead I have to allocate a
separate temporary buffer.
The _most_ interesting call site was in the PKCS#1 v1.5 padding code
in sshrsa.c (used in SSH-1), in which you need a stream of _nonzero_
random bytes. The previous code just looped on random_byte, retrying
if it got a zero. Now I'm doing a much more interesting thing with an
mpint, essentially scaling a binary fraction repeatedly to extract a
number in the range [0,255) and then adding 1 to it.
Mostly on the Unix side: there are lots of places the Windows code was
collecting noise that the corresponding Unix/GTK code wasn't bothering
to, such as mouse movements, keystrokes and various network events.
Also, both platforms had forgotten to collect noise when reading data
from a pipe to a local proxy process, even though in that
configuration that's morally equivalent to the network packet timings
that we'd normally be collecting from.
I'd forgotten that the SSH-2 BPP uses a defensive measure of
generating the MAC for successive prefixes of an incoming packet,
which means that ssh_mac_genresult needs to be nondestructive.
While I'm at it, I've also made all of hmac's hash objects exist all
the time - they're created up front, destroyed unconditionally on
free, and in between, whenever one is destroyed at all it's
immediately recreated. I think this simplifies things in general, and
in particular, creating at least one hash object immediately will come
in useful when I add selector vtables in a few commits' time.
Keeping that information alongside the hashes themselves seems more
sensible than having the HMAC code know that fact about everything it
can work with.
That flag was missing from all the CBC vtables' flags fields, because
my recent rewrite forgot to put it in. As a result the SSH_MSG_IGNORE
defence against CBC length oracle attacks was not being enabled.
Another thing pointed out by ASan: new_unix_listener takes ownership
of the SockAddr you give it, so I shouldn't have been freeing it at
the end of platform_make_x11_server().
The single simplest request in the entire protocol - the command
'hello' which is supposed to respond 'hello, world\n' to demonstrate
to an interactive user that testcrypt has started up successfully -
was missing the trailing newline in the response. :-)
This allows me to compile testcrypt with -DDEBUG, even though it's not
linked against the usual collection of platform-specific modules that
normally provide dputs. I think the simplest possible dputs ('just
output to stderr') is actually better for testcrypt, because that
keeps it easy to compile for strange experimental platforms.
This replaces all the separate HMAC-implementing wrappers in the
various source files implementing the underlying hashes.
The new HMAC code also correctly handles the case of a key longer than
the underlying hash's block length, by replacing it with its own hash.
This means I can reinstate the test vectors in RFC 6234 which exercise
that case, which I didn't add to cryptsuite before because they'd have
failed.
It also allows me to remove the ad-hoc code at the call site in
cproxy.c which turns out to have been doing the same thing - I think
that must have been the only call site where the question came up
(since MAC keys invented by the main SSH-2 BPP are always shorter than
that).
Similar to the versions in ssh_cipheralg and ssh_keyalg, this allows a
set of vtables to share function pointers while providing varying
constant data that the shared function can use to vary its behaviour.
As an initial demonstration, I've used this to recombine the four
trivial text_name methods for the HMAC-SHA1 variants. I'm about to use
it for something more sensible, though.
We were retaining a ptrlen 's->comment' into a past agent response
message, but that had been freed by the time it was actually printed
in a diagnostic. Also, agent_response_to_free was being freed twice,
because the variable 'ret' in the response-formatting code aliased it.
All the hash-specific state structures, and the functions that
directly accessed them, are now local to the source files implementing
the hashes themselves. Everywhere we previously used those types or
functions, we're now using the standard ssh_hash or ssh2_mac API.
The 'simple' functions (hmacmd5_simple, SHA_Simple etc) are now a pair
of wrappers in sshauxcrypt.c, each of which takes an algorithm
structure and can do the same conceptual thing regardless of what it
is.
DES was the next target in my ongoing programme of trying to make all
our crypto code constant-time. Unfortunately, DES is very hard to make
constant-time and still have any kind of performance: my early timing
tests suggest that the implementation I have here is about 4.5 times
slower than the implementation it's replacing. That's about the same
factor as the new AES code when it's not in parallel mode and not
superseded by hardware acceleration - but of course the difference is
that AES usually _is_ superseded by HW acceleration or (failing that)
in parallel mode. This DES implementation doesn't parallelise, and
there's no hardware alternative, so DES is going to be this slow all
the time, unless someone sends me code that does it better.
But hopefully that isn't too big a problem. The main use for DES these
days is legacy devices whose SSH servers haven't been updated to speak
anything more modern, so with any luck those devices will also be old
and slow enough that _their_ end will be the bottleneck in connection
speed!
Like the recently added tests for the auxiliary encryption functions,
this new set of tests is not derived from any external source: the
expected results are simply whatever the current PuTTY code delivers
_now_ for the given operation. The aim is to protect me against
breakage during refactoring or rewrites.
All those functions like aes256_encrypt_pubkey and des_decrypt_xdmauth
previously lived in the same source files as the ciphers they were
based on, and provided an alternative API to the internals of that
cipher's implementation. But there was no _need_ for them to have that
privileged access to the internals, because they didn't do anything
you couldn't access through the standard API. It was just a historical
oddity with no real benefit, whose main effect is to make refactoring
painful.
So now all of those functions live in a new file sshauxcrypt.c, and
they all work through the same vtable system as all other cipher
clients, by making an instance of the right cipher and configuring it
in the appropriate way. This should make them completely independent
of any internal changes to the cipher implementations they're based
on.
The aim of this reorganisation is to make it easier to test all the
ciphers in PuTTY in a uniform way. It was inconvenient that there were
two separate vtable systems for the ciphers used in SSH-1 and SSH-2
with different functionality.
Now there's only one type, called ssh_cipher. But really it's the old
ssh2_cipher, just renamed: I haven't made any changes to the API on
the SSH-2 side. Instead, I've removed ssh1_cipher completely, and
adapted the SSH-1 BPP to use the SSH-2 style API.
(The relevant differences are that ssh1_cipher encapsulated both the
sending and receiving directions in one object - so now ssh1bpp has to
make a separate cipher instance per direction - and that ssh1_cipher
automatically initialised the IV to all zeroes, which ssh1bpp now has
to do by hand.)
The previous ssh1_cipher vtable for single-DES has been removed
completely, because when converted into the new API it became
identical to the SSH-2 single-DES vtable; so now there's just one
vtable for DES-CBC which works in both protocols. The other two SSH-1
ciphers each had to stay separate, because 3DES is completely
different between SSH-1 and SSH-2 (three layers of CBC structure
versus one), and Blowfish varies in endianness and key length between
the two.
(Actually, while I'm here, I've only just noticed that the SSH-1
Blowfish cipher mis-describes itself in log messages as Blowfish-128.
In fact it passes the whole of the input key buffer, which has length
SSH1_SESSION_KEY_LENGTH == 32 bytes == 256 bits. So it's actually
Blowfish-256, and has been all along!)
All the things like des_encrypt_xdmauth and aes256_encrypt_pubkey are
at risk of changing their behaviour if I rewrite the underlying code,
so even if I don't have any externally verified test cases, I should
still have _something_ to keep me confident that they work the same
way today that they worked yesterday.
Freeing ssh1key->comment before calling freersakey() on the whole of
ssh1key is redundant, and worse, because we also didn't null out the
freed pointer, causes a double-free.
I've only just noticed that the server-side SSH-1 connection layer had
no handler for that message, so when an SSH-1 connection terminates in
the normal way, Uppity's event log reports 'Unexpected packet
received, type 33 (SSH1_CMSG_EXIT_CONFIRMATION)', which is wrong in
that it _should_ be expected!
The SSH wire protocol for tty modes corresponding to control
characters (e.g. configuring what Ctrl-Foo you can press to generate
SIGINT or SIGQUIT) specifies (RFC 4254 section 8, under VINTR, saying
'similarly for the other characters') that you should send the value
255 on the wire if you want _no_ character code to map to the action
in question.
But in the <termios.h> API, that's indicated by setting the
appropriate field of 'struct termios' to _POSIX_VDISABLE, which is a
platform-dependent value and varies between (at least) Linux and *BSD.
On the client side, Unix Plink has always known this: when it copies
the local termios settings into a struct ssh_ttymodes to be sent on
the wire, it checks for _POSIX_VDISABLE and replaces it with 255. But
uxpty.c, mapping ssh_ttymodes back to termios for Uppity's pty
sessions, wasn't making the reverse transformation.
The refactored sshaes.c gives me a convenient slot to drop in a second
hardware-accelerated AES implementation, similar to the existing one
but using Arm NEON intrinsics in place of the x86 AES-NI ones.
This needed a minor structural change, because Arm systems are often
heterogeneous, containing more than one type of CPU which won't
necessarily all support the same set of architecture features. So you
can't test at run time for the presence of AES acceleration by
querying the CPU you're running on - even if you found a way to do it,
the answer wouldn't be reliable once the OS started migrating your
process between CPUs. Instead, you have to ask the OS itself, because
only that knows about _all_ the CPUs on the system. So that means the
aes_hw_available() mechanism has to extend a tentacle into each
platform subdirectory.
The trickiest part was the nest of ifdefs that tries to detect whether
the compiler can support the necessary parts. I had successful
test-compiles on several compilers, and was able to run the code
directly on an AArch64 tablet (so I know it passes cryptsuite), but
it's likely that at least some Arm platforms won't be able to build it
because of some path through the ifdefs that I haven't been able to
test yet.
In the course of writing the tests for detect_attack, I noticed that
it had a parameter where you can pass in the last cipher block of the
previous packet (or the CBC IV, of course, if there was no previous
packet), so that it can detect a pattern of repeated cipher blocks
even if one of them is just outside the current packet.
But the actual use of the attack detector in ssh1bpp wasn't using that
parameter. Now it is!
I remembered the existence of that module while I was changing the API
of the CRC functions. It's still quite possibly the only code in PuTTY
not written specifically _for_ PuTTY, so it definitely deserves a bit
of a test suite.
In order to expose it through the ptrlen-centric testcrypt system,
I've added some missing 'const' in the detector module itself, but
otherwise I've left the detector code as it was.
In SSH-1, the CRC is used on sensitive data, because it takes the
place of what ought to be a MAC. This is of course hopelessly bad
security and one of the major reasons SSH-1 was replaced, but even so,
there's no need to add timing and cache side channels _as well_ as all
the other problems with it!
So I've removed the 256-entry lookup table that's the usual way to
implement CRC (in particular, the implementation given in the RFC 1662
appendix shows the same table in full). The new strategy folds in four
bits at a time, using a multiply+XOR technique to replicate the
outgoing four bits in all the right places.
In a crude timing test this gave about a factor of 2 slowdown, which
seemed surprisingly good to me - six multiplies replacing a single
table lookup? But the multiplications in each 4-bit fold are
independent of each other, so I suspect the CPU is managing to
parallelise them.
Finding even semi-official test vectors for this CRC implementation
was hard, because it turns out not to _quite_ match any of the well
known ones catalogued on the web. Its _polynomial_ is well known, but
the combination of details that go alongside it (starting state,
post-hashing transformation) are not quite the same as any other hash
I know of.
After trawling catalogue websites for a while I finally worked out
that SSH-1's CRC and RFC 1662's CRC are basically the same except for
different choices of starting value and final adjustment. And RFC
1662's CRC is common enough that there _are_ test vectors.
So I've renamed the previous crc32_compute function to crc32_ssh1,
reflecting that it seems to be its own thing unlike any other CRC;
implemented the RFC 1662 CRC as well, as an alternative tiny wrapper
on the inner crc32_update function; and exposed all three functions to
testcrypt. That lets me run standard test vectors _and_ directed tests
of the internal update routine, plus one check that crc32_ssh1 itself
does what I expect.
While I'm here, I've also modernised the code to use uint32_t in place
of unsigned long, and ptrlen instead of separate pointer,length
arguments. And I've removed the general primer on CRC theory from the
header comment, in favour of the more specifically useful information
about _which_ CRC this is and how it matches up to anything else out
there.
(I've bowed to inevitability and put the directed CRC tests in the
'crypt' class in cryptsuite.py. Of course this is a misnomer, since
CRC isn't cryptography, but it falls into the same category in terms
of the role it plays in SSH-1, and I didn't feel like making a new
pointedly-named 'notreallycrypt' container class just for this :-)
I found some that look pretty good - in particular exercising every
entry in every S-box. These will come in useful when I finish writing
a replacement for the venerable current DES implementation.
The 128-bit example from Appendix A/B is a more useful first test case
for a new implementation than the Appendix C tests, because the
standard shows even more of the working (in particular the full set of
intermediate results from key setup).
Ahem. I went to all the effort of setting up a wrapper function that
would store the result of the first call to aes_hw_available(), and
managed to forget to make it set the flag that said it _had_ stored
the result. So the underlying query function was being called every
time.
I intended cryptsuite to be Python 2/3 agnostic when I first wrote it,
but of course since then I've been testing on whichever Python was
handy and not continuing to check that both actually worked.