1
0
mirror of https://git.tartarus.org/simon/putty.git synced 2025-01-09 17:38:00 +00:00
putty-source/test/testsc.c

2068 lines
60 KiB
C
Raw Permalink Normal View History

New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
/*
* testsc: run PuTTY's crypto primitives under instrumentation that
* checks for cache and timing side channels.
*
* The idea is: cryptographic code should avoid leaking secret data
* through timing information, or through traces of its activity left
* in the caches.
*
* (This property is sometimes called 'constant-time', although really
* that's a misnomer. It would be impossible to avoid the execution
* time varying for any number of reasons outside the code's control,
* such as the prior contents of caches and branch predictors,
* temperature-based CPU throttling, system load, etc. And in any case
* you don't _need_ the execution time to be literally constant: you
* just need it to be independent of your secrets. It can vary as much
* as it likes based on anything else.)
*
* To avoid this, you need to ensure that various aspects of the
* code's behaviour do not depend on the secret data. The control
* flow, for a start - no conditional branches based on secrets - and
* also the memory access pattern (no using secret data as an index
* into a lookup table). A couple of other kinds of CPU instruction
* also can't be trusted to run in constant time: we check for
* register-controlled shifts and hardware divisions. (But, again,
* it's perfectly fine to _use_ those instructions in the course of
* crypto code. You just can't use a secret as any time-affecting
* operand.)
*
* This test program works by running the same crypto primitive
* multiple times, with different secret input data. The relevant
* details of each run is logged to a file via the DynamoRIO-based
* instrumentation system living in the subdirectory test/sclog. Then
* we check over all the files and ensure they're identical.
*
* This program itself (testsc) is built by the ordinary PuTTY
* makefiles. But run by itself, it will do nothing useful: it needs
* to be run under DynamoRIO, with the sclog instrumentation library.
*
* Here's an example of how I built it:
*
* Download the DynamoRIO source. I did this by cloning
* https://github.com/DynamoRIO/dynamorio.git, and at the time of
* writing this, 259c182a75ce80112bcad329c97ada8d56ba854d was the head
* commit.
*
* In the DynamoRIO checkout:
*
* mkdir build
* cd build
* cmake -G Ninja ..
* ninja
*
* Now set the shell variable DRBUILD to be the location of the build
* directory you did that in. (Or not, if you prefer, but the example
* build commands below will assume that that's where the DynamoRIO
* libraries, headers and runtime can be found.)
*
* Then, in test/sclog:
*
* cmake -G Ninja -DCMAKE_PREFIX_PATH=$DRBUILD/cmake .
* ninja
*
* Finally, to run the actual test, set SCTMP to some temp directory
* you don't mind filling with large temp files (several GB at a
* time), and in the main PuTTY source directory (assuming that's
* where testsc has been built):
*
* $DRBUILD/bin64/drrun -c test/sclog/libsclog.so -- ./testsc -O $SCTMP
*/
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include "defs.h"
#include "putty.h"
#include "ssh.h"
testsc: add side-channel test of probabilistic prime gen. Now that I've removed side-channel leakage from both prime candidate generation (via mp_unsafe_mod_integer) and Miller-Rabin, the probabilistic prime generation system in this code base is now able to get through testsc without it detecting any source of cache or timing side channels. So you should be able to generate an RSA key (in which the primes themselves must be secret) in a more hostile environment than you could previously be confident of. This is a bit counterintuitive, because _obviously_ random prime generation takes a variable amount of time, because it has to keep retrying until an attempt succeeds! But that's OK as long as the attempts are completely independent, because then any timing or cache information leaked by a _failed_ attempt will only tell an attacker about the numbers used in the failed attempt, and those numbers have been thrown away, so it doesn't matter who knows them. It's only important that the _successful_ attempt, from generating the random candidate through to completing its verification as (probably) prime, should be side-channel clean, because that's the attempt whose data is actually going to be turned into a private key that needs to be kept secret. (In particular, this means you have to avoid the old-fashioned strategy of generating successive prime candidates by incrementing a starting value until you find something not divisible by any small prime, because the number of iterations of that method would be a timing leak. Happily, we stopped doing that last year, in commit 08a3547bc54051e: now every candidate integer is generated independently, and if one fails the initial checks, we throw it away and start completely from scratch with a fresh random value.) So the test harness works by repeatedly running the prime generator in one-shot mode until an attempt succeeds, and then resetting the random-number stream to where it was just before the successful attempt. Then we generate the same prime number again, this time with the sclog mechanism turned on - and then, we compare it against the version we previously generated with the same random numbers, to make sure they're the same. This checks that the attempts really _are_ independent, in the sense that the prime generator is a pure function of its random input stream, and doesn't depend on state left over from previous attempts.
2021-08-27 16:46:25 +00:00
#include "sshkeygen.h"
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
#include "misc.h"
#include "mpint.h"
#include "crypto/ecc.h"
#include "crypto/ntru.h"
New post-quantum kex: ML-KEM, and three hybrids of it. As standardised by NIST in FIPS 203, this is a lattice-based post-quantum KEM. Very vaguely, the idea of it is that your public key is a matrix A and vector t, and the private key is the knowledge of how to decompose t into two vectors with all their coefficients small, one transformed by A relative to the other. Encryption of a binary secret starts by turning each bit into one of two maximally separated residues mod a prime q, and then adding 'noise' based on the public key in the form of small increments and decrements mod q, again with some of the noise transformed by A relative to the rest. Decryption uses the knowledge of t's decomposition to align the two sets of noise so that the _large_ changes (which masked the secret from an eavesdropper) cancel out, leaving only a collection of small changes to the original secret vector. Then the vector of input bits can be recovered by assuming that those accumulated small pieces of noise haven't concentrated in any particular residue enough to push it more than half way to the other of its possible starting values. A weird feature of it is that decryption is not a true mathematical inverse of encryption. The assumption that the noise doesn't get large enough to flip any bit of the secret is only probabilistically valid, not a hard guarantee. In other words, key agreement can fail, simply by getting particularly unlucky with the distribution of your random noise! However, the probability of a failure is very low - less than 2^-138 even for ML-KEM-512, and gets even smaller with the larger variants. An awkward feature for our purposes is that the matrix A, containing a large number of residues mod the prime q=3329, is required to be constructed by a process of rejection sampling, i.e. generating random 12-bit values and throwing away the out-of-range ones. That would be a real pain for our side-channel testing system, which generally handles rejection sampling badly (since it necessarily involves data-dependent control flow and timing variation). Fortunately, the matrix and the random seed it was made from are both public: the matrix seed is transmitted as part of the public key, so it's not necessary to try to hide it. Accordingly, I was able to get the implementation to pass testsc by means of not varying the matrix seed between runs, which is justified by the principle of testsc that you vary the _secrets_ to ensure timing is independent of them - and the matrix seed isn't a secret, so you're allowed to keep it the same. The three hybrid algorithms, defined by the current Internet-Draft draft-kampanakis-curdle-ssh-pq-ke, include one hybrid of ML-KEM-768 with Curve25519 in exactly the same way we were already hybridising NTRU Prime with Curve25519, and two more hybrids of ML-KEM with ECDH over a NIST curve. The former hybrid interoperates with the implementation in OpenSSH 9.9; all three interoperate with the fork 'openssh-oqs' at github.com/open-quantum-safe/openssh, and also with the Python library AsyncSSH.
2024-12-07 19:33:39 +00:00
#include "crypto/mlkem.h"
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
static NORETURN PRINTF_LIKE(1, 2) void fatal_error(const char *p, ...)
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
{
va_list ap;
fprintf(stderr, "testsc: ");
va_start(ap, p);
vfprintf(stderr, p, ap);
va_end(ap);
fputc('\n', stderr);
exit(1);
}
void out_of_memory(void) { fatal_error("out of memory"); }
/*
* A simple deterministic PRNG, without any of the Fortuna
* complexities, for generating test inputs in a way that's repeatable
* between runs of the program, even if only a subset of test cases is
* run.
*/
static uint64_t random_counter = 0;
static const char *random_seedstr = NULL;
static uint8_t random_buf[MAX_HASH_LEN];
static size_t random_buf_limit = 0;
static ssh_hash *random_hash;
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
static void random_seed(const char *seedstr)
{
random_seedstr = seedstr;
random_counter = 0;
random_buf_limit = 0;
}
static void random_advance_counter(void)
{
ssh_hash_reset(random_hash);
put_asciz(random_hash, random_seedstr);
put_uint64(random_hash, random_counter);
random_counter++;
random_buf_limit = ssh_hash_alg(random_hash)->hlen;
ssh_hash_digest(random_hash, random_buf);
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
void random_read(void *vbuf, size_t size)
{
assert(random_seedstr);
uint8_t *buf = (uint8_t *)vbuf;
while (size-- > 0) {
if (random_buf_limit == 0)
random_advance_counter();
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
*buf++ = random_buf[random_buf_limit--];
}
}
testsc: add side-channel test of probabilistic prime gen. Now that I've removed side-channel leakage from both prime candidate generation (via mp_unsafe_mod_integer) and Miller-Rabin, the probabilistic prime generation system in this code base is now able to get through testsc without it detecting any source of cache or timing side channels. So you should be able to generate an RSA key (in which the primes themselves must be secret) in a more hostile environment than you could previously be confident of. This is a bit counterintuitive, because _obviously_ random prime generation takes a variable amount of time, because it has to keep retrying until an attempt succeeds! But that's OK as long as the attempts are completely independent, because then any timing or cache information leaked by a _failed_ attempt will only tell an attacker about the numbers used in the failed attempt, and those numbers have been thrown away, so it doesn't matter who knows them. It's only important that the _successful_ attempt, from generating the random candidate through to completing its verification as (probably) prime, should be side-channel clean, because that's the attempt whose data is actually going to be turned into a private key that needs to be kept secret. (In particular, this means you have to avoid the old-fashioned strategy of generating successive prime candidates by incrementing a starting value until you find something not divisible by any small prime, because the number of iterations of that method would be a timing leak. Happily, we stopped doing that last year, in commit 08a3547bc54051e: now every candidate integer is generated independently, and if one fails the initial checks, we throw it away and start completely from scratch with a fresh random value.) So the test harness works by repeatedly running the prime generator in one-shot mode until an attempt succeeds, and then resetting the random-number stream to where it was just before the successful attempt. Then we generate the same prime number again, this time with the sclog mechanism turned on - and then, we compare it against the version we previously generated with the same random numbers, to make sure they're the same. This checks that the attempts really _are_ independent, in the sense that the prime generator is a pure function of its random input stream, and doesn't depend on state left over from previous attempts.
2021-08-27 16:46:25 +00:00
struct random_state {
const char *seedstr;
uint64_t counter;
size_t limit;
uint8_t buf[MAX_HASH_LEN];
};
static struct random_state random_get_state(void)
{
struct random_state st;
st.seedstr = random_seedstr;
st.counter = random_counter;
st.limit = random_buf_limit;
memcpy(st.buf, random_buf, sizeof(st.buf));
return st;
}
static void random_set_state(struct random_state st)
{
random_seedstr = st.seedstr;
random_counter = st.counter;
random_buf_limit = st.limit;
memcpy(random_buf, st.buf, sizeof(random_buf));
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
/*
* Macro that defines a function, and also a volatile function pointer
* pointing to it. Callers indirect through the function pointer
* instead of directly calling the function, to ensure that the
* compiler doesn't try to get clever by eliminating the call
* completely, or inlining it.
*
* This is used to mark functions that DynamoRIO will look for to
* intercept, and also to inhibit inlining and unrolling where they'd
* cause a failure of experimental control in the main test.
*/
#define VOLATILE_WRAPPED_DEFN(qualifier, rettype, fn, params) \
qualifier rettype fn##_real params; \
qualifier rettype (*volatile fn) params = fn##_real; \
qualifier rettype fn##_real params
VOLATILE_WRAPPED_DEFN(, void, log_to_file, (const char *filename))
{
/*
* This function is intercepted by the DynamoRIO side of the
* mechanism. We use it to send instructions to the DR wrapper,
* namely, 'please start logging to this file' or 'please stop
* logging' (if filename == NULL). But we don't have to actually
* do anything in _this_ program - all the functionality is in the
* DR wrapper.
*/
}
static const char *outdir = NULL;
char *log_filename(const char *basename, size_t index)
{
return dupprintf("%s/%s.%04"SIZEu, outdir, basename, index);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static char *last_filename;
static const char *test_basename;
static size_t test_index = 0;
void log_start(void)
{
last_filename = log_filename(test_basename, test_index++);
log_to_file(last_filename);
}
void log_end(void)
{
log_to_file(NULL);
sfree(last_filename);
}
static bool test_skipped = false;
VOLATILE_WRAPPED_DEFN(, intptr_t, dry_run, (void))
{
/*
* This is another function intercepted by DynamoRIO. In this
* case, DR overrides this function to return 0 rather than 1, so
* we can use it as a check for whether we're running under
* instrumentation, or whether this is just a dry run which goes
* through the motions but doesn't expect to find any log files
* created.
*/
return 1;
}
static void mp_random_bits_into(mp_int *r, size_t bits)
{
mp_int *x = mp_random_bits(bits);
mp_copy_into(r, x);
mp_free(x);
}
static void mp_random_fill(mp_int *r)
{
mp_random_bits_into(r, mp_max_bits(r));
}
VOLATILE_WRAPPED_DEFN(static, size_t, looplimit, (size_t x))
{
/*
* looplimit() is the identity function on size_t, but the
* compiler isn't allowed to rely on it being that. I use it to
* make loops in the test functions look less attractive to
* compilers' unrolling heuristics.
*/
return x;
}
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
#if HAVE_AES_NI
#define IF_AES_NI(x) x
#else
#define IF_AES_NI(x)
#endif
#if HAVE_SHA_NI
#define IF_SHA_NI(x) x
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
#else
#define IF_SHA_NI(x)
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
#endif
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
#if HAVE_CLMUL
#define IF_CLMUL(x) x
#else
#define IF_CLMUL(x)
#endif
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
#if HAVE_NEON_CRYPTO
#define IF_NEON_CRYPTO(x) x
#else
#define IF_NEON_CRYPTO(x)
#endif
#if HAVE_NEON_SHA512
#define IF_NEON_SHA512(x) x
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
#else
#define IF_NEON_SHA512(x)
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
#endif
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
#if HAVE_NEON_PMULL
#define IF_NEON_PMULL(x) x
#else
#define IF_NEON_PMULL(x)
#endif
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
/* Ciphers that we expect to pass this test. Blowfish and Arcfour are
* intentionally omitted, because we already know they don't. */
#define CIPHERS(X, Y) \
X(Y, ssh_3des_ssh1) \
X(Y, ssh_3des_ssh2_ctr) \
X(Y, ssh_3des_ssh2) \
X(Y, ssh_des) \
X(Y, ssh_des_sshcom_ssh2) \
X(Y, ssh_aes256_sdctr) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
X(Y, ssh_aes256_gcm) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(Y, ssh_aes256_cbc) \
X(Y, ssh_aes192_sdctr) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
X(Y, ssh_aes192_gcm) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(Y, ssh_aes192_cbc) \
X(Y, ssh_aes128_sdctr) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
X(Y, ssh_aes128_gcm) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(Y, ssh_aes128_cbc) \
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
X(Y, ssh_aes256_sdctr_sw) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
X(Y, ssh_aes256_gcm_sw) \
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
X(Y, ssh_aes256_cbc_sw) \
X(Y, ssh_aes192_sdctr_sw) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
X(Y, ssh_aes192_gcm_sw) \
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
X(Y, ssh_aes192_cbc_sw) \
X(Y, ssh_aes128_sdctr_sw) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
X(Y, ssh_aes128_gcm_sw) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(Y, ssh_aes128_cbc_sw) \
IF_AES_NI(X(Y, ssh_aes256_sdctr_ni)) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
IF_AES_NI(X(Y, ssh_aes256_gcm_ni)) \
IF_AES_NI(X(Y, ssh_aes256_cbc_ni)) \
IF_AES_NI(X(Y, ssh_aes192_sdctr_ni)) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
IF_AES_NI(X(Y, ssh_aes192_gcm_ni)) \
IF_AES_NI(X(Y, ssh_aes192_cbc_ni)) \
IF_AES_NI(X(Y, ssh_aes128_sdctr_ni)) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
IF_AES_NI(X(Y, ssh_aes128_gcm_ni)) \
IF_AES_NI(X(Y, ssh_aes128_cbc_ni)) \
IF_NEON_CRYPTO(X(Y, ssh_aes256_sdctr_neon)) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
IF_NEON_CRYPTO(X(Y, ssh_aes256_gcm_neon)) \
IF_NEON_CRYPTO(X(Y, ssh_aes256_cbc_neon)) \
IF_NEON_CRYPTO(X(Y, ssh_aes192_sdctr_neon)) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
IF_NEON_CRYPTO(X(Y, ssh_aes192_gcm_neon)) \
IF_NEON_CRYPTO(X(Y, ssh_aes192_cbc_neon)) \
IF_NEON_CRYPTO(X(Y, ssh_aes128_sdctr_neon)) \
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
IF_NEON_CRYPTO(X(Y, ssh_aes128_gcm_neon)) \
IF_NEON_CRYPTO(X(Y, ssh_aes128_cbc_neon)) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(Y, ssh2_chacha20_poly1305) \
/* end of list */
#define CIPHER_TESTLIST(X, name) X(cipher_ ## name)
#define SIMPLE_MACS(X, Y) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(Y, ssh_hmac_md5) \
X(Y, ssh_hmac_sha1) \
X(Y, ssh_hmac_sha1_buggy) \
X(Y, ssh_hmac_sha1_96) \
X(Y, ssh_hmac_sha1_96_buggy) \
X(Y, ssh_hmac_sha256) \
X(Y, ssh_hmac_sha512) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
/* end of list */
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
#define ALL_MACS(X, Y) \
SIMPLE_MACS(X, Y) \
X(Y, poly1305) \
X(Y, aesgcm_sw_sw) \
X(Y, aesgcm_sw_refpoly) \
IF_AES_NI(X(Y, aesgcm_ni_sw)) \
IF_NEON_CRYPTO(X(Y, aesgcm_neon_sw)) \
IF_CLMUL(X(Y, aesgcm_sw_clmul)) \
IF_NEON_PMULL(X(Y, aesgcm_sw_neon)) \
IF_AES_NI(IF_CLMUL(X(Y, aesgcm_ni_clmul))) \
IF_NEON_CRYPTO(IF_NEON_PMULL(X(Y, aesgcm_neon_neon))) \
/* end of list */
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
#define MAC_TESTLIST(X, name) X(mac_ ## name)
#define HASHES(X, Y) \
X(Y, ssh_md5) \
X(Y, ssh_sha1) \
X(Y, ssh_sha1_sw) \
X(Y, ssh_sha256) \
X(Y, ssh_sha256_sw) \
X(Y, ssh_sha384) \
X(Y, ssh_sha512) \
Break up crypto modules containing HW acceleration. This applies to all of AES, SHA-1, SHA-256 and SHA-512. All those source files previously contained multiple implementations of the algorithm, enabled or disabled by ifdefs detecting whether they would work on a given compiler. And in order to get advanced machine instructions like AES-NI or NEON crypto into the output file when the compile flags hadn't enabled them, we had to do nasty stuff with compiler-specific pragmas or attributes. Now we can do the detection at cmake time, and enable advanced instructions in the more sensible way, by compile-time flags. So I've broken up each of these modules into lots of sub-pieces: a file called (e.g.) 'foo-common.c' containing common definitions across all implementations (such as round constants), one called 'foo-select.c' containing the top-level vtable(s), and a separate file for each implementation exporting just the vtable(s) for that implementation. One advantage of this is that it depends a lot less on compiler- specific bodgery. My particular least favourite part of the previous setup was the part where I had to _manually_ define some Arm ACLE feature macros before including <arm_neon.h>, so that it would define the intrinsics I wanted. Now I'm enabling interesting architecture features in the normal way, on the compiler command line, there's no need for that kind of trick: the right feature macros are already defined and <arm_neon.h> does the right thing. Another change in this reorganisation is that I've stopped assuming there's just one hardware implementation per platform. Previously, the accelerated vtables were called things like sha256_hw, and varied between FOO-NI and NEON depending on platform; and the selection code would simply ask 'is hw available? if so, use hw, else sw'. Now, each HW acceleration strategy names its vtable its own way, and the selection vtable has a whole list of possibilities to iterate over looking for a supported one. So if someone feels like writing a second accelerated implementation of something for a given platform - for example, I've heard you can use plain NEON to speed up AES somewhat even without the crypto extension - then it will now have somewhere to drop in alongside the existing ones.
2021-04-19 05:42:12 +00:00
X(Y, ssh_sha384_sw) \
X(Y, ssh_sha512_sw) \
IF_SHA_NI(X(Y, ssh_sha256_ni)) \
IF_SHA_NI(X(Y, ssh_sha1_ni)) \
IF_NEON_CRYPTO(X(Y, ssh_sha256_neon)) \
IF_NEON_CRYPTO(X(Y, ssh_sha1_neon)) \
IF_NEON_SHA512(X(Y, ssh_sha384_neon)) \
IF_NEON_SHA512(X(Y, ssh_sha512_neon)) \
X(Y, ssh_sha3_224) \
X(Y, ssh_sha3_256) \
X(Y, ssh_sha3_384) \
X(Y, ssh_sha3_512) \
X(Y, ssh_shake256_114bytes) \
X(Y, ssh_blake2b) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
/* end of list */
#define HASH_TESTLIST(X, name) X(hash_ ## name)
#define TESTLIST(X) \
X(mp_get_nbits) \
X(mp_from_decimal) \
X(mp_from_hex) \
X(mp_get_decimal) \
X(mp_get_hex) \
X(mp_cmp_hs) \
X(mp_cmp_eq) \
X(mp_min) \
X(mp_max) \
X(mp_select_into) \
X(mp_cond_swap) \
X(mp_cond_clear) \
X(mp_add) \
X(mp_sub) \
X(mp_mul) \
X(mp_rshift_safe) \
X(mp_divmod) \
X(mp_nthroot) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
X(mp_modadd) \
X(mp_modsub) \
X(mp_modmul) \
X(mp_modpow) \
X(mp_invert_mod_2to) \
X(mp_invert) \
X(mp_modsqrt) \
X(ecc_weierstrass_add) \
X(ecc_weierstrass_double) \
X(ecc_weierstrass_add_general) \
X(ecc_weierstrass_multiply) \
X(ecc_weierstrass_is_identity) \
X(ecc_weierstrass_get_affine) \
X(ecc_weierstrass_decompress) \
X(ecc_montgomery_diff_add) \
X(ecc_montgomery_double) \
X(ecc_montgomery_multiply) \
X(ecc_montgomery_get_affine) \
X(ecc_edwards_add) \
X(ecc_edwards_multiply) \
X(ecc_edwards_eq) \
X(ecc_edwards_get_affine) \
X(ecc_edwards_decompress) \
CIPHERS(CIPHER_TESTLIST, X) \
ALL_MACS(MAC_TESTLIST, X) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
HASHES(HASH_TESTLIST, X) \
X(argon2) \
testsc: add side-channel test of probabilistic prime gen. Now that I've removed side-channel leakage from both prime candidate generation (via mp_unsafe_mod_integer) and Miller-Rabin, the probabilistic prime generation system in this code base is now able to get through testsc without it detecting any source of cache or timing side channels. So you should be able to generate an RSA key (in which the primes themselves must be secret) in a more hostile environment than you could previously be confident of. This is a bit counterintuitive, because _obviously_ random prime generation takes a variable amount of time, because it has to keep retrying until an attempt succeeds! But that's OK as long as the attempts are completely independent, because then any timing or cache information leaked by a _failed_ attempt will only tell an attacker about the numbers used in the failed attempt, and those numbers have been thrown away, so it doesn't matter who knows them. It's only important that the _successful_ attempt, from generating the random candidate through to completing its verification as (probably) prime, should be side-channel clean, because that's the attempt whose data is actually going to be turned into a private key that needs to be kept secret. (In particular, this means you have to avoid the old-fashioned strategy of generating successive prime candidates by incrementing a starting value until you find something not divisible by any small prime, because the number of iterations of that method would be a timing leak. Happily, we stopped doing that last year, in commit 08a3547bc54051e: now every candidate integer is generated independently, and if one fails the initial checks, we throw it away and start completely from scratch with a fresh random value.) So the test harness works by repeatedly running the prime generator in one-shot mode until an attempt succeeds, and then resetting the random-number stream to where it was just before the successful attempt. Then we generate the same prime number again, this time with the sclog mechanism turned on - and then, we compare it against the version we previously generated with the same random numbers, to make sure they're the same. This checks that the attempts really _are_ independent, in the sense that the prime generator is a pure function of its random input stream, and doesn't depend on state left over from previous attempts.
2021-08-27 16:46:25 +00:00
X(primegen_probabilistic) \
X(ntru) \
New post-quantum kex: ML-KEM, and three hybrids of it. As standardised by NIST in FIPS 203, this is a lattice-based post-quantum KEM. Very vaguely, the idea of it is that your public key is a matrix A and vector t, and the private key is the knowledge of how to decompose t into two vectors with all their coefficients small, one transformed by A relative to the other. Encryption of a binary secret starts by turning each bit into one of two maximally separated residues mod a prime q, and then adding 'noise' based on the public key in the form of small increments and decrements mod q, again with some of the noise transformed by A relative to the rest. Decryption uses the knowledge of t's decomposition to align the two sets of noise so that the _large_ changes (which masked the secret from an eavesdropper) cancel out, leaving only a collection of small changes to the original secret vector. Then the vector of input bits can be recovered by assuming that those accumulated small pieces of noise haven't concentrated in any particular residue enough to push it more than half way to the other of its possible starting values. A weird feature of it is that decryption is not a true mathematical inverse of encryption. The assumption that the noise doesn't get large enough to flip any bit of the secret is only probabilistically valid, not a hard guarantee. In other words, key agreement can fail, simply by getting particularly unlucky with the distribution of your random noise! However, the probability of a failure is very low - less than 2^-138 even for ML-KEM-512, and gets even smaller with the larger variants. An awkward feature for our purposes is that the matrix A, containing a large number of residues mod the prime q=3329, is required to be constructed by a process of rejection sampling, i.e. generating random 12-bit values and throwing away the out-of-range ones. That would be a real pain for our side-channel testing system, which generally handles rejection sampling badly (since it necessarily involves data-dependent control flow and timing variation). Fortunately, the matrix and the random seed it was made from are both public: the matrix seed is transmitted as part of the public key, so it's not necessary to try to hide it. Accordingly, I was able to get the implementation to pass testsc by means of not varying the matrix seed between runs, which is justified by the principle of testsc that you vary the _secrets_ to ensure timing is independent of them - and the matrix seed isn't a secret, so you're allowed to keep it the same. The three hybrid algorithms, defined by the current Internet-Draft draft-kampanakis-curdle-ssh-pq-ke, include one hybrid of ML-KEM-768 with Curve25519 in exactly the same way we were already hybridising NTRU Prime with Curve25519, and two more hybrids of ML-KEM with ECDH over a NIST curve. The former hybrid interoperates with the implementation in OpenSSH 9.9; all three interoperate with the fork 'openssh-oqs' at github.com/open-quantum-safe/openssh, and also with the Python library AsyncSSH.
2024-12-07 19:33:39 +00:00
X(mlkem512) \
X(mlkem768) \
X(mlkem1024) \
Switch to RFC 6979 for DSA nonce generation. This fixes a vulnerability that compromises NIST P521 ECDSA keys when they are used with PuTTY's existing DSA nonce generation code. The vulnerability has been assigned the identifier CVE-2024-31497. PuTTY has been doing its DSA signing deterministically for literally as long as it's been doing it at all, because I didn't trust Windows's entropy generation. Deterministic nonce generation was introduced in commit d345ebc2a5a0b59, as part of the initial version of our DSA signing routine. At the time, there was no standard for how to do it, so we had to think up the details of our system ourselves, with some help from the Cambridge University computer security group. More than ten years later, RFC 6979 was published, recommending a similar system for general use, naturally with all the details different. We didn't switch over to doing it that way, because we had a scheme in place already, and as far as I could see, the differences were not security-critical - just the normal sort of variation you expect when any two people design a protocol component of this kind independently. As far as I know, the _structure_ of our scheme is still perfectly fine, in terms of what data gets hashed, how many times, and how the hash output is converted into a nonce. But the weak spot is the choice of hash function: inside our dsa_gen_k() function, we generate 512 bits of random data using SHA-512, and then reduce that to the output range by modular reduction, regardless of what signature algorithm we're generating a nonce for. In the original use case, this introduced a theoretical bias (the output size is an odd prime, which doesn't evenly divide the space of 2^512 possible inputs to the reduction), but the theory was that since integer DSA uses a modulus prime only 160 bits long (being based on SHA-1, at least in the form that SSH uses it), the bias would be too small to be detectable, let alone exploitable. Then we reused the same function for NIST-style ECDSA, when it arrived. This is fine for the P256 curve, and even P384. But in P521, the order of the base point is _greater_ than 2^512, so when we generate a 512-bit number and reduce it, the reduction never makes any difference, and our output nonces are all in the first 2^512 elements of the range of about 2^521. So this _does_ introduce a significant bias in the nonces, compared to the ideal of uniformly random distribution over the whole range. And it's been recently discovered that a bias of this kind is sufficient to expose private keys, given a manageably small number of signatures to work from. (Incidentally, none of this affects Ed25519. The spec for that system includes its own idea of how you should do deterministic nonce generation - completely different again, naturally - and we did it that way rather than our way, so that we could use the existing test vectors.) The simplest fix would be to patch our existing nonce generator to use a longer hash, or concatenate a couple of SHA-512 hashes, or something similar. But I think a more robust approach is to switch it out completely for what is now the standard system. The main reason why I prefer that is that the standard system comes with test vectors, which adds a lot of confidence that I haven't made some other mistake in following my own design. So here's a commit that adds an implementation of RFC 6979, and removes the old dsa_gen_k() function. Tests are added based on the RFC's appendix of test vectors (as many as are compatible with the more limited API of PuTTY's crypto code, e.g. we lack support for the NIST P192 curve, or for doing integer DSA with many different hash functions). One existing test changes its expected outputs, namely the one that has a sample key pair and signature for every key algorithm we support.
2024-04-01 08:18:34 +00:00
X(rfc6979_setup) \
X(rfc6979_attempt) \
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
/* end of list */
static void test_mp_get_nbits(void)
{
mp_int *z = mp_new(512);
static const size_t bitposns[] = {
0, 1, 5, 16, 23, 32, 67, 123, 234, 511
};
mp_int *prev = mp_from_integer(0);
for (size_t i = 0; i < looplimit(lenof(bitposns)); i++) {
mp_int *x = mp_power_2(bitposns[i]);
mp_add_into(z, x, prev);
mp_free(prev);
prev = x;
log_start();
mp_get_nbits(z);
log_end();
}
mp_free(prev);
mp_free(z);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_mp_from_decimal(void)
{
char dec[64];
static const size_t starts[] = { 0, 1, 5, 16, 23, 32, 63, 64 };
for (size_t i = 0; i < looplimit(lenof(starts)); i++) {
memset(dec, '0', lenof(dec));
for (size_t j = starts[i]; j < lenof(dec); j++) {
uint8_t r[4];
random_read(r, 4);
dec[j] = '0' + GET_32BIT_MSB_FIRST(r) % 10;
}
log_start();
mp_int *x = mp_from_decimal_pl(make_ptrlen(dec, lenof(dec)));
log_end();
mp_free(x);
}
}
static void test_mp_from_hex(void)
{
char hex[64];
static const size_t starts[] = { 0, 1, 5, 16, 23, 32, 63, 64 };
static const char digits[] = "0123456789abcdefABCDEF";
for (size_t i = 0; i < looplimit(lenof(starts)); i++) {
memset(hex, '0', lenof(hex));
for (size_t j = starts[i]; j < lenof(hex); j++) {
uint8_t r[4];
random_read(r, 4);
hex[j] = digits[GET_32BIT_MSB_FIRST(r) % lenof(digits)];
}
log_start();
mp_int *x = mp_from_hex_pl(make_ptrlen(hex, lenof(hex)));
log_end();
mp_free(x);
}
}
static void test_mp_string_format(char *(*mp_format)(mp_int *x))
{
mp_int *z = mp_new(512);
static const size_t bitposns[] = {
0, 1, 5, 16, 23, 32, 67, 123, 234, 511
};
for (size_t i = 0; i < looplimit(lenof(bitposns)); i++) {
mp_random_bits_into(z, bitposns[i]);
log_start();
char *formatted = mp_format(z);
log_end();
sfree(formatted);
}
mp_free(z);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_mp_get_decimal(void)
{
test_mp_string_format(mp_get_decimal);
}
static void test_mp_get_hex(void)
{
test_mp_string_format(mp_get_hex);
}
static void test_mp_cmp(unsigned (*mp_cmp)(mp_int *a, mp_int *b))
{
mp_int *a = mp_new(512), *b = mp_new(512);
static const size_t bitposns[] = {
0, 1, 5, 16, 23, 32, 67, 123, 234, 511
};
for (size_t i = 0; i < looplimit(lenof(bitposns)); i++) {
mp_random_fill(b);
mp_int *x = mp_random_bits(bitposns[i]);
mp_xor_into(a, b, x);
mp_free(x);
log_start();
mp_cmp(a, b);
log_end();
}
mp_free(a);
mp_free(b);
}
static void test_mp_cmp_hs(void)
{
test_mp_cmp(mp_cmp_hs);
}
static void test_mp_cmp_eq(void)
{
test_mp_cmp(mp_cmp_eq);
}
static void test_mp_minmax(
void (*mp_minmax_into)(mp_int *r, mp_int *x, mp_int *y))
{
mp_int *a = mp_new(256), *b = mp_new(256);
for (size_t i = 0; i < looplimit(10); i++) {
uint8_t lens[2];
random_read(lens, 2);
mp_int *x = mp_random_bits(lens[0]);
mp_copy_into(a, x);
mp_free(x);
mp_int *y = mp_random_bits(lens[1]);
mp_copy_into(a, y);
mp_free(y);
log_start();
mp_minmax_into(a, a, b);
log_end();
}
mp_free(a);
mp_free(b);
}
static void test_mp_max(void)
{
test_mp_minmax(mp_max_into);
}
static void test_mp_min(void)
{
test_mp_minmax(mp_min_into);
}
static void test_mp_select_into(void)
{
mp_int *a = mp_random_bits(256);
mp_int *b = mp_random_bits(512);
mp_int *r = mp_new(384);
for (size_t i = 0; i < looplimit(16); i++) {
log_start();
mp_select_into(r, a, b, i & 1);
log_end();
}
mp_free(a);
mp_free(b);
mp_free(r);
}
static void test_mp_cond_swap(void)
{
mp_int *a = mp_random_bits(512);
mp_int *b = mp_random_bits(512);
for (size_t i = 0; i < looplimit(16); i++) {
log_start();
mp_cond_swap(a, b, i & 1);
log_end();
}
mp_free(a);
mp_free(b);
}
static void test_mp_cond_clear(void)
{
mp_int *a = mp_random_bits(512);
mp_int *x = mp_copy(a);
for (size_t i = 0; i < looplimit(16); i++) {
mp_copy_into(x, a);
log_start();
mp_cond_clear(a, i & 1);
log_end();
}
mp_free(a);
mp_free(x);
}
static void test_mp_arithmetic(mp_int *(*mp_arith)(mp_int *x, mp_int *y))
{
mp_int *a = mp_new(256), *b = mp_new(512);
for (size_t i = 0; i < looplimit(16); i++) {
mp_random_fill(a);
mp_random_fill(b);
log_start();
mp_int *r = mp_arith(a, b);
log_end();
mp_free(r);
}
mp_free(a);
mp_free(b);
}
static void test_mp_add(void)
{
test_mp_arithmetic(mp_add);
}
static void test_mp_sub(void)
{
test_mp_arithmetic(mp_sub);
}
static void test_mp_mul(void)
{
test_mp_arithmetic(mp_mul);
}
static void test_mp_invert(void)
{
test_mp_arithmetic(mp_invert);
}
static void test_mp_rshift_safe(void)
{
mp_int *x = mp_random_bits(256);
for (size_t i = 0; i < looplimit(mp_max_bits(x)+1); i++) {
log_start();
mp_int *r = mp_rshift_safe(x, i);
log_end();
mp_free(r);
}
mp_free(x);
}
static void test_mp_divmod(void)
{
mp_int *n = mp_new(256), *d = mp_new(256);
mp_int *q = mp_new(256), *r = mp_new(256);
for (size_t i = 0; i < looplimit(32); i++) {
uint8_t sizes[2];
random_read(sizes, 2);
mp_random_bits_into(n, sizes[0]);
mp_random_bits_into(d, sizes[1]);
log_start();
mp_divmod_into(n, d, q, r);
log_end();
}
mp_free(n);
mp_free(d);
mp_free(q);
mp_free(r);
}
static void test_mp_nthroot(void)
{
mp_int *x = mp_new(256), *remainder = mp_new(256);
for (size_t i = 0; i < looplimit(32); i++) {
uint8_t sizes[1];
random_read(sizes, 1);
mp_random_bits_into(x, sizes[0]);
log_start();
mp_free(mp_nthroot(x, 3, remainder));
log_end();
}
mp_free(x);
mp_free(remainder);
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
static void test_mp_modarith(
mp_int *(*mp_modarith)(mp_int *x, mp_int *y, mp_int *modulus))
{
mp_int *base = mp_new(256);
mp_int *exponent = mp_new(256);
mp_int *modulus = mp_new(256);
for (size_t i = 0; i < looplimit(8); i++) {
mp_random_fill(base);
mp_random_fill(exponent);
mp_random_fill(modulus);
mp_set_bit(modulus, 0, 1); /* we only support odd moduli */
log_start();
mp_int *out = mp_modarith(base, exponent, modulus);
log_end();
mp_free(out);
}
mp_free(base);
mp_free(exponent);
mp_free(modulus);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_mp_modadd(void)
{
test_mp_modarith(mp_modadd);
}
static void test_mp_modsub(void)
{
test_mp_modarith(mp_modsub);
}
static void test_mp_modmul(void)
{
test_mp_modarith(mp_modmul);
}
static void test_mp_modpow(void)
{
test_mp_modarith(mp_modpow);
}
static void test_mp_invert_mod_2to(void)
{
mp_int *x = mp_new(512);
for (size_t i = 0; i < looplimit(32); i++) {
mp_random_fill(x);
mp_set_bit(x, 0, 1); /* input should be odd */
log_start();
mp_int *out = mp_invert_mod_2to(x, 511);
log_end();
mp_free(out);
}
mp_free(x);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_mp_modsqrt(void)
{
/* The prime isn't secret in this function (and in any case
* finding a non-square on the fly would be prohibitively
* annoying), so I hardcode a fixed one, selected to have a lot of
* factors of two in p-1 so as to exercise lots of choices in the
* algorithm. */
mp_int *p =
MP_LITERAL(0xb56a517b206a88c73cfa9ec6f704c7030d18212cace82401);
mp_int *nonsquare = MP_LITERAL(0x5);
size_t bits = mp_max_bits(p);
ModsqrtContext *sc = modsqrt_new(p, nonsquare);
mp_free(p);
mp_free(nonsquare);
mp_int *x = mp_new(bits);
unsigned success;
/* Do one initial call to cause the lazily initialised sub-context
* to be set up. This will take a while, but it can't be helped. */
mp_int *unwanted = mp_modsqrt(sc, x, &success);
mp_free(unwanted);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
for (size_t i = 0; i < looplimit(8); i++) {
mp_random_bits_into(x, bits - 1);
log_start();
mp_int *out = mp_modsqrt(sc, x, &success);
log_end();
mp_free(out);
}
mp_free(x);
modsqrt_free(sc);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static WeierstrassCurve *wcurve(void)
{
mp_int *p = MP_LITERAL(0xc19337603dc856acf31e01375a696fdf5451);
mp_int *a = MP_LITERAL(0x864946f50eecca4cde7abad4865e34be8f67);
mp_int *b = MP_LITERAL(0x6a5bf56db3a03ba91cfbf3241916c90feeca);
mp_int *nonsquare = mp_from_integer(3);
WeierstrassCurve *wc = ecc_weierstrass_curve(p, a, b, nonsquare);
mp_free(p);
mp_free(a);
mp_free(b);
mp_free(nonsquare);
return wc;
}
static WeierstrassPoint *wpoint(WeierstrassCurve *wc, size_t index)
{
mp_int *x = NULL, *y = NULL;
WeierstrassPoint *wp;
switch (index) {
case 0:
break;
case 1:
x = MP_LITERAL(0x12345);
y = MP_LITERAL(0x3c2c799a365b53d003ef37dab65860bf80ae);
break;
case 2:
x = MP_LITERAL(0x4e1c77e3c00f7c3b15869e6a4e5f86b3ee53);
y = MP_LITERAL(0x5bde01693130591400b5c9d257d8325a44a5);
break;
case 3:
x = MP_LITERAL(0xb5f0e722b2f0f7e729f55ba9f15511e3b399);
y = MP_LITERAL(0x033d636b855c931cfe679f0b18db164a0d64);
break;
case 4:
x = MP_LITERAL(0xb5f0e722b2f0f7e729f55ba9f15511e3b399);
y = MP_LITERAL(0xbe55d3f4b86bc38ff4b6622c418e599546ed);
break;
default:
unreachable("only 5 example Weierstrass points defined");
}
if (x && y) {
wp = ecc_weierstrass_point_new(wc, x, y);
} else {
wp = ecc_weierstrass_point_new_identity(wc);
}
if (x)
mp_free(x);
if (y)
mp_free(y);
return wp;
}
static void test_ecc_weierstrass_add(void)
{
WeierstrassCurve *wc = wcurve();
WeierstrassPoint *a = ecc_weierstrass_point_new_identity(wc);
WeierstrassPoint *b = ecc_weierstrass_point_new_identity(wc);
for (size_t i = 0; i < looplimit(5); i++) {
for (size_t j = 0; j < looplimit(5); j++) {
if (i == 0 || j == 0 || i == j ||
(i==3 && j==4) || (i==4 && j==3))
continue; /* difficult cases */
WeierstrassPoint *A = wpoint(wc, i), *B = wpoint(wc, j);
ecc_weierstrass_point_copy_into(a, A);
ecc_weierstrass_point_copy_into(b, B);
ecc_weierstrass_point_free(A);
ecc_weierstrass_point_free(B);
log_start();
WeierstrassPoint *r = ecc_weierstrass_add(a, b);
log_end();
ecc_weierstrass_point_free(r);
}
}
ecc_weierstrass_point_free(a);
ecc_weierstrass_point_free(b);
ecc_weierstrass_curve_free(wc);
}
static void test_ecc_weierstrass_double(void)
{
WeierstrassCurve *wc = wcurve();
WeierstrassPoint *a = ecc_weierstrass_point_new_identity(wc);
for (size_t i = 0; i < looplimit(5); i++) {
WeierstrassPoint *A = wpoint(wc, i);
ecc_weierstrass_point_copy_into(a, A);
ecc_weierstrass_point_free(A);
log_start();
WeierstrassPoint *r = ecc_weierstrass_double(a);
log_end();
ecc_weierstrass_point_free(r);
}
ecc_weierstrass_point_free(a);
ecc_weierstrass_curve_free(wc);
}
static void test_ecc_weierstrass_add_general(void)
{
WeierstrassCurve *wc = wcurve();
WeierstrassPoint *a = ecc_weierstrass_point_new_identity(wc);
WeierstrassPoint *b = ecc_weierstrass_point_new_identity(wc);
for (size_t i = 0; i < looplimit(5); i++) {
for (size_t j = 0; j < looplimit(5); j++) {
WeierstrassPoint *A = wpoint(wc, i), *B = wpoint(wc, j);
ecc_weierstrass_point_copy_into(a, A);
ecc_weierstrass_point_copy_into(b, B);
ecc_weierstrass_point_free(A);
ecc_weierstrass_point_free(B);
log_start();
WeierstrassPoint *r = ecc_weierstrass_add_general(a, b);
log_end();
ecc_weierstrass_point_free(r);
}
}
ecc_weierstrass_point_free(a);
ecc_weierstrass_point_free(b);
ecc_weierstrass_curve_free(wc);
}
static void test_ecc_weierstrass_multiply(void)
{
WeierstrassCurve *wc = wcurve();
WeierstrassPoint *a = ecc_weierstrass_point_new_identity(wc);
mp_int *exponent = mp_new(56);
for (size_t i = 1; i < looplimit(5); i++) {
WeierstrassPoint *A = wpoint(wc, i);
ecc_weierstrass_point_copy_into(a, A);
ecc_weierstrass_point_free(A);
mp_random_fill(exponent);
log_start();
WeierstrassPoint *r = ecc_weierstrass_multiply(a, exponent);
log_end();
ecc_weierstrass_point_free(r);
}
ecc_weierstrass_point_free(a);
ecc_weierstrass_curve_free(wc);
mp_free(exponent);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_ecc_weierstrass_is_identity(void)
{
WeierstrassCurve *wc = wcurve();
WeierstrassPoint *a = ecc_weierstrass_point_new_identity(wc);
for (size_t i = 1; i < looplimit(5); i++) {
WeierstrassPoint *A = wpoint(wc, i);
ecc_weierstrass_point_copy_into(a, A);
ecc_weierstrass_point_free(A);
log_start();
ecc_weierstrass_is_identity(a);
log_end();
}
ecc_weierstrass_point_free(a);
ecc_weierstrass_curve_free(wc);
}
static void test_ecc_weierstrass_get_affine(void)
{
WeierstrassCurve *wc = wcurve();
WeierstrassPoint *r = ecc_weierstrass_point_new_identity(wc);
for (size_t i = 0; i < looplimit(4); i++) {
WeierstrassPoint *A = wpoint(wc, i), *B = wpoint(wc, i+1);
WeierstrassPoint *R = ecc_weierstrass_add_general(A, B);
ecc_weierstrass_point_copy_into(r, R);
ecc_weierstrass_point_free(A);
ecc_weierstrass_point_free(B);
ecc_weierstrass_point_free(R);
log_start();
mp_int *x, *y;
ecc_weierstrass_get_affine(r, &x, &y);
log_end();
mp_free(x);
mp_free(y);
}
ecc_weierstrass_point_free(r);
ecc_weierstrass_curve_free(wc);
}
static void test_ecc_weierstrass_decompress(void)
{
WeierstrassCurve *wc = wcurve();
/* As in the mp_modsqrt test, prime the lazy initialisation of the
* ModsqrtContext */
mp_int *x = mp_new(144);
WeierstrassPoint *a = ecc_weierstrass_point_new_from_x(wc, x, 0);
if (a) /* don't care whether this one succeeded */
ecc_weierstrass_point_free(a);
for (size_t p = 0; p < looplimit(2); p++) {
for (size_t i = 1; i < looplimit(5); i++) {
WeierstrassPoint *A = wpoint(wc, i);
mp_int *X;
ecc_weierstrass_get_affine(A, &X, NULL);
mp_copy_into(x, X);
mp_free(X);
ecc_weierstrass_point_free(A);
log_start();
WeierstrassPoint *a = ecc_weierstrass_point_new_from_x(wc, x, p);
log_end();
ecc_weierstrass_point_free(a);
}
}
mp_free(x);
ecc_weierstrass_curve_free(wc);
}
static MontgomeryCurve *mcurve(void)
{
mp_int *p = MP_LITERAL(0xde978eb1db35236a5792e9f0c04d86000659);
mp_int *a = MP_LITERAL(0x799b62a612b1b30e1c23cea6d67b2e33c51a);
mp_int *b = MP_LITERAL(0x944bf9042b56821a8c9e0b49b636c2502b2b);
MontgomeryCurve *mc = ecc_montgomery_curve(p, a, b);
mp_free(p);
mp_free(a);
mp_free(b);
return mc;
}
static MontgomeryPoint *mpoint(MontgomeryCurve *wc, size_t index)
{
mp_int *x = NULL;
MontgomeryPoint *mp;
switch (index) {
case 0:
x = MP_LITERAL(31415);
break;
case 1:
x = MP_LITERAL(0x4d352c654c06eecfe19104118857b38398e8);
break;
case 2:
x = MP_LITERAL(0x03fca2a73983bc3434caae3134599cd69cce);
break;
case 3:
x = MP_LITERAL(0xa0fd735ce9b3406498b5f035ee655bda4e15);
break;
case 4:
x = MP_LITERAL(0x7c7f46a00cc286dbe47db39b6d8f5efd920e);
break;
case 5:
x = MP_LITERAL(0x07a6dc30d3b320448e6f8999be417e6b7c6b);
break;
case 6:
x = MP_LITERAL(0x7832da5fc16dfbd358170b2b96896cd3cd06);
break;
default:
unreachable("only 7 example Weierstrass points defined");
}
mp = ecc_montgomery_point_new(wc, x);
mp_free(x);
return mp;
}
static void test_ecc_montgomery_diff_add(void)
{
MontgomeryCurve *wc = mcurve();
MontgomeryPoint *a = NULL, *b = NULL, *c = NULL;
for (size_t i = 0; i < looplimit(5); i++) {
MontgomeryPoint *A = mpoint(wc, i);
MontgomeryPoint *B = mpoint(wc, i);
MontgomeryPoint *C = mpoint(wc, i);
if (!a) {
a = A;
b = B;
c = C;
} else {
ecc_montgomery_point_copy_into(a, A);
ecc_montgomery_point_copy_into(b, B);
ecc_montgomery_point_copy_into(c, C);
ecc_montgomery_point_free(A);
ecc_montgomery_point_free(B);
ecc_montgomery_point_free(C);
}
log_start();
MontgomeryPoint *r = ecc_montgomery_diff_add(b, c, a);
log_end();
ecc_montgomery_point_free(r);
}
ecc_montgomery_point_free(a);
ecc_montgomery_point_free(b);
ecc_montgomery_point_free(c);
ecc_montgomery_curve_free(wc);
}
static void test_ecc_montgomery_double(void)
{
MontgomeryCurve *wc = mcurve();
MontgomeryPoint *a = NULL;
for (size_t i = 0; i < looplimit(7); i++) {
MontgomeryPoint *A = mpoint(wc, i);
if (!a) {
a = A;
} else {
ecc_montgomery_point_copy_into(a, A);
ecc_montgomery_point_free(A);
}
log_start();
MontgomeryPoint *r = ecc_montgomery_double(a);
log_end();
ecc_montgomery_point_free(r);
}
ecc_montgomery_point_free(a);
ecc_montgomery_curve_free(wc);
}
static void test_ecc_montgomery_multiply(void)
{
MontgomeryCurve *wc = mcurve();
MontgomeryPoint *a = NULL;
mp_int *exponent = mp_new(56);
for (size_t i = 0; i < looplimit(7); i++) {
MontgomeryPoint *A = mpoint(wc, i);
if (!a) {
a = A;
} else {
ecc_montgomery_point_copy_into(a, A);
ecc_montgomery_point_free(A);
}
mp_random_fill(exponent);
log_start();
MontgomeryPoint *r = ecc_montgomery_multiply(a, exponent);
log_end();
ecc_montgomery_point_free(r);
}
ecc_montgomery_point_free(a);
ecc_montgomery_curve_free(wc);
mp_free(exponent);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_ecc_montgomery_get_affine(void)
{
MontgomeryCurve *wc = mcurve();
MontgomeryPoint *r = NULL;
for (size_t i = 0; i < looplimit(5); i++) {
MontgomeryPoint *A = mpoint(wc, i);
MontgomeryPoint *B = mpoint(wc, i);
MontgomeryPoint *C = mpoint(wc, i);
MontgomeryPoint *R = ecc_montgomery_diff_add(B, C, A);
ecc_montgomery_point_free(A);
ecc_montgomery_point_free(B);
ecc_montgomery_point_free(C);
if (!r) {
r = R;
} else {
ecc_montgomery_point_copy_into(r, R);
ecc_montgomery_point_free(R);
}
log_start();
mp_int *x;
ecc_montgomery_get_affine(r, &x);
log_end();
mp_free(x);
}
ecc_montgomery_point_free(r);
ecc_montgomery_curve_free(wc);
}
static EdwardsCurve *ecurve(void)
{
mp_int *p = MP_LITERAL(0xfce2dac1704095de0b5c48876c45063cd475);
mp_int *d = MP_LITERAL(0xbd4f77401c3b14ae1742a7d1d367adac8f3e);
mp_int *a = MP_LITERAL(0x51d0845da3fa871aaac4341adea53b861919);
mp_int *nonsquare = mp_from_integer(2);
EdwardsCurve *ec = ecc_edwards_curve(p, d, a, nonsquare);
mp_free(p);
mp_free(d);
mp_free(a);
mp_free(nonsquare);
return ec;
}
static EdwardsPoint *epoint(EdwardsCurve *wc, size_t index)
{
mp_int *x, *y;
EdwardsPoint *ep;
switch (index) {
case 0:
x = MP_LITERAL(0x0);
y = MP_LITERAL(0x1);
break;
case 1:
x = MP_LITERAL(0x3d8aef0294a67c1c7e8e185d987716250d7c);
y = MP_LITERAL(0x27184);
break;
case 2:
x = MP_LITERAL(0xf44ed5b8a6debfd3ab24b7874cd2589fd672);
y = MP_LITERAL(0xd635d8d15d367881c8a3af472c8fe487bf40);
break;
case 3:
x = MP_LITERAL(0xde114ecc8b944684415ef81126a07269cd30);
y = MP_LITERAL(0xbe0fd45ff67ebba047ed0ec5a85d22e688a1);
break;
case 4:
x = MP_LITERAL(0x76bd2f90898d271b492c9c20dd7bbfe39fe5);
y = MP_LITERAL(0xbf1c82698b4a5a12c1057631c1ebdc216ae2);
break;
default:
unreachable("only 5 example Edwards points defined");
}
ep = ecc_edwards_point_new(wc, x, y);
mp_free(x);
mp_free(y);
return ep;
}
static void test_ecc_edwards_add(void)
{
EdwardsCurve *ec = ecurve();
EdwardsPoint *a = NULL, *b = NULL;
for (size_t i = 0; i < looplimit(5); i++) {
for (size_t j = 0; j < looplimit(5); j++) {
EdwardsPoint *A = epoint(ec, i), *B = epoint(ec, j);
if (!a) {
a = A;
b = B;
} else {
ecc_edwards_point_copy_into(a, A);
ecc_edwards_point_copy_into(b, B);
ecc_edwards_point_free(A);
ecc_edwards_point_free(B);
}
log_start();
EdwardsPoint *r = ecc_edwards_add(a, b);
log_end();
ecc_edwards_point_free(r);
}
}
ecc_edwards_point_free(a);
ecc_edwards_point_free(b);
ecc_edwards_curve_free(ec);
}
static void test_ecc_edwards_multiply(void)
{
EdwardsCurve *ec = ecurve();
EdwardsPoint *a = NULL;
mp_int *exponent = mp_new(56);
for (size_t i = 1; i < looplimit(5); i++) {
EdwardsPoint *A = epoint(ec, i);
if (!a) {
a = A;
} else {
ecc_edwards_point_copy_into(a, A);
ecc_edwards_point_free(A);
}
mp_random_fill(exponent);
log_start();
EdwardsPoint *r = ecc_edwards_multiply(a, exponent);
log_end();
ecc_edwards_point_free(r);
}
ecc_edwards_point_free(a);
ecc_edwards_curve_free(ec);
mp_free(exponent);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
static void test_ecc_edwards_eq(void)
{
EdwardsCurve *ec = ecurve();
EdwardsPoint *a = NULL, *b = NULL;
for (size_t i = 0; i < looplimit(5); i++) {
for (size_t j = 0; j < looplimit(5); j++) {
EdwardsPoint *A = epoint(ec, i), *B = epoint(ec, j);
if (!a) {
a = A;
b = B;
} else {
ecc_edwards_point_copy_into(a, A);
ecc_edwards_point_copy_into(b, B);
ecc_edwards_point_free(A);
ecc_edwards_point_free(B);
}
log_start();
ecc_edwards_eq(a, b);
log_end();
}
}
ecc_edwards_point_free(a);
ecc_edwards_point_free(b);
ecc_edwards_curve_free(ec);
}
static void test_ecc_edwards_get_affine(void)
{
EdwardsCurve *ec = ecurve();
EdwardsPoint *r = NULL;
for (size_t i = 0; i < looplimit(4); i++) {
EdwardsPoint *A = epoint(ec, i), *B = epoint(ec, i+1);
EdwardsPoint *R = ecc_edwards_add(A, B);
ecc_edwards_point_free(A);
ecc_edwards_point_free(B);
if (!r) {
r = R;
} else {
ecc_edwards_point_copy_into(r, R);
ecc_edwards_point_free(R);
}
log_start();
mp_int *x, *y;
ecc_edwards_get_affine(r, &x, &y);
log_end();
mp_free(x);
mp_free(y);
}
ecc_edwards_point_free(r);
ecc_edwards_curve_free(ec);
}
static void test_ecc_edwards_decompress(void)
{
EdwardsCurve *ec = ecurve();
/* As in the mp_modsqrt test, prime the lazy initialisation of the
* ModsqrtContext */
mp_int *y = mp_new(144);
EdwardsPoint *a = ecc_edwards_point_new_from_y(ec, y, 0);
if (a) /* don't care whether this one succeeded */
ecc_edwards_point_free(a);
for (size_t p = 0; p < looplimit(2); p++) {
for (size_t i = 0; i < looplimit(5); i++) {
EdwardsPoint *A = epoint(ec, i);
mp_int *Y;
ecc_edwards_get_affine(A, NULL, &Y);
mp_copy_into(y, Y);
mp_free(Y);
ecc_edwards_point_free(A);
log_start();
EdwardsPoint *a = ecc_edwards_point_new_from_y(ec, y, p);
log_end();
ecc_edwards_point_free(a);
}
}
mp_free(y);
ecc_edwards_curve_free(ec);
}
static void test_cipher(const ssh_cipheralg *calg)
{
ssh_cipher *c = ssh_cipher_new(calg);
if (!c) {
test_skipped = true;
return;
}
const ssh2_macalg *malg = calg->required_mac;
ssh2_mac *m = NULL;
if (malg) {
m = ssh2_mac_new(malg, c);
if (!m) {
ssh_cipher_free(c);
test_skipped = true;
return;
}
}
uint8_t *ckey = snewn(calg->padded_keybytes, uint8_t);
uint8_t *civ = snewn(calg->blksize, uint8_t);
uint8_t *mkey = malg ? snewn(malg->keylen, uint8_t) : NULL;
size_t datalen = calg->blksize * 8;
size_t maclen = malg ? malg->len : 0;
uint8_t *data = snewn(datalen + maclen, uint8_t);
size_t lenlen = 4;
uint8_t *lendata = snewn(lenlen, uint8_t);
for (size_t i = 0; i < looplimit(16); i++) {
random_read(ckey, calg->padded_keybytes);
if (malg)
random_read(mkey, malg->keylen);
random_read(data, datalen);
random_read(lendata, lenlen);
if (i == 0) {
/* Ensure one of our test IVs will cause SDCTR wraparound */
memset(civ, 0xFF, calg->blksize);
} else {
random_read(civ, calg->blksize);
}
uint8_t seqbuf[4];
random_read(seqbuf, 4);
uint32_t seq = GET_32BIT_MSB_FIRST(seqbuf);
log_start();
ssh_cipher_setkey(c, ckey);
ssh_cipher_setiv(c, civ);
if (m)
ssh2_mac_setkey(m, make_ptrlen(mkey, malg->keylen));
if (calg->flags & SSH_CIPHER_SEPARATE_LENGTH)
ssh_cipher_encrypt_length(c, data, datalen, seq);
ssh_cipher_encrypt(c, data, datalen);
if (m) {
ssh2_mac_generate(m, data, datalen, seq);
ssh2_mac_verify(m, data, datalen, seq);
}
if (calg->flags & SSH_CIPHER_SEPARATE_LENGTH)
ssh_cipher_decrypt_length(c, data, datalen, seq);
ssh_cipher_decrypt(c, data, datalen);
log_end();
}
sfree(ckey);
sfree(civ);
sfree(mkey);
sfree(data);
sfree(lendata);
if (m)
ssh2_mac_free(m);
ssh_cipher_free(c);
}
#define CIPHER_TESTFN(Y_unused, cipher) \
static void test_cipher_##cipher(void) { test_cipher(&cipher); }
CIPHERS(CIPHER_TESTFN, Y_unused)
static void test_mac(const ssh2_macalg *malg, const ssh_cipheralg *calg)
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
{
ssh_cipher *c = NULL;
if (calg) {
c = ssh_cipher_new(calg);
if (!c) {
test_skipped = true;
return;
}
}
ssh2_mac *m = ssh2_mac_new(malg, c);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
if (!m) {
test_skipped = true;
if (c)
ssh_cipher_free(c);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
return;
}
size_t ckeylen = calg ? calg->padded_keybytes : 0;
size_t civlen = calg ? calg->blksize : 0;
uint8_t *ckey = snewn(ckeylen, uint8_t);
uint8_t *civ = snewn(civlen, uint8_t);
uint8_t *mkey = snewn(malg->keylen, uint8_t);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
size_t datalen = 256;
size_t maclen = malg->len;
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
uint8_t *data = snewn(datalen + maclen, uint8_t);
for (size_t i = 0; i < looplimit(16); i++) {
random_read(ckey, ckeylen);
random_read(civ, civlen);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
random_read(mkey, malg->keylen);
random_read(data, datalen);
uint8_t seqbuf[4];
random_read(seqbuf, 4);
uint32_t seq = GET_32BIT_MSB_FIRST(seqbuf);
log_start();
if (c) {
ssh_cipher_setkey(c, ckey);
ssh_cipher_setiv(c, civ);
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
ssh2_mac_setkey(m, make_ptrlen(mkey, malg->keylen));
ssh2_mac_generate(m, data, datalen, seq);
ssh2_mac_verify(m, data, datalen, seq);
log_end();
}
sfree(ckey);
sfree(civ);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
sfree(mkey);
sfree(data);
ssh2_mac_free(m);
if (c)
ssh_cipher_free(c);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
}
#define MAC_TESTFN(Y_unused, mac) \
static void test_mac_##mac(void) { test_mac(&mac, NULL); }
SIMPLE_MACS(MAC_TESTFN, Y_unused)
static void test_mac_poly1305(void)
{
test_mac(&ssh2_poly1305, &ssh2_chacha20_poly1305);
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
Implement AES-GCM using the @openssh.com protocol IDs. I only recently found out that OpenSSH defined their own protocol IDs for AES-GCM, defined to work the same as the standard ones except that they fixed the semantics for how you select the linked cipher+MAC pair during key exchange. (RFC 5647 defines protocol ids for AES-GCM in both the cipher and MAC namespaces, and requires that you MUST select both or neither - but this contradicts the selection policy set out in the base SSH RFCs, and there's no discussion of how you resolve a conflict between them! OpenSSH's answer is to do it the same way ChaCha20-Poly1305 works, because that will ensure the two suites don't fight.) People do occasionally ask us for this linked cipher/MAC pair, and now I know it's actually feasible, I've implemented it, including a pair of vector implementations for x86 and Arm using their respective architecture extensions for multiplying polynomials over GF(2). Unlike ChaCha20-Poly1305, I've kept the cipher and MAC implementations in separate objects, with an arm's-length link between them that the MAC uses when it needs to encrypt single cipher blocks to use as the inputs to the MAC algorithm. That enables the cipher and the MAC to be independently selected from their hardware-accelerated versions, just in case someone runs on a system that has polynomial multiplication instructions but not AES acceleration, or vice versa. There's a fourth implementation of the GCM MAC, which is a pure software implementation of the same algorithm used in the vectorised versions. It's too slow to use live, but I've kept it in the code for future testing needs, and because it's a convenient place to dump my design comments. The vectorised implementations are fairly crude as far as optimisation goes. I'm sure serious x86 _or_ Arm optimisation engineers would look at them and laugh. But GCM is a fast MAC compared to HMAC-SHA-256 (indeed compared to HMAC-anything-at-all), so it should at least be good enough to use. And we've got a working version with some tests now, so if someone else wants to improve them, they can.
2022-08-16 17:36:58 +00:00
static void test_mac_aesgcm_sw_sw(void)
{
test_mac(&ssh2_aesgcm_mac_sw, &ssh_aes128_gcm_sw);
}
static void test_mac_aesgcm_sw_refpoly(void)
{
test_mac(&ssh2_aesgcm_mac_ref_poly, &ssh_aes128_gcm_sw);
}
#if HAVE_AES_NI
static void test_mac_aesgcm_ni_sw(void)
{
test_mac(&ssh2_aesgcm_mac_sw, &ssh_aes128_gcm_ni);
}
#endif
#if HAVE_NEON_CRYPTO
static void test_mac_aesgcm_neon_sw(void)
{
test_mac(&ssh2_aesgcm_mac_sw, &ssh_aes128_gcm_neon);
}
#endif
#if HAVE_CLMUL
static void test_mac_aesgcm_sw_clmul(void)
{
test_mac(&ssh2_aesgcm_mac_clmul, &ssh_aes128_gcm_sw);
}
#endif
#if HAVE_NEON_PMULL
static void test_mac_aesgcm_sw_neon(void)
{
test_mac(&ssh2_aesgcm_mac_neon, &ssh_aes128_gcm_sw);
}
#endif
#if HAVE_AES_NI && HAVE_CLMUL
static void test_mac_aesgcm_ni_clmul(void)
{
test_mac(&ssh2_aesgcm_mac_clmul, &ssh_aes128_gcm_ni);
}
#endif
#if HAVE_NEON_CRYPTO && HAVE_NEON_PMULL
static void test_mac_aesgcm_neon_neon(void)
{
test_mac(&ssh2_aesgcm_mac_neon, &ssh_aes128_gcm_neon);
}
#endif
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
static void test_hash(const ssh_hashalg *halg)
{
ssh_hash *h = ssh_hash_new(halg);
if (!h) {
test_skipped = true;
return;
}
ssh_hash_free(h);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
size_t datalen = 256;
uint8_t *data = snewn(datalen, uint8_t);
uint8_t *hash = snewn(halg->hlen, uint8_t);
for (size_t i = 0; i < looplimit(16); i++) {
random_read(data, datalen);
log_start();
h = ssh_hash_new(halg);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
put_data(h, data, datalen);
ssh_hash_final(h, hash);
log_end();
}
sfree(data);
sfree(hash);
}
#define HASH_TESTFN(Y_unused, hash) \
static void test_hash_##hash(void) { test_hash(&hash); }
HASHES(HASH_TESTFN, Y_unused)
struct test {
const char *testname;
void (*testfn)(void);
};
static void test_argon2(void)
{
/*
* We can only expect the Argon2i variant to pass this stringent
* test for no data-dependency, because the other two variants of
* Argon2 have _deliberate_ data-dependency.
*/
size_t inlen = 48+16+24+8;
uint8_t *indata = snewn(inlen, uint8_t);
ptrlen password = make_ptrlen(indata, 48);
ptrlen salt = make_ptrlen(indata+48, 16);
ptrlen secret = make_ptrlen(indata+48+16, 24);
ptrlen assoc = make_ptrlen(indata+48+16+24, 8);
strbuf *outdata = strbuf_new();
strbuf_append(outdata, 256);
for (size_t i = 0; i < looplimit(16); i++) {
strbuf_clear(outdata);
random_read(indata, inlen);
log_start();
argon2(Argon2i, 32, 2, 2, 144, password, salt, secret, assoc, outdata);
log_end();
}
sfree(indata);
strbuf_free(outdata);
}
testsc: add side-channel test of probabilistic prime gen. Now that I've removed side-channel leakage from both prime candidate generation (via mp_unsafe_mod_integer) and Miller-Rabin, the probabilistic prime generation system in this code base is now able to get through testsc without it detecting any source of cache or timing side channels. So you should be able to generate an RSA key (in which the primes themselves must be secret) in a more hostile environment than you could previously be confident of. This is a bit counterintuitive, because _obviously_ random prime generation takes a variable amount of time, because it has to keep retrying until an attempt succeeds! But that's OK as long as the attempts are completely independent, because then any timing or cache information leaked by a _failed_ attempt will only tell an attacker about the numbers used in the failed attempt, and those numbers have been thrown away, so it doesn't matter who knows them. It's only important that the _successful_ attempt, from generating the random candidate through to completing its verification as (probably) prime, should be side-channel clean, because that's the attempt whose data is actually going to be turned into a private key that needs to be kept secret. (In particular, this means you have to avoid the old-fashioned strategy of generating successive prime candidates by incrementing a starting value until you find something not divisible by any small prime, because the number of iterations of that method would be a timing leak. Happily, we stopped doing that last year, in commit 08a3547bc54051e: now every candidate integer is generated independently, and if one fails the initial checks, we throw it away and start completely from scratch with a fresh random value.) So the test harness works by repeatedly running the prime generator in one-shot mode until an attempt succeeds, and then resetting the random-number stream to where it was just before the successful attempt. Then we generate the same prime number again, this time with the sclog mechanism turned on - and then, we compare it against the version we previously generated with the same random numbers, to make sure they're the same. This checks that the attempts really _are_ independent, in the sense that the prime generator is a pure function of its random input stream, and doesn't depend on state left over from previous attempts.
2021-08-27 16:46:25 +00:00
static void test_primegen(const PrimeGenerationPolicy *policy)
{
static ProgressReceiver null_progress = { .vt = &null_progress_vt };
PrimeGenerationContext *pgc = primegen_new_context(policy);
init_smallprimes();
mp_int *pcopy = mp_new(128);
for (size_t i = 0; i < looplimit(2); i++) {
while (true) {
random_advance_counter();
testsc: add side-channel test of probabilistic prime gen. Now that I've removed side-channel leakage from both prime candidate generation (via mp_unsafe_mod_integer) and Miller-Rabin, the probabilistic prime generation system in this code base is now able to get through testsc without it detecting any source of cache or timing side channels. So you should be able to generate an RSA key (in which the primes themselves must be secret) in a more hostile environment than you could previously be confident of. This is a bit counterintuitive, because _obviously_ random prime generation takes a variable amount of time, because it has to keep retrying until an attempt succeeds! But that's OK as long as the attempts are completely independent, because then any timing or cache information leaked by a _failed_ attempt will only tell an attacker about the numbers used in the failed attempt, and those numbers have been thrown away, so it doesn't matter who knows them. It's only important that the _successful_ attempt, from generating the random candidate through to completing its verification as (probably) prime, should be side-channel clean, because that's the attempt whose data is actually going to be turned into a private key that needs to be kept secret. (In particular, this means you have to avoid the old-fashioned strategy of generating successive prime candidates by incrementing a starting value until you find something not divisible by any small prime, because the number of iterations of that method would be a timing leak. Happily, we stopped doing that last year, in commit 08a3547bc54051e: now every candidate integer is generated independently, and if one fails the initial checks, we throw it away and start completely from scratch with a fresh random value.) So the test harness works by repeatedly running the prime generator in one-shot mode until an attempt succeeds, and then resetting the random-number stream to where it was just before the successful attempt. Then we generate the same prime number again, this time with the sclog mechanism turned on - and then, we compare it against the version we previously generated with the same random numbers, to make sure they're the same. This checks that the attempts really _are_ independent, in the sense that the prime generator is a pure function of its random input stream, and doesn't depend on state left over from previous attempts.
2021-08-27 16:46:25 +00:00
struct random_state st = random_get_state();
PrimeCandidateSource *pcs = pcs_new(128);
pcs_set_oneshot(pcs);
pcs_ready(pcs);
mp_int *p = primegen_generate(pgc, pcs, &null_progress);
if (p) {
mp_copy_into(pcopy, p);
sfree(p);
random_set_state(st);
log_start();
PrimeCandidateSource *pcs = pcs_new(128);
pcs_set_oneshot(pcs);
pcs_ready(pcs);
mp_int *q = primegen_generate(pgc, pcs, &null_progress);
log_end();
assert(q);
assert(mp_cmp_eq(pcopy, q));
mp_free(q);
break;
}
}
}
mp_free(pcopy);
primegen_free_context(pgc);
}
static void test_primegen_probabilistic(void)
{
test_primegen(&primegen_probabilistic);
}
static void test_ntru(void)
{
unsigned p = 11, q = 59, w = 3;
uint16_t *pubkey_orig = snewn(p, uint16_t);
uint16_t *pubkey_check = snewn(p, uint16_t);
uint16_t *pubkey = snewn(p, uint16_t);
uint16_t *plaintext = snewn(p, uint16_t);
uint16_t *ciphertext = snewn(p, uint16_t);
strbuf *buffer = strbuf_new();
strbuf_append(buffer, 16384);
BinarySource src[1];
for (size_t i = 0; i < looplimit(32); i++) {
while (true) {
random_advance_counter();
struct random_state st = random_get_state();
NTRUKeyPair *keypair = ntru_keygen_attempt(p, q, w);
if (keypair) {
memcpy(pubkey_orig, ntru_pubkey(keypair),
p*sizeof(*pubkey_orig));
ntru_keypair_free(keypair);
random_set_state(st);
log_start();
NTRUKeyPair *keypair = ntru_keygen_attempt(p, q, w);
memcpy(pubkey_check, ntru_pubkey(keypair),
p*sizeof(*pubkey_check));
ntru_gen_short(plaintext, p, w);
ntru_encrypt(ciphertext, plaintext, pubkey, p, w);
ntru_decrypt(plaintext, ciphertext, keypair);
strbuf_clear(buffer);
ntru_encode_pubkey(ntru_pubkey(keypair), p, q,
BinarySink_UPCAST(buffer));
BinarySource_BARE_INIT_PL(src, ptrlen_from_strbuf(buffer));
ntru_decode_pubkey(pubkey, p, q, src);
strbuf_clear(buffer);
ntru_encode_ciphertext(ciphertext, p, q,
BinarySink_UPCAST(buffer));
BinarySource_BARE_INIT_PL(src, ptrlen_from_strbuf(buffer));
ntru_decode_ciphertext(ciphertext, keypair, src);
strbuf_clear(buffer);
ntru_encode_plaintext(plaintext, p, BinarySink_UPCAST(buffer));
log_end();
ntru_keypair_free(keypair);
break;
}
assert(!memcmp(pubkey_orig, pubkey_check,
p*sizeof(*pubkey_check)));
}
}
sfree(pubkey_orig);
sfree(pubkey_check);
sfree(pubkey);
sfree(plaintext);
sfree(ciphertext);
strbuf_free(buffer);
}
New post-quantum kex: ML-KEM, and three hybrids of it. As standardised by NIST in FIPS 203, this is a lattice-based post-quantum KEM. Very vaguely, the idea of it is that your public key is a matrix A and vector t, and the private key is the knowledge of how to decompose t into two vectors with all their coefficients small, one transformed by A relative to the other. Encryption of a binary secret starts by turning each bit into one of two maximally separated residues mod a prime q, and then adding 'noise' based on the public key in the form of small increments and decrements mod q, again with some of the noise transformed by A relative to the rest. Decryption uses the knowledge of t's decomposition to align the two sets of noise so that the _large_ changes (which masked the secret from an eavesdropper) cancel out, leaving only a collection of small changes to the original secret vector. Then the vector of input bits can be recovered by assuming that those accumulated small pieces of noise haven't concentrated in any particular residue enough to push it more than half way to the other of its possible starting values. A weird feature of it is that decryption is not a true mathematical inverse of encryption. The assumption that the noise doesn't get large enough to flip any bit of the secret is only probabilistically valid, not a hard guarantee. In other words, key agreement can fail, simply by getting particularly unlucky with the distribution of your random noise! However, the probability of a failure is very low - less than 2^-138 even for ML-KEM-512, and gets even smaller with the larger variants. An awkward feature for our purposes is that the matrix A, containing a large number of residues mod the prime q=3329, is required to be constructed by a process of rejection sampling, i.e. generating random 12-bit values and throwing away the out-of-range ones. That would be a real pain for our side-channel testing system, which generally handles rejection sampling badly (since it necessarily involves data-dependent control flow and timing variation). Fortunately, the matrix and the random seed it was made from are both public: the matrix seed is transmitted as part of the public key, so it's not necessary to try to hide it. Accordingly, I was able to get the implementation to pass testsc by means of not varying the matrix seed between runs, which is justified by the principle of testsc that you vary the _secrets_ to ensure timing is independent of them - and the matrix seed isn't a secret, so you're allowed to keep it the same. The three hybrid algorithms, defined by the current Internet-Draft draft-kampanakis-curdle-ssh-pq-ke, include one hybrid of ML-KEM-768 with Curve25519 in exactly the same way we were already hybridising NTRU Prime with Curve25519, and two more hybrids of ML-KEM with ECDH over a NIST curve. The former hybrid interoperates with the implementation in OpenSSH 9.9; all three interoperate with the fork 'openssh-oqs' at github.com/open-quantum-safe/openssh, and also with the Python library AsyncSSH.
2024-12-07 19:33:39 +00:00
static void test_mlkem(const mlkem_params *params)
{
char rho[32], sigma[32], z[32], m[32], ek[1568], dk[3168], c[1568];
char k[32], k2[32];
/* rho is a random but public value, so side channels are allowed
* to reveal it (and undoubtedly will). So we don't vary it
* between runs. */
random_read(rho, 32);
for (size_t i = 0; i < looplimit(32); i++) {
random_advance_counter();
random_read(sigma, 32);
random_read(z, 32);
random_read(m, 32);
log_start();
/* Every other iteration, tamper with the ciphertext so that
* implicit rejection occurs, because we need to test that
* that too is done in constant time. */
unsigned tampering = i & 1;
buffer_sink ek_sink[1]; buffer_sink_init(ek_sink, ek, sizeof(ek));
buffer_sink dk_sink[1]; buffer_sink_init(dk_sink, dk, sizeof(dk));
buffer_sink c_sink[1]; buffer_sink_init(c_sink, c, sizeof(c));
buffer_sink k_sink[1]; buffer_sink_init(k_sink, k, sizeof(k));
mlkem_keygen_rho_sigma(
BinarySink_UPCAST(ek_sink), BinarySink_UPCAST(dk_sink),
params, rho, sigma, z);
ptrlen ek_pl = make_ptrlen(ek, ek_sink->out - ek);
ptrlen dk_pl = make_ptrlen(dk, dk_sink->out - dk);
mlkem_encaps_internal(
BinarySink_UPCAST(c_sink), BinarySink_UPCAST(k_sink),
params, ek_pl, m);
dk[0] ^= tampering;
ptrlen c_pl = make_ptrlen(c, c_sink->out - c);
buffer_sink_init(k_sink, k2, sizeof(k2));
bool success = mlkem_decaps(
BinarySink_UPCAST(k_sink), params, dk_pl, c_pl);
log_end();
assert(success);
unsigned eq_expected = tampering ^ 1;
unsigned eq = smemeq(k, k2, 32);
assert(eq == eq_expected);
}
}
static void test_mlkem512(void) { test_mlkem(&mlkem_params_512); }
static void test_mlkem768(void) { test_mlkem(&mlkem_params_768); }
static void test_mlkem1024(void) { test_mlkem(&mlkem_params_1024); }
Switch to RFC 6979 for DSA nonce generation. This fixes a vulnerability that compromises NIST P521 ECDSA keys when they are used with PuTTY's existing DSA nonce generation code. The vulnerability has been assigned the identifier CVE-2024-31497. PuTTY has been doing its DSA signing deterministically for literally as long as it's been doing it at all, because I didn't trust Windows's entropy generation. Deterministic nonce generation was introduced in commit d345ebc2a5a0b59, as part of the initial version of our DSA signing routine. At the time, there was no standard for how to do it, so we had to think up the details of our system ourselves, with some help from the Cambridge University computer security group. More than ten years later, RFC 6979 was published, recommending a similar system for general use, naturally with all the details different. We didn't switch over to doing it that way, because we had a scheme in place already, and as far as I could see, the differences were not security-critical - just the normal sort of variation you expect when any two people design a protocol component of this kind independently. As far as I know, the _structure_ of our scheme is still perfectly fine, in terms of what data gets hashed, how many times, and how the hash output is converted into a nonce. But the weak spot is the choice of hash function: inside our dsa_gen_k() function, we generate 512 bits of random data using SHA-512, and then reduce that to the output range by modular reduction, regardless of what signature algorithm we're generating a nonce for. In the original use case, this introduced a theoretical bias (the output size is an odd prime, which doesn't evenly divide the space of 2^512 possible inputs to the reduction), but the theory was that since integer DSA uses a modulus prime only 160 bits long (being based on SHA-1, at least in the form that SSH uses it), the bias would be too small to be detectable, let alone exploitable. Then we reused the same function for NIST-style ECDSA, when it arrived. This is fine for the P256 curve, and even P384. But in P521, the order of the base point is _greater_ than 2^512, so when we generate a 512-bit number and reduce it, the reduction never makes any difference, and our output nonces are all in the first 2^512 elements of the range of about 2^521. So this _does_ introduce a significant bias in the nonces, compared to the ideal of uniformly random distribution over the whole range. And it's been recently discovered that a bias of this kind is sufficient to expose private keys, given a manageably small number of signatures to work from. (Incidentally, none of this affects Ed25519. The spec for that system includes its own idea of how you should do deterministic nonce generation - completely different again, naturally - and we did it that way rather than our way, so that we could use the existing test vectors.) The simplest fix would be to patch our existing nonce generator to use a longer hash, or concatenate a couple of SHA-512 hashes, or something similar. But I think a more robust approach is to switch it out completely for what is now the standard system. The main reason why I prefer that is that the standard system comes with test vectors, which adds a lot of confidence that I haven't made some other mistake in following my own design. So here's a commit that adds an implementation of RFC 6979, and removes the old dsa_gen_k() function. Tests are added based on the RFC's appendix of test vectors (as many as are compatible with the more limited API of PuTTY's crypto code, e.g. we lack support for the NIST P192 curve, or for doing integer DSA with many different hash functions). One existing test changes its expected outputs, namely the one that has a sample key pair and signature for every key algorithm we support.
2024-04-01 08:18:34 +00:00
static void test_rfc6979_setup(void)
{
mp_int *q = mp_new(512);
mp_int *x = mp_new(512);
strbuf *message = strbuf_new();
strbuf_append(message, 123);
RFC6979 *s = rfc6979_new(&ssh_sha256, q, x);
for (size_t i = 0; i < looplimit(20); i++) {
random_read(message->u, message->len);
mp_random_fill(q);
mp_random_fill(x);
log_start();
rfc6979_setup(s, ptrlen_from_strbuf(message));
log_end();
}
rfc6979_free(s);
mp_free(q);
mp_free(x);
strbuf_free(message);
}
static void test_rfc6979_attempt(void)
{
mp_int *q = mp_new(512);
mp_int *x = mp_new(512);
strbuf *message = strbuf_new();
strbuf_append(message, 123);
RFC6979 *s = rfc6979_new(&ssh_sha256, q, x);
for (size_t i = 0; i < looplimit(5); i++) {
random_read(message->u, message->len);
mp_random_fill(q);
mp_random_fill(x);
rfc6979_setup(s, ptrlen_from_strbuf(message));
for (size_t j = 0; j < looplimit(10); j++) {
log_start();
RFC6979Result result = rfc6979_attempt(s);
mp_free(result.k);
log_end();
}
}
rfc6979_free(s);
mp_free(q);
mp_free(x);
strbuf_free(message);
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
static const struct test tests[] = {
#define STRUCT_TEST(X) { #X, test_##X },
TESTLIST(STRUCT_TEST)
#undef STRUCT_TEST
};
void dputs(const char *buf)
{
fputs(buf, stderr);
}
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
int main(int argc, char **argv)
{
bool doing_opts = true;
const char *pname = argv[0];
uint8_t tests_to_run[lenof(tests)];
bool keep_outfiles = false;
bool test_names_given = false;
Arm: turn on PSTATE.DIT if available and needed. DIT, for 'Data-Independent Timing', is a bit you can set in the processor state on sufficiently new Arm CPUs, which promises that a long list of instructions will deliberately avoid varying their timing based on the input register values. Just what you want for keeping your constant-time crypto primitives constant-time. As far as I'm aware, no CPU has _yet_ implemented any data-dependent optimisations, so DIT is a safety precaution against them doing so in future. It would be embarrassing to be caught without it if a future CPU does do that, so we now turn on DIT in the PuTTY process state. I've put a call to the new enable_dit() function at the start of every main() and WinMain() belonging to a program that might do cryptography (even testcrypt, in case someone uses it for something!), and in case I missed one there, also added a second call at the first moment that any cryptography-using part of the code looks as if it might become active: when an instance of the SSH protocol object is configured, when the system PRNG is initialised, and when selecting any cryptographic authentication protocol in an HTTP or SOCKS proxy connection. With any luck those precautions between them should ensure it's on whenever we need it. Arm's own recommendation is that you should carefully choose the granularity at which you enable and disable DIT: there's a potential time cost to turning it on and off (I'm not sure what, but plausibly something of the order of a pipeline flush), so it's a performance hit to do it _inside_ each individual crypto function, but if CPUs start supporting significant data-dependent optimisation in future, then it will also become a noticeable performance hit to just leave it on across the whole process. So you'd like to do it somewhere in the middle: for example, you might turn on DIT once around the whole process of verifying and decrypting an SSH packet, instead of once for decryption and once for MAC. With all respect to that recommendation as a strategy for maximum performance, I'm not following it here. I turn on DIT at the start of the PuTTY process, and then leave it on. Rationale: 1. PuTTY is not otherwise a performance-critical application: it's not likely to max out your CPU for any purpose _other_ than cryptography. The most CPU-intensive non-cryptographic thing I can imagine a PuTTY process doing is the complicated computation of font rendering in the terminal, and that will normally be cached (you don't recompute each glyph from its outline and hints for every time you display it). 2. I think a bigger risk lies in accidental side channels from having DIT turned off when it should have been on. I can imagine lots of causes for that. Missing a crypto operation in some unswept corner of the code; confusing control flow (like my coroutine macros) jumping with DIT clear into the middle of a region of code that expected DIT to have been set at the beginning; having a reference counter of DIT requests and getting it out of sync. In a more sophisticated programming language, it might be possible to avoid the risk in #2 by cleverness with the type system. For example, in Rust, you could have a zero-sized type that acts as a proof token for DIT being enabled (it would be constructed by a function that also sets DIT, have a Drop implementation that clears DIT, and be !Send so you couldn't use it in a thread other than the one where DIT was set), and then you could require all the actual crypto functions to take a DitToken as an extra parameter, at zero runtime cost. Then "oops I forgot to set DIT around this piece of crypto" would become a compile error. Even so, you'd have to take some care with coroutine-structured code (what happens if a Rust async function yields while holding a DIT token?) and with nesting (if you have two DIT tokens, you don't want dropping the inner one to clear DIT while the outer one is still there to wrongly convince callees that it's set). Maybe in Rust you could get this all to work reliably. But not in C! DIT is an optional feature of the Arm architecture, so we must first test to see if it's supported. This is done the same way as we already do for the various Arm crypto accelerators: on ELF-based systems, check the appropriate bit in the 'hwcap' words in the ELF aux vector; on Mac, look for an appropriate sysctl flag. On Windows I don't know of a way to query the DIT feature, _or_ of a way to write the necessary enabling instruction in an MSVC-compatible way. I've _heard_ that it might not be necessary, because Windows might just turn on DIT unconditionally and leave it on, in an even more extreme version of my own strategy. I don't have a source for that - I heard it by word of mouth - but I _hope_ it's true, because that would suit me very well! Certainly I can't write code to enable DIT without knowing (a) how to do it, (b) how to know if it's safe. Nonetheless, I've put the enable_dit() call in all the right places in the Windows main programs as well as the Unix and cross-platform code, so that if I later find out that I _can_ put in an explicit enable of DIT in some way, I'll only have to arrange to set HAVE_ARM_DIT and compile the enable_dit() function appropriately.
2024-12-19 08:47:08 +00:00
/* One day, perhaps, if I ever get this test to work on Arm, we
* might actually _check_ DIT is enabled, and check we're sticking
* to the precise list of DIT-affected instructions */
enable_dit();
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
memset(tests_to_run, 1, sizeof(tests_to_run));
random_hash = ssh_hash_new(&ssh_sha256);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
while (--argc > 0) {
char *p = *++argv;
if (p[0] == '-' && doing_opts) {
if (!strcmp(p, "-O")) {
if (--argc <= 0) {
fprintf(stderr, "'-O' expects a directory name\n");
return 1;
}
outdir = *++argv;
} else if (!strcmp(p, "-k") || !strcmp(p, "--keep")) {
keep_outfiles = true;
} else if (!strcmp(p, "--")) {
doing_opts = false;
} else if (!strcmp(p, "--help")) {
printf(" usage: drrun -c test/sclog/libsclog.so -- "
"%s -O <outdir>\n", pname);
printf("options: -O <outdir> "
"put log files in the specified directory\n");
printf(" -k, --keep "
"do not delete log files for tests that passed\n");
printf(" also: --help "
"display this text\n");
return 0;
} else {
fprintf(stderr, "unknown command line option '%s'\n", p);
return 1;
}
} else {
if (!test_names_given) {
test_names_given = true;
memset(tests_to_run, 0, sizeof(tests_to_run));
}
bool found_one = false;
for (size_t i = 0; i < lenof(tests); i++) {
if (wc_match(p, tests[i].testname)) {
tests_to_run[i] = 1;
found_one = true;
}
}
if (!found_one) {
fprintf(stderr, "no test name matched '%s'\n", p);
return 1;
}
}
}
bool is_dry_run = dry_run();
if (is_dry_run) {
printf("Dry run (DynamoRIO instrumentation not detected)\n");
} else {
/* Print the address of main() in this run. The idea is that
* if this image is compiled to be position-independent, then
* PC values in the logs won't match the ones you get if you
* disassemble the binary, so it'll be harder to match up the
* log messages to the code. But if you know the address of a
* fixed (and not inlined) function in both worlds, you can
* find out the offset between them. */
printf("Live run, main = %p\n", (void *)main);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
if (!outdir) {
fprintf(stderr, "expected -O <outdir> option\n");
return 1;
}
printf("Will write log files to %s\n", outdir);
}
size_t nrun = 0, npass = 0;
for (size_t i = 0; i < lenof(tests); i++) {
bool keep_these_outfiles = true;
if (!tests_to_run[i])
continue;
const struct test *test = &tests[i];
printf("Running test %s ... ", test->testname);
fflush(stdout);
test_skipped = false;
random_seed(test->testname);
test_basename = test->testname;
test_index = 0;
test->testfn();
if (test_skipped) {
/* Used for e.g. tests of hardware-accelerated crypto when
* the hardware acceleration isn't available */
printf("skipped\n");
continue;
}
nrun++;
if (is_dry_run) {
printf("dry run done\n");
continue; /* test files won't exist anyway */
}
if (test_index < 2) {
printf("FAIL: test did not generate multiple output files\n");
goto test_done;
}
char *firstfile = log_filename(test_basename, 0);
FILE *firstfp = fopen(firstfile, "rb");
if (!firstfp) {
printf("ERR: %s: open: %s\n", firstfile, strerror(errno));
goto test_done;
}
for (size_t i = 1; i < test_index; i++) {
char *nextfile = log_filename(test_basename, i);
FILE *nextfp = fopen(nextfile, "rb");
if (!nextfp) {
printf("ERR: %s: open: %s\n", nextfile, strerror(errno));
goto test_done;
}
rewind(firstfp);
char buf1[4096], bufn[4096];
bool compare_ok = false;
while (true) {
size_t r1 = fread(buf1, 1, sizeof(buf1), firstfp);
size_t rn = fread(bufn, 1, sizeof(bufn), nextfp);
if (r1 != rn) {
printf("FAIL: %s %s: different lengths\n",
firstfile, nextfile);
break;
}
if (r1 == 0) {
if (feof(firstfp) && feof(nextfp)) {
compare_ok = true;
} else {
printf("FAIL: %s %s: error at end of file\n",
firstfile, nextfile);
}
break;
}
if (memcmp(buf1, bufn, r1) != 0) {
printf("FAIL: %s %s: different content\n",
firstfile, nextfile);
break;
}
}
fclose(nextfp);
sfree(nextfile);
if (!compare_ok) {
goto test_done;
}
}
fclose(firstfp);
sfree(firstfile);
printf("pass\n");
npass++;
keep_these_outfiles = keep_outfiles;
test_done:
if (!keep_these_outfiles) {
for (size_t i = 0; i < test_index; i++) {
char *file = log_filename(test_basename, i);
remove(file);
sfree(file);
}
}
}
ssh_hash_free(random_hash);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
if (npass == nrun) {
printf("All tests passed\n");
return 0;
} else {
printf("%"SIZEu" tests failed\n", nrun - npass);
New test system to detect side channels in crypto code. All the work I've put in in the last few months to eliminate timing and cache side channels from PuTTY's mp_int and cipher implementations has been on a seat-of-the-pants basis: just thinking very hard about what kinds of language construction I think would be safe to use, and trying not to absentmindedly leave a conditional branch or a cast to bool somewhere vital. Now I've got a test suite! The basic idea is that you run the same crypto primitive multiple times, with inputs differing only in ways that are supposed to avoid being leaked by timing or leaving evidence in the cache; then you instrument the code so that it logs all the control flow, memory access and a couple of other relevant things in each of those runs, and finally, compare the logs and expect them to be identical. The instrumentation is done using DynamoRIO, which I found to be well suited to this kind of work: it lets you define custom modifications of the code in a reasonably low-effort way, and it lets you work at both the low level of examining single instructions _and_ the higher level of the function call ABI (so you can give things like malloc special treatment, not to mention intercepting communications from the program being instrumented). Build instructions are all in the comment at the top of testsc.c. At present, I've found this test to give a 100% pass rate using gcc -O0 and -O3 (Ubuntu 18.10). With clang, there are a couple of failures, which I'll fix in the next commit.
2019-02-10 13:09:53 +00:00
return 1;
}
}