From aab089267118feb50c7bb1a2a68f96d9b097daa2 Mon Sep 17 00:00:00 2001 From: Simon Tatham Date: Mon, 1 Apr 2024 08:48:36 +0100 Subject: [PATCH] Side-channel tester: align memory allocations. While trying to get an upcoming piece of code through testsc, I had trouble - _yet again_ - with the way that control flow diverges inside the glibc implementations of functions like memcpy and memset, depending on the alignment of the input blocks _above_ the alignment guaranteed by malloc, so that doing the same sequence of malloc + memset can lead to different control flow. (I believe this is done either for cache performance reasons or SIMD alignment requirements, or both: on x86, some SIMD instructions require memory alignment beyond what malloc guarantees, which is also awkward for our x86 hardware crypto implementations.) My previous effort to normalise this problem out of sclog's log files worked by wrapping memset and all its synonyms that I could find. But this weekend, that failed for me, and the reason appears to be ifuncs. I'm aware of the great irony of committing code to a security project with a log message saying something vague about ifuncs, on the same weekend that it came to light that commits matching that description were one of the methods used to smuggle a backdoor into the XZ Utils project (CVE-2024-3094). So I'll bend over backwards to explain both what I think is going on, and why this _isn't_ a weird ifunc-related backdooring attempt: When I say I 'wrap' memset, I mean I use DynamoRIO's 'drwrap' API to arrange that the side-channel test rig calls a function of mine before and after each call to memset. The way drwrap works is to look up the symbol address in either the main program or a shared library; in this case, it's a shared library, namely libc.so. Then it intercepts call instructions with exactly that address as the target. Unfortunately, what _actually_ happens when the main program calls memset is more complicated. First, control goes to the PLT entry for memset (still in the main program). In principle, that loads a GOT entry containing the address of memset (filled in by ld.so), and jumps to it. But in fact the GOT entry varies its value through the program; on the first call, it points to a resolver function, whose job is to _find out_ the address of memset. And in the version of libc.so I'm currently running, that resolver is an STT_GNU_IFUNC indirection function, which tests the host CPU's capabilities, and chooses an actual implementation of memset depending on what it finds. (In my case, it looks as if it's picking one that makes extensive use of x86 SIMD.) To avoid the overhead of doing this on every call, the returned function pointer is then written into the main program's GOT entry for memset, overwriting the address of the resolver function, so that the _next_ call the main program makes through the same PLT entry will go directly to the memset variant that was chosen. And the problem is that, after this has happened, none of the new control flow ever goes near the _official_ address of memset, as read out of libc.so's dynamic symbol table by DynamoRIO. The PLT entry isn't at that address, and neither is the particular SIMD variant that the resolver ended up choosing. So now my wrapper on memset is never being invoked, and memset cheerfully generates different control flow in runs of my crypto code that testsc expects to be doing exactly the same thing as each other, and all my tests fail spuriously. My solution, at least for the moment, is to completely abandon the strategy of wrapping memset. Instead, let's just make it behave the same way every time, by forcing all the affected memory allocations to have extra-strict alignment. I found that 64-byte alignment is not good enough to eliminate memset-related test failures, but 128-byte alignment is. This would be tricky in itself, if it weren't for the fact that PuTTY already has its own wrapper function on malloc (for various reasons), which everything in our code already uses. So I can divert to C11's aligned_alloc() there. That in turn is done by adding a new #ifdef to utils/memory.c, and compiling it with that #ifdef into a new object library that is included in testsc, superseding the standard memory.o that would otherwise be pulled in from our 'utils' static library. With the previous memset-compensator removed, this means testsc is now dependent on having aligned_alloc() available. So we test for it at cmake time, and don't build testsc at all if it can't be found. This shouldn't bother anyone very much; aligned_alloc() is available on _my_ testsc platform, and if anyone else is trying to run this test suite at all, I expect it will be on something at least as new as that. (One awkward thing here is that we can only replace _new_ allocations with calls to aligned_alloc(): C11 provides no aligned version of realloc. Happily, this doesn't currently introduce any new problems in testsc. If it does, I might have to do something even more painful in future.) So, why isn't this an ifunc-related backdoor attempt? Because (and you can check all of this from the patch): 1. The memset-wrapping code exists entirely within the DynamoRIO plugin module that lives in test/sclog. That is not used in production, only for running the 'testsc' side-channel tester. 2. The memset-wrapping code is _removed_ by this patch, not added. 3. None of this code is dealing directly with ifuncs - only working around the unwanted effects on my test suite from the fact that they exist somewhere else and introduce awkward behaviour. --- cmake/setup.cmake | 8 +++++ test/sclog/sclog.c | 81 ++++++--------------------------------------- unix/CMakeLists.txt | 12 +++++-- utils/memory.c | 10 ++++++ 4 files changed, 37 insertions(+), 74 deletions(-) diff --git a/cmake/setup.cmake b/cmake/setup.cmake index d81d5a5c..1be448df 100644 --- a/cmake/setup.cmake +++ b/cmake/setup.cmake @@ -108,6 +108,14 @@ include_directories( ${platform} ${extra_dirs}) +check_c_source_compiles(" +#define _ISOC11_SOURCE +#include +int main(int argc, char **argv) { + void *p = aligned_alloc(128, 12345); + free(p); +}" HAVE_ALIGNED_ALLOC) + if(PUTTY_DEBUG) add_compile_definitions(DEBUG) endif() diff --git a/test/sclog/sclog.c b/test/sclog/sclog.c index d5304a01..74f4f86a 100644 --- a/test/sclog/sclog.c +++ b/test/sclog/sclog.c @@ -214,6 +214,14 @@ static void wrap_malloc_pre(void *wrapctx, void **user_data) *user_data = drwrap_get_arg(wrapctx, 0); dr_fprintf(outfile, "malloc %"PRIuMAX"\n", (uintmax_t)*user_data); } +static void wrap_aligned_alloc_pre(void *wrapctx, void **user_data) +{ + logging_paused++; + size_t align = (size_t) drwrap_get_arg(wrapctx, 0); + *user_data = drwrap_get_arg(wrapctx, 1); + dr_fprintf(outfile, "aligned_alloc align=%zu size=%"PRIuMAX"\n", + align, (uintmax_t)*user_data); +} static void wrap_free_pre(void *wrapctx, void **user_data) { logging_paused++; @@ -239,71 +247,7 @@ static void wrap_alloc_post(void *wrapctx, void *user_data) } /* - * We wrap the C library function memset, because I've noticed that at - * least one optimised implementation of it diverges control flow - * internally based on what appears to be the _alignment_ of the input - * pointer - and that alignment check can vary depending on the - * addresses of allocated blocks. So I can't guarantee no divergence - * of control flow inside memset if malloc doesn't return the same - * values, and instead I just have to trust that memset isn't reading - * the contents of the block and basing control flow decisions on that. - */ -static void wrap_memset_pre(void *wrapctx, void **user_data) -{ - uint was_already_paused = logging_paused++; - - if (outfile == INVALID_FILE || was_already_paused) - return; - - const void *addr = drwrap_get_arg(wrapctx, 0); - size_t size = (size_t)drwrap_get_arg(wrapctx, 2); - - struct allocation *alloc = find_allocation(addr); - if (!alloc) { - dr_fprintf(outfile, "memset %"PRIuMAX" @ %"PRIxMAX"\n", - (uintmax_t)size, (uintmax_t)addr); - } else { - dr_fprintf(outfile, "memset %"PRIuMAX" @ allocations[%"PRIuPTR"]" - " + %"PRIxMAX"\n", (uintmax_t)size, alloc->index, - (uintmax_t)(addr - alloc->start)); - } -} - -/* - * Similarly to the above, wrap some versions of memmove. - */ -static void wrap_memmove_pre(void *wrapctx, void **user_data) -{ - uint was_already_paused = logging_paused++; - - if (outfile == INVALID_FILE || was_already_paused) - return; - - const void *daddr = drwrap_get_arg(wrapctx, 0); - const void *saddr = drwrap_get_arg(wrapctx, 1); - size_t size = (size_t)drwrap_get_arg(wrapctx, 2); - - - struct allocation *alloc; - - dr_fprintf(outfile, "memmove %"PRIuMAX" ", (uintmax_t)size); - if (!(alloc = find_allocation(daddr))) { - dr_fprintf(outfile, "to %"PRIxMAX" ", (uintmax_t)daddr); - } else { - dr_fprintf(outfile, "to allocations[%"PRIuPTR"] + %"PRIxMAX" ", - alloc->index, (uintmax_t)(daddr - alloc->start)); - } - if (!(alloc = find_allocation(saddr))) { - dr_fprintf(outfile, "from %"PRIxMAX"\n", (uintmax_t)saddr); - } else { - dr_fprintf(outfile, "from allocations[%"PRIuPTR"] + %"PRIxMAX"\n", - alloc->index, (uintmax_t)(saddr - alloc->start)); - } -} - -/* - * Common post-wrapper function for memset and free, whose entire - * function is to unpause the logging. + * Common post-wrapper function to unpause the logging. */ static void unpause_post(void *wrapctx, void *user_data) { @@ -594,10 +538,9 @@ static void load_module( TRY_WRAP("dry_run_real", NULL, wrap_dryrun); if (libc) { TRY_WRAP("malloc", wrap_malloc_pre, wrap_alloc_post); + TRY_WRAP("aligned_alloc", wrap_aligned_alloc_pre, wrap_alloc_post); TRY_WRAP("realloc", wrap_realloc_pre, wrap_alloc_post); TRY_WRAP("free", wrap_free_pre, unpause_post); - TRY_WRAP("memset", wrap_memset_pre, unpause_post); - TRY_WRAP("memmove", wrap_memmove_pre, unpause_post); /* * More strangely named versions of standard C library @@ -616,10 +559,6 @@ static void load_module( TRY_WRAP("__libc_malloc", wrap_malloc_pre, wrap_alloc_post); TRY_WRAP("__GI___libc_realloc", wrap_realloc_pre, wrap_alloc_post); TRY_WRAP("__GI___libc_free", wrap_free_pre, unpause_post); - TRY_WRAP("__memset_sse2_unaligned", wrap_memset_pre, unpause_post); - TRY_WRAP("__memset_sse2", wrap_memset_pre, unpause_post); - TRY_WRAP("__memmove_avx_unaligned_erms", wrap_memmove_pre, - unpause_post); TRY_WRAP("cfree", wrap_free_pre, unpause_post); } } diff --git a/unix/CMakeLists.txt b/unix/CMakeLists.txt index d4de28df..ce02098c 100644 --- a/unix/CMakeLists.txt +++ b/unix/CMakeLists.txt @@ -100,9 +100,15 @@ add_executable(cgtest $) target_link_libraries(cgtest keygen console crypto utils) -add_executable(testsc - ${CMAKE_SOURCE_DIR}/test/testsc.c) -target_link_libraries(testsc keygen crypto utils) +if(HAVE_ALIGNED_ALLOC) + add_library(overaligned_alloc OBJECT + ${CMAKE_SOURCE_DIR}/utils/memory.c) + target_compile_definitions(overaligned_alloc PRIVATE ALLOCATION_ALIGNMENT=128) + add_executable(testsc + ${CMAKE_SOURCE_DIR}/test/testsc.c + $) + target_link_libraries(testsc keygen crypto utils) +endif() add_executable(testzlib ${CMAKE_SOURCE_DIR}/test/testzlib.c diff --git a/utils/memory.c b/utils/memory.c index 97ae9401..0ba791ad 100644 --- a/utils/memory.c +++ b/utils/memory.c @@ -2,6 +2,12 @@ * PuTTY's memory allocation wrappers. */ +#ifdef ALLOCATION_ALIGNMENT +/* Before we include standard headers, define _ISOC11_SOURCE so that + * we get the declaration of aligned_alloc(). */ +#define _ISOC11_SOURCE +#endif + #include #include #include @@ -28,6 +34,8 @@ void *safemalloc(size_t factor1, size_t factor2, size_t addend) void *p; #ifdef MINEFIELD p = minefield_c_malloc(size); +#elif defined ALLOCATION_ALIGNMENT + p = aligned_alloc(ALLOCATION_ALIGNMENT, size); #else p = malloc(size); #endif @@ -52,6 +60,8 @@ void *saferealloc(void *ptr, size_t n, size_t size) if (!ptr) { #ifdef MINEFIELD p = minefield_c_malloc(size); +#elif defined ALLOCATION_ALIGNMENT + p = aligned_alloc(ALLOCATION_ALIGNMENT, size); #else p = malloc(size); #endif