The problem
C23 (N3088, https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf) simply puts that
"The memset_explicit function copies the value of c (converted to an unsigned char) into each of
the first n characters of the object pointed to by s . The purpose of this function is to make sensitive
information stored in the object inaccessible".
Ok, inaccessible, but to what extend?
Currently, we have explicit_bzero in Glibc, which is just a normal memset followed by a full compiler fence. glibc/string/explicit_bzero.c at master · bminor/glibc · GitHub
However, Jens Gustedt suggests in his blog post that
- A call to this function should never be optimized out. This can often be achieved by having it in a separate TU from all other functions and by disabling link-time optimization for this TU or at least for this function.
- No store to any byte of the function should be optimized out.
- The return of the function should synchronize with all read and write operations. This could for example be achieved by issuing a call
atomic_signal_fence(memory_order_seq_cst)
or equivalent, such that even a signal that kicks in right after the call could not not read the previous contents of the byte array. - All caches for the byte array should have been invalidated on return.
- To avoid side-channel attacks, the implementation of the function should make no explicit or implicit reference to the contents of the byte array nor the value that has been chosen for the overwrite. Each write operation should use the same time (and other resources) per byte.
- Good performance is not expected, security first.
" Each write operation should use the same time (and other resources) per byte." Does it mean that we cannot unroll loop or use SIMD at all? I don’t really see how all such optimization can be used in side-channel attacks. Consider the real-world scenario: if a secret is stored in memory, you’d better store it in a fixed-sized region no matter what the exact length of the password is. If that is the case, we don’t need the same timing per-byte.
Current Proposal
Since there is no easy way to guarantee the original array is absolutely “not accessible”, we may simply go with
[[gnu::noinline]] void *memset_explicit(void * __restrict dst, int value, size_t len) {
memset_inline(dst, value, len); // vectorized routine, same as normal memset
cpp::atomic_signal_fence(cpp::MemoryOrder::SeqCst);
return dst;
}
Or, if we agree that atomic_thread_fence
can do better job in “invalidating” cachelines.
[[gnu::noinline]] void *memset_explicit(void * __restrict dst, int value, size_t len) {
memset_inline(dst, value, len); // vectorized routine, same as normal memset
cpp::atomic_thread_fence(cpp::MemoryOrder::SeqCst);
return dst;
}
Some (External) Suggestions
- Although we have
bzero
before, some of the history of the tech. spec. (https://www.open-std.org/JTC1/SC22/WG14/www/docs/n2897.htm) seems to suggest that none of the pre-existing solutions were standard and fully portable. - (for per-byte cost issue) The cost of the function should be independent of the values being overwritten. So unrolling or SIMD-izing by a constant (non-data-dependent) factor should be ok.
- std::atomic_thread_fence - cppreference.com seems to suggest that such fences can be used to establish ordering for non-atomic operations.
- “Inaccessible” is super hard considering swap/cache/… One may need extra help from hardware.