Implement `memset_explicit`

The problem

C23 (N3088, https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3088.pdf) simply puts that

"The memset_explicit function copies the value of c (converted to an unsigned char) into each of

the first n characters of the object pointed to by s . The purpose of this function is to make sensitive

information stored in the object inaccessible".

Ok, inaccessible, but to what extend?

Currently, we have explicit_bzero in Glibc, which is just a normal memset followed by a full compiler fence. glibc/string/explicit_bzero.c at master · bminor/glibc · GitHub

However, Jens Gustedt suggests in his blog post that

  • A call to this function should never be optimized out. This can often be achieved by having it in a separate TU from all other functions and by disabling link-time optimization for this TU or at least for this function.
  • No store to any byte of the function should be optimized out.
  • The return of the function should synchronize with all read and write operations. This could for example be achieved by issuing a call atomic_signal_fence(memory_order_seq_cst) or equivalent, such that even a signal that kicks in right after the call could not not read the previous contents of the byte array.
  • All caches for the byte array should have been invalidated on return.
  • To avoid side-channel attacks, the implementation of the function should make no explicit or implicit reference to the contents of the byte array nor the value that has been chosen for the overwrite. Each write operation should use the same time (and other resources) per byte.
  • Good performance is not expected, security first.

" Each write operation should use the same time (and other resources) per byte." Does it mean that we cannot unroll loop or use SIMD at all? I don’t really see how all such optimization can be used in side-channel attacks. Consider the real-world scenario: if a secret is stored in memory, you’d better store it in a fixed-sized region no matter what the exact length of the password is. If that is the case, we don’t need the same timing per-byte.

Current Proposal

Since there is no easy way to guarantee the original array is absolutely “not accessible”, we may simply go with

[[gnu::noinline]] void *memset_explicit(void * __restrict dst, int value, size_t len) {
    memset_inline(dst, value, len); // vectorized routine, same as normal memset
    cpp::atomic_signal_fence(cpp::MemoryOrder::SeqCst);
    return dst;
}

Or, if we agree that atomic_thread_fence can do better job in “invalidating” cachelines.

[[gnu::noinline]] void *memset_explicit(void * __restrict dst, int value, size_t len) {
    memset_inline(dst, value, len); // vectorized routine, same as normal memset
    cpp::atomic_thread_fence(cpp::MemoryOrder::SeqCst);
    return dst;
}

Some (External) Suggestions

  • Although we have bzero before, some of the history of the tech. spec. (https://www.open-std.org/JTC1/SC22/WG14/www/docs/n2897.htm) seems to suggest that none of the pre-existing solutions were standard and fully portable.
  • (for per-byte cost issue) The cost of the function should be independent of the values being overwritten. So unrolling or SIMD-izing by a constant (non-data-dependent) factor should be ok.
  • std::atomic_thread_fence - cppreference.com seems to suggest that such fences can be used to establish ordering for non-atomic operations.
  • “Inaccessible” is super hard considering swap/cache/… One may need extra help from hardware.

I haven’t read the blog post (do you have a link?) but I’d expect the intent here is, don’t get clever about optimizing stores. For example special-casing c == 0 to use special set-to-zero instructions would violate that principle.

It’s not feasible for such a library function to operate in constant time, therefore the number of bytes n cannot be hidden. So, the time taken by the library function should be based on nothing but n. IMO the time does not need to be linear in n (e.g., writing 8 bytes can be faster than writing 7 bytes) so long as the time is the same for each n-byte buffer. So, SIMD or unrolling or whatever ought to be fine.

Might be necessary to flush cache, the fence operation might not do that for you.

FWIW this makes a lot of sense to me.
I think with the " Each write operation should use the same time (and other resources) per byte." is meant that you should make sure to use instructions which take always the same amount of time (mod caches etc. of course - you can’t influence that). e.g. a masked store can be significantly more expensive if it’s masking across a page boundary. A tail that will only access the bytes that are actually in the range specified won’t have the same performance difference depending on where exactly you are writing the data.

Hi, the post is at C23 implications for C libraries

Thanks! I see how this would be difficult for the C committee to specify with proper security, given the abstract machine.

“All caches for the byte array should have been invalidated on return.” is pretty important here, but I don’t know enough about how the atomics work to have an opinion on whether that’s enough to invalidate caches on all CPUs/cores in the machine.

All caches for the byte array should have been invalidated on return

I believe this about ensuring that external data extraction from the RAM chips (or e.g. a forced-reboot without RAM clearing) won’t be able to gain access to the secret data – NOT about cross-thread data visibility. So, fences are not the appropriate operation.

On x86, you would want to use a CLFLUSH instruction; on AArch64, DC CVAC. Other architectures will likely have a similar instruction available too, though sometimes it might be privileged. Note that the existing __builtin___clear_cache intrinsic is not what’s needed – that’s only specified to make the instruction cache consistent with data cache, for self-modifying code. Here, we need to flush the data cache all the way out to main memory, which is likely an entirely different set of operations.

That sort of platform-specific cache invalidation sounds like it would need to go in alongside the platform-specific inline_memset implementation (e.g. llvm-project/libc/src/string/memory_utils/x86_64/inline_memset.h at main · llvm/llvm-project · GitHub)

That would also allow internal functions to access it if necessary. Possibly we could have the generic implementation use the proposed atomic fences, or it may be better to leave using this on an unsupported platform an error.

I found an old link on that topic:

I bet that in the end you will need a Clang builtin, an LLVM intrinsic, and teach passes about the intrinsic.

Thank you all for precious suggestions. PoC implementation is at [libc][c23] add memset_explicit by SchrodingerZhu · Pull Request #83577 · llvm/llvm-project · GitHub. Notice that GitHub is having some issues now such that my latest commits are not synced to the PR.

This also matches what the linux kernel does for it’s memzero_explicit.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/string.h#n260

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/include/linux/compiler.h#n88

However, Jens Gustedt suggests in his blog post that

  • All caches for the byte array should have been invalidated on return.

Wait a second, 7.26.6.2 of n3220 says nothing about invalidating caches. Let’s see what the spec says:

The memset_explicit function copies the value of c (converted to an unsigned char) into each of the first n characters of the object pointed to by s. The purpose of this function is to make sensitive information stored in the object inaccessible.367)
367)The intention is that the memory store is always performed (i.e. never elided), regardless of optimizations. This is in contrast to calls to the memset function (7.26.6.1)

You have a whole lot of architectural specific complexity in [libc][c23] add memset_explicit by SchrodingerZhu · Pull Request #83577 · llvm/llvm-project · GitHub for flushing cache lines, that I’m not even sure is correct (at least, it will take some time to verify). Why should we add such semantics that aren’t part of the spec? Perhaps users of memset_explict use arch specific methods of cache invalidation if they require that.

Notice that inaccessible is something very strong. Imagine that your RAM sits in persistent domain, if the cache is not flushed then one can attempt a forcible reset which recovers the data from PMEM.

Ignoring PMEM, cache flushing has no visible semantic effects. After a clflush I can read the same data as before.
I hinted at the article, your stores can be seen as dead stores, because nobody reads them. You will need some other mechanism.

So, one fundamental problem is that the people authoring/discussing the spec for this function wanted it to do something useful security-wise, but the C abstract machine constrains the spec so that various implied threat models can’t be discussed explicitly, and therefore the spec inherently cannot specify all the effects that it really should. Insisting that the implementation do nothing more than the spec literally requires will make the function not especially useful for its intended purpose.

I looked at the Yang paper. It shows the Linux implementation (which I expect is covered by GPL, which may make it unusable by LLVM), and also claims there is a public domain implementation at https://compsec.sysnet.ucsd.edu/secure_memzero.h but that URL gives me name-not-resolved. However, Section 6 of the paper describes what their implementation did in general terms, well enough that it could be re-implemented fairly easily.

They also hacked the DSE pass in Clang to avoid eliminating writes meeting certain criteria, meaning that a vanilla memset ought to Just Work. Performance cost seemed minimal. They used a command-line option but I’d guess that a function attribute would be a better idea.

Would DSE still kick in if we marked the function as an “opaque” one that shall never be inlined? I think memset_explicit should not be considered as a libcall in optimization anyway.


There may be some problem with region-based IPA. (is DSE an IPA?) Ideally, after calling this function, clang should “victim” values related to the memory area. This seems to go beyond the library consideration and become more related to language semop. I am not sure if opacity is enough.

Current DSE is at function-scope. There is a discussion about IPA DSE.

This is outside my area of expertise. Maybe making the stores of memset_explicit volatile or atomic seq_cst could trick DSE to ignore them.

On the other hand, if you look at the current implementations of this functionality in various crypto-adjacent libraries (as referred to by the paper by Yang, linked above), I haven’t yet seen anyone doing a cache flush operation.

So…I think we do need to ask: are we sure a cache-flush is what users want here, vs simply “No, compiler, don’t optimize away the zeroing just because the memory is going to be freed, darn it!”

The not-optimizing-away-writes part is the crucial bit.

The security argument in favor of cache-flushing is so that caches on other CPUs will be notified promptly and invalidate those lines, causing a reload from memory if those bytes are read again. In a system with a write-back cache this could possibly take a little while if you don’t flush explicitly. It’s primarily an argument about timeliness.

I think it was mentioned in the standard discussion that most previous implementation are considered non-portable or not fulfilling certain security demands.

However, I think this is more about decision making. And, for llvm-libc, we have plenty room to correct our decision; at least in the current stage. If more people think we may just go well with noinline and compiler barrier; we may just do it for now.

And if PMEM is still within consideration, I think flushing is definitely needed. There is still active research going on with PMEM, although there is no longer any commercial production line for now.

Then the standards body should not be attempting to standardize something which they cannot make useful real world semantic requirements about. Otherwise they’re just adding more garbage to the standard.

Forming requirements on the libc from side channel blog posts by former spec editors seems problematic.

If user care, then they should do:

memset_explicit(dest, ch, count);
arch_specific_cache_flush_not_provided_by_libc(dest);

Here’s an archived link:
https://web.archive.org/web/20190507224913/https://compsec.sysnet.ucsd.edu/secure_memzero.h

Notice: only contains compiler barriers, no cache flushes.

The implementation uses SecureZeroMemory when available, which is not documented to flush any caches.

On the other hand, if you look at the current implementations of this functionality in various crypto-adjacent libraries (as referred to by the paper by Yang, linked above), I haven’t yet seen anyone doing a cache flush operation.

Yep. Bionic for example is also just doing the compiler barrier.

although there is no longer any commercial production line for now.

Ancientware is not something we should bend over backward to support.

And, for llvm-libc, we have plenty room to correct our decision; at least in the current stage. If more people think we may just go well with noinline and compiler barrier; we may just do it for now.

I agree and think that’s the best path forward. I’m happy to change my position when the C standards body makes more concrete requirements in the standard, or significant research is published into this topic with agreeable action items that implementations can take. Maybe we make this configurable some day.