RFC: volatile/const atomics should not expand to LLSC/cmpxchg loops

In the AArch64 backend, if a given atomic operation is not available on the target, then AtomicExpandPass will expand it to an LL/SC or compare-exchange loop. These involve both reading and writing to the address, potentially multiple times. Currently we do the same for volatile operations and operations on const pointers, which is a violation of both const and volatile semantics. For example:

#include "stdatomic.h"

long add(_Atomic volatile long* ptr) {
  return atomic_fetch_add(ptr, 1);
}

v8.0 has no such instruction, however compiling with clang -target aarch64 -march=armv8-a -O1 we get this:

add(long _Atomic volatile*):                     // @add(long _Atomic volatile*)
        mov     x8, x0
.LBB0_1:                                // %atomicrmw.start
        ldaxr   x0, [x8]
        add     x9, x0, #1
        stlxr   w10, x9, [x8]
        cbnz    w10, .LBB0_1
        ret

The additional memory operations (the store exclusive stlxr, and any further loads/stores from going around the loop multiple times) violate the volatile semantics. Similarly, the stores violate const semantics when const atomic loads are expanded (e.g. 128 bit loads without LSE2).

It seems like the most reasonable thing we can do in this scenario (where no instruction is available) is to emit a libcall. There are at least three options for doing so: __sync (Itanium ABI), __atomic (libatomic) and __aarch64 (outline-atomics). The simplest way to implement this would be by extending AtomicExpandPass, since it already has code for emitting __atomic libcalls.

IIUC, either __sync calls would be more appropriate than __atomic because they do not make any restrictions on how they synchronise, and hence whether they can be mixed with other methods of synchronisation for the same size. In contrast __atomic calls require all atomic operations above a given size to use the calls. Therefore by emitting __sync calls the user is free to provide their own implementations to be used for the volatile instructions, if they can do so (e.g. with more direct control over the hardware). The same might be true for __aarch64 calls but I can’t find any good documentation on their semantics.

I can see some potential problems with this approach, specifically:

  • Volatile/const and non-volatile/const operations on the same location must use the same synchronisation method, so the implementation of __sync must e.g. interact correctly with the hardware exclusive monitor. This is possible but not necessarily easy or common, but embedded users may wish to do so. I am not suggesting that LLVM provides implementations of these functions because they are not implementable without knowing the details in the underlying core.
  • Introducing new function calls (e.g. in code which was compiled but not used) without providing definitions could cause linker errors in existing projects.

Here are some alternative approaches for consideration:

  • Emit a warning or error in the frontend when you use an unsupported volatile/const atomic operation, and keep the invalid codegen. Direct the user to use -moutline-atomics.
  • Never emit LLSC/cmpxchg loops, because volatile/non-volatile accesses to the same pointer must use the same synchronisation method, and we can’t guarantee that with libcalls.
  • Use __atomic calls for all operations above the maximum size supported by the target, and error for volatile/const operations smaller than that limit.

In general the atomics ABI situation is complex and not always brilliantly documented, so let me know if I have misunderstood anything above. Does emitting libcalls for unsupported volatile operations look like the correct approach? If so is __sync the right choice?

2 Likes

This was discussed offline, and apparently this is the expected behaviour. Specifically, this claim of mine:

The additional memory operations (the store exclusive stlxr, and any further loads/stores from going around the loop multiple times) violate the volatile semantics.

is not correct, at least for volatile atomic operations: volatile does not imply “don’t introduce additional side effects”, it only implies “don’t optimise this out”. Whether this is specified in the relevant standards or a matter of convention is not clear; I have to admit some aspects of the volatile semantics are still quite unclear to me, especially in the context of atomics. Can anyone can point to documentation that can support this interpretation?

I still believe the const atomic loads should not emit stores. If I get time I will open a patch to make this an error if the backend can’t do them without stores.

volatile

Yes, volatile is not well specified, most behaviors around it are a matter of convention. The only thing it really must do is that “volatile sig_atomic_t g;” has to be atomic with regards to signal handler interruptions.

I still believe the const atomic loads should not emit stores. If I get time I will open a patch to make this an error if the backend can’t do them without stores.

You are correct, implementations should not have used atomic store operations to implement an atomic load. This implies that when a CPU only provides an 128-bit cmpxchg or ll/sc, but not a 128-bit load, you cannot use the 128-bit atomic CPU instructions to implement C/C++ atomic operations, you must use a lock. (The compiler should not have used inline atomics – it should have called the atomic library __atomic_* routines, and that should’ve used locks.)

Sadly, this mistake can’t be fully fixed, because it would require an ABI breakage. We can make the compiler emit atomic library calls – if we ensure that the library implementation continues to do the “wrong thing” on older hardware (using cmpxchg/ll/sc loop instead of a lock). We cannot switch it use to locks without breaking the existing binaries which use CPU instructions inline.

It’s also not possible to change the behavior based on whether a pointer is “const” or not, since you can mix accesses via a const pointer and a non-const pointer.

FWIW, from a Rust perspective we actually prefer the current situation over using a lock. We generally promise that the compiler will never implicitly use locks, and our const/immutable semantics are slightly different meaning that having these stores is, in general, fine. We did have to add a rule saying that one is not allowed to use atomics on truly read-only memory (e.g. things that would end up in the readonly section of the binary), but as of now that seems like an acceptable solution for us.

Cc @nikic
Also see here for some of the discussion on the Rust side.

These stores are actually “fine” in the same manner for C/C++ as for Rust, because the stores guaranteed to be a no-op (by virtue of being a compare-exchange instruction or sequences). Thus, there is no issue if the backing memory is mapped read/write.

The C and C++ specifications don’t allow for atomic load to be non-functional on readonly memory, but in practice that’s something users rarely run into, which is why the current situation has been in place for years without much complaint.

That’s the difference between C/C++ and Rust I was getting at in my previous post – in Rust we have the freedom to say that atomic loads may indeed not be used on readonly memory.

One of my most liked diffs. 128-bit loads on ARM CPUs.
https://reviews.llvm.org/D109827