Adding sanity to the Atomics implementation

Right now, the atomics implementation in clang is a bit of a mess. It has three basically completely distinct code-paths:

  1. There’s the legacy _sync* builtins, which clang lowers directly to atomic IR instructions. Then, the llvm atomic IR instructions themselves can sometimes emit libcalls to _sync* library functions (which are basically undocumented, and users are often responsible for implementing themselves if they need it).
  2. There’s the new _atomic* builtins, which clang will, depending on size and alignment and target, lower either to a libcall to a “standardized-by-GCC” _atomic* library function (implemented by libatomic), or, to the atomic IR instructions. (Also covered by the same code is the _c11_atomic* builtins, which are almost-identical clang-specific aliases of the _atomic* builtins.)
  3. There’s the C11 “_Atomic” type modifier and some OpenMP stuff, which clang lowers into atomic IR or libcalls to _atomic* library functions, via almost completely separate code from #2. (Note: doesn’t apply to C++ std::atomic<>, which gets implemented with the builtins instead). Beyond just a lot of duplicated logic in the code, the _Atomic impl is kinda broken: sometimes it emits llvm IR instructions directly (letting things fall back to _sync* calls), and sometimes it checks if it should emit a libcall first (falling back to _atomic* calls). That’s pretty bad behavior.

I’d like to make a proposal for cleaning this mess up.

BTW, as a sidenote…
One thing that’s important to remember: at least on the C/C++ level, if you support lock-free atomic-ops for a given size/alignment, ALL the atomic ops for that size/alignment must be lock-free. This property is usually quite easy, because you’ll have either LL/SC or CAS instructions, which can be used to implement any other operation. However, many older CPU models support atomic load, store, and exchange operations, but are missing any way to do an atomic compare-and-swap. This is true for at least ARMv5, SPARCv8, and Intel 80386. When your minimum CPU is set to such a cpu model, all atomic operations must be done via libcall – it’s not acceptable for atomic_store to emit an atomic “store” instruction, but atomic_fetch_add to require a libcall which gets implemented with a lock. If that were to happen, then atomic_fetch_add could not actually be made atomic versus a simultaneous atomic_store.

So anyhow, there’s basically two paths I think we could take for cleanup. I like “Alternative A” below better, but would be interested to hear if others have reasons to think the other would be preferable for some reason.

Both alternatives I’ve suggested will have the effect that the _sync* builtins in clang will now lower to _atomic* function calls on platforms without inline atomics (or for unsupported sizes), and C11 atomics will stop being schizophrenic. Clang will cease to ever emit a call to a _sync* function from any of its builtins or language support. This could theoretically cause an incompatibility on some target.

However, I think it ought to be pretty safe: I can’t imagine anyone who will use an upgraded compiler has _sync* functions implemented for their platform, but doesn’t have the _atomic* library functions implemented. The _atomic* builtins (which already use those library calls) are in widespread use, notably, both in libstdc++ and libc++, as well as in a lot of third-party code. IMO it’s worth having that potential incompatibility, in order to simplify the situation for users (only one set of atomic library calls to worry about), and to have a single code-path for atomics in the compiler.

Alternative A: Move all atomic libcall emission into LLVM

In this alternative, LLVM will be responsible for lowering all atomic operations (to inline code or libcalls as appropriate) for all three code-paths listed at the beginning. Clang will emit no libcalls for atomic operations itself.

A1) In LLVM: when lowering “store atomic”, “load atomic”, “atomicrmw”, and “cmpxchg” instructions that aren’t supported by the target, emit libcalls to the new _atomic* functions, (rather than the current behavior of calling the legacy _sync* functions.)

Additionally, I’d like to move the decision to emit a libcall into AtomicExpandPass, and remove the ability to use Expand in ISel to create a libcall for ISD::ATOMIC_*. Putting the decision there allows querying the target, up front, for its supported sizes, before any other lowering decisions (either other parts of AtomicExpandPass and in ISel). When choosing whether to inline or libcall, only the size and alignment of the operation should be considered, and not the operation itself, or the type. This will ensure that the “all or none” property is adhered to.

(Q: what about the implementation of __atomic_is_lock_free/__atomic_always_lock_free in clang? The clang frontend can’t query target information from the llvm backend, can it? Is there some other way to expose the information of what sizes are supported by a backend so that the clang builtins – the latter of which must be a C++ constant expression – can also use it? Or…just continue duplicating the info in both places, as is done today…)

A2) In clang, start removing the code that knows how to do optimized library call lowering for atomics – always emit llvm atomic ops. Except for one case: clang will still need to emit library calls itself for data not aligned naturally for its size. The LLVM atomic instructions currently will not handle unaligned data, but unaligned data is allowed for the four “slab of memory” builtins (__atomic_load, __atomic_store, __atomic_compare_exchange, and __atomic_exchange).

A3) In LLVM, add “align” attributes to cmpxchg and atomicrmw, and allow specifying “align” values for “load atomic” and “store atomic” (where the attribute currently exists but cannot be used). LLVM will then be able to lower misaligned accesses to library calls. In clang, get rid of special case for directly emitting libcalls for misaligned atomics, and lower to llvm instructions there, as well.

A4) Hopefully, now, the three codepaths in clang will be sufficiently the same (as they’re all now just creating llvm atomic instructions) that they can be trivially merged, or else they are so trivial that the remaining duplication is not worthwhile to merge.

Alternative B: Move all libcall emission for atomics into Clang.

In this alternative, LLVM will never emit atomic libcalls from the atomic IR instructions. If the operation requested is not possible on a given target, that is an error. So, the cleanups here are to get clang to stop emitting LLVM IR atomics that cannot be lowered without libcalls.

B1) In Clang: have the legacy _sync* builtins become essentially aliases for the _atomic* builtins (thus they will emit calls to _atomic* library functions when applicable).

B2) In Clang: Have the C11 _Atomic type support and openmp stuff call into the __atomic builtins’ code to do their dirty work, instead of having its own separate implementation.

B3) After those two changes, I believe clang will only ever emit atomic IR instructions when they can be lowered. So then, in LLVM: get rid of the fallback to libcalls to _sync* from the atomic IR instructions. If an atomic IR operation is requested and not implementable lock-free, it will be an error, with no fallback.

Right now, the atomics implementation in clang is a bit of a mess. It has
three basically completely distinct code-paths:

   1. There's the legacy __sync_* builtins, which clang lowers directly
   to atomic IR instructions. Then, the llvm atomic IR instructions themselves
   can sometimes emit libcalls to __sync_* library functions (which are
   basically undocumented, and users are often responsible for implementing
   themselves if they need it).
   2. There's the new __atomic_* builtins, which clang will, depending on
   size and alignment and target, lower either to a libcall to a "
   standardized-by-GCC <https://gcc.gnu.org/wiki/Atomic/GCCMM/LIbrary&gt;&quot;
   __atomic_* library function (implemented by libatomic), or, to the atomic
   IR instructions. (Also covered by the same code is the __c11_atomic_*
   builtins, which are almost-identical clang-specific aliases of the
   __atomic_* builtins.)
   3. There's the C11 "_Atomic" type modifier and some OpenMP stuff,
   which clang lowers into atomic IR or libcalls to __atomic_* library
   functions, via almost completely separate code from #2. (Note: doesn't
   apply to C++ std::atomic<>, which gets implemented with the builtins
   instead). Beyond just a lot of duplicated logic in the code, the _Atomic
   impl is kinda broken: sometimes it emits llvm IR instructions directly
   (letting things fall back to __sync_* calls), and sometimes it checks if it
   should emit a libcall first (falling back to __atomic_* calls). That's
   pretty bad behavior.

I'd like to make a proposal for cleaning this mess up.

BTW, as a sidenote...
One thing that's important to remember: at least on the C/C++ level, if
you support lock-free atomic-ops for a given size/alignment, *ALL* the
atomic ops for that size/alignment must be lock-free. This property is
usually quite easy, because you'll have either LL/SC or CAS instructions,
which can be used to implement any other operation. However, many older
CPU models support atomic load, store, and exchange operations, but are
missing any way to do an atomic compare-and-swap. This is true for at least
ARMv5, SPARCv8, and Intel 80386.

When your minimum CPU is set to such a cpu model, all atomic operations

must be done via libcall -- it's not acceptable for atomic_store to emit an
atomic "store" instruction, but atomic_fetch_add to require a libcall which
gets implemented with a lock. If that were to happen, then atomic_fetch_add
could not actually be made atomic versus a simultaneous atomic_store.

So anyhow, there's basically two paths I think we could take for cleanup.
I like "Alternative A" below better, but would be interested to hear if
others have reasons to think the other would be preferable for some reason.

*Both* alternatives I've suggested will have the effect that the __sync_*
builtins in clang will now lower to __atomic_* function calls on platforms
without inline atomics (or for unsupported sizes), and C11 atomics will
stop being schizophrenic. Clang will cease to ever emit a call to a
__sync_* function from any of its builtins or language support. *This
could theoretically cause an incompatibility on some target. *

However, I think it ought to be pretty safe: I can't imagine anyone who
will use an upgraded compiler has __sync_* functions implemented for their
platform, but doesn't have the __atomic_* library functions implemented.
The __atomic_* builtins (which already use those library calls) are in
widespread use, notably, both in libstdc++ and libc++, as well as in a lot
of third-party code. IMO it's worth having that potential incompatibility,
in order to simplify the situation for users (only one set of atomic
library calls to worry about), and to have a single code-path for atomics
in the compiler.

*Alternative A*: Move all atomic libcall emission into LLVM

In this alternative, LLVM will be responsible for lowering all atomic
operations (to inline code or libcalls as appropriate) for all three
code-paths listed at the beginning. Clang will emit no libcalls for atomic
operations itself.

A1) In LLVM: when lowering "store atomic", "load atomic", "atomicrmw", and
"cmpxchg" instructions that aren't supported by the target, emit libcalls
to the new __atomic_* functions, (rather than the current behavior of
calling the legacy __sync_* functions.)

Additionally, I'd like to move the decision to emit a libcall into
AtomicExpandPass, and remove the ability to use Expand in ISel to create a
libcall for ISD::ATOMIC_*. Putting the decision there allows querying the
target, up front, for its supported sizes, before any other lowering
decisions (either other parts of AtomicExpandPass and in ISel). When
choosing whether to inline or libcall, only the size and alignment of the
operation should be considered, and not the operation itself, or the type.
This will ensure that the "all or none" property is adhered to.

This sounds reasonable to me.

(Q: what about the implementation of
__atomic_is_lock_free/__atomic_always_lock_free in clang? The clang
frontend can't query target information from the llvm backend, can it? Is
there some other way to expose the information of what sizes are supported
by a backend so that the clang builtins -- the latter of which must be a
C++ constant expression -- can also use it? Or...just continue duplicating
the info in both places, as is done today...)

A2) In clang, start removing the code that knows how to do optimized
library call lowering for atomics -- always emit llvm atomic ops. Except
for one case: clang will still need to emit library calls itself for data
not aligned naturally for its size. The LLVM atomic instructions currently
will not handle unaligned data, but unaligned data is allowed for the four
"slab of memory" builtins (__atomic_load, __atomic_store,
__atomic_compare_exchange, and __atomic_exchange).

Hrm, why couldn't we get LLVM to do the heavy lifting for us?

A3) In LLVM, add "align" attributes to cmpxchg and atomicrmw, and allow
specifying "align" values for "load atomic" and "store atomic" (where the
attribute currently exists but cannot be used). LLVM will then be able to
lower misaligned accesses to library calls. In clang, get rid of special
case for directly emitting libcalls for misaligned atomics, and lower to
llvm instructions there, as well.

Ah, that answers my question. Great!

A4) Hopefully, now, the three codepaths in clang will be sufficiently the
same (as they're all now just creating llvm atomic instructions) that they
can be trivially merged, or else they are so trivial that the remaining
duplication is not worthwhile to merge.

SGTM.

Right now, the atomics implementation in clang is a bit of a mess. It has three basically completely distinct code-paths:

I’m partly to blame for some of this. Sorry!

Alternative A: Move all atomic libcall emission into LLVM

In this alternative, LLVM will be responsible for lowering all atomic operations (to inline code or libcalls as appropriate) for all three code-paths listed at the beginning. Clang will emit no libcalls for atomic operations itself.

We had a discussion about this around 2011/2012. The general consensus was that everyone thought that it was a good idea, no one had time to implement it. If you have time, then that is wonderful news!

Note that part of this will involve generalising the atomic operations (at least cmpxchg) to work on *any* LLVM IR type, including vectors and pointers.

A1) In LLVM: when lowering "store atomic", "load atomic", "atomicrmw", and "cmpxchg" instructions that aren't supported by the target, emit libcalls to the new __atomic_* functions, (rather than the current behavior of calling the legacy __sync_* functions.)

Additionally, I'd like to move the decision to emit a libcall into AtomicExpandPass, and remove the ability to use Expand in ISel to create a libcall for ISD::ATOMIC_*. Putting the decision there allows querying the target, up front, for its supported sizes, before any other lowering decisions (either other parts of AtomicExpandPass and in ISel). When choosing whether to inline or libcall, only the size and alignment of the operation should be considered, and not the operation itself, or the type. This will ensure that the "all or none" property is adhered to.

I’m not 100% sure about this. On some architectures, it may be advantageous to lower some operations to libcalls purely because of the size of the implementation, or because of ABI stability (e.g. all architecture versions can do a store safely, but on some the other operations need locks / restart blocks on some versions and can do it in native instructions on others, and you want the ability to select between them at load time so that the same binary can work on all versions of the architecture).

(Q: what about the implementation of __atomic_is_lock_free/__atomic_always_lock_free in clang? The clang frontend can't query target information from the llvm backend, can it? Is there some other way to expose the information of what sizes are supported by a backend so that the clang builtins -- the latter of which must be a C++ constant expression -- can also use it? Or...just continue duplicating the info in both places, as is done today…)

I think that this will probably need to end up duplicating the info, but it’s a bit horrible. Ideally, I’d propose extending the DataLayout to include it, which at least means that you will get a hard failure if clang and the back end disagree.

A2) In clang, start removing the code that knows how to do optimized library call lowering for atomics -- always emit llvm atomic ops. Except for one case: clang will still need to emit library calls itself for data not aligned naturally for its size. The LLVM atomic instructions currently will not handle unaligned data, but unaligned data is allowed for the four "slab of memory" builtins (__atomic_load, __atomic_store, __atomic_compare_exchange, and __atomic_exchange).

A3) In LLVM, add "align" attributes to cmpxchg and atomicrmw, and allow specifying "align" values for "load atomic" and "store atomic" (where the attribute currently exists but cannot be used). LLVM will then be able to lower misaligned accesses to library calls. In clang, get rid of special case for directly emitting libcalls for misaligned atomics, and lower to llvm instructions there, as well.

A4) Hopefully, now, the three codepaths in clang will be sufficiently the same (as they're all now just creating llvm atomic instructions) that they can be trivially merged, or else they are so trivial that the remaining duplication is not worthwhile to merge.

One other consideration is the lowering of _Atomic(T). C11 is very careful to not require that sizeof(T) == sizeof(_Atomic(T)) (though there are some bugs in the spec currently), to allow for architectures where sub-word atomics are inefficient (for example, _Atomic(char) may be implemented as a 16- or 32-bit quantity, with explicit truncation). Clang currently handles this (possibly incorrectly).

David

I don’t have much of substance to add, but I will say that I like the proposal (I too prefer Alternative A). When adding C11 atomic support for hexagon, I found it surprising that support for the _sync* was implemented completely differently than the C11 atomics.

I think this here is the main challenge. The frontend (and std::atomic
implementations) need to know whether the target can perform an atomic
operation without locks, and they can't rely on the LLVM target being built
in for this information.

I think Clang should continue to duplicate this information, the same way
we duplicate target datalayout strings. Other than that, sure, we can let
LLVM expand IR operations to libcalls. I don't immediately see a problem
with that.

Note that a libcall doesn't necessarily mean using locks. With one
exception, NetBSD provides lock-free CAS on all platforms for natural
argument sizes. The intersection between platforms with MP support and
platforms without hardware CAS is SPARCv8. Everything else works fine,
even the libcall version.

Joerg

Right now, the atomics implementation in clang is a bit of a mess. It has three basically completely distinct code-paths:

I’m partly to blame for some of this. Sorry!

Alternative A: Move all atomic libcall emission into LLVM

In this alternative, LLVM will be responsible for lowering all atomic operations (to inline code or libcalls as appropriate) for all three code-paths listed at the beginning. Clang will emit no libcalls for atomic operations itself.

We had a discussion about this around 2011/2012. The general consensus was that everyone thought that it was a good idea, no one had time to implement it. If you have time, then that is wonderful news!

Note that part of this will involve generalising the atomic operations (at least cmpxchg) to work on *any* LLVM IR type, including vectors and pointers.

FYI, this part is already in progress. Right now we have power-of-two integers, floats, and pointers. Vectors looks to be substantially more complicated at the moment. Not all of the atomic instructions have been updated for the same types, so consistency would be a first good step.

> I think Clang should continue to duplicate this information, the same way
> we duplicate target datalayout strings. Other than that, sure, we can let
> LLVM expand IR operations to libcalls. I don't immediately see a problem
> with that.

Note that a libcall doesn't necessarily mean using locks.

Yes, and having possibly-lock-free atomics is fine -- __atomic_is_lock_free
has a fallback to a libcall if the compiler doesn't know the answer is
"true", for that exact reason.

With one

exception, NetBSD provides lock-free CAS on all platforms for natural

argument sizes. The intersection between platforms with MP support and
platforms without hardware CAS is SPARCv8. Everything else works fine,
even the libcall version.

Indeed, with kernel support (magic restartable sequences, or a syscall),
lock-free atomics can be provided for any uniprocessor architecture. So,
linux on ARM provides a memory page you can jump to, which contains a
normal un-atomic instruction sequence on uniprocessor ARMv5s, or actual
ll/sc instructions on ARMv6+. And for the ARMv5 implementation, the kernel
arranges to magically restart the nonatomic sequence from the beginning if
it's interrupted in the middle, thus making it act as if it were atomic. (
User space atomic ops on ARMv5 and earlier [LWN.net] is a nice explanation). I see that NetBSD
has something similar for restartable instruction sequences.

After some further investigation on how this is all supposed to hang
together, I do need to modify my plan slightly. The detail I neglected to
take account of earlier is that some platforms don't just have
*potentially* lock-free
library functions (e.g., depending on the CPU model you detect at runtime),
but rather, can have *guaranteed* lock-free library functions via this
kernel support.

For any target that has kernel-supported lock-free atomics, the LLVM target
should claim those atomic sizes as supported and clang should return true
from __atomic_always_lock_free. By claiming them supported as if they were
supported by native instructions, AtomicExpandPass will let them through
unscathed to ISel.

Then, the change in plan: I actually will preserve the ability to expand to
libcalls from within ISel, for this use-case only. The libcall names for
these routines will only be defined on those CPU+OS combinations which
implement this functionality.

In GCC, the supported-through-kernel atomic ops expand to a __sync_*
libcall, which is provided by the compiler runtime support library libgcc.a
(on the targets in question). It's not *necessary* for GCC to emit a
libcall, an inline syscall would be fine too sometimes, but currently it
always emits a call.

Like my plan for LLVM, GCC only enables these on some architectures (grep
for "init_sync_libfuncs"): Linux OS, on ARM, 68000, Nios2, PA-Risc, and
Super-H chips.

Note that unlike with the __atomic_* library calls, these target-specific
libcalls *are* okay to mix and match with CPU instructions since they must
be guaranteed lock-free. It is also okay to include potentially multiple
copies of the functions into a executable image, since they're stateless,
which is why it's okay for them to live in libgcc.a (not libgcc_s.so).

BTW, regarding NetBSD in particular: it appears as though the
implementation of the __sync_* functions in its libc sometimes use locks,
when the target is built with the generic "atomic_init_testset.c"
implementation, and then runs on a multiprocessor system. I think that
might happen if you compiled your binary for an armv5, but then actually
ran it on a multiprocessor armv6+ system? If I've read that right (which
I'm not at all sure I have), then llvm can't properly depend on NetBSD's
__sync_* functions to be always lock-free, at least at the moment.

From: "Ben via llvm-dev Craig" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Thursday, January 28, 2016 9:41:06 AM
Subject: Re: [llvm-dev] Adding sanity to the Atomics implementation

I don't have much of substance to add, but I will say that I like the
proposal (I too prefer Alternative A).

I also prefer alternative A.

-Hal

Same here: +1 to A.

it makes it easier to support virtual ISAs. For targets such as WebAssembly
we don't know ahead-of time what's atomic but we make certain assumptions
(e.g. atomic_flag has to be sufficiently sized to be lock-free). As David
pointed out this means that _Atomic and std::atomic have to be
"sufficiently sized" e.g. sizeof(T) == sizeof(std::atomic<T>) isn't
guaranteed (the guarantee is also there for other academic reasons, but I
don't think they matter in this case).

Another consideration: we want to get the atomic optimizations, which
option B would forbid because the middle-end would see libcalls instead of
atomics.

Also, I'd like LLVM to eventually be able to patch code at load-time and
transform maybe-lock-free operations into the proper instruction or libcall.

Related to always lock-free: wg21.link/n4509
Note that C has macros that return never / sometimes / always lock-free.

Also, +1. Strongly agreed with this part.