Why can't atomic loads and stores handle floats?

Looking through the documentation, I discovered that atomic loads and stores are only supported for integer types. Can anyone provide some background on this? Why is this true?

Currently, given code:
std::atomic aFloat;
void foo() {
float f = atomic_load(&aFloat);


Clang generates code like:``
%“struct.std::atomic.2” = type { float }
@aFloat = global %“struct.std::atomic.2” zeroinitializer, align 4

define void @foo() {
%1 = load atomic i32* bitcast (%“struct.std::atomic.2”* @aFloat to i32*) seq_cst, align 4
%2 = bitcast i32 %1 to float


This seems less than ideal. I would expect that we might have to desugar floats into integer & cast operations in the backend, but why is this imposed on the frontend?

More generally, is there anyone who is knowledgeable and/or working on atomics and synchronization in LLVM? I’ve got a number of questions w.r.t. semantics and have found a number of what I believe to be missed optimizations. I’m happy to file the later, but I’d like to talk them over with a knowledgeable party first.


What is the downside of the currently generated IR? There ain’t nothin’ wrong with bitcasts, IMO.


It's problematic because it means that you'll end up generating an integer store even if your hardware support load-linked / store-conditional (or cmpxchg) from floating point registers. This means an extra floating point to integer register copy (and the reverse and back if it fails for an atomicrmw loop) and that typically involves some complex pipeline interlocking on modern processors so is very expensive. It also means you need to allocate an extra integer register, even though the underlying hardware may support the original form.

You can hack around this a bit in the back end, but you end up with the back end trying to figure out what the front end meant (and, on some architectures, store ordering semantics from integer and floating point registers are different, so this isn't necessarily possible anyway, as you can be turning two constructs that have different semantics to the front and back ends into the same thing in the IR).

We're currently unable to express atomic pointer loads and stores on our architecture for a similar reason: the hardware supports separate fat pointer registers, which are wider than the widest integer registers (and interact with tag bits) and implements a load-linked and store-conditional for them, C11 supports atomic operations on pointers, but LLVM IR doesn't (or didn't, not sure if this is fixed now).

In short, it's only a problem if you care about architectures outside of ARM and x86.


P.S. To correctly implement C11 semantics for atomic operations on floats, we also need to be able to model floating point environment state, which we can't.

David provided one good answer. I'll give another.

The current design pushes complexity into the language frontend for - as far as I know - no good reason. I can say from recent experience that the corner cases around atomics are both surprising and result in odd looking hacks in the frontend. To say this differently, why should marking loads and stores atomic required me to rewrite largish chunks of code around the load or store? There's nothing "wrong" per se with that design, but why complicate a bunch of frontends when a single IR level desugarring pass could preform the same logic?

Another answer would be that bitcasts make the IR less readable. They consume memory. Unless handled carefully, they inhibit optimizations. (i.e. if you forget to strip casts in a peephole optimization) When dealing with large IR files from a language where *every* field access is atomic "unordered", the first two are particularly important.

p.s. I'm currently operating under the assumption that there is no *technical* reason LLVM could represent atomic loads and stores on floating point types. If this is not true, please correct me.


There's nothing "wrong" per se with that design, but why
complicate a bunch of frontends when a single IR level desugarring pass
could preform the same logic?

I quite like this idea. It could give David his atomic ops where an
integer really can't do the right thing, and isn't just shunting the
burden onto all of the backends. Some restrictions would still be
needed. A "load atomic [1000 x i64]* %addr" is just being cheeky.

The biggest issue I see is modelling the legal loads for a target. For
example AArch64 probably has "legal" monotonic loads for most sane
types, in the sense that they can be implemented in the same way as
non-atomic ones. But there's no "ldar s0, [addr]", and you can't
simply replace an atomic load with a normal load even in the weaker
cases because you have no say in what passes run after your shiny
expansion pass.

With appropriate target hooks, I think it could be made to work.



Currently, the frontend will have to lower these to calls to the __atomic functions, but there's no technical reason for this on all architectures. Haswell and newer Intel chips *can* implement atomic loads of 1000 x i64: with the transactional extensions, the limit for loads is very large (the limit for atomic writes is around 30KB).

As transactional memory becomes more common, large atomicrmw operations become possible, but LLVM IR can't meaningfully express them. Currently, two architectures in LLVM support hardware transactional memory in some form: x86 and BlueGene/Q.

One of the biggest issues I face implementing the back end for our architecture is the willingness, both in mapping to IR and then to SelectionDAG for LLVM to throw away information that is not yet meaningful to existing back ends. Let's try not to make that any worse for future back-end authors. Lots of people are trying to use LLVM for custom ASICs and at EuroLLVM there were a number of people who had encountered similar problems in exactly this.


I'm wiling to take this on. It's not going to be immediately, but cleaning up the code in my frontend is worth the work.

It seems like the logical place for this would either be in CodeGenPrep or SelectionDAGBuilder. Does anyone know of any reason why it would need to be done earlier?

I'm going to ignore the generalized transactional-memory use cases for the moment. I want to stick to the subset of features which have fairly wide support across platforms. Honestly, the transactional memory bits feel like they should be solved differently anyways.