[CodeGen][AtomicExpandPass] Initial load not atomic?

Hey all,

While reading through AtomicExpand::insertRMWCmpXchgLoop I noticed
that the initial load is documented to be atomic but the code below
generates a non-atomic load. I'm not sure whether it's an actual
problem or not - could that be mis-compiled or mis-optimized?

Inline documentation comment here

Example:

define dso_local zeroext i16 @foo(i16* %0, i16 zeroext %1, i16 zeroext
%2) local_unnamed_addr #0 {
%4 = atomicrmw volatile nand i16* %0, i16 %1 seq_cst
ret i16 %4
}

Running opt -atomic-expand gives:

define dso_local zeroext i16 @foo(i16* %0, i16 zeroext %1, i16 zeroext
%2) local_unnamed_addr #0 {
%4 = load i16, i16* %0, align 2
br label %atomicrmw.start

atomicrmw.start: ; preds = %atomicrmw.start, %3
%loaded = phi i16 [ %4, %3 ], [ %newloaded, %atomicrmw.start ]
%5 = and i16 %loaded, %1
%new = xor i16 %5, -1
%6 = cmpxchg i16* %0, i16 %loaded, i16 %new seq_cst seq_cst
%success = extractvalue { i16, i1 } %6, 1
%newloaded = extractvalue { i16, i1 } %6, 0
br i1 %success, label %atomicrmw.end, label %atomicrmw.start

atomicrmw.end: ; preds = %atomicrmw.start
ret i16 %newloaded
}

Hi,

I believe that this is still correct for your example. The compare and exchange is sequentially consistent and so establishes ordering and and happens-before relationships. The load is unordered but if it contains any value other than the load that happens in the compare-and-exchange then the compare fails and the loop retries. On the second iteration, we have the value from the sequentially consistent atomic. I believe this holds for other memory orderings as well.

This would be an invalid operation if it were done as a source-to-source transform in C, because C states that mixing atomic and non-atomic accesses to the same memory location is UB and so the transformed version would suffer from UB but the source version would not. I can't find any equivalent text in the LLVM LangRef (we should probably explicitly clarify this) and my understanding is that LLVM semantics permit tearing and do not guarantee any ordering, but give, at worse, an unspecified value. Even if we statically propagate that, I would hope that the worst thing that could happen is that the load is elided entirely and we start the loop with a cmpxchg on an arbitrary expected value which may succeed (if you are very lucky) but not in a way that the guess is observable outside or, more likely, the first cmpxchg around the loop fails and we are now in a completely valid and slightly more strongly ordered than necessary code path.

That said, I can't find anything in a quick re-skim of the atomics part of the spec to tell me if this is guaranteed. This code probably originated on x86, where all variants of load (atomic or otherwise) are lowered to the same instruction. I'm not sure what other architectures use cmpxchg lowering, rather than an LL/SC primitive, and aren't TSO.

David