Why does __atomic generate IR calls when __sync generates atomicrmw?

I have noticed that, on my target, __sync_fetch_and_add causes clang to generate the following...
     %0 = atomicrmw add i32* %val, i32 1 seq_cst

... but __atomic_fetch_add with __ATOMIC_SEQ_CST causes the following to get emitted...
     %1 = bitcast i32* %val to i8*
     %call = call i32 @__atomic_fetch_add_4(i8* %1, i32 1, i32 5) #1

Now, I am aware that I need to tweak my version of TargetInfo::MaxAtomicInlineWidth, MaxAtomicPromoteWidth, and hasBuiltinAtomic() in order to get the atomicrmw IR instruction I want. What I'm not sure of is why.

Why is a function call generated in clang? Why don't we let LLVM choose whether to emit a function call or inline assembly in these cases?

Another reason I care about this is because some atomic operations aren't directly supported by my hardware ( 4 byte and 8 byte atomics are directly supported, 1 byte and 2 byte atomics are not). I can emulate them, but I would like to know where the function call should be emitted. Suppose I lie to clang and tell it to inline my 1 byte and 2 byte atomics, and let LLVM generate the library call. Are there significant downsides to this approach (like losing the memory model information), or do I need to implement separate library calls for the __sync instructions and __atomic instructions?

Why is a function call generated in clang? Why don't we let LLVM choose whether to emit a function call or inline assembly in these cases?

The short answer is: that was easiest to implement and various bits of atomic support were done in a bit of a rush. It’s not the right design decision, because it prevents the optimisers working on these things.

If someone has time to work on this, then it would be much better to:

- Extend atomic compare and exchange to work natively on floating point and pointer types (remove the bitcasts to integers - they’re problematic for various things in both optimisers and code generation)

- Modify clang to always emit the atomicrmw or cmpxchg instructions.

- Add a parameterisable pre-codegen lowering pass that will transform them into calls to library functions if not supported by the target.

Another reason I care about this is because some atomic operations aren't directly supported by my hardware ( 4 byte and 8 byte atomics are directly supported, 1 byte and 2 byte atomics are not).

If you look in the MIPS back end, you’ll see exactly the same thing. 32-bit and 64-bit ll/sc is natively supported, the rest must be turned into an ll/sc loop with a mask.

David