I have noticed that, on my target, __sync_fetch_and_add causes clang to generate the following...
%0 = atomicrmw add i32* %val, i32 1 seq_cst
... but __atomic_fetch_add with __ATOMIC_SEQ_CST causes the following to get emitted...
%1 = bitcast i32* %val to i8*
%call = call i32 @__atomic_fetch_add_4(i8* %1, i32 1, i32 5) #1
Now, I am aware that I need to tweak my version of TargetInfo::MaxAtomicInlineWidth, MaxAtomicPromoteWidth, and hasBuiltinAtomic() in order to get the atomicrmw IR instruction I want. What I'm not sure of is why.
Why is a function call generated in clang? Why don't we let LLVM choose whether to emit a function call or inline assembly in these cases?
Another reason I care about this is because some atomic operations aren't directly supported by my hardware ( 4 byte and 8 byte atomics are directly supported, 1 byte and 2 byte atomics are not). I can emulate them, but I would like to know where the function call should be emitted. Suppose I lie to clang and tell it to inline my 1 byte and 2 byte atomics, and let LLVM generate the library call. Are there significant downsides to this approach (like losing the memory model information), or do I need to implement separate library calls for the __sync instructions and __atomic instructions?