[RFC] Refactoring CGAtomic into LLVMFrontend

tl;dr: Move the language-independent parts of CGAtomic.cpp to a new LLVM library LLVMFrontendAtomic, so Flang can use it as well. Do so by first NFC refactor (Draft PR), then gradually use the same functions that LLVM’s AtomicExpandPass already uses.

Motivation

We need to implement lowering to IR for atomic access in Flang. Unfortunately it is not sufficient to just emit an IR instruction such as cmpxchg. In particular, this fails on aggregate types such as complex.

Clang’s IR-gen of atomic construct is located in CGAtomic.cpp. Instead of cmpxchg, it may fallback to __atomic_compare_exchange. It is implemented in either GNU libatomic or compiler-rt.

Looking into how CGAtomic handles this, the conditions are a lot more complicated. Non-integer types may get coerced to integer values. Non-power-of-two-sized types are not allowed with cmpxchg. Types with padding. Types larger then 16 always use libcall to avoid excessively large value types. Atomic operations that are composits of multiple other operations, e.g. #pragma omp atomic update.

Whatever CGAtomic does not lower to a libcall may still be lowered by the AtomicExpandPass or AtomicLower.cpp if the target does not have a machine instruction it can be lowered. One way to look at what CGAtomic does is pre-lowering anything that processors don’t have instructions for anyway.

Proposal

Instead of rediscovering all the performance and correctness pitfalls for Flang, the idea is to move the language-independent parts of CGAtomic into LLVMFrontend, a module explicitly created for language frontends. This means other front-ends may be able to use it. The LLVM Front-end hints page already suggests to emit IR as similar to Clang as possible.

CGAtomic itself might benefit from a refactoring. A class called AtomicInfo was already refactored-out in a8ec7e, but it does not handle all the atomic accesses. Specifically, AtomicExpr have their own lowering logic separate and different from AtomicInfo.

The libcall function calls and declarations emitted by CGAtomic use Clang’s generic ABI lowering mechanisms whereas AtomicExpandPass uses BuildLibCalls.cpp. While the libcall is defined by the C-ABI, this leads to differences in the function attributes, e.g. whether a call is marked as noundef, convergent, signextend, zeroextend, etc. IMHO such implementation-detail differences whether it was lowered by CGAtomic or AtomicExpandPass should be avoided. If they diverge, only one can be correct, lets make that into LLVMFrontend so non-C frontends can use it as well. BuildLibCalls.cpp also knows about about each individual libcall function and applies nocapture, noalias etc. attributes whereas CGAtomic only applies generic attributes for runtime functions.[1]

Approach

I sketched a refactor in Draft extracting CGAtomic to LLVMFrontend by Meinersbur · Pull Request #2 · Meinersbur/llvm-project · GitHub, but only for AtomicCompareExchange so far. It creates a new module LLVMFrontendAtomic with the main class AtomicInfo.
To not introduce correctness problems with the refactor, it is meant to be NFC.
Admittedly, this makes it ugly, but in further steps it can be improved:

  1. Clang needs to forward individual information about the platform’s ABI (e.g. Calling convention) since Clang uses its own classes such as clang::TargetInfo that cannot be moved into LLVM. In contrast, BuildLibCalls.cpp uses llvm::TargetLibraryInfo. Clang creates it only after its CodeGen phase when running the LLVM pass pipeline, but that could be changed. Creating the TargetLibraryInfo object has already been refactored in LLVMFrontendDriver, also with Flang in mind.

  2. CGAtomic uses different codes to generate the llvm::Value for use with cmpxchg compared to when needing a pointer to that value for __compare_atomic_exchange. I currently implemented that as a lambda that materializes the one actually needed in EmitAtomicCompareExchange, and a return value that can also be either the value or the pointer. I think for all cases, a pointer should be sufficient and therefore canonical. The raising to a register is a job for Mem2Reg/SROA.

  3. NFC is by best-effort only, some untested options/targets might still introduce different function/argument attributes. Since the goal is to replace it with BuildLibCalls.cpp anyway, I think this is sufficient.

[EDIT: llvm::AtomicInfo is not a template anymore. The only customization Clang did was customizing the inserter, which persists in IRBuilderBase through the IRBuilderDefaultInserter virtual base class]

Would the community support the goal? If yes, is this the right approach?


  1. LLVM has yet another codebase for handling libcall in RuntimeLibcallUtil.h, but for MachineIR. ↩︎

1 Like

I wonder how this fits into a bigger picture now that there are efforts to upstream ClangIR.
CC @AaronBallman @bcardosolopes

I think all redundant atomic lowering in clang should be removed. Any code required to handle weird cases like aggregates should go into LLVMFrontend, but anything it does should be implemented in terms of atomicrmw/cmpxchg and never a libcall. The true lowering decisions should be left for the backend.

atomicrmw/cmpxchg require integer/pointer/float arguments that are power-of-two-sized, whereas libcall functions allow more general arguments. Apparently, LLVM also does not perform well with very large types, e.g. i4096. Clang also lowers detected builtins such as __atomic_compare_exchange itself using CGAtomic (AtomicExpr::AO__atomic_compare_exchange). That is, I am afraid that a fallback for atomicrmw/cmpxchg is needed even if we signifiacantly relax the atomicrmw/cmpxchg requirements.

Thanks for bringing this to our attention. Looks like a good change, my quick reading of the draft PR is that a lot of the helper logic gets moved into LLVM, which makes the overall CodeGen code actually a bit more simple and easier to read.

From the CIR perspective, once this land we could do similar in CIR: simplify CIRGen with similar skeleton and move helpers to the CIR dialect lib - we’re already doing some of that to a certain extent.

A lot of what has been done in clang for atomics historically is just the legacy of lack of full support in IR. Much of that has been addressed over time. So at this point, I guess I’m unconvinced that there’s enough complexity that needs to remain in a frontend, which can actually be done in a frontend-independent manner, to really be worthwhile abstracting.

The decision of when Clang emits a direct libcall is now simple: not power-of-2-sized, or more than 16-bytes in size, then emit a libcall directly. The non-power-of-2 is just an IR validity limit, which we could potentially relax if we wished.

In other cases, we emit direct IR, just potentially with a bitcast to an integer type. Bitcasting to an integer in the frontend doesn’t seem onerous – but we could also potentially remove that IR requirement.

Some of the complexity in Clang is to pad out an (e.g.) 3-byte struct to 4 bytes, so that it can use lock-free atomics. I think this needs to be frontend-specific – I don’t know how that could be reasonably abstracted to frontend-agnostic.

The question you raise about clang’s libcall generation vs llvm’s libcall generation does seem like an interesting problem, but to me, that seems like a generic issue – not specific to atomic libcalls – which therefore needs a generic answer.

That seems sufficient complexity to at least introduce convenience functions. LLVM does with much simpler things, e.g. all the IRBuilder utility methods, BasicBlockUtils.h, and BuildLibCalls.h itself which could be described as just emitting a call with predefined function name, a subset of what LLVMFrontendAtomic would do.

If you think a new module[1] and classes on its own is too much, standalone functions such as those in BuildLibCalls.h might be sufficient, but it will not come with an NFC version. If the bitcasting and power-of-2 requirement is removed one day for some types, those standalone functions can be updated to affect all front-ends.

Clang also takes alignment into account[2]

class TargetInfo {
  [...]
  /// Returns true if the given target supports lock-free atomic
  /// operations at the specified width and alignment.
  virtual bool hasBuiltinAtomic(uint64_t AtomicSizeInBits,
                                uint64_t AlignmentInBits) const {
    return AtomicSizeInBits <= AlignmentInBits &&
           AtomicSizeInBits <= getMaxAtomicInlineWidth() &&
           (AtomicSizeInBits <= getCharWidth() ||
            llvm::isPowerOf2_64(AtomicSizeInBits / getCharWidth()));
  }

The method is virtual, implying that it could be overridden with more complex behavior. There are no alternative definitions, at least not in-tree.

That answer is BuildLibCall.h, no? If we expect all frontends to be consistent with what LLVM passes expect, front-ends other than Clang cannot use Clang.


  1. LLVMFrontendDriver already exists and has just a single function. ↩︎

  2. EmitAtomicExpr uses a different code path that does as you described) ↩︎

@jyknight I created an alternative implementation trying to mimic LLVM’s BuildLibCall.h. That is, a single standalone function that decides how to emit a generic compare-exchange instruction taking target information into account.

Add emitAtomicCompareExchangeBuiltin helper function · Pull Request #101966 · llvm/llvm-project (github.com)

As you may see there is a lot more complexity than bitcasting and type-size. Please let me know what you think.

My feeling is that this is over-generalized, and therefore, over-complicated.

E.g. the frontend doesn’t need to emit sized libcalls. AtomicExpandPass should be the only place where that functionality lives. And e.g. the backend doesn’t need to generate a runtime switch over memory orderings. So putting all of this functionality into a single entry-point is not useful.

Also, I am not familiar with Fortran, but AFAICT, it looks like Fortran APIs also only supports constant memory-order/weakness values, too. (1 or 2).

What I was trying to say is that the minimal thing necessary just for frontend use should be simple. E.g. what is the code you actually need to duplicate from clang to flang?

Sized libcalls: Once a __atomic_compare_exchange is emitted, it will never be optimized to a sized libcall. There are cases when a sized libcall can be emitted, but not an atomicrmw/cmpxchg that AtomicExpandPass could later emit as a sized libcall: non-constant memorder, target’s MaxAtomicSizeSupported too small.

Runtime switch: This is what the Clang frontend does. Constant arguments is a limitation of the IR that IMHO is the front-end’s task to hide. Where else should it go? Why is this not useful when it is actually required by Clang?

Fortran: I am trying avoid duplication of the same functionality across front-ends (and possibly AtomicExpandPass itself), not just Fortran. If it is limited to just support Fortran use cases then it will not be useful for other front-ends and could just add just another replication of it into Flang, MLIR, and OpenMPIRBuilder. They tend to disagree with each other and will further diverge over time. It would be great to have a canonical source of how these are supposed to be lowered. The LLVM front-end guide recommends emitting IR as Clang does.

Part of the complexity comes from that atomicrmw/cmpxchg is target-dependent:

[…] the type of ‘’ must be an integer type whose bit width is a power of two greater than or equal to eight and less than or equal to a target-specific size limit.

I assume that “target-specific size limit” is TargetLoweringInfo::getMaxAtomicSizeInBitsSupported(). We cannot assume it is 16 bytes since future hardware may support larger atomics. We can also not allow arbitrary large values since IR does not scale well will huge register sizes. Clang currently does not even use MaxAtomicSizeInBitsSupported but clang::TargetInfo::MaxAtomicInlineWidth instead.

The simplest for the front-end would be if it could just always emit __atomic_compare_exchange and AtomicExpandPass is optimizing it to cmpxchg or sized libcalls if possible. However, that is late in the pass pipeline and the libcall does not support everything that cmpxchg does, e.g. weak atomic and sync-scopes.

What would you say the optimal solution would look like?

@jyknight Could elaborate on which details you think emitAtomicCompareExchangeBuiltin should handle and how frontends like Clang would use it?