[RFC] Add clang atomic control options and pragmas

This RFC proposes the addition of an atomic pragma to Clang, designed to provide a more flexible mechanism for users to specify how atomic operations should be handled during the lowering process in LLVM IR. Currently, the atomicrmw instruction in LLVM IR can be lowered to either atomic instructions or CAS loops, depending on whether the target supports atomic instructions for a specific operation type or alignment. However, there are cases where the decision-making process for lowering an atomicrmw instruction cannot be fully expressed by the existing IR.

For instance, consider a scenario where a floating-point atomic add instruction does not conform to IEEE denormal mode requirements on a particular subtarget. Even though this non-conformance exists, users might still prefer the corresponding IR to be lowered to atomic instructions if they are unconcerned about denormal mode. This means that the backend needs to be informed through IR whether to ignore the floating-point denormal mode during the lowering process. Another example involves an atomic instruction that may not function correctly for specific memory types, such as memory accessed through PCIe, which only supports atomic integer add, exchange, or compare-and-swap operations. To ensure correct and efficient lowering of atomicrmw instructions, the backend must be aware of the memory type involved.

To convey this necessary information to the backend, we propose adding target-specific metadata to atomicrmw instructions in IR. Since this information is provided by users, a flexible mechanism is needed to allow them to specify these details in the source code. To achieve this, we introduce a pragma in the format of
#pragma clang atomic no_remote_memory(on|off) no_fine_grained_memory(on|off) ignore_denormal_mode(on|off)
This pragma allows users to specify one, two, or all three options and must be placed at the beginning of a compound statement. The pragma can also be nested, with inner pragmas overriding the options specified in outer compound statements or the target’s default options. These options will then determine the target-specific metadata added to atomic instructions in the IR.

In addition to the pragma, a new compiler option is introduced: -fatomic=no_remote_memory:{on|off},no_fine_grained_memory:{on|off},ignore_denormal_mode{on|off}. This compiler option allows users to override the target’s default options through the Clang driver and front end.

The design of this atomic pragma and the associated compiler options are intended to be target-neutral, enabling potential reuse across different targets. While a target might choose not to emit metadata for some or all of these options, or might add new options to the pragma, the overall design is inspired by Clang’s floating-point pragma, which conveys extra information to the backend about how floating-point instructions should be lowered. Importantly, the metadata introduced by this pragma in the IR can be dropped without affecting the correctness of the program, as it is primarily intended to improve performance.

In terms of implementation, the atomic pragma is represented in the AST by trailing data in CompoundStmt. The parser in Clang maintains an atomic options stack in Sema, which is updated whenever the atomic pragma is encountered. When a CompoundStmt is created, it includes the current atomic options. RAII is employed to save and restore atomic options when transitioning between outer and inner CompoundStmts.

During code generation in Clang, the CodeGenModule maintains the current atomic options, which are used to emit the relevant metadata for atomic instructions. As with the parsing phase, RAII is used to manage the saving and restoring of atomic options when entering and exiting nested CompoundStmts. This ensures that the correct metadata is generated in the IR, reflecting the user’s specified options accurately.

An initial implementation of this RFC can be found at

Your feedbacks are welcome. Thanks.

How would this interact with wrappers like std::atomic?

std::atomic results in AtomicExpr in AST through clang atomic builtin calls. When it is emitted as atomic instructions to IR, metadata will be added according to the effective atomic options of the enclosing compound statements. This is similar to other AtomicExpr in AST, including those from OpenMP through atomic pragmas.

Using std::atomic results in function calls foremost. Does that mean that multiple functions would get generated depending on what mode we’re currently in?

The proposed pragma and options does not affect the number of atomic instructions generated in IR. It only adds metadata that affects how backend lowers the atomic instructions in IR. Previously for correctness a backend may emit a CAS loop in ISA for certain std::atomic. With the change, due to memory type promise by the metadata, a backend may emit atomic instructions in ISA for better performance.

I think you’re misunderstanding Nikolas’s point. The surrounding statements of the primitive atomic expressions for std::atomic are inline functions in the C++ standard library headers, which will not typically be in the scope of a pragma in user code. To make local pragmas affect std::atomic, you would need standard libraries to adopt some kind of language feature that changed the resolution of function calls. I don’t think that’s completely intractable, but it’s not discussed in your RFC.

More generally, it feels to me like it might be better to express these through other language mechanisms.

Atomic expressions that perform floating-point operations should already be sensitive to floating-point pragmas. Is there no current pragma that’s sufficient to express that precise behavior for denormals is not required? Because if there is, then writing an atomic expression within that pragma should be good enough to allow this specialized atomic operation to be selected.

The PCIe use case strikes me as highly specialized; my immediate reaction is to question whether this is common enough to merit a compiler feature instead of simply asking the driver programmers in question to use inline asm. If you do need a compiler feature, perhaps it’d be better to model the PCIe memory as a special address space.

There is not (and I don’t really think there should be one). We don’t have a way to express this at the IR (we would need some kind of ftz/daz permitted control on atomicrmw, which doesn’t support fast math flags). The denormal mode function attribute is not permission to ignore the mode.

Right, the intended user of these is the builtin header implementations (primarily for the atomicAdd style of CUDA defined functions).

Emphatically no. There’s about a 100 reasons this is intractable and unmaintainable. A significant fraction of the reason for doing this is to ensure the backend is the only place that needs to deal with atomic legalization. The matrix of possible conditions that need to be considered is gigantic and cannot be unloaded on library or frontend writers. There’s little to no ISA compatibility between targets, and the primary issue with remote atomics depends on the system configuration which we do not know at compile time. Right now different frontends have their own hacked together, imprecise, manually writing cmpxchg expansions.

This doesn’t really behave like an address space, and is not expressed in the lanaguages as an address space. The support area for treating it as an address space is large. We just need a marker on a single atomicrmw instruction.

Well, anything that’s actually semantically required for FP operations really does need to be representable for FP atomicrmw, up to and including having a constrained FP intrinsic. So I don’t think it’s generally unreasonable to have this kind of thing in the representation. Also, adding arbitrary target-specific metadata is basically doing that, just without any pretense at a unified representation.

What’s the actual hardware behavior you’re trying to compensate for here? A lot of hardware has support for flushing denormals on arbitrary FP operations, so if this can be described that way, it’s a relatively portable feature. If it’s less predictable than that — it only flushes in certain situations, or something even weirder — then okay, that’s more difficult.

Okay. So we’re agreed that the answer to Nikolas’s question is that uses of std::atomic would not be sensitive to this pragma.

I feel like you’re arguing two or three distinct points here. Let’s try to keep them separated as much as we can.

  1. You’d like LLVM’s atomicrmw representation to be more flexible in the hopes of geting frontends out of the business of emitting cmpxchg loops. I don’t think this is 100% feasible — there are a lot of weird situations that can happen in source that I don’t think it will ever be reasonable to extend to atomicrmw, like adding a double to an atomic float — but I agree that it’s a good goal for normal situations to be lowered in LLVM rather than frontends. I certainly don’t think frontends should be emitting inline asm if it wasn’t written in the source and am not arguing in favor of that.

  2. You’d like to add this pragma to Clang. This is what my concern is about. I am not convinced that this specific issue with atomics on PCIe-mapped memory is not highly target-specific and more sensible to address by adding inline asm to 1-2 places in a specific target’s CUDA headers rather than plumbing through a generic feature and having the frontend give it special treatment for that target.

Thank you for the feedback, John. I understand the concern about introducing a new pragma for what seems like a highly specific issue with PCIe-mapped memory. However, I believe the need for this pragma extends beyond just PCIe and addresses a broader set of challenges in lowering atomic instructions.

The complexity of lowering atomic instructions in IR depends on several factors like subtarget capabilities, value types, and whether the memory is remote or fine-grained. These factors can lead to different lowering strategies, such as using atomic instructions or a CAS loop. Managing this complexity through inline assembly or library headers would add significant maintenance burden and potential for inconsistency. Using metadata attached to atomic instructions in the IR allows the backend to make informed decisions while keeping the IR subtarget-neutral until the final lowering phase, which is particularly important when using IR as a generic binary representation across multiple subtargets.

We opted for a pragma because it’s consistent with existing Clang conventions, like #pragma unroll or #pragma fp_contract, which users are already familiar with. While a compound statement attribute could technically work, it would diverge from the established pattern. Moreover, using a target-neutral pragma instead of a target-specific one, like #pragma amdgpu atomic, makes sense because the limiting factors often stem from interconnect specs or operating systems rather than the processor itself, affecting multiple targets.

Pragmas are not known for their good language design; attributes are much easier for users to reason about in general. We should not be adding new pragmas to Clang simply because other pragmas exist. In my opinion, new pragmas should only be introduced when there is a standard mandating that design choice or when attributes simply cannot express the same semantics.

Adding a new pragma with a grab-bag of options on it that are all effectively target-specific is also a significant maintenance burden. Right now, it sounds like you’re trying to replace a handful of #if chains in a single AMDGPU CUDA header with a few thousand lines of compiler code.

Even if this is a good trade-off, I don’t understand why it’s not a builtin. Scoped pragmas are a good language design for controlling floating point because you want the pragma to uniformly affect a large and heterogenous set of FP operations. I’ve written quite a bit of lock-free code, and the atomic operations are always isolated and very deliberately considered. And it sounds like your pragma is going to go in an inline function in a header where it applies to exactly one operation.

We need to support atomic expressions in OpenMP, which is in the form of

#pragma omp atomic
x++;

Okay, so you do expect that this will be significantly used in arbitrary user code and not just in selected places in system headers? Is that true for all of the options on this pragma?

There’s not concerns about OpenMP pragmas because that’s following a standard which mandates we support a pragma.

Yes.

Pragmas are not known for their good language design; attributes are
/much/ easier for users to reason about in general. We should not be
adding new pragmas to Clang simply because other pragmas exist. In my
opinion, new pragmas should only be introduced when there is a
standard mandating that design choice or when attributes simply cannot
express the same semantics.

To expand on this more, there are some cases like floating-point pragmas
where their use is tolerated because there hasn’t yet been offered a
better way to achieve their goals, and even here, it’s acknowledged that
the pragmas have severe limitations (such as the inline function issue
that John and Nikolas brought up). But there are more cases where paths
other than pragmas have proven more fruitful–virtually every
OpenMP-like construct that has come along after OpenMP has eschewed the
use of pragmas, for example.

For this proposal in particular, I’m having a hard time seeing why a
pragma is the best option. The utility for the floating-point pragmas is
because there is a need to change the semantics of infix operators like
a + b, where creating and using builtin alternatives for different
semantics variant is clearly too much hassle [1]. But the proposal here
is on atomicrmw operations, which are already based on custom builtin
functions, so using somewhat different builtins seems to be less of an
issue. Furthermore, atomics in C++ idiomatically go through inline
functions, so it seems you immediately run into the biggest issue with
existing pragmas. I am less than persuaded that this is a tolerable use
of pragmas.

[1] Even then, I’d argue that a better language design is using an
fma_fast function that is either an FMA or a FMUL/FADD pair, depending
on which is faster, instead of #pragma STDC FP_CONTRACT.

The reason we cannot use builtin is that OpenMP atomic expression is not using builtin, which I have explained in previous comments. Another reason is that users may want to tag all atomic expressions in a block of code instead of doing it one by one.

Could you elaborate on the fp pragma limitation regarding inline functions? And are there better solutions? Thanks.

But you’re not proposing OpenMP’s pragma, you’re proposing something with significantly different syntax and surface area to it, right?

This does not require a pragma. You can put an attribute on a compound statement.

If we choose to use compound statement attribute, what is a proper format?

How about

[[clang: atomic(no_remote_memory:on, no_fine_grained_memory:on, ignore_denormal_mode:on)]]