[RFC]: Strengthen Relaxed Atomics Implementation behind `-mstrict-rlx-atomics` Flag

Hi folks,
After discussing this proposal with Arm/Qualcomm engineers and maintainers of concurrency libraries, I’d like to get your thoughts on my proposal to strengthen the implementation of relaxed atomic loads behind a command line flag. The full proposal is here:

In short, I propose we implement a proposal Hans Boehm made some time ago - putting a dummy branch after load instructions for Armv7, Armv8, IBM PowerPC, and RISC-V. This would restore the ability to reason about concurrent code by restricting load-load/store ordering for these architectures. No work is needed for x86 or MIPS backends.

Of course I don’t want to outlaw existing uses of relaxed atomics, so I propose to guard the implementation behind a command line flag, such as -mstrict-rlx-atomics. This gives users the choice between performance (at the cost of reasoning) and the reasoning (vice versa). This provides a subset of the behaviours allowed by C23 and so few, if any, changes would be needed to the standards model.

This proposal follows a decade of work on understanding the cost of relaxed atomics, and follows work that proves this implementation is sound under compilation schemes, has promising performance characteristics (relative to other proposals), and restores the ability to reason about atomics should the user need it.

I propose to test this using the Telechat a compiler testing tool I have developed. Telechat can test the compilation of concurrent programs that use C/C++ atomics for all of the architectures mentioned. Our paper is detailed here.
https://arxiv.org/abs/2310.12337

Please let me know what you think.
Thanks
Luke

Forget CPU reordering, isn’t it permissible for the compiler to reorder relaxed atomics? (isn’t that the point of the relaxed memory-order?). That is, if we assume a no-alias relationship between x and y is known, can’t the compiler transform your example into the following – which clearly may result in both threads observing “r0==1” results:

// Concurrent Program with threads P0 and P1
P0 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(y,1,memory_order_relaxed);
  int r0 = atomic_load_explicit(x,memory_order_relaxed);
}

P1 (atomic_int* y,atomic_int* x) {
  atomic_store_explicit(x,1,memory_order_relaxed);
  int r0 = atomic_load_explicit(y,memory_order_relaxed);
}

Am I missing something, which would make that be an invalid transform?

Yes this is valid as the standard states that relaxed operations has no order requirements. I propose a stronger implementation, hidden under a flag.

The problem comes when we want to reason about non-trivial programs with relaxed atomics. Since the above is data-race free, in theory any data race free program should have sequential semantics (aka SC-DRF). Unfortunately the original program does not have SC behaviour, despite being data-race free.

This has caused headaches for people trying to model the C/C++ behaviour as we essentially cannot reason about anything beyond trivially small programs.

Some quibbles: I think that the SC-DRF discussion might be a bit of a red herring (this implementation strategy won’t result in relaxed accesses providing that guarantee anyways; code will still be difficult to analyze). Likewise, for better or worse, the C++ memory model isn’t (yet, at least) RC11: we effectively decided not to require SB union RF to be acyclic. So I don’t know that it’s necessarily right to call transformations that break those rules incorrect.

That being said, I’m interested in this as an experiment; I’ve come around to the belief that OOTA is not solvable without some runtime costs or heavyweight syntactic constructs, so we might as well get some data on the runtime side of things.

Two other thoughts:

  • Historically, hardware vendors were opposed to require load-to-store ordering. I see you’ve talked to Arm people, but I wonder what the Power people think, as well as GPU vendors.

  • I think the risk here is the language fork one. There are ways in which this is analogous to default 0 initialization.

Hi David,
Under this proposal I wouldn’t say those optimisations are incorrect, rather for this compiler flag we have a restriction on the behaviour that is a subset of what is permitted by C23.

Yes we are looking to see what the performance is of this in practice, we have already spoken to concurrency library maintainers and have found basic perf results in the literature, but would like to see the wider impact.

The PPC folks have actually done something similar to this before - see the questions section of the proposal. That said, I will reach out to the people I know for both PPC and GPU.

What do you mean by a language fork? as far as I know my proposal merely restricts behaviours to a subset of what is permitted by the standard already - in other words RC11 is the model of the proposed behaviour, otherwise C23 still holds.

The general “language fork” concern is that if we have a flag that defines behavior that’s otherwise undefined, people start writing code that only works with the flag enabled. So we have a body of code that doesn’t interoperate with other C++ code, and it’s hard to tell which category a given snippet of code falls under. We don’t want more situations like -fwrapv.

I don’t think that’s much of a concern here: the guarantee here is weak enough that most people wouldn’t have any idea how to take advantage of it anyway. Maybe worth considering providing a builtin function to access this functionality without the flag, though.

Under this proposal I wouldn’t say those optimisations are incorrect, rather for this compiler flag we have a restriction on the behaviour that is a subset of what is permitted by C23.

Right; I mean e.g. this github issue you link. It’s not really a bug; the C++MM allows the outcome that happens (unless I’m missing something?). You propose disallowing them as an extension, but allowing them when the extension isn’t on seems OK.

What do you mean by a language fork? as far as I know my proposal merely restricts behaviours to a subset of what is permitted by the standard already - in other words RC11 is the model of the proposed behaviour, otherwise C23 still holds.

All I mean is that it’s possible to write code that’s correct when compiled against clang -O2 -mstrict-rlx-atomics, which then compiles with no errors or warnings when compiled against clang -O2, but is now incorrect.

In this case there is another issue - notably that the control dependency is removed and masked at higher opt levels. More generally we have the question of Relaxed Atomics but IMO ordering issues include control flow as well - we should fix this regardless.

All I mean is that it’s possible to write code that’s correct when compiled against

that’s a fair point, I think it’s less of an issue here as Eli says. It’s all technically still allowed.

I think it’s definitely worth an experiment, if nothing else to see if the perf is as good as the papers say. I propose we try it, under a flag at least, and if accepted we can implement a builtin.

Related, but I’ve been running the flag name internally at Arm, and some engineers think -morder-rlx-atomics is a better name. Any thoughts?

Without commenting at all on the technicalities, having “relax(ed)” abbreviated to “rlx” but “order” and “atomics” left in full looks extremely ugly/jarring to me. I would encourage you to expand it in full.

2 Likes

While control dependencies are treated different than data dependencies by the hardware (on ARM and some others), I’m extremely skeptical about trying to specify anything at the C++ level which attempts to reason in terms of impact of a “removed control dependency”. Attempting to directly map that hardware concept of control-flow-dependency (which is completely obvious, at the machine level!) to the high-level C++ language doomed the “consume” memory order – for good reason.

So, this bug which asserts that reordering of relaxed atomics is impermissible when there’s a source-level “control dependency” between them is an idea which must be rejected, IMO. Either relaxed atomics may be reordered, or not – without it depending on whether there’s a control dependency between them. Thus, I think that the bug should be rejected: converting a control flow dependency to a predicated instruction dependency is, under the current model, entirely permissible, and the resulting litmus test that is asserted to be wrong is actually OK.

Fortunately, the proposal in this thread doesn’t have that problem, only the originally-motivating bug. Strengthening the relaxed memory ordering is an interesting idea – and possibly even a good idea.

Introducing it via a compiler option in Clang seems a odd, though; ISTM that either it should be specified in the standard as required behavior for relaxed atomics (in which case we should implement it unconditionally), or it should be specified in the standard as a distinct atomic memory order, slightly stronger than relaxed (which code can then use if it wants).

I’d also like to hear more about the history of N3710 – why was that rejected when proposed back in 2013, and why do those objections, whatever they were, no longer apply today?

I wasn’t in the room for N3710 specifically, but I’ve been around long enough to see the winds shift on the underlying issue.

I think for a long time people thought that there might be some clever group of academics who would come up with a formalism for “control dependencies” that matches human intuition on real programs. No one wanted to force relaxed loads to have extra overhead relative to non-atomic loads until we were convinced it was impossible to solve OOTA otherwise. I think that there’s a growing belief that it is indeed impossible, so maybe taking some runtime hit is acceptable to get reasonable semantics.

Simply disallowing load;store reordering is an attractively simple way to prevent those OOTA issues, if it’s cheap enough. But how do we know for sure if it’s cheap enough? There’s sort of a chicken-and-egg problem; the committee doesn’t want to standardize it if there’s no data that it’s low-cost in practice; but compilers don’t always want to implement the modes that would get that data without language-level requirements.

I’m happy to go with the expanded form -morder-relaxed-atomics

I’m extremely skeptical about trying to specify anything at the C++ level which attempts to reason in terms of impact of a “removed control dependency”

To be clear, the bug starts with an outcome of the compiled program that is not an outcome of the C/C++ model (forbidden under either C23 or RC11). So the origin of the report is a difference between what you expect and what you observe - a plain old bug!

I explain it in terms of broken control dependencies, but I am not specifying, nor do I propose, any mapping between dependencies. I am just comparing outcomes apples for apples and suggesting IMHO how it should be fixed in the generated code in the spirit of dependencies.

Verifying any mapping between dependencies is undecidable and has been tried in the past - I’m not going near that.

this bug which asserts that reordering of relaxed atomics is impermissible when there’s a source-level “control dependency” between them is an idea which must be rejected […] converting a control flow dependency to a predicated instruction dependency is, under the current model, entirely permissible, and the resulting litmus test that is asserted to be wrong is actually OK.

I’m afraid this isn’t right - under both C23 and RC11 a buggy outcome is observed. This is only explained in terms of dependencies so that a fix may be suggested.

and possibly even a good idea.

:slight_smile:

that either it should be specified in the standard as required behavior for relaxed atomics

My view is that the standard ratifies practice and I propose an alternative practice here. If standards changes are needed, they come second after LLVM and GCC accept it. We would get nowhere if we argued solely from what is currently allowed.

I’m with David on this one:

for a long time people thought that there might be some clever group of academics who would come up with a formalism for “control dependencies”

It hasn’t really worked AFAIK. and so:

There’s sort of a chicken-and-egg problem; the committee doesn’t want to standardize it if there’s no data that it’s low-cost in practice; but compilers don’t always want to implement the modes that would get that data without language-level requirements.

Let’s break this cyclic dependency and try it out. Even in a branch that we can add and remove within one release cycle.

Hi, I thought I’d provide some GPU perspective, as much of my/my advisor’s research has been on testing weak memory implementations of GPUs.

Generally, atomic operations in GPU languages have followed C++ style semantics, with some differences. For example in Vulkan/SPIR-V (a newer cross-platform GPU framework), sequentially consistent atomics are not supported. Languages like SPIR-V are built to target quite a few different architectures from many vendors, many of which are not publicly documented.

Our research has shown that on many GPU vendors, a variety of relaxed atomic weak behaviors are observable. Load buffering is one example, but also message passing, store buffer, and a few others, meaning that every simple combination of loads and stores can end up re-ordered. Our first paper (https://dl.acm.org/doi/10.1145/3575693.3575750) tested only four GPUs from Intel/AMD/NVIDIA/Apple, but our larger scale study (https://dl.acm.org/doi/10.1145/3597926.3598095) ran on over 100 GPUs and showed that many devices (see Section 4.1/4.2) showed all of these relaxed memory re-orderings.

You can actually test weak behaviors out on your own GPU using WebGPU (which only provides relaxed atomics currently), using the website we built: (WebGPU Memory Model Testing, currently works on Chrome only).

Based on all that, here are a few thoughts:

  • I agree disallowing load/store re-ordering won’t give you SC-DRF guarantees, since other types of re-orderings will still be possible.
  • I’m not sure if the branch after load change would be enough to disallow load buffering on all GPUs, due to their different architectures. It might, but it would also be a lot of work to get all the GPU vendors to agree to it/test it.
  • If CPU/GPU specifications diverge in terms of their relaxed atomic implementations, it will become more difficult to write inter-operable code or make the transition when learning. Perhaps not a huge deal since these low-level atomics aren’t used by most developers, but I think worth mentioning.

Hi folks,
Just to clarify. The views expressed are the product of my own research. My sponsors, colleagues, and other companies mentioned may not endorse them.
Thanks
Luke

Hi Reese,
That’s fascinating research and I’m interested to see more. Do you have thoughts or opinions on the code gen aspect of GPU modelling? Specifically what changes to code generation might be needed? I suspect it’s the same as the CPU code gen changes but it would be good to confirm.

  • I agree disallowing load/store re-ordering won’t give you SC-DRF guarantees, since other types of re-orderings will still be possible.

I am happy to lean away from this perspective.

If CPU/GPU specifications diverge in terms of their relaxed atomic implementations, it will become more difficult to write inter-operable code or make the transition when learning.

The proposal should strengthen atomics code gen, I wouldn’t expect any different outcomes from what we already observe, just less of them, that is unless…

  • I’m not sure if the branch after load change would be enough to disallow load buffering on all GPUs, due to their different architectures. It might, but it would also be a lot of work to get all the GPU vendors to agree to it/test it.

What do you mean? this is very interesting!

Thanks
Luke

I’m afraid this isn’t right - under both C23 and RC11 a buggy outcome is observed. This is only explained in terms of dependencies so that a fix may be suggested.

I’m afraid I must correct myself here - the standard permits re-ordering over conditional branches but discourages it’s use in 7.17.3 of C23.

Nevertheless the general proposal remains

Yeah, I’m not totally sure what changes would be needed to code generation might be needed. I haven’t really worked in that space, perhaps there are GPU compiler engineers who could weigh in with their thoughts.

Mainly the concern with branch-after-load would be performance, due to the SIMD nature of GPUs. For example, on many GPU architectures sets of threads execute instructions in lockstep. So if there is branch divergence, threads that do not take the branch are disabled until there is a convergence of control flow. If there are branches on atomic loads, there might be surprising performance results due to control flow issues that are not obvious to the programmer.

Hi, Arm’s official position on the topic can be found in this recent blog:

Please do reach out to memory-model@arm.com if there are any questions.

Thanks,
Jade

1 Like

I do not think there is any benefit to be had by strengthening relaxed atomic operations. They are intended to provide maximum performance, by not providing any ordering guarantees between threads. This can lead to different threads disagreeing on the order of operations to distinct variables. Users should be aware of this, and use appropriate synchronization operations if they do not desire this outcome.

If your compiler testing tool says that such behaviour is not permitted, this is a bug in your tool, rather than in the compiler, processor or C++ standard.

2 Likes

Memory model topics are always tricky, and I wouldn’t call myself an expert (at anything, really), but I think I speak accurately when I say that from the perspective of the Java language memory model, reordering of these kinds of accesses is generally considered to be a feature, not a bug. I would generally be cautious of characterizing this kind of (intentionally allowed) reordering behavior as “buggy”, and particularly subsequently indicting implementations on that basis. I believe that it is commonly expected that concurrent code generated by Java runtimes such as HotSpot generally benefit from such weak orderings, and I would imagine that it is desirable for LLVM-backed programs to benefit similarly.