Plan to optimize atomics in LLVM

Hello everyone,

I have recently started on optimizing C11/C++11 atomics in LLVM, and plan to focus on that for the next two months as an intern in the PNaCl team. I’ve sent two patches on this topic to Phabricator that fix http://llvm.org/bugs/show_bug.cgi?id=17281:

http://reviews.llvm.org/D4796

http://reviews.llvm.org/D4797

The first patch is X86-specific, and tries to apply operations with immediates to atomics without going through a register. The main trouble here is that the X86 backend appears to respect LLVM memory model instead of the x86-TSO memory model, and may reorder instructions. In order to prevent illegal reordering of atomic accesses, the backend converts atomic accesses to pseudo-instructions in X86InstrCompiler.td (RELEASE_MOV* and ACQUIRE_MOV*) that are opaque to most of the rest of the backend, and only lowers those at the very end of the pipeline. I have decided to follow the same approach, just adding some more RELEASE_* pseudo-instructions rather than trying to find every possibly misbehaving part of the backend in order to do early lowering. This lowers the risk and complexity of the patch, but at the cost of possibly missing some optimization possibilities.

Another trouble I had with this patch is a failure of TableGen type inference when adding negate/not to the scheme. As a result I have left these two instructions commented out in this patch. Does anyone have an idea for how to proceed with this ?

The second patch is more straightforward: in the C11/C++11 memory model (that LLVM basically borrows), optimizations like DSE can safely fire across atomic accesses in many cases, basically as long as they are not operating across a release-acquire pair (see paper referenced in the comments). So I tweaked MemoryDependenceAnalysis to track such pairs and only return a clobber result in such cases.

My next steps will probably be to improve the compilation of acquire atomics in the ARM backend. In particular, they are currently compiled by a load + dmb, while a load + dependant useless branch + isb is also valid (see http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html for example) and may be faster. Even better: if there is already a dependant branch (such as the loop for the lowering of CAS), it is just a cheap isb. The main step will be switching off the InsertFencesForAtomic flag, and do the lowering of atomics in the backend, because once an acquire load has been transformed in an acquire fence, too much information has been lost to apply this mapping.

Longer term, I hope to improve the fence elimination of the ARM backend with a kind of PRE algorithm. Both of these improvements to the ARM backend should be fairly straightforward to port to the POWER architecture later, and I hope to also do that.

Does this approach seem worthwhile to you ? Can I do anything to help the review process ?

Thank you,

Robin Morisset

I’ve already made comments on these in the respective review threads. Other readers - Please, we need extra reviewers on these! If you have experience reasoning about memory model issues, please chime in. I’m unfamiliar with the details of the ARM architecture, so I won’t be able to help you much here. Any reason these couldn’t be done at the IR level? Small patches. Lots of justification. Many test cases. Ping folks periodically. Philip

Longer term, I hope to improve the fence elimination of the ARM backend with
a kind of PRE algorithm. Both of these improvements to the ARM backend
should be fairly straightforward to port to the POWER architecture later,
and I hope to also do that.

Any reason these couldn't be done at the IR level?

I definitely agree here. At the time, it was a plausible idea (the
barriers didn't even exist in IR most of the time). But the
implementation was always going to be much more complicated and less
portable than in IR, and what we actually have is very flawed in its
own right (only applies to ARM mode, unmaintained,

Actually, I think we decided to remove it a while back, but I haven't
gotten around to it yet.

Cheers.

Tim.

I am planning in doing in IR, but with target specific-passes (such as X86ExpandAtomicPass), that just share some of the code (possibly by having each of the target-specific passes inherit from and override a target-independent pass).

The reasons for doing it in IR are the following:

  • easier sharing of target-independent code
  • easier dealing with control-flow (especially useful for advanced fence elimination)
    But it must be target-dependent as for example on Power a seq_cst store has a fence before it, while on ARM it has a fence both before and after it (per http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

For this exact reason, I am planning on splitting AtomicExpandLoadLinkedPass in a target-independent and a target-dependent file: the current pass claims to be target-independent but is actually designed for ARM: for example it puts a release-fence before a seq_cst CAS, which would be unsound on Power if the backend was more agressive and using lwsync for release_fences. Since these fences are not chosen based on the LLVM fences semantics, but on the hardware memory model, I was thinking of inserting target-specific intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it clearer that these passes are target-specific and unsound outside of their target.

Another thing I would have to move to this IR pass is the insertion of fences around atomic stores/loads when insertFencesForAtomic==true. It is currently happening in SelectionDAGBuilder, which makes it impossible to do fence elimination at the IR level.

Is it reasonable, or is there some rule against using hardware-specific intrinsics at the hardware level (or some other problem with this approach)?

Cheers,
Robin Morisset

I am planning in doing in IR, but with target specific-passes (such as X86ExpandAtomicPass)
that just share some of the code

This would more normally be done via target hooks in LLVM, though the
principle is sound.

But it must be target-dependent as for example on Power a
seq_cst store has a fence before it, while on ARM it has a fence
both before and after it (per http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

That certainly seems to suggest some kind of parametrisation.

For this exact reason, I am planning on splitting AtomicExpandLoadLinkedPass
in a target-independent and a target-dependent file: the current pass claims
to be target-independent but is actually designed for ARM: for example it
puts a release-fence before a seq_cst CAS, which would be unsound on Power
if the backend was more agressive and using lwsync for release_fences.

I don't know the Power architecture, but this failure ought to be
describable in terms of LLVM's own memory model (if a valid Power
implementation of LLVM IR can trigger it, then the transformation
itself is wrong). Do you have an example execution in mind that shows
it?

Since
these fences are not chosen based on the LLVM fences semantics, but on the
hardware memory model, I was thinking of inserting target-specific
intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
clearer that these passes are target-specific and unsound outside of their
target.

If they *are* unsound, that should be changed immediately (and I'll
almost certainly make time to do so, hence my questions here). It's
completely unacceptable to give LLVM's "fence whatever" a
target-specific meaning in my opinion, even briefly.

Another thing I would have to move to this IR pass is the insertion of
fences around atomic stores/loads when insertFencesForAtomic==true. It is
currently happening in SelectionDAGBuilder, which makes it impossible to do
fence elimination at the IR level.

I'm a little worried about this being just one monster "do stuff with
atomics" pass, especially if it ends up one-per-target, but even if
the bulk can remain generic. I'd prefer a more compositional approach
if it can be managed.

Is it reasonable, or is there some rule against using hardware-specific
intrinsics at the hardware level (or some other problem with this approach)?

Lots of the changes sound like they're going in the right direction.
I'd particularly pleased to see other architectures using (via
whatever adaptations are necessary) the atomic expansion pass; I think
that could significantly simplify other backends.

I'm a little concerned about changing the "fence XYZ" conversion into
target intrinsics, but it looks likely it'd be necessary for
performance even if the current scheme does turn out to be correct so
I say go for it!

Cheers.

Tim.

Here is an example of code on which ExpandLoadLinkedPass looks unsound to me (It is a variant of the SB pattern in the literature):

x, y two globals of type *i32 (atomic)

Thread 0:
store atomic i32 1, i32* @x seq_cst, align 4
%res = cmpxchg i32* @y, i32 0, i32 0, seq_cst, seq_cst
%old_y = extractvalue {i32, i1} %res, 0
Thread 1:
store atomic i32 1, i32* @y seq_cst, align 4
%res = cmpxchg i32* @x, i32 0, i32 0, seq_cst, seq_cst
%old_x = extractvalue {i32, i1} %res, 0

The pass turns the cmpxchg into (approximately, there is a bunch of control-flow in addition):

fence release
load-linked monotonic
store-conditional monotonic
fence seq_cst

store seq_cst
fence release
load-linked monotonic

into

load-linked monotonic
store seq_cst
fence release

Which would make an execution ending in %old_x = %old_y = 0 possible, while it is impossible in the original program.
Since some atomic stores may be turned into AtomicCmpXchg if they are too big, it may even occur in code with originally only atomic stores and loads.

Fixing it is a two line change in insertLeadingFence, but it triggers some test failures, both because of tests looking specifically for a fence release here, or looking for a dmb ishst which is asserted to be correct for fence release on swift processors (I have no idea if it is correct in this specific scenario, I could not find documentation on the exact quirks of the memory model of swift, assuming it is public).

About your other comments:
I will look at target hooks, I had not thought about using them.

I agree that some modularity is required. But currently there is lots of code duplication, for example RMWs are lowered into monotonic operations + fences in ExpandLoadLinked (as we just saw), and load/stores are lowered into monotonic operations + fences in SelectionDAGBuilder… I hope to share some code between those by pushing things into a common ExpandAtomicsPass.

Thanks,
Robin

I forgot to say that x and y are to be initially 0 for the example to work.

Hi Robin,

But currently there is lots of
code duplication, for example RMWs are lowered into monotonic operations +
fences in ExpandLoadLinked (as we just saw), and load/stores are lowered
into monotonic operations + fences in SelectionDAGBuilder.. I hope to share
some code between those by pushing things into a common ExpandAtomicsPass.

I've not had time to think about your examples yet (I will!), but
thought I should comment on the history behind this one anyway.

Until very recently SelectionDAG was the only place that did this kind
of transformation. I added ExpandLoadLinked to give a neater route
(primarily for ARM, as you've noticed). But we still had to leave the
legacy path in place for the other architectures.

I'd love to see them do things before SDAG (if nothing else, we have a
long-term goal to nuke the DAG; but I also think it's a much neater
solution anyway). But I didn't have time to port all of the existing
backends, so we ended up with the duplication you're seeing.

I do like the direction you're suggesting here though. The more we can
move out of SelectionDAG, the less we have to port to GlobalISel when
it finally gets here.

Cheers.

Tim.

I am planning in doing in IR, but with target specific-passes (such as X86ExpandAtomicPass)
that just share some of the code

This would more normally be done via target hooks in LLVM, though the
principle is sound.

But it must be target-dependent as for example on Power a
seq_cst store has a fence before it, while on ARM it has a fence
both before and after it (per http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html)

That certainly seems to suggest some kind of parametrisation.

An alternate way of saying this might be that both ARM and Power require the store to be fenced before and after. On Power the fence after is implicit, where on ARM it is not. (Is this actually correct? I don't know either of these models well.)

Could you use that framing to factor the arch specific and general parts? I'd really love to have a generic barrier combine pass which can work on the IR semantics independent of the architecture barrier semantics.

For this exact reason, I am planning on splitting AtomicExpandLoadLinkedPass
in a target-independent and a target-dependent file: the current pass claims
to be target-independent but is actually designed for ARM: for example it
puts a release-fence before a seq_cst CAS, which would be unsound on Power
if the backend was more agressive and using lwsync for release_fences.

I don't know the Power architecture, but this failure ought to be
describable in terms of LLVM's own memory model (if a valid Power
implementation of LLVM IR can trigger it, then the transformation
itself is wrong). Do you have an example execution in mind that shows
it?

Since
these fences are not chosen based on the LLVM fences semantics, but on the
hardware memory model, I was thinking of inserting target-specific
intrinsics (dmb/isb on ARM, hwsync/lwsync/isync on Power), to make it
clearer that these passes are target-specific and unsound outside of their
target.

If they *are* unsound, that should be changed immediately (and I'll
almost certainly make time to do so, hence my questions here). It's
completely unacceptable to give LLVM's "fence whatever" a
target-specific meaning in my opinion, even briefly.

I strongly agree with Tim's point here. A latent bug is still a bug.

Another thing I would have to move to this IR pass is the insertion of
fences around atomic stores/loads when insertFencesForAtomic==true. It is
currently happening in SelectionDAGBuilder, which makes it impossible to do
fence elimination at the IR level.

I'm a little worried about this being just one monster "do stuff with
atomics" pass, especially if it ends up one-per-target, but even if
the bulk can remain generic. I'd prefer a more compositional approach
if it can be managed.

Is it reasonable, or is there some rule against using hardware-specific
intrinsics at the hardware level (or some other problem with this approach)?

Lots of the changes sound like they're going in the right direction.
I'd particularly pleased to see other architectures using (via
whatever adaptations are necessary) the atomic expansion pass; I think
that could significantly simplify other backends.

I'm a little concerned about changing the "fence XYZ" conversion into
target intrinsics, but it looks likely it'd be necessary for
performance even if the current scheme does turn out to be correct so
I say go for it!

I would say there's a burden of justification that the target intrinsic approach is substantially better performance wise. This doesn't have to be extensive, but something should be presented. (If the generic approach is actually possible.)

Philip

I am planning in doing in IR, but with target specific-passes (such as

X86ExpandAtomicPass)
that just share some of the code

This would more normally be done via target hooks in LLVM, though the
principle is sound.

But it must be target-dependent as for example on Power a

seq_cst store has a fence before it, while on ARM it has a fence
both before and after it (per http://www.cl.cam.ac.uk/~
pes20/cpp/cpp0xmappings.html)

That certainly seems to suggest some kind of parametrisation.

An alternate way of saying this might be that both ARM and Power require
the store to be fenced before and after. On Power the fence after is
implicit, where on ARM it is not. (Is this actually correct? I don't know
either of these models well.)

Could you use that framing to factor the arch specific and general parts?
I'd really love to have a generic barrier combine pass which can work on
the IR semantics independent of the architecture barrier semantics.

More precisely, Both ARM and Power require a barrier between every store
seq_cst and every later load seq_cst (among lots of other requirements).
On Power the mapping achieves this by a full fence before every load
seq_cst, whereas ARM uses a full fence after ever store seq_cst.

I would also love to have a generic barrier combine pass, but I strongly
doubt it is at all possible.

Is it reasonable, or is there some rule against using hardware-specific

intrinsics at the hardware level (or some other problem with this
approach)?

Lots of the changes sound like they're going in the right direction.
I'd particularly pleased to see other architectures using (via
whatever adaptations are necessary) the atomic expansion pass; I think
that could significantly simplify other backends.

I'm a little concerned about changing the "fence XYZ" conversion into
target intrinsics, but it looks likely it'd be necessary for
performance even if the current scheme does turn out to be correct so
I say go for it!

I would say there's a burden of justification that the target intrinsic
approach is substantially better performance wise. This doesn't have to be
extensive, but something should be presented. (If the generic approach is
actually possible.)

For one simple example: acquire loads on ARM that are followed by a
dependent branch can be implemented by putting an isb fence at each target
of the branch (I can lookup the reference for this if you want), which is
supposedly cheaper (I am hoping to get some benchmarks on this and similar
things soon). But all the C11 fences, including the acquire fence require a
full dmb barrier. So it is impossible to express this optimized mapping of
acquire loads in a target-independent way.

ping.

I am currently working on such a truly target-independent atomics expansion pass, and hope to have some patches for review by the end of the week. But I would like to have your opinion on my example of the unsoundness of using a leading release fence for seq_cst CAS, in order to know whether or not I should fix it in the process.

Thanks,
Robin

From my reading of Atomics.rst, it would be sound to reorder (It does not
say much about load-linked, so I am treating it as a normal load here)

store seq_cst
fence release
load-linked monotonic

into

load-linked monotonic
store seq_cst
fence release

Which would make an execution ending in %old_x = %old_y = 0 possible, while
it is impossible in the original program.

Hmm, evil. Well, I'm convinced. Thanks very much for taking the time
to come up with an example and telling us about it.

Fixing it is a two line change in insertLeadingFence, but it triggers some
test failures, both because of tests looking specifically for a fence
release here,

That's fine, we can change those easily enough. And the "dmb ishst"
(as I understand it, it *is* a release fence, but not almost certainly
not suitable for preventing this reordering).

Cheers.

Tim.

Hi,

I have done a refactoring that merges AtomicExpandLoadLinkedPass and X86AtomicExpandPass, fixing the bug above in the process. I am wondering on what is the best way to cut such a patch in pieces for review. I have attached the corresponding patches; they are not completely ready for review (mostly missing tests), I would just like to make sure that the general approach seems reasonable.

If so, I will finish them and send them on Phabricator for review.

Thanks,
Robin

0001-Add-hooks-to-TargetLowering-for-a-future-AtomicExpan.patch (5.38 KB)

0002-Copy-AtomicExpandLoadLinkedPass-X86AtomicExpandPass-.patch (21 KB)

0003-X86-Use-AtomicExpandPass-instead-of-X86AtomicExpandP.patch (5.79 KB)

0004-Erase-X86AtomicExpandPass.patch (12.3 KB)

0005-ARM-Use-AtomicExpandPass-instead-of-AtomicExpandLoad.patch (64.6 KB)

0006-AArch64-Use-AtomicExpandPass-instead-of-AtomicExpand.patch (4.69 KB)

0007-Erase-AtomicExpandLoadLinkedPass.patch (20 KB)

Hi Robin,

I have
attached the corresponding patches; they are not completely ready for review
(mostly missing tests), I would just like to make sure that the general
approach seems reasonable.

I've had a quick glance at the patches, and the code seems fairly sane.

But I'm not so sure about starting with a new pass then deleting the
other two. I think it's likely to muddy the revision control history.
I'd prefer to see a gradual evolution of the
AtomicExpandLoadLinked.cpp (perhaps starting with a "git mv" and some
internal renaming to stake out the intent, followed by adding and
using the extra hooks).

Cheers.

Tim.

Thanks for taking the time to look at the patches.

I will redo them in the way you suggest and put them on Phabricator for review.

Cheers,
Robin

First batch of patches for review:
http://reviews.llvm.org/D4958

http://reviews.llvm.org/D4959

http://reviews.llvm.org/D4960

http://reviews.llvm.org/D4961
These fix the problem exposed above (and another I found since), and impact exclusively with the ARM backend.

I have another patch coming soon offering more refactoring, and merging with X86AtomicExpandPass. Do you want to be added as subscribers/reviewers ? (I only put Tim Northover as subscriber on the first batch because he looked interested in that bug).

Cheers,
Robin