Design question about a fence elimination algorithm

Hello everyone,

I am hitting a bit of a design problem and I would like your advice before jumping to the implementation.

I am planning to implement a fence elimination/insertion algorithm that would be more general than the current one (inter-basic block, at the IR level, could be adapted to other architectures such as Power fairly easily). I will explain the details of that algorithm later if you want but they are not relevant to this issue. My problem is that when AtomicExpand calls emitLeading/TrailingFence (hooks from TargetLowering), the target greedily inserts appropriate barriers, losing information which would be required for good fence elimination.
For example, on ARMv7 an acquire load trailing fence can either be a dmb (what is currently emitted), or an isb fence after a control-dependent branch. Ideally, the code would first emit and optimize the dmb barriers, then for each acquire load see whether there is already a dmb (nothing to do), a control-dependent branch (emit an isb) or something else (emit a dmb). But once a load acquire has been lowered to a dmb it is too late, that information is lost. There are other similar cases where useful information can be lost (for example the leading fence of a release store is a dmb that can commute with other accesses to the same location… but not with accesses to other locations. Once the dmb is emitted, we forget which access spawned it).

I see mostly two different approaches for fixing this, and don’t know which to prefer (and maybe you can see a third way of doing it ?).

  1. The first idea is to have an IR-level pass per backend doing the elimination for each architecture (they can share code through some target-independent helper functions), and to preserve the information through some special pseudo-instruction (a new target-dependent intrinsic ? a normal dmb/sync but with special metadata ?). So for example there would be DMBAfterAcquireLoad, which would take the load as an operand and could be turned into a control dependency + isb later. In case where fence elimination is disabled, there would be a basic pass to lower those in the trivial way.
    Drawback: will require adding a bunch of pseudo-instructions to the IR. Even though they probably won’t be seen by any other pass, it could be seen as a source of complexity.

  2. Another possibility is to have the ExpandAtomicPass keep a vector of the places where it currently calls emitLeading/TrailingFence, and give it at the end of runOnFunction to another target hook. The target could then do the fence elimination/insertion, having all the information present. Code could be shared between the targets in the same way as above through helper functions.
    Drawback: I suspect it might be a bit involved to keep a vector of insertion points in a function while mutating the control-flow of the function. Is there any simple way of doing that without a risk of invalidation ?

Thank you,
Robin Morisset

2) Another possibility is to have the ExpandAtomicPass keep a vector of
the places where it currently calls emitLeading/TrailingFence, and give it
at the end of runOnFunction to another target hook. The target could then
do the fence elimination/insertion, having all the information present.
Code could be shared between the targets in the same way as above through
helper functions.
Drawback: I suspect it might be a bit involved to keep a vector of
insertion points in a function while mutating the control-flow of the
function. Is there any simple way of doing that without a risk of
invalidation ?

One thing we didn't discuss in person (kind of related to 2 here) is that
you could use metadata on each instruction. The metadata would record the
information that the passes currently discard when lowering, and the
algorithm would have to be able to handle partial loss or duplication of
this data (say, if an instruction gets copied with or without its metadata,
and potentially deleted).