I have been thinking about implementing a fence elimination algorithm based on PRE over the last few weeks. It will operate on target-dependent fences (as there are very few target-independent fence elimination opportunities), but should be general enough to be used by both the ARM and the Power backends just by tweaking some simple hooks. In this email I will try to summarize my ideas about it, to get some feedback before I start implementing. If you have any question/suggestion/comment/criticism, please say so !
TL;DR: I have a handful of questions:
- Should I try to make the code general enough to be used for full PRE later, or only care about fences ?
- Is it ok to use metadata to preserve information that would otherwise be lost by a transformation pass (and would help a later pass) ?
- If you have read the text below, does the approach make sense ?
There is already a local fence elimination algorithm for ARM (lib/Target/ARM/ARMOptimizeBarriersPass.cpp), that tries to find pairs of dmb with no memory access in between and eliminate one of the two (report describing the idea/results). The goal here is to do better, in particular to hoist fences out of loops and do optimization across basic blocks.
This is closely related to PRE: we want a dmb on every path between some accesses and some other accesses, PRE places a computation on every path between the definition of its operand and the uses of its results. Sadly, only an extremely limited form of PRE is done in LLVM, inside a sub-phase of the GVN pass, so we cannot reuse existing infrastructure.
Most modern papers on PRE see it as a flow problem: if all the operands are connected to a source and all the uses of the computation are connected to a sink, then the goal is to find a cut of the graph between the source and the sink that verifies an additional property.
For classical PRE (i.e. conservative, static), the extra property is that no path through the CFG goes through more instances of the computation than before. For speculative PRE (SPRE), it is that the cut is a min-cut, when all edges in the CFG are annotated with their frequency (which requires profiling-based information in general). The same distinction can be made in our case: using profiling-based information allows more aggressive optimization but with a risk of mis-optimization if the information is wrong.
Quite often, there will be different kinds of fences to deal with, such as dmb ish/dmb ishst on swift ARM processors, or hwsync/lwsync on Power. I suggest a rather greedy approach: first insert the strongest fences without looking at the weaker ones; then insert the weaker ones while taking into account the stronger ones. If two fences are uncomparable, then insert each one independently, completely ignoring the other. The cost of this approach is having to do a cut for each kind of fences. However, I don’t expect the graphs under consideration to often be big.
The graph on which we will compute the cut is based on the CFG, but there are several possible ways to tweak it, mostly to accelerate the algorithm:
By default, there would be two nodes for each instruction (beginning and end), with an infinite weight on them (impossible to insert a Fence in the middle of an instruction), and an edge between the end of an instruction and the beginning of the next. We cannot use just one node per instruction, because the beginning of an instruction may be a sink, while its end is a source.
We can merge together all the instructions in a basic block which are not memory accesses (since fences commute with those anyway) and do not affect control-flow.
We can put an infinite weight on edges between basic blocks (fences must be inside BBs in this case) or not (we must then do edge-splitting to insert fences there)
The cut algorithm cost is only incurred on code with fences, and any part of the graph which is unreachable from the source or co-unreachable from the sink can be pruned (or not even built).
The basic algorithm I propose is thus the following:
Gather all Fences
For each Fence:
On every path going down from the Fence, find the first memory access/return/call, and connect the beginning of this instruction to the sink (with infinite weight). If instead we first reach a stronger Fence, then do nothing for that path (do not connect it to the sink).
On every path going up from the Fence, find the first memory access/call/beginning of function, and connect the source to its end (with infinite weight). If instead we first reach a stronger Fence, then do nothing for that path.
Remove the Fence
Execute the cut algorithm on this graph (only keeping the nodes/edges traversed above, merging all adjacent nodes that are not branches or memory accesses/call/return), and insert a Fence on every edge that is part of the cut.
It is fairly obvious that this algorithm preserves the property that every pair of memory accesses that was separated by a fence still is (and of the same kind since we are operating only on fences of the same kind at a time).
In terms of implementation, the steps 1 & 2.1 & 2.2 could be one analysis pass, and 2.3 & 3 could be a transform pass depending on it.
A nice property of this framework is that it can easily be expanded to allow more optimizations. Here are some examples that may be added incrementally later.
Accesses to the same location are automatically ordered on any reasonable architecture, so we do not need fences to order them. So when looking upwards from the fence introduced by a release, or downwards from the fence introduced by an acquire, we can skip any access to the same location (with the same size obviously). This may sound rare, but in practice it is crucial to hoisting fences out of loops. Consider for example the following code (which can be found any time a thread spins waiting on another one):
if (x.load(acquire)) break;
Without this optimization, there must be a Fence after every load, whereas after it there can be a single Fence after the loop. The only difficulty here is finding from what access (if any) a fence comes from… see next part of this doc for ides on how to do that.
It is fairly simple to make the pass partially (greedily) interprocedural for free: operate on the functions bottom-up, whenever a Fence is inserted just after the start of a function (resp. just before every return), tag this function with some special piece of metadata. Then whenever hitting a call to such a function when searching downwards (resp upwards) of a Fence, stop the search without connecting to the sink (resp source).
This is not necessarily optimal (as we cannot push a Fence through a function for example), but it is free and may only improve things (get rid of some Fences).
Neither Power nor ARM can reorder two memory accesses, if the first is a read, and the second has a data or address dependency on the first. This means that when searching memory accesses downward of a Fence fence inserted by an acquire read, we can skip any access that depends on that acquire read.
This would for example allow optimizing the following:
tmp = x.load(acquire);
tmp2 = (*tmp).load(acquire);
Which by default would require two Fences (after each load). Because the second load depends on the first, only the second Fence is required (in the algorithm, the source would be connected to the ends of both loads, but only the return would be connected to the sink).
In practice, this optimization might be extremely difficult to implement because of phase ordering issues: no later pass can be allowed to eliminate dependencies, which requires running extremely late (in the backend) at which point iterating over the control-flow is painful. So I think this optimization probably won’t be implemented (soon) despite its potential benefits.
I have seen at least 4 broad categories of algorithms for computing such a cut.
This has two main benefits: extremely adaptable (for example can enforce the conservativeness requirement of classical PRE, or various other constraints in the literature), and can do up to 64 cuts in parallel (highly useful for PRE as it must do a cut for each expression, useless for us). This is what is used in most papers on PRE (most of which are about clever/unreadable ways of doing that encoding) but I think it would only make sense for us if we want to reuse the infrastructure to provide general-purpose PRE.
Specially tuned for computer vision problems, with perfectly regular planar graphs with some special structure: probably useless for us (although easy to test as it is implemented in boost).
Probably too slow, but easy to test (boost).
Probably the best for our purpose (and easy to test: boost). However, it is not obvious to me how to enforce the extra constraint required for conservative PRE, so if we use it we will need profiling information (or any reasonable static approximation, maybe from BlockFrequency ?).
I am currently planning on running this pass just after AtomicExpandPass, i.e. just before lowering to SelectionDAG. It could in theory be executed later in the backend, but I would rather avoid doing a global, control-flow related optimization on SelectionDAGs if at all possible. It cannot be run earlier, because the target-specific fences do not exist yet.
As we saw previously in the previous section, several possible improvements to the graph building require this pass to find where a given Fence comes from. It is thus not mandatory and won’t be in the first patch, but we should have plan on how to do it later (incrementally). I see mostly three ways of doing this (I tried getting some discussion on this, but without much success). For clarity, I am using DMB here for the target-dependent fence.
This would be the easiest, but would utterly break encapsulation, force the two passes to run together all the time, and generally not respect the current PassManager-based infrastructure. I don’t think it is acceptable.
In this proposal, AtomicExpandPass does not insert DMBs, but various pseudo-instructions such as FenceAfterAcquire(LoadItWasCreatedBy). This would work, but require adding some pseudo-instructions, that the backend/another pass would have to learn how to lower for -O0.
This is a variant on the previous proposal: AtomicExpandPass keeps introducing DMBs, but tags them with special metadata describing their origin. FenceEliminationPass could then make use of this information as long as it is available, whereas the backend and any other pass would just automatically ignore it and correctly emit DMBs. Furthermore, if any other pass introduces/copies DMBs, FenceElimination will automatically be correctly conservative. This approach seems like the best, and is currently what I am planning on doing. For the optimizations currently envisioned, a single piece of metadata llvm.fence.from with one argument being the access that caused the fence to be emitted would be enough.
There is not much point in doing fence elimination on x86, because only fence seq_cst is lowered to a mfence. In particular atomic accesses do not ever introduce mfences, instead seq_cst stores are compiled to lock xchg, and everything else is just a mov.
Arm will be my first target, everything in this document should apply to it straightforwardly.
The main difference between Power and ARM is that Power has both hwsync and lwsync where ARM only has DMB ISH (it has a few others, but none matching lwsync exactly). As described previously this is not a big problem, we can just insert the hwsyncs first, and the lwsyncs second, taking into account the hwsyncs in the step 2 of the algorithm.
I do not know the other architectures well enough to propose a design for them. However, the approach looks general enough that it may be used for Sparc/Mips if someone with the required expertise is motivated to do it.