# dominance frontiers

Reading the comments in Analysis/DominanceFrontier.h, I see a note that the structure is deprecated and we’re not to use it for anything new.

Has it been replaced with something equally useful, or shall I redo the calculation for myself, or …?

Thanks,
Preston

We're hoping to remove them, because they're inherently an N^2 data structure and there are usually better ways to do things. What are you trying to do? SSA construction?

-Chris

It depends on what you need.

- PromoteMemoryToRegister.cpp is used for initial SSA construction.
- SSAUpdater.h can be used to update SSA form after duplicating code. It works both for LLVM IR and the CodeGen MI representation.
- LiveRangeCalc.cpp is used by the register allocator to recompute liveness and SSA form after splitting live ranges.

These functions all compute what they need from the dominator tree.

/jakob

To do dependence analysis efficiently, we want to avoid testing all
memory references against each other (another O(N^2) proposition). So
we build FUD chains for memory references, guided by the Alias Sets
analysis. It's very like SSA construction, but must make provision
testing anti dependences. I had planned to use dominance frontiers to
guide placement of phi nodes, as usual.

I'm surprised they cause you a space problem. In my experience
much less memory available at the time. Perhaps we could profitably
re-visit how they are represented?

Preston

Here’s how I did things, back when I got to write my own infrastructure. It’s still O(n^2) in the worst case, but much quicker in the expected case.

Number all the basic blocks, 0 to n-1 and build a vector mapping from integer to block. Requires O(n) time and space.

For each block, compute the set containing it’s dominance frontier, based on Figure 10 of

Efficiently Computing Static Single Assignment Form and …
Cytron, Ferrante, Rosen, Wegman

During the calculation, represent the current frontier, DF(X), as suggested in

An Efficient Representation for Sparse Sets
Briggs, Torczon

This lets us add members and clear the set in constant time and allows us to enumerate the members in time proportional to the number of members. O(n) space.

Then, having computed the DF for a single block, allocate a vector of the appropriate length and copy the members of the set into the vector. This requires time and space proportional to the actual size of the DF, typically much smaller than O(n) per block.

When you’re done computing DFs, throw away the big sparse set (you only need one as a workspace) and retain the dense vectors.

When placing phi nodes, Figure 11 of Cytron, et al., the only use of the DF is to walk over the members in a for loop, and the dense vector is perfect for the occasion. More general representations waste space.

So, the only real trick is to change the set representation between the time the DFs are computed and the time they are used. Some people object to numbering the basic blocks, but I’ve never understood this objection. I did it often, whenever I needed it; it’s not some expensive property that you need to maintain across many phases.

Preston

I executed

cmake -G "Visual Studio 10" .

on windows, and got the following error message. I have python installed and in the path.

CMake Error at CMakeLists.txt:256 (message):
Unexpected failure executing llvm-build: Traceback (most recent call last):

File "C:/Users/xzx/llvm/utils/llvm-build/llvm-build", line 3, in <module>
import llvmbuild
File "C:\Users\xzx\llvm\utils\llvm-build\llvmbuild\__init__.py", line 1, in
<module>
from main import main

ImportError: No module named main

-- Configuring incomplete, errors occurred!

Python 2.x works. Thanks Joey.

To do dependence analysis efficiently, we want to avoid testing all
memory references against each other (another O(N^2) proposition). So
we build FUD chains for memory references, guided by the Alias Sets
analysis.

Ok. Having worked in similar spaces (and caring about compile time :), I've quickly given up on approaches like that. Instead, I've found that lazy approaches (e.g. what is done in the "memory dependence analysis" pass) work better, because you burn compile time only when a client is making queries (instead of eagerly analyzing every possible alias relation). YMMV of course.

It's very like SSA construction, but must make provision
testing anti dependences. I had planned to use dominance frontiers to
guide placement of phi nodes, as usual.

Ok, in that case, please check out include/llvm/Transforms/Utils/SSAUpdater.h, which is the preferred API (that doesn't use dom frontiers) to build SSA.

I'm surprised they cause you a space problem. In my experience
much less memory available at the time. Perhaps we could profitably
re-visit how they are represented?

Admittedly, our implementation is brutally inefficient :). That said, dom frontiers are another great example of a non-lazy data structure that you have to fully compute even if you end up not using it - and I'd much rather rip it out of LLVM than fix it.

In my experience, there are better ways to solve problems that are traditionally formulated in terms of it. That said, my experience is limited to the domains that are already implemented in LLVM, so if you're trying to do something new... then that experience may not be very useful

-Chris

The other major problem that I forgot to mention is the “updating” problem. If you have many clients that are using DF, then you want anything that changes the CFG to update DF. This can be non-trivial, and (again) is a waste of time in most cases IMO.

-Chris

Interesting. I never updated my DF representation;
instead, I computed them from scratch whenever necessary.
But I only used them for placing phi nodes, and I only did that in a
batch fashion.
That is, when I converted a routine to SSA form, I built DF's for the
entire routine
and then used them to insert phi nodes for the entire routine.
If I was in the middle of register allocation (when the CFG didn't change),
I carried the DFs along; otherwise, I threw them away.

I expect most our difference in perspective goes back to my paying attention
to the performance of individual optimizations, versus performance of
the compiler
as a whole.

Preston

Hi Chris,

weeks, but email will have to do…

That would be nice

In my experience, they were trivial to compute, requiring very little
time, space, or code. Rereading Cytron et al., I would expect they'd
require O(N + E) time and O(N) space in most cases, and when we
consider that N and E are in nodes and edges in the CFG (i.e., pretty
small compared to the number of instructions), then I'd think we can
compute DF fram scratch in perhaps less time than required for a
single walk over the instructions in a routine. Probably the O(N)
memory allocations to store the results will dominate.

Everything that you say is correct, but it is missing the point IMO. Regardless of the constant factor of computing DF for a function, you're doing work proportional to the size of the function speculatively, assuming it will be used.

Lets be more concrete: you're using DF for some purpose: say rewriting a variable into SSA form. That variable may only be used in a small subset of the function, say within a loop body or some other scope. Does it make sense to compute DF for the entire function, rewrite the variable, then throw it away?

You mentioned updating... I wouldn't update, I'd just recompute. It
seems like LLVM is set up perfectly for this sort of thing. If the
CFG changes, you notice automagically that the dominator tree is
invalid and recompute it when necessary. Why not the same for DF?

In my example above, of course it would be ridiculous to compute DF each time you want to rewrite a single variable. You need to reuse it, and some "cfg listener"-based approach might make sense. However, now you're in a position where you have little control over how and when it gets updated.

Lets take a bigger example: say you're doing a loop transformation (like unswitching) which can occasionally require unstructured updates of the CFG. The natural way to write this transformation would be:

for each loop
if it is a candidate for unswitching
hack up the CFG (e.g. duplicating a loop body)
rewrite the variables that are now duplicated

However, if you do that, your listener based approach will recompute DF for the entire function after each loop is processed. This is particularly bad given that we're working on natural loops, so even if there is unstructured control within the loop, the loop itself provides structure that recomputing DF from scratch ignores.

Another way to write this is rewrite the transformation to process all of the loops in a function before rewriting the variables back into SSA form. However, this has a number of problems, including that you have to keep track of the variable equivalencies somehow, even across other loop updates.

Alternatively, you can use the LLVM trick of rewriting things temporarily in terms of load/store to allocas. This then pessimizes size estimates for other loop unswitch operations.

Both of these also have the small problem that it completely destroys the concept of LoopPassManager

You mentioned computing more than was absolutely necessary... I use
DF to place phi nodes. Seems like computing complete DF for a CFG
doesn't waste effort; we'll need the entire set of DF to handle all
the phi nodes.

You certainly don't.

void foo() {
...
for () {
int x;
if (C)
x = 1
else
x = 2;
use(x)
}
...
}

rewriting "x" only requires a tiny amount of the CFG to be analyzed. If you're doing the initial rewrite from imperative load/stores into SSA then you have a lot of variables all across the function to rewrite. In this case, the tradeoff is more likely in favor of using DF to do it.

We're aiming to do a number of loop xforms, including parallelization,
loop fusion, distribution, rewriting of reductions and recurrences,
etc. The usual approach is to build a dependence graph connecting all
the memory references in a routine, then examine the graph looking for
connected components and such. Building the graph incrementally, on
demand, seems wasteful since we'll want the entire graph. Similarly,
since we're going to build the entire dependence graph, building a
complete set of FUD chains is also useful since we'll use the entire
thing.

I don't have extensive experience with those loop transformations, so my perspective is limited. However, each of them seem to want to know very different kind of relations - e.g. some want cross loop iteration dependencies, some want within iteration dependences, some care about forward and others care about anti-dependences, etc. You'd really want to compute all of this?

Further, many loop transformations have to immediately give up: for example, it might not be possible to vectorize a loop that contains a call to an unknown function, or a loop that does pointer chasing. Why burn compile time speculatively computing a bunch of information for these loops?

It's very like SSA construction, but must make provision
testing anti dependences. I had planned to use dominance frontiers to
guide placement of phi nodes, as usual.

Ok, in that case, please check out include/llvm/Transforms/Utils/SSAUpdater.h,
which is the preferred API (that doesn't use dom frontiers) to build SSA.

What approach does it use? I look at the code, but don't really see
an explanation.

Here is some background info:

-Chris

Sorry, my reading comprehension skills were low, I thought you were asking about MemDep for some reason. SSAUpdater uses simple local scans across the CFG, excelling at local updates of values.

-Chris

Note: GCC takes exactly the same approach as LLVM here, for exactly
the reason chris specifies.
In fact, until we started local SSA updating (which is now many years
ago, but ...), dominance frontier calculation for ssa updating was in
the top 10 profile functions for GCC compiles of large source files.
I had tried a number of the linear time phi placement algorithms
someone later published a paper comparing them all and finding the
same thing).

As a completely random aside, the algorithm given in the original
paper you cited is actually not quite the best way in practice. It
performs more comparisons than strictly necessary. When I moved us to
the algorithm found in http://www.cs.rice.edu/~keith/Embed/dom.pdf,
Figure 5, it was ~30% faster (AFAIK, Jeanne Ferrante is somewhat
responsible for both algorithms :P). It is also a lot simpler to
explain, IMHO.

Both of these things happened around 2005 (local SSA updating in GCC
and replacement of the dominance frontier algorithm), so things may be
different now.

Note: GCC takes exactly the same approach as LLVM here, for exactly
the reason chris specifies.
In fact, until we started local SSA updating (which is now many years
ago, but …), dominance frontier calculation for ssa updating was in
the top 10 profile functions for GCC compiles of large source files.
I had tried a number of the linear time phi placement algorithms
someone later published a paper comparing them all and finding the
same thing).

I wrote the current implementation of the up-front SSA construction in LLVM, found in Utils/PromoteMemoryToRegister.cpp. It uses a modified version of the Sreedhar-Gao algorithm from “A linear time algorithm for placing phi-nodes”. The most interesting modification is that it uses per-variable liveness information to compute pruned SSA by pruning the entire IDF computation, rather than merely pruning the DF → IDF step like is usually done. I was a bit surprised that it worked so well, given the reputation of the “linear time” algorithms, but it performed the best out of everything I attempted.

As a completely random aside, the algorithm given in the original
paper you cited is actually not quite the best way in practice. It
performs more comparisons than strictly necessary. When I moved us to
the algorithm found in http://www.cs.rice.edu/~keith/Embed/dom.pdf,
Figure 5, it was ~30% faster (AFAIK, Jeanne Ferrante is somewhat
responsible for both algorithms :P). It is also a lot simpler to
explain, IMHO.

There is a followup paper from other authors (including Tarjan himself) that discusses some implementation tricks for the Lengauer-Tarjan algorithm:

http://www.cs.princeton.edu/~rwerneck/docs/GWTTA04.pdf

They cite personal communication from Keith Cooper that a later more careful implementation of Lengauer-Tarjan led to different benchmark results than what they published. I added those tricks and some others to LLVM’s implementation of Lengauer-Tarjan (the simple O(n log n) version) for a significant performance improvement. I should do a comparison with a careful implementation of the Cooper-Harvey-Kennedy algorithm some day.

Cameron

Note: GCC takes exactly the same approach as LLVM here, for exactly
the reason chris specifies.
In fact, until we started local SSA updating (which is now many years
ago, but ...), dominance frontier calculation for ssa updating was in
the top 10 profile functions for GCC compiles of large source files.
I had tried a number of the linear time phi placement algorithms
someone later published a paper comparing them all and finding the
same thing).

I wrote the current implementation of the up-front SSA construction in LLVM,
found in Utils/PromoteMemoryToRegister.cpp. It uses a modified version of
the Sreedhar-Gao algorithm from "A linear time algorithm for placing
phi-nodes". The most interesting modification is that it uses per-variable
liveness information to compute pruned SSA by pruning the entire IDF
computation, rather than merely pruning the DF -> IDF step like is usually
done. I was a bit surprised that it worked so well, given the reputation of
the "linear time" algorithms, but it performed the best out of everything I
attempted.

Interesting. Maybe some day i'll give it a try again.
I swear I had tried this one (I was always a fan of Sreedhar's work),
however, but it's entirely possible i messed up the implementation

As a completely random aside, the algorithm given in the original
paper you cited is actually not quite the best way in practice. It
performs more comparisons than strictly necessary. When I moved us to
the algorithm found in http://www.cs.rice.edu/~keith/Embed/dom.pdf,
Figure 5, it was ~30% faster (AFAIK, Jeanne Ferrante is somewhat
responsible for both algorithms :P). It is also a lot simpler to
explain, IMHO.

There is a followup paper from other authors (including Tarjan himself) that
discusses some implementation tricks for the Lengauer-Tarjan algorithm:

Ah, i was referring to the *dominance frontier* computation, not the
dominator computation.
If you look, you'll see there is an algorithm in the paper that
computes dominance frontiers touching only the nodes that are
*actually in* the dominance frontier

The algorithm in the cytron/et al paper looks like this:

for each X in a bottom-up traversal of the dominator tree do
DF(X) = empty
for each Y in Successors(X)
if (idom(Y) != X) then DF(X) += Y
for each Z in DomChildren(X)
for each Y in DF(Z)
if (idom(Y) != X then DF(X) += Y

You can see that this does more comparisons than strictly necessary.
OTOH, the algorithm by Ferrante that Harvey gives is:

for each B in all basic blocks
if the length of Predecessors(B) >= 2
for each P in Predecessors(B)
runner = P
while runner != idom[B]
DF(runner) += B
runner = idom[runner]

You can see this does a lot less work by skipping a lot of useless nodes, etc.

It's also a lot simpler to explain

http://www.cs.princeton.edu/~rwerneck/docs/GWTTA04.pdf

They cite personal communication from Keith Cooper that a later more careful
implementation of Lengauer-Tarjan led to different benchmark results than
what they published. I added those tricks and some others to LLVM's
implementation of Lengauer-Tarjan (the simple O(n log n) version) for a
significant performance improvement.

My only work on LLVM's dominators was when I successfully made a mess
of the dominator tree implementation in LLVM way back when by using a
splay-tree structure that stored the sequences, rather than BB's. It
is technically faster to update, but in practice wildly complex to
maintain.
(See rev 25144)

when I should do a comparison with a careful
implementation of the Cooper-Harvey-Kennedy algorithm some day.

We have always found the simple version of Lengauer-Tarjan + the
various tricks you are thinking of to beat out the
Cooper-Harvey-Kennedy algorithm.

We used to use the complex version, but as you also found, it was not
faster in practice than the O(n log n) + good engineering, and it was
a heck of a lot harder to maintain.

Interesting. Maybe some day i'll give it a try again.
I swear I had tried this one (I was always a fan of Sreedhar's work),
however, but it's entirely possible i messed up the implementation

One thing I found is that the pruning actually made it faster, even when including the cost of computing per-variable liveness information.

A fun paper with a lot of algorithms (some of them pretty impractical) is "Algorithms for computing the static single assignment form" by Bilardi and Pingali. They have a technique in their paper for caching DFs for an important subset of the nodes in a way that the total DF storage is linear. They claim that this is faster in practice than the fully lazy version of the Sreedhar-Gao algorithm that I modified, but I never implemented this for a comparison.

Ah, i was referring to the *dominance frontier* computation, not the
dominator computation.
If you look, you'll see there is an algorithm in the paper that
computes dominance frontiers touching only the nodes that are
*actually in* the dominance frontier

The algorithm in the cytron/et al paper looks like this:

for each X in a bottom-up traversal of the dominator tree do
DF(X) = empty
for each Y in Successors(X)
if (idom(Y) != X) then DF(X) += Y
for each Z in DomChildren(X)
for each Y in DF(Z)
if (idom(Y) != X then DF(X) += Y

You can see that this does more comparisons than strictly necessary.
OTOH, the algorithm by Ferrante that Harvey gives is:

for each B in all basic blocks
if the length of Predecessors(B) >= 2
for each P in Predecessors(B)
runner = P
while runner != idom[B]
DF(runner) += B
runner = idom[runner]

You can see this does a lot less work by skipping a lot of useless nodes, etc.

It's also a lot simpler to explain

Ah sorry for the mistake, now your original email makes more sense. This was one of the variants I tested and it was a bit of a speedup over the existing code.

My only work on LLVM's dominators was when I successfully made a mess
of the dominator tree implementation in LLVM way back when by using a
splay-tree structure that stored the sequences, rather than BB's. It
is technically faster to update, but in practice wildly complex to
maintain.
(See rev 25144)

How well did your scheme deal with dominator updates after edge deletion? That's the painful part of incremental dominator computation.

Cameron

[many interesting things, together with contributions from Daniel Berlin and Cameron Zwarich]

Thanks.

We're aiming to do a number of loop xforms, including parallelization,
loop fusion, distribution, rewriting of reductions and recurrences,
etc. The usual approach is to build a dependence graph connecting all
the memory references in a routine, then examine the graph looking for
connected components and such. Building the graph incrementally, on
demand, seems wasteful since we'll want the entire graph. Similarly,
since we're going to build the entire dependence graph, building a
complete set of FUD chains is also useful since we'll use the entire
thing.

I don't have extensive experience with those loop transformations,
so my perspective is limited. However, each of them seem to want
to know very different kind of relations - e.g. some want cross loop
iteration dependencies, some want within iteration dependences,
You'd really want to compute all of this?

Today, yes. Later, when I understand more, perhaps I'll be able to
get by with less.

Further, many loop transformations have to immediately give up:
for example, it might not be possible to vectorize a loop that contains
a call to an unknown function, or a loop that does pointer chasing.
Why burn compile time speculatively computing a bunch of information for these loops?

Because xforms of larger scope may be able to take advantage.
Drawing on your examples, a loop containing a call can sometimes be distributed
into several loops, some of which can be vectorized. A loop that does pointer
chasing might be contained in another loop that can be parallelized.

I don't imagine such things will be useful to everybody,
but they're useful to us and we're happy to use an extra flag and pay
the extra compile time.
After all, it's so much better than having to restructure the code by hand.

Preston

Hello Chris,

We caught this failure:
llc: LegalizeTypes.cpp:831: void llvm::DAGTypeLegalizer::SetSplitVector(llvm::SDValue, llvm::SDValue, llvm::SDValue): Assertion `Lo.getValueType().getVectorElementType() == Op.getValueType().getVectorElementType() && 2*Lo.getValueType().getVectorNumElements() == Op.getValueType().getVectorNumElements() && Hi.getValueType() == Lo.getValueType() && "Invalid type for split vector"' failed.

The illegal CONCAT_VECTOR node is created from
LegalizeVectorTypes.cpp

I saw that this code comes from your commit 112171

this is the reduced test
define void @add18i16(<18 x i16>* nocapture sret %ret, <18 x i16>* %bp) nounwind {
%b = load <18 x i16>* %bp, align 16
%x = add <18 x i16> zeroinitializer, %b
store <18 x i16> %x, <18 x i16>* %ret, align 16
ret void
}

Could you take a look, please.

Thanks.

- Elena