DCE in the presence of control flow.

I have been looking at some internal codes looking for differences between Clang (specifically 3.7.1) and gcc (typically 4.8.1 but sometimes later).

One area where I bumped into was dead code elimination in the presence of complex control flow. I note that the “aggressive dead code elimination” (ADCE.cpp) treats all branch operations as live (isa(I)). Doing more requires some approximation to control dependence.

I note SimplifyCFG indirectly handles some simple cases. It will speculate the contents of a basic block into a predecessor but this is insufficient for more complex structures. This probably cherry-picks the most common cases by frequency.

Have their been prior attempts strengthen dead code elimination w.r.t. control flow? If so, any guidance on what went wrong?

Thanks

David

From: "David Callahan via llvm-dev" <llvm-dev@lists.llvm.org>
To: "LLVM Dev Mailing list" <llvm-dev@lists.llvm.org>
Sent: Wednesday, January 27, 2016 3:56:33 PM
Subject: [llvm-dev] DCE in the presence of control flow.

I have been looking at some internal codes looking for differences
between Clang (specifically 3.7.1) and gcc (typically 4.8.1 but
sometimes later).

One area where I bumped into was dead code elimination in the
presence of complex control flow. I note that the “aggressive dead
code elimination” (ADCE.cpp) treats all branch operations as live (
isa<TerminatorInst>(I)). Doing more requires some approximation to
control dependence.

Can you provide a small example where this matters?

-Hal

The post dominators computation costs more on llvm than GCC because of how the respective cfgs work under the covers. Even for GCC, when we implemented cd-dce, it only catches 1-2% more cases. It’s not really worth the cost in llvm unless postdom comes free

Thanks
Also I found that some cases are also caught by a specialized routine to remove dead loops which is missing the case I noticed.
odavd

From: "David Callahan via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Daniel Berlin" <dberlin@dberlin.org>, "LLVM Dev Mailing list"
<llvm-dev@lists.llvm.org>
Sent: Thursday, January 28, 2016 11:35:49 PM
Subject: Re: [llvm-dev] DCE in the presence of control flow.

Thanks
Also I found that some cases are also caught by a specialized routine
to remove dead loops which is missing the case I noticed.
odavd

From: Daniel Berlin < dberlin@dberlin.org >
Date: Thursday, January 28, 2016 at 8:45 PM
To: David Callahan < dcallahan@fb.com >, LLVM Dev Mailing list <
llvm-dev@lists.llvm.org >
Subject: Re: [llvm-dev] DCE in the presence of control flow.

The post dominators computation costs more on llvm than GCC because
of how the respective cfgs work under the covers. Even for GCC, when
we implemented cd-dce, it only catches 1-2% more cases. It's not
really worth the cost in llvm unless postdom comes free

A 1-2% reduction in code size seems like it might well be worth a post-dom calculation. Also, what exactly is the algorithm using post-dom info?

I suppose that we're looking for cases where we have a CFG diamond with one having only dead instructions, and loops with all dead instructions, etc.

Thanks again,
Hal

------------------------------

*From: *"David Callahan via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Daniel Berlin" <dberlin@dberlin.org>, "LLVM Dev Mailing list" <
llvm-dev@lists.llvm.org>
*Sent: *Thursday, January 28, 2016 11:35:49 PM
*Subject: *Re: [llvm-dev] DCE in the presence of control flow.

Thanks
Also I found that some cases are also caught by a specialized routine to
remove dead loops which is missing the case I noticed.
odavd

From: Daniel Berlin <dberlin@dberlin.org>
Date: Thursday, January 28, 2016 at 8:45 PM
To: David Callahan <dcallahan@fb.com>, LLVM Dev Mailing list <
llvm-dev@lists.llvm.org>
Subject: Re: [llvm-dev] DCE in the presence of control flow.

The post dominators computation costs more on llvm than GCC because of how
the respective cfgs work under the covers. Even for GCC, when we
implemented cd-dce, it only catches 1-2% more cases. It's not really worth
the cost in llvm unless postdom comes free

A 1-2% reduction in code size seems like it might well be worth a post-dom
calculation.

1-2% more cases != 1-2% reduction in code size. In particular, it assumes
nothing else will catch those cases :slight_smile:

The cases are mostly caught by SimplifyCFG/etc anyway

In any case, here are the numbers from when it was turned on by default:

https://gcc.gnu.org/ml/gcc-patches/2004-01/msg00675.html

Note: GCC is at least 3x faster at computing post-dom than LLVM

Also, what exactly is the algorithm using post-dom info?

So, to review for a second, right now, the algorithm answers the question
when is a branch necessary with "always" :slight_smile:

The real answer is "when an already-necessary operation depends on its
existence". This is of course, requires control-dependence to answer.
So if you take our current DCE algorithm, and instead of marking
terminators always-live, it simply marks control dependent edges of those
operands as necessary, and branches that generate those edges.

(IE

for each block in RDF(useful block):
  mark terminator of block as useful

I suppose that we're looking for cases where we have a CFG diamond with one

having only dead instructions, and loops with all dead instructions, etc.

Yes. Loops with all dead instructions includes "loops with no side-effects
outside of the loop"

Thanks again,

From: "Daniel Berlin" <dberlin@dberlin.org>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Dev Mailing list" <llvm-dev@lists.llvm.org>, "David
Callahan" <dcallahan@fb.com>
Sent: Friday, January 29, 2016 12:48:37 AM
Subject: Re: [llvm-dev] DCE in the presence of control flow.

> > From: "David Callahan via llvm-dev" < llvm-dev@lists.llvm.org >
>

> > To: "Daniel Berlin" < dberlin@dberlin.org >, "LLVM Dev Mailing
> > list"
> > < llvm-dev@lists.llvm.org >
>

> > Sent: Thursday, January 28, 2016 11:35:49 PM
>

> > Subject: Re: [llvm-dev] DCE in the presence of control flow.
>

> > Thanks
>

> > Also I found that some cases are also caught by a specialized
> > routine
> > to remove dead loops which is missing the case I noticed.
>

> > odavd
>

> > From: Daniel Berlin < dberlin@dberlin.org >
>

> > Date: Thursday, January 28, 2016 at 8:45 PM
>

> > To: David Callahan < dcallahan@fb.com >, LLVM Dev Mailing list <
> > llvm-dev@lists.llvm.org >
>

> > Subject: Re: [llvm-dev] DCE in the presence of control flow.
>

> > The post dominators computation costs more on llvm than GCC
> > because
> > of how the respective cfgs work under the covers. Even for GCC,
> > when
> > we implemented cd-dce, it only catches 1-2% more cases. It's not
> > really worth the cost in llvm unless postdom comes free
>

> A 1-2% reduction in code size seems like it might well be worth a
> post-dom calculation.

1-2% more cases != 1-2% reduction in code size. In particular, it
assumes nothing else will catch those cases :slight_smile:

Fair enough.

The cases are mostly caught by SimplifyCFG/etc anyway

In any case, here are the numbers from when it was turned on by
default:

Steven Bosscher - [tree-ssa] DCE with control dependence again (with numbers, for a change

Note: GCC is at least 3x faster at computing post-dom than LLVM

Why?

> Also, what exactly is the algorithm using post-dom info?

So, to review for a second, right now, the algorithm answers the
question when is a branch necessary with "always" :slight_smile:

The real answer is "when an already-necessary operation depends on
its existence". This is of course, requires control-dependence to
answer.
So if you take our current DCE algorithm, and instead of marking
terminators always-live, it simply marks control dependent edges of
those operands as necessary, and branches that generate those edges.

(IE

for each block in RDF(useful block):
mark terminator of block as useful

FWIW, Keith Cooper's slide deck on this has a nice explanation: https://www.cs.rice.edu/~keith/512/2011/Lectures/L04Dead-1up.pdf

We might be able to do this without precomputing an RDF, however. For example, you could solve a data-flow problem on the reverse CFG, where for each block you solve for the "next" live instruction. Then a branch is alive only if the next live instruction for each successor is different. You'd need to have "next-live-instruction phi nodes" for cases where there's not a unique answer, etc.

Thanks again,
Hal

Note: GCC is at least 3x faster at computing post-dom than LLVM

Why?

It has a real edge structure, and so doing things like walking successors
and predecessors in a row is *really* fast and easily predictable cache
behavior.
LLVM has to walk use structures and look at stuff to walk predecessors,
which is both slow and often cache unfriendly :frowning:

When you time it on larger cases, it comes out to about 600ms vs 200ms.

I think you can also avoid the RDF computation using a more directed form of control dependence testing such as described in

Keshav Pingali and Gianfranco Bilardi. 1997. Optimal control dependence computation and the Roman chariots problem. ACM Trans. Program. Lang. Syst. 19, 3 (May 1997), 462-491. DOI=http://dx.doi.org/10.1145/256167.256217

However one challenge seems to be fixing the SSA graph after deleting essentially arbitrary connected regions of the control flow graph. It may be that the common case where deleted control flow has a single entry and a common post-dominator which seems straight forward but in the general case it seems much harder. Any prior experience on that problem?

david

In practice, APT is not faster to build than rdf.
The df calculator we use is linear time and quite fast.

Updating is also pretty trivial since it’s only deletes of dead and unreachable code. So anything it reached can be replaced with undef in most cases.

Cd-dce is not slower in GCC than dce

I had assumed you would treat phi nodes differently from other operations in that they don’t need to keep the block alive just to retain the data flow facts but it would be simplest to do that.
Thanks Daniel

Maybe I was too quick here. Does gcc record the incoming edge to a phi? If so, won’t those change when you delete blocks in a non-trivial manner? How are those updated?

(BTW hal, there is a nice web version of the code from keith’s presentation here:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=3&ved=0ahUKEwibwpvI8dHKAhVC0WMKHbdkDGYQFggtMAI&url=https%3A%2F%2Fwww.cs.princeton.edu%2F~ras%2Fdead.ps&usg=AFQjCNGxo6X07t5-9gFtzCbWpjz25afMBg&sig2=E6qUdZS8-x-88uJaIRZ8-A

The rest of code to the massively scalar compiler from rice, which had all these nice SSA passes written in noweb, doesn’t appear to be available generally anymore, sadly. It was a nice reference.)

“I had assumed you would treat phi nodes differently from other operations in that they don’t need to keep the block alive just to retain the data flow facts but it would be simplest to do that.”

Actually, they do.
At least in dead code elimination. This is true even if the incoming edge is an empty block.

Phi nodes have data and control dependencies. If the output of a phi is necessary, both the data and control ependencies of the phi are necessary.

Now, in practice, yes, a phi node argument is only necessary if removing it will not cause the phi node output to produce a different value in the necessary uses.

But this is not DCE, it’s value numbering :slight_smile:

Indeed, all the good value numbering/etc algorithms will determine if they can replace/remove phi node arguments and detect unreachable blocks and edges.
But that is a significantly more complicated ballgame.

In DCE, phi nodes will end up either entirely dead, or entirely alive.

Maybe I was too quick here. Does gcc record the incoming edge to a phi?

Yes, it must, as mentioned, phi nodes have both data and control
dependences.

imagine the following pseudocode:

if (a)
;; {block 1}
else
;; {block 2}
c = phi(7 ( block 1), 9 (block 2))

Without the control dependence, you have no idea which value to pick.

If so, won’t those change when you delete blocks in a non-trivial manner?

They will not. You are guaranteed that the edge will stay live if the phi
node is live, because you will mark the control dependent edges on the
predecessor block as necessary, and then go and mark the branch
instructions that generate those edges as necessary.

As such, you are guaranteed the incoming edge will not change, because you
will have marked them live.

Even if we lived in a compiler where we could not easily delete dead
branches, it's easier to leave unreachable control flow/blocks around.

That is trivial:

For each dead branch, connect it to the nearest necessary post dominator of
it's old target.

How are those updated?

I think you can also avoid the RDF computation using a more directed form of control dependence testing such as described in

Keshav Pingali and Gianfranco Bilardi. 1997. Optimal control dependence computation and the Roman chariots problem. ACM Trans. Program. Lang. Syst. 19, 3 (May 1997), 462-491. DOI=http://dx.doi.org/10.1145/256167.256217

However one challenge seems to be fixing the SSA graph after deleting essentially arbitrary connected regions of the control flow graph. It may be that the common case where deleted control flow has a single entry and a common post-dominator which seems straight forward but in the general case it seems much harder. Any prior experience on that problem?

david

Hi David,

You may want to consider looking at the implementation of DCE in the Swift optimizer. It should be fairly straightforward to follow despite not being LLVM IR.

I implemented something inspired by the above paper, but it does not implement the full construction of the sets mentioned in the paper. Instead it precomputes some information to attempt to do a limited walk of the post-dominator tree. I think it has worked well in practice.

The code for the precomputation portion starts here: https://github.com/apple/swift/blob/master/lib/SILOptimizer/Transforms/DeadCodeElimination.cpp#L502

Feel free to ping me off list if you have any questions about it.

Mark

Thanks for the pointer Mark.