RFC: New aggressive dead code elimination pass

Hi,

I have a new variant of Aggressive Dead Code Elimination that also removes dead branching. It is designed to minimize the cost of control-dependence analysis in the common case where almost the entire program is live. It also can optionally remove dead but may-be-infinite loops.

When enabled for –O3 (replacing current ADCE pass) and removing loops, impact on SPEC2006 is in the noise but it impacts internal benchmarks suites 1-2% with a comparable increase in compile time. My expectation would be to enable –O3 only until we have some experience with cost but I suspect it should be fine –O2.

What information would the community like to see about such a change before I put up a diff and (including tweaks to unit tests).

Thanks

david

David Callahan via llvm-dev <llvm-dev@lists.llvm.org> writes:

Hi,

I have a new variant of Aggressive Dead Code Elimination that also
removes dead branching. It is designed to minimize the cost of
control-dependence analysis in the common case where almost the entire
program is live. It also can optionally remove dead but
may-be-infinite loops.

When enabled for –O3 (replacing current ADCE pass) and removing loops,
impact on SPEC2006 is in the noise but it impacts internal benchmarks
suites 1-2% with a comparable increase in compile time. My
expectation would be to enable –O3 only until we have some experience
with cost but I suspect it should be fine –O2.

Just to clarify, you're saying that both runtime and compile time impact
were in the noise on SPEC, right?

What information would the community like to see about such a change
before I put up a diff and (including tweaks to unit tests).

I'm not sure that there's much to discuss in the abstract - it's much
easier to evaluate this kind of thing when there's a patch to refer to.
Presumably people will want to try the patch out on their own internal
benchmarks as well.

Can you give an example of a case that is missed today but catched by your pass?

[+Danny]

From: "Justin Bogner via llvm-dev" <llvm-dev@lists.llvm.org>
To: "David Callahan via llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Wednesday, March 23, 2016 12:36:50 PM
Subject: Re: [llvm-dev] RFC: New aggressive dead code elimination pass

David Callahan via llvm-dev <llvm-dev@lists.llvm.org> writes:
> Hi,
>
> I have a new variant of Aggressive Dead Code Elimination that also
> removes dead branching. It is designed to minimize the cost of
> control-dependence analysis in the common case where almost the
> entire
> program is live. It also can optionally remove dead but
> may-be-infinite loops.
>
> When enabled for –O3 (replacing current ADCE pass) and removing
> loops,
> impact on SPEC2006 is in the noise but it impacts internal
> benchmarks
> suites 1-2% with a comparable increase in compile time. My
> expectation would be to enable –O3 only until we have some
> experience
> with cost but I suspect it should be fine –O2.

Just to clarify, you're saying that both runtime and compile time
impact
were in the noise on SPEC, right?

> What information would the community like to see about such a
> change
> before I put up a diff and (including tweaks to unit tests).

I'm not sure that there's much to discuss in the abstract - it's much
easier to evaluate this kind of thing when there's a patch to refer
to.
Presumably people will want to try the patch out on their own
internal
benchmarks as well.

+1

Does it use post-dominance information?

-Hal

Yeah, that was gonna be my question.
If so, my view is we should just bite the bullet and start threading post dominance through the compiler.
(assuming anyone wants to help. I’m tackling the memoryssa updating stuff with george ATM).

What do you have in mind here?

Make most things update post-dominance info and preserve it.

Our other alternative to not take up too much time seems to be invasive surgery on how BB successors/predecessors work so they are a constant time array. I say this because GCC recomputes post-dominators roughly 15-19 times per compilation, and can do it about 3-5x faster than we can. All profiling i’ve done basically says all our time is spent trying to get at successors and predecessors in dominance/post-dominance respectively, which takes basically no time in gcc, because the edge lists are an array.

(Note that if you look at generic dominance code, like http://www.cs.princeton.edu/~rwerneck/dominators/, it’s much faster than what we do on the same graphs. This is true even though we use the same algorithms …)

Given it seems unlikely we are going to change the internal representation anytime soon (or at least i’ve not seen a proposal), updating post-dom seems the easier answer.

Are we talking about the basic-blocks edges only? I’m not sure changing the IR for the BBs would be a lot more work than preserving dominance analysis everywhere, or I am totally underestimating one / overestimating the other?

From: "Mehdi Amini via llvm-dev" <llvm-dev@lists.llvm.org>
To: "Daniel Berlin" <dberlin@dberlin.org>
Cc: "David Callahan via llvm-dev" <llvm-dev@lists.llvm.org>
Sent: Friday, March 25, 2016 5:43:12 PM
Subject: Re: [llvm-dev] RFC: New aggressive dead code elimination
pass

> Make most things update post-dominance info and preserve it.

> Our other alternative to not take up too much time seems to be
> invasive surgery on how BB successors/predecessors work so they are
> a constant time array. I say this because GCC recomputes
> post-dominators roughly 15-19 times per compilation, and can do it
> about 3-5x faster than we can. All profiling i've done basically
> says all our time is spent trying to get at successors and
> predecessors in dominance/post-dominance respectively, which takes
> basically no time in gcc, because the edge lists are an array.

> (Note that if you look at generic dominance code, like
> http://www.cs.princeton.edu/~rwerneck/dominators/ , it's much
> faster
> than what we do on the same graphs. This is true even though we use
> the same algorithms .....)

> Given it seems unlikely we are going to change the internal
> representation anytime soon (or at least i've not seen a proposal),
> updating post-dom seems the easier answer.

Are we talking about the basic-blocks edges only? I'm not sure
changing the IR for the BBs would be a lot more work than preserving
dominance analysis everywhere, or I am totally underestimating one /
overestimating the other?

I'm also curious about this, especially because I'd naively think that:

representation change == a little thinking and (potentially) a lot of typing
preserving post dom == a lot of thinking and a little typing

and, thus, while updating the analysis might be the *right* thing to do, it is probably not easier (especially once you factor in the time taken to fix bugs where we subtly get it wrong). Maybe in the long run, we should do both?

-Hal

Make most things update post-dominance info and preserve it.

Our other alternative to not take up too much time seems to be invasive
surgery on how BB successors/predecessors work so they are a constant time
array. I say this because GCC recomputes post-dominators roughly 15-19
times per compilation, and can do it about 3-5x faster than we can. All
profiling i've done basically says all our time is spent trying to get at
successors and predecessors in dominance/post-dominance respectively, which
takes basically no time in gcc, because the edge lists are an array.

(Note that if you look at generic dominance code, like
http://www.cs.princeton.edu/~rwerneck/dominators/, it's much faster than
what we do on the same graphs. This is true even though we use the same
algorithms .....)

Given it seems unlikely we are going to change the internal representation
anytime soon (or at least i've not seen a proposal), updating post-dom
seems the easier answer.

Are we talking about the basic-blocks edges only?

Yes. Only the successor and predecessors.

I'm not sure changing the IR for the BBs would be a lot more work than
preserving dominance analysis everywhere, or I am totally underestimating
one / overestimating the other?

The way "edges" work right now is that it walks the use/users lists. In
the case of predecessors, it walks the user list of the bb, and sees which
ones are terminator instructions, and then hands you the parent bb's of
those.

In the case of successor, the operand array stores it, but it requires some
indirect loads per successor.

The advantage to this scheme is that if you set the operand of the
terminator, you've updated the edge. It requires no other work.
Chandler was strongly against edge cuts not being super-fast constant
time[1]
If you move to edge arrays, it now requires finding and updating the
terminators, unless you keep them both as use lists somehow (which are
still about 1.5-2x as slow as straight arrays for dominators purposes), or
always have the appropriate terminator around.

In any case, it's really hard to have a case where you don't have to
update some stuff on edge redirection, even if you can store stuff to do it
in constant time.

For example, you would have to keep an array of (bb, index in succ/pred
array) and (terminators, operandno) or something similar to give you
constant time cutting (because you have to update both ends, so you need to
know where everything is both directions from whatever list you are looking
at)

if you can deal with linear time this is much easier.

[1]I don't think it matters as much as he does. They don't happen that
often, and even in the case of bugpoint, most blocks do not have a ton of
edges, so it won't slow much down for bugpointing real programs. The
significantly better cache behavior may even make up for it in practice.

------------------------------

*From: *"Mehdi Amini via llvm-dev" <llvm-dev@lists.llvm.org>
*To: *"Daniel Berlin" <dberlin@dberlin.org>
*Cc: *"David Callahan via llvm-dev" <llvm-dev@lists.llvm.org>
*Sent: *Friday, March 25, 2016 5:43:12 PM
*Subject: *Re: [llvm-dev] RFC: New aggressive dead code elimination pass

Make most things update post-dominance info and preserve it.

Our other alternative to not take up too much time seems to be invasive
surgery on how BB successors/predecessors work so they are a constant time
array. I say this because GCC recomputes post-dominators roughly 15-19
times per compilation, and can do it about 3-5x faster than we can. All
profiling i've done basically says all our time is spent trying to get at
successors and predecessors in dominance/post-dominance respectively, which
takes basically no time in gcc, because the edge lists are an array.

(Note that if you look at generic dominance code, like
http://www.cs.princeton.edu/~rwerneck/dominators/, it's much faster than
what we do on the same graphs. This is true even though we use the same
algorithms .....)

Given it seems unlikely we are going to change the internal representation
anytime soon (or at least i've not seen a proposal), updating post-dom
seems the easier answer.

Are we talking about the basic-blocks edges only? I'm not sure changing
the IR for the BBs would be a lot more work than preserving dominance
analysis everywhere, or I am totally underestimating one / overestimating
the other?

I'm also curious about this, especially because I'd naively think that:

  representation change == a little thinking and (potentially) a lot of
typing

It also may change space characteristics for programs with lots of edges.

  preserving post dom == a lot of thinking and a little typing

and, thus, while updating the analysis might be the *right* thing to do,
it is probably not easier (especially once you factor in the time taken to
fix bugs where we subtly get it wrong). Maybe in the long run, we should do
both?

If we try to keep constant time edge redirection, both are fairly
complicated in terms of thinking :slight_smile:

I posted the patch http://reviews.llvm.org/D18762

In the case where loops are retained, current dead code elimination plus some transformations in SimplifyCFG cover most common cases. This version however does provide a unified approach which allows extend to removal of may-be-infinite)loops.
—david

Sorry to have disappeared.

No I do not use Post-dominators. I tried building post-dominatirs and changing iterated dominance frontier to allow a reverse graph but I found it was significant more expensive than solving a custom data flow problem over the “may be dead” subset of the CFG. I did separately rewrite the post-dominator code to fit the new pass manager but now there are no clients for it http://reviews.llvm.org/D17114

—david

Some question:

  1. IDFCalculator already allows reverse graphs, and gets used for that, so what did you have to change? (this change was made in the past year, so i wonder if you were working on a branch or something).

  2. What are the actual numbers here in terms of calculation of IDF vs your method.

IDF calculator is linear time (Well, depends on priority queue impl, but we could fix that too), so it should not be that bad.
We can make it calculate subgraphs like you do as well.

While i think the way you compute CD for the subset is cool, it is a bunch of code, and probably hard for the vast majority of people to understand and debug :slight_smile:

So if we can make IDF (assuming postdom) within a few percent of what you are doing, we should just do it, IMHO.

If not, well, it’s the old “compile time cost vs actual runtime performance improvement vs any reduced maintenance burden/stuff we can make architecturally more sound” game.

I may have not correctly used the IDFCalculator. I passed in the PostDominator tree and then changed the loop over successor blocks to also be able to iterate over predecessors. I did not see anything in the interface that would let that happen but perhaps I don’t understand the API so I modified it to explicitly iterate over predecessors under a flag.

I don’t have comparisons but the problem I had was it would iterative over two much of the graph when we discovered a block was live. We would need to limit the processing. Perhaps the LiveInBblocks vector could be used to filter. I will take a look and see if that allows an appropriate subgraph to be analyzed. The cost of building the post-dominator tree however was also noticible, but as you observe, it would remove a chunk of code.

—david

I may have not correctly used the IDFCalculator. I passed in the
PostDominator tree and then changed the loop over successor blocks to also
be able to iterate over predecessors. I did not see anything in the
interface that would let that happen but perhaps I don’t understand the API
so I modified it to explicitly iterate over predecessors under a flag.

Sigh.
Yes, we fixed it to take domtreebase, so it can take the postdomtree, but
nobody modified it to change the calculation.
I'll fix this in a second.

I don’t have comparisons but the problem I had was it would iterative
over two much of the graph when we discovered a block was live.

So, the calculation should only happen once, and IDFCalculator should only
touch each block in the CFG a maximum of once, ever.

If it's doing more than that, it's broken for post-dom.
In the forwards, problem, i've never seen the *IDF* calculation to be
noticeable, only the post-dom calculation.
This would not be completely surprising if it's broken, but it would
definitely be broken :slight_smile:

We would need to limit the processing. Perhaps the LiveInBblocks vector

could be used to filter. I will take a look and see if that allows an
appropriate subgraph to be analyzed. The cost of building the
post-dominator tree however was also noticible, but as you observe, it
would remove a chunk of code.

Yeah, the post-dom cost problem is a problem we have elsewhere, and we
should just fix that in one of a myriad of ways.

I update the diff to use IDFCalculator, it is indistinguishable in compile-time from the ad hoc approach (once I used it correctly).
You mentioned changing to handle the reverse graph correctly, if that is in truck I will update shortly

Quentin approved it but I still have to commit (and add two typedefs). I’ll do it today

Thanks for the IDF change.

Do you have time for a deeper review of the change? Or could you suggest someone?
http://reviews.llvm.org/D18762

Thanks

—david