Optimization passes organization and tradeoffs

Hi all,

I’m getting more impressed by LLVM day by day, but what’s a bit unclear to me now is the order of optimization passes, and their performance. I think I have a pretty solid understanding of what each pass does at a high level, but I couldn’t find any documentation about how they interact at a lower level.

I’d like to use LLVM for generating high-performance stream processing code at run-time. Obviously the resulting code should be as optimized as possible, but performing the optimizations themselves should also be very fast. The code I’m compiling is comparable to C (without any exception handling or garbage collection, so none of the related passes are needed). My first attempt at collecting useful optimizations looks like this:

passManager->add(new TargetData(*executionEngine->getTargetData()));

passManager->add(createScalarReplAggregatesPass()); // Convert to SSA form

passManager->add(createSCCPPass()); // Propagate constants

passManager->add(createInstructionCombiningPass()); // Peephole optimization

passManager->add(createDeadStoreEliminationPass()); // Dead store elimination

passManager->add(createAggressiveDCEPass()); // Aggressive dead code elimination

passManager->add(createCFGSimplificationPass()); // Control-flow optimization

I have several questions about this:

  1. Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? I think I need it to optimize small arrays, but what is the expected added complexity?

  2. Does SCCP also eliminate multiplying/dividing by 1 and adding/subtracting 0?

  3. Is it arbitrary where to place InstructionCombining? Is there a better order?

  4. Is DeadStoreElimination still necessary when we have AggressiveDCE?

  5. What are the tradeoffs between the different dead code elimination variants (why not always use the aggressive one)?

  6. Is there a better place for CFGSimplification? Should I perform it at multiple points?

Also, my code will frequently have vectors, that are either initialized to all 0.0 or 1.0. This offers a lot of opportunity for eliminating many multiplications and additions, but I was wondering which passes I need for this (probably a Reassociate pass, what else)? And I also wonder whether these passes actually work with vectors?

Is there any other highly recommended pass for this kind of applications that I’m forgetting? Any passes that I better avoid due to poor gains and long optimization time?

Sorry for the many question marks. :slight_smile: I don’t expect there is an absolute definite answer to each of them, but some guidelines and insights would be very welcome and much appreciated.

Thanks!

Nicolas Capens

1) Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? I

Nope, they are different. Mem2Reg is really important if you want register
allocation.

think I need it to optimize small arrays, but what is the expected added
complexity?

I shouldn't think it would be very expensive at all.

2) Does SCCP also eliminate multiplying/dividing by 1 and
adding/subtracting 0?

That's probably more the purview of instcombine.

3) Is it arbitrary where to place InstructionCombining? Is there a better
order?

Typically you'll want to place it after various kinds of propagation are done
(for example, SCCP). You can run it multipe times.

4) Is DeadStoreElimination still necessary when we have AggressiveDCE?

Probably, but I'll let others give the definitive answer.

5) What are the tradeoffs between the different dead code elimination
variants (why not always use the aggressive one)?

Others can speak to this.

6) Is there a better place for CFGSimplification? Should I perform it at
multiple points?

I think once is probably enough. Earlier would probably be better as it will
simplify later passes and potentially help them run faster.

Also, my code will frequently have vectors, that are either initialized to
all 0.0 or 1.0. This offers a lot of opportunity for eliminating many
multiplications and additions, but I was wondering which passes I need for
this (probably a Reassociate pass, what else)? And I also wonder whether
these passes actually work with vectors?

I would assume they work with vector. Anything expression-related is good
to capture these opportunities (reassociation, folding, instcombine, etc.).

Is there any other highly recommended pass for this kind of applications
that I'm forgetting? Any passes that I better avoid due to poor gains and
long optimization time?

The most expensive optimizations are typically scheduling and register
allocation. Regalloc is pretty important but it depends a lot on the machine
architecture. If you have a very fast cache, it becomes less important.

Scheduling is hit-or-miss. If your architecture is strictly in-order, it's
pretty important. You can always play games like have the scheduler
bail out if a basic block is too large or not scheduling bocks outside of
loops. This can save significant compile time.

Sorry for the many question marks. :slight_smile: I don't expect there is an absolute
definite answer to each of them, but some guidelines and insights would be
very welcome and much appreciated.

Phase ordering is one of the trickiest parts to tune. It's often highly
code-dependent. If your target sources are limited in number, you might be
able to get away with a funky ordering that would be disastrous on
general-purpose integer codes, for example. It's often a matter of trial and
error. Several people have done research on using genetic algorithms and
other tricks to find an optimal phase ordering. A google search should turn
up interesting stuff.

Have a look at the opt tool sources to get an idea of where to start.

                                                    -Dave

1) Does ScalarReplAggregates totally superscede PromoteMemoryToRegister? I

Nope, they are different. Mem2Reg is really important if you want register
allocation.

Actually SROA does fully subsume Mem2Reg. It iterates between breaking up aggregates and promoting them to registers.

think I need it to optimize small arrays, but what is the expected added
complexity?

I shouldn't think it would be very expensive at all.

Agreed.

2) Does SCCP also eliminate multiplying/dividing by 1 and
adding/subtracting 0?

That's probably more the purview of instcombine.

Right.

4) Is DeadStoreElimination still necessary when we have AggressiveDCE?

Probably, but I'll let others give the definitive answer.

They are orthogonal. DSE touches stores, AggressiveDCE does mostly scalars. ADCE basically assumes all stores are live.

5) What are the tradeoffs between the different dead code elimination
variants (why not always use the aggressive one)?

Others can speak to this.

ADCE is actually quite expensive from the compile-time perspective, because it requires building post dominance information. I'd stick with a combination of "-dce -simplifycfg" instead of -adce.

6) Is there a better place for CFGSimplification? Should I perform it at
multiple points?

I think once is probably enough. Earlier would probably be better as it will
simplify later passes and potentially help them run faster.

I'd suggest running it earlier as you suggest, but then also late to clean up. It's a very fast pass.

Also, my code will frequently have vectors, that are either initialized to
all 0.0 or 1.0. This offers a lot of opportunity for eliminating many
multiplications and additions, but I was wondering which passes I need for
this (probably a Reassociate pass, what else)? And I also wonder whether
these passes actually work with vectors?

I would assume they work with vector. Anything expression-related is good
to capture these opportunities (reassociation, folding, instcombine, etc.).

Note that a lot of optimizations don't apply to floating point values. E.g. "x+0.0" can't be eliminated (in general), because "-0.0+0.0" is 0.0, not -0.0. All the optimizations should tolerate vectors, but may miss certain optimizations on them. If you notice something (e.g. reassociation) that isn't being done aggressively with vectors, please file a bug. We do a lot of vector optimizations though, so we are probably only missing details, not whole classes of important optimizations.

Is there any other highly recommended pass for this kind of applications
that I'm forgetting? Any passes that I better avoid due to poor gains and
long optimization time?

Another important optimization is GVN, which does common subexpression elimination (for example).

Sorry for the many question marks. :slight_smile: I don't expect there is an absolute
definite answer to each of them, but some guidelines and insights would be
very welcome and much appreciated.

Phase ordering is one of the trickiest parts to tune. It's often highly
code-dependent. If your target sources are limited in number, you might be
able to get away with a funky ordering that would be disastrous on
general-purpose integer codes, for example. It's often a matter of trial and
error. Several people have done research on using genetic algorithms and
other tricks to find an optimal phase ordering. A google search should turn
up interesting stuff.

"What he said" :). Also, if you see specific problems in the generated code, let us know.

-Chris

SCCP with a good constant folder should definitely be able to take
care of * 0 and */ 1 though.

Not that i would go much further than this.

Hi Chris,

Thanks for the detailed explanations. I have a few remaining questions:

Am I correct that ScalarReplAggregates is hardly more expensive than Mem2Reg
and therefore generally preferable?

What would be the code quality implications of using "-dce -simplifycfg"
instead of -adce? As far as I understand the algorithms involved, -dce would
hardly ever miss a dead instruction if dead stores have already been
eliminated, right?

For my application it is safe to reduce x+0.0 to x. I checked and my current
set of passes keeps the add. Is there a way to control the floating-point
model in some way? Lots of compilers can select 'strict', 'precise', 'fast',
etc. Even adding and then subtracting the same value, which almost never
results in the original value, is almost always safe to eliminate for my
app...

What's the difference between GVN and GCSE, if they both perform common
subexpression elimination?

So far LLVM is working great for me. I'm impressed by the overall
architecture and the fair learning curve. I'm generating complex but
well-optimized code in a matter of days. I haven't found any problems in the
generated code yet, but I'll let you know as soon as I notice a glitch.

Kind regards,

Nicolas

Hi Daniel,

What's a constant folder? I only found a reference to it in the SCCP
implementation. Is there a bit of documentation for it or could you explain
the basics?

Thanks,

Nicolas

Thanks for the detailed explanations. I have a few remaining questions:

Am I correct that ScalarReplAggregates is hardly more expensive than Mem2Reg
and therefore generally preferable?

Right.

What would be the code quality implications of using "-dce -simplifycfg"
instead of -adce? As far as I understand the algorithms involved, -dce would
hardly ever miss a dead instruction if dead stores have already been
eliminated, right?

-dce -simplifycfg should be basically the same as adce. ADCE was recently changed to not delete output-free loops. That was the major advantage it had over simplifycfg. We now have an explicit loop deletion pass as part of the optimizer.

For my application it is safe to reduce x+0.0 to x. I checked and my current
set of passes keeps the add. Is there a way to control the floating-point
model in some way? Lots of compilers can select 'strict', 'precise', 'fast',
etc. Even adding and then subtracting the same value, which almost never
results in the original value, is almost always safe to eliminate for my
app...

You can tell the code generator that it is safe to do this (see llvm/Target/TargetOptions.h), but there isn't currently a way to tell the optimizers about it. My tentative plan is to split "add" into "fadd"/"iadd" and then make the "fp" operations have some sort of flags to parameterize the optimizations allowed for them. This would allow us to propagate down the equivalent of GCC's -ffast-math into the IR.

What's the difference between GVN and GCSE, if they both perform common
subexpression elimination?

GVN does more, and is a better algorithm. GCSE is basically deprecated and should be removed at some point.

So far LLVM is working great for me. I'm impressed by the overall
architecture and the fair learning curve. I'm generating complex but
well-optimized code in a matter of days. I haven't found any problems in the
generated code yet, but I'll let you know as soon as I notice a glitch.

Awesome!

-Chris

Hi David,

Thanks for the info, but I'm not using any of the command line options. I'm
dynamically generating code at run-time. So unless I'm missing some way to
set these command line options at run-time I think I need another API for
controlling register allocation and scheduling.

I was also under the impression that linear scan register allocation is
quite cheap, faster than graph coloring. Or does it use an advanced
variation (with aggressive coalescing), while other register allocation
algorithms are more straightforward?

Cheers,

Nicolas

Hi David,

Thanks for the info, but I'm not using any of the command line options. I'm
dynamically generating code at run-time. So unless I'm missing some way to
set these command line options at run-time I think I need another API for
controlling register allocation and scheduling.

You'll just have to pick one and instantiate that Pass to run. I was just
pointing out the command-line options so you could try them out.

I was also under the impression that linear scan register allocation is
quite cheap, faster than graph coloring. Or does it use an advanced
variation (with aggressive coalescing), while other register allocation
algorithms are more straightforward?

Yes, Linear Scan is cheaper than graph coloring, but it is still more
expensive than the local spiller or simple spiller.

                                           -Dave

Er...waitaminute. Maybe there's something I don't fully grok about GVN,
but in general, value numbering and CSE are different and often complementary.
Neither is "more powerful" than the other.

                                       -Dave

I'm talking about the LLVM passes here, not the abstract algorithms.

-Chris