[polly] pass ordering

Hi,

polly is run very early and schedules the following passes before it runs:

/// @brief Schedule a set of canonicalization passes to prepare for Polly
///
/// The set of optimization passes was partially taken/copied from the
/// set of default optimization passes in LLVM. It is used to bring the code
/// into a canonical form that simplifies the analysis and optimization passes
/// of Polly. The set of optimization passes scheduled here is probably not yet
/// optimal. TODO: Optimize the set of canonicalization passes.
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
  PM.add(llvm::createPromoteMemoryToRegisterPass());
  PM.add(llvm::createInstructionCombiningPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createTailCallEliminationPass());
  PM.add(llvm::createCFGSimplificationPass());
  PM.add(llvm::createReassociatePass());
  PM.add(llvm::createLoopRotatePass());
  PM.add(llvm::createInstructionCombiningPass());

  if (!SCEVCodegen)
    PM.add(polly::createIndVarSimplifyPass());

  PM.add(polly::createCodePreparationPass());
  PM.add(polly::createRegionSimplifyPass());

Sergei was saying that on some benchmarks PromoteMemoryToRegister was causing
performance regressions when it is run with and without Polly and scheduled that
early. Another remark is that these passes apply to all the functions,
transforming them without considering whether they contain loops or whether
Polly could improve anything.

That brings the question: why do we run Polly that early? Could we move Polly
down after all these passes have been scheduled by LLVM's scalar optimizer?

Thanks,
Sebastian

Hi,

polly is run very early and schedules the following passes before it runs:

/// @brief Schedule a set of canonicalization passes to prepare for Polly
///
/// The set of optimization passes was partially taken/copied from the
/// set of default optimization passes in LLVM. It is used to bring the code
/// into a canonical form that simplifies the analysis and optimization passes
/// of Polly. The set of optimization passes scheduled here is probably not yet
/// optimal. TODO: Optimize the set of canonicalization passes.
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
   PM.add(llvm::createPromoteMemoryToRegisterPass());
   PM.add(llvm::createInstructionCombiningPass());
   PM.add(llvm::createCFGSimplificationPass());
   PM.add(llvm::createTailCallEliminationPass());
   PM.add(llvm::createCFGSimplificationPass());
   PM.add(llvm::createReassociatePass());
   PM.add(llvm::createLoopRotatePass());
   PM.add(llvm::createInstructionCombiningPass());

   if (!SCEVCodegen)
     PM.add(polly::createIndVarSimplifyPass());

   PM.add(polly::createCodePreparationPass());
   PM.add(polly::createRegionSimplifyPass());

Right.

Sergei was saying that on some benchmarks PromoteMemoryToRegister was causing
performance regressions when it is run with and without Polly and scheduled that
early.

Are you saying these passes add compile time overhead or rather that they cause problems with the performance of the compiled binary?

I assume when talking about regressions, you compare against a compilation with Polly disabled.

> Another remark is that these passes apply to all the functions,

transforming them without considering whether they contain loops or whether
Polly could improve anything.

True.

That brings the question: why do we run Polly that early? Could we move Polly
down after all these passes have been scheduled by LLVM's scalar optimizer?

For Polly we have basically two constraints:

1) We want to detect scops in the IR on which we run Polly

This means the IR needs to be canonicalized enough to allow scalar evolution & Co to work.

2) The IR generated by Polly, should be well optimized through LLVM

This means we do not only need to perform the optimizations that would have been necessary for the input code, but we also want to take advantage of optimization opportunities that show up after Polly regenerated code.

When I generated the pass ordering, I did not spend a large amount of time to minimize it. I rather assumed, that to be sure the LLVM-IR is well optimized after Polly, it would be good to just run all passes LLVM passes over the output of Polly. So I just placed Polly at the very beginning. Now, to enable Polly to detect reasonably sized scops, I scheduled a set of canonicalization passes before Polly (taken from the beginning of the -O3 sequence).

In terms of scop coverage and quality of the generated code this seems to be a good choice, but it obviously will increase the compile time compared to a run without Polly. What we could aim for is to run Polly at the beginning of the loop transformations e.g. by adding an extension point 'EP_LoopOptimizerStart'. Meaning before vectorization, before loop invariant code motion and before the loop idiom recognition. However, we would then need to evaluate what cleanup passes we need to run after Polly. For the classical code generation strategy we probably need a couple of scalar cleanups, with the scev based code generation, there is normally a lot less to do.

If you can find a pass ordering that does not regress too much on the performance and scop coverage of the current one, but that has Polly
integrated in the normal pass chain just before the loop passes, that would be a great improvement.

Thanks,
Tobias

From: "Tobias Grosser" <tobias@grosser.es>
To: "Sebastian Pop" <spop@codeaurora.org>
Cc: llvmdev@cs.uiuc.edu
Sent: Wednesday, April 17, 2013 12:45:26 PM
Subject: Re: [LLVMdev] [polly] pass ordering

> Hi,
>
> polly is run very early and schedules the following passes before
> it runs:
>
> /// @brief Schedule a set of canonicalization passes to prepare for
> Polly
> ///
> /// The set of optimization passes was partially taken/copied from
> the
> /// set of default optimization passes in LLVM. It is used to bring
> the code
> /// into a canonical form that simplifies the analysis and
> optimization passes
> /// of Polly. The set of optimization passes scheduled here is
> probably not yet
> /// optimal. TODO: Optimize the set of canonicalization passes.
> static void registerCanonicalicationPasses(llvm::PassManagerBase
> &PM) {
> PM.add(llvm::createPromoteMemoryToRegisterPass());
> PM.add(llvm::createInstructionCombiningPass());
> PM.add(llvm::createCFGSimplificationPass());
> PM.add(llvm::createTailCallEliminationPass());
> PM.add(llvm::createCFGSimplificationPass());
> PM.add(llvm::createReassociatePass());
> PM.add(llvm::createLoopRotatePass());
> PM.add(llvm::createInstructionCombiningPass());
>
> if (!SCEVCodegen)
> PM.add(polly::createIndVarSimplifyPass());
>
> PM.add(polly::createCodePreparationPass());
> PM.add(polly::createRegionSimplifyPass());

Right.

> Sergei was saying that on some benchmarks PromoteMemoryToRegister
> was causing
> performance regressions when it is run with and without Polly and
> scheduled that
> early.

Are you saying these passes add compile time overhead or rather that
they cause problems with the performance of the compiled binary?

I assume when talking about regressions, you compare against a
compilation with Polly disabled.

> Another remark is that these passes apply to all the functions,
> transforming them without considering whether they contain loops or
> whether
> Polly could improve anything.

True.

> That brings the question: why do we run Polly that early? Could we
> move Polly
> down after all these passes have been scheduled by LLVM's scalar
> optimizer?

For Polly we have basically two constraints:

1) We want to detect scops in the IR on which we run Polly

This means the IR needs to be canonicalized enough to allow scalar
evolution & Co to work.

2) The IR generated by Polly, should be well optimized through LLVM

This means we do not only need to perform the optimizations that
would
have been necessary for the input code, but we also want to take
advantage of optimization opportunities that show up after Polly
regenerated code.

When I generated the pass ordering, I did not spend a large amount of
time to minimize it. I rather assumed, that to be sure the LLVM-IR is
well optimized after Polly, it would be good to just run all passes
LLVM
passes over the output of Polly. So I just placed Polly at the very
beginning. Now, to enable Polly to detect reasonably sized scops, I
scheduled a set of canonicalization passes before Polly (taken from
the
beginning of the -O3 sequence).

In terms of scop coverage and quality of the generated code this
seems
to be a good choice, but it obviously will increase the compile time
compared to a run without Polly. What we could aim for is to run
Polly
at the beginning of the loop transformations e.g. by adding an
extension
point 'EP_LoopOptimizerStart'. Meaning before vectorization, before
loop
invariant code motion and before the loop idiom recognition. However,
we
would then need to evaluate what cleanup passes we need to run after
Polly. For the classical code generation strategy we probably need a
couple of scalar cleanups, with the scev based code generation, there
is
normally a lot less to do.

If you can find a pass ordering that does not regress too much on the
performance and scop coverage of the current one, but that has Polly
integrated in the normal pass chain just before the loop passes, that
would be a great improvement.

I thought that, when we discussed this in November, the goal was to have Polly scheduled to run just prior to the loop vectorizer, etc. That way we could split the analysis off and it could be (optionally) reused by the vectorization passes without being invalidated by other transforms.

-Hal

This is another good point. We want to run the loop vectorizer as close as possible after Polly.

This is the current list of passes.

MPM.add(createLoopRotatePass()); // Rotate Loop
MPM.add(createLICMPass()); // Hoist loop invariants
MPM.add(createLoopUnswitchPass(SizeLevel || OptLevel < 3));
MPM.add(createInstructionCombiningPass());
MPM.add(createIndVarSimplifyPass()); // Canonicalize indvars
MPM.add(createLoopIdiomPass()); // Recognize idioms like
                                                memset.
MPM.add(createLoopDeletionPass()); // Delete dead loops

if (LoopVectorize && OptLevel > 2)
   MPM.add(createLoopVectorizePass());

As I said, I would like to run Polly before the LICM pass and possibly also before the loop rotate pass. The question is if any of those loop
passes between Polly and the loop vectorizer would be problematic or possibly even beneficial. I believe that after Polly there will be opportunities for LICM, loop rotation, instruction combination, and possibly loop idiom recognition. I believe doing this after Polly and before vectorization is a good thing. The loop.parallel markers are hopefully robust enough to survive up to the loop vectorizer. If one pass really does invalidate them, we can probably teach the pass to keep them.

Cheers,
Tobias

Tobias Grosser wrote:

>Hi,
>
>polly is run very early and schedules the following passes before it runs:
>
>/// @brief Schedule a set of canonicalization passes to prepare for Polly
>///
>/// The set of optimization passes was partially taken/copied from the
>/// set of default optimization passes in LLVM. It is used to bring the code
>/// into a canonical form that simplifies the analysis and optimization passes
>/// of Polly. The set of optimization passes scheduled here is probably not yet
>/// optimal. TODO: Optimize the set of canonicalization passes.
>static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
> PM.add(llvm::createPromoteMemoryToRegisterPass());
> PM.add(llvm::createInstructionCombiningPass());
> PM.add(llvm::createCFGSimplificationPass());
> PM.add(llvm::createTailCallEliminationPass());
> PM.add(llvm::createCFGSimplificationPass());
> PM.add(llvm::createReassociatePass());
> PM.add(llvm::createLoopRotatePass());
> PM.add(llvm::createInstructionCombiningPass());
>
> if (!SCEVCodegen)
> PM.add(polly::createIndVarSimplifyPass());
>
> PM.add(polly::createCodePreparationPass());
> PM.add(polly::createRegionSimplifyPass());

Right.

>Sergei was saying that on some benchmarks PromoteMemoryToRegister was causing
>performance regressions when it is run with and without Polly and scheduled that
>early.

Are you saying these passes add compile time overhead or rather that
they cause problems with the performance of the compiled binary?

Sergei was looking at the performance of the generated code (not compile time),
and yes he looked at the impact of -O3 with the pre-passes of Polly as scheduled
now vs. plain -O3.

This means the IR needs to be canonicalized enough to allow scalar
evolution & Co to work.

Right, Sergei has also pointed out that PromoteMemoryToRegister is needed that
early because otherwise SCEV would not be able to recognize induction variables
allocated on the stack.

If we schedule polly in the LNO, this constraint would be satisfied.

2) The IR generated by Polly, should be well optimized through LLVM

This means we do not only need to perform the optimizations that
would have been necessary for the input code, but we also want to
take advantage of optimization opportunities that show up after
Polly regenerated code.

When I generated the pass ordering, I did not spend a large amount
of time to minimize it. I rather assumed, that to be sure the
LLVM-IR is well optimized after Polly, it would be good to just run
all passes LLVM passes over the output of Polly. So I just placed
Polly at the very beginning. Now, to enable Polly to detect
reasonably sized scops, I scheduled a set of canonicalization passes
before Polly (taken from the beginning of the -O3 sequence).

In terms of scop coverage and quality of the generated code this
seems to be a good choice, but it obviously will increase the
compile time compared to a run without Polly. What we could aim for
is to run Polly at the beginning of the loop transformations e.g. by
adding an extension point 'EP_LoopOptimizerStart'. Meaning before
vectorization, before loop invariant code motion and before the loop
idiom recognition. However, we would then need to evaluate what
cleanup passes we need to run after Polly. For the classical code
generation strategy we probably need a couple of scalar cleanups,
with the scev based code generation, there is normally a lot less to
do.

Right: let's try to see whether with SCEVcodegen we can have better performance
when scheduling Polly in the LNO.

Sebastian

Tobias Grosser wrote:

Hi,

polly is run very early and schedules the following passes before it runs:

/// @brief Schedule a set of canonicalization passes to prepare for Polly
///
/// The set of optimization passes was partially taken/copied from the
/// set of default optimization passes in LLVM. It is used to bring the code
/// into a canonical form that simplifies the analysis and optimization passes
/// of Polly. The set of optimization passes scheduled here is probably not yet
/// optimal. TODO: Optimize the set of canonicalization passes.
static void registerCanonicalicationPasses(llvm::PassManagerBase &PM) {
   PM.add(llvm::createPromoteMemoryToRegisterPass());
   PM.add(llvm::createInstructionCombiningPass());
   PM.add(llvm::createCFGSimplificationPass());
   PM.add(llvm::createTailCallEliminationPass());
   PM.add(llvm::createCFGSimplificationPass());
   PM.add(llvm::createReassociatePass());
   PM.add(llvm::createLoopRotatePass());
   PM.add(llvm::createInstructionCombiningPass());

   if (!SCEVCodegen)
     PM.add(polly::createIndVarSimplifyPass());

   PM.add(polly::createCodePreparationPass());
   PM.add(polly::createRegionSimplifyPass());

Right.

Sergei was saying that on some benchmarks PromoteMemoryToRegister was causing
performance regressions when it is run with and without Polly and scheduled that
early.

Are you saying these passes add compile time overhead or rather that
they cause problems with the performance of the compiled binary?

Sergei was looking at the performance of the generated code (not compile time),
and yes he looked at the impact of -O3 with the pre-passes of Polly as scheduled
now vs. plain -O3.

I am a little confused here. You are saying running -mem2reg decreases the quality of the generated code when run twice (once in the normal LLVM sequence and once before?). I can not really see how -mem2reg could have any negative impact. Do you have any idea where this may come from?

2) The IR generated by Polly, should be well optimized through LLVM

This means we do not only need to perform the optimizations that
would have been necessary for the input code, but we also want to
take advantage of optimization opportunities that show up after
Polly regenerated code.

When I generated the pass ordering, I did not spend a large amount
of time to minimize it. I rather assumed, that to be sure the
LLVM-IR is well optimized after Polly, it would be good to just run
all passes LLVM passes over the output of Polly. So I just placed
Polly at the very beginning. Now, to enable Polly to detect
reasonably sized scops, I scheduled a set of canonicalization passes
before Polly (taken from the beginning of the -O3 sequence).

In terms of scop coverage and quality of the generated code this
seems to be a good choice, but it obviously will increase the
compile time compared to a run without Polly. What we could aim for
is to run Polly at the beginning of the loop transformations e.g. by
adding an extension point 'EP_LoopOptimizerStart'. Meaning before
vectorization, before loop invariant code motion and before the loop
idiom recognition. However, we would then need to evaluate what
cleanup passes we need to run after Polly. For the classical code
generation strategy we probably need a couple of scalar cleanups,
with the scev based code generation, there is normally a lot less to
do.

Right: let's try to see whether with SCEVcodegen we can have better performance
when scheduling Polly in the LNO.

We can probably just schedule it there and then guess a reasonable set of cleanup passes around Polly in the case when SCEVcodegen is disabled. It seems we agree the right location is somewhere in the LNO.

As said before, we could probably add it in between those two passes:

   MPM.add(createReassociatePass()); // Reassociate expressions
+ addExtensionsToPM(EP_LoopOptimizerStart, MPM);
   MPM.add(createLoopRotatePass()); // Rotate Loop

Given that Hal feels it is save to have a couple of passes between Polly and the loop vectorizer.

Tobias

Tobi

Tobias Grosser wrote:

As said before, we could probably add it in between those two passes:

  MPM.add(createReassociatePass()); // Reassociate expressions
+ addExtensionsToPM(EP_LoopOptimizerStart, MPM);
  MPM.add(createLoopRotatePass()); // Rotate Loop

As this is in the middle of other LNO passes, can you please rename
s/EP_LoopOptimizerStart/EP_Polly_LNO/ or anything other than LoopOptimizerStart?

Thanks,
Sebastian

It is in the middle? The passes executed before are:

MPM.add(createJumpThreadingPass()); // Thread jumps.
MPM.add(createCorrelatedValuePropagationPass()); // Propagate
                conditionals
MPM.add(createCFGSimplificationPass()); // Merge & remove
                                            BBs
MPM.add(createInstructionCombiningPass()); // Combine silly
                 seq's

MPM.add(createTailCallEliminationPass()); // Eliminate tail
                 calls
MPM.add(createCFGSimplificationPass()); // Merge & remove
                 BBs
MPM.add(createReassociatePass());

None of them seems to be a loop pass? So it seems to be at the start of the LNO.

Also, I would like to avoid calling an extension point 'Polly...'. We should use a name that describes its possible use or its location rather than a specific user.

Tobias

Sorry for being late to the party... but here are my 2c.

  I am definitely for moving Polly to the point it would need minimal
pre-processing with max coverage. Where this point is - I am not sure yet. I
do not care much right now for the compile time, so duplicating computations
might be OK.

  On a different angle, my reason for getting into this was an attempt to
introduce a cost function to _not_ transform loops that are not likely or
_not_profitable_ to be vectorized. The later is platform specific, so if we
are talking about moving Polly position, maybe we can talk about the cost
function as well? The tradeoff is obvious, the earlier you know what you are
_not_ vectorizing, the less time spent and less code change might be done to
such a region. Precision of such check is increasing with Polly specific
passes progress, so such cost function needs to be callable from multiple
locations. Ideally Polly needs to use the BE hooks used by LLVM
vectorizer...

Sergei