Where in the pass pipeline run the loop optimizer?

I think we could benefit from some community feedback here.

Some context:

Since a couple of weeks we focus on tuning Polly to minimize
compile time overhead. One important case for us is to minimize the compile time impact of Polly in case no loop optimizations are performed. One item that we would like to reconsider here is the position in the LLVM pass order where we execute Polly.

Currently, when '-O3 -polly' is run, we first schedule nine canonicalization passes (taken from the -O3 pass pipeline) to prepare the code for Polly, then run Polly itself and after this we run the full unmodified -O3 sequence.

The reason for this was that we wanted full control on how we canonicalize the code for Polly and we also wanted to run the _entire_
LLVM optimization chain to be sure the code is well optimized after. The reason for this choice was mostly to be able to experiment
with loop optimizations without being constrained by compile time or
pass order concerns.

This pass order is obviously not optimal due to the inherent compile time increase it implies and also because we do not currently run the inliner in our canonicalizer.

Hence, we would like to place the Polly optimizer inside the standard -O3 optimization sequence. The question is now what would be the optimal location.

Here some considerations:

1) After the IR has been canonicalized

This should remove the need for most of our -polly-canonicalize passes
and eliminate the corresponding compile time cost.

2) After inlining has been performed.

Inlining is essential to optimize C++ heavy code such as boost::ublas.

3) Before the Loop/SLP vectorizer

We want to ensure that Polly is run at a position, where the vectorizer
can continue to optimize the code generated by Polly (and to benefit from the parallelism, alias information and loop optimizations we performed).

4) Minimize interactions with the inliner

At best Polly is run after all inlining has happened, such that Polly always sees the largest possible loop nest. If we optimize too early
we may optimize code that later is inlined in a larger loop nest where
the earlier optimization hinders the optimization of the larger loop nest.

Some open questions:

5) Run it before / after LICM?

LICM may make more loop bounds analyzable by making the parameters in the loop bounds loop invariant.

On the other side, running it may introduce unnecessary scalar data dependences, which need to be eliminated again (Polly does not do this
yet).

6) Run it before loop rotate / loop unswitch?

Both of these passes introduce copies of code which Polly does not really benefit from. (Polly may e.g prove that a loop is executed at
least once, such that the copy introduced by loop-rotate is not necessary any more). Hence running them before Polly may unnecessarily complicating the loop structure. On the other side, those passes may be helpful for LICM.

I currently have been considering two locations:

Option 1: Before the loop optimization passes

Maybe a stupid question, but what does lnt say when you run it on both
positions (if possible)?