IR Passes and TargetTransformInfo: Straw Man

Since introducing the new TargetTransformInfo analysis, there has been some confusion over the role of target heuristics in IR passes. A few patches have led to interesting discussions.

To centralize the discussion, until we get some documentation and better APIs in place, let me throw out an oversimplified Straw Man for a new pass pipline. It serves two purposes: (1) an overdue reorganization of the pass pipeline (2) a formalization of the role of TargetTransformInfo.

There seems to be a lot of interest recently in LTO. How do you see the situation of splitting the IR passes between per-TU processing and multi-TU (“link time”) processing?

– Sean Silva

Andy and I briefly discussed this the other day, we have not yet got chance to list a detailed pass order
for the pre- and post- IPO scalar optimizations.

This is wish-list in our mind:

pre-IPO: based on the ordering he propose, get rid of the inlining (or just inline tiny func), get rid of
                all loop xforms...

post-IPO: get rid of inlining, or maybe we still need it, only perform the inling to to callee which now become tiny.
                enable the loop xforms.

                 The SCC pass manager seems to be important inling, no matter how the inling looks like in the future,
                 I think the passmanager is still useful for scalar opt. It enable us to achieve cheap inter-procedural
                 opt hands down in the sense that we can optimize callee, analyze it, and feedback the detailed whatever
                 info back to caller (say info like "the callee already return constant 5", the "callee return value in 5-10",
                 and such info is difficult to obtain and IPO stage, as it can not afford to take such closer look.

I think it is too early to discuss the pre-IPO and post-IPO thing, let us focus on what Andy is proposing.

I should explain: SROA-1 and SROA-2 are not necessarily different versions of SROA (though they could be in the future), I just wanted to be clear that it is run twice.

-Andy

Right. We don’t need to debate the specifics of the LTO pipeline yet, particularly in this thread. The important thing is that the function passes will run in the same order for LTO, so we should probably factor PassManagerBuilder to make that easy. The difference with LTO is that some of the canonical IR passes will end up running at least twice, in both pre and post IPO. The lowering IR passes will only run in post IPO.

-Andy

Hi, Sean:

   I'm sorry I lie. I didn't mean to lie. I did try to avoid making a *BIG* change
to the IPO pass-ordering for now. However, when I make a minor change to
populateLTOPassManager() by separating module-pass and non-module-passes, I
saw quite a few performance difference, most of them are degradations. Attacking
these degradations one by one in a piecemeal manner is wasting time. We might as
well define the pass-ordering for Pre-IPO, IPO and Post-IPO phases at this time,
and hopefully once for all.

  In order to repair the image of being a liar, I post some preliminary result in this cozy
Saturday afternoon which I normally denote to daydreaming :slight_smile:

  So far I only measure the result of MultiSource benchmarks on my iMac (late
2012 model), and the command to run the benchmark is
  "make TEST=simple report OPTFLAGS='-O3 -flto'".

  In terms of execution-time, some degrade, but more improve, few of them
are quite substantial. User-time is used for comparison. I measure the
result twice, they are basically very stable. As far as I can tell from the result,
the proposed pass-ordering is basically toward good change.

  Interesting enough, if I combine the populatePreIPOPassMgr() as the preIPO phase
(see the patch) with original populateLTOPassManager() for both IPO and postIPO,
I see significant improve to "Benchmarks/Trimaran/netbench-crc/netbench-crc"
(about 94%, 0.5665s(was) vs 0.0295s), as of I write this mail, I have not yet got chance
to figure out why this combination improves this benchmark this much.

  In teams of compile-time, the result reports my change improve the compile
time by about 2x, which is non-sense. I guess test-script doesn't count
link-time.

   The new pass ordering Pre-IPO, IPO, and PostIPO are defined by
populate{PreIPO|IPO|PostIPO}PassMgr().

   I will discuss with Andy next Monday in order to be consistent with the
pass-ordering design he is envisioning, and measure more benchmarks then
post the patch and result to the community for discussion and approval.

Thanks
Shuxin

cmp.exec.txt (15.3 KB)

llvm.patch (14.9 KB)

clang.patch (3.21 KB)

I'm a bit late to this, but the examples of the "generic loop opts" above are really better left until later. They have a potential to obscure the code and make other loop optimizations harder. Specifically, there has to be a place where loop nest optimizations can be done (such as loop interchange or unroll-and-jam). There is also array expansion and loop distribution, which can be highly target-dependent in terms of their applicability. I don't know if TTI could provide enough details to account for all circumstances that would motivate such transformations, but assuming that it could, there still needs to be a room left for it in the design.

On a different, but related note---one thing I've asked recently was about the "proper" solution for recognizing target specific loop idioms. On Hexagon, we have a builtin functions that handle certain specific loop patterns. In order to separate the target-dependent code from the target-independent, we would basically have to replicate the loop idiom recognition in our own target-specific pass. Not only that, but it would have to run before the loops may be subjected to other optimizations that could obfuscate the opportunity. To solve this, I was thinking about having target-specific hooks in the idiom recognition code, that could transform a given loop in the target's own way. Still, that would imply target-specific transformations running before the "official" lowering code.

-K

I don't have any objection to this as long as your compile times are comparable.

The major differences that I could spot are:

You've moved the second iteration of some scalar opts into post-IPO:
- JumpThreading
- CorrelatedValueProp

You no longer run InstCombine after the first round of scalar opts (in preIPO) and before the second round (in PostIPO).

You now have an extra (3rd) SROA in PostIPO.

I don't see a problem, but I'd like to understand the rationale. I think it would be valuable to capture some of the motivation behind the standard pass ordering and any changes we make to it. Sometimes part of the design becomes obsolete but no one can be sure. Shall we start a new doc under LLVM subsystems?

-Andy

Starting a new doc sounds like a good idea to me. If you aren't familiar
with adding to the Sphinx docs, the sphinx quickstart template will get you
up and running <http://llvm.org/docs/SphinxQuickstartTemplate.html>.

-- Sean Silva

> Hi, Sean:
>
> I'm sorry I lie. I didn't mean to lie. I did try to avoid making
> a *BIG* change
> to the IPO pass-ordering for now. However, when I make a minor
> change to
> populateLTOPassManager() by separating module-pass and
> non-module-passes, I
> saw quite a few performance difference, most of them are
> degradations. Attacking
> these degradations one by one in a piecemeal manner is wasting
> time. We might as
> well define the pass-ordering for Pre-IPO, IPO and Post-IPO phases
> at this time,
> and hopefully once for all.
>
> In order to repair the image of being a liar, I post some
> preliminary result in this cozy
> Saturday afternoon which I normally denote to daydreaming :slight_smile:
>
> So far I only measure the result of MultiSource benchmarks on my
> iMac (late
> 2012 model), and the command to run the benchmark is
> "make TEST=simple report OPTFLAGS='-O3 -flto'".
>
> In terms of execution-time, some degrade, but more improve, few of
> them
> are quite substantial. User-time is used for comparison. I measure
> the
> result twice, they are basically very stable. As far as I can tell
> from the result,
> the proposed pass-ordering is basically toward good change.
>
> Interesting enough, if I combine the populatePreIPOPassMgr() as
> the preIPO phase
> (see the patch) with original populateLTOPassManager() for both IPO
> and postIPO,
> I see significant improve to
> "Benchmarks/Trimaran/netbench-crc/netbench-crc"
> (about 94%, 0.5665s(was) vs 0.0295s), as of I write this mail, I
> have not yet got chance
> to figure out why this combination improves this benchmark this
> much.
>
> In teams of compile-time, the result reports my change improve the
> compile
> time by about 2x, which is non-sense. I guess test-script doesn't
> count
> link-time.
>
> The new pass ordering Pre-IPO, IPO, and PostIPO are defined by
> populate{PreIPO|IPO|PostIPO}PassMgr().
>
> I will discuss with Andy next Monday in order to be consistent
> with the
> pass-ordering design he is envisioning, and measure more benchmarks
> then
> post the patch and result to the community for discussion and
> approval.
>
> Thanks
> Shuxin

I don't have any objection to this as long as your compile times are
comparable.

The major differences that I could spot are:

You've moved the second iteration of some scalar opts into post-IPO:
- JumpThreading
- CorrelatedValueProp

You no longer run InstCombine after the first round of scalar opts
(in preIPO) and before the second round (in PostIPO).

You now have an extra (3rd) SROA in PostIPO.

I don't see a problem, but I'd like to understand the rationale. I
think it would be valuable to capture some of the motivation behind
the standard pass ordering and any changes we make to it. Sometimes
part of the design becomes obsolete but no one can be sure.

Out of curiosity, has anyone tried to optimize the pass ordering in some (quasi-)automated way? Naively, a genetic algorithm seems like a perfect fit for this.

-Hal

Since introducing the new TargetTransformInfo analysis, there has been some confusion over the role of target heuristics in IR passes. A few patches have led to interesting discussions.

To centralize the discussion, until we get some documentation and better APIs in place, let me throw out an oversimplified Straw Man for a new pass pipline. It serves two purposes: (1) an overdue reorganization of the pass pipeline (2) a formalization of the role of TargetTransformInfo.

---
Canonicalization passes are designed to normalize the IR in order to expose opportunities to subsequent machine independent passes. This simplifies writing machine independent optimizations and improves the quality of the compiler.

An important property of these passes is that they are repeatable. The may be invoked multiple times after inlining and should converge to a canonical form. They should not destructively transform the IR in a way that defeats subsequent analysis.

Canonicalization passes can make use of data layout and are affected by ABI, but are otherwise target independent. Adding target specific hooks to these passes can defeat the purpose of canonical IR.

IR Canonicalization Pipeline:

Function Passes {
  SimplifyCFG
  SROA-1
  EarlyCSE
}
Call-Graph SCC Passes {
  Inline
  Function Passes {
    EarlyCSE
    SimplifyCFG
    InstCombine
    Early Loop Opts {
      LoopSimplify
      Rotate (when obvious)
      Full-Unroll (when obvious)
    }
    SROA-2
    InstCombine
    GVN
    Reassociate
    Generic Loop Opts {
      LICM (Rotate on-demand)
      Unswitch
    }
    SCCP
    InstCombine
    JumpThreading
    CorrelatedValuePropagation
    AggressiveDCE
  }
}

I'm a bit late to this, but the examples of the "generic loop opts" above are really better left until later. They have a potential to obscure the code and make other loop optimizations harder. Specifically, there has to be a place where loop nest optimizations can be done (such as loop interchange or unroll-and-jam). There is also array expansion and loop distribution, which can be highly target-dependent in terms of their applicability. I don't know if TTI could provide enough details to account for all circumstances that would motivate such transformations, but assuming that it could, there still needs to be a room left for it in the design.

You mean that LICM and Unswitching should be left for later? For the purpose of exposing scalar optimizations, I'm not sure I agree with that but I'd be interested in examples.

I think you're only worried about the impact on loop nest optimizations. Admittedly I'm not making much concessesion for that, because I think of loop nest optimization as a different tool that will probably want fairly substantial changes to the pass pipeline anyway. Here's a few of ways it might work:

(1) Loop nest optimizer extends the standard PMB by plugging in its own passes prior to Generic Loop Opts in addition to loading TTI. The loop nest optimizer's passes are free to query TTI:

(2) Loop nest optimizer suppresses generic loop opts through a PMB flag (assuming they are too disruptive). It registers its own loop passes with the Target Loop Opts. It registers instances of generic loop opts to now run after loop nest optimization, and registers new instances of scalar opts to rerun after Target Loop Opts if needed.

(3) If the loop nest optimizer were part of llvm core libs, then we could have a completely separate passmanager builder for it.

On a different, but related note---one thing I've asked recently was about the "proper" solution for recognizing target specific loop idioms. On Hexagon, we have a builtin functions that handle certain specific loop patterns. In order to separate the target-dependent code from the target-independent, we would basically have to replicate the loop idiom recognition in our own target-specific pass. Not only that, but it would have to run before the loops may be subjected to other optimizations that could obfuscate the opportunity. To solve this, I was thinking about having target-specific hooks in the idiom recognition code, that could transform a given loop in the target's own way. Still, that would imply target-specific transformations running before the "official" lowering code.

We may be able to run loop idiom recognition as part of Target Loop Opts. If that misses too many optimizations, then targets can add a second instance of loop-idiom in the target loop opts. Target's can also add extra instances of scalar opts passes in the lowering pipeline, if needed, to cleanup. The lowering pass order should be completely configurable.

Are you afraid that LICM and unswitching will obfuscate the loops to the point that you can’t recognize the idiom? The current pass pipeline would have the same problem.

-Andy

This is the closest I've seen:
http://donsbot.wordpress.com/2010/03/01/evolving-faster-haskell-programs-now-with-llvm/

However, it deals with a "toy" example. Doing something similar over an
entire benchmark suite would be interesting (and it may find non-obvious,
highly-profitable interactions between passes that we aren't currently
exploiting).

-- Sean Silva

Hi, Sean:

   I'm sorry I lie. I didn't mean to lie. I did try to avoid making a *BIG* change
to the IPO pass-ordering for now. However, when I make a minor change to
populateLTOPassManager() by separating module-pass and non-module-passes, I
saw quite a few performance difference, most of them are degradations. Attacking
these degradations one by one in a piecemeal manner is wasting time. We might as
well define the pass-ordering for Pre-IPO, IPO and Post-IPO phases at this time,
and hopefully once for all.
      In order to repair the image of being a liar, I post some preliminary result in this cozy
Saturday afternoon which I normally denote to daydreaming :slight_smile:

  So far I only measure the result of MultiSource benchmarks on my iMac (late
2012 model), and the command to run the benchmark is
  "make TEST=simple report OPTFLAGS='-O3 -flto'".

  In terms of execution-time, some degrade, but more improve, few of them
are quite substantial. User-time is used for comparison. I measure the
result twice, they are basically very stable. As far as I can tell from the result,
the proposed pass-ordering is basically toward good change.

  Interesting enough, if I combine the populatePreIPOPassMgr() as the preIPO phase
(see the patch) with original populateLTOPassManager() for both IPO and postIPO,
I see significant improve to "Benchmarks/Trimaran/netbench-crc/netbench-crc"
(about 94%, 0.5665s(was) vs 0.0295s), as of I write this mail, I have not yet got chance
to figure out why this combination improves this benchmark this much.

  In teams of compile-time, the result reports my change improve the compile
time by about 2x, which is non-sense. I guess test-script doesn't count
link-time.

   The new pass ordering Pre-IPO, IPO, and PostIPO are defined by
populate{PreIPO|IPO|PostIPO}PassMgr().

   I will discuss with Andy next Monday in order to be consistent with the
pass-ordering design he is envisioning, and measure more benchmarks then
post the patch and result to the community for discussion and approval.

Thanks
Shuxin

I don't have any objection to this as long as your compile times are comparable.

The major differences that I could spot are:

You've moved the second iteration of some scalar opts into post-IPO:
- JumpThreading
- CorrelatedValueProp

I don't see why we need so many iterations. So, I get rid of it

You no longer run InstCombine after the first round of scalar opts (in preIPO) and before the second round (in PostIPO).

You now have an extra (3rd) SROA in PostIPO.

I call the SROA for dead code elimination, seriously!

The dead-whatever-elimination (even if they are called aggressive) pass dose not eliminate last store the
local variable. Shame! Shame! Shame!

It seems we don't have better way since we don't like mem-ssa. We have to call SROA , a all-in-one algorithm,
to perform such stuff.

One more I saw recently:
http://gcc.gnu.org/wiki/cauldron2013?action=AttachFile&do=get&target=machine_guided_energy_energy_efficient_compilation.pdf

(it deals with LLVM and GCC, but focuses on "energy efficiency" and not raw
performance).

I personally strong abhor this kind of thing:-) I guess I should be more open-minded.

For pre-ipo phase, some passes should not invoke, say, any loop nest-opt, loop version, aggressive loop unrolling,
vectorization, aggressive inling.

The reasons are they will hinder the downstream optimizers if they kick in early.

I personally strong abhor this kind of thing:-) I guess I should be
more
open-minded.

In my opinion, there are really two different kinds of uses for automated optimization:

1. Fundamental design questions (such as the overall structure of the pass schedule). For these, using an optimization algorithm can be interesting, but mostly just to make sure that things are working as we expect.

2. Purely empirical questions (such as exactly where to run instcombine). For these questions the overall design provides a range of theoretically-equally-good configurations, and picking an optimum based on compile-time-vs-code-performance for different optimization levels (for some particular set of target applications) is a perfectly justifiable use of autotuning.

-Hal

You mean that LICM and Unswitching should be left for later? For the purpose of exposing scalar optimizations, I'm not sure I agree with that but I'd be interested in examples.

Optimizations like LICM, and unswitching can potentially damage perfect nesting of loops. For example, consider this nest:

   for (i) {
     for (j) {
       ... = A[i];
     }
   }

The load of A[i] is invariant in the j-loop (assuming no aliased stores), and even if it was not invariant, the address computation code is. A pass of LICM could then generate something like this:

   for (i) {
     t = A[i];
     for (j) {
       ... = t;
     }
   }

This is no longer a perfect nest, and a variety of loop nest optimizations will no longer be applicable. In general, nest optimizations have a greater potential for improving code performance than scalar optimizations, and should be given priority over them. In most cases (excluding loop interchange, for example), the LICM opportunity will remain, and can be taken care of later.

An example of where the target-dependent factors come into the picture before the target-specific stage (target-specific optimizations, lowering, etc.), may be loop distribution. The example below does not belong to the nest optimizations, but the generic target-independent scalar optimizations will still apply after this transformation.
Say that we have a loop that performs several computations, some of which may be amenable to more aggressive optimizations:

   for (i) {
     v = 1.0/(1.0 + A[i]*A[i]);
     B[i] = (1.0 - B[i])*v
   }

Suppose that we have a hardware that can perform very fast FMA operations (possibly vectorized). We could transform the loop into

   for (i) {
     t = 1.0 + A[i]*A[i];
     v = 1.0/t;
     B[i] = (1.0 - B[i])*v;
   }

And then expand t into an array, and distribute the loop:

   for (i) {
     t[i] = 1.0 + A[i]*A[i];
   }
   for (i) {
     v = 1.0/t[i];
     B[i] = (1.0 - B[i])*v;
   }

The first loop may then be unrolled and vectorized, while the second one may simply be left alone.

This may be difficult to address in a target-independent way (via a generic TTI interface), since we are trying to identify a specific pattern that, for the specific hardware, may be worth extracting out of a more complex loop. In addition to that, for this example to make sense, the vectorization passes would have to run afterwards, which would then put a bound on how late this transformation could be done.

I guess the approach of being able to extend/modify what the PMB creates, that you mention below, would address this problem.

I think you're only worried about the impact on loop nest optimizations. Admittedly I'm not making much concessesion for that, because I think of loop nest optimization as a different tool that will probably want fairly substantial changes to the pass pipeline anyway.

Yes, I agree. My major concern is that we need to be careful that we don't accidentally make things harder for ourselves. I very much agree with the canonicalization approach, and loop nests can take a bit of preparation (even including loop distribution). At the same time, there are other optimizations that will destroy the canonical structures (of loops, loop nests, etc.), that also need to take place at some point. There should be a place in the optimization sequence, where the "destructive" optimizations have not yet taken place, and where the "constructive" (canonical) passes can be executed. The farther down the optimization stream we push the "destructive" ones, the more flexibility we will have as to what types of transformations can be done that require the code to be in a canonical form.

Here's a few of ways it might work:
(1) Loop nest optimizer extends the standard PMB by plugging in its own passes prior to Generic Loop Opts in addition to loading TTI. The loop nest optimizer's passes are free to query TTI:

(2) Loop nest optimizer suppresses generic loop opts through a PMB flag (assuming they are too disruptive). It registers its own loop passes with the Target Loop Opts. It registers instances of generic loop opts to now run after loop nest optimization, and registers new instances of scalar opts to rerun after Target Loop Opts if needed.

(3) If the loop nest optimizer were part of llvm core libs, then we could have a completely separate passmanager builder for it.

All of these approaches would work (even concurrently). I think (3) could potentially be the future goal.

Are you afraid that LICM and unswitching will obfuscate the loops to the point that you can’t recognize the idiom? The current pass pipeline would have the same problem.

Actually, in this case my concern was the interleaving of the target-dependent code with target-independent code (if we were to do all idiom recognition in the same pass), or code duplication (if the target-independent and target-dependent passes were to be separate). The more I think about it, the more I favor the "separate" approach, since the target-specific idioms may be very different for different targets, and there doesn't seem to be much that can be handled in a common code (hence not a lot of actual duplication would happen).

-Krzysztof

Yes, LICM will make perfect nesting become imperfect. When I define pre-ipo pass, I also take this into account as well.
I think for a while, and are not able to figure out any strong reason for running or not running LICM in pre-ipo pass,
or other compilation phase before LNO.

   The pro for running LICM early is that it may move big redundant stuff out of loop nest. You never know
how big it is. In case you are lucky , you can move lot of stuff out of
loop, the loop may become much smaller and hence enable lots of downstream optimizations. This sound
to be a big win for control-intensive programs where Loop-nest-opt normally is a big, expensive no-op.

   The con side is that, as you said, the nest is not perfect any more. However, I would argue LNO optimizations
should be able to tackle the cases when imperfect part is simple enough (say, no call, no control etc).
(FYI, Open64's LNO is able to tackle imperfect nesting so long as imperfect part is simple). Or you just reverse
the LICM, that dosen't sound hard.

   Similar argument for unswitching, you can run fusion to reverse the unswitching, and you certainly need
this opt to transform input nest into appropriate form.

   While this sound bit lame for those HPC thinking folks, the majority care the performance of control intensive
programs. The design have to make the later happy:-)

You mean that LICM and Unswitching should be left for later? For the purpose of exposing scalar optimizations, I'm not sure I agree with that but I'd be interested in examples.

Optimizations like LICM, and unswitching can potentially damage perfect nesting of loops. For example, consider this nest:

for (i) {
   for (j) {
     ... = A[i];
   }
}

The load of A[i] is invariant in the j-loop (assuming no aliased stores), and even if it was not invariant, the address computation code is. A pass of LICM could then generate something like this:

for (i) {
   t = A[i];
   for (j) {
     ... = t;
   }
}

This is no longer a perfect nest, and a variety of loop nest optimizations will no longer be applicable. In general, nest optimizations have a greater potential for improving code performance than scalar optimizations, and should be given priority over them. In most cases (excluding loop interchange, for example), the LICM opportunity will remain, and can be taken care of later.

An example of where the target-dependent factors come into the picture before the target-specific stage (target-specific optimizations, lowering, etc.), may be loop distribution. The example below does not belong to the nest optimizations, but the generic target-independent scalar optimizations will still apply after this transformation.
Say that we have a loop that performs several computations, some of which may be amenable to more aggressive optimizations:

for (i) {
   v = 1.0/(1.0 + A[i]*A[i]);
   B[i] = (1.0 - B[i])*v
}

Suppose that we have a hardware that can perform very fast FMA operations (possibly vectorized). We could transform the loop into

for (i) {
   t = 1.0 + A[i]*A[i];
   v = 1.0/t;
   B[i] = (1.0 - B[i])*v;
}

And then expand t into an array, and distribute the loop:

for (i) {
   t[i] = 1.0 + A[i]*A[i];
}
for (i) {
   v = 1.0/t[i];
   B[i] = (1.0 - B[i])*v;
}

The first loop may then be unrolled and vectorized, while the second one may simply be left alone.

This may be difficult to address in a target-independent way (via a generic TTI interface), since we are trying to identify a specific pattern that, for the specific hardware, may be worth extracting out of a more complex loop. In addition to that, for this example to make sense, the vectorization passes would have to run afterwards, which would then put a bound on how late this transformation could be done.

I guess the approach of being able to extend/modify what the PMB creates, that you mention below, would address this problem.

I think you're only worried about the impact on loop nest optimizations. Admittedly I'm not making much concessesion for that, because I think of loop nest optimization as a different tool that will probably want fairly substantial changes to the pass pipeline anyway.

Yes, I agree. My major concern is that we need to be careful that we don't accidentally make things harder for ourselves. I very much agree with the canonicalization approach, and loop nests can take a bit of preparation (even including loop distribution). At the same time, there are other optimizations that will destroy the canonical structures (of loops, loop nests, etc.), that also need to take place at some point. There should be a place in the optimization sequence, where the "destructive" optimizations have not yet taken place, and where the "constructive" (canonical) passes can be executed. The farther down the optimization stream we push the "destructive" ones, the more flexibility we will have as to what types of transformations can be done that require the code to be in a canonical form.

Thanks Krysztof. I appreciate your input and examples. It’s difficult to balance things like Unswitching and LICM that clearly need to run early to expose scalar opts, but in some sense are destructive. I agree with Shuxin’s analysis that these transformations can be reversed if the effort is made. Those developing experimental LNO tools can use any of the mechanisms described below to work around it.

Here's a few of ways it might work:
(1) Loop nest optimizer extends the standard PMB by plugging in its own passes prior to Generic Loop Opts in addition to loading TTI. The loop nest optimizer's passes are free to query TTI:

(2) Loop nest optimizer suppresses generic loop opts through a PMB flag (assuming they are too disruptive). It registers its own loop passes with the Target Loop Opts. It registers instances of generic loop opts to now run after loop nest optimization, and registers new instances of scalar opts to rerun after Target Loop Opts if needed.

(3) If the loop nest optimizer were part of llvm core libs, then we could have a completely separate passmanager builder for it.

All of these approaches would work (even concurrently). I think (3) could potentially be the future goal.

Are you afraid that LICM and unswitching will obfuscate the loops to the point that you can’t recognize the idiom? The current pass pipeline would have the same problem.

Actually, in this case my concern was the interleaving of the target-dependent code with target-independent code (if we were to do all idiom recognition in the same pass), or code duplication (if the target-independent and target-dependent passes were to be separate). The more I think about it, the more I favor the "separate" approach, since the target-specific idioms may be very different for different targets, and there doesn't seem to be much that can be handled in a common code (hence not a lot of actual duplication would happen).

I think we may want to run some generic loop idiom recognition early (MemCpy/MemSet) because it may aid subsequent analysis. Shuxin described this as a kind of “IR promotion” which is a good way to look at it.

However, targets do want to plugin custom idioms and I think those can run later, separately.

To deal with pass ordering problems introduced by target-specific opts, targets should be able to rerun generic passes if it’s worth the compile time. There are a couple things we should improve to make this more practical:

- The generic, canonicalization passes should be able to run in a mode that limits transformations to those that are locally profitable. We don’t want to canonicalize late and defeat earlier optimizations.
- We should expose more generic optimizations as utilities that can run on-demand for a given block or loop (GVN, LICM).

-Andy

FWIW, I completely agree with this. The canonical form should be that loop invariants are hoisted. Optimizations should not depend on perfect loops. This concept really only makes sense for Source/AST level transformations anyway, which don't apply at the LLVM IR level.

-Chris