[llvm] r184698 - Add a flag to defer vectorization into a phase after the inliner and its

Hi,

I wanted to start a discussion about the following issue since I am not sure about what to do about it:

The loop-vectorizer has the potential to make code a quite a bit bigger (esp. in cases where we don’t know the loop count or whether pointers alias).
Chandler has observed this in snappy where we have a simple memory copying loop (that may overlap).

We vectorize this loop and then this loop gets inlined into a function and prevents this function from getting inlined again. Causing a significant(?) degradation.

https://code.google.com/p/snappy/source/browse/trunk/snappy.cc#99

We have seen some good performance benefits from vectorizing such loops. So not vectorizing them is not really a good option I think.

In <http://llvm.org/viewvc/llvm-project?view=revision&revision=184698&gt; Chandler introduced a flag so that we can run the vectorizer after all CG passes. This would prevent the inline from seeing the vectorized code.

I see some potential issues:

* We run a loop pass later again with the associated compile time cost (?)

* The vectorizer relies on cleanup passes to run afterwards: dce, instsimplify, simplifycfg, maybe some form of redundancy elimination
  If we run later we have to run those passes again increasing compile time OR
  We have to duplicate them in the loop vectorizer increasing its complexity

* The vectorizer would like SCEV analysis to be as precise as possible: one reason are dependency checks that want to know that expressions cannot wrap (AddRec expressions to be more specific):
  At the moment, indvars will remove those flags in some cases which currently is not a problem because SCEV analysis still has the old values cached (except in the case that Chandler mention to me on IRC where we inline a function - in which case that info is lost). My understanding of this is that this is not really something we can fix easily because of the way that SCEV works (unique-ifying/commoning expressions and thereby dropping the flags).
  A potential solution would be to move indvars to later. The question is do other loop passes which simplify IR depend on indvars? Andy what is your take on this?

The benefit of vectorizing later is that we would have more context at the inlined call site. And it would solve the problem of the inliner seeing vectorized code.

What do you all think?

Hi,

I wanted to start a discussion about the following issue since I am
not sure about what to do about it:

The loop-vectorizer has the potential to make code a quite a bit
bigger (esp. in cases where we don’t know the loop count or whether
pointers alias).
Chandler has observed this in snappy where we have a simple memory
copying loop (that may overlap).

We vectorize this loop and then this loop gets inlined into a
function and prevents this function from getting inlined again.
Causing a significant(?) degradation.

Google Code Archive - Long-term storage for Google Code Project Hosting.

We have seen some good performance benefits from vectorizing such
loops. So not vectorizing them is not really a good option I think.

In
<http://llvm.org/viewvc/llvm-project?view=revision&revision=184698&gt;
Chandler introduced a flag so that we can run the vectorizer after
all CG passes. This would prevent the inline from seeing the
vectorized code.

There are obviously several issues here, and they seem only loosely related. Regarding this first one, why is the right answer not to adjust (or improve) the inlining heuristic? I understand that this is not easy, but the fact remains that, in the end, having the loop inlined, even with the extra vectorization checks, is what should be happening (or is the performance still worse than the non-vectorized code?). If we really feel that we can't adjust the current heuristic without breaking other things, then we could add some metadata to make the cost estimator ignore the vector loop preheader, but I'd prefer adjusting the inlining thresholds, etc. The commit message for r184698 said that the flag was for experimentation purposes, and I think that's fine, but this should not be the solution unless it really produces better non-inlined code as well.

I see some potential issues:

* We run a loop pass later again with the associated compile time
cost (?)

* The vectorizer relies on cleanup passes to run afterwards: dce,
instsimplify, simplifycfg, maybe some form of redundancy elimination
  If we run later we have to run those passes again increasing
  compile time OR
  We have to duplicate them in the loop vectorizer increasing its
  complexity

* The vectorizer would like SCEV analysis to be as precise as
possible: one reason are dependency checks that want to know that
expressions cannot wrap (AddRec expressions to be more specific):
  At the moment, indvars will remove those flags in some cases which
  currently is not a problem because SCEV analysis still has the old
  values cached (except in the case that Chandler mention to me on
  IRC where we inline a function - in which case that info is lost).
  My understanding of this is that this is not really something we
  can fix easily because of the way that SCEV works
  (unique-ifying/commoning expressions and thereby dropping the
  flags).

I assume that we're taking about nsw, etc. The fact that SCEV assumes nsw in some cases has led to problems (PR16130 has some history on this), and I don't understand why SCEV's unique-ifying/commoning expressions implies that it needs to drop the flags. Maybe this is just a matter of someone needing to do the work? Is it clear whether (i + 5 == i +(nsw) 5) should always be true, always false, or it depends on how the caller wants to use the answer?

  A potential solution would be to move indvars to later. The
  question is do other loop passes which simplify IR depend on
  indvars? Andy what is your take on this?

The benefit of vectorizing later is that we would have more context
at the inlined call site.

Is it clear that this additional context is a win? Some simplifications make loops easier to understand, and some make them harder to understand. The best thing (not for compile time) may be to run the vectorizer in both places. Nevertheless, now that we can experiment with this, it will be interesting to see some statistics.

Thanks again,
Hal

Yes I am talking about the ability to maintain nsw/nuw on general expressions between LLVM IR and the SCEV IR and back.

If I recall Andy’s explanation correctly, it is at least an issue if you have the same expression once with a wrapping flag (because it is guarded) and once without. Maybe also issues when we simplify expressions in that it is hard to maintain them?

What holds true is that if you go

  LLVM IR -> SCEV -> SCEV expander -> LLVM IR

is that you will loose wrapping flags. And IndVars does just that :(.

> I assume that we're taking about nsw, etc. The fact that SCEV
> assumes nsw in some cases has led to problems (PR16130 has some
> history on this), and I don't understand why SCEV's
> unique-ifying/commoning expressions implies that it needs to drop
> the flags.

Yes I am talking about the ability to maintain nsw/nuw on general
expressions between LLVM IR and the SCEV IR and back.

If I recall Andy’s explanation correctly, it is at least an issue if
you have the same expression once with a wrapping flag (because it
is guarded) and once without.

That makes sense, especially combined with simplification. This is why I asked how equality should be defined if we kept the flags.

Maybe also issues when we simplify
expressions in that it is hard to maintain them?

What holds true is that if you go

  LLVM IR -> SCEV -> SCEV expander -> LLVM IR

is that you will loose wrapping flags. And IndVars does just that :(.

Thanks for following up.

-Hal

In <http://llvm.org/viewvc/llvm-project?view=revision&revision=184698&gt;
Chandler introduced a flag so that we can run the vectorizer after all CG
passes. This would prevent the inline from seeing the vectorized code.

Just for the record, I have no real expectation that this is a good idea
yet... But it's hard to collect numbers without a flag of some kind, and
it's also really annoying to craft this flag given the current pass
manager, so I figured I would get a skeleton in place that folks could
experiment with, and we could keep or delete based on this discussion and
any numbers.

I see some potential issues:

* We run a loop pass later again with the associated compile time cost (?)

This actually worries me the least -- the tradeoff is between locality and
repeatedly analyzing the same loops during the highly iterative CGSCC pass
manager (which looks at each function up to 4 times, and may look at the
body of a function which gets inlined many more times).

* The vectorizer relies on cleanup passes to run afterwards: dce,
instsimplify, simplifycfg, maybe some form of redundancy elimination
  If we run later we have to run those passes again increasing compile
time OR
  We have to duplicate them in the loop vectorizer increasing its
complexity

Hopefully, these won't be too expensive. instcombine and simplifycfg
shouldn't be this late in the pipeline, but only numbers will really tell.
If we need redundancy elimination in the form of GVN, then we're in a lot
of trouble, but that doesn't seem likely to be necessary (I hope).

* The vectorizer would like SCEV analysis to be as precise as possible:
one reason are dependency checks that want to know that expressions cannot
wrap (AddRec expressions to be more specific):
  At the moment, indvars will remove those flags in some cases which
currently is not a problem because SCEV analysis still has the old values
cached (except in the case that Chandler mention to me on IRC where we
inline a function - in which case that info is lost). My understanding of
this is that this is not really something we can fix easily because of the
way that SCEV works (unique-ifying/commoning expressions and thereby
dropping the flags).
  A potential solution would be to move indvars to later. The question is
do other loop passes which simplify IR depend on indvars? Andy what is your
take on this?

I think this is far and away the most important question to answer. =] I
think there are lots of things that would be improved by preserving SCEVs
precision throughout the CGSCC pass manager, but I have no idea how
feasible that is. I would also appreciate Dan's insights here.

The benefit of vectorizing later is that we would have more context at the
inlined call site. And it would solve the problem of the inliner seeing
vectorized code.

There's more though in theory (although maybe not in practice today, and I
may just be wrong on some of these):
- It would improve the ability of the vectorizer to reason about icache
impact, because it wouldn't think the loop was the only loop in the
function only to have it be inlined in 10 places.
- It may form a vectorized loop before inlining which would be handled
better by loop idiom recognition after inlining.
- It would prevent turning loops which SCEV-based passes can simply compute
and/or delete *after* inlining adds some context into a vectorized loop
that may be significantly harder to analyze at this level.

(The last one is the most speculative to me -- it could be that I'm wrong
and SCEV and other loop analyses will work just as well on the vectorized
loops...)

All of these share a common thread: the vectorizer somewhat inherently
loses information, and thus doing it during the iterative pass manager is
very risky as later iterations may be hampered by it.

If all we are doing is inlining first in order to tweak the cost model for
inlining, then I agree with everything you say... but I'm not at all sure
that's the end result.

After inlining, the control flow and even loop structure may look
substantially different than they do before inlining, so the vectorizer may
make a substantially different decision about whether or not to vectorize a
particular loop. As one (somewhat contrived) example, if the loop body is
inlined into a cold region of the enclosing function, the vectorizer might
be able to prioritize code size over performance and skip vectorization.

I think the fundamental problem is that we can't un-vectorize, and the
decision (and strategy) to vectorize is based on a cost model, so the later
we do it the more information we can use in the cost model to make the
correct decision.

In <
http://llvm.org/viewvc/llvm-project?view=revision&revision=184698 >
Chandler introduced a flag so that we can run the vectorizer after
all CG passes. This would prevent the inline from seeing the
vectorized code.

Just for the record, I have no real expectation that this is a good
idea yet... But it's hard to collect numbers without a flag of some
kind, and it's also really annoying to craft this flag given the
current pass manager, so I figured I would get a skeleton in place
that folks could experiment with, and we could keep or delete based
on this discussion and any numbers.

I see some potential issues:

* We run a loop pass later again with the associated compile time
cost (?)

This actually worries me the least -- the tradeoff is between
locality and repeatedly analyzing the same loops during the highly
iterative CGSCC pass manager (which looks at each function up to 4
times, and may look at the body of a function which gets inlined
many more times).

* The vectorizer relies on cleanup passes to run afterwards: dce,
instsimplify, simplifycfg, maybe some form of redundancy elimination
If we run later we have to run those passes again increasing compile
time OR
We have to duplicate them in the loop vectorizer increasing its
complexity

Hopefully, these won't be too expensive. instcombine and simplifycfg
shouldn't be this late in the pipeline, but only numbers will really
tell. If we need redundancy elimination in the form of GVN, then
we're in a lot of trouble, but that doesn't seem likely to be
necessary (I hope).

Obviously there may be relevant differences, but running EarlyCSE after the BB vectorizer turned out to work almost as well as running GVN, and the improved compile time seemed worth the trade off.

* The vectorizer would like SCEV analysis to be as precise as
possible: one reason are dependency checks that want to know that
expressions cannot wrap (AddRec expressions to be more specific):
At the moment, indvars will remove those flags in some cases which
currently is not a problem because SCEV analysis still has the old
values cached (except in the case that Chandler mention to me on IRC
where we inline a function - in which case that info is lost). My
understanding of this is that this is not really something we can
fix easily because of the way that SCEV works
(unique-ifying/commoning expressions and thereby dropping the
flags).
A potential solution would be to move indvars to later. The question
is do other loop passes which simplify IR depend on indvars? Andy
what is your take on this?

I think this is far and away the most important question to answer.
=] I think there are lots of things that would be improved by
preserving SCEVs precision throughout the CGSCC pass manager, but I
have no idea how feasible that is. I would also appreciate Dan's
insights here.

The benefit of vectorizing later is that we would have more context
at the inlined call site. And it would solve the problem of the
inliner seeing vectorized code.

There's more though in theory (although maybe not in practice today,
and I may just be wrong on some of these):
- It would improve the ability of the vectorizer to reason about
icache impact, because it wouldn't think the loop was the only loop
in the function only to have it be inlined in 10 places.

Good point (although might only apply in practice to cases where we know that the trip count is small, and that requires profiling data in general).

- It may form a vectorized loop before inlining which would be
handled better by loop idiom recognition after inlining.

I imagine that we could improve idiom recognition to mitigate this particular issue.

- It would prevent turning loops which SCEV-based passes can simply
compute and/or delete *after* inlining adds some context into a
vectorized loop that may be significantly harder to analyze at this
level.

In my experience (from working with the BB vectorizer), this can be a significant problem. Even worse, if you vectorize any of the calculations feeding addressing, then BasicAA also becomes less precise.

(The last one is the most speculative to me -- it could be that I'm
wrong and SCEV and other loop analyses will work just as well on the
vectorized loops...)

All of these share a common thread: the vectorizer somewhat
inherently loses information, and thus doing it during the iterative
pass manager is very risky as later iterations may be hampered by
it.

This is an infrastructure problem, but I suspect it will remain true without a significant effort to teach SCEV, BasicAA, etc. to look though vectorized computations.

-Hal

> In
> < http://llvm.org/viewvc/llvm-project?view=revision&revision=184698
> >
> Chandler introduced a flag so that we can run the vectorizer after
> all CG passes. This would prevent the inline from seeing the
> vectorized code.

There are obviously several issues here, and they seem only loosely
related. Regarding this first one, why is the right answer not to
adjust (or improve) the inlining heuristic? I understand that this
is not easy, but the fact remains that, in the end, having the loop
inlined, even with the extra vectorization checks, is what should be
happening (or is the performance still worse than the non-vectorized
code?). If we really feel that we can't adjust the current heuristic
without breaking other things, then we could add some metadata to
make the cost estimator ignore the vector loop preheader, but I'd
prefer adjusting the inlining thresholds, etc. The commit message
for r184698 said that the flag was for experimentation purposes, and
I think that's fine, but this should not be the solution unless it
really produces better non-inlined code as well.
If all we are doing is inlining first in order to tweak the cost
model for inlining, then I agree with everything you say... but I'm
not at all sure that's the end result.

After inlining, the control flow and even loop structure may look
substantially different than they do before inlining, so the
vectorizer may make a substantially different decision about whether
or not to vectorize a particular loop. As one (somewhat contrived)
example, if the loop body is inlined into a cold region of the
enclosing function, the vectorizer might be able to prioritize code
size over performance and skip vectorization.

Unfortunately, this can also be a problem. Loops that the vectorizer could understand and vectorize prior to inlining can become loops that the vectorizer cannot understand after inlining (and, as things currently stand, it does not take much: IIRC, a preheader is still required). In the future, I imagine that things will get better (but they could also get worse, if we do loop fusion for example).

I think the fundamental problem is that we can't un-vectorize, and
the decision (and strategy) to vectorize is based on a cost model,
so the later we do it the more information we can use in the cost
model to make the correct decision.

I agree.

Thanks again,
Hal

I agree. Adding a temporary flag is a good way to allow people to test this change with minimal effort. This is what we did when Jeffery Yasskin wanted to check the vectorizer a few weeks ago.

I agree. The vectorizer is a lowering pass, and much like LSR and it loses information. A few months ago some of us talked about this and came up with a general draft for the ideal pass ordering.
If I remember correctly the plan was that the second half of the pipe should start with GVN (which currently runs after the loop passes). After that come the loop passes, the Vectorizers (loop vectorization first), and finally LSR, Lower-switch, CGP, etc. I think that when we discussed this most people argued that the inliner should be before GVN and the loop passes. It would be interesting to see the performance numbers for the new pass order.

Nadav

I agree. The vectorizer is a *lowering* pass, and much like LSR and it
loses information. A few months ago some of us talked about this and came
up with a general draft for the ideal pass ordering.

Where? On the mailing list?

If I remember correctly the plan was that the second half of the pipe
should start with GVN (which currently runs after the loop passes). After
that come the loop passes, the Vectorizers (loop vectorization first), and
finally LSR, Lower-switch, CGP, etc. I think that when we discussed this
most people argued that the inliner should be before GVN and the loop
passes. It would be interesting to see the performance numbers for the new
pass order.

This doesn't make a lot of sense to me yet.

The inliner, GVN, and the loop passes run together, *iteratively*. They are
neither before or after one another. And this is important as it allows
iterative simplification in the inliner. It is one of the most critical
optimizations for C++ code that LLVM does.

We can't sink all of the loop passes out of the iterative pass model
either, because deleting loops, simplifying them, etc. all directly feed
the iterative simplification needed by GVN and the inliner.

We need a *second* loop pass that happens after the iterative CGSCC walk
which does the further optimizations such as (potentially indvars, ) the
vectorizers, LSR, lower-switch, CGP, CG. I think we actually want most of
the post CGSCC module passes to run after the vectorizers and before LSR to
fold away constants and globals that look different after vectorization
compared to before, but aren't significantly shifted by LSR and CGP.

In terms of mental model, is it best to think of vectorization as being a loop pass or as a late lowering pass?

What about when we get more aggressive loop transformations like blocking, strip mining, fusion, etc?

-Chris

Hi,

I wanted to start a discussion about the following issue since I am not sure about what to do about it:

The loop-vectorizer has the potential to make code a quite a bit bigger (esp. in cases where we don’t know the loop count or whether pointers alias).
Chandler has observed this in snappy where we have a simple memory copying loop (that may overlap).

We vectorize this loop and then this loop gets inlined into a function and prevents this function from getting inlined again. Causing a significant(?) degradation.

https://code.google.com/p/snappy/source/browse/trunk/snappy.cc#99

We have seen some good performance benefits from vectorizing such loops. So not vectorizing them is not really a good option I think.

In <http://llvm.org/viewvc/llvm-project?view=revision&revision=184698> Chandler introduced a flag so that we can run the vectorizer after all CG passes. This would prevent the inline from seeing the vectorized code.

I see some potential issues:

  • We run a loop pass later again with the associated compile time cost (?)

I want to move loop opts that depend on target info later, outside of CGSCC: definitely indvars, vectorize/partial unroll. That way we only inline canonical code and have a clean break between canonical and lowering passes. Hopefully inlining heuristics will be adequate without first running these passes. For the most part, I think it’s as simple as inlining first with high-level heuristics, then lowering later.

  • The vectorizer relies on cleanup passes to run afterwards: dce, instsimplify, simplifycfg, maybe some form of redundancy elimination
    If we run later we have to run those passes again increasing compile time OR
    We have to duplicate them in the loop vectorizer increasing its complexity

We’ll have to handle this case-by-case as we gradually move passes around. But the general idea is that lowering passes like the vectorizer should clean up after themselves as much as feasible (whereas canonicalization passes should not need to). We should be developing utilities to cleanup redundancies incrementally. A value number utility would make sense. Of course, if a very light-weight pass can simply be rescheduled to run again to do the cleanup then we don’t need a cleanup utility.

  • The vectorizer would like SCEV analysis to be as precise as possible: one reason are dependency checks that want to know that expressions cannot wrap (AddRec expressions to be more specific):
    At the moment, indvars will remove those flags in some cases which currently is not a problem because SCEV analysis still has the old values cached (except in the case that Chandler mention to me on IRC where we inline a function - in which case that info is lost). My understanding of this is that this is not really something we can fix easily because of the way that SCEV works (unique-ifying/commoning expressions and thereby dropping the flags).
    A potential solution would be to move indvars to later. The question is do other loop passes which simplify IR depend on indvars? Andy what is your take on this?

Indvars should ideally preserve NSW flags whenever possible. However, we don’t want to rely on SCEV to preserve them. SCEV expressions are implicitly reassociated and uniqued in a flow-insensitive universe independent of the def-use chain of values. SCEV simply can’t represent the flags in most cases. I think the only flag that makes sense in SCEV is the no-wrap flag on a recurrence (that’s independent of signed/unsigned overflow).

As long as indvars does not rely on SCEVExpander it should be able to preserve the flags. Unfortunately, it still uses SCEVExpander in a few places. LinearFunctionTestReplace is one that should simply be moved into LSR instead. For the couple other cases, we’ll just have to work on alternative implementations that don’t drop flags, but I think it’s worth doing.

That said, we should try not to rely on NSW at all unless clearly necessary. It introduces nasty complexity that needs to be well justified. e.g. in the vectorized loop preheader we should explicitly check for wrapping and only try to optimize those checks using NSW if we have data that indicates it’s really necessary.

-Andy

> The inliner, GVN, and the loop passes run together, *iteratively*.
> They are neither before or after one another. And this is
> important as it allows iterative simplification in the inliner. It
> is one of the most critical optimizations for C++ code that LLVM
> does.
>
> We can't sink all of the loop passes out of the iterative pass
> model either, because deleting loops, simplifying them, etc. all
> directly feed the iterative simplification needed by GVN and the
> inliner.
>
> We need a *second* loop pass that happens after the iterative CGSCC
> walk which does the further optimizations such as (potentially
> indvars, ) the vectorizers, LSR, lower-switch, CGP, CG. I think we
> actually want most of the post CGSCC module passes to run after
> the vectorizers and before LSR to fold away constants and globals
> that look different after vectorization compared to before, but
> aren't significantly shifted by LSR and CGP.

In terms of mental model, is it best to think of vectorization as
being a loop pass or as a late lowering pass?

What about when we get more aggressive loop transformations like
blocking, strip mining, fusion, etc?

I view vectorization as a lowering pass, but it changes the cost of the code (both in terms of size and runtime), and so the inliner should be aware of its effects. Of course, the inliner is already trying to guess properties of the code that will be generated for some given IR, and vectorization only makes this process potentially more difficult. Currently, we deal with this by running the vectorizer prior to inlining, but that is obviously not the only way.

Fusion could be considered a canonicalization pass, but blocking, strip mining, etc. are also lowering operations (and, furthermore, lowering operations that need to be vectorization-aware).

In terms of categorization, there seem to be (at least) two relevant factors: canonicalization vs. target-awareness (lowering) and also information creating/destroying/preserving. The inliner is information creating (except, for example, when we lose no-alias function argument information, but that is an infrastructure limitation), the vectorizer is information destroying (mainly an infrastructure limitation in SCEV, BasicAA, etc., but currently true regardless).

-Hal

Hi,

I wanted to start a discussion about the following issue since I am
not sure about what to do about it:

The loop-vectorizer has the potential to make code a quite a bit
bigger (esp. in cases where we don’t know the loop count or whether
pointers alias).
Chandler has observed this in snappy where we have a simple memory
copying loop (that may overlap).

We vectorize this loop and then this loop gets inlined into a
function and prevents this function from getting inlined again.
Causing a significant(?) degradation.

Google Code Archive - Long-term storage for Google Code Project Hosting.

We have seen some good performance benefits from vectorizing such
loops. So not vectorizing them is not really a good option I think.

In <
http://llvm.org/viewvc/llvm-project?view=revision&revision=184698 >
Chandler introduced a flag so that we can run the vectorizer after
all CG passes. This would prevent the inline from seeing the
vectorized code.

I see some potential issues:

* We run a loop pass later again with the associated compile time
cost (?)

I want to move loop opts that depend on target info later, outside of
CGSCC: definitely indvars, vectorize/partial unroll. That way we
only inline canonical code and have a clean break between canonical
and lowering passes. Hopefully inlining heuristics will be adequate
without first running these passes. For the most part, I think it's
as simple as inlining first with high-level heuristics, then
lowering later.

* The vectorizer relies on cleanup passes to run afterwards: dce,
instsimplify, simplifycfg, maybe some form of redundancy elimination
If we run later we have to run those passes again increasing compile
time OR
We have to duplicate them in the loop vectorizer increasing its
complexity

We'll have to handle this case-by-case as we gradually move passes
around. But the general idea is that lowering passes like the
vectorizer should clean up after themselves as much as feasible
(whereas canonicalization passes should not need to). We should be
developing utilities to cleanup redundancies incrementally. A value
number utility would make sense. Of course, if a very light-weight
pass can simply be rescheduled to run again to do the cleanup then
we don't need a cleanup utility.

* The vectorizer would like SCEV analysis to be as precise as
possible: one reason are dependency checks that want to know that
expressions cannot wrap (AddRec expressions to be more specific):
At the moment, indvars will remove those flags in some cases which
currently is not a problem because SCEV analysis still has the old
values cached (except in the case that Chandler mention to me on IRC
where we inline a function - in which case that info is lost). My
understanding of this is that this is not really something we can
fix easily because of the way that SCEV works
(unique-ifying/commoning expressions and thereby dropping the
flags).
A potential solution would be to move indvars to later. The question
is do other loop passes which simplify IR depend on indvars? Andy
what is your take on this?

Indvars should ideally preserve NSW flags whenever possible. However,
we don't want to rely on SCEV to preserve them. SCEV expressions are
implicitly reassociated and uniqued in a flow-insensitive universe
independent of the def-use chain of values. SCEV simply can't
represent the flags in most cases. I think the only flag that makes
sense in SCEV is the no-wrap flag on a recurrence (that's
independent of signed/unsigned overflow).

Why can't SCEV keep a flow sensitive (effectively per-BB) map of expressions and their original flags (and perhaps cached deduced flags)? It seems like this problem is solvable within SCEV.

-Hal

I think you would have to track all the uses of each expression…

You can turn SCEV into an IR and inherit all the representational problems inherent in llvm IR. Or you can use it as an analysis that efficiently reprents only the mathematical properties of an expression and is simple to work with. I think it’s fantastic as an analysis but really stinks at being an IR.

-Andy

These discussions had more to do with formalizing the use of target information within IR passes (legality, instruction-level cost). There were some list threads and offline discussion. I set this aside because I wasn’t sure how it was going to fit with some of the other work in progress, particularly LTO.

I don’t think there’s any controversy over the high-level goals. But there will be controversy when we start proposing concrete pass ordering changes.

When I return to work mid-July, I’d be happy to send out some proposed changes for discussion. The first step will be an improved interface for IR-level cost metrics, which we already agreed to some time ago.

I don’t want to start a centi-thread yet, but here’s a very rough idea (leaving many things out):

Canonicalize {
Func {
SimpCFG
SROA-1
EarlyCSE
}
CGSCC {
Inline
EarlyCSE
SimpCFG
InstCombine
Early Loop Opts {
LoopSimplify
Rotate
Obvious-Full-Unroll
}
SROA-2
InstCombine
GVN
Reassociate
Late Loop Opts {
LICM
Unswitch
}
SCCP
InstCombine
JT
CVP
DCE
}
}
Lower {
Target Loop Opts {
IndvarSimplify
Vectorize/Unroll
LSR
}
SLP Vectorize
}

We might need to pull some things like exit value replacement out of IndvarSimplify into target-independent loop opts.

-Andy

I actually like the look of this. I'll hack things around so we have a flag
(or set) we can use to experiment. It would be great to get the bits
factored out of IndvarSimplify so that we can run the experiments more
easily.

I'm particularly interested in this if it addresses the regressions I hit
with loop vectorize on by default.

Another thing to look at is whether we need a GVN utility to run after LICM/Unswitch, or just run a fast non-iterative version of the pass a second time. Likewise, the vectorizer may rely on GVN now, but it should probably call a utility instead.

-Andy

Indvars should ideally preserve NSW flags whenever possible. However,
we don't want to rely on SCEV to preserve them. SCEV expressions are
implicitly reassociated and uniqued in a flow-insensitive universe
independent of the def-use chain of values. SCEV simply can't
represent the flags in most cases. I think the only flag that makes
sense in SCEV is the no-wrap flag on a recurrence (that's
independent of signed/unsigned overflow).

Why can't SCEV keep a flow sensitive (effectively per-BB) map of
expressions and their original flags (and perhaps cached deduced
flags)? It seems like this problem is solvable within SCEV.

-Hal

I think you would have to track all the uses of each expression...

Yes, but SCEV's are flow insensitive, normalized, and uniqued, and we only need to keep track of the original flagged instructions (specifically the parent basic blocks). Combined with some caching of subsequent flags for derived expressions I think this may even be not too expensive.

You can turn SCEV into an IR and inherit all the representational
problems inherent in llvm IR. Or you can use it as an analysis that
efficiently reprents only the mathematical properties of an
expression and is simple to work with. I think it's fantastic as an
analysis but really stinks at being an IR.

Agreed. Unfortunately, we do need to keep track of wrapping properties in order to generate correct code in all cases (assuming we don't want to be unduly pessimistic). I don't think that enhancing SE to deal with this will automatically drag in all of the messiness of a full IR (especially if we can just keep track of the necessary flag information 'on the side'). I also don't like the idea of writing mini-SEs all over the place to work around problems in SE.

Regardless, I take it that you feel that I'm off-base in wanting SE to deal with nsw in some general sense, but I don't see why. As you've pointed out to me, at least some SE computations assume nsw even when perhaps they shouldn't (like the backedge-taken counts). That combined with the fact that SE is dropping the flags, causing problems for downstream passes, seems like a good motivation for tracking them within SE.

A more general question: Currently, when we have (i +(nsw) 5) in SE, etc. implicitly assumes that (i + 25) also does not wrap, correct? If that's right, then while this seems to follow the C model, I don't think that it is specified in the language reference.

Thanks again,
Hal

Indvars should ideally preserve NSW flags whenever possible. However,
we don't want to rely on SCEV to preserve them. SCEV expressions are
implicitly reassociated and uniqued in a flow-insensitive universe
independent of the def-use chain of values. SCEV simply can't
represent the flags in most cases. I think the only flag that makes
sense in SCEV is the no-wrap flag on a recurrence (that's
independent of signed/unsigned overflow).

Why can't SCEV keep a flow sensitive (effectively per-BB) map of
expressions and their original flags (and perhaps cached deduced
flags)? It seems like this problem is solvable within SCEV.

-Hal

I think you would have to track all the uses of each expression...

Yes, but SCEV's are flow insensitive, normalized, and uniqued, and we only need to keep track of the original flagged instructions (specifically the parent basic blocks). Combined with some caching of subsequent flags for derived expressions I think this may even be not too expensive.

Clients of SCEV need SCEV to Value mappings. But do we really need to preserve it across passes? Invalidation is a nasty problem. We already have issues with SCEVUnknown invalidation.

You can turn SCEV into an IR and inherit all the representational
problems inherent in llvm IR. Or you can use it as an analysis that
efficiently reprents only the mathematical properties of an
expression and is simple to work with. I think it's fantastic as an
analysis but really stinks at being an IR.

Agreed. Unfortunately, we do need to keep track of wrapping properties in order to generate correct code in all cases (assuming we don't want to be unduly pessimistic). I don't think that enhancing SE to deal with this will automatically drag in all of the messiness of a full IR (especially if we can just keep track of the necessary flag information 'on the side'). I also don't like the idea of writing mini-SEs all over the place to work around problems in SE.

Regardless, I take it that you feel that I'm off-base in wanting SE to deal with nsw in some general sense, but I don't see why. As you've pointed out to me, at least some SE computations assume nsw even when perhaps they shouldn't (like the backedge-taken counts). That combined with the fact that SE is dropping the flags, causing problems for downstream passes, seems like a good motivation for tracking them within SE.

SE "drops" flags internally because we don't have a safe way to express them. But why is Indvars dropping the IR flags? We should:
1) determine if its safe for Indvars to preserve nsw in the cases we miss now
2) if not we may want a loop level analysis to preserve certain facts like trip count (hopefully not needed)
3) defer parts of Indvars that are destructive or even reorganize loop passes. E.g.linear function test replace could as late as LSR.

A more general question: Currently, when we have (i +(nsw) 5) in SE, etc. implicitly assumes that (i + 25) also does not wrap, correct? If that's right, then while this seems to follow the C model, I don't think that it is specified in the language reference.

Good point. I'm sure we make that assumption in places.
Andy