LICM and SCEV AA?

Hello,

We currently generate suboptimal code for this loop:

for (int i = 1; i < LEN; i++) {
  a[i] = a[0] + b[i];
}

The largest problem is that we won't vectorize the loop because the loop vectorizer does not grok the independence of a[i] and a[0] (note that i starts at 1 in this loop). While we can probably fix the vectorizer to handle this, I don't think that is the right place for the fix, because this is really a LICM problem. LICM could move a[0] out of the loop, but it currently doesn't (which is pretty suboptimial in itself). The reason is that LICM does not use SE (either directly or indirectly), and BasicAA does not recognize that a[0] and a[i] will never overlap in this loop.

I see several possible solutions:

1. Add yet-another hack to BasicAA to catch this case.

2. Let LICM use SE directly. Instead of bailing out on hoisting a loop-invariant value in an alias set with non-invariant values, use SE to check independence with the other values in the alias set.

3. Use ScalarEvolutionAliasAnalysis; scheduling SCEV AA prior to running LICM fixes this problem.

Using #3 seems like the simplest, and most natural, way to fix this issue. I recall someone tell me (maybe at the developers' meeting) that SCEV AA was 'fundamentally flawed', and now I'm wondering what was meant by that. The pass manager does not know how to reschedule it unless it is specifically requested, but if that's the entirety of the problem, then I think that we could manually schedule it in a few key places and get a lot of benefit (there is a particular use case for prior to LICM, at least).

FWIW, I don't think that adding an SCEV dependency before LICM would be particularly problematic: Currently, LICM lives in the same loop pass manager instance as LoopSimplify, LoopRotate and LoopUnswitch, all of which preserve SE.

Suggestions?

Thanks again,
Hal

Hello,

We currently generate suboptimal code for this loop:

for (int i = 1; i < LEN; i++) {
a[i] = a[0] + b[i];
}

The largest problem is that we won't vectorize the loop because the loop vectorizer does not grok the independence of a[i] and a[0] (note that i starts at 1 in this loop). While we can probably fix the vectorizer to handle this, I don't think that is the right place for the fix, because this is really a LICM problem. LICM could move a[0] out of the loop, but it currently doesn't (which is pretty suboptimial in itself). The reason is that LICM does not use SE (either directly or indirectly), and BasicAA does not recognize that a[0] and a[i] will never overlap in this loop.

I see several possible solutions:

1. Add yet-another hack to BasicAA to catch this case.

2. Let LICM use SE directly. Instead of bailing out on hoisting a loop-invariant value in an alias set with non-invariant values, use SE to check independence with the other values in the alias set.

3. Use ScalarEvolutionAliasAnalysis; scheduling SCEV AA prior to running LICM fixes this problem.

Using #3 seems like the simplest, and most natural, way to fix this issue. I recall someone tell me (maybe at the developers' meeting) that SCEV AA was 'fundamentally flawed', and now I'm wondering what was meant by that. The pass manager does not know how to reschedule it unless it is specifically requested, but if that's the entirety of the problem, then I think that we could manually schedule it in a few key places and get a lot of benefit (there is a particular use case for prior to LICM, at least).

FWIW, I don't think that adding an SCEV dependency before LICM would be particularly problematic: Currently, LICM lives in the same loop pass manager instance as LoopSimplify, LoopRotate and LoopUnswitch, all of which preserve SE.

Suggestions?

I would be scared to enable SCEV AA because it's not well tested and updating SCEV across transforms can be problematic. LICM would require another run of SCEV (bad for compile time). As for other reasons it's fundamentally flawed, Dan G. may remember.

I would be tempted to go with #1 just to fix LICM, since LICM has no need for generic dependence analysis.

Of course, the vectorizer already uses SCEV to determine array element dependence. It should have a strong method for dependence analysis. Why does it miss this case?

-Andy

Hello,

We currently generate suboptimal code for this loop:

for (int i = 1; i < LEN; i++) {
a[i] = a[0] + b[i];
}

The largest problem is that we won't vectorize the loop because the loop vectorizer does not grok the independence of a[i] and a[0] (note that i starts at 1 in this loop). While we can probably fix the vectorizer to handle this, I don't think that is the right place for the fix, because this is really a LICM problem. LICM could move a[0] out of the loop, but it currently doesn't (which is pretty suboptimial in itself). The reason is that LICM does not use SE (either directly or indirectly), and BasicAA does not recognize that a[0] and a[i] will never overlap in this loop.

I see several possible solutions:

1. Add yet-another hack to BasicAA to catch this case.

2. Let LICM use SE directly. Instead of bailing out on hoisting a loop-invariant value in an alias set with non-invariant values, use SE to check independence with the other values in the alias set.

3. Use ScalarEvolutionAliasAnalysis; scheduling SCEV AA prior to running LICM fixes this problem.

Using #3 seems like the simplest, and most natural, way to fix this issue. I recall someone tell me (maybe at the developers' meeting) that SCEV AA was 'fundamentally flawed', and now I'm wondering what was meant by that. The pass manager does not know how to reschedule it unless it is specifically requested, but if that's the entirety of the problem, then I think that we could manually schedule it in a few key places and get a lot of benefit (there is a particular use case for prior to LICM, at least).

FWIW, I don't think that adding an SCEV dependency before LICM would be particularly problematic: Currently, LICM lives in the same loop pass manager instance as LoopSimplify, LoopRotate and LoopUnswitch, all of which preserve SE.

Suggestions?

I would be scared to enable SCEV AA because it's not well tested and updating SCEV across transforms can be problematic. LICM would require another run of SCEV (bad for compile time). As for other reasons it's fundamentally flawed, Dan G. may remember.

I would be tempted to go with #1 just to fix LICM, since LICM has no need for generic dependence analysis.

Of course, the vectorizer already uses SCEV to determine array element dependence. It should have a strong method for dependence analysis. Why does it miss this case?

Like you said it *should*. The simple dependence analysis we currently have works by pair wise comparing (subtracting) the access SCEVS.

It currently only handles constant dependences (SCEVConstant)

The resulting distance from Hal's example is {4,+,4} ReadWrite and so should be safe.

We just don't handle this case yet.

Best,
Arnold

Sent from my iPhone

>
>
>>
>> Hello,
>>
>> We currently generate suboptimal code for this loop:
>>
>> for (int i = 1; i < LEN; i++) {
>> a[i] = a[0] + b[i];
>> }
>>
>> The largest problem is that we won't vectorize the loop because
>> the loop vectorizer does not grok the independence of a[i] and
>> a[0] (note that i starts at 1 in this loop). While we can
>> probably fix the vectorizer to handle this, I don't think that is
>> the right place for the fix, because this is really a LICM
>> problem. LICM could move a[0] out of the loop, but it currently
>> doesn't (which is pretty suboptimial in itself). The reason is
>> that LICM does not use SE (either directly or indirectly), and
>> BasicAA does not recognize that a[0] and a[i] will never overlap
>> in this loop.
>>
>> I see several possible solutions:
>>
>> 1. Add yet-another hack to BasicAA to catch this case.
>>
>> 2. Let LICM use SE directly. Instead of bailing out on hoisting a
>> loop-invariant value in an alias set with non-invariant values,
>> use SE to check independence with the other values in the alias
>> set.
>>
>> 3. Use ScalarEvolutionAliasAnalysis; scheduling SCEV AA prior to
>> running LICM fixes this problem.
>>
>> Using #3 seems like the simplest, and most natural, way to fix
>> this issue. I recall someone tell me (maybe at the developers'
>> meeting) that SCEV AA was 'fundamentally flawed', and now I'm
>> wondering what was meant by that. The pass manager does not know
>> how to reschedule it unless it is specifically requested, but if
>> that's the entirety of the problem, then I think that we could
>> manually schedule it in a few key places and get a lot of benefit
>> (there is a particular use case for prior to LICM, at least).
>>
>> FWIW, I don't think that adding an SCEV dependency before LICM
>> would be particularly problematic: Currently, LICM lives in the
>> same loop pass manager instance as LoopSimplify, LoopRotate and
>> LoopUnswitch, all of which preserve SE.
>>
>> Suggestions?
>
> I would be scared to enable SCEV AA because it's not well tested
> and updating SCEV across transforms can be problematic. LICM would
> require another run of SCEV (bad for compile time). As for other
> reasons it's fundamentally flawed, Dan G. may remember.
>
> I would be tempted to go with #1 just to fix LICM, since LICM has
> no need for generic dependence analysis.
>
> Of course, the vectorizer already uses SCEV to determine array
> element dependence. It should have a strong method for dependence
> analysis. Why does it miss this case?

Like you said it *should*. The simple dependence analysis we
currently have works by pair wise comparing (subtracting) the access
SCEVS.

It currently only handles constant dependences (SCEVConstant)

The resulting distance from Hal's example is {4,+,4} ReadWrite and so
should be safe.

We just don't handle this case yet.

The problem is that having the loop vectorizer handle this case, unless we teach it to also do LICM, would mean turning it into a scalar load and splat on every iteration. This is still not great. Having LICM fix this would be much better.

-Hal

> Hello,
>
> We currently generate suboptimal code for this loop:
>
> for (int i = 1; i < LEN; i++) {
> a[i] = a[0] + b[i];
> }
>
> The largest problem is that we won't vectorize the loop because the
> loop vectorizer does not grok the independence of a[i] and a[0]
> (note that i starts at 1 in this loop). While we can probably fix
> the vectorizer to handle this, I don't think that is the right
> place for the fix, because this is really a LICM problem. LICM
> could move a[0] out of the loop, but it currently doesn't (which
> is pretty suboptimial in itself). The reason is that LICM does not
> use SE (either directly or indirectly), and BasicAA does not
> recognize that a[0] and a[i] will never overlap in this loop.
>
> I see several possible solutions:
>
> 1. Add yet-another hack to BasicAA to catch this case.
>
> 2. Let LICM use SE directly. Instead of bailing out on hoisting a
> loop-invariant value in an alias set with non-invariant values,
> use SE to check independence with the other values in the alias
> set.
>
> 3. Use ScalarEvolutionAliasAnalysis; scheduling SCEV AA prior to
> running LICM fixes this problem.
>
> Using #3 seems like the simplest, and most natural, way to fix this
> issue. I recall someone tell me (maybe at the developers' meeting)
> that SCEV AA was 'fundamentally flawed', and now I'm wondering
> what was meant by that. The pass manager does not know how to
> reschedule it unless it is specifically requested, but if that's
> the entirety of the problem, then I think that we could manually
> schedule it in a few key places and get a lot of benefit (there is
> a particular use case for prior to LICM, at least).
>
> FWIW, I don't think that adding an SCEV dependency before LICM
> would be particularly problematic: Currently, LICM lives in the
> same loop pass manager instance as LoopSimplify, LoopRotate and
> LoopUnswitch, all of which preserve SE.
>
> Suggestions?

I would be scared to enable SCEV AA because it's not well tested

This seems like a solvable problem. SCEV AA itself is something like 30 lines of non-boilerplate code, and essentially does one thing (calls getMinusSCEV in both directions and uses getUnsignedRange on the result). We should look it over.

and
updating SCEV across transforms can be problematic. LICM would
require another run of SCEV (bad for compile time).

Why? 'Running' SE does not really do anything because we compute everything lazily. SE depends on DT, which I agree we don't want to constantly recompute, but all of those transformations preserve DT and SE (and already use DT), so I don't see why there would be any problem here.

Using SE itself on every AA query might certainly add extra expense, but that just seems like an argument for judicious use.

As for other
reasons it's fundamentally flawed, Dan G. may remember.

I would be tempted to go with #1 just to fix LICM, since LICM has no
need for generic dependence analysis.

I was afraid that you'd say that :wink:

Of course, the vectorizer already uses SCEV to determine array
element dependence. It should have a strong method for dependence
analysis. Why does it miss this case?

It has been improving, but I think it is still fairly ad hoc. I think that we should switch to using a DependenceAnalysis-based solution (although maybe not until after 3.4 branches).

Nevertheless, it would still be better to have the scalar load outside the loop, instead of loaded and splatted inside the loop.

-Hal

If SCEV AA itself is conservative, the worst that can happen is that it
won't detect all cases it could, but we can extend it later. It seems to me
that this piece of code was added as an experiment and never updated, but
since it's already there, we could come up with a few tests that would
check the boundaries of a 30 loc piece, no?

--renato

Yes SCEV AA was an experiment. The reason it wasn’t ripped out is that it’s potentially useful. The worst that will happen is that passes will crash in very confusing ways. SCEVs will be evaluated early, not properly updated, and someone will try to expand them. You won’t hit this problems with simple testing.

I’m not opposed to making SCEV AA a reliable thing, but I’m not sure what “fundamental problems” we’ll hit along the way.

My primary objection is that I don’t think SCEV should run before the LICM loop pass manager. It’s to expensive in terms of compile time and does have maintenance burden. I’m not yet aware of any long-term need to have SCEV running at this point. If we’re only talking about disambiguating constant array indices with loop variant ones with constant bounds, there might be a simpler way to handle it.

-Andy

Yes SCEV AA was an experiment. The reason it wasn’t ripped out is that it’s potentially useful. The worst that will happen is that passes will crash in very confusing ways. SCEVs will be evaluated early, not properly updated, and someone will try to expand them. You won’t hit this problems with simple testing.

I’m not opposed to making SCEV AA a reliable thing, but I’m not sure what “fundamental problems” we’ll hit along the way.

My primary objection is that I don’t think SCEV should run before the LICM loop pass manager. It’s to expensive in terms of compile time and does have maintenance burden. I’m not yet aware of any long-term need to have SCEV running at this point. If we’re only talking about disambiguating constant array indices with loop variant ones with constant bounds, there might be a simpler way to handle it.

…that said. If SCEV AA is much simpler, and we are confident that it will be invalidated before anything else uses it. (InstCombine may do it), then it’s a straightforward compile time vs perf trade off. That can be measured.

Andy

> I would be scared to enable SCEV AA because it's not well tested

This seems like a solvable problem. SCEV AA itself is something like
30 lines of non-boilerplate code, and essentially does one thing
(calls getMinusSCEV in both directions and uses getUnsignedRange on
the result). We should look it over.

If SCEV AA itself is conservative, the worst that can happen is that
it won't detect all cases it could, but we can extend it later. It
seems to me that this piece of code was added as an experiment and
never updated, but since it's already there, we could come up with a
few tests that would check the boundaries of a 30 loc piece, no?

Yes SCEV AA was an experiment. The reason it wasn’t ripped out is
that it’s potentially useful. The worst that will happen is that
passes will crash in *very* confusing ways. SCEVs will be evaluated
early, not properly updated, and someone will try to expand them.
You won’t hit this problems with simple testing.

FWIW, it looks like the other passes in that group of passes preserve SE by calling forgetLoop. Nevertheless, I'm certainly sympathetic to the augment that the SE-preservation property of these passes may essentially be untested.

I’m not opposed to making SCEV AA a reliable thing, but I’m not sure
what “fundamental problems” we’ll hit along the way.

Fair enough.

My primary objection is that I don’t think SCEV should run before the
LICM loop pass manager. It’s to expensive in terms of compile time
and does have maintenance burden. I’m not yet aware of any long-term
need to have SCEV running at this point. If we’re only talking about
disambiguating constant array indices with loop variant ones with
constant bounds, there might be a simpler way to handle it.

I agree; I just wanted to have some discussion about yet-again growing the number of BasicAA heuristics when a more-systematic solution exists.

Thanks again,
Hal