RFC for a design change in LoopStrengthReduce / ScalarEvolution

This is related to an issue in loop strength reduction [1] that I've
been trying to fix on and off for a while. [1] has a more detailed
description of the issue and an example, but briefly put, I want LSR
to consider formulae that have "Zext T" as base and/or scale
registers, and to appropriately rate such formulae.

My first attempt[2] at fixing this was buggy and had to be reverted.
While it is certainly possible to "fix" the bug in [2], the fixed
version is ugly and fragile, and I'd prefer having something cleaner.

The key issue here is that currently there is no way to represent
"Zext T" as a *distinct computation* within SCEV -- a SCEV expression
of the form "Zext T" (any SCEV expression, actually) is always
maximally simplified on construction. This means I cannot "factor
out" a Zext from a base register (say) and force SCEV to maintain it
as a "SCEVZeroExtend T"; SCEV is free to re-simplify "SCEVZeroExtend
T" to something equivalent on construction. This is not just an issue
in theory -- factoring out a 'SCEVZeroExtend' from an expression 'E'
requires us to prove that 'E' == 'SCEVZeroExtend T'. SCEV can use
that exact same fact to "simplify" 'SCEVZeroExtend T' back to 'E' on
construction. There is an analogous issue w.r.t. sign extension
expressions.

There are two possibilities I see here -- the problem can be solved within
LoopStrengthReduce or within ScalarEvolution.

# Within LoopStrengthReduce

Solving it within LoopStrengthReduce is tricky: due to the
representation issue in SCEV mentioned above, the fact that a scale or
base register is actually a narrower than needed and has to be zero /
sign extended before use needs to be represented separately from the
SCEV*. Every place in LoopStrengthReduce that analyzes and
manipulates registers needs to be updated to understand this bit of
side information ([2] was buggy because not every place in LSR had
been updated this way in the change).

The cleanest approach I could come up with is to introduce a `class
Register` that wraps a SCEV and some associated metadata; and change
LSR to reason about registers using this `Register` class (and not raw
SCEV pointers). All of the "this i64 register is actually a zero
extended i32 register" can then be made to live within this `Register`
class, making the structure of the code more obvious.

# Within ScalarEvolution

Another way to fix the problem is to introduce a new kind of SCEV node
-- a `SCEVOpaqueExpr`. `SCEVOpaqueExpr` wraps another SCEV
expression, and prevents optimizations from optimizing across the
`SCEVOpaqueExpr` boundary. This will allow us to keep representing
registers as SCEV pointers as they are today while solving the zext
representation problem. After factoring a SCEV expression 'E' into
'SCEVZeroExtend T', 'SCEVZeroExtend T' can be "frozen" by representing
it as 'SCEVZeroExtend (SCEVOpaqueExpr T)'. Since 'SCEVZeroExtend
(SCEVOpaqueExpr T)' computes the same value as 'SCEVZeroExtend T' ==
'E', there is no need to update the parts of LSR that analyze / modify
registers to preserve correctness.

Note: this type of "opaque" node also can be used to force the SCEV
expander to generate a particular sequence of expressions; something
that is achieved by the "expand(getSCEVUnknown(expand(Op)))" idiom in
LSR currently.

So far I'm leaning towards the second approach, but given that it would
be a substantial change (though potentially useful elsewhere), I'd like to
hear what others think before diving in.

-- Sanjoy

[1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html
[2]: http://reviews.llvm.org/rL243939

From: "Sanjoy Das via llvm-dev" <llvm-dev@lists.llvm.org>
To: "llvm-dev" <llvm-dev@lists.llvm.org>, "Andrew Trick" <atrick@apple.com>, "Quentin Colombet"
<qcolombet@apple.com>, "Dan Gohman" <dan433584@gmail.com>
Sent: Sunday, August 16, 2015 7:42:16 PM
Subject: [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

This is related to an issue in loop strength reduction [1] that I've
been trying to fix on and off for a while. [1] has a more detailed
description of the issue and an example, but briefly put, I want LSR
to consider formulae that have "Zext T" as base and/or scale
registers, and to appropriately rate such formulae.

My first attempt[2] at fixing this was buggy and had to be reverted.
While it is certainly possible to "fix" the bug in [2], the fixed
version is ugly and fragile, and I'd prefer having something cleaner.

The key issue here is that currently there is no way to represent
"Zext T" as a *distinct computation* within SCEV -- a SCEV expression
of the form "Zext T" (any SCEV expression, actually) is always
maximally simplified on construction. This means I cannot "factor
out" a Zext from a base register (say) and force SCEV to maintain it
as a "SCEVZeroExtend T"; SCEV is free to re-simplify "SCEVZeroExtend
T" to something equivalent on construction. This is not just an
issue
in theory -- factoring out a 'SCEVZeroExtend' from an expression 'E'
requires us to prove that 'E' == 'SCEVZeroExtend T'. SCEV can use
that exact same fact to "simplify" 'SCEVZeroExtend T' back to 'E' on
construction. There is an analogous issue w.r.t. sign extension
expressions.

I don't understand why you want to factor out the information, exactly. It seems like what you need is a function like:

  unsigned getMinLeadingZeros(const SCEV *);

then, if you want to get the non-extended expression, you can just apply an appropriate truncation. I assume, however, that I'm missing something.

Adding simplification barriers into SCEV expressions seems like a dangerous idea, especially since LSR is not the only user of SCEV.

-Hal

I don't understand why you want to factor out the information,
exactly. It seems like what you need is a function like:

  unsigned getMinLeadingZeros(const SCEV *);

then, if you want to get the non-extended expression, you can just
apply an appropriate truncation. I assume, however, that I'm missing
something.

The problem is not about how to codegen a specific SCEV expression
efficiently, but about the overall solution LSR converges to (which
includes considering how it will codegen *other* SCEVs). In the
latter sense there is a difference between `X` and `(trunc (zext X))`
even though both of them produce the same value (and possibly are
equally efficient).

The example in [1] is roughly like this at the high level:

You have an i32 induction variable `V` == `{VStart,+,1}`.
`{VStart,+,1}` does not unsigned-overflow given the loop bounds. Two
interesting uses in the loop are something like (in this mail by zext
I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
`V ult limit`.

Because SCEV can prove that `{VStart,+,1}` does not unsigned-overflow,
`GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
VStart), {i64 0,+,1}`. Similarly, `V` == `trunc (zext V)` ==
`trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` ==
`trunc({i64 0,+,1}) + trunc (zext VStart)` ==
`trunc({i64 0,+,1}) + VStart`. This is the solution LSR ends up
choosing. However, the cheaper solution here is

`GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
and
`V ult limit` == `{VStart,+,1} ult limit`

(i.e. pretty much the direct naive lowering).

This naive lowering is better because we can use the increment in
`{VStart,+,1}` to zero extend the value implicitly on x86 (you can
find the generated machine code at [1]).

The reason why LSR does not pick this cheaper solution is because it
"forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. I'm
trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
with `zext({X,+,Y})` when possible on architectures that have "free"
zero extensions, so that if there is a solution involving the narrower
induction variable, it will be found.

Now I could use the trick you mentioned in the place in LSR where it
computes the cost of a solution (i.e. when looking for "common"
registers see if one register is the zero extension of another). I
have not tried it, but I suspect it will be fairly tricky -- a given
solution will now have more than one "cost" depending on how you
interpret specific registers (as truncations of some extended value or
as a native wide value) and you'd have to search over the possible
interpretations of a specific solution.

Adding simplification barriers into SCEV expressions seems like a
dangerous idea, especially since LSR is not the only user of SCEV.

So I don't intend for SCEV to produce SCEVOpaqueExpr nodes itself.
There would instead be a 'const SCEV *getOpaqueExpr(const SCEV *)'
function that clients of SCEV can use to create an opaque expression.
Then most of the simplification routines SCEV will be changed to have
the moral equivalent of 'if (isa<SCEVOpaqueExpr>(E)) continue;'.

-- Sanjoy

[1]: http://lists.llvm.org/pipermail/llvm-dev/2015-April/084434.html

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Andrew Trick" <atrick@apple.com>, "Quentin Colombet"
<qcolombet@apple.com>, "Dan Gohman" <dan433584@gmail.com>
Sent: Sunday, August 16, 2015 9:43:54 PM
Subject: Re: [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

> I don't understand why you want to factor out the information,
> exactly. It seems like what you need is a function like:
>
> unsigned getMinLeadingZeros(const SCEV *);
>
> then, if you want to get the non-extended expression, you can just
> apply an appropriate truncation. I assume, however, that I'm
> missing
> something.

The problem is not about how to codegen a specific SCEV expression
efficiently, but about the overall solution LSR converges to (which
includes considering how it will codegen *other* SCEVs). In the
latter sense there is a difference between `X` and `(trunc (zext X))`
even though both of them produce the same value (and possibly are
equally efficient).

The example in [1] is roughly like this at the high level:

You have an i32 induction variable `V` == `{VStart,+,1}`.
`{VStart,+,1}` does not unsigned-overflow given the loop bounds. Two
interesting uses in the loop are something like (in this mail by zext
I'll always mean zext from i32 to i64) `GEP @Global, zext(V)` and
`V ult limit`.

Because SCEV can prove that `{VStart,+,1}` does not
unsigned-overflow,
`GEP @Global, zext(V)` can be computed as `GEP (@Global + zext
VStart), {i64 0,+,1}`. Similarly, `V` == `trunc (zext V)` ==
`trunc {zext VStart,+,1}` == `trunc {zext VStart,+,1}` ==
`trunc({i64 0,+,1}) + trunc (zext VStart)` ==
`trunc({i64 0,+,1}) + VStart`. This is the solution LSR ends up
choosing. However, the cheaper solution here is

`GEP @Global, zext(V)` == `GEP @Global, zext({VStart,+,1})`
and
`V ult limit` == `{VStart,+,1} ult limit`

(i.e. pretty much the direct naive lowering).

This naive lowering is better because we can use the increment in
`{VStart,+,1}` to zero extend the value implicitly on x86 (you can
find the generated machine code at [1]).

The reason why LSR does not pick this cheaper solution is because it
"forgets" that `{zext VStart,+,1}` == `zext({VStart,+,1})`. I'm
trying to fix this by adding a rule that replaces `{zext X,+,zext Y}`
with `zext({X,+,Y})` when possible on architectures that have "free"
zero extensions, so that if there is a solution involving the
narrower
induction variable, it will be found.

To back up for a second, how much of this is self-inflicted damage? IndVarSimplify likes to preemptively widen induction variables. Is that why you have the extensions here in the first place?

The code model used by IndVarSimplify is a bit simple:

  // We should not widen an indvar if arithmetics on the wider indvar are more
  // expensive than those on the narrower indvar. We check only the cost of ADD
  // because at least an ADD is required to increment the induction variable. We
  // could compute more comprehensively the cost of all instructions on the
  // induction variable when necessary.
  if (TTI &&
      TTI->getArithmeticInstrCost(Instruction::Add, Ty) >
          TTI->getArithmeticInstrCost(Instruction::Add,
                                      Cast->getOperand(0)->getType())) {

and perhaps we need a bit more sophistication here. I've found that, generally speaking, the extensions on inductions variables, and SCEV's preference for factoring them out to varying degrees, ends up making SCEV's job more difficult.

Also, Clang likes to generate code that ends up looking like this (after some optimization):

  %idxprom = sext i32 %i.0 to i64
  %arrayidx = getelementptr inbounds i32, i32* %b, i64 %idxprom

but even here, this is actually unnecessary. An inbounds GEP is defined to use sign-extended arithmetic, and so this could just as easily be:

  %arrayidx = getelementptr inbounds i32, i32* %b, i32 %i.0

In short, I wonder if we're solving the right problem. Maybe the canonicalization of the IR is not as useful as it could be.

-Hal

To back up for a second, how much of this is self-inflicted damage?
IndVarSimplify likes to preemptively widen induction variables. Is
that why you have the extensions here in the first place?

In the specific example I was talking about the zext came from our
frontend (our FE used to insert these extensions for reasons that are
no longer relevant). But you can easily get the same behavior from an
expression like `a[(unsigned long)i]`.

This looked like a fairly straightforward LSR problem to me,
especially because SCEV already has all the smarts to infer the
required properties. The only thing missing the right framing of the
problem (i.e. framing it in a way such that we can implement a
maintainable solution).

-- Sanjoy

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Andrew Trick" <atrick@apple.com>, "Quentin Colombet"
<qcolombet@apple.com>, "Dan Gohman" <dan433584@gmail.com>
Sent: Monday, August 17, 2015 4:38:15 PM
Subject: Re: [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

> To back up for a second, how much of this is self-inflicted damage?
> IndVarSimplify likes to preemptively widen induction variables. Is
> that why you have the extensions here in the first place?

In the specific example I was talking about the zext came from our
frontend (our FE used to insert these extensions for reasons that are
no longer relevant). But you can easily get the same behavior from
an
expression like `a[(unsigned long)i]`.

Of course, and the point is that, for example, on x86_64, the zext here is free. I'm still trying to understand the problem...

In the example you provided in your previous e-mail, we choose the solution:

  `GEP @Global, zext(V)` -> `GEP (@Global + zext VStart), {i64 0,+,1}`
  `V` -> `trunc({i64 0,+,1}) + VStart`

instead of the actually-better solution:

  `GEP @Global, zext(V)` -> `GEP @Global, zext({VStart,+,1})`
  `V` -> `{VStart,+,1}`

where LSR never considers the latter case because it transforms:

  `zext({VStart,+,1})` to `{zext VStart,+,1}`

and, thus, never considers the formula with zext on the outside? Your proposed solution is that LSR should be able to create:

  zext(opaque({VStart,+,1}))

forcing it to keep all extensions as the outermost operators. Is that right? Will this interfere with SCEV's ability to prove wrapping properties of those expressions? If so, does that matter?

This looked like a fairly straightforward LSR problem to me,
especially because SCEV already has all the smarts to infer the
required properties. The only thing missing the right framing of the
problem (i.e. framing it in a way such that we can implement a
maintainable solution).

Sure, but fixing LSR by creating a local-handicapping mechanism for SCEV seems somehow unfortunate. Are we going to oscillate from picking from one part of the solution space to picking from a different part, without particular regard for which might be better?

-Hal

Hi Sanjoy - I also noticed LSR was converging on suboptimal solutions.
I'm hoping you can clarify this discussion for me.
Given a simple copy loop:

void copy(int* src, int* dst, unsigned int size) {
    for (unsigned int i = 0; i < size; ++i)
        dst[i] = src[i];
}

The hot spot for i686 contains 6 instructions with LSR:

8b 32 mov (%edx),%esi
89 31 mov %esi,(%ecx)
83 c1 04 add $0x4,%ecx
83 c2 04 add $0x4,%edx
48 dec %eax
75 f3 jne 11 <copy+0x11>

but only 5 instructions without LSR.

8b 3c b2 mov (%edx,%esi,4),%edi
89 3c b1 mov %edi,(%ecx,%esi,4)
46 inc %esi
39 c6 cmp %eax,%esi
75 f5 jne 14 <copy+0x14>

My interpretation of this problem was that LSR starts analysis with an
SCEV that counts down toward zero even in an upward counting loop. A
more natural fit would be an initial IV going from 0 to some limit,
but there does not seem to be a way to express this sort of SCEV. The
initial condition puts an artificially high cost on solutions with
upward counting registers since none match the "free" down counting
SCEV. Notice the weird 'dec %eax' showing up as an artifact. The
best solution with base-index scale 4 addressing was considered by
LSR, but incorrectly deemed to be too expensive.

Is this the same LSR problem you describe?

Thanks,
-steve

Of course, and the point is that, for example, on x86_64, the zext here is free. I'm still trying to understand the problem...

In the example you provided in your previous e-mail, we choose the solution:

  `GEP @Global, zext(V)` -> `GEP (@Global + zext VStart), {i64 0,+,1}`
  `V` -> `trunc({i64 0,+,1}) + VStart`

instead of the actually-better solution:

  `GEP @Global, zext(V)` -> `GEP @Global, zext({VStart,+,1})`
  `V` -> `{VStart,+,1}`

where LSR never considers the latter case because it transforms:

  `zext({VStart,+,1})` to `{zext VStart,+,1}`

and, thus, never considers the formula with zext on the outside? Your proposed solution is that LSR should be able to create:

  zext(opaque({VStart,+,1}))

forcing it to keep all extensions as the outermost operators. Is that right?

Yes, precisely.

Will this interfere with SCEV's ability to prove wrapping properties of those expressions? If so, does that matter?

SCEV won't be able (by design) to prove no-overflow for
`opaque({VStart,+,1})` but it should still be able to prove no
overflow for `{VStart,+,1}` as usual.

Sure, but fixing LSR by creating a local-handicapping mechanism for
SCEV seems somehow unfortunate. Are we going to oscillate from picking
from one part of the solution space to picking from a different part,
without particular regard for which might be better?

While I found the SCEVOpaqueExpr idea easier to reason with, by no
means am I suggesting that it is clearly the better solution. :slight_smile:

Would you rather this be solved fully within LSR instead?

-- Sanjoy

My interpretation of this problem was that LSR starts analysis with an
SCEV that counts down toward zero even in an upward counting loop. A
more natural fit would be an initial IV going from 0 to some limit,
but there does not seem to be a way to express this sort of SCEV. The
initial condition puts an artificially high cost on solutions with
upward counting registers since none match the "free" down counting
SCEV. Notice the weird 'dec %eax' showing up as an artifact. The
best solution with base-index scale 4 addressing was considered by
LSR, but incorrectly deemed to be too expensive.

Is this the same LSR problem you describe?

No, these two issues look unrelated.

-- Sanjoy

From: "Sanjoy Das" <sanjoy@playingwithpointers.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Andrew Trick" <atrick@apple.com>, "Quentin Colombet"
<qcolombet@apple.com>, "Dan Gohman" <dan433584@gmail.com>
Sent: Tuesday, August 18, 2015 12:35:50 AM
Subject: Re: [llvm-dev] RFC for a design change in LoopStrengthReduce / ScalarEvolution

> Of course, and the point is that, for example, on x86_64, the zext
> here is free. I'm still trying to understand the problem...
>
> In the example you provided in your previous e-mail, we choose the
> solution:
>
> `GEP @Global, zext(V)` -> `GEP (@Global + zext VStart), {i64
> 0,+,1}`
> `V` -> `trunc({i64 0,+,1}) + VStart`
>
> instead of the actually-better solution:
>
> `GEP @Global, zext(V)` -> `GEP @Global, zext({VStart,+,1})`
> `V` -> `{VStart,+,1}`
>
> where LSR never considers the latter case because it transforms:
>
> `zext({VStart,+,1})` to `{zext VStart,+,1}`
>
> and, thus, never considers the formula with zext on the outside?
> Your proposed solution is that LSR should be able to create:
>
> zext(opaque({VStart,+,1}))
>
> forcing it to keep all extensions as the outermost operators. Is
> that right?

Yes, precisely.

> Will this interfere with SCEV's ability to prove wrapping
> properties of those expressions? If so, does that matter?

SCEV won't be able (by design) to prove no-overflow for
`opaque({VStart,+,1})` but it should still be able to prove no
overflow for `{VStart,+,1}` as usual.

> Sure, but fixing LSR by creating a local-handicapping mechanism for
> SCEV seems somehow unfortunate. Are we going to oscillate from
> picking
> from one part of the solution space to picking from a different
> part,
> without particular regard for which might be better?

While I found the SCEVOpaqueExpr idea easier to reason with, by no
means am I suggesting that it is clearly the better solution. :slight_smile:

Would you rather this be solved fully within LSR instead?

I don't claim to understand LSR well enough to have an informed preference here, and of course, this might also depend on *how* you solve it in LSR. That having been said, in an abstract sense, solving the problem without blocking analysis within SCEV seems like something should at least be explored. Also, it seems like doing this in LSR can provide benefits the SCEV-based approach can't (but I could have misunderstood and the two proposed solutions search equivalent parts of the solution space).

Thanks again,
Hal

I also need some fixes in LSR though it may or may not be the same as Sanjoy's. One of the TODO from the file is: More sophistication in the way Formulae are generated and filtered.

I guess that `{zext VStart,+,1}` appears during induction variable simplify.

Some cases with address calculations and SIB:

Reg1 = sext (offset) to i64
Loop:
Load* BaseAddr [BaseReg,Reg1,Scale];
Reg1 = Reg1 + 1
Backedge

Or

Reg1 = sext (offset)*Scale to i64
Loop:
Load* BaseAddr [0,Reg1,1];
Reg1 = Reg1 + Scale
Backedge

Or

Reg1 = sext (offset1) to i64
Reg1 = Scale*Reg1
Reg2 = offset2
Loop:
Reg3 = sext(Reg2)
Load* BaseAddr [Reg1,Reg3,Scale]
Reg2 = <somethingelse>
Reg1 = Reg1 + Stride*Scale
Backedge

If LSR finds and optimizes: Reg1 = Reg1 + Stride*Scale, sext on Start looks like a better thing to have, as IV got promoted to i64, though maybe this is not all that important. If NW propagation didn’t help, the sext would be inside the loop. The cost of this may vary with different targets: in x86_64, the register cost of zext is 0 and though sext may or may not need an extra register, it needs an extra instruction (movslq or cltq).

Another issue is that LSR just looks at inner most loops, instead of all array accesses at the current loop-level. Also, it looks like we need to input thick Addrec expressions to LSR, for ex: for C1 * (+ (X, {Y, +, C2}), C1 would need to be substituted in the add to obtain an Nary Addrec, prior to LSR opt. Else it is going to consider {Y,+,C2} for LSR replacement.

I am not able to make the connection between SIB and all this, as we have addrecs. An addrec is {X, +, Y} or + (X, {0,+,Y}), so basically the LSR will insert a new add instruction based on {0,+,Y}. I suppose if there is more than one such Addrec subexpression in a loop, i.e. involving different Stepvalues, LSR may not be doing the better thing vs having an offset register in a SIB instruction. Maybe I am missing something. Thoughts?