Working on FP SCEV Analysis

Hi,

I’m working now on extending SCEV Analysis and adding FP support.
At the beginning, I want to vectorize this loop:

float fp_inc;

float x = init;
for (int i=0;i<N;i++){
A[i] = x;
x += fp_inc; // Loop invariant variable or constant
}

In this loop “x” is a FP induction variable. But it is not the “primary” induction and loop trip count is still depends on integer induction “i”.

In the future, I plan to work on “primary” FP inductions.
Intel compiler vectorizes FP-induction loops, GCC does not.

I wanted to hear that community does not have any principal objections for this work.
Thank you.

  • Elena

Adding Sanjoy Das as I think he will have lots of thoughts on this general subject.

[+CC Andy]

Hi Elena,

I don't have any fundamental issues with teaching SCEV about floating
point types, but given this will be a major change, I think a high
level roadmap should be discussed on llvm-dev before we start
reviewing and committing changes.

Here are some issues that I think are worth discussing:

  - Core motivation: why do we even care about optimizing floating
    point induction variables? What situations are they common in? Do
    programmers _expect_ compilers to optimize them well? (I haven't
    worked on our vectorizers so pardon the possibly stupid question)
    in the example you gave, why do you need SCEV to analyze the
    increment to vectorize the loop (i.e how does it help)? What are
    some other concrete cases you'll want to optimize?

  - I presume you'll want SCEV expressions for `sitofp` and `uitofp`.
    (The most important question:) With these in the game, what is the
    canonical representation of SCEV expressions that can be expressed
    as, say, both `sitofp(A + B)` and `sitofp(A) + sitofp(B)`?

    Will we have a way to mark expressions (like we have `nsw` and
    `nuw` for `sext` and `zext`) which we can distribute `sitofp` and
    `uitofp` over?

    Same questions for `fptosi` and `fptoui`.

  - How will you partition the logic between floating and integer
    expressions in SCEV-land? Will you have, say, `SCEVAddExpr` do
    different things based on type, or will you split it into
    `SCEVIAddExpr` and `SCEVFAddExpr`? [0]

    * There are likely to be similarities too -- e.g. the "inductive"
      or "control flow" aspect of `SCEVAddRecExpr` is likely to be
      common between floating point add recurrences[1], and integer add
      recurrences; and part of figuring out the partitioning is also
      figuring out how to re-use these bits of logic.
  
[0]: I'll prefer the latter since e.g. integer addition is
associative, but floating point addition isn't; and it is better to
force programmers to handle the two operations differently.

[1]: For instance, things like this:
https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis/ScalarEvolution.cpp#L7564
are likely to stay common between floating point and integer add recs.

-- Sanjoy

I don’t think it makes sense to consider a floating point induction variable an “add recurrence” unless it the inductive value can be promoted to an integer. (The computational properties of chained recurrences are defined on integers).

IndVarSimplify handleFloatingPointIV is supposed to promote your FP induction variables to integers so that SCEV can analyze them. After all the loop opts runs, the LSR is supposed to cleanup any casts during optimizeShadowIV.

-Andy

That said, there are probably many reasons (e.g. non-integer initial value) that you can’t promote a FP IV to an Integer, but you may still want to vectorize it in fast-math mode. I’ll leave it to MichaelZ and Adam to comment on how best to prepare such loops for vectorization.

-Andy

One thought I’ve had about this in the past —

SCEV is useful for proving two expressions are equivalent (in general, not just specifically for loop induction variables). This gets a little bit trickier with fast-math; for example, suppose two expressions are equivalent, but only with reassociativity allowed?

This would be useful to make N-ary reassociate or similar algorithms work with float, as it currently uses SCEV to prove equivalence, but fast-math makes it become much trickier (and without fast-math, there’s not much you can prove).

—escha

Hi Escha,

via llvm-dev wrote:

One thought I’ve had about this in the past —

SCEV is useful for proving two expressions are equivalent (in general,
not just specifically for loop induction variables). This gets a little
bit trickier with fast-math; for example, suppose two expressions are
equivalent, but *only* with reassociativity allowed?

This isn't too different from "sext(a + b) == sext(a) + sext(b) only
if the addition is nsw".

However, it wasn't clear to me from the lang-ref what the fastmath
flags really mean. What is the semantics of a program that computes
((a +fast b) +fast c) when ((a + b) + c) != (a + (b + c))? Does it
have UB (in which case we can't hoist fastmath arithmetic) or is it
more like poison? Depending on the answer, for fastmath we can
considering doing what we do today for nuw/nsw (which has its own set
of issues, but at least wouldn't require us to start from scratch).

-- Sanjoy

From: "Sanjoy Das via llvm-dev" <llvm-dev@lists.llvm.org>
To: escha@apple.com
Cc: "llvm-dev" <llvm-dev@lists.llvm.org>, "Michael V Zolotukhin" <michael.v.zolotukhin@intel.com>
Sent: Monday, May 16, 2016 7:31:20 PM
Subject: Re: [llvm-dev] Working on FP SCEV Analysis

Hi Escha,

> One thought I’ve had about this in the past —
>
> SCEV is useful for proving two expressions are equivalent (in
> general,
> not just specifically for loop induction variables). This gets a
> little
> bit trickier with fast-math; for example, suppose two expressions
> are
> equivalent, but *only* with reassociativity allowed?

This isn't too different from "sext(a + b) == sext(a) + sext(b) only
if the addition is nsw".

However, it wasn't clear to me from the lang-ref what the fastmath
flags really mean. What is the semantics of a program that computes
((a +fast b) +fast c) when ((a + b) + c) != (a + (b + c))?

This is often true, and it is neither UB nor poison. The user has given the compiler the freedom to choose whatever the compiler decides will be most-efficient to compute. "fast math" is an odd concept in that sense. It is giving the compiler the freedom to change the program semantics to some extent.

-Hal

From: “Sanjoy Das via llvm-dev” <llvm-dev@lists.llvm.org>
To: escha@apple.com
Cc: “llvm-dev” <llvm-dev@lists.llvm.org>, “Michael V Zolotukhin” <michael.v.zolotukhin@intel.com>
Sent: Monday, May 16, 2016 7:31:20 PM
Subject: Re: [llvm-dev] Working on FP SCEV Analysis

Hi Escha,

One thought I’ve had about this in the past —

SCEV is useful for proving two expressions are equivalent (in
general,
not just specifically for loop induction variables). This gets a
little
bit trickier with fast-math; for example, suppose two expressions
are
equivalent, but only with reassociativity allowed?

This isn’t too different from “sext(a + b) == sext(a) + sext(b) only
if the addition is nsw”.

However, it wasn’t clear to me from the lang-ref what the fastmath
flags really mean. What is the semantics of a program that computes
((a +fast b) +fast c) when ((a + b) + c) != (a + (b + c))?

This is often true, and it is neither UB nor poison. The user has given the compiler the freedom to choose whatever the compiler decides will be most-efficient to compute. “fast math” is an odd concept in that sense. It is giving the compiler the freedom to change the program semantics to some extent.

It’s possible to make an infinite loop finite with fast math if the primary IV is float. Is it acceptable behavior too? What I mean is that if we have
for (float i = 0.0; i < HUGE_FLOAT_NUMBER; i += 0.5) {

}
we might never reach HUGE_FLOAT_NUMBER, as at some point i+1.0 = i in float (or am I totally wrong here?). However, when analyzing this loop, we might find its trip count to be exactly HUGE_FLOAT_NUMBER*2. There are all kinds of odd stuff with floats in this area, but probably if ‘-ffast-math’ is provided, we’re fine.

I’m also very interested in specific cases we want to catch here. In the past Tyler (CCed) looked into this problem, but as far as I remember we decided that complexity/gains ratio was too low. Basically, I anticipate that the code to support this in SCEV will be at least as sophisticated as for integers, but in contrast to integers FP induction variables are much rarer in real world programs, at least to my impression.

Thanks,
Michael

From: "Michael Zolotukhin" <mzolotukhin@apple.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Sanjoy Das" <sanjoy@playingwithpointers.com>, "llvm-dev"
<llvm-dev@lists.llvm.org>, "Michael V Zolotukhin"
<michael.v.zolotukhin@intel.com>, "tyler nowicki"
<tyler.nowicki@gmail.com>
Sent: Monday, May 16, 2016 8:20:05 PM
Subject: Re: [llvm-dev] Working on FP SCEV Analysis

> > From: "Sanjoy Das via llvm-dev" < llvm-dev@lists.llvm.org >
>

> > To: escha@apple.com
>

> > Cc: "llvm-dev" < llvm-dev@lists.llvm.org >, "Michael V
> > Zolotukhin"
> > <
> > michael.v.zolotukhin@intel.com >
>

> > Sent: Monday, May 16, 2016 7:31:20 PM
>

> > Subject: Re: [llvm-dev] Working on FP SCEV Analysis
>

> > Hi Escha,
>

>

> > > One thought I’ve had about this in the past —
> >
>

> > > SCEV is useful for proving two expressions are equivalent (in
> >
>

> > > general,
> >
>

> > > not just specifically for loop induction variables). This gets
> > > a
> >
>

> > > little
> >
>

> > > bit trickier with fast-math; for example, suppose two
> > > expressions
> >
>

> > > are
> >
>

> > > equivalent, but *only* with reassociativity allowed?
> >
>

> > This isn't too different from "sext(a + b) == sext(a) + sext(b)
> > only
>

> > if the addition is nsw".
>

> > However, it wasn't clear to me from the lang-ref what the
> > fastmath
>

> > flags really mean. What is the semantics of a program that
> > computes
>

> > ((a +fast b) +fast c) when ((a + b) + c) != (a + (b + c))?
>

> This is often true, and it is neither UB nor poison. The user has
> given the compiler the freedom to choose whatever the compiler
> decides will be most-efficient to compute. "fast math" is an odd
> concept in that sense. It is giving the compiler the freedom to
> change the program semantics to some extent.

It’s possible to make an infinite loop finite with fast math if the
primary IV is float. Is it acceptable behavior too? What I mean is
that if we have
for (float i = 0.0; i < HUGE_FLOAT_NUMBER; i += 0.5) {
...
}
we might never reach HUGE_FLOAT_NUMBER, as at some point i+1.0 = i in
float (or am I totally wrong here?). However, when analyzing this
loop, we might find its trip count to be exactly
HUGE_FLOAT_NUMBER*2. There are all kinds of odd stuff with floats in
this area, but probably if ‘-ffast-math’ is provided, we’re fine.

Great question. So, if I paraphrase the intent of "fast math" to be, "I'm only thinking about these floating-point numbers as real numbers, you can do the same", then I think the answer to your question is yes. Furthermore, that would certainly speed things up. :wink:

I’m also very interested in specific cases we want to catch here. In
the past Tyler (CCed) looked into this problem, but as far as I
remember we decided that complexity/gains ratio was too low.
Basically, I anticipate that the code to support this in SCEV will
be at least as sophisticated as for integers, but in contrast to
integers FP induction variables are much rarer in real world
programs, at least to my impression.

To be honest, I don't think that loops with a floating-point reduction variable, written as the loop counter, is common or worth much dedicated effort. I have seen this in real code, but it normally comes from people trying (perhaps too hard) to work around integer register pressure problems. The case that Elena presented, however:

float fp_inc;

float x = init;
for (int i=0;i<N;i++){
A[i] = x;
x += fp_inc; // Loop invariant variable or constant
}

is not uncommon. Being able to do something intelligent with this is worthwhile.

-Hal

Yes, we should handle floating-point reductions, not induction variables. Sorry if I misunderstood the intent. I thought that Tyler already solved the problem of vectorizing FP reductions though.

-Andy

Hi Sanjoy,

Please see my answers bellow:

  • Core motivation: why do we even care about optimizing floating
    point induction variables? What situations are they common in? Do
    programmers expect compilers to optimize them well? (I haven’t
    worked on our vectorizers so pardon the possibly stupid question)
    in the example you gave, why do you need SCEV to analyze the
    increment to vectorize the loop (i.e how does it help)? What are
    some other concrete cases you’ll want to optimize?

[Demikhovsky, Elena] I gave an example of loop that can be vectorized in the fast-math m____ode. ICC compiler vectorizes loops with primary and secondary IVs____:
This is the example for primary induction:

(1) for (float i = 0.5; i < 0.____75; i+=0.05) {} i is a "primary" IV

And for secondary:

(2) for (int i = 0____, float x = start____; i < N; i++, x += delta) {} x is a "secondary" IV

Now I’m working only on (2)

  • I presume you’ll want SCEV expressions for sitofp and uitofp.

[Demikhovsky, Elena] I’m adding these expressions, of course. They are similar to "truncate" and “zext”, in terms of implementation.

(The most important question:) With these in the game, what is the
canonical representation of SCEV expressions that can be expressed
as, say, both sitofp(A + B) and sitofp(A) + sitofp(B)?
[Demikhovsky, Elena] Meanwhile I have (____start + delta * sitofp(i))____.
I don’t know how far we can go with FP simplification and under what flags. The first implementation does not assume that sitofp(A + B____) is equal to sitofp(A) + sitofp(B)

Will we have a way to mark expressions (like we have nsw and
nuw for sext and zext) which we can distribute sitofp and
uitofp over?
[Demikhovsky, Elena] I assume that sitofp and uitofp should be 2 diffe____rent operations.

Same questions for fptosi and fptoui.
[Demikhovsky, Elena] the same answer as above, because I don’t want to combine these operations

  • How will you partition the logic between floating and integer
    expressions in SCEV-land? Will you have, say, SCEVAddExpr do
    different things based on type, or will you split it into
    SCEVIAddExpr and SCEVFAddExpr? [0]

[Demikhovsky, Elena] Yes, I’m introducing SCEVFAddExpr and SCEVFMulExpr - (____start + delta * sitofp(i))

  • There are likely to be similarities too – e.g. the “inductive”
    or “control flow” aspect of SCEVAddRecExpr is likely to be
    common between floating point add recurrences[1], and integer add
    recurrences; and part of figuring out the partitioning is also
    figuring out how to re-use these bits of logic.
    [Demikhovsky, Elena] I’m adding S____CEVFAddRecExpr to describe the recurrence of FP IV

[0]: I’ll prefer the latter since e.g. integer addition is associative, but floating point addition isn’t; and it is better to force programmers to handle the two operations differently.

[1]: For instance, things like this:
https://github.com/llvm-mirror/llvm/blob/master/lib/Analysis/ScalarEvolution.cpp#L7564
are likely to stay common between floating point and integer add recs.

– Sanjoy

What situations are they common in?

ICC Vectorizer made a paradigm shift a while ago.
If there aren’t a clear reason why something can’t be vectorized, we should try our best to vectorize.
The rest is a performance modeling (and priority to implement) question, not a capability question.
We believe this is a good paradigm to follow in a vectorizer development. It was a big departure from
“vectorize when all things look nice to vectorizer”.

We shouldn’t give up vectorizing simply because programmer wrote a FP induction code.(*)
Then, the next question is what’s the best solution for that problem, and extending SCEV
appears to be one of the obvious directions to explore.

Thanks,
Hideki Saito
Intel Compilers and Languages

>What situations are they common in?

ICC Vectorizer made a paradigm shift a while ago.
If there aren’t a clear reason why something can’t be vectorized, we
should try our best to vectorize.
The rest is a performance modeling (and priority to implement) question,
not a capability question.
We believe this is a good paradigm to follow in a vectorizer development.

In some sense, yes, but not at all possible costs.
There needs to be some actual motivating case to make it worth even writing
the code for.

It was a big departure from
“vectorize when all things look nice to vectorizer”.

These are not diametrically opposed.

I mean, it may be not worth the cost of mainintaing the *compiler code* to
do o it.
This isn't the same as "when things look nice to the vectorizer", it's more
"we're willing to vectorize whatever we can, as long as someone is going to
actually use it".

Nobody has here provided a useful set of cases/applications/etc that
suggests it should be done. I'm not saying there are none, i'm saying,
literally, nobody has motivated this use case yet :slight_smile:

We shouldn’t give up vectorizing simply because programmer wrote a FP
induction code.(*)

We shouldn't add code to the compiler just because we can.

I would similarly be against, for example, vectorizing loops with binary
coded decimal induction variables, and adding an entire BCD SCEV
infrastructure, without some motivating case *somewhere*.

So i suggest y'all start from: "Here are the cases we care about making
faster, and why we care about making them faster".

In compilers, building infrastructure first, then finding customers works a
lot worse than figuring out what customers want, and then building
infrastructure for them :slight_smile:

This paradigm can have far reaching consequences. The vectorizer is the performance cow to milk at the IR level. So under that paradigm - followed religiously - one would plug in any loop transformation, polyhedral or non-polyhedral etc cost models etc to morph code vectorizable. And when that is not sufficient one would probably start adding large numbers of run-time checks, multi-versioned code etc. This might be a good paradigm to follow from the peak performance angle, but not so from the compile-time or code size angle. It seems best to pursue a paradigm like this with a peak performance library rather than mainstream llvm.

+1 I think a lot of people would be very interested in non-toy examples that show big performance differences between icc and clang. That would also allow to dig deeper into questions like is it “vectorizer capability, dependence analysis and/or supporting transformations and/or ??? ” to explain the gap.

Graphics shading languages like GLSL or HLSL have floating point types as the “default” types. Integers weren’t even added until later revisions of GLSL. In that world, it’s not especially strange to imagine a loop counter written in floating point.

—Owen

But most of those are convertible to integer IVs which as Andy pointed out up the thread is something IndVarSimplify tries to do.

A potential way to show a motivating use case is what Andy already outlined: cases where non-integer values are fundamentally used as part of the IV.

Even then, I’d personally want to see further evidence of why the correct solution is to model the floating point IV in SCEV rather than find a more powerful way of converting the IV to an integer that models the non-integer values taken on by the IV. As an example, if the use case is the following code with appropriate flags to relax IEEE semantics so this looks like normal algabra etc:

for (float f = 0.01f; f < 1.0f; f += 0.01f)

I’d rather see us cleverly turn it into:

float f = 0.01f;
for (int i = 1; i < 100; i += 1, f += 0.01f)

(or whatever the transform would be, I’ve not spent lots of time thinking about the exact way to map this onto a synthetic “adding that value N times crosses threshold, so let’s replace IV with a counter to N”)

-Chandler

Even then, I’d personally want to see further evidence of why the correct solution is to model the floating point IV in SCEV rather than find a more powerful way of converting the IV to an integer that models > the non-integer values taken on by the IV. As an example, if the use case is the following code with appropriate flags to relax IEEE semantics so this looks like normal algabra etc:

for (float f = 0.01f; f < 1.0f; f += 0.01f) ç A

I’d rather see us cleverly turn it into:

float f = 0.01f;

for (int i = 1; i < 100; i += 1, f += 0.01f) ç B

I can later try to enhance IndVarSimplify::handleFloatingPointIV() in order to convert A to B.

But B is exactly the case I’m starting from. The main IV “i” is integer. The variable “f” is also considered as IV in this loop.

And this loop is not vectorized because “f” is floating point.

I don’t think that the case B is uncommon.

  • Elena

Even then, I’d personally want to see further evidence of why the correct solution is to model the floating point IV in SCEV rather than find a more powerful way of converting the IV to an integer that models > the non-integer values taken on by the IV. As an example, if the use case is the following code with appropriate flags to relax IEEE semantics so this looks like normal algabra etc:

for (float f = 0.01f; f < 1.0f; f += 0.01f) ç A

I’d rather see us cleverly turn it into:

float f = 0.01f;

for (int i = 1; i < 100; i += 1, f += 0.01f) ç B

I can later try to enhance IndVarSimplify::handleFloatingPointIV() in order to convert A to B.

But B is exactly the case I’m starting from. The main IV “i” is integer. The variable “f” is also considered as IV in this loop.

And this loop is not vectorized because “f” is floating point.

I don’t think that the case B is uncommon.

There needs to be some actual motivating case to make it worth even writing the code for.

This goes back into “priority to implement” question. If there aren’t any customers, priority goes down, by a lot.

So under that paradigm - followed religiously - one would plug in any loop transformation, polyhedral or non-polyhedral etc cost models etc to morph code vectorizable

I won’t comment on other transformations. Powerful vectorizer certainly helps other optimizers make a case,

and sometimes require more optimizations to fully appreciate.

This might be a good paradigm to follow from the peak performance angle, but not so from the compile-time or code size angle.
It seems best to pursue a paradigm like this with a peak performance library rather than mainstream llvm.

This should be evaluated feature-by-feature. I fully understand that LLVM is also used as JIT compiler.

I don’t think FP induction is adding significantly more compile-time and code-size than integer induction.

So i suggest y’all start from: "Here are the cases we care about making faster, and why we care about making them faster”.

+1

This was our thinking before the paradigm shift.

The following code vectorizes for TTT being int (might need a bit of extension in SCEV) but not when TTT is float/double (unless FP induction

analysis is available). Adding 2-lines of code like this to a 1000-line loop suddenly stops vectorizing the entire loop. These are the things that

greatly irritate programmers. Resolving programmer frustration is equally important as performance. In this case, a robust vectorizer should

either 1) vectorize FP induction or 2) tell the programmer that FP induction needs to be converted to integer induction. Either way, FP induction

analysis is needed. Showing a backward dependence edge on “x” would certainly help, but not as helpful as 1) or 2). ICC Vectorizer customers

appreciate improved “loop was not vectorized” messaging as much as functional and performance enhancements of the vectorizer.

In general, investing in making vectorizer “robust” pays off very well, through performance and/or programmer satisfaction.

void foo(TTT *a, int N, TTT x, TTT y){

int i;

for (i=0;i<N;i++){

A[i] = x;
x+=y;
}

}

FYI, I have a customer asking for an extension of OpenMP linear for non-POD types (I won’t bother getting into that discussion in llvm-dev).

When vectorizer becomes stronger, more feature requests will come. J

Thanks,

Hideki