First attempt at recognizing pointer reduction

Hi Nadav, Arnold,

I managed to find some time to work on the pointer reduction, and I got a patch that can make “canVectorize()” pass.

Basically what I do is to teach AddReductionVar() about pointers, saying they don’t really have an exit instructions, and that (maybe) the final store is a good candidate (is it?).

This makes it recognize the writes and reads, but then “canVectorizeMemory()” bails because it can’t find the array bounds. This will be my next step, but I’d like to know if I’m in the right direction, with the right assumptions about the reduction detection, before proceeding into more complicated bits.

An example of the debug output I get for my earlier example:

LV: Checking a loop in “fn1”
LV: Found a loop: for.body

== Induction ==

LV: Found an induction variable.

== Our write reduction ==

LV: Found a pointer add reduction PHI. %WRITE913 = phi i8* [ %incdec.ptr18, %for.body ], [ %WRITE, %for.body.preheader ]

== Not sure what this is… Will check. ==

LV: Found an non-int non-pointer PHI.

== Our read reduction ==

LV: Found a pointer add reduction PHI. %READ1012 = phi i8* [ %incdec.ptr2, %for.body ], [ %READ, %for.body.preheader ]

== Below are the memory operations’s validation, which I’ll deal with later ==

LV: Found an unidentified write ptr: i8* %WRITE
LV: Found an unidentified write ptr: i8* %WRITE
LV: Found an unidentified write ptr: i8* %WRITE
LV: Found an unidentified read ptr: i8* %READ
LV: Found an unidentified read ptr: i8* %READ
LV: Found an unidentified read ptr: i8* %READ
LV: Found a runtime check ptr: %WRITE913 = phi i8* [ %incdec.ptr18, %for.body ], [ %WRITE, %for.body.preheader ]

LV: Found a runtime check ptr: %incdec.ptr16 = getelementptr inbounds i8* %WRITE913, i64 1
LV: Found a runtime check ptr: %incdec.ptr17 = getelementptr inbounds i8* %WRITE913, i64 2
LV: Found a runtime check ptr: %READ1012 = phi i8* [ %incdec.ptr2, %for.body ], [ %READ, %for.body.preheader ]
LV: Found a runtime check ptr: %incdec.ptr = getelementptr inbounds i8* %READ1012, i64 1
LV: Found a runtime check ptr: %incdec.ptr1 = getelementptr inbounds i8* %READ1012, i64 2
LV: We need to do 18 pointer comparisons.
LV: We can’t vectorize because we can’t find the array bounds.

cheers,
–renato

pointer-reduction.patch (3.61 KB)

Renato,

This looks like the right direction. Did you run it on the LLVM test suite to check if it finds new loops to vectorize ?

Thanks,
Nadav

Thanks Nadav.

No, I haven't. I'm not expecting it to vectorize anything, but you're
right, maybe this is enough material on its own. I did run the check-all
and found no regressions, but that means very little in this case.

I'll run with debug messages on and compare the LV lines.

cheers,
--renato

Renato,

can you post a hand-created vectorized IR of how a reduction would work on your example?

I don’t think that recognizing this as a reduction is going to get you far. A reduction is beneficial if the value reduced is only truly needed outside of a loop.
This is not the case here (we are storing/loading from the pointer).

Your example is something like

WRITEPTR = phi i8* [ outsideval, preheader] [WPTR, loop]
store WRITEPTR
...
WPTR = getelementptr WRIPTR, 3

If you treat this as a reduction, you will end up with code like (say we used a VF=2):

WRITEPTR = phi <2 x i8*> [ <outsideval, 0>, preheader] [WPTR, loop]
ptr = WRITEPTR<0> + WRITEPTR<1>
store ptr
...
WPTR = getelementptr WRIPTR, <3,3>

I think the right approach to get examples like this is that we need to recognize strided pointer inductions (we only support strides by one currently).

Basically we need to be able to see

phi_ptr = phi double*

phi_ptr = gep phi_ptr, 2

and make a canonical induction variable out of this

phi_ind = phi 0, phi_ind
ptr = gep ptr_val_from_outside_loop, 2*phi_ind

phi_ind = add, phi_ind, 1

(of course virtually).

Or in other words:

for (i++)
*ptr++ =
*ptr++ =

has to look like
for (i++)
ptr[2*i] =
ptr[2*i+1] =

to us when we are vectorizing this.

Thanks,
Arnold

Hi Arnold,

To sum up my intentions, I want to understand how the reduction/induction variable detection works in LLVM, so that I can know better how to detect different patterns in memory, not just the stride vectorization.

For instance, even if the relationship between each loop would be complicated, I know that in each loop, all three reads are sequential. So, at least, I could use a Load<3> rather than three loads. I guess this is why Nadav was hinting for the SLP vectorizer, not the loop vec. Since the operation on all three loaded values are exactly the same AND the writes are also sequential, I can reduce that to: load<3> → op<3> → store<3>. That should work even on machines that don’t have de-interleaved access (assuming they’re pointers to simple types, etc).

Hi Arnold,

To sum up my intentions, I want to understand how the reduction/induction variable detection works in LLVM, so that I can know better how to detect different patterns in memory, not just the stride vectorization.

To detect memory access patterns you will want to look at the SCEV of a pointer (or at a set of SCEVs/pointers). This way you get a canonical form.

For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:

{ptr, +, 8}_loop
{ptr+4, +, 8}_loop

Each access on its own requires a gather/scather (2 loads/stores when vectorized (VF=2) + inserts/extracts). But when we look at both at once we see that we only need two load/store in total (plus some interleaving operations).

What other patterns (than strided accesses) do you want to vectorize?

For instance, even if the relationship between each loop would be complicated, I know that in each loop, all three reads are sequential. So, at least, I could use a Load<3> rather than three loads.

Yes. Detecting this is what is described in the paper I referred in a post before (Auto-vectorization of interleaved data for SIMD). And what gcc is doing (AFAICT). You want to take a set of accesses in the loop and recognize that they are consecutive in memory (currently we look at every access on it is own, if an access is not sequential we assume we have to gather/scather). Once you know that you have consecutive accesses spread across different instructions you can generate more efficient code instead of scather/gathers. You would want to do the evaluation of which accesses are consecutive in SCEV I think.

For your example, you want to recognize strided accesses (this is separate from the induction/reduction mechanism), first:

for (i = 0 .. n, +1)
a[2*i] = …
a[2*1+1] = …

You want this part first because without it the loop vectorizer is not going to vectorize because of the cost of gather/scather. But if it can recognize that in cases like this the “gather/scather” is not as costly and we emit the right code we will start to vectorize such loops.

Once we can do that. We need to support strided pointer inductions to get your example.

for (i = 0..n, +1)
*a++=
*a++=

I guess this is why Nadav was hinting for the SLP vectorizer, not the loop vec. Since the operation on all three loaded values are exactly the same AND the writes are also sequential, I can reduce that to: load<3> -> op<3> ->
store<3>. That should work even on machines that don't have de-interleaved access (assuming they're pointers to simple types, etc).

Getting this example in the slp vectorizer is easier but we won’t get the most efficient code (i.e. the one that gcc emits) because we would have <3 x i8> stores/loads. With vectorization of interleaved data you can load/store more elements (from several iterations) with a single load.

can you post a hand-created vectorized IR of how a reduction would work on your example?

I'm not there yet, just trying to understand the induction/reduction code first and what comes out of it.

I think the right approach to get examples like this is that we need to recognize strided pointer inductions (we only support strides by one currently).

I see. I'll have a look at IK_PtrInduction and see what patterns I can spot.

Do you envisage a new IK type for strided induction, or just work with the PtrInduction to extend the concept of a non-unit stride (ie. separate Step from ElementSize)?

Either representation should be fine. I think the bigger task is not recognizing the induction but recognizing consecutive strided memory accesses, though. First, I think we want to be able to do:

for (i = 0 … n, +1)
  a[3*i] =
  a[3*i+1] =
  a[3*i+2] =

And next,

for (i = 0 … n, +1)
  *a++ =
  *a++ =
  *a++ =
  
Because to get the latter, you need the former.

Have you compared the performance of the kernel (gcc vectorized) you showed vs a scalar version? I would be curious about the speed-up.

Thanks,
Arnold

For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:

{ptr, +, 8}_loop
{ptr+4, +, 8}_loop

Each access on its own requires a gather/scather (2 loads/stores when
vectorized (VF=2) + inserts/extracts). But when we look at both at once we
see that we only need two load/store in total (plus some interleaving
operations).

Yes, I've been studying SCEV when trying to understand some other patterns
where the vectorizer was unable to detect the exit count (basically this
case, with a nested loop). It does make things easier to spot patterns in
the code.

The patch I attached here was not to help vectorize anything, but to let me
jump over the validation step, so that I could start working with the
patterns themselves during the actual vectorization. The review request was
only to understand if the checks I was making made sense, but it turned out
a lot more valuable than that.

Getting this example in the slp vectorizer is easier but we won’t get the

most efficient code (i.e. the one that gcc emits) because we would have <3
x i8> stores/loads. With vectorization of interleaved data you can
load/store more elements (from several iterations) with a single load.

So, this was the other patterns I was looking for, as a stepping stone into
the full vectorizer. But I'm not sure this will help in any way the strided
access.

Either representation should be fine. I think the bigger task is not

recognizing the induction but recognizing consecutive strided memory
accesses, though. First, I think we want to be able to do:

for (i = 0 … n, +1)
  a[3*i] =
  a[3*i+1] =
  a[3*i+2] =

And next,

for (i = 0 … n, +1)
  *a++ =
  *a++ =
  *a++ =

Because to get the latter, you need the former.

Makes total sense. I'll change my approach.

Have you compared the performance of the kernel (gcc vectorized) you showed

vs a scalar version? I would be curious about the speed-up.

4x faster, on both Cortex A9 and A15. :slight_smile:

Thanks for the tips, I hope I can find more time to work on it this week,
since Linaro Connect is in the coming week and the US dev meeting is on the
next.

cheers,
--renato

Whenever you find time to work on this will be of great help!

I wanted to work on this at some point. But my near time tasks won’t allow. So I much appreciate any help!

Thanks,
Arnold

For example these should be the SCEVs of “int a[2*i] = ; a[2*i+1] =”:

{ptr, +, 8}_loop
{ptr+4, +, 8}_loop

Each access on its own requires a gather/scather (2 loads/stores when vectorized (VF=2) + inserts/extracts). But when we look at both at once we see that we only need two load/store in total (plus some interleaving operations).

Yes, I've been studying SCEV when trying to understand some other patterns where the vectorizer was unable to detect the exit count (basically this case, with a nested loop). It does make things easier to spot patterns in the code.

The patch I attached here was not to help vectorize anything, but to let me jump over the validation step, so that I could start working with the patterns themselves during the actual vectorization. The review request was only to understand if the checks I was making made sense, but it turned out a lot more valuable than that.

Getting this example in the slp vectorizer is easier but we won’t get the most efficient code (i.e. the one that gcc emits) because we would have <3 x i8> stores/loads. With vectorization of interleaved data you can load/store more elements (from several iterations) with a single load.

So, this was the other patterns I was looking for, as a stepping stone into the full vectorizer. But I'm not sure this will help in any way the strided access.

Either representation should be fine. I think the bigger task is not recognizing the induction but recognizing consecutive strided memory accesses, though. First, I think we want to be able to do:

for (i = 0 … n, +1)
  a[3*i] =
  a[3*i+1] =
  a[3*i+2] =

And next,

for (i = 0 … n, +1)
  *a++ =
  *a++ =
  *a++ =

Because to get the latter, you need the former.

Makes total sense. I'll change my approach.

Have you compared the performance of the kernel (gcc vectorized) you showed vs a scalar version? I would be curious about the speed-up.

4x faster, on both Cortex A9 and A15. :slight_smile:

Nice.

Hi Arnold, Nadav,

Let me resurrect this discussion a bit.

There are few things that I need to check while validating a stride access
loop:
1. If there is an induction variable, and it has stride > 1. This is easy,
and can be easily extended on isInductionVariable() by adding Step
to InductionInfo;
2. If the reduction variables' access are sequential and compatible with
the stride. This is the core of the change, and will require SCEV
computation, as extensively described by Arnold.
3. If the reduction variable has any use outside of the loop. This will
need to account for global variables, which is in itself, a use of the
computed value.

The problem I'm having, and why I'm resurrecting this thread, is in this
simple code:

#define OFFSET 255
extern char b[OFFSET];
char *fn1 (char a[]) {
    unsigned i;

    for (i=0; i<OFFSET; i += 3) {
      a[i] = OFFSET - b[i] - DELTA;
      a[i+1] = OFFSET - b[i+1] - DELTA;
      a[i+2] = OFFSET - b[i+2] - DELTA;
    }
    return &a[0];
}

This is the most trivial example. The exit count is constant, the
operations are the same and compatible with the stride of 3, and the
variable "a" is used outside the loop. However, the optimized code that the
vectorizer sees is after SCEV has passed through it, leaving the IR very
similar to my original example:

for.body: ; preds = %entry,
%for.body
  %indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds [255 x i8]* @b, i64 0, i64 %indvars.iv
  %0 = load i8* %arrayidx, align 1
  %1 = xor i8 %0, -1
  %2 = load i8* @DELTA, align 1
  %sub2 = sub i8 %1, %2
  %arrayidx5 = getelementptr inbounds i8* %a, i64 %indvars.iv
  store i8 %sub2, i8* %arrayidx5, align 1
  ...
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 3
  %11 = trunc i64 %indvars.iv.next to i32
  %cmp = icmp ult i32 %11, 255
  br i1 %cmp, label %for.body, label %for.end

So, as you can see, my original patch to recognize pointer reductions is
still relevant, since I'll need to act on the list of induction and
reduction variables to discover if the reduction's stride is compatible
with the induction's step.

Also, I'll need that to make sure that the reduction in question has access
outside (via global / extern qualifiers), so ignoring load/store
instructions that won't have any users, because the variables are not used
directly, but through &a[0].

Arnold,

I agree with you that the analysis should not be done exclusively trying to
manipulate the pointer reduction variables (but using SCEV as you
describe), but I can't see how I can get past the validation phase if, at
least, I don't recognize the GEP PHIs as a reduction variable.

cheers,
--renato

I don’t think that recognizing this as a reduction is going to get you far. A reduction is beneficial if the value reduced is only truly needed outside of a loop.
This is not the case here (we are storing/loading from the pointer).

Hi Arnold, Nadav,

Let me resurrect this discussion a bit.

There are few things that I need to check while validating a stride access loop:
1. If there is an induction variable, and it has stride > 1. This is easy, and can be easily extended on isInductionVariable() by adding Step to InductionInfo;

Yes.

2. If the reduction variables' access are sequential and compatible with the stride. This is the core of the change, and will require SCEV computation, as extensively described by Arnold.
3. If the reduction variable has any use outside of the loop. This will need to account for global variables, which is in itself, a use of the computed value.

In the examples you gave there are no reduction variables in the loop vectorizer’s sense. But, they all have memory accesses that are strided.

So for your example below we would need two things:

1.) analyze the memory accesses in the loop for strided accesses and deal with them appropriately during analysis and transformation
2.) be able to handle loops with non-unit stride integer induction variables - this only makes sense if we have support for 1.)

There is an even simpler example that would just deal with strided accesses:

for (i=0; i < OFFSET/3; i++) {
  a[3*i] = OFFSET - b[3*i] - DELTA
  a[3*i+1] = ….
}

Once we can handle this example we can extend our induction variable checks to handle loops with stride (because then it would sense to handle them).

Best,
Arnold

In the examples you gave there are no reduction variables in the loop
vectorizer’s sense. But, they all have memory accesses that are strided.

This is what I don't get. As far as I understood, a reduction variable is
the one that aggregates the computation done by the loop, and is used
outside the loop.

In my example, I'm aggregating a computation in an array and returning this
array for later use, what am I missing here?

1.) analyze the memory accesses in the loop for strided accesses and deal

with them appropriately during analysis and transformation
2.) be able to handle loops with non-unit stride integer induction
variables - this only makes sense if we have support for 1.)

So, despite all this, we still have to pass validation, so canVectorize()
must return true. As it stands, all my examples can't vectorize because of
the extra memory PHI, and your example below can't vectorize because it
can't find the array bounds.

I'm assuming that, as soon as I teach the validation to accept your loop
(given additional checks), and teach about the strides and costs, it will
vectorize. But if I go back to my original code, it won't, because of the
reduction PHI.

So, again, I agree with you, one step at a time, I'll work with your loop,
because it's the straightest path from here. But at some time, I'll have to
either identify the memory PHIs or transform the loop to avoid them at all.
Does that make sense?

cheers,
--renato

In the examples you gave there are no reduction variables in the loop vectorizer’s sense. But, they all have memory accesses that are strided.

This is what I don't get. As far as I understood, a reduction variable is the one that aggregates the computation done by the loop, and is used outside the loop.

Yes. But:

for (i = …)
  a[i] = b[i]
  a[i+1] = b[i+1]
  …

is not such a reduction. Not even when the code disguises like this:

for (i= …)
*a++ = *b++
*a++ = *b++

A reduction is something like:

for (i= …) {
r+= a[i];
}
return r;

here the value of “r” is only truly used by the reduction itself and outside of the loop.

In your example you have

for ()
  *a =
  a++;
  …

the “*a=“ part being a true use in the loop. This is a problem because at that point you would have to reduce the vectorized reduction.

I.e if we wrote “a" as (pointer reduction, VF=2) as:

  %a_red = phi <2 x i32*> [preheader, <%a, 0>], [loop, %a_red_next]

  %a_red_next = gep %a_red, <3, 3>

you would have to compute the value of “a” for the loop on every iteration from the vectorized reduction.

  %a_red = phi <2 x i32*> [preheader, <%a, 0>], [loop, %a_red_next]

  %a = horizontal_add %a_red // This is executed on every iteration.
    = load a
  …
  %a_red_next = gep %a_red, <3, 3>

Something like the above wants to be a simple induction

  %a_ind = phi i32* [preheader, %a], [loop, %a_next]
   … use of %a_ind
   ...
  %a_next = gep %a_ind, 6

The horizontal reduction i mention above is not needed if you have no use inside the loop like in the case of:

r=0
for (i = …) {
  r += a[i];
}
return r;

This is simply (i am leaving out the induction variable for “i”):

  %r_red = phi <2 x i32> [preheader, <%r, 0>] , [loop, %r_red_next]
  %a_ptr = gep %a, %i
  %val_of_a_sub_i = load <2 x i32> * %a_ptr

  %r_red_next = add %r_red, %val_of_a_sub_i

Outside of the loop we reduce the vectorized reduction to the final value of “r”

%r= horizontal_add %r_red_next
ret %r

In my example, I'm aggregating a computation in an array and returning this array for later use, what am I missing here?

See above. You are not really aggregating in the spirit of reductions as used in the loop vectorizer. Every array element is only accesses once so this is not really an aggregation.

1.) analyze the memory accesses in the loop for strided accesses and deal with them appropriately during analysis and transformation
2.) be able to handle loops with non-unit stride integer induction variables - this only makes sense if we have support for 1.)

So, despite all this, we still have to pass validation, so canVectorize() must return true. As it stands, all my examples can't vectorize because of the extra memory PHI, and your example below can't vectorize because it can't find the array bounds.

Yes because we have to teach the vectorizer about pointer induction variables that stride for your example.

Your code - as far as I can reconstruct it from memory - looks something like

preheader:

loop:
%strided_induction_pointer = phi [preheader, %b], [loop, %str_ind_ptr_inc]

     = load %strided_induction_pointer
%b_1 = gep %strided_induction_pointer, 1

… = load %b_1

%b_2 = gep %strided_induction_pointer, 2

… = load %b_2

%str_ind_ptr_inc = gep %strided_induction_pointer, 3 // Induction variable that strides by 3
%cmp = …
br %cmp, loop, exit

exit:

%strided_induction_pointer here is a pointer induction not a reduction, because a reduction precludes uses like the loads in the loop.

My example probably gives you this answer because it is trying to add runtime checks that ensure that

a[3*i] =
a[3*i+1] =
a[3*i+2] =

don’t overlap for i in {0 …n}. Since we check overlapping by checking a range this would fail anyways.

Teaching the legality logic that we can still vectorize this code - because the accesses are strided (and don’t overlap) - is part of the problem to solve when I say we have to teach the vectorizer how to handle interleaved/strided data.

I'm assuming that, as soon as I teach the validation to accept your loop (given additional checks), and teach about the strides and costs, it will vectorize. But if I go back to my original code, it won't, because of the reduction PHI.

So, again, I agree with you, one step at a time, I'll work with your loop, because it's the straightest path from here. But at some time, I'll have to either identify the memory PHIs or transform the loop to avoid them at all. Does that make sense?

You don’t need to transform them in the legality phase. Believe me ;). Look at how we handle stride one pointer inductions at the moment (they are also memory phis) - they are based off a canonical induction variable that we create during the actual vectorization. Everything before that is done virtually without having to transform code until we actually know we want to vectorize it.

Basically for pointer inductions we store the start value. When we come to actually vectorize the loop we create a canonical induction variable that starts at zero and strides by the vector factor (VF). Any use of the original pointer induction variable in the vectorized loop is simply:

  Builder.CreateGEP(II.StartValue, cannonical_induction_variable);

For a non-unit pointer induction this would be

  Builder.CreateGEP(II.StartValue, stride * cannonical_induction_variable);

(I have simplified the problem a little the code in the vectorizer is a little bit more complicated but essentially boils down to the above)

Thanks,
Arnold

A reduction is something like:

for (i= …) {
r+= a[i];
}
return r;

Ok, so "reduction" is just a reduction in the map-reduce sense, and nothing
else.

You don’t need to transform them in the legality phase. Believe me ;). Look

at how we handle stride one pointer inductions at the moment (they are also
memory phis) - they are based off a canonical induction variable that we
create during the actual vectorization. Everything before that is done
virtually without having to transform code until we actually know we want
to vectorize it.

Oh, so that's what I was missing. When Nadav said about pointer reduction,
I thought that was how we were should be dealing with memory PHIs in the
end. I'll see how stride 1 pointer induction traverses the code and where
to add stride N (but not sooner than I try to teach strides to non-pointer
cases).

I'll also add the reduction case, like:

for (i .. N/3) {
  r += a[3*i] ..;
  r += a[3*i+1] ..;
  r += a[3*i+2] ..;
}
return r;

And see how it should work. (later, too).

Basically for pointer inductions we store the start value. When we come to

actually vectorize the loop we create a canonical induction variable that
starts at zero and strides by the vector factor (VF). Any use of the
original pointer induction variable in the vectorized loop is simply:

Thanks, I'll have a look (again). :wink:

I'll open some Bugzilla bugs and assign to me, one for each type of loop to
be vectorized, in sequence (depends on) and using some of the test we
discussed in the mailing list. That way, we'll be able to agree on the
order and implementation beforehand. Feel free to share them with me if you
have time in the interim, I'll mainly be two weeks away from now (Connect,
holidays, LLVM meeting).

So far, we have covered:

1. simple stride: for (N/3) { a[3*i] = b[3*i]; a[3*i+1] = b[3*i+1];
a[3*i+2] = b[3*i+2]; }
* The simplest form, needs only basic checks & vectorization logic

2. re-rolling possible: for (N/3) { a[3*i] = b[3*i] + K; a[3*i+1] =
b[3*i+1] + K; a[3*i+2] = b[3*i+2] + K; }
* can be re-rolled for simple optimization (already implemented)
* doesn't need interleaved data, even if vectorized directly via stride
access

3. re-rolling impossible: for (N/3) { a[3*i] = b[3*i] + I; a[3*i+1] =
b[3*i+1] + J; a[3*i+2] = b[3*i+2] + K; }
* can't be re-rolled and does need interleaved data access

4. reduction stride: for (N/3) { a += b[3*i]; a += b[3*i+1]; a += b[3*i+2];
} return a;
* May work with case above implemented, may need additional reduction logic
* can be any of the three types above

5. memory stride: for (N/3) { a++ = b++; a++ = b++; a++ = b++; }
* should be transformed into one of the cases above first, than vectorized
as them
* can be any of the three types above (1, 2, 3)

With the dependency being:

1 -> { 2, 3} -> {4, 5}

cheers,
--renato

A reduction is something like:

for (i= …) {
r+= a[i];
}
return r;

Ok, so "reduction" is just a reduction in the map-reduce sense, and nothing else.

Yes.

You don’t need to transform them in the legality phase. Believe me ;). Look at how we handle stride one pointer inductions at the moment (they are also memory phis) - they are based off a canonical induction variable that we create during the actual vectorization. Everything before that is done virtually without having to transform code until we actually know we want to vectorize it.

Oh, so that's what I was missing. When Nadav said about pointer reduction, I thought that was how we were should be dealing with memory PHIs in the end. I'll see how stride 1 pointer induction traverses the code and where to add stride N (but not sooner than I try to teach strides to non-pointer cases).

I'll also add the reduction case, like:

for (i .. N/3) {
  r += a[3*i] ..;
  r += a[3*i+1] ..;
  r += a[3*i+2] ..;
}
return r;

And see how it should work. (later, too).

Basically for pointer inductions we store the start value. When we come to actually vectorize the loop we create a canonical induction variable that starts at zero and strides by the vector factor (VF). Any use of the original pointer induction variable in the vectorized loop is simply:

Thanks, I'll have a look (again). :wink:

I'll open some Bugzilla bugs and assign to me, one for each type of loop to be vectorized, in sequence (depends on) and using some of the test we discussed in the mailing list. That way, we'll be able to agree on the order and implementation beforehand. Feel free to share them with me if you have time in the interim, I'll mainly be two weeks away from now (Connect, holidays, LLVM meeting).

See you at the dev meeting then.

So far, we have covered:

1. simple stride: for (N/3) { a[3*i] = b[3*i]; a[3*i+1] = b[3*i+1]; a[3*i+2] = b[3*i+2]; }
* The simplest form, needs only basic checks & vectorization logic

Yes I think this is what we should start with. This is similar to 3. (4 should just fall off).

2. re-rolling possible: for (N/3) { a[3*i] = b[3*i] + K; a[3*i+1] = b[3*i+1] + K; a[3*i+2] = b[3*i+2] + K; }
* can be re-rolled for simple optimization (already implemented)
* doesn't need interleaved data, even if vectorized directly via stride access

3. re-rolling impossible: for (N/3) { a[3*i] = b[3*i] + I; a[3*i+1] = b[3*i+1] + J; a[3*i+2] = b[3*i+2] + K; }
* can't be re-rolled and does need interleaved data access

4. reduction stride: for (N/3) { a += b[3*i]; a += b[3*i+1]; a += b[3*i+2]; } return a;
* May work with case above implemented, may need additional reduction logic
* can be any of the three types above

5. memory stride: for (N/3) { a++ = b++; a++ = b++; a++ = b++; }
* should be transformed into one of the cases above first, than vectorized as them
* can be any of the three types above (1, 2, 3)

This will need support for non-unit stride pointer inductions.

With the dependency being:

1 -> { 2, 3} -> {4, 5}

Yep.