loop vectorizer

The loop vectorizer seems to be not able to vectorize the following code:

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start ; i < end ; ++i ) {
     const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
     const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
     c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
   }
}

LV: Found a loop: for.body
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer %arrayidx11 = getelementptr inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw>
LV: Bad stride - Not an AddRecExpr pointer %arrayidx15 = getelementptr inbounds float* %c, i64 %add8 SCEV: ((4 * %add8)<nsw> + %c)<nsw>
LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 * %add8)<nsw> + %c)<nsw>(Induction step: 0)
LV: Distance for store float %add10, float* %arrayidx11, align 4 to store float %add14, float* %arrayidx15, align 4: ((4 * %add8)<nsw> + (-4 * %add2))
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

Here the code:

entry:
   %cmp14 = icmp ult i64 %start, %end
   br i1 %cmp14, label %for.body, label %for.end

for.body: ; preds = %entry, %for.body
   %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
   %div = lshr i64 %i.015, 2
   %mul = shl i64 %div, 3
   %rem = and i64 %i.015, 3
   %add2 = or i64 %mul, %rem
   %add8 = or i64 %add2, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %add2
   %0 = load float* %arrayidx, align 4
   %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
   %1 = load float* %arrayidx9, align 4
   %add10 = fadd float %0, %1
   %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
   store float %add10, float* %arrayidx11, align 4
   %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
   %2 = load float* %arrayidx12, align 4
   %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
   %3 = load float* %arrayidx13, align 4
   %add14 = fadd float %2, %3
   %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
   store float %add14, float* %arrayidx15, align 4
   %inc = add i64 %i.015, 1
   %exitcond = icmp eq i64 %inc, %end
   br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry
   ret void

Why is it saying Bad stride?Are the 'rem' and 'div' instruction asking too much from it?

The code should be vectorizable. Here the index access for start=0, end=16:

loop count i = 0 index_0 = 0 index_1 = 4
loop count i = 1 index_0 = 1 index_1 = 5
loop count i = 2 index_0 = 2 index_1 = 6
loop count i = 3 index_0 = 3 index_1 = 7
loop count i = 4 index_0 = 8 index_1 = 12
loop count i = 5 index_0 = 9 index_1 = 13
loop count i = 6 index_0 = 10 index_1 = 14
loop count i = 7 index_0 = 11 index_1 = 15
loop count i = 8 index_0 = 16 index_1 = 20
loop count i = 9 index_0 = 17 index_1 = 21
loop count i = 10 index_0 = 18 index_1 = 22
loop count i = 11 index_0 = 19 index_1 = 23
loop count i = 12 index_0 = 24 index_1 = 28
loop count i = 13 index_0 = 25 index_1 = 29
loop count i = 14 index_0 = 26 index_1 = 30
loop count i = 15 index_0 = 27 index_1 = 31

Any ideas?

Frank

Hi Frank,

The access pattern to arrays a and b is non-linear. Unrolled loops are usually handled by the SLP-vectorizer. Are ir0 and ir1 consecutive for all values for i ?

Thanks,
Nadav

Based on his list of values, it seems that the induction stride is linear
within each block of 4 iterations, but it's not a clear relationship.

As you say, it should be possible to spot that once the loop is unrolled,
and get the SLP to vectorize if the relationship becomes clear.

Maybe I'm wrong, but this looks like a problem of missed opportunities, not
technically hard to implement.

--renato

I ran the BB vectorizer as I guess this is the SLP vectorizer.

BBV: using target information
BBV: fusing loop #1 for for.body in _Z3barmmPfS_S_...
BBV: found 2 instructions with candidate pairs
BBV: found 0 pair connections.
BBV: done!

However, this was run on the unrolled loop (I guess).

Here is the IR printed by 'opt':

entry:
   %cmp9 = icmp ult i64 %start, %end
   br i1 %cmp9, label %for.body, label %for.end

for.body: ; preds = %entry, %for.body
   %storemerge10 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
   %div = lshr i64 %storemerge10, 2
   %mul1 = shl i64 %div, 3
   %rem = and i64 %storemerge10, 3
   %add2 = or i64 %mul1, %rem
   %0 = lshr i64 %storemerge10, 1
   %add51 = shl i64 %0, 2
   %mul6 = or i64 %rem, %add51
   %add8 = or i64 %mul6, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %add2
   %1 = load float* %arrayidx, align 4
   %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
   %2 = load float* %arrayidx9, align 4
   %add10 = fadd float %1, %2
   %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
   store float %add10, float* %arrayidx11, align 4
   %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
   %3 = load float* %arrayidx12, align 4
   %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
   %4 = load float* %arrayidx13, align 4
   %add14 = fadd float %3, %4
   %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
   store float %add14, float* %arrayidx15, align 4
   %inc = add i64 %storemerge10, 1
   %exitcond = icmp eq i64 %inc, %end
   br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry
   ret void

Is what you're saying that I should unroll the loop first by a given factor and then run SLP again? How would I do that say for a factor of 2?

Frank

I ran the BB vectorizer as I guess this is the SLP vectorizer.

No, while the BB vectorizer is doing a form of SLP vectorization, there is a separate SLP vectorization pass which uses a different algorithm. You can pass -vectorize-slp to opt.

-Hal

The access pattern to arrays a and b is non-linear. Unrolled loops
are usually handled by the SLP-vectorizer. Are ir0 and ir1
consecutive for all values for i ?

On problem that you might run into currently is that the loop vectorizer, which is the only pass that, by default, will do partial unrolling, is run after SLP vectorization. The regular unroller can do partial unrolling, but unless the target overrides the relevant TTI interface and changes the default, the -unroll-allow-partial or -unroll-runtime flags would need to be specified.

-Hal

The SLP vectorizer apparently did something in the prologue of the function (where storing of arguments on the stack happens) which then got eliminated later on (since I don't see any vector instructions in the final IR). Below the debug output of the SLP pass:

Args: opt -O1 -vectorize-slp -debug loop.ll -S
SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 2 stores to vectorize.
SLP: Analyzing a store chain of length 2.
SLP: Trying to vectorize starting at PHIs (1)
SLP: Vectorizing a list of length = 2.

IR produced:

define void @_Z3barmmPfS_S_(i64 %start, i64 %end, float* noalias nocapture %c, float* noalias nocapture readonly %a, float* noalias nocapture readonly %b) #3 {
entry:
   %cmp14 = icmp ult i64 %start, %end
   br i1 %cmp14, label %for.body, label %for.end

for.body: ; preds = %entry, %for.body
   %i.015 = phi i64 [ %inc, %for.body ], [ %start, %entry ]
   %div = lshr i64 %i.015, 2
   %mul = shl i64 %div, 3
   %rem = and i64 %i.015, 3
   %add2 = or i64 %mul, %rem
   %add8 = or i64 %add2, 4
   %arrayidx = getelementptr inbounds float* %a, i64 %add2
   %0 = load float* %arrayidx, align 4
   %arrayidx9 = getelementptr inbounds float* %b, i64 %add2
   %1 = load float* %arrayidx9, align 4
   %add10 = fadd float %0, %1
   %arrayidx11 = getelementptr inbounds float* %c, i64 %add2
   store float %add10, float* %arrayidx11, align 4
   %arrayidx12 = getelementptr inbounds float* %a, i64 %add8
   %2 = load float* %arrayidx12, align 4
   %arrayidx13 = getelementptr inbounds float* %b, i64 %add8
   %3 = load float* %arrayidx13, align 4
   %add14 = fadd float %2, %3
   %arrayidx15 = getelementptr inbounds float* %c, i64 %add8
   store float %add14, float* %arrayidx15, align 4
   %inc = add i64 %i.015, 1
   %exitcond = icmp eq i64 %inc, %end
   br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry
   ret void
}

Does this finally mean the current LLVM cannot vectorize the function?:

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start ; i < end ; ++i ) {
     const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
     const std::uint64_t ir1 = ( (i/inner) * 2 + 1 ) * inner + i%4;
     c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
   }
}

I was trying the following:

clang++ -emit-llvm -S loop.cc -std=c++11
(this writes loop.ll)

opt -O1 -bb-vectorize -debug loop.ll -S
(I found the -O1 useful to prepare the IR for the loop vectorizer, so I hoped this would work here as well)

opt -O1 -loop-vectorize -debug-only=loop-vectorize loop.ll -S

opt -loop-unroll -unroll-allow-partial -debug loop.ll -S
(this didn't unroll the loop.) Here the output (relevant for loop-unroll)
Loop Unroll: F[_Z3barmmPfS_S_] Loop %for.cond

Are there any other combinations of passes that I might try?

Frank

The debug messages are misleading. They should read “trying to vectorize a list of …”; The problem is that the SCEV analysis is unable to detect that C[ir0] and C[ir1] are consecutive. Is this loop from an important benchmark ?

Thanks,
Nadav

O1 might not run all the more aggressive passes that will prepare the loop
for the vectorizers. Not all loops will vectorize on all optimization
levels, even though the vectorizers are enabled on most of them.

Still, this doesn't mean LLVM will vectorize this (or any) specific loop on
a specific optimization level. Using "-O3 -debug-only=..." will help you
understand the issues, reduce cases, report bugs and improvement requests.

cheers,
--renato

Well, they are not directly consecutive. They are consecutive with a constant offset or stride:

ir1 = ir0 + 4

If I rewrite the function in this form

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start ; i < end ; ++i ) {
     const std::uint64_t ir0 = ( (i/inner) * 2 + 0 ) * inner + i%4;
     const std::uint64_t ir1 = ir0 + 4;
     c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
   }
}

still neither the SLP nor the loop vectorizer do anything of effect:

SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 2 stores to vectorize.
SLP: Analyzing a store chain of length 2.
SLP: Trying to vectorize starting at PHIs (1)
SLP: Vectorizing a list of length = 2.

LV: Checking a loop in "_Z3barmmPfS_S_"
LV: Found a loop: for.body
LV: Found an induction variable.
LV: We need to do 0 pointer comparisons.
LV: Checking memory dependencies
LV: Bad stride - Not an AddRecExpr pointer %arrayidx6 = getelementptr inbounds float* %c, i64 %add2 SCEV: ((4 * %add2)<nsw> + %c)<nsw>
LV: Bad stride - Not an AddRecExpr pointer %arrayidx10 = getelementptr inbounds float* %c, i64 %add312 SCEV: ((4 * %add312)<nsw> + %c)<nsw>
LV: Src Scev: ((4 * %add2)<nsw> + %c)<nsw>Sink Scev: ((4 * %add312)<nsw> + %c)<nsw>(Induction step: 0)
LV: Distance for store float %add5, float* %arrayidx6, align 4 to store float %add9, float* %arrayidx10, align 4: ((4 * %add312)<nsw> + (-4 * %add2))
Non-consecutive pointer access
LV: We don't need a runtime memory check.
LV: Can't vectorize due to memory conflicts
LV: Not vectorizing.

To answer Nadav's question. This kind of loop is generated by a scientific library and we are in the process of evaluating whether LLVM can be used for this research project. The target architectures will have (very wide) vector instructions and these loops are performance-critical to the application. Thus it would be important that these loops can make use of the vector units.

Right now as it seems LLVM cannot vectorize these loops. We might have some time to look into this, but it's not sure yet. However, high-level guidance from LLVM pros would be very useful.

What is this usual way of requesting an improvement feature? Is this mailing list the central pace to communicate?

Thanks,
Frank

Hi Frank,

To answer Nadav’s question. This kind of loop is generated by a scientific library and we are in the process of evaluating whether LLVM can be used for this research project. The target architectures will have (very wide) vector instructions and these loops are performance-critical to the application. Thus it would be important that these loops can make use of the vector units.

Does your CPU have a good scatter/gather support ? It will be easy to add support for scatter/gather operations to the LLVM Loop-Vectorizer. The current design focuses on SIMD vectors and it probably does not have all of the features that are needed for wide-vector vectorization.

Right now as it seems LLVM cannot vectorize these loops. We might have some time to look into this, but it’s not sure yet. However, high-level guidance from LLVM pros would be very useful.

What is this usual way of requesting an improvement feature? Is this mailing list the central pace to communicate?

You can open bugs on bugzilla, but I think that the best way to move forward is to continue the discussions on the mailing list. Are there other workloads that are important to you ? Are there any other problems that you ran into ?

Thanks,
Nadav

Hi Nadav,

We are looking at a variety of target architectures. Ultimately we aim to run on BG/Q and Intel Xeon Phi (native). However, running on those architectures with the LLVM technology is planned in some future. As a first step we would target vanilla x86 with SSE/AVX 128/256 as a proof-of-concept.

Most of our generated functions implement pure data-parallel operations which suit vector instructions. There are of course some kernels that require scatter/gather but I don't worry about those right now.

What I don't understand: How can the loop vectorizer be good on a small vector size but not so good on a large one? (I guess this is what you're saying with SIMD vector as a 'small vector'). Isn't this functionality completely generic in the loop vectorizer and its algorithm doesn't care about the actual 'width' of the vector?

Why did you bring up gather/scatter instructions? The test function doesn't make use of them. What's the role of gather/scatter in the loop vectorizer? I know one needs to insert/extract values to/from vectors in order to use them for scalar operations. But in the case here, there are no scalar operations. That's what I mean with these functions implement purely data-parallel/vector operations.

Regards whether we have other problems. That's the good news about it: There are no other problem. Our applications already runs (and is correct) using the LLVM JIT'er. However, only with a datalayout that's not optimal for CPU architectures. In this case the functions get vectorized, but the application performance gets hurt due to cache thrashing. Now, applying an optimized data layout, which maximizes cache line reuse, introduces these 'rem' and 'div' instructions mentioned earlier which seem to let the vectorizer fail (or be it the scalar evolution analysis pass).

Is there fundamental functionality missing in the auto vectorizer when the target vector size increases to 512 bits (instead of 128 for example)? And why?

What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier?

Frank

Hi Frank,

We are looking at a variety of target architectures. Ultimately we aim to run on BG/Q and Intel Xeon Phi (native). However, running on those architectures with the LLVM technology is planned in some future. As a first step we would target vanilla x86 with SSE/AVX 128/256 as a proof-of-concept.

Great! It should be easy to support these targets. When you said wide-vectors I assumed that you mean old school vector-processors. Elena Demikhovsky is working on adding AVX512 support and once she is done things should just work. We will need to support some of the new features of AVX512 such as predication and scatter/gather to make the most out of this CPU. I don’t know too much on BG/Q, but maybe Hal can provide more info.

Most of our generated functions implement pure data-parallel operations which suit vector instructions. There are of course some kernels that require scatter/gather but I don't worry about those right now.

What I don't understand: How can the loop vectorizer be good on a small vector size but not so good on a large one? (I guess this is what you're saying with SIMD vector as a 'small vector'). Isn't this functionality completely generic in the loop vectorizer and its algorithm doesn't care about the actual 'width' of the vector?
Why did you bring up gather/scatter instructions? The test function doesn't make use of them.

If scatter/gather were free (or low cost), then it could allow vectorization of many more loops, because many times the high-cost of non-consecutive memory operations prevent vectorization.

What's the role of gather/scatter in the loop vectorizer?

Simply to load/store non-consecutive memory locations.

I know one needs to insert/extract values to/from vectors in order to use them for scalar operations. But in the case here, there are no scalar operations. That's what I mean with these functions implement purely data-parallel/vector operations.

Regards whether we have other problems. That's the good news about it: There are no other problem. Our applications already runs (and is correct) using the LLVM JIT'er. However, only with a datalayout that's not optimal for CPU architectures. In this case the functions get vectorized, but the application performance gets hurt due to cache thrashing. Now, applying an optimized data layout, which maximizes cache line reuse, introduces these 'rem' and 'div' instructions mentioned earlier which seem to let the vectorizer fail (or be it the scalar evolution analysis pass).

Is there fundamental functionality missing in the auto vectorizer when the target vector size increases to 512 bits (instead of 128 for example)? And why?

Scatter/Gather cost model (and possibly intrinsics), support for predicated instructions, AVX512 cost model.

What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier?

Maybe you could rewrite the loop in a way that will expose contiguous memory accesses. Is this something you could do ?

Thanks,
Nadav

Hi Frank,

> We are looking at a variety of target architectures. Ultimately we
> aim to run on BG/Q and Intel Xeon Phi (native). However, running
wide-vectors I assumed that you mean old school vector-processors.
Elena Demikhovsky is working on adding AVX512 support and once she
is done things should just work. We will need to support some of the
new features of AVX512 such as predication and scatter/gather to
make the most out of this CPU. I don’t know too much on BG/Q, but
maybe Hal can provide more info.

I'm glad to hear that you're interested in BG/Q support. Somewhat off topic, so briefly: I've not pushed QPX vector support upstream yet (and may not end up with time before 3.4 branches -- starting now is probably not prudent). The current full BG/Q patchset (source and relocatable binary RPMs) is available from http://trac.alcf.anl.gov/projects/llvm-bgq; if you're running at ALCF, it is already installed for you. When you're interested in starting BG/Q work, please follow up with me (either directly, or on our llvm-bgq-discuss list (see the link on the web page)) if you have any questions or suggestions.

-Hal

What needs to be done (on a high level) in order to have the auto vectorizer succeed on the test function as given erlier?

Maybe you could rewrite the loop in a way that will expose contiguous memory accesses. Is this something you could do ?

Hi Nadav,

the only option I see is to unroll the loop by hand. Since the array access is consecutive over 4 loop iterations I gave it a try and unrolled the loop by a factor of 4. Which gives the following array accesses:

loop iter 0:
index_0 = 0 index_1 = 4
index_0 = 1 index_1 = 5
index_0 = 2 index_1 = 6
index_0 = 3 index_1 = 7

loop iter 1:
index_0 = 8 index_1 = 12
index_0 = 9 index_1 = 13
index_0 = 10 index_1 = 14
index_0 = 11 index_1 = 15

For completeness, here the code:

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   const std::uint64_t inner = 4;
   for (std::uint64_t i = start ; i < end ; i+=4 ) {
     {
       const std::uint64_t ir0 = ( ((i+0)/inner) * 2 + 0 ) * inner + (i+0)%4;
       const std::uint64_t ir1 = ( ((i+0)/inner) * 2 + 1 ) * inner + (i+0)%4;
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
       c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
     }
     {
       const std::uint64_t ir0 = ( ((i+1)/inner) * 2 + 0 ) * inner + (i+1)%4;
       const std::uint64_t ir1 = ( ((i+1)/inner) * 2 + 1 ) * inner + (i+1)%4;
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
       c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
     }
     {
       const std::uint64_t ir0 = ( ((i+2)/inner) * 2 + 0 ) * inner + (i+2)%4;
       const std::uint64_t ir1 = ( ((i+2)/inner) * 2 + 1 ) * inner + (i+2)%4;
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
       c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
     }
     {
       const std::uint64_t ir0 = ( ((i+3)/inner) * 2 + 0 ) * inner + (i+3)%4;
       const std::uint64_t ir1 = ( ((i+3)/inner) * 2 + 1 ) * inner + (i+3)%4;
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
       c[ ir1 ] = a[ ir1 ] + b[ ir1 ];
     }
   }
}

This should be an ideal test case for the SLP vectorizer, right?

It seems, I am out of luck:

opt -O3 -vectorize-slp -debug loop.ll -S

SLP: Analyzing blocks in _Z3barmmPfS_S_.
SLP: Found 8 stores to vectorize.
SLP: Analyzing a store chain of length 8.
SLP: Trying to vectorize starting at PHIs (1)
SLP: Vectorizing a list of length = 2.

But the resulting IR is not showing any vector instructions. Maybe it's me. I never got the SLP vectortizer to do anything good. Any idea what might go wrong?

I also tries the loop vectorizer:

opt -O3 -loop-vectorize -debug-only=loop-vectorize -debug loop.ll -S

LV: Checking a loop in "_Z3barmmPfS_S_"
LV: Found a loop: for.body
LV: SCEV could not compute the loop exit count.
LV: Not vectorizing.

Hm.. This was better with the unrolled loop. At least it could find the loop exit count. Any idea why it can't find it now?

Frank

the only option I see is to unroll the loop by hand. Since the array access is consecutive over 4 loop iterations I gave it a try and unrolled the loop by a factor of 4. Which gives the following array accesses:

loop iter 0:
index_0 = 0 index_1 = 4
index_0 = 1 index_1 = 5
index_0 = 2 index_1 = 6
index_0 = 3 index_1 = 7

loop iter 1:
index_0 = 8 index_1 = 12
index_0 = 9 index_1 = 13
index_0 = 10 index_1 = 14
index_0 = 11 index_1 = 15

The SLP-vectorizer detects 8 stores, but it can’t prove that they are consecutive, so it moves on. Can you simplify the address expression ? Can you write " index0 = i*8 + 0 “ and give it a try ?

I tried the following on the hand-unrolled loop:

       const std::uint64_t ir0 = i*8+0; // working

       const std::uint64_t ir0 = i%4+0; // working

       const std::uint64_t ir0 = (i+0)%4; // not working

'+0' means +1,+2,+3 in the unrolled iterations.

'Working' means the SLP vectorizer succeeded.

Thus, when working 'towards' the correct index function, auto vectorization fails. However, there is no option to use a simpler index function.

Is it possible to make the SCEV pass more smart? Or would you strongly advise against such endeavor?

Frank

I thought this would be the case when I saw the original expression. Maybe
we need to teach module arithmetic to SCEV?

--renato

A quite small but yet complete example function which all vectorization passes fail to optimize:

#include <cstdint>
#include <iostream>

void bar(std::uint64_t start, std::uint64_t end, float * __restrict__ c, float * __restrict__ a, float * __restrict__ b)
{
   for ( std::uint64_t i = start ; i < end ; i += 4 ) {
     {
       const std::uint64_t ir0 = (i+0)%4 + 8*((i+0)/4);
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     }
     {
       const std::uint64_t ir0 = (i+1)%4 + 8*((i+1)/4);
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     }
     {
       const std::uint64_t ir0 = (i+2)%4 + 8*((i+2)/4);
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     }
     {
       const std::uint64_t ir0 = (i+3)%4 + 8*((i+3)/4);
       c[ ir0 ] = a[ ir0 ] + b[ ir0 ];
     }
   }
}

The loop index and array accesses for the first 4 iterations:

i 0: 0 1 2 3
i 4: 8 9 10 11
i 8: 16 17 18 19
i 12: 24 25 26 27

For example on an x86 processor with SSE (128 bit SIMD vectors) the loop body could be vectorized into 2 SIMD reads, 1 SIMD add and 1 SIMD store.

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.

How can the SCEV AA pass be extended to handle this type of arithmetic?

Frank

With current trunk I tried the following on the above example:

clang++ -emit-llvm -S loop_minimal.cc -std=c++11
opt -O3 -vectorize-slp -S loop_minimal.ll
opt -O3 -loop-vectorize -S loop_minimal.ll
opt -O3 -bb-vectorize -S loop_minimal.ll

All optimization passes miss the opportunity. It seems the SCEV AA pass doesn't understand modulo arithmetic.

Hi Frank,

IIRC, opt -O3 will already pass all three, so you don't have to add it to
the command line.

How can the SCEV AA pass be extended to handle this type of arithmetic?

I'm not an SCEV expert, so it might be possible that it does understand,
but the loop vectorizer is not understanding the evolution info, or
whatever.

What I recommend is to step through the loop vectorizer (canVectorize) and
see what the SCEV returns to the vectorizer. If the info is there, but not
accounted for, it's a vectorizer bug. If the SCEV gets lost, than you need
to add it into the SCEV logic.

cheers,
--renato