[BBVectorizer] Obvious vectorization benefit, but req-chain is too short

Hi Hal,

this is one of the first test cases, I would love to have improved vectorizer support. I sent it out earlier, but I think it is a good time to look into it again, after the vectorizer was committed.

The basic examples is a set of scalar loads that load for consecutive elements and store them back right ahead. For me this is an obvious case where vectorization is beneficial (scalar.ll):

define i32 @main() nounwind {
%V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0),
  align 16
%V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1),
  align 4
%V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
  align 8
%V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
  align 4
store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
               0), align 16
store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
               1), align 4
store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        2), align 8
store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        3), align 4
   ret i32 0
}

opt -O3 -vectorize can not optimize this straight ahead, as the req-chain is too short.

Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:

define i32 @main() nounwind {
   %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*),
  align 16
   store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4
                 x float>*), align 16
   ret i32 0
}

Is there any way, we can make this case work by default? Maybe we can decrease the req-chain to 2, and increase the cost for non stride one loads or stores?

Another probably unrelated point. I tried also a run with -bb-vectorize-req-chain-depth=1. The generated code is full of shufflevector instructions and eight element vectors. For me this is entirely unexpected. Do you have any ideas what is going on here?

Tobi

scalar.ll (1.1 KB)

vector.ll (586 Bytes)

Hi Hal,

this is one of the first test cases, I would love to have improved
vectorizer support. I sent it out earlier, but I think it is a good time
to look into it again, after the vectorizer was committed.

The basic examples is a set of scalar loads that load for consecutive
elements and store them back right ahead. For me this is an obvious case
where vectorization is beneficial (scalar.ll):

define i32 @main() nounwind {
%V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0),
  align 16
%V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1),
  align 4
%V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
  align 8
%V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
  align 4
store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
               0), align 16
store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
               1), align 4
store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        2), align 8
store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        3), align 4
   ret i32 0
}

opt -O3 -vectorize can not optimize this straight ahead, as the
req-chain is too short.

Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:

define i32 @main() nounwind {
   %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*),
  align 16
   store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4
                 x float>*), align 16
   ret i32 0
}

Is there any way, we can make this case work by default? Maybe we can
decrease the req-chain to 2, and increase the cost for non stride one
loads or stores?

Making the default chain length 2 will lead to a lot of unprofitable
vectorization. I think we'll probably want to do something like make
getDepthFactor return 3 for loads and stores. (or make the default chain
length 4 and make getDepthFactor return 2 for loads and stores). We
should experiment with this [this was already on my post-commit TODO
list].

Another probably unrelated point. I tried also a run with
-bb-vectorize-req-chain-depth=1. The generated code is full of
shufflevector instructions and eight element vectors. For me this is
entirely unexpected. Do you have any ideas what is going on here?

A chain length of 1 means "vectorize any pairs that you possibly can",
and it will do this iteratively until it cannot do it any more. As the
iteration continues it will pair the previously-paired instructions,
until the requested bit limit is reached, and so you'll end up with long
vectors (of short types).

Thanks again,
Hal

Hi Hal,

this is one of the first test cases, I would love to have improved
vectorizer support. I sent it out earlier, but I think it is a good time
to look into it again, after the vectorizer was committed.

The basic examples is a set of scalar loads that load for consecutive
elements and store them back right ahead. For me this is an obvious case
where vectorization is beneficial (scalar.ll):

define i32 @main() nounwind {
%V1 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 0),
  align 16
%V2 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 1),
  align 4
%V3= load float* getelementptr ([1024 x float]* @A, i64 0, i64 2),
  align 8
%V4 = load float* getelementptr ([1024 x float]* @A, i64 0, i64 3),
  align 4
store float %V1, float* getelementptr ([1024 x float]* @B, i64 0, i64
               0), align 16
store float %V2, float* getelementptr ([1024 x float]* @B, i64 0, i64
               1), align 4
store float %V3, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        2), align 8
store float %V4, float* getelementptr ([1024 x float]* @B, i64 0, i64
                                        3), align 4
   ret i32 0
}

opt -O3 -vectorize can not optimize this straight ahead, as the
req-chain is too short.

Adding -bb-vectorize-req-chain-depth=2 allows us to vectorize the code:

define i32 @main() nounwind {
   %V1 = load <4 x float>* bitcast ([1024 x float]* @A to <4 x float>*),
  align 16
   store <4 x float> %V1, <4 x float>* bitcast ([1024 x float]* @B to <4
                 x float>*), align 16
   ret i32 0
}

Is there any way, we can make this case work by default? Maybe we can
decrease the req-chain to 2, and increase the cost for non stride one
loads or stores?

Try it now (after r149761). If this "solution" causes other problems,
then we may need to think of something more sophisticated.

-Hal

Hello,

Thanks for your work on the bb-vectorizer. It looks like a
promising pass to be used for multi-work-item-vectorization in
pocl.

Try it now (after r149761). If this "solution" causes other problems,
then we may need to think of something more sophisticated.

I wonder if the case where a store is the last user of the value could be
treated as a special case. The case where the code reads, computes
and writes values in a fully parallelizable (unrolled) loop is an
optimal case for vectorizing as it might not lead to any unpack/pack
overheads at all.

In case of the bb-vectorizer (if I understood the parameters correctly),
if the final store (or actually, any final consumer of a value) is weighed
more heavily in the "chain length computation" it could allow using a
large chain length threshold and still pick up these "embarrassingly parallel
loop cases" where there are potentially many updates to different variables
in memory, but with short preceding computation lengths. This type of
embarrasingly parallel loop cases are the basic case when vectorizing
multiple instances of OpenCL C kernels which are parallel by definition.

E.g. a case where the kernel does something like:

A = read mem
B = read mem
C = add A, B
write C to mem

D = read mem
E = read mem
F = mul D, E
write F to mem

When this is parallelized N times in the work group, the vectorizer
might fail to vectorize multiple "kernel iterations" properly as the
independent computation chains/live ranges (e.g. from D to F) are quite
short. Still, vectorization is very beneficial here as, like we know, fully
parallel loops vectorize perfectly without unpack/pack overheads in case
all the operations can be vectorized (such is the case here when one can
scale the work-group size to match the vector width).

BR,

Hello,

Thanks for your work on the bb-vectorizer. It looks like a
promising pass to be used for multi-work-item-vectorization in
pocl.

> Try it now (after r149761). If this "solution" causes other problems,
> then we may need to think of something more sophisticated.

I wonder if the case where a store is the last user of the value could be
treated as a special case. The case where the code reads, computes
and writes values in a fully parallelizable (unrolled) loop is an
optimal case for vectorizing as it might not lead to any unpack/pack
overheads at all.

The fix that I implemented was to give loads and stores each half of the
required effective chain depth to trigger vectorization. I believe that
this reflects a useful bias toward vectorizing memory operations over
other things, although we'll need to see how it interacts with other
aspects of the heuristic. Some more context-sensitive rule, such as the
one you proposed, might be better, but we'd need to think about it
carefully. For example, is it better to prioritize chains that end with
a store and not those that start with a load? Maybe we should do this
only for aligned loads and stores?

In case of the bb-vectorizer (if I understood the parameters correctly),
if the final store (or actually, any final consumer of a value) is weighed
more heavily in the "chain length computation" it could allow using a
large chain length threshold and still pick up these "embarrassingly parallel
loop cases" where there are potentially many updates to different variables
in memory, but with short preceding computation lengths.

Yes.

This type of
embarrasingly parallel loop cases are the basic case when vectorizing
multiple instances of OpenCL C kernels which are parallel by definition.

E.g. a case where the kernel does something like:

A = read mem
B = read mem
C = add A, B
write C to mem

D = read mem
E = read mem
F = mul D, E
write F to mem

When this is parallelized N times in the work group, the vectorizer
might fail to vectorize multiple "kernel iterations" properly as the
independent computation chains/live ranges (e.g. from D to F) are quite
short. Still, vectorization is very beneficial here as, like we know, fully
parallel loops vectorize perfectly without unpack/pack overheads in case
all the operations can be vectorized (such is the case here when one can
scale the work-group size to match the vector width).

Indeed, and I agree that these cases should be vectorized by default
(and it now should do this by default). If it does not, please send a
test case.

The tricky part comes in figuring how exactly how to penalize chains
that appear to require permutations in them. This is tricky because it
is really target dependent, the relative costs are different on
different hardware, and some kinds of permutations can be folded "for
free" into certain other operations on some targets. So currently I
ignore this problem, but it will need to be dealt with soon.

I think that in practice we'll need to experiment with different
approaches in order to see what works best. So try it out and if it is
doing something that is detrimental or not doing something obviously
beneficial, then we'll fix it.

Thanks for look at this,
Hal