[Proposal][RFC] Strided Memory Access Vectorization

Sorry for the spam. Copy-paste didn't capture the Subject properly. Resending with the correct Subject so that the thread is captured properly.

Thanks Hideki for going through this proposal.

I agree this can be done with Gather/Scatter intrinsic as well, once we
enable these we need to place right costing. During costing we have to
estimate the cost of load[s], store[s] and shuffle[s] and in CG prepare
we have to lower them. In the proposed approach we have introduced
SkipFactor which helps to reduce number of load[s] & store[s]. i.e. For
Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load operation.

4) Unfortunately, 3) most likely imply that the wide load used here needs to be
based on target HW physical
     vector width. There are reasons why such type legalization at vectorizer
should be avoided.

I did not understood this completely, vectorizer already identifies maximum
target HW physical vector width and based on it choose VF then widens the
instructions.

If I remember correctly, Elena Demikhovsky (my colleague) attempted to
introduce stride load/store intrinsic sometime ago and got a push back. I think
it's best to address the following questions.
a) What can this enable over gather/scatter intrinsic w/ constant index vector
b) What can this enable over hypothetical "stride load/store" intrinsic

In this proposal we are not generating any intrinsic rather directly generates
required instructions based on vector factor. Here instructions width are purely
based on vector factor. Also we have introduced SkipFactor concept to reduce
number of load[s] & store[s].

But I accept this can be done with hypothetical "stride load/store" intrinsic or
gather/scatter intrinsic. At vectorizer generate these intrinsic and later during
CodeGen/CG-Prepare generate mapping instructions.

c) Any chance of introducing stride load/store as a form of wide load/store?

We already widens the instructions but purely based on vector factor.
i.e. if for a i32 load, VF is 4 then the generated wide load will be <i32 x 4>.
We are not intentionally generating wide loads like interleave vectorizer
because we transform everything at vectorizer and the wide load will not
help us to minimize the required load[s] & store[s].
i.e. for VF 4 and stride 3, interleave vectorizer will generate <i32 x 12> wide
load (which is actually 3 <i32 x 4>) but this approach only generate 2 <i32 x 4> loads.

Regards,
Ashutosh

I agree this can be done with Gather/Scatter intrinsic as well, once we enable these we need to place right costing. During costing we have to estimate the cost of load[s], >store[s] and shuffle[s] and in CG prepare we have to lower them. In the proposed approach we have introduced SkipFactor which helps to reduce number of load[s] & >store[s]. i.e. For Stride 3 & VF 4 we only generate 2 loads(vs 3) to model load operation.

I'm all for properly accounting the cost. I'm just against leaking unnecessary complexity to the IL.

I did not understood this completely, vectorizer already identifies maximum target HW physical vector width and based on it choose VF then widens the instructions.

Well, the selection of VF should be based on the cost modeling. If you haven't, please follow the thread started by Michael Kuperstein.
  [llvm-dev] [RFC] Allow loop vectorizer to choose vector widths that generate illegal types
In general, a "logical vector" operation can be one full vector on the target, less than one full vector (such as half), or more than one full vector
(e.g., two or four vector). LLVM IR instructions can support logical vectors (and have vector type legalization support). I don't see why vectorizer
should be restricted to use target legalized vector only.

Now, looking at Stride-2 Loads and VF=4 case, where target HW can accommodate 4 elements in the vector.
  A * B * C * D
We could try breaking up into two loads in
  A * B * + C * D *
or
  A * B * + * C * D
The second one is fine if we can assume that any memory between A and D can be accessed.
However, the first one is bad since the "*" after D may cause SEGV.

What if we want to use VF=8? Under the proposal in consideration, some operations (such as ADD)
are consuming the result of load in 8-way vector (say, <8 x float>), but the load is somehow split up
into 2x2x4. That's not good for optimizers. Load of 8-way vector should look like load of 8 elements ---
not a convoluted sequence of smaller loads and bunch of shuffles that analysis/optimization downstream
(or even human) will have a hard time understanding that as a load of 8 elements.

But I accept this can be done with hypothetical "stride load/store" intrinsic or gather/scatter intrinsic. At vectorizer generate these intrinsic and later during CodeGen/CG->Prepare generate mapping instructions.

Thanks. That's what I'd recommend ---- unless someone can clearly show that adding this much complexity to the vectorizer's output
enables some other optimizations (and without hurting optimizations) downstream.

Vectorizer's output should be as clean as vector code can be so that analyses and optimizers downstream can
do a great job optimizing.

Thanks,
Hideki

Vectorizer's output should be as clean as vector code can be so that analyses and optimizers downstream can
do a great job optimizing.

Guess I should clarify this philosophical position of mine. In terms of vector code optimization that complicates
the output of vectorizer:

If vectorizer is the best place to perform the optimization, it should do so.
     This includes the cases like
           If vectorizer doesn't do it, we'll lose the opportunity.
           If vectorizer doesn't do it, complexity of performing the optimization goes up significantly.
           Optimization changes the way vectorization happens (may consider this as part of "lose the oppotunity")
They should be different from
           If vectorizer doesn't do it, it'll add complexity of vectorizer's cost model.
Compiler's cost modeling in general is already complex enough since there are many things
happening downstream. Adding one more optimization downstream doesn't change
the big picture.

There is certainly a tradeoff against "if vectorizer does this, we'll lose some other optimization downstream"
even if vectorizer is the best place to perform a certain optimization ---- but this isn't a new problem.

As Ashutosh wrote, stride memref optimization in this RFC can be done later in compilation, and doing so
should not be much more complex than doing it in vectorizer. That's why I recommend doing it outside of
vectorizer (like CG prepare), and I'm glad to hear that Ashutosh is considering that.

Thanks for reading.
Hideki

Now, looking at Stride-2 Loads and VF=4 case, where target HW can
accommodate 4 elements in the vector.
  A * B * C * D
We could try breaking up into two loads in
  A * B * + C * D *
or
  A * B * + * C * D
The second one is fine if we can assume that any memory between A and D
can be accessed.
However, the first one is bad since the "*" after D may cause SEGV.

True, boundary element access may be risky.
To facilitate this we have introduced generating of masked load
for the last-required-load in a loop iteration.
This can be enabled by providing “-enable-strided-mask-load” option.

i.e. For arr[2*i] with VF 8 will generate below instructions:

  %12 = getelementptr inbounds i32, i32* %arr, i64 %11
  %13 = bitcast i32* %12 to <8 x i32>*
  %stride.load = load <8 x i32>, <8 x i32>* %13, align 4, !tbaa !1
  %14 = getelementptr i32, i32* %12, i64 8
  %15 = bitcast i32* %14 to <8 x i32>*
  %wide.masked.load = call <8 x i32> @llvm.masked.load.v8i32(<8 x i32>* %15, i32 4, <8 x i1> <i1 true, i1 false, i1 true, i1 false, i1 true, i1 false, i1 true, i1 false>, <8 x i32> undef), !tbaa !1
  %strided.vec = shufflevector <8 x i32> %stride.load, <8 x i32> %wide.masked.load, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>

NOTE: the last load is a masked load.

What if we want to use VF=8? Under the proposal in consideration, some
operations (such as ADD)
are consuming the result of load in 8-way vector (say, <8 x float>), but the load
is somehow split up
into 2x2x4. That's not good for optimizers. Load of 8-way vector should look like
load of 8 elements ---
not a convoluted sequence of smaller loads and bunch of shuffles that
analysis/optimization downstream
(or even human) will have a hard time understanding that as a load of 8
elements.

This approach widens the instructions purely based on vector factor.
For VF8 the intent is to generate, 8 width loads(i.e. <8 x i32>) so we generating it.

i.e. For arr[2*i] with VF 8 will generate below instructions:

  %12 = getelementptr inbounds i32, i32* %arr, i64 %11
  %13 = bitcast i32* %12 to <8 x i32>*
  %stride.load = load <8 x i32>, <8 x i32>* %13, align 4, !tbaa !1
  %14 = getelementptr i32, i32* %12, i64 8
  %15 = bitcast i32* %14 to <8 x i32>*
  %stride.load27 = load <8 x i32>, <8 x i32>* %15, align 4, !tbaa !1
  %strided.vec = shufflevector <8 x i32> %stride.load, <8 x i32> %stride.load27, <8 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14>

NOTE: It will only generates 2 loads of width 8.

Thanks,
Ashutosh

One common concern raised for cases where Loop Vectorizer generate
bigger types than target supported:

Based on VF currently we check the cost and generate the expected set of
instruction[s] for bigger type. It has two challenges for bigger types cost
is not always correct and code generation may not generate efficient
instruction[s].

Probably can depend on the support provided by below RFC by Michael:
"Allow loop vectorizer to choose vector widths that generate illegal types"
In that case Loop Vectorizer will consider the cost of bigger types and generate
gather / scatter of same type. Later code generation will generate the desired
instruction[s] by breaking the bigger types into target supported.

Regards,
Ashutosh

As a strong advocate of logical vector representation, I'm counting on community liking Michael's RFC and that'll proceed sooner than later.
I plan to pitch in (e.g., perf experiments).

Probably can depend on the support provided by below RFC by Michael:
"Allow loop vectorizer to choose vector widths that generate illegal types"
In that case Loop Vectorizer will consider the cost of bigger types and generate gather / scatter of same type. Later code generation will generate the desired instruction[s] by breaking the bigger types into target >supported.

Hopefully, you can come up with a short term workaround, if the progress of Michael's work is behind yours.

Thanks,
Hideki

>Probably can depend on the support provided by below RFC by Michael:
> "Allow loop vectorizer to choose vector widths that generate illegal types"
>In that case Loop Vectorizer will consider the cost of bigger types and
generate gather / scatter of same type. Later code generation will generate the
desired instruction[s] by breaking the bigger types into target >supported.

Hopefully, you can come up with a short term workaround, if the progress of
Michael's work is behind yours.

We can skip those cases for now, whenever the vector factor for a type goes
beyond the target support will disable stride memory vectorization.
Once Michael work accepted will enable it to consider these cases.

Regards,
Ashutosh