Data structure improvement for the SLP vectorizer

There was some discussion of this on the llvm-commits list, but I
wanted to raise the topic for discussion here. The background of the
-commits discussion was that r296863 added the ability to sort memory
access when the SLP vectorizer reached a load (the SLP vectorizer
starts at a store or some other sink, and tries to go up the tree
vectorizing as it goes along - if the input is in a different order
than the output, you need a shufflevector somewhere in the middle).
However, that commit had to be reverted because the SLP tree is not
actually a tree and the same load bundle could be used by several
different users, causing problems with representing the required
shuffle on the load bundle node (Please add any history I'm missing
here).

Now, I'm jumping into this because I want to teach the vectorizer
about a similar operation, namely a 2->4 (or 4->8 for AVX512)
broadcast. This operation also requires a load followed by a
shufflevector, so it would be nice to use the same representation for
both of these.

Now, the primary problem is that we don't really have anywhere to put
this permutation information. The two solutions seem to be duplicating
the nodes (and maybe deduplicating them later again when materializing
the tree), or taking the hit and representing the graph more
explicitly, such that we can annotate information on the actual edges
of the graph.

The latter seems nicer to me, we'd essentially have a DAG of scalar
bundles with vector shuffles in between them. Right now the vectorizer
can't really handle intermediate vector shuffles at all as far as I
can tell (with a few hand coded exceptions for common patterns). From
my perspective that seems a bit odd. AVX512 for example can represent
most (all?) shufflevector operations in one instruction, so failing to
vectorize a large tree because it would require one of them seems very
problematic. This is also related to my desire for vector->larger
vector broadcasting. Especially when targeting AVX512, there can be
whole 4-wide subtrees that the SLP vectorizer isn't currently looking
at, because it can't handle the 4->8 expansion. Whatever the new
representation, I'd like the SLP vectorizer to be able to handle that.

I hope that summarizes the problem. My understanding of the situation
is admittedly incomplete - I came to this because I need the
vectorizer to correctly handle my narrow-width subtrees, but I'm
hoping we can collectively come up with a design that would cover all
the relevant use cases.

Thanks Keno for your interest. My response inlined below.

From: Keno Fischer [mailto:keno@juliacomputing.com]
Sent: Wednesday, March 15, 2017 6:31 AM
To: llvm-dev@lists.llvm.org
Cc: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid@amd.com>; Michael
Kuperstein <mkuper@google.com>; Matthias Braun <matze@braunis.de>
Subject: Data structure improvement for the SLP vectorizer

There was some discussion of this on the llvm-commits list, but I wanted to
raise the topic for discussion here. The background of the -commits
discussion was that r296863 added the ability to sort memory access when
the SLP vectorizer reached a load (the SLP vectorizer starts at a store or some
other sink, and tries to go up the tree vectorizing as it goes along - if the input
is in a different order than the output, you need a shufflevector somewhere
in the middle).
However, that commit had to be reverted because the SLP tree is not actually
a tree and the same load bundle could be used by several different users,
causing problems with representing the required shuffle on the load bundle
node (Please add any history I'm missing here).

Now, I'm jumping into this because I want to teach the vectorizer about a
similar operation, namely a 2->4 (or 4->8 for AVX512) broadcast. This
operation also requires a load followed by a shufflevector, so it would be
nice to use the same representation for both of these.

How do you plan to identify or mark a certain "load" bundle for broadcast,
As these "loads" presumably will not be unordered?

Now, the primary problem is that we don't really have anywhere to put this
permutation information. The two solutions seem to be duplicating the
nodes (and maybe deduplicating them later again when materializing the
tree), or taking the hit and representing the graph more explicitly, such that
we can annotate information on the actual edges of the graph.

The latter seems nicer to me, we'd essentially have a DAG of scalar bundles
with vector shuffles in between them.

That’s right.

Right now the vectorizer can't really

handle intermediate vector shuffles at all as far as I can tell (with a few hand
coded exceptions for common patterns). From my perspective that seems a
bit odd. AVX512 for example can represent most (all?) shufflevector
operations in one instruction, so failing to vectorize a large tree because it
would require one of them seems very problematic.

Do you mean widening the vector shuffles itself?

This is also related to my

desire for vector->larger vector broadcasting. Especially when targeting
AVX512, there can be whole 4-wide subtrees that the SLP vectorizer isn't
currently looking at, because it can't handle the 4->8 expansion. Whatever
the new representation, I'd like the SLP vectorizer to be able to handle that.

I hope that summarizes the problem. My understanding of the situation is
admittedly incomplete - I came to this because I need the vectorizer to
correctly handle my narrow-width subtrees, but I'm hoping we can
collectively come up with a design that would cover all the relevant use cases.

I did not get this part can you please elaborate?

Regards,
Shahid

Maybe it would illustrative to give an IR example of the case I'm
interested in. Consider

define void @"julia_transform_bvn_derivs_hessian!"(double* %data,
double* %data2, double *%data3, double *%out) {
    %element11 = getelementptr inbounds double, double* %data, i32 1
    %load10 = load double, double* %data
    %load11 = load double, double* %element11

    %element21 = getelementptr inbounds double, double* %data2, i32 1
    %element22 = getelementptr inbounds double, double* %data2, i32 2
    %element23 = getelementptr inbounds double, double* %data2, i32 3
    %load20 = load double, double* %data2
    %load21 = load double, double* %element21
    %load22 = load double, double* %element22
    %load23 = load double, double* %element23

    %element31 = getelementptr inbounds double, double* %data3, i32 1
    %element32 = getelementptr inbounds double, double* %data3, i32 2
    %element33 = getelementptr inbounds double, double* %data3, i32 3
    %load30 = load double, double* %data3
    %load31 = load double, double* %element31
    %load32 = load double, double* %element32
    %load33 = load double, double* %element33

    %mul1 = fmul fast double %load20, %load10
    %mul2 = fmul fast double %load21, %load11
    %mul3 = fmul fast double %load22, %load10
    %mul4 = fmul fast double %load23, %load11

    %add1 = fadd fast double %load30, %mul1
    %add2 = fadd fast double %load31, %mul2
    %add3 = fadd fast double %load32, %mul3
    %add4 = fadd fast double %load33, %mul4

    %out0 = getelementptr inbounds double, double* %out, i32 0
    %out1 = getelementptr inbounds double, double* %out, i32 1
    %out2 = getelementptr inbounds double, double* %out, i32 2
    %out3 = getelementptr inbounds double, double* %out, i32 3

    store double %add1, double *%out0
    store double %add2, double *%out1
    store double %add3, double *%out2
    store double %add4, double *%out3

    ret void
}

Currently the SLP vectorizer generates (for avx2 target):

{
  %element11 = getelementptr inbounds double, double* %data, i64 1
  %load10 = load double, double* %data, align 8
  %load11 = load double, double* %element11, align 8
  %1 = bitcast double* %data2 to <4 x double>*
  %2 = load <4 x double>, <4 x double>* %1, align 8
  %3 = bitcast double* %data3 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = insertelement <4 x double> undef, double %load10, i32 0
  %6 = insertelement <4 x double> %5, double %load11, i32 1
  %7 = insertelement <4 x double> %6, double %load10, i32 2
  %8 = insertelement <4 x double> %7, double %load11, i32 3
  %9 = fmul fast <4 x double> %2, %8
  %10 = fadd fast <4 x double> %4, %9
  %11 = bitcast double* %out to <4 x double>*
  store <4 x double> %10, <4 x double>* %11, align 8
  ret void
}

I want it to generate

{
  %cast = bitcast double* %data to <2 x double>*
  %load = load <2 x double>, <2 x double>* %1, align 8
  %shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32
<i32 0, i32 1, i32 0, i32 1>
  %1 = bitcast double* %data2 to <4 x double>*
  %2 = load <4 x double>, <4 x double>* %1, align 8
  %3 = bitcast double* %data3 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %9 = fmul fast <4 x double> %2, %shuffle
  %10 = fadd fast <4 x double> %4, %9
  %11 = bitcast double* %out to <4 x double>*
  store <4 x double> %10, <4 x double>* %11, align 8
  ret void
}

More generally though, rather than the loads, there could be a whole
subtree of vectorizable operations.
This becomes an especially large problem for larger vector widths, as
the penalty there for missing the
vectorization opportunity becomes larger.

What I was suggesting that to handle the case, the load could be a
two-element bundle with a
(0, 1, 0, 1) shuffle mask on the edge into the fmul. Hope that gives
an illustrative example of what I care
about and why I think it can be accomplished with the same solution.
Please let me know if that didn't answer your questions.

Comments inlined.

From: Keno Fischer [mailto:keno@juliacomputing.com]
Sent: Wednesday, March 15, 2017 7:27 PM
To: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid@amd.com>
Cc: llvm-dev@lists.llvm.org; Michael Kuperstein <mkuper@google.com>;
Matthias Braun <matze@braunis.de>
Subject: Re: Data structure improvement for the SLP vectorizer

Maybe it would illustrative to give an IR example of the case I'm interested in.
Consider

define void @"julia_transform_bvn_derivs_hessian!"(double* %data,
double* %data2, double *%data3, double *%out) {
    %element11 = getelementptr inbounds double, double* %data, i32 1
    %load10 = load double, double* %data
    %load11 = load double, double* %element11

    %element21 = getelementptr inbounds double, double* %data2, i32 1
    %element22 = getelementptr inbounds double, double* %data2, i32 2
    %element23 = getelementptr inbounds double, double* %data2, i32 3
    %load20 = load double, double* %data2
    %load21 = load double, double* %element21
    %load22 = load double, double* %element22
    %load23 = load double, double* %element23

    %element31 = getelementptr inbounds double, double* %data3, i32 1
    %element32 = getelementptr inbounds double, double* %data3, i32 2
    %element33 = getelementptr inbounds double, double* %data3, i32 3
    %load30 = load double, double* %data3
    %load31 = load double, double* %element31
    %load32 = load double, double* %element32
    %load33 = load double, double* %element33

    %mul1 = fmul fast double %load20, %load10
    %mul2 = fmul fast double %load21, %load11
    %mul3 = fmul fast double %load22, %load10
    %mul4 = fmul fast double %load23, %load11

    %add1 = fadd fast double %load30, %mul1
    %add2 = fadd fast double %load31, %mul2
    %add3 = fadd fast double %load32, %mul3
    %add4 = fadd fast double %load33, %mul4

    %out0 = getelementptr inbounds double, double* %out, i32 0
    %out1 = getelementptr inbounds double, double* %out, i32 1
    %out2 = getelementptr inbounds double, double* %out, i32 2
    %out3 = getelementptr inbounds double, double* %out, i32 3

    store double %add1, double *%out0
    store double %add2, double *%out1
    store double %add3, double *%out2
    store double %add4, double *%out3

    ret void
}

Currently the SLP vectorizer generates (for avx2 target):

{
  %element11 = getelementptr inbounds double, double* %data, i64 1
  %load10 = load double, double* %data, align 8
  %load11 = load double, double* %element11, align 8
  %1 = bitcast double* %data2 to <4 x double>*
  %2 = load <4 x double>, <4 x double>* %1, align 8
  %3 = bitcast double* %data3 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %5 = insertelement <4 x double> undef, double %load10, i32 0
  %6 = insertelement <4 x double> %5, double %load11, i32 1
  %7 = insertelement <4 x double> %6, double %load10, i32 2
  %8 = insertelement <4 x double> %7, double %load11, i32 3
  %9 = fmul fast <4 x double> %2, %8
  %10 = fadd fast <4 x double> %4, %9
  %11 = bitcast double* %out to <4 x double>*
  store <4 x double> %10, <4 x double>* %11, align 8
  ret void
}

This is because currently the broadcast elements are being handled as scalars to be "gathered".

I want it to generate

{
  %cast = bitcast double* %data to <2 x double>*
  %load = load <2 x double>, <2 x double>* %1, align 8
  %shuffle = shufflevector <2 x double> %load, <2 x double> undef, i32
<i32 0, i32 1, i32 0, i32 1>
  %1 = bitcast double* %data2 to <4 x double>*
  %2 = load <4 x double>, <4 x double>* %1, align 8
  %3 = bitcast double* %data3 to <4 x double>*
  %4 = load <4 x double>, <4 x double>* %3, align 8
  %9 = fmul fast <4 x double> %2, %shuffle
  %10 = fadd fast <4 x double> %4, %9
  %11 = bitcast double* %out to <4 x double>*
  store <4 x double> %10, <4 x double>* %11, align 8
  ret void
}

Here, %load should be 4 element load instead of 2 and you can still do the required broadcast
With above shuffle. Why this is important is that with our scheme of having a DAG with masks on
Edges to a single tree node, generating a vector load of lesser length than the chosen vector factor
Will be problematic.
If we want to do so we would require another tree entry with lesser number of scalar loads.

More generally though, rather than the loads, there could be a whole
subtree of vectorizable operations.
This becomes an especially large problem for larger vector widths, as the
penalty there for missing the vectorization opportunity becomes larger.

What I was suggesting that to handle the case, the load could be a two-
element bundle with a (0, 1, 0, 1) shuffle mask on the edge into the fmul.
Hope that gives an illustrative example of what I care about and why I think it
can be accomplished with the same solution.
Please let me know if that didn't answer your questions.

Yes, that sums up the problem. IMO, this can be achieved by checking for the broadcast case:
This is essentially looking for the partial consecutiveness for memory accesses and having a
proper mask for that on the edge from this USE node to DEF node.

Regard,
Shahid

Could you elaborate why you think this is? There doesn't seem a
problem to me of having on 2-element bundle and then putting (0,1,0,1)
on the edge to a 4-element, bundle, but I may be missing something.

Thanks,
Keno

From: Keno Fischer [mailto:keno@juliacomputing.com]
Sent: Thursday, March 16, 2017 6:11 PM
To: Shahid, Asghar-ahmad <Asghar-ahmad.Shahid@amd.com>
Cc: llvm-dev@lists.llvm.org; Michael Kuperstein <mkuper@google.com>;
Matthias Braun <matze@braunis.de>
Subject: Re: Data structure improvement for the SLP vectorizer

>
> Here, %load should be 4 element load instead of 2 and you can still do
> the required broadcast With above shuffle. Why this is important is
> that with our scheme of having a DAG with masks on Edges to a single
> tree node, generating a vector load of lesser length than the chosen vector
factor Will be problematic.

Could you elaborate why you think this is? There doesn't seem a problem to
me of having on 2-element bundle and then putting (0,1,0,1) on the edge to a
4-element, bundle, but I may be missing something.

That is because no. of elements for a bundle is decided by a chosen vector factor(VF).
VF is chosen by the target vector register length divided by data type used. In case it is
4, then the tree of different length is built and profitability is checked and the profitable
Tree gets vectorized with the chosen VF.

Another way to look at it is, if you have a case where you need to use full length loaded vector and also partially, reloading same values for lesser no. of element is not efficient instead using the just required values from the fully loaded vector is fine using shuffle.

Regards,
Shahid