[LoopVectorizer] Improving the performance of dot product reduction loop

Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result.

In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the vpaddd instruction that we use to sum with the results from the previous loop iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of the register so that we can use a 256-bit vpaddd instruction so that the upper elements can keep going around the loop without being disturbed in case they weren’t initialized to 0. But this still means the vpmaddwd instruction is doing half the amount of work the CPU is capable of if we had been able to use a 256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an instruction that can do VPMADDWD and VPADDD in one instruction, but that width mismatch makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we could have something like two v32i8 loads, two shufflevectors to extract the even elements and the odd elements to create four v16i8 pieces.Sign extend each of those pieces. Multiply the two even pieces and the two odd pieces separately, sum those results with a v8i32 add. Then another v8i32 add to accumulate the previous loop iterations. Then ensures that no pieces exceed the target vector width and the final operation is correctly sized to go around the loop in one register. All but the last add can then be pattern matched to vpmaddwd as proposed in https://reviews.llvm.org/D49636. And for the future CPU the whole thing can be matched to the new instruction.

Do other targets have a similar instruction or a similar issue to this? Is this something we can solve in the loop vectorizer? Or should we have a separate IR transformation that can recognize this pattern and generate the new sequence? As a separate pass we would need to pair two vector loads together, remove a reduction step outside the loop and remove half the phis assuming the loop was partially unrolled. Or if there was only one add/mul inside the loop we’d have to reduce its width and the width of the phi.

Thanks,

~Craig

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here? Can you explain how the desired code from the vectorizer differs from the code that the vectorizer produces if you add ‘#pragma clang loop vectorize(enable) vectorize_width(16)’ above the loop? I tried it in your godbolt example and the generated code looks very similar to the icc-generated code. Thanks again, Hal

(specifically, I mean this: )

~Craig

Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result.

That godbolt link seems wrong. It wasn’t supposed to be clang IR. This should be right.

In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the vpaddd instruction that we use to sum with the results from the previous loop iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of the register so that we can use a 256-bit vpaddd instruction so that the upper elements can keep going around the loop without being disturbed in case they weren’t initialized to 0. But this still means the vpmaddwd instruction is doing half the amount of work the CPU is capable of if we had been able to use a 256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an instruction that can do VPMADDWD and VPADDD in one instruction, but that width mismatch makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we could have something like two v32i8 loads, two shufflevectors to extract the even elements and the odd elements to create four v16i8 pieces.

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?

Oops that should have been v16i16. Mixed up my 256-bit types.

Sign extend each of those pieces. Multiply the two even pieces and the two odd pieces separately, sum those results with a v8i32 add. Then another v8i32 add to accumulate the previous loop iterations. Then ensures that no pieces exceed the target vector width and the final operation is correctly sized to go around the loop in one register. All but the last add can then be pattern matched to vpmaddwd as proposed in https://reviews.llvm.org/D49636. And for the future CPU the whole thing can be matched to the new instruction.

Do other targets have a similar instruction or a similar issue to this? Is this something we can solve in the loop vectorizer? Or should we have a separate IR transformation that can recognize this pattern and generate the new sequence? As a separate pass we would need to pair two vector loads together, remove a reduction step outside the loop and remove half the phis assuming the loop was partially unrolled. Or if there was only one add/mul inside the loop we’d have to reduce its width and the width of the phi.

Can you explain how the desired code from the vectorizer differs from the code that the vectorizer produces if you add ‘#pragma clang loop vectorize(enable) vectorize_width(16)’ above the loop? I tried it in your godbolt example and the generated code looks very similar to the icc-generated code.

It’s similar, but the vpxor %xmm0, %xmm0, %xmm0 is being unnecessarily carried across the loop. It’s then redundantly added twice in the reduction after the loop despite it being 0. This happens because we basically tricked the backend into generating a 256-bit vpmaddwd concated with a 256-bit zero vector going into a 512-bit vaddd before type legalization. The 512-bit concat and vpaddd get split during type legalization, and the high half of the add gets constant folded away. I’m guessing we probably finished with 4 vpxors before the loop but MachineCSE(or some other pass?) combined two of them when it figured out the loop didn’t modify them.

My perspective, being a vectorizer guy, is that vectorizer should

  1. Take this optimization into account in the cost modeling so that it favors full vector compute.

  2. but generate plain widened computation:
    full vector unit stride load of A[],
    full vector unit stride load of B[],
    sign extend both, (this makes it 2x full vector, on the surface)
    multiply
    add

    standard reduction last value sequence after the loop

and let downstream optimizer, possibly in Target, use instructions like (v)pmaddwd effectively.

If needed, an IR-to-IR xform before hitting Target.

This mechanism also works if a programmer or other FE produces a similar naïvely vectorized IR like above.

Since vectorizer should understand the existence of the optimization, it can certainly be arm-twisted to

generate the IR desired by the Target. However, whether we want to do that is a totally different story.

Vectorizer should focus on having reasonable cost model and generating straight-forward optimizable IR

---- as opposed to generating convoluted IR (such as breaking up unit-stride load into even/odd, simply

to put them back to unit-stride again) wanted by the Target.

My recommendation is first analyzing the source of the current code generation deficiencies and then

try to remedy it there.

Thanks,

Hideki

Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result.

In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the vpaddd instruction that we use to sum with the results from the previous loop iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of the register so that we can use a 256-bit vpaddd instruction so that the upper elements can keep going around the loop without being disturbed in case they weren't initialized to 0. But this still means the vpmaddwd instruction is doing half the amount of work the CPU is capable of if we had been able to use a 256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an instruction that can do VPMADDWD and VPADDD in one instruction, but that width mismatch makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we could have something like two v32i8 loads, two shufflevectors to extract the even elements and the odd elements to create four v16i8 pieces.

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?

Sign extend each of those pieces. Multiply the two even pieces and the two odd pieces separately, sum those results with a v8i32 add. Then another v8i32 add to accumulate the previous loop iterations. Then ensures that no pieces exceed the target vector width and the final operation is correctly sized to go around the loop in one register. All but the last add can then be pattern matched to vpmaddwd as proposed in https://reviews.llvm.org/D49636. And for the future CPU the whole thing can be matched to the new instruction.

Do other targets have a similar instruction or a similar issue to this? Is this something we can solve in the loop vectorizer? Or should we have a separate IR transformation that can recognize this pattern and generate the new sequence? As a separate pass we would need to pair two vector loads together, remove a reduction step outside the loop and remove half the phis assuming the loop was partially unrolled. Or if there was only one add/mul inside the loop we'd have to reduce its width and the width of the phi.

Can you explain how the desired code from the vectorizer differs from the code that the vectorizer produces if you add '#pragma clang loop vectorize(enable) vectorize_width(16)' above the loop? I tried it in your godbolt example and the generated code looks very similar to the icc-generated code.

Vectorizer considers the largest data type size in the loop body and considers the maximum possible VF as 8, hence in this example it generates the 128 bits vpmaddwd.
Like to share my observation where instead of forcing vector factor to 16 (by using pragma) tried option “vectorizer-maximize-bandwidth”.
“vectorizer-maximize-bandwidth” considers the smallest data type size in the loop body and allows the possible VF up to 16, but LV selects the VF as 8 though VF16 has the same cost.
LV: Vector loop of width 8 costs: 1.
LV: Vector loop of width 16 costs: 1.

It’s because of below check in LV:
LoopVectorizationCostModel::selectVectorizationFactor() {

    if (VectorCost < Cost) {
      Cost = VectorCost;
      Width = i;
    }

I don’t know the history behind this change but wondering why it’s like this, at least for “vectorizer-maximize-bandwidth” it should be (VectorCost <= Cost).
By forcing the vector factor to 16 it generates 256 bits vpmaddwd.

Regards,
Ashutosh

Ah, interesting. I was indeed wondering whether this was another case where we’d benefit from having vectorizer-maximize-bandwidth on by default. I recall that our defaults have long been to prefer smaller VF in cases where the costs are equal to avoid extra legalization costs (and, potentially, to retain more freedom for interleaving later). Should the costs here be equal, or should we have some extra target information to distinguish them? It sounds like they’re not really equal in practice. -Hal

Hi Hal,

Enabling vectorizer-maximize-bandwidth by default has been explored
before (see history in https://reviews.llvm.org/D46283) and it created
a handful of problems.

With the correctness problems fixed, we still were blocked on high
negative performance hits, which was probably to do with the cost
model. Adhemerval sent this patch instead:
https://reviews.llvm.org/D48332, which dealt with the specific case in
hand.

I remember hearing that some complex loops get worse performance for
wide loads (ex. AVX512) because it makes the loop too short and the
shuffles in and out too long, or increase the number of shuffles if
the loads are not trivial.

So, while enabling it by default is probably a good idea in a lot of
cases, we probably need to be careful with its usage as a wide scope
default. I don't have more info on the individual cases, though, so
just is more of an FYI. :slight_smile:

I’m still missing something. Why do you want to separate out the even and odd parts instead of just adding up the first half of the numbers and the second half? Thanks again, Hal

Ah, interesting. I was indeed wondering whether this was another case where we'd benefit from having vectorizer-maximize-bandwidth on by default. I recall that our defaults have long been to prefer smaller VF in cases where the costs are equal to avoid extra legalization costs (and, potentially, to retain more freedom for interleaving later). Should the costs here be equal, or should we have some extra target information to distinguish them? It sounds like they're not really equal in practice.

Hi Hal,

Enabling vectorizer-maximize-bandwidth by default has been explored
before (see history in https://reviews.llvm.org/D46283) and it created
a handful of problems.

With the correctness problems fixed, we still were blocked on high
negative performance hits, which was probably to do with the cost
model. Adhemerval sent this patch instead:
https://reviews.llvm.org/D48332, which dealt with the specific case in
hand.

This matches my recollection: the correctness problems had been
addressed and there were still some outstanding cost-modeling problems.

I remember hearing that some complex loops get worse performance for
wide loads (ex. AVX512) because it makes the loop too short and the
shuffles in and out too long, or increase the number of shuffles if
the loads are not trivial.

Sounds like another potential cost-modeling problem?

So, while enabling it by default is probably a good idea in a lot of
cases, we probably need to be careful with its usage as a wide scope
default. I don't have more info on the individual cases, though, so
just is more of an FYI. :slight_smile:

I don't really see any long-term path to addressing these kinds of
problems aside from working through the cost-modeling problems and
getting this enabled. We'll see :wink:

-Hal

Same here.

I think that this should be, generally, our strategy. We might have different reduction strategies (we already do, at least in terms of the final reduction tree), and one might include what x86 wants here, so long as we can reasonably create a cost-modeling interface that let’s us differentiate it from other strategies at the IR level. Lacking the ability to abstract this behind a generalized strategy with an IR-level cost-modeling interface, I think that the vectorizer should produce straightforward IR (e.g., what we currently produce with VF=16, see the other discussion of the vectorizer-maximize-bandwidth option) and then the target can adjust it as necessary to take advantage of special isel opportunities. Thanks again, Hal

~Craig

Hello all,

This code https://godbolt.org/g/tTyxpf is a dot product reduction loop multipying sign extended 16-bit values to produce a 32-bit accumulated result. The x86 backend is currently not able to optimize it as well as gcc and icc. The IR we are getting from the loop vectorizer has several v8i32 adds and muls inside the loop. These are fed by v8i16 loads and sexts from v8i16 to v8i32. The x86 backend recognizes that these are addition reductions of multiplication so we use the vpmaddwd instruction which calculates 32-bit products from 16-bit inputs and does a horizontal add of adjacent pairs. A vpmaddwd given two v8i16 inputs will produce a v4i32 result.

That godbolt link seems wrong. It wasn’t supposed to be clang IR. This should be right.

In the example code, because we are reducing the number of elements from 8->4 in the vpmaddwd step we are left with a width mismatch between vpmaddwd and the vpaddd instruction that we use to sum with the results from the previous loop iterations. We rely on the fact that a 128-bit vpmaddwd zeros the upper bits of the register so that we can use a 256-bit vpaddd instruction so that the upper elements can keep going around the loop without being disturbed in case they weren’t initialized to 0. But this still means the vpmaddwd instruction is doing half the amount of work the CPU is capable of if we had been able to use a 256-bit vpmaddwd instruction. Additionally, future x86 CPUs will be gaining an instruction that can do VPMADDWD and VPADDD in one instruction, but that width mismatch makes that instruction difficult to utilize.

In order for the backend to handle this better it would be great if we could have something like two v32i8 loads, two shufflevectors to extract the even elements and the odd elements to create four v16i8 pieces.

Why v*i8 loads? I thought that we have 16-bit and 32-bit types here?

Oops that should have been v16i16. Mixed up my 256-bit types.

Sign extend each of those pieces. Multiply the two even pieces and the two odd pieces separately, sum those results with a v8i32 add. Then another v8i32 add to accumulate the previous loop iterations.

I’m still missing something. Why do you want to separate out the even and odd parts instead of just adding up the first half of the numbers and the second half?

Doing even/odd matches up with a pattern I already have to support for the code in https://reviews.llvm.org/D49636. I wouldn’t even need to detect is as a reduction to do the reassocation since even/odd exactly matches the behavior of the instruction. But you’re right we could also just detect the reduction and add two halves.

With maximize-bandwidth I’d still end up with the extra vpxors above the loop and the extra addition reduction steps at the end that we get from forcing the vf to 16 right?

Right. Nothing changes there by purely going with wider vector.

But you’re right we could also just detect the reduction and add two halves.

If we take this approach, we also need to teach vector type legalizer about that the last add of the following is reduction sum and thus add to the same temp instead of creating the second temp. Should not be a big deal — just calling it out.

full vector unit stride load of A[],
full vector unit stride load of B[],
sign extend both, (this makes it 2x full vector, on the surface)
multiply
add

Hideki

Yea, I think we’d need something else to help with that. My underlying thought here is that, based on your description, we really do want a VF of 16 (because we want to use 256-bit loads, etc.). And so, when thinking about how to fix things, we should start with looking at the VF = 16 output, and not the VF = 8 output, as the right starting point for the backend. -Hal