Patch to synthesize x86 hadd instructions; need help with the tablegen bits

This patch synthesizes haddps/haddpd/hsubps/hsubpd instructions from floating
point additions and subtractions of appropriate vector shuffles. To do this I
introduced new x86 FHADD and FHSUB opcodes. These need to be wired up somehow
in the .td file to the appropriate instructions. Since I have no idea how
tablegen works I just hacked it in horribly. It works, but breaks support for
the hadd etc intrinsics (if you take a look at how I did it you will see why!).
I'm sending the patch for comments, and in the hope that someone will explain
how I should be doing the tablegen bits.

Ciao, Duncan.

hadd.diff (16.9 KB)

Hi Duncan,

This patch synthesizes haddps/haddpd/hsubps/hsubpd instructions from
floating
point additions and subtractions of appropriate vector shuffles. To do this
I
introduced new x86 FHADD and FHSUB opcodes. These need to be wired up
somehow
in the .td file to the appropriate instructions. Since I have no idea how
tablegen works I just hacked it in horribly. It works, but breaks support
for
the hadd etc intrinsics (if you take a look at how I did it you will see
why!).
I'm sending the patch for comments, and in the hope that someone will
explain
how I should be doing the tablegen bits.

This is awesome :smiley:

Some comments:

+ // Try to synthesize horizontal adds from adds of shuffles.
+ if (((Subtarget->hasSSE3() && (VT == MVT::v4f32 || VT == MVT::v2f64)) ||
+ (Subtarget->hasAVX() && (VT == MVT::v8f32 || VT == MVT::v4f64))) &&
+ isHorizontalBinOp(LHS, RHS, true))

1) You probably want to do something like:

"bool HasHorizontalArith = Subtarget->hasSSE3() ||
Subtarget->hasAVX()" and check it for the first condition, because
when AVX is on, the SSE levels are all turned off (as to consider AVX
a reimplementation of all SSE levels).

For the second condition: Does this logic works for 256-bit vectors?
I'm asking that because although the 128-bit HADDPS and the 256-bit
HADDPD have the same number of elements, their horizontal operation
behavior is different (look at AVX manual for details)! If it doesn't,
just remove the 256-bit handling for now.

2) Rename horizontal.ll to sse3-haddsub.ll
3) Can you duplicate the testcase file to something like
avx-haddsub.ll, and check for the AVX 128-bit versions too?
4) Your tablegen modifications are totally fine, for the intrinsics just do:

let Predicates = [HasSSE3] in {
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), VR128:$src2),
          (HADDPSrr VR128:$src1, VR128:$src2)>;
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), (memop addr:$src2)),
          (HADDPSrm VR128:$src1, addr:$src2)>;
...

and

let Predicates = [HasAVX] in {
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), VR128:$src2),
          (VHADDPSrr VR128:$src1, VR128:$src2)>;
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), (memop addr:$src2)),
          (VHADDPSrm VR128:$src1, addr:$src2)>;
...

Thanks Duncan,

Hi Bruno,

Some comments:

+ // Try to synthesize horizontal adds from adds of shuffles.
+ if (((Subtarget->hasSSE3()&& (VT == MVT::v4f32 || VT == MVT::v2f64)) ||
+ (Subtarget->hasAVX()&& (VT == MVT::v8f32 || VT == MVT::v4f64)))&&
+ isHorizontalBinOp(LHS, RHS, true))

1) You probably want to do something like:

"bool HasHorizontalArith = Subtarget->hasSSE3() ||
Subtarget->hasAVX()" and check it for the first condition, because
when AVX is on, the SSE levels are all turned off (as to consider AVX
a reimplementation of all SSE levels).

For the second condition: Does this logic works for 256-bit vectors?
I'm asking that because although the 128-bit HADDPS and the 256-bit
HADDPD have the same number of elements, their horizontal operation
behavior is different (look at AVX manual for details)! If it doesn't,
just remove the 256-bit handling for now.

it's not clear whether it is correct for 256 bit operations. The AVX docs
only describe the integer horizontal operations, not the floating point ones;
for the integer ones the 256 bit ones work differently. If someone has a
machine with AVX to test on, I've attached avx-hadd.s. It should be possible
to do:
   gcc -o avx-hadd avx-hadd.s
   ./avx-hadd
and the result should make it clear.

In the meantime I'm removed the 256 bit logic.

2) Rename horizontal.ll to sse3-haddsub.ll

Done!

3) Can you duplicate the testcase file to something like
avx-haddsub.ll, and check for the AVX 128-bit versions too?

I added the avx checks to the same file (in which case calling it
sse3-haddsub.ll is not so great).

4) Your tablegen modifications are totally fine, for the intrinsics just do:

let Predicates = [HasSSE3] in {
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), VR128:$src2),
           (HADDPSrr VR128:$src1, VR128:$src2)>;
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), (memop addr:$src2)),
           (HADDPSrm VR128:$src1, addr:$src2)>;
...

and

let Predicates = [HasAVX] in {
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), VR128:$src2),
           (VHADDPSrr VR128:$src1, VR128:$src2)>;
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), (memop addr:$src2)),
           (VHADDPSrm VR128:$src1, addr:$src2)>;
...

I came up with a vim macro that added them for me (see attached patch).
Probably there is a way to compress this using tablegen magic, but I don't
know how.

OK to apply?

Ciao, Duncan.

avx-hadd.s (1.69 KB)

hadd.diff (22.8 KB)

The output of the avx-hadd program is

3 11 7 15

Preston

The output of the avx-hadd program is

3 11 7 15

Thanks Preston. That means that the 256 bit AVX floating point horizontal
add works the same as the integer 256 bits ones, i.e. inconsistently with
the 128 bit ones.

Ciao, Duncan.

I added the avx checks to the same file (in which case calling it
sse3-haddsub.ll is not so great).

Oops, I missed that.

4) Your tablegen modifications are totally fine, for the intrinsics just
do:

let Predicates = [HasSSE3] in {
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), VR128:$src2),
(HADDPSrr VR128:$src1, VR128:$src2)>;
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), (memop addr:$src2)),
(HADDPSrm VR128:$src1, addr:$src2)>;
...

and

let Predicates = [HasAVX] in {
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), VR128:$src2),
(VHADDPSrr VR128:$src1, VR128:$src2)>;
def : Pat<(int_x86_sse3_hadd_ps (v4f32 VR128:$src1), (memop addr:$src2)),
(VHADDPSrm VR128:$src1, addr:$src2)>;
...

I came up with a vim macro that added them for me (see attached patch).
Probably there is a way to compress this using tablegen magic, but I don't
know how.

Cool! This is fine! :slight_smile:

OK to apply?

LGTM! Yep!