RFC: generation of PSAD instruction

Hello,

I was looking at the following test case which is very relevant in imaging applications.

int sad(unsigned char *pix1, unsigned char *pix2)

{

int sum = 0;

for( int x = 0; x < 16; x++ )

{

sum += abs( pix1[x] - pix2[x] );

}

return sum;

}

The llvm IR generated after all the IR optimizations is

……

%9 = bitcast i8* %8 to <4 x i8>*

%wide.load.1 = load <4 x i8>* %9, align 1

%10 = zext <4 x i8> %wide.load.1 to <4 x i32>

%11 = getelementptr inbounds i8* %pix2, i64 4

%12 = bitcast i8* %11 to <4 x i8>*

%wide.load7.1 = load <4 x i8>* %12, align 1

%13 = zext <4 x i8> %wide.load7.1 to <4 x i32>

%14 = sub nsw <4 x i32> %10, %13

%15 = icmp uge <4 x i8> %wide.load.1, %wide.load7.1

%16 = sub nsw <4 x i32> zeroinitializer, %14

%17 = select <4 x i1> %15, <4 x i32> %14, <4 x i32> %16

%18 = add nsw <4 x i32> %17, %7

%19 = getelementptr inbounds i8* %pix1, i64 8

……

The test case is a perfect example for the generation of Sum of Absolute Difference (SAD) instruction which is present in most of the targets. The proposed generated IR is : ( directly x86 intrinsic is called just to demonstrate the difference )

……

%6 = bitcast i8* %5 to x86_mmx*

%wide.load8.1 = load x86_mmx* %6, align 1

%7 = getelementptr inbounds i8* %pix2, i64 8

%8 = bitcast i8* %7 to x86_mmx*

%wide.load79.1 = load x86_mmx* %8, align 1

%9 = call x86_mmx @llvm.x86.mmx.psad.bw(x86_mmx %wide.load8.1, x86_mmx %wide.load79.1)

%10 = bitcast x86_mmx %9 to i64

%11 = trunc i64 %10 to i32

%12 = add i32 %11, %4

……

This Loop optimization can reduce the loop to only one instruction for every 4 or 8 or 16 iterations of the loop ( depending on the target sad instruction).

Proposed Solution

The sad pattern shown above which is the abs pattern with reduction variable ( sequence of sub, icomp, sub, select, add ) can be replaced with an intrinsic call.

For a non-unrolled loop we can do this in LoopVectorization where all the infrastructure for identifying reduction variables is already there.

We just need to identify the sad pattern, remove those instructions and call an intrinsic.

This can be done in the loop body which Vectorizer creates (vector.body) without disturbing the scalar body of the loop.

For this to happen in LoopVectorizer, we also need to influence the cost modeling of LV. For computing cost for different Vectorization Factors, we should remove the cost of sad pattern and add the cost of intrinsic.

If selected vectorization factor comes to be equal to any operand vector size of basic SAD instruction for that particular target (8 for x86 target) then we can go ahead and replace with intrinsic.

This intrinsic call can be a call to llvm intrinsic which can be lowered into different target intrinsic calls depending on the VF selected.

For unrolled loops, we can target in SLPVectorizer.

The work in loopVectorizer using llvm intrinsic call is ready for x86 codegen and can send the patch after this discussion.

( Attached contains complete IR with and without 8 byte SAD for the above test case after optimizations )

Please provide your thoughts.

Thank you,

Vijender

IR with SAD.txt (951 Bytes)

IR without SAD.txt (2.74 KB)

Hi Vijender,

Thanks for posting this, there is wide support here for improving our support for reductions of various kinds, both in flavor and robustness. I've cc'd some others who have previously discussed this.

James has advocated in the past for an intrinsic for horizontal reductions, which I support. We also need better support for vectorization cost modeling of combinations of IR-level instructions that have efficient target representations. Taken together, I think this addresses the use case you're highlighting here (and more).

-Hal

Hi Vijender,

Thanks for posting this, there is wide support here for improving our support for reductions of various kinds, both in flavor and robustness. I've cc'd some others who have previously discussed this.

James has advocated in the past for an intrinsic for horizontal reductions, which I support. We also need better support for vectorization cost modeling of combinations of IR-level instructions that have efficient target representations. Taken together, I think this addresses the use case you're highlighting here (and more).

The second part (multi-instruction cost modeling) is on my todo list,
to vectorize saturation code. I created a thread recently, and
indeed, people have mentioned SAD as another potential user. I need
to look into how the cost model interacts with the reduction variable
identification (it doesn't?), and will come up with a patch soon!

-Ahmed

Hi all,

I have done some work on implementing generation of SAD in LoopVectorizer. I am sharing two patches which should help to understand my idea. There are many places where I am not sure about the right way, which I want to put forward for discussion. Let me talk about each patch individually.

LoopVectorizerSAD.patch

This patch contains:
  - Introduction of new ReductionKind "RK_IntegerSad".
  - Identify the sad pattern which is a sequence of SUB, ICMP, NEGATE, SELECT and ADD in isSadPattern() function .
  - Introduction of new list into the reduction Descriptor to store the SAD pattern instructions. This list is needed for cost computations and also needed for removal once the LLVM intrinsic is created.
  - Modification of the getWidestType function to return 8 if SAD pattern is present. This is because for SAD in x86, the minimum vectorization factor needed is 8. In the present implementation of cost model, the highest VF which is tested is WidestRegisterSize/WidestType. If WidesrRegisterSize is 128 and if there is any variable in the basic block which is i32 which is not actually related to vector data then also max VF will come to be 4 (128/32). So if I replace widestType with 8 then max VF will come to 16. My thought is that higher vectorization factors like 8 or 16 also needs to be given chance for computing its cost which makes sense for instructions like SAD. To make this happen we can divide the widestRegisterSize with SmallestType. I am not sure about the circumstances of this. We can also get target information and decide.
- For the cost calculations, the SAD intrinsic is not yet present in the basic block. So the relevant PHI node is passed to getInstructionCost() function. Inside this function, a dummy SAD intrinsic is passed and cost is computed.
- Right now the Basic Block created by LoopVectorizer is modified .This code works fine for known trip count. If the trip count is not known at compile time then we may need to have multiple versions of the loop body.

CostModelingSAD.patch

This patch includes:
- Addition of new function getSadInstrCost() in the targetTransformInfo.
- Addition of LLVM Intrinsic which takes anyvector type as arguments and returns I32.
- Addition of new SAD DAG node. This is also required for lowering the LLVM intrinsic.

The work on lowering LLVM intrinsic to X86 is in progress.

Please provide me with your inputs.

Note: These patches may not work if you directly integrate with the code. These are provided only for explanation. Do let me know if you want a complete working patch.

Thank you,
Vijender.

LoopVectorizerSAD.patch (16.4 KB)

costModelingSAD.patch (5.08 KB)

Hi all,

I have done some work on implementing generation of SAD in LoopVectorizer.
I am sharing two patches which should help to understand my idea. There are
many places where I am not sure about the right way, which I want to put
forward for discussion. Let me talk about each patch individually.

LoopVectorizerSAD.patch

I know this is a work-in-progress, but for formatting, please have a look
at the coding standards and try clang-format.

This patch contains:
  - Introduction of new ReductionKind "RK_IntegerSad".
  - Identify the sad pattern which is a sequence of SUB, ICMP, NEGATE,
SELECT and ADD in isSadPattern() function .
  - Introduction of new list into the reduction Descriptor to store the
SAD pattern instructions. This list is needed for cost computations and
also needed for removal once the LLVM intrinsic is created.
  - Modification of the getWidestType function to return 8 if SAD pattern
is present. This is because for SAD in x86, the minimum vectorization
factor needed is 8. In the present implementation of cost model, the
highest VF which is tested is WidestRegisterSize/WidestType. If
WidesrRegisterSize is 128 and if there is any variable in the basic block
which is i32 which is not actually related to vector data then also max VF
will come to be 4 (128/32). So if I replace widestType with 8 then max VF
will come to 16. My thought is that higher vectorization factors like 8 or
16 also needs to be given chance for computing its cost which makes sense
for instructions like SAD. To make this happen we can divide the
widestRegisterSize with SmallestType. I am not sure about the circumstances
of this. We can also get target information and decide.

I haven't had a good look at the rest yet, but this part looks dubious to
me. This should go in the cost model and TTI: determining whether the
vectorization is profitable is the cost model's job; plus, baking
target-specific information in the vectorizer also isn't desirable.

-Ahmed

- For the cost calculations, the SAD intrinsic is not yet present in the

Hi Vijender,

thanks for working on this! Comments below.

Hi all,

I have done some work on implementing generation of SAD in LoopVectorizer. I am sharing two patches which should help to understand my idea. There are many places where I am not sure about the right way, which I want to put forward for discussion. Let me talk about each patch individually.

LoopVectorizerSAD.patch

This patch contains:
- Introduction of new ReductionKind "RK_IntegerSad”.

Sounds good.

- Identify the sad pattern which is a sequence of SUB, ICMP, NEGATE, SELECT and ADD in isSadPattern() function .

I suspect that there will be other patterns than SAD. We should probably look for at least one other similar reduction case and then generalize the logic such that it handles both cases. What I mean by this is that most of the code should not talk about isSADPattern, rather, the specific pattern should be hidden behind a generalized function, for example, isSpecialPattern().

But in principle this is fine.

I think that we should probably only match an SAD pattern if the target supports it.

- Introduction of new list into the reduction Descriptor to store the SAD pattern instructions. This list is needed for cost computations and also needed for removal once the LLVM intrinsic is created.

This should be a set. We should test this set when we vectorize instructions (no need to vectorize and then erase again) and for cost computation (no need to count and then subtract again).

We also need to make sure that instructions in the pattern are not used other than by the pattern (I have not looked closely whether you did that).

- Modification of the getWidestType function to return 8 if SAD pattern is present. This is because for SAD in x86, the minimum vectorization factor needed is 8. In the present implementation of cost model, the highest VF which is tested is WidestRegisterSize/WidestType. If WidesrRegisterSize is 128 and if there is any variable in the basic block which is i32 which is not actually related to vector data then also max VF will come to be 4 (128/32). So if I replace widestType with 8 then max VF will come to 16. My thought is that higher vectorization factors like 8 or 16 also needs to be given chance for computing its cost which makes sense for instructions like SAD. To make this happen we can divide the widestRegisterSize with SmallestType. I am not sure about the circumstances of this. We can also get target information and decide.

In principle fine but this also needs some kind of abstraction: getWidestTypeForPattern or something the like. The pattern can then decide. For an SAD it will be the type of the inputs to the SAD. You don’t need target info for this.

- For the cost calculations, the SAD intrinsic is not yet present in the basic block. So the relevant PHI node is passed to getInstructionCost() function. Inside this function, a dummy SAD intrinsic is passed and cost is computed.

Keying off the PHI node is fine.

- Right now the Basic Block created by LoopVectorizer is modified .This code works fine for known trip count. If the trip count is not known at compile time then we may need to have multiple versions of the loop body.

I don't understand why the TripCount is important. The only thing you need is the vectorization factor VF. The loop vectorizer already generates a skeleton that will jump to a scalar copy of the loop if that is not the case.

CostModelingSAD.patch

This patch includes:
- Addition of new function getSadInstrCost() in the targetTransformInfo.
- Addition of LLVM Intrinsic which takes anyvector type as arguments and returns I32.

Why only i32?

From: "Vj Muthyala" <Vj.Muthyala@amd.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "LLVM Dev" <llvmdev@cs.uiuc.edu>, "James Molloy" <james@jamesmolloy.co.uk>, "Michael Zolotukhin"
<mzolotukhin@apple.com>, "Nadav Rotem" <nrotem@apple.com>, "Arnold Schwaighofer" <aschwaighofer@apple.com>, "suyog
sarda" <suyog.sarda@samsung.com>
Sent: Monday, February 2, 2015 6:48:36 AM
Subject: RE: [LLVMdev] RFC: generation of PSAD instruction

Hi all,

I have done some work on implementing generation of SAD in
LoopVectorizer. I am sharing two patches which should help to
understand my idea. There are many places where I am not sure about
the right way, which I want to put forward for discussion. Let me
talk about each patch individually.

LoopVectorizerSAD.patch

This patch contains:
  - Introduction of new ReductionKind "RK_IntegerSad".
  - Identify the sad pattern which is a sequence of SUB, ICMP,
  NEGATE, SELECT and ADD in isSadPattern() function .
  - Introduction of new list into the reduction Descriptor to store
  the SAD pattern instructions. This list is needed for cost
  computations and also needed for removal once the LLVM intrinsic
  is created.
  - Modification of the getWidestType function to return 8 if SAD
  pattern is present. This is because for SAD in x86, the minimum
  vectorization factor needed is 8. In the present implementation of
  cost model, the highest VF which is tested is
  WidestRegisterSize/WidestType. If WidesrRegisterSize is 128 and if
  there is any variable in the basic block which is i32 which is not
  actually related to vector data then also max VF will come to be 4
  (128/32). So if I replace widestType with 8 then max VF will come
  to 16. My thought is that higher vectorization factors like 8 or
  16 also needs to be given chance for computing its cost which
  makes sense for instructions like SAD. To make this happen we can
  divide the widestRegisterSize with SmallestType. I am not sure
  about the circumstances of this. We can also get target
  information and decide.

I agree, trying VFs based on the size of the smallest type can make sense. Just make sure there is no significant compile-time impact (if there is, we may need to restrict when we do this, unless it leads more-generally to better outcomes).

- For the cost calculations, the SAD intrinsic is not yet present
in the basic block. So the relevant PHI node is passed to
getInstructionCost() function. Inside this function, a dummy SAD
intrinsic is passed and cost is computed.
- Right now the Basic Block created by LoopVectorizer is modified
.This code works fine for known trip count. If the trip count is
not known at compile time then we may need to have multiple
versions of the loop body.

I'd think you would handle this by having the vector loop handle the part of the trip count divisible by 8, and the tail loop handle the remainder.

-Hal

Hi,

Thank you for your feedback. Fews comments inlined.[VJ]

-Vijender

Comment below.

[VJ] I tired looking for other patterns which are similar but couldn’t find. Do let me know if you find any other pattern.

Perhaps matching a dot product (dpps/dppd) is similar enough?

I filed PR21975 for this a few weeks ago:
http://llvm.org/bugs/show_bug.cgi?id=21975