supporting SAD in loop vectorizer

Nadav and other vectorizer folks-

Is there any plan to support special idioms in the loop vectorizer like sum of absolute difference (SAD) ? We see some useful cases where llvm is losing performance at -O3 due to SADs not being vectorized (hence PSADBWs not being generated).

Also, since the abs() call is already lowered to a sequence of 'icmp; neg; select' by simplifylibcalls (in -O3), we may then need to get hold of this pattern in the loop vectorizer (part of reduction analysis) and do the needful.

Thoughts ?

-Thx
Dibyendu

It's been a while, but this could either be that the legalisation
phase is not recognising the reduction or that the cost is not taking
into account the lowered abs().

What does -debug-only=loop-vectorize say about it?

cheers,
--renato

From: "Renato Golin" <renato.golin@linaro.org>
To: "Dibyendu Das" <Dibyendu.Das@amd.com>
Cc: llvmdev@cs.uiuc.edu
Sent: Tuesday, November 4, 2014 5:23:30 AM
Subject: Re: [LLVMdev] supporting SAD in loop vectorizer

> Is there any plan to support special idioms in the loop vectorizer
> like sum of absolute difference (SAD) ? We see some useful cases
> where llvm is losing performance at -O3 due to SADs not being
> vectorized (hence PSADBWs not being generated).

It's been a while, but this could either be that the legalisation
phase is not recognising the reduction or that the cost is not taking
into account the lowered abs().

What does -debug-only=loop-vectorize say about it?

FWIW, I agree, this sounds like a cost-model problem. The loop-vectorizer should be able to vectorize the 'icmp; neg; select' pattern, and then the backend can pattern-patch that with the reduction (which is a series of shuffles and extract_element) into the single instruction PSADBW -- we're quite likely missing the target code to do that.

-Hal

I will get the debug dump and get back on this.

Here's the simple SAD code:

From: "Dibyendu Das" <Dibyendu.Das@amd.com>
To: "Hal Finkel" <hfinkel@anl.gov>, "Renato Golin" <renato.golin@linaro.org>
Cc: llvmdev@cs.uiuc.edu
Sent: Tuesday, November 4, 2014 12:15:12 PM
Subject: RE: [LLVMdev] supporting SAD in loop vectorizer

Here's the simple SAD code:
---------------------------------------------------
  1 #include <stdlib.h>
  2
  3 extern int ly,lx;
  4 int sad_c( unsigned char *pix1, unsigned char *pix2)
  5 {
  6 int i_sum = 0;
  7 for( int x = 0; x < lx; x++ )
  8 i_sum += abs( pix1 - pix2 );
  9 return i_sum;
10 }
11
-----------------------------------------------------

The loop vectorizer does vectorize the loop and then unrolls it
twice. The main body of the loop at the end looks like below where
we see the icmp, neg select pattern appearing twice.
Are we saying we pattern match this to PSADBW in target ?

Yes.

That seems
to have some challenges

It does, but we already have code in the backend that matches other horizontal operations (see isHorizontalBinOp and its callers in lib/Target/X86/X86ISelLowering.cpp), and I suspect this won't be significantly more complicated.

including the fact that we would need a
4-way unroll to use all of 128b PSADBWs. Or am I
missing something ?

No, each unrolling will get its own, so you'll get a PSADBW from each time the loop is unrolled. Each unrolling is vectorized in terms of <4 x i32>, and that is the 128 bits you need.

If you'd like to contribute support for this, look at isHorizontalBinOp and go from there. Feel free to ask questions if you get stuck.

-Hal

If you’d like to contribute support for this, look at isHorizontalBinOp and go from there. Feel free to ask questions if you get stuck.

FWIW, I’ve looked at isHorizontalBinOp for inspiration for matching AArch64 ADDV-and-friends (horizontal reduction operations), and thought it was rather temperamental and noticed it being prone to breaking depending on the exact format of the IR. Given that we don’t have a canonical form for reductions, I think it wrong that we expect targets to undo quite complex patterns.

The reduction pattern is a log2(n) sequence of shuffles and binops, that are really rather complex. These sort of things should, IMHO, be intrinsics. I chatted with Arnold about this at the devmtg and was going to send a patch to do exactly that in a week or so.

Cheers,

James

Thx James and Hal. I will have a look at the HorizontalBinOp and check and get back.

From: "James Molloy" <james@jamesmolloy.co.uk>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: "Dibyendu Das" <Dibyendu.Das@amd.com>, llvmdev@cs.uiuc.edu
Sent: Tuesday, November 11, 2014 8:21:37 AM
Subject: Re: [LLVMdev] supporting SAD in loop vectorizer

If you'd like to contribute support for this, look at
isHorizontalBinOp and go from there. Feel free to ask questions if
you get stuck.

FWIW, I've looked at isHorizontalBinOp for inspiration for matching
AArch64 ADDV-and-friends (horizontal reduction operations), and
thought it was rather temperamental and noticed it being prone to
breaking depending on the exact format of the IR. Given that we
don't have a canonical form for reductions, I think it wrong that we
expect targets to undo quite complex patterns.

The reduction pattern is a log2(n) sequence of shuffles and binops,
that are really rather complex. These sort of things should, IMHO,
be intrinsics. I chatted with Arnold about this at the devmtg and
was going to send a patch to do exactly that in a week or so.

Sounds good. We should try hard to canonicalize into the intrinsic in InstCombine from the shuffles (in addition to emitting it directly from the vectorizer), but it is likely easier to do there than in the backend.

-Hal

From: "Hal Finkel" <hfinkel@anl.gov>
To: "James Molloy" <james@jamesmolloy.co.uk>
Cc: llvmdev@cs.uiuc.edu
Sent: Tuesday, November 11, 2014 8:54:01 AM
Subject: Re: [LLVMdev] supporting SAD in loop vectorizer

> From: "James Molloy" <james@jamesmolloy.co.uk>
> To: "Hal Finkel" <hfinkel@anl.gov>
> Cc: "Dibyendu Das" <Dibyendu.Das@amd.com>, llvmdev@cs.uiuc.edu
> Sent: Tuesday, November 11, 2014 8:21:37 AM
> Subject: Re: [LLVMdev] supporting SAD in loop vectorizer
>
>
> If you'd like to contribute support for this, look at
> isHorizontalBinOp and go from there. Feel free to ask questions if
> you get stuck.
>
>
>
> FWIW, I've looked at isHorizontalBinOp for inspiration for matching
> AArch64 ADDV-and-friends (horizontal reduction operations), and
> thought it was rather temperamental and noticed it being prone to
> breaking depending on the exact format of the IR. Given that we
> don't have a canonical form for reductions, I think it wrong that
> we
> expect targets to undo quite complex patterns.
>
>
> The reduction pattern is a log2(n) sequence of shuffles and binops,
> that are really rather complex. These sort of things should, IMHO,
> be intrinsics. I chatted with Arnold about this at the devmtg and
> was going to send a patch to do exactly that in a week or so.

Sounds good. We should try hard to canonicalize into the intrinsic in
InstCombine from the shuffles

Or maybe we should do this in CGP -- would we want to do this if there is no actual target support?

-Hal

Hi Hal,

My current implementation scalarises if the target SDNode doesn’t exit (new SDNodes ISD::REDUCE_ADD and friends are created) in SelectionDAGBuilder to the log2(N) method currently employed by the loop vectorizer.

James

While there are no instructions in e.g. ARMv7 to do horizontal reductions in a single instruction, you can still do better than the loop in the source code, and the easy way to get the optimum result is probably to transform the loop into a horizontal reduction intrinsic and then lower it to a target-appropriate sequence of instructions.

e.g.

vpadd.i8 d1, d16, d17
vpaddl.u8 d1, d1
vpaddl.u16 d1, d1
vmov.32 r1, d1[1]
vmov.32 r0, d1[0]
add r0, r1

(I’m not sure offhand whether an additional reduction stage in the vector unit and transferring just one result to the integer registers is possible or desirable)

Hal, James and others-

We have started looking into this support. We may get back for some clarifications.

-dibyendu