[RFC][PATCH] Adding absd/hadd/sad intrinsics

Hi All,

I would like to introduce intrinsics to generate efficient codes for 'absolute differences', 'horizontal add'
and 'sum of absolute differences' Idioms used by user programs.

Identifying these idioms at lower level (Codegen) is complex. These idioms can be identified in LV/SLP
and vectorized using above intrinsics to generate better code.

Proposal:
1. Add intrinsics for 'absolute differences', 'horizontal add' and 'sum of absolute differences' having variants of
Fixed / floating point data types and signed/unsigned operations.
2. Recognize these idioms at loop vectorizer or SLP and replace with corresponding intrinsics based on the target's
Backend support.

Having these intrinsic will also help in cost modeling for these idioms which is complex down the lane.

Pls find attached a patch regarding this which is incomplete & dirty at this moment, however, it may help in this
discussion.

Regards,
Shahid

absd_hadd_sad_intrinsic.patch (29.4 KB)

Hi Devs,

Pinging..., as didn't get any response yet. Or shall I assume it is acceptable to all?

Regards,
Shahid

Hi Asghar-Ahmed,

I saw your last ping - sorry, I’m away on vacation and back on Wednesday.

Generally, I’m not sure that having both absd/hadd and sad are compatible with the discussions going on in other threads, for example my thread about min and max.

Given that those two intrinsics are fairly trivial to match , I don’t see the need to have two different canonical forms.

James

Hi James,

Thanks for your response. You are correct that matching the two intrinsic is trivial.

My worry is regarding the query for cost calculation for specific SAD instructions

such as ‘psad’ (X86) or ‘usad’ (ARM) in Loop Vectorizer.

In case of absd/hadd, we have to query the cost separately for both intrinsics and the

returned cost may be bigger than actual cost of ‘psad’ or ‘usad’ instruction.

And hence vectorizer may decide not to generate these intrinsic at all.

I am not sure if my ‘worry’ is valid here, thoughts?

Regards,

Shahid

Hi Shahid,

The vectorizer's cost model has the ability to return different costs
for the same instruction based on the arguments (scalar/vector,
big/small, special cases), so I don't think that adding intrisics will
help you in defining the correct cost. This is true for all other
vectorizer's decisions and it works quite well.

If you find something missing, maybe we should fix the cost model, not
introduce more intrinsics.

cheers,
--renato

Hi Renato,

Thanks for your response. My concern was actually this. For example, take vector type V8i16 on X86 target

With llvm.sad() intrinsic:
VC1 (Vector Cost) = Cost associated with "PSAD" instruction.

W/ llvm.absd() and llvm.hadd()
VC2 = Cost associated with "absolute diff" + "horizontal add" ( ??? )

As I will be querying with getIntrinsicCost(ID) for these two intrinsics separately, Will VC1==VC2?

May be I am missing something obvious?

Regards,
Shahid

I see. You are correct to say that this is a crude approximation.

The way we do today is to get one of them and treat as "cheap", or if
not possible, to hope it'll dilute amidst other more expensive
instructions. Since the cost table is mostly to get it going, having
2/4 of the cost instead of 1/4 of the cost (for diff+add of 4-way
vectors instead of diff+add of 4 scalars) will count little to the
final score and it'll probably encourage vectorization. On the generic
cases that we fail to vectorize, we end up increasing the cost of the
scalar operations.

I agree this is far from ideal, but it works reasonably well. The
alternative would be to have instructions pattern support, which would
give us more fine grained control. I have suggested this many years
ago, but so far, the current model is working well enough so that we
haven't felt the need to implement a complicated pattern matching
support.

The cases where a pattern match would help are mainly:

* Detecting cases where the back end has special instructions for
multiple IR instructions. This is your case, and is common enough that
should benefit almost all back-ends.

* Hazard detection, for instance when moving in and out of VFP
registers, or when two instructions in sequence are really bad in
specific CPUs. This would also benefit multiple back-ends, but
probably has less impact on the quality of the choices.

However, we should first try the current model, and only go towards
the more complex model if we have enough patterns that would benefit
strongly enough to compensate for the increase in complexity. This
should be a consensus decision, I think.

In any case, not an argument to implement intrisics just because the
cost model is not accurate enough. If anything, we should fix the cost
model.

cheers,
--renato

Hi Renato,

That’s right. I agree with your *pattern vs complexity* thinking.

So I would drop llvm.sad() and go ahead with the remaining two.

Does it make sense in general?

Regards,
Shahid

We strive to keep things simple.

Creating too many intrinsics makes the IR brittle and hard to
optimise. Creating pattern matching rules can be complex, but they're
generally preferred to adding intrinsics. But creating a new pattern
matching engine just for the sake of one case is too much, so we end
up using heuristics instead.

However, heuristics are like connective tissue. They look functional,
but they add up to a big and ineffective blob, which ends up being
less effective every new data you add. So, in the end, having a plan
to generate a more efficient pattern matching structure could very
well reduce the amount of code, if (and only if), the plan accounts
for most current cases as well as the new ones.

For the time being, if you can get away with heuristics, and that
fills your allocated time for this task, that it's the best way
forward for now.

cheers,
--renato

For the time being, if you can get away with heuristics, and that fills your
allocated time for this task, that it's the best way forward for now.

Sorry that I could not get what exactly you mean with "heuristics".
Is it the "intrinsics approach" itself or something else?

BTW, now my plan is to just add the two intrinsics for 'absolute difference'
and 'horizontal add'.

Regards,
Shahid

Sorry that I could not get what exactly you mean with "heuristics".
Is it the "intrinsics approach" itself or something else?

No, that was about the cost tables.

BTW, now my plan is to just add the two intrinsics for 'absolute difference'
and 'horizontal add'.

That's ok, as long as they're impossible to represent in plain IR.

cheers,
--renato

Hi,
There is a related issue with horizontal max (and min), please see https://llvm.org/bugs/show_bug.cgi?id=23116
Thanks,
     Nick C