[RFC] Integer Saturation Intrinsics

Hi all,

The patches linked below introduce a new family of intrinsics, for
integer saturation: @llvm.usat, and @llvm.ssat (unsigned/signed).
Quoting the added documentation:

      %r = call i32 @llvm.ssat.i32(i32 %x, i32 %n)

is equivalent to the expression min(max(x, -2^(n-1)), 2^(n-1)-1), itself
implementable as the following IR:

      %min_sint_n = i32 ... ; the min. signed integer of bitwidth n, -2^(n-1)
      %max_sint_n = i32 ... ; the max. signed integer of bitwidth n, 2^(n-1)-1
      %0 = icmp slt i32 %x, %min_sint_n
      %1 = select i1 %0, i32 %min_sint_n, i32 %x
      %2 = icmp sgt i32 %1, %max_sint_n
      %r = select i1 %2, i32 %max_sint_n, i32 %1

As a starting point, here are two patches:
- http://reviews.llvm.org/D6976 Add Integer Saturation Intrinsics.
- http://reviews.llvm.org/D6977 [CodeGen] Add legalization for
Integer Saturation Intrinsics.

From there, we can generate several new instructions, more efficient

than their expanded counterpart. Locally, I have worked on:
- ARM: the SSAT/USAT instructions (scalar)
- AArch64: the SQ/UQ ADD/SUB AArch64 instructions (vector/scalar
saturating arithmetic)
- X86: PACK SS/US (vector, saturate+truncate)
- X86: PADD/SUB S/US (vector, saturating arithmetic)

Anyway, let's first agree on the intrinsics, so that further
development is done on trunk.

Thanks!
-Ahmed

At a very high level, why do we need these intrinsics?

What is the use case? What are typical values for N?

Have you looked at just generating the conditional IR and then pattern matching late? What's the performance trade off involved?

My default position is that we shouldn't add these unless there is a compelling use case. We *still* haven't gotten the *.with.overflow intrinsics piped through everywhere in the optimizer. I'm reluctant to keep adding new ones.

Philip

At a very high level, why do we need these intrinsics?

In short, to catch sequences you can't catch in the SelectionDAG.

What is the use case? What are typical values for N?

Typically, you get this from (a little overlapping) compression, DSP,
or pixel-handling code.
Off the top of my head, this occurs in paq8p in the test-suite, as
well as a few other tests.

You'd have something like:
    a = x + y;
    if (a < -128)
      a = -128;
    if (a > 127)
      a = 127;

Have you looked at just generating the conditional IR and then pattern
matching late? What's the performance trade off involved?

That's a valid concern. The original problem is, we can't catch this
kind of thing in the SelectionDAG, because we're limited by a single
basic block. I guess we could (and I gather that's the alternative
you're presenting?) canonicalize the control flow to the 2icmp+2select
sequence, but I wasn't sure that was "workable". Truth be told, I
didn't investigate this very thoroughly, as I didn't expect reluctance
on adding intrinsics! I'll look into it some more: avoid adding the
intrinsic, keep the codegen additions as is, match the pattern in CGP
instead of InstCombines.

My default position is that we shouldn't add these unless there is a
compelling use case. We *still* haven't gotten the *.with.overflow
intrinsics piped through everywhere in the optimizer. I'm reluctant to keep
adding new ones.

I get your point, and agree with the general idea, but I think
saturation is easier to reason about so wouldn't be as big a problem.
At least it was pretty fast to implement the usual suspects:
vectorizability, known bits, etc.. Do you have a specific part in
mind?

Anyway, thanks for the feedback!

-Ahmed

A couple of questions:

1) Should this really be an intrinsic and not a flag on add? The add instruction already allows overflow to be either undefined or defined to wrap. Making it defined to saturate seems a natural extension.

2) How do you imagine this being used and what are the guarantees for sequences of operations with respect to optimisation? If I do a+b-c (or +c where c is negative), and a+b would saturate, but a+(b-c) would not, then is it allowed for an optimiser to generate the second rather than the first? If it's an intrinsic that's opaque to optimisers, then that's not a problem for correctness, but then you'll miss some potentially beneficial optimisations.

David

A couple of questions:

1) Should this really be an intrinsic and not a flag on add? The add
instruction already allows overflow to be either undefined or defined to
wrap. Making it defined to saturate seems a natural extension.

I don't think this should be a flag on add. Flags are designed such that
the middle-end may be ignorant of them and nothing bad might happen, it is
always safe to ignore or drop flags when doing so is convenient (for a
concrete example, take a look at reassociate).

In this case, the saturating nature of the operation does not seem like
something that can be safely ignored.

This is true of metadata, not of flags. Consider the atomic memory order on loads and stores, for example. It is definitely not safe for an optimiser to ignore these.

David

>
> I don't think this should be a flag on add. Flags are designed such
that the middle-end may be ignorant of them and nothing bad might happen,
it is always safe to ignore or drop flags when doing so is convenient (for
a concrete example, take a look at reassociate).

This is true of metadata, not of flags. Consider the atomic memory order
on loads and stores, for example. It is definitely not safe for an
optimiser to ignore these.

The arithmetic operations are *very* consistent that flags imply a
relaxation of constraint: floating point has fast math flags, division has
exact, etc.
Memory instructions seem to have gone in the other direction.

I would recommend that such optimizations not be done. At least where I have seen saturation arithmetic done before is with thing like digital signal processing. In that cse, the programmers will generally be very careful to craft exactly what they want... efficiently.

The undefined vs. wrap is a semantic difference that could affect optimization, too. The result of the operation is undefined in one case and well defined in the wrap case. The undefined case should allow some code motion optimizations that they wrap case doesn’t. Algebraic transformations are affected by both wrap and saturation; although, the saturation is more restrictive in that it affects add and subtract while wrap would only be blocking refactoring when multiply and divide are involved.

At a very high level, why do we need these intrinsics?

In short, to catch sequences you can't catch in the SelectionDAG.

What is the use case? What are typical values for N?

Typically, you get this from (a little overlapping) compression, DSP,
or pixel-handling code.
Off the top of my head, this occurs in paq8p in the test-suite, as
well as a few other tests.

You'd have something like:
     a = x + y;
     if (a < -128)
       a = -128;
     if (a > 127)
       a = 127;

Have you looked at just generating the conditional IR and then pattern
matching late? What's the performance trade off involved?

That's a valid concern. The original problem is, we can't catch this
kind of thing in the SelectionDAG, because we're limited by a single
basic block. I guess we could (and I gather that's the alternative
you're presenting?) canonicalize the control flow to the 2icmp+2select
sequence, but I wasn't sure that was "workable". Truth be told, I
didn't investigate this very thoroughly, as I didn't expect reluctance
on adding intrinsics! I'll look into it some more: avoid adding the
intrinsic, keep the codegen additions as is, match the pattern in CGP
instead of InstCombines.

Just to be clear, I'm not saying "don't add an intrinsic". I am saying "make sure the cost of the intrinsic are worth it". In particular, I think you're going to give up a lot of optimization benefit in practice by using intrinsics unless you put a *lot* of effort into making it work everywhere.

One middle ground might be to have an intrinsic which is immediately lowered to the select form, let that run through most of the IR optimization passes, then pattern match in either SelectionDAG or CGP back to the intrinsic. Not saying this is neccessarily the right approach, but it might be worth thinking about.

My default position is that we shouldn't add these unless there is a
compelling use case. We *still* haven't gotten the *.with.overflow
intrinsics piped through everywhere in the optimizer. I'm reluctant to keep
adding new ones.

I get your point, and agree with the general idea, but I think
saturation is easier to reason about so wouldn't be as big a problem.
At least it was pretty fast to implement the usual suspects:
vectorizability, known bits, etc.. Do you have a specific part in
mind?

known bits, inst simplify, inst combine, slp vectorize, loop vectorize, licm, gvn, lvi, early cse, etc.., etc...

To resurrect an old thread, MHO is that adding intrinsics for these is the right way to go. As Ahmed points out, pattern matching these can be complicated and is best done by the middle end. There are clear optional expansion patterns for backends with various features (including the obvious generic expansion using cmove).

These are the same reasons that we have an intrinsic for bswap, and it has worked out really well. I agree that this shouldn’t be a flag on the add instruction.

-Chris