RFC: Integer saturation intrinsics

Hi all,

I'm proposing integer saturation intrinsics.

def int_ssat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;
def int_usat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;

The first operand is the integer value being saturated, and second is the saturation bit position.

For scalar integer types, the semantics are:

int_ssat: x < -(1 << (bit-1)) ? -(1 << (bit-1)) : (x > (1 << (bit-1))-1 ? (1 << (bit-1))-1 : x)
int_usat: x < 0 ? 0 : (x > (1 << bit)-1 : (1 << bit)-1 : x)

e.g. If bit is 8, the range is -128 to 127 for int_ssat; 0 to 255 for int_usat. This is useful for i16 / i32 / i64 to i8 satuation code like the following:
(x < -256) ? -256 : (x > 255 ? 255 : x)
(x < 0) ? 0 : (x > 255 ? 255 : x)

If the source type is an integer vector type, then each element of the vector is being saturated.

For ARM, these intrinsics will map to usat / ssat instructions. The semantics matches exactly. X86 doesn't have saturation instructions. However, SSE does have packed add, packed sub, and pack with saturation. So it's possible to instruction select patterns such as (int_{s|u}sat ({add|sub} x, y), c).

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Evan

Hi all,

I'm proposing integer saturation intrinsics.

def int_ssat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;
def int_usat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;

The first operand is the integer value being saturated, and second is the saturation bit position.

For scalar integer types, the semantics are:

int_ssat: x < -(1 << (bit-1)) ? -(1 << (bit-1)) : (x > (1 << (bit-1))-1 ? (1 << (bit-1))-1 : x)
int_usat: x < 0 ? 0 : (x > (1 << bit)-1 : (1 << bit)-1 : x)

e.g. If bit is 8, the range is -128 to 127 for int_ssat; 0 to 255 for int_usat. This is useful for i16 / i32 / i64 to i8 satuation code like the following:
(x < -256) ? -256 : (x > 255 ? 255 : x)
(x < 0) ? 0 : (x > 255 ? 255 : x)

If the source type is an integer vector type, then each element of the vector is being saturated.

For ARM, these intrinsics will map to usat / ssat instructions. The semantics matches exactly. X86 doesn't have saturation instructions. However, SSE does have packed add, packed sub, and pack with saturation. So it's possible to instruction select patterns such as (int_{s|u}sat ({add|sub} x, y), c).

The stated pattern simply doesn't work. A portable saturating
add/subtract intrinsic might be nice given that most vector
instruction sets have such an instruction, but this seems completely
orthogonal.

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Is there some reason why pattern-matching this in an ARM-specific
DAGCombine doesn't work?

-Eli

These will replace the existing int_arm_ssat and int_arm_usat intrinsics, right?

Hi all,

I'm proposing integer saturation intrinsics.

def int_ssat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;
def int_usat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;

The first operand is the integer value being saturated, and second is the saturation bit position.

For scalar integer types, the semantics are:

int_ssat: x < -(1 << (bit-1)) ? -(1 << (bit-1)) : (x > (1 << (bit-1))-1 ? (1 << (bit-1))-1 : x)
int_usat: x < 0 ? 0 : (x > (1 << bit)-1 : (1 << bit)-1 : x)

e.g. If bit is 8, the range is -128 to 127 for int_ssat; 0 to 255 for int_usat. This is useful for i16 / i32 / i64 to i8 satuation code like the following:
(x < -256) ? -256 : (x > 255 ? 255 : x)
(x < 0) ? 0 : (x > 255 ? 255 : x)

If the source type is an integer vector type, then each element of the vector is being saturated.

For ARM, these intrinsics will map to usat / ssat instructions. The semantics matches exactly. X86 doesn't have saturation instructions. However, SSE does have packed add, packed sub, and pack with saturation. So it's possible to instruction select patterns such as (int_{s|u}sat ({add|sub} x, y), c).

The stated pattern simply doesn't work. A portable saturating
add/subtract intrinsic might be nice given that most vector
instruction sets have such an instruction, but this seems completely
orthogonal.

Can you explain why you think the pattern (which?) would not work?

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Is there some reason why pattern-matching this in an ARM-specific
DAGCombine doesn't work?

It's not possible to look beyond a single BB at isel time.

Evan

Hi all,

I'm proposing integer saturation intrinsics.

def int_ssat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;
def int_usat : Intrinsic<[llvm_anyint_ty], [LLVMMatchType<0>, llvm_i32_ty]>;

The first operand is the integer value being saturated, and second is the saturation bit position.

For scalar integer types, the semantics are:

int_ssat: x < -(1 << (bit-1)) ? -(1 << (bit-1)) : (x > (1 << (bit-1))-1 ? (1 << (bit-1))-1 : x)
int_usat: x < 0 ? 0 : (x > (1 << bit)-1 : (1 << bit)-1 : x)

e.g. If bit is 8, the range is -128 to 127 for int_ssat; 0 to 255 for int_usat. This is useful for i16 / i32 / i64 to i8 satuation code like the following:
(x < -256) ? -256 : (x > 255 ? 255 : x)
(x < 0) ? 0 : (x > 255 ? 255 : x)

If the source type is an integer vector type, then each element of the vector is being saturated.

For ARM, these intrinsics will map to usat / ssat instructions. The semantics matches exactly. X86 doesn't have saturation instructions. However, SSE does have packed add, packed sub, and pack with saturation. So it's possible to instruction select patterns such as (int_{s|u}sat ({add|sub} x, y), c).

The stated pattern simply doesn't work. A portable saturating
add/subtract intrinsic might be nice given that most vector
instruction sets have such an instruction, but this seems completely
orthogonal.

Can you explain why you think the pattern (which?) would not work?

Suppose you want a paddsw. To express the equivalent using ssat, you
would have to write (trunc (ssat (add (sext x), (sext y)), c)). And I
wouldn't trust that to work.

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Is there some reason why pattern-matching this in an ARM-specific
DAGCombine doesn't work?

It's not possible to look beyond a single BB at isel time.

Anything that we can match to ssat should be of the form max(min(x,
SATMAX), SATMIN) (where max and min are icmp+select pairs). If the
min and max aren't in the same block, and we don't have an IR
transformation to put them in the same block, we should fix that
rather than introducing an instrinsic for this special case, I
think...

-Eli

Okay, thinking about it a bit more, I don't think this is practical.

I'm still skeptical that adding platform-independent intrinsics for
arbitrary ARM instructions is a good idea simply because we don't have
the infrastructure to handle them otherwise. It wouldn't be
especially hard to allow target-specific transforms on IR...

-Eli

The plan is to form calls to these intrinsics in InstCombine. Legalizer
can expand these intrinsics if they are not legal. The expansion should
be fairly straight forward and produce code that is at least as good as
what LLVM is currently generating for these code sequence.

SSAT/USAT may set the Q (saturation) flag, which might cause problems
for anyone relying on explicitly using saturating operations (QADD etc.)
and testing the Q flag. So there are several possibilities:

- you could do liveness analysis on Q and only introduce SSAT/USAT if
   Q is not live

- you could fall back to not introducing SSAT/USAT if you could tell there
   was no test of Q in the function (the ARM ABI says Q is not defined across
   ABI-compliant function boundaries)

- you could just go ahead and do it anyway (might be appropriate for a
   "fast and loose" optimization mode)

If you can easily factor Q into LLVM's liveness analysis, that might be
the best solution. (It also allows you to discard the instruction if
neither its result nor the Q flag is tested - you can't normally drop
Q-flag-affecting instructions as they may be executed for side effects).

There are similar issues with some other ARM instructions, e.g. SMLABB.

Q is a sticky bit, like the floating-point cumulative exception flags,
so the issues are a bit similar (the same issue would apply if e.g. you
wanted to exploit floating-point hardware for integer division in a mode
where the inexact exception was valid). You have the same issues of
modelling the use of machine status flags, and understanding what
optimizations are/aren't legal at any program point.

Al

-- IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Is there some reason why pattern-matching this in an ARM-specific
DAGCombine doesn't work?

It's not possible to look beyond a single BB at isel time.

Anything that we can match to ssat should be of the form max(min(x,
SATMAX), SATMIN) (where max and min are icmp+select pairs). If the
min and max aren't in the same block, and we don't have an IR
transformation to put them in the same block, we should fix that
rather than introducing an instrinsic for this special case, I
think...

Okay, thinking about it a bit more, I don't think this is practical.

I think this will solve this particular problem but I'm not certain it's the right thing to do in general. In general, llvm already has the tendency to form selects too aggressively. What's your concern?

I'm still skeptical that adding platform-independent intrinsics for
arbitrary ARM instructions is a good idea simply because we don't have
the infrastructure to handle them otherwise. It wouldn't be
especially hard to allow target-specific transforms on IR...

It's easy to expand these intrinsics in legalizer time. Also llvm is already generating inferior code for these patterns for ARM and x86.

Evan

CPSR liveness is not a concern. The instructions will be modeled as such and isel and scheduling will do the right thing.

Evan

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Is there some reason why pattern-matching this in an ARM-specific
DAGCombine doesn't work?

It's not possible to look beyond a single BB at isel time.

Anything that we can match to ssat should be of the form max(min(x,
SATMAX), SATMIN) (where max and min are icmp+select pairs). If the
min and max aren't in the same block, and we don't have an IR
transformation to put them in the same block, we should fix that
rather than introducing an instrinsic for this special case, I
think...

Okay, thinking about it a bit more, I don't think this is practical.

I think this will solve this particular problem but I'm not certain it's the right thing to do in general. In general, llvm already has the tendency to form selects too aggressively. What's your concern?

I wasn't really considering the issues with converting branches into
selects, actually... I didn't realize that's what you were running
into. :slight_smile: I was thinking more in terms of instruction sinking, which
we really don't have an aggressive IR pass for at the moment. But
you're right, converting chains of branches like that into selects
without considering what we'll end up with is a bit suspicious.

I'm still skeptical that adding platform-independent intrinsics for
arbitrary ARM instructions is a good idea simply because we don't have
the infrastructure to handle them otherwise. It wouldn't be
especially hard to allow target-specific transforms on IR...

It's easy to expand these intrinsics in legalizer time. Also llvm is already generating inferior code for these patterns for ARM and x86.

We also have to consider whether forming sset will require substantial
changes to other passes; thinking about it, it won't be too bad here.

I still don't see the point of a usat intrinsic; we ought to already
handle constructs that should form into usat effectively.

-Eli

The plan is to form calls to these intrinsics in InstCombine. Legalizer can expand these intrinsics if they are not legal. The expansion should be fairly straight forward and produce code that is at least as good as what LLVM is currently generating for these code sequence.

Comments?

Is there some reason why pattern-matching this in an ARM-specific
DAGCombine doesn't work?

It's not possible to look beyond a single BB at isel time.

Anything that we can match to ssat should be of the form max(min(x,
SATMAX), SATMIN) (where max and min are icmp+select pairs). If the
min and max aren't in the same block, and we don't have an IR
transformation to put them in the same block, we should fix that
rather than introducing an instrinsic for this special case, I
think...

Okay, thinking about it a bit more, I don't think this is practical.

I think this will solve this particular problem but I'm not certain it's the right thing to do in general. In general, llvm already has the tendency to form selects too aggressively. What's your concern?

I wasn't really considering the issues with converting branches into
selects, actually... I didn't realize that's what you were running
into. :slight_smile: I was thinking more in terms of instruction sinking, which
we really don't have an aggressive IR pass for at the moment. But
you're right, converting chains of branches like that into selects
without considering what we'll end up with is a bit suspicious.

I ran into this from the other angle recently. The middle-end isn't being aggressive enough in folding branches into selects which is blocking desired optimization.

I was thinking we may want to add a backend pass that decides what to do about long chains of computations leading into select instructions. The backend would have a better idea of how expensive a branch is compared to the instructions, or whether the instructions could be predicated, etc.

Nick