max/min intrinsics

I have been working on a patch to add support for max/min reductions in LoopVectorize. One of the comments that came up in review is that the implementation could be simplified (and less fragile) if max and min intrinsics were recognized rather than looking for compare-select sequences.

The suggestion was to change compare-selects into max and min intrinsic calls during instcombine.

The intrinsics to add are:
declare iN llvm.{smin,smax}.iN(iN %a, iN %b)
declare iN llvm.{umin,umax}.iN(iN %a, iN %b)
declare fN llvm.{fmin,fmax}.fN(fN %a, fN %b)

What does the community think?

Paul

Redmond, Paul wrote:

I have been working on a patch to add support for max/min reductions in LoopVectorize. One of the comments that came up in review is that the implementation could be simplified (and less fragile) if max and min intrinsics were recognized rather than looking for compare-select sequences.

The suggestion was to change compare-selects into max and min intrinsic calls during instcombine.

+1

Sebastian

Hi Paul,

Min/max certainly makes Loop Nest Optimization (including the innermost loop vectorization) lots easier.
However, I like they are "lowered" in lower level scalar opt (sopt).

I kinda feel "raw" instructions is bit easier than integrated instruction to optimized in sopt, and
the "raw" instruction could expose more opportunities.

e.g. In the following snippet, if we break the max() into "raw" instruction, the cost of comparison
is reduced thanks to the CSE, and it also reveals that more often than not, Z hold value of min_v + 2.
However, max() obscure this info.

It seems inevitable. For the floating point version, please make it very clear what the behavior of max(-0,+0) and related cases are. This also means stuff that matches compare/select idioms (e.g. llvm/Support/PatternMatch.h) will need to be updated.

-Chris

It seems inevitable. For the floating point version, please make it very
clear what the behavior of max(-0,+0) and related cases are.

Along these lines, AArch64 has an instruction "FMAXNM". It returns the
maximum if neither value is NaN, but returns the number if just one
value is NaN. This is in addition to an "FMAX" which propagates NaNs.

I suspect you'll just want to consider this as an "oh yes, make sure
that the result is NaN if either input is" advisory notice, but I
haven't actually thought through the details of implementation yet.

Tim.

Given the recent fast-math flags commit, what about the inverse? ("I
guarantee that inputs are not NaNs") In this regard min/max look like
other floating point instructions to me. Does it make sense to add
them as instructions so that they can profit from fast-math flags?

Dmitri

Hi,

Along these lines, AArch64 has an instruction "FMAXNM". It returns the
maximum if neither value is NaN, but returns the number if just one
value is NaN. This is in addition to an "FMAX" which propagates NaNs.

I suspect you'll just want to consider this as an "oh yes, make sure
that the result is NaN if either input is" advisory notice, but I
haven't actually thought through the details of implementation yet.

Given the recent fast-math flags commit, what about the inverse? ("I
guarantee that inputs are not NaNs")

It's certainly a point to consider in general, but I don't think it
would help in our case. I've not looked at the instruction timings for
any of our CPUs, but I can't imagine FMAXNM being any different from
FMAX. If that's right, it doesn't matter whether NaNs are present
computationally; it becomes a question of what semantics you actually
want.

Tim.

The following is our current proposal for llvm.fmax/fmin.*:

[1] If exactly one argument is a NaN, the intrinsic returns the other argument.
[2] If both arguments are NaN, the intrinsic returns a NaN.
[3] An SNaN may behave as a QNaN.
[4] If the arguments compare equal, the intrinsic returns a value that compares equal to both arguments.
[5] Otherwise, the intrinsic returns the greater/lesser of the two arguments.

Rationale and notes:

Points [1] and [2] match the C/Posix library functions' specs.

Point [3] matches the OpenCL library functions, and may permit some implementations to test for NaNs less expensively.

Point [4] accounts for fmax(-0,+0) in IEEE 754 arithmetic, and any similar cases that might exist in other systems (LLVM needs a VAX backend). IEEE specifies that comparisons ignore the sign of zero, so requiring fmax to order ±0 would be expensive on many systems, and is not necessary to support common library functions.

The intrinsics can replace calls to the C and OpenCL library functions.

The intrinsics can be implemented as calls to the C or OpenCL library functions. They can also be implemented by IEEE 754 maxNum()/minNum() operations (but not vice versa).

The intrinsics are not equivalent to an fcmp/select sequence.

This part worries me. The new min/max intrinsics will only be useful if we could pattern match cmp/select into them.

Yes, that's the obvious alternative. I don't think we have any strong opinion either way, and fcmp/select is certainly easier to implement.

Maybe we can have two versions of the intrinsic function, "ordered" and "unordered", just like fcmp has [1]. Would that work ?

[1] - http://llvm.org/docs/LangRef.html#fcmp-instruction

No; real fmin and fmax are symmetric, and fcmp+select patterns are not
symmetric.

Also, on point [3] "An SNaN may behave as a QNaN", LLVM doesn't
correctly implement SNaN anyway, so there's no need to clutter up the
rules with SNaN exceptions. When (if?) someone ever implements SNaN,
they'll probably need to introduce a new operator flag to allow passes
to ignore it (nsnan?), and then fmin/fmax can use that.

Dan

Maybe we can have two versions of the intrinsic function, "ordered" and
"unordered", just like fcmp has [1]. Would that work ?

If the intrinsics are to be produced by matching fcmp + select, I think we need four versions of each -- select (fcmp olt) a,b is equivalent to select (fcmp uge) b,a and so on.