RFC: Add New Set of Vector Math Builtins

Hi,

I would like to provide a convenient way for our users to perform additional operations like min, max, abs and round on vector (and possibly matrix) types. To do so, I’d like to propose adding a new set of builtins that operate on vector-like types. The new builtins can be used to perform each operation element wise and as reduction on the elements of a vector or a matrix. The proposal also includes arithmetic reductions. Those are performed pairwise, rather then sequential to ensure they can lowered directly to vector instructions on targets like AArch64 and X86.

I considered overloading the existing math builtins, but think we can provide a better and more consistent user-experience with a new set of builtins. The drawbacks of overloading math builtins is discussed at the end.

Below is the proposed specification for element-wise and reductions builtins. They will be lowered to the corresponding LLVM intrinsic. Note that not all possible builtins are listed here. Once agreed on the general scheme, it should be easy to add additional builtins.

Specification of the element-wise builtins with 1 operands

Provided builtins:

• __builtin_elementwise_abs

• __builtin_elementwise_ceil

• __builtin_elementwise_floor

• __builtin_elementwise_rint

• __builtin_elementwise_round

• __builtin_elementwise_trunc

Hi Florian,

I have a few questions about thereduction builtins.

llvm.reduce.fadd is currently defined as ordered unless the reassociate fast math flag is present. Are you proposing to change that to make it pairwise?

llvm.reduce.fmin/fmax change behavior based on the nonans fast math flag. And I think they always imply no signed zeros regardless of whether the fast math flag is present. The vectorizers check the fast math flags before creating the intrinsics today. What are the semantics of the proposed builtin?

Thanks,

Hi Craig,

Hi Florian,

I have a few questions about thereduction builtins.

Thanks for taking a look!

llvm.reduce.fadd is currently defined as ordered unless the reassociate fast math flag is present. Are you proposing to change that to make it pairwise?

That’s a good point and I forgot to explicitly call this out! The reduction builtin unfortunately cannot express pairwise reductions and the reassoicate flag would be too permissive. An initial lowering in Clang could just generate the pairwise reduction tree directly, but down the road I anticipate improving the reduction builtin to allow expressing pairwise reductions. This would probably be helpful for parts of the middle-end too which at the moment manually emit pairwise reduction trees (e.g. in the epilogue of vector loops with reductions).

llvm.reduce.fmin/fmax change behavior based on the nonans fast math flag. And I think they always imply no signed zeros regardless of whether the fast math flag is present. The vectorizers check the fast math flags before creating the intrinsics today. What are the semantics of the proposed builtin?

I tried to specify NaN handling the the `Special Values` section. At the moment it says "If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, return a NaN”. This should match both the NaN handling of llvm.minnum and libm’s fmin(f). Note that in the original email, the Special Values section still includes a mention to fmax. That reference should be removed.

The current proposal does not specifically talk about signed zeros, but I am not sure we have to. The proposal defines min/max as returning the smaller/larger value. Both -0 and +0 are equal, so either can be returned. I think this again matches libm’s fmin(f)’s and llvm.minnum’s behavior although llvm.minnum’ definition calls this out explicitly by stating explicitly what happens when called with equal arguments. Should the proposed definitions also spell that out?

Cheers,
Florian

Hi Craig,

Hi Florian,

I have a few questions about thereduction builtins.

Thanks for taking a look!

llvm.reduce.fadd is currently defined as ordered unless the reassociate fast math flag is present. Are you proposing to change that to make it pairwise?

That’s a good point and I forgot to explicitly call this out! The reduction builtin unfortunately cannot express pairwise reductions and the reassoicate flag would be too permissive. An initial lowering in Clang could just generate the pairwise reduction tree directly, but down the road I anticipate improving the reduction builtin to allow expressing pairwise reductions. This would probably be helpful for parts of the middle-end too which at the moment manually emit pairwise reduction trees (e.g. in the epilogue of vector loops with reductions).

I didn’t think the vectorizers used pairwise reductions. The cost modelling flag for it was removed in https://reviews.llvm.org/D105484

FWIW, the X86 backend barely uses the pairwise reduction instructions like haddps. They have a suboptimal implementation on most CPUs that makes them not good for reducing over a single register.

llvm.reduce.fmin/fmax change behavior based on the nonans fast math flag. And I think they always imply no signed zeros regardless of whether the fast math flag is present. The vectorizers check the fast math flags before creating the intrinsics today. What are the semantics of the proposed builtin?

I tried to specify NaN handling the the Special Values section. At the moment it says "If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, return a NaN”. This should match both the NaN handling of llvm.minnum and libm’s fmin(f). Note that in the original email, the Special Values section still includes a mention to fmax. That reference should be removed.

The current proposal does not specifically talk about signed zeros, but I am not sure we have to. The proposal defines min/max as returning the smaller/larger value. Both -0 and +0 are equal, so either can be returned. I think this again matches libm’s fmin(f)’s and llvm.minnum’s behavior although llvm.minnum’ definition calls this out explicitly by stating explicitly what happens when called with equal arguments. Should the proposed definitions also spell that out?

I just noticed that the ExpandReductions pass uses fcmp+select for expanding llvm.reduce.fmin/fmax with nonans. But SelectionDAG expands it using ISD::FMAXNUM and ISD::FMINNUM. I only looked at ExpandReductions and saw the nonans check there, but didn’t realize it was using fcmp+select.

Hi Craig,

Hi Florian,

I have a few questions about thereduction builtins.

Thanks for taking a look!

llvm.reduce.fadd is currently defined as ordered unless the reassociate fast math flag is present. Are you proposing to change that to make it pairwise?

That’s a good point and I forgot to explicitly call this out! The reduction builtin unfortunately cannot express pairwise reductions and the reassoicate flag would be too permissive. An initial lowering in Clang could just generate the pairwise reduction tree directly, but down the road I anticipate improving the reduction builtin to allow expressing pairwise reductions. This would probably be helpful for parts of the middle-end too which at the moment manually emit pairwise reduction trees (e.g. in the epilogue of vector loops with reductions).

I didn’t think the vectorizers used pairwise reductions. The cost modelling flag for it was removed in https://reviews.llvm.org/D105484

I was referring to the code in fixReduction (https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp#L4427)

It now emits a call to the reduction intrinsics, but the ExpandReductions pass should lower this to a reduction tree which matches the specification.

FWIW, the X86 backend barely uses the pairwise reduction instructions like haddps. They have a suboptimal implementation on most CPUs that makes them not good for reducing over a single register.

Yes, thanks for raising that. I realized the original proposal was a bit ambiguous on which pairs exactly are being added. Originally the intention was to use even/odd pairs because this is what multiple architectures provide instructions for. Unfortunately the performance of those instructions varies across hardware implementations, especially on X86 as you mentioned.

It seems to me like whatever order we choose, we will leave performance on the table on some architectures/hardware implementations. The goal is to allow users to write high-performance code for platforms they care about, so I think it would be best to allow users to pick the evolution order. That way, they can make an informed decision and get the best performance on the targets they care about. One way to do that could be to actually expose 3 different versions of the fadd reduction builtin: 1) __builtin_reduce_unordered_fadd, 2) __builtin_reduce_adjecent_pairs_fadd and 3) __builtin_reduce_low_high_pairs_fadd. Alternatively we could add an additional parameter, but that seems less explicit than encoding it in the name.

What do you think? Do you have an alternative suggestion in mind?

For the other builtins, the evaluation order should not really matter.

llvm.reduce.fmin/fmax change behavior based on the nonans fast math flag. And I think they always imply no signed zeros regardless of whether the fast math flag is present. The vectorizers check the fast math flags before creating the intrinsics today. What are the semantics of the proposed builtin?

I tried to specify NaN handling the the Special Values section. At the moment it says "If exactly one argument is a NaN, return the other argument. If both arguments are NaNs, return a NaN”. This should match both the NaN handling of llvm.minnum and libm’s fmin(f). Note that in the original email, the Special Values section still includes a mention to fmax. That reference should be removed.

The current proposal does not specifically talk about signed zeros, but I am not sure we have to. The proposal defines min/max as returning the smaller/larger value. Both -0 and +0 are equal, so either can be returned. I think this again matches libm’s fmin(f)’s and llvm.minnum’s behavior although llvm.minnum’ definition calls this out explicitly by stating explicitly what happens when called with equal arguments. Should the proposed definitions also spell that out?

I just noticed that the ExpandReductions pass uses fcmp+select for expanding llvm.reduce.fmin/fmax with nonans. But SelectionDAG expands it using ISD::FMAXNUM and ISD::FMINNUM. I only looked at ExpandReductions and saw the nonans check there, but didn’t realize it was using fcmp+select.

It looks like most backends do not directly lower the reduction intrinsics and still rely on the ExpandReduction pass. The AArch64 backend supports lowering the intrinsics directly and produces the expected results. For backends to get the most out of the proposed builtins I think they have to make sure they lower reduction intrinsics as best as they can.

Cheers,
Florian

Some notes on the subject of SIMD reductions:

  1. It’s desirable to have a third option other than the existing ordered and unordered. Ordered is too slow to be desirable on most implementations, and unordered blocks portability. Either obvious binary tree reduction (even/odd or high/low) is strictly more desirable for explicit SIMD code (which is what these builtins are meant to support). Even when one option is “worse” for a given platform (e.g. HADD on x86) it’s still vastly better than ordered while still allowing us to get the same result everywhere if we need to.

  2. even/odd (pairwise reduction) has the slight virtue that it probably maps more nicely to an unknown variable-vector length ISA. This is quite minor, though, because we’re talking about explicit SIMD code.

If it’s worth having a C builtin reduce for explicit SIMD types, I think it’s worth having it use a defined reduction tree. Even/odd is probably the more forward-thinking option, but is somewhat suboptimal on generic x86. Hi/lo is somewhat suboptimal on generic arm64, but not as much so as even/odd on x86.

Doing both at the C level seems slightly silly to me; I would rather just lower to undefined evaluation order when reassoc is set, I think, for people who want “whatever is fastest”.

– Steve

Thanks Steve!

I tried to write up the proposed builtins in a concise way and put up a patch which hopefully makes it easier to track & incorporate any further updates: https://reviews.llvm.org/D111529

For now I went with even/odd pairing for reductions. It should be straight-forward for users to allow reassociation at builtin call sites using #pragma clang fp reassociate(on), which hopefully provides enough flexibility to keep the set of reduction builtins compact.

So far it sounds like there are no concerns/objections to adding a new set of vector builtins in general, but I am going to add a few additional people explicitly to the thread & patch for extra visibility.

Cheers,
Florian

The patch to describe the vector builtins has gone through a number of iterations now to iron out some details and issues with wording: https://reviews.llvm.org/D111529.

I am planning on moving ahead with landing the patches next week. If you have any additional thoughts or concerns, please let me know.

I also put up patches to implement a few of the proposed builtins already

• Elementwise min/max: https://reviews.llvm.org/D111985
• Elementwise abs: https://reviews.llvm.org/D111986
• min/max reduction: https://reviews.llvm.org/D112001

Cheers,
Florian

Sorry I'm late to the discussion (and thank you for working on this!) - I was wondering if you'd investigated making these builtins constexpr at all? Simon.

So far I have not looked into making the builtins constexprs, but I think it would make a lot of sense. IIUC this would mostly be a question of whether someone is interested in implementing constexpr evaluation for the builtins in Clang :slight_smile:

Hi,

Just a quick heads up, the first builtins have been implemented in Clang:

I am planning on implementing the remaining ones at a leisurely pace. If anybody is interested in helping out, I created https://bugs.llvm.org/show_bug.cgi?id=52444 to communicate and avoid duplicated work.