RFC: Generic IR reductions

Amara_Emerson · January 31, 2017, 5:27pm

Hi all,

During the Nov 2016 dev meeting, we had a hackers’ lab session where we discussed some issues about loop idiom recognition, IR representation and cost modelling. I took an action to write up an RFC about introducing reduction intrinsics to LLVM to represent horizontal operations across vectors.

Vector reductions have been discussed in the past before, notably here: http://lists.llvm.org/pipermail/llvm-dev/2015-November/092379.html

This was in the context of recognizing the patterns for matching more complex cases like SAD & dot product. These discussions have centered around adding a new vector reduce instruction or annotating the reduction PHI to allow pattern recognition later at codegen. However, these motivations are different to the ones I'm giving here.

Current status

reductions-enable-aarch64.diff (77.9 KB)

reductions-ir.diff (32.6 KB)

Renato_Golin1 · January 31, 2017, 8:19pm

Hi Amara,

We also had some discussions on the SVE side of reductions on the main
SVE thread, but this description is much more detailed than we had
before.

I don't want to discuss specifically about SVE, as the spec is not out
yet, but I think we can cover a lot of ground until very close to SVE
and do the final step when we get there.

1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction.

Not SVE specific, for example fast-math.

2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions.

We should leave this for later, as this is very SVE-specific.

3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later.

This is a minor point, since bloated IR is "efficient" if it collapses
into a small number of instructions, for example dozens of shuffles
into a ZIP.

The argument that the intrinsic is harder to destroy through
optimisation passes is the same as other cases of stiff rich semantics
vs. generic pattern matching, so orthogonal to this issue.

We propose to introduce the following reduction intrinsics as a starting point:
int_vector_reduce_add(vector_src)

Is this C intrinsic? Shouldn't an IR builtin be something like:

@llvm.vector.reduce.add(...) ?

These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence.

Yup.

...
int_vector_reduce_fmax(vector_src, i32 NoNaNs)

A large list, and probably doesn't even cover all SVE can do, let
alone other reductions.

Why not simplify this into something like:

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %red = @llvm.reduce(%sum, float %acc)
or
  %fast_red = @llvm.reduce(%sum)

For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?

We have multiple options for expressing vector predication in reductions:
1. The first is to simply add a predicate operand to the intrinsics, and require that targets without predication explicitly pattern match for an all-true predicate in order to select hardware instructions.
2. The second option is to mask out inactive lanes in the input vector by using a select between the input, and a vector splat of the reduction's identity values, e.g. 0.0 for fp-add.

We believe option 2 will be sufficient for predication capable architectures, while keeping the intrinsics definitions simple and minimal. If there are targets for which the identity value for a reduction is different, then we could use an IR constant to express this in a generic way.

I agree. I haven't followed Elena's work on the similar concept for
Intel, but I vaguely remember we reached a similar conclusion for
AVX512.

cheers,
--renato

Amara_Emerson2 · January 31, 2017, 11:16pm

+cc Simon who's also interested in reductions for the any_true,
all_true predicate vectors.

Hi Amara,

We also had some discussions on the SVE side of reductions on the main
SVE thread, but this description is much more detailed than we had
before.

I don't want to discuss specifically about SVE, as the spec is not out
yet, but I think we can cover a lot of ground until very close to SVE
and do the final step when we get there.

The goal of this proposal is to agree on a new representation, so we
are looking at more than SVE and improving things for LLVM as a whole.

1) As our vector lengths are unknown at compile time (and also not a power of 2), we cannot generate the same reduction IR pattern as other targets. We need a direct IR representation of the reduction in order to vectorize them. SVE also contains strict ordering FP reductions, which cannot be done using the tree-reduction.

Not SVE specific, for example fast-math.

Can you explain what you mean here? Other targets may well have
ordered reductions, I can't comment on that aspect. However, fast-math
vectorization is a case where you *don't* want ordered reductions, as
the relaxed fp-contract means that the conventional tree reduction
algorithm preserves the required semantics.

2) That we can use reduction intrinsics to implement our proposed SVE "test" instruction, intended to check whether a property of a predicate, {first,last,any,all}x{true,false} holds. Without any ability to express reductions as intrinsics, we need a separate instruction for the same reasons as described for regular reductions.

We should leave this for later, as this is very SVE-specific.

As I stated earlier, this is going beyond SVE. I hope to achieve a
consensus between other targets as well, even if some don't yet have
efficient ways to handle it.

3) Other, non-vectorizer, users or LLVM that may want to generate vector code themselves. Front-ends which want to do this currently must deal with the difficulties of generating the tree shuffle pattern to ensure that they're matched to efficient instructions later.

This is a minor point, since bloated IR is "efficient" if it collapses
into a small number of instructions, for example dozens of shuffles
into a ZIP.

The hassle of generating reductions may well be at most a minor
motivator, but my point still stands. If a front-end wants the target
to be able to generate the best code for a reduction idiom, they must
generate a lot of IR for many-element vectors. You still have to paid
the price in bloated IR, see the tests changed as part of the AArch64
NEON patch.

The argument that the intrinsic is harder to destroy through
optimisation passes is the same as other cases of stiff rich semantics
vs. generic pattern matching, so orthogonal to this issue.

We propose to introduce the following reduction intrinsics as a starting point:
int_vector_reduce_add(vector_src)

Is this C intrinsic? Shouldn't an IR builtin be something like:

@llvm.vector.reduce.add(...) ?

You're right, in IR assembly it would appear like that. The ones I
proposed were the tablegen syntax of the intrinsics definitions, so in
practice they would look like @llvm.vector.reduce.[operation] in IR
asm.

These intrinsics do not do any type promotion of the scalar result. Architectures like SVE which can do type promotion and reduction in a single instruction can pattern match the promote->reduce sequence.

Yup.

...
int_vector_reduce_fmax(vector_src, i32 NoNaNs)

A large list, and probably doesn't even cover all SVE can do, let
alone other reductions.

Why not simplify this into something like:

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %red = @llvm.reduce(%sum, float %acc)
or
  %fast_red = @llvm.reduce(%sum)

Because the semantics of an operation would not depend solely in the
operand value types and operation, but on a chain of computations
forming the operands. If the input operand is a phi, you then have to
do potentially inter-block analysis. If it's a function parameter or
simply a load from memory then you're pretty much stuck and you can't
resolve the semantics.

During the dev meeting, a reductions proposal where the operation to
be performed was a kind of opcode was discussed, and rejected by the
community. I don't believe having many intrinsics would be a problem.

For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?

For the same reasons that we don't re-use the other binary operator
instructions like add, sub, mul. The vector versions of those are not
horizontal operations, they instead produce vector results.

Renato_Golin1 · February 1, 2017, 8:27am

+Elena, as she has done a lot of work for AVX512, which has similar concepts.

Not SVE specific, for example fast-math.

Can you explain what you mean here? Other targets may well have
ordered reductions, I can't comment on that aspect. However, fast-math
vectorization is a case where you *don't* want ordered reductions, as
the relaxed fp-contract means that the conventional tree reduction
algorithm preserves the required semantics.

That's my point. Fast-math can change the target's semantics regarding
reductions, independently of scalable vectors, so it's worth
discussing in a more general case, ie in this thread.

Sorry for being terse.

The hassle of generating reductions may well be at most a minor
motivator, but my point still stands. If a front-end wants the target
to be able to generate the best code for a reduction idiom, they must
generate a lot of IR for many-element vectors. You still have to paid
the price in bloated IR, see the tests changed as part of the AArch64
NEON patch.

This is a completely orthogonal discussion, as I stated here:

The argument that the intrinsic is harder to destroy through
optimisation passes is the same as other cases of stiff rich semantics
vs. generic pattern matching, so orthogonal to this issue.

One that we have had multiple times and the usual consensus is: if it
can be represented in plain IR, it must. Adding multiple semantics for
the same concept, especially stiff ones like builtins, adds complexity
to the optimiser.

Regardless of the merits in this case, builtins should only be
introduced IFF there is no other way. So first we should discuss
adding it to IR with generic concepts, just like we did for
scatter/gather and strided access.

Why not simplify this into something like:

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %red = @llvm.reduce(%sum, float %acc)
or
  %fast_red = @llvm.reduce(%sum)

Because the semantics of an operation would not depend solely in the
operand value types and operation, but on a chain of computations
forming the operands. If the input operand is a phi, you then have to
do potentially inter-block analysis. If it's a function parameter or
simply a load from memory then you're pretty much stuck and you can't
resolve the semantics.

I think you have just described the pattern matching algorithm,
meaning it's possible to write that in a sequence of IR instructions,
thus using add+reduce should work. Same with pointer types and other
reduction operations.

If the argument comes from a function parameter that is a non-strict
pointer to memory, then all bets are off anyway and the front-end
wouldn't be able to generate anything more specific, unless you're
using SIMD intrinsics, in which case this point is moot.

During the dev meeting, a reductions proposal where the operation to
be performed was a kind of opcode was discussed, and rejected by the
community.

Well, that was certainly a smaller group than the list. Design
decisions should not be taken off list, so we must have this
discussion on the list again, I'm afraid.

I don't believe having many intrinsics would be a problem.

This is against every decision I remember. Saying it out loud in a
meeting is one thing, writing them down and implementing and having to
bear the maintenance costs is another entirely.

That's why the consensus has to happen on the list.

For a min/max reduction, why not just extend @llvm.minnum and @llvm.maxnum?

For the same reasons that we don't re-use the other binary operator
instructions like add, sub, mul. The vector versions of those are not
horizontal operations, they instead produce vector results.

Sorry, I meant min/max + reduce, just like above.

  %sum = add <N x float>, <N x float> %a, <N x float> %b
  %min = @llvm.minnum(<N x float> %sum)
   %red = @llvm.reduce(%min, float %acc)

cheers,
--renato

Demikhovsky_Elena · February 1, 2017, 9:38am

One that we have had multiple times and the usual consensus is: if it can be represented in plain IR, it must. Adding multiple semantics for the same concept, especially stiff ones like builtins, adds complexity to the optimiser.
Regardless of the merits in this case, builtins should only be introduced IFF there is no other way. So first we should discuss adding it to IR with generic concepts, just like we did for scatter/gather and strided access.

I suppose, we can let Target to "decide". Target may decide to use an intrinsic since it will be converted later to one instruction or leave a plain IR letting the optimizer to work smoothly.
Actually, we do the same for gather/scatter - vectorizer does not build intrinsics if the Target does not support them.

adds complexity to the optimizer

Optimizer should not deal with intrinsics, with this kind of intrinsics at least.

The target should be able to answer on question IsProfitable(ORDERED_REDUCTION_FADD, VF) - ordered reductions, for example, are not supported on X86 and the answer will, probably, be "false".
In this case we'll stay with plain IR.
But SVE will answer "true" and an intrinsic will be created. So, in my opinion, we need reduction intrinsics.

if it can be represented in plain IR, it must

The IR is plain and the ISAs are complex, we miss optimizations at the end. IR should be reach in order to serve complex ISAs.

As far as intrinsics set, we, probably, need "ordered" and "plain" mode for FP reductions. X86 does not have the "ordered" today, but might be interesting in a loop-tail operation in FP fast mode.

- Elena

Renato_Golin1 · February 1, 2017, 9:58am

I suppose, we can let Target to "decide". Target may decide to use an intrinsic since it will be converted later to one instruction or leave a plain IR letting the optimizer to work smoothly.
Actually, we do the same for gather/scatter - vectorizer does not build intrinsics if the Target does not support them.

Absolutely, this is a target-specific decision.

For gather / scatter we had some intrinsics, but we also used plain IR
for others, no? For strided access, we used mostly IR, despite the
original proposal to be mostly intrinsics.

I think the balance is: what is the cost across the compiler to
generate and keep the intrinsics syntactically correct and optimal?

I'm not claiming I know, as I don't know AVX or SVE too well, but we
need answers to those questions before we take any decision.

adds complexity to the optimizer

Optimizer should not deal with intrinsics, with this kind of intrinsics at least.

That's why it adds complexity. To most optimisers, intrinsics are
function calls that could either be replaced by a code block, or a
specific instruction or a library call. This may make certain
transformations opaque.

We have had in the past, examples in the NEON generation that has
shown better code after inlining due to new IR patterns being found,
and if we had used intrinsics, they wouldn't.

The target should be able to answer on question IsProfitable(ORDERED_REDUCTION_FADD, VF) - ordered reductions, for example, are not supported on X86 and the answer will, probably, be "false".
In this case we'll stay with plain IR.
But SVE will answer "true" and an intrinsic will be created. So, in my opinion, we need reduction intrinsics.

I'm not sure what you mean. The cost analysis is per instruction,
which adds up to per block or function, be them intrinsics or legal IR
instructions.

The implementation of functions that calculate profitability already
have to take into account non-intrinsics, so that shouldn't make any
difference.

The IR is plain and the ISAs are complex, we miss optimizations at the end. IR should be reach in order to serve complex ISAs.

Creating more intrinsics actually make the IR *more* complex. What
kind of optimisations would we miss?

If you mean "patterns may not be matched, and reduction instructions
will not be generated, making the code worse", then this is just a
matter of making the patterns obvious and the back-ends robust enough
to cope with it, no?

I mean, we already do this for almost everything else. If the current
IR can express the required semantics, then we should use plain IR.

The alternative is to start forcing intrinsics for everything, which
would make IR really pointless.

As far as intrinsics set, we, probably, need "ordered" and "plain" mode for FP reductions.

The ordered and plain modes can be chosen via an additional parameter,
the accumulator.

X86 does not have the "ordered" today, but might be interesting in a loop-tail operation in FP fast mode.

Indeed, same for NEON and I guess most other SIMD engines.

cheers,
--renato

Demikhovsky_Elena · February 1, 2017, 10:30am

If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no?

The Back-end should be as robust as possible, I agree. The problem that I see is in adding another kind of complexity to the optimizer that works between the Vectorizer and the Back-end. It should be able to recognize all "obvious" patterns in order to preserve them.

The target should be able to answer on question IsProfitable(ORDERED_REDUCTION_FADD, VF) - ordered reductions, for example, are not supported on X86 and the answer will, probably, be "false".
In this case we'll stay with plain IR.
But SVE will answer "true" and an intrinsic will be created. So, in my opinion, we need reduction intrinsics.

I'm not sure what you mean. The cost analysis is per instruction, which adds up to per block or function, be them intrinsics or legal IR instructions.

Now we look at a Reduction Phi and if the FP mode requires the "Ordered" reduction which is not supported, the whole loop remains scalar.
If we would leave this decision to the Cost Model, it will provide a cost of scalarization. And at the end we may decide to scalarize reduction operation inside vector loop. Now, once the taken decision is vectorization, inserting an intrinsic or generating a plain IR should be a Target decision.

- Elena

Renato_Golin1 · February 1, 2017, 10:54am

If you mean "patterns may not be matched, and reduction instructions will not be generated, making the code worse", then this is just a matter of making the patterns obvious and the back-ends robust enough to cope with it, no?

The Back-end should be as robust as possible, I agree. The problem that I see is in adding another kind of complexity to the optimizer that works between the Vectorizer and the Back-end. It should be able to recognize all "obvious" patterns in order to preserve them.

Right!

Also, I may have been my own enemy again and muddled the question. Let
me try again...

I'm not against a reduction intrinsic. I'm against one reduction
intrinsic for {every kind} x {ordered, unordered}. At least until
further evidence comes to light.

My proposal was to have a reduction intrinsic that can infer the type
by the predecessors.

For example:

@llvm.reduce(ext <N x double> ( add <N x float> %a, %b))

would generate a widening unordered reduction (fast-math).

Now we look at a Reduction Phi and if the FP mode requires the "Ordered" reduction which is not supported, the whole loop remains scalar.

Right, but this is orthogonal to having separate intrinsics or not.

%fast = @llvm.reduce(ext <N x double> ( add <N x float> %a, %b))
%order = llvm.reduce(ext <N x double> ( add <N x float> %a, %b), double %acc)

If the IR contains %order, then this will *have* to be scalar if the
target doesn't support native ordered or a fast path to it. And this
is up to the cost model.

If we would leave this decision to the Cost Model, it will provide a cost of scalarization. And at the end we may decide to scalarize reduction operation inside vector loop. Now, once the taken decision is vectorization, inserting an intrinsic or generating a plain IR should be a Target decision.

Hum, I'm beginning to see your point, I think. I agree this is again a
target decision, but it's also a larger compiler-wide decision, too.

The target's decision is: IR doesn't have the required semantics, I
*must* use intrinsics. It can't be: I'd rather have intrinsics because
it's easier to match in the back-end.

The first is a requirement, the second is a personal choice, and one
that can impact the generic instruction selection between IR and
target specific selection.

cheers,
--renato

Renato_Golin1 · February 1, 2017, 10:59am

No, this is wrong. I actually meant overriding the max/min intrinsics
to take vectors instead of two scalar options.

The semantics of those intrinsics is to get a number of parameters and
return the max/min on their own type. A vector is just a different
packing of parameters, all of which have the same type, so there's no
semantic difference other than the number of arguments.

However, when they're lowered, they'll end up a a short sequence of
instruction (if supported) in the same way.

You'd only emit a max/min in vectorisation if the target supports it
and the cost is low. For instance, in AArch64 it would only emit it if
SVE support is enabled.

Makes sense?

cheers,
--renato

Demikhovsky_Elena · February 1, 2017, 11:59am

My proposal was to have a reduction intrinsic that can infer the type by the predecessors.

For example:

> @llvm.reduce(ext <N x double> ( add <N x float> %a, %b))

And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer))
Don't we have a problem with constant propagation in this approach?

I proposed a "generic" intrinsic approach on the BOF (Nov, 2016), like
%scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string, integer or metadata.

- Elena

Renato_Golin1 · February 1, 2017, 12:39pm

> @llvm.reduce(ext <N x double> ( add <N x float> %a, %b))

And if we don't have %b? We just want to sum all elements of %a? Something like @llvm.reduce(ext <N x double> ( add <N x float> %a, zeroinitializer))

Hum, that's a good point. My examples were actually wrong, as they
weren't related to simple reductions. Your zeroinit is the thing I was
looking for.

Don't we have a problem with constant propagation in this approach?

I'm not sure. Can you expand this?

I proposed a "generic" intrinsic approach on the BOF (Nov, 2016), like
%scalar = @llvm.reduce(OPCODE, %vector_input) - OPCODE may be a string, integer or metadata.

I wouldn't use metadata. Integer would be cumbersome and lead to
eventual ABI breakages, and "text" would be the same as:

%scalar = @llvm.reduce.add(%vector)

which is the same thing Amara proposed.

I'm not saying it is wrong, I'm just worried that, by mandating the
encoding of the reduction into an intrinsic, we'll force the
middle-end to convert high-level code patterns to the intrinsic or the
target will ignore it completely.

There is a pattern already for reductions, and the back-ends already
match it. This should not change, unless there is a serious flaw in it
- for the targets that *already* support it. This is an orthogonal
discussion.

SVE has more restrictions, for instance, one cannot know how many
shuffles to do because the vector size is unknown, so the current
representation is insufficient, in which case, we need the intrinsic.

But replace everything else with intrinsics just because one target
can't cope with it doesn't work.

On thing that does happen is that code optimisations expose patterns
that would otherwise not be apparent. This includes potential
reduction or fusion patterns and can lead to massively smaller code or
even eliding the whole block. If you convert a block to an intrinsic
too early you may lose the ability to merge it back again later, as
we're doing today.

These are all hypothetical wrt SVE, but they did happen in NEON in the
past and were the reason why we only have a handful of NEON
intrinsics. Everything else are encoded with sequences of
instructions.

cheers,
--renato

Demikhovsky_Elena · February 1, 2017, 1:06pm

Constant propagation:

%sum = add <N x float> %a, %b
@llvm.reduce(ext <N x double> %sum)

if %a and %b are vector of constants, the %sum also becomes a vector of constants.
At this point you have @llvm.reduce(ext <N x double> %sum) and don't know what kind of reduction do you need.

- Elena

Renato_Golin1 · February 1, 2017, 2:20pm

Well, sum is an add node. If %a and %b have special semantics for the
target, than @reduce is meaningful and this means some type of
summation.

But the more I think of it, the less I think this could actually solve
the semantic issues around reductions... The zeroinit idiom would be
more of a hack than re-using similar concepts in IR.

So, let me take a step back, and assume that, for scalable vectors we
*have* to use all intrinsics. IR simply has no compatible idiom, and
unless we introduce some (like the stepvector), it won't work.

As I said before, adding intrinsics is better than new IR constructs,
so let's also assume this is the less costly way forward for now.

But having IR instructions is better than adding intrinsics, and I'm
not sure we want to completely replace what's there already, for
intrinsics.

Does AVX512 suffer from any cost in using the current extract/op
idiom? NEON has small vectors, so IR sequences end up being 1~4 ops,
which I don't consider a problem.

Also, the current idiom can cope with ordered/unordered reduction by
interleaving the operations or not:

  %sum0 = %vec[0] + %vec[1]
  %sum1 = %vec[2] + %vec[3]
  %sum = %sum0 + %sum1

or

  %sum0 = %vec[0] + %vec[1]
  %sum1 = %sum0 + %vec[2]
  %sum = %sum1 + %vec[3]

It may not cope with special semantics leading to use of
target-specific instructions, in which case we obviously need
intrinsics. It certainly can't cope with unknown vector sizes either.

cheers,
--renato

Amara_Emerson2 · February 1, 2017, 2:36pm

I think you mean take a single vector operand, not two vector
operands. If the intrinsic took two vectors then we would naturally
expect the result to be a vector, where each element is the result of
a scalar operation between each respective element in the vector
operands. Just like a normal binary operator.

Amara_Emerson2 · February 1, 2017, 2:44pm

Her point is that the %sum value will no longer be an add node, but
simply a constant vector. There is no way to resolve the semantics and
have meaningful IR after a simple constant prop. This means that other
simple transformations will all have to have special case logic just
to handle reductions, for example LCSSA.

On thing that does happen is that code optimisations expose patterns
that would otherwise not be apparent. This includes potential
reduction or fusion patterns and can lead to massively smaller code or
even eliding the whole block. If you convert a block to an intrinsic
too early you may lose the ability to merge it back again later, as
we're doing today.

These are all hypothetical wrt SVE, but they did happen in NEON in the
past and were the reason why we only have a handful of NEON
intrinsics. Everything else are encoded with sequences of
instructions.

Can you give a specific example? Reductions are actually very robust
to optimizations breaking the idiom, which is why I was able to
replace the reductions with intrinsics in my patches and with simple
matching generate identical code to before. No other changes were
required in the mid-end.

Renato_Golin1 · February 1, 2017, 2:54pm

Precisely.

%vec = @llvm.maxnum(<N x float> %a, <N x float> %b)
%scalar = @llvm.maxnum(<N x floag> %c)

In NEON, the former is a simple VMAX (VADD, etc), while the latter can
be a combination of VPMAX (VPADD etc).

cheers,
--renato

Amara_Emerson2 · February 1, 2017, 2:56pm

+CC Chandler and Philip.

Well, that was certainly a smaller group than the list. Design
decisions should not be taken off list, so we must have this
discussion on the list again, I'm afraid.

I don't believe having many intrinsics would be a problem.

This is against every decision I remember. Saying it out loud in a
meeting is one thing, writing them down and implementing and having to
bear the maintenance costs is another entirely.

That's fair. To bring some context, during Elena's BOF
Chandler/Philip's made some arguments against the idea of general
purpose single intrinsics that take an opcode parameter, which went
along the lines of: the design would lower the bar for introducing new
idioms to an extent that each one would not have to undergo as much
scrutiny as separate additions. I specifically recall someone, and I
think it was Chandler but correct me if I'm wrong, saying that many
intrinsics were not in itself a problem. Their points were also
broader than that of reductions, and covered the general approach of
handling groups of idioms. Chandler/Philip I hope I haven't
mischaracterized your thoughts.

Amara

Renato_Golin1 · February 1, 2017, 3:10pm

Her point is that the %sum value will no longer be an add node, but
simply a constant vector. There is no way to resolve the semantics and
have meaningful IR after a simple constant prop. This means that other
simple transformations will all have to have special case logic just
to handle reductions, for example LCSSA.

Right.

Can you give a specific example? Reductions are actually very robust
to optimizations breaking the idiom, which is why I was able to
replace the reductions with intrinsics in my patches and with simple
matching generate identical code to before. No other changes were
required in the mid-end.

First, this is a demonstration that keeping them as IR is a good
thing, not a bad one. We keep the semantics as well as allow for
further introspection in the code block.

Examples of introspection are vector widening or instruction fusion
after inlining. Say you have a loop with a function call and a
reduction, but that call has a reduction on its own. If the function
gets inlined after you reduced your loop into a builtin, the optimiser
will have no visibility if the function's reduction pattern can be
merged, either widening the vectors (saving on loads/stores) or fusing
(ex. MLA).

We have seen both cases with NEON after using IR instructions for
everything but the impossible.

In 2010, I've gone through a similar discussion with Bob Wilson, who
defended the position I'm defending now. And I defend this position
today because I have been categorically proven wrong by the results I
describe above.

Chandler's arguments are perfectly to the point. Intrinsics are not
only necessary when we can't represent things in IR, they're a *good*
ways of representing odd things.

But if things are not odd (ie. many targets have them) or if we can
already represent them in IR, then it stands to reason that adding
duplicated stuff only adds complexity. It increases the maintenance
cost (more node types to consider), in increases the chance for
missing some of them (and either not optimising or generating bad
code), and it stops the optimisers that know nothing about it (because
it's too new) to do any inference, and can actually generate worse
code than before (shuffle explosion).

As they say: if it's not broken, don't fix it.

Let's talk about the reductions that AVX512 and SVE can't handle with
IR semantics, but let's not change the current IR semantics for no
reason.

cheers,
--renato

mehdi_amini · February 1, 2017, 4:32pm

I was not at the BoF but I agree with the idea that lowering the bar has undesirable side-effects.

Usually when considering the choice between the form llvm.reduce.op(…) or llvm.reduce(op, …), I would expect the second choice to have to be motivated by the fact that there is enough interesting semantic properties that can be used by the analysis *without* understanding `op`.
This is because it would allow to write passes / analyses that could process llvm.reduce() generically.

I’m not sure this is the case here, knowing that a reduction is performed without knowing the operation is likely not very helpful.

Topic		Replies	Views
RFC: Generic IR reductions LLVM Dev List Archives	11	74	February 17, 2017
[RFC] Vector Predication LLVM Dev List Archives	70	209	February 27, 2019
Status of llvm.experimental.vector.reduce.* intrinsics LLVM Dev List Archives	5	109	August 4, 2017
[RFC] Supporting ARM's SVE in LLVM LLVM Dev List Archives	43	383	December 1, 2016
An update on scalable vectors in LLVM LLVM Dev List Archives	3	145	November 16, 2020

RFC: Generic IR reductions

Related topics