RFC: Generic IR reductions

Hi All,

Renato wrote:

As they say: if it's not broken, don't fix it.
Let's talk about the reductions that AVX512 and SVE can't handle with IR semantics, but let's not change the current IR semantics for no reason.

Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without
knowing #elements in vector to start from.

For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally
working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL
for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time,
etc.)

So, it's not like we don't have reasons for asking to change the status quo.

Here's what I'd like to see at the end of this discussion.
     Nice and concise representation of reducing a vector into a scalar value.
     An IR instruction to do so is ideal, but I understand that the hurdle for that is very high.
     I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.
     Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see
     a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem
     of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)"
     is "nice and concise".
Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical
FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of
the loop body. So, this isn't just "outside the vectorized loop" stuff.

Now, can we focus on the value of "nice and concise representation"? If the community consensus
is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small
number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig
deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of
operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about
details right away.

Obviously, I'm voting for YES on "nice and concise representation".

Thanks,
Hideki Saito
Intel Compilers and Languages

Main problem for SVE: We can't write straight-line IR instruction sequence for reduction last value compute, without
knowing #elements in vector to start from.

Hi Saito,

Indeed, I think there's at least consensus that, for scalable vectors,
there's no other way around.

For non-scalable vector, series of "reduce to half" (or any other straight-line IR instruction sequence) is functionally
working, but certainly not ideal (ugly, optimal sequence is target dependent, artificially inflating the outgoing IL
for any optimizers not interested in optimizing reduction last value compute, that translates to longer compile time,
etc.)

I agree it's not pretty, but IR doesn't have to be pretty. IR is a
compiler intermediate language, not a user-facing language. So, if the
representation is accurate and the compile time costs are within
noise, there are no reasons to change.

Now, you touched one subject that is worth considering: binary
reduction is target specific.

I don't know AVX, AltiVec and others too well, but I assumed there are
only two types of reduction strategies: ordered and binary reduction.
The first because of precision and the second because it's the most
logical generic step: you only need one instruction with multiple
register sizes to do any kind of reduction, and that reduces the
complexity of the ISA.

So if there are non-ordered / non-binary reductions, the
representation is, indeed, target specific, and that's a very negative
point for it. But, this only raises the case for non-ordered /
non-binary reductions to gain a builtin, not to move everything into a
builtin.

So, it's not like we don't have reasons for asking to change the status quo.

I apologise if that's what got across. I'm not trying to shut anything
down, I'm just reiterating the reasons that I have received (and
given) in the past with regards to IR changes. We have done this for
NEON, stride access, etc. and it has worked well in general, with
exposed optimisations that we wouldn't have gotten with a builtin.

Here's what I'd like to see at the end of this discussion.
     Nice and concise representation of reducing a vector into a scalar value.
     An IR instruction to do so is ideal, but I understand that the hurdle for that is very high.
     I'm okay with an intrinsic function call, and I heard that's a reasonable step to get to instruction.

Yes. If we have an intrinsic that can be built in a generic way,
that's half-way towards an instruction. That's what I was trying to do
on my first reply, when I suggested the @llvm.reduce() builtin.

But it seems impractical to have a single one. It'll most likely be a
family: @llvm.reduce.op.typein.typeout.ordered?(%vector), with op
being add/sub/max/min/etc, type in the vector type, type out if
different for widen/shorten and ordered if FP precision is important.
It'll be harder to transform this into an instruction (mainly because
of the op), but it can be done, just need time and community support.

     Let's say someone comes up with 1024bit vector working on char data. Nobody is really happy to see
     a sequence of "reduce to half" for 128 elements. Today, with AVX512BW, we already have the problem
     of half that size (only a few instructions less). I don't think anything that is proportional to "LOG(#elems)"
     is "nice and concise".

I agree it would be ugly, but I want to make sure we're clear that
this is mostly irrelevant, unless it impacts compile time or the
complexity of pattern matching beyond reasonable.

I would expect larger vectors to have longer patterns, but I wouldn't
oppose to an intrinsic for 1024 bit vector reduction. :slight_smile:

Such a representation is also useful inside of vectorized loop if the programmer wants bitwise identical
FP reduction value (obviously at the much slower speed), without losing the vectorization of rest of
the loop body. So, this isn't just "outside the vectorized loop" stuff.

I'm not sure what you mean by this. Do you mean that
@llvm.reduce.add.ordered(...) would still be useful scalarised?

We can get the same ordered behaviour by introducing instruction
dependencies (%sum = %a + %b -> %sum = %um + %c, etc).

Also, we'd have to teach the middle end or all back-ends to lower that
intrinsic, even if the target has no concept of reduction in its ISA,
in the case it does appear from some other pass.

Wouldn't that be risky?

Now, can we focus on the value of "nice and concise representation"? If the community consensus
is "Yes", we can then talk about how to deliver that vision --- one IR instruction, one intrinsic call, a small
number of IR instructions (I don't know how, but someone might have brilliant ideas), etc., and further dig
deeper (e.g., IR instruction for each reduction operator?, whether the operator specification is part of
operand or part of intrinsic name?, etc). If the community consensus is "No", we can stop talking about
details right away.

As I said, I don't see why we need to have a concise representation at
all, unless it impacts compile time or pattern match complexity
(maintainability), or it's the only way (ex. SVE).

So I vote "it depends". :slight_smile:

cheers,
--renato

Hi, Renato.

So I vote "it depends". :slight_smile:

My preference is to let vectorizer emit one kind of "reduce vector into scalar"
instead of letting vectorizer choose one of many different ways.

I'm perfectly fine with
  @llvm.reduce.op.typein.typeout.ordered?(%vector)
being that "one kind of reduce vector into scalar".

I think we are converging enough at the detail level, but having a big
difference in the opinions at the "vision" level. :slight_smile:

Thanks,
Hideki

Vision is always harder than engineering. :slight_smile:

So, I have copied a lot more people now, to make sure we get a wider
audience. I want to give them a chance to chime in early enough.
Folks, feel free to copy more people if you think they need to have an
opinion.

Here's a summary:

Problem #1: Reductions are common, but:
* the code generated is large (log(lanes) operations + extracts) and
will get "too large" with 1024/2048 bit vectors
* this may cause compile time slow downs (we're not sure, need data),
and increased complexity in pattern-matching description for the
back-end
* the semantics can be target-specific (ordered / non-ordered /
non-binary reduction), but I don't know of any target that is not
ordered/binary.
* constant propagation can hide the semantics half-way through, and
the back-end loses the ability to match

Problem #2: As we move to scalable vectors (of unknown length, like SVE), we:
* can't generate binary extracts + ops (to mean unordered / aka
fast), as we don't know the number of lanes
* we'll have to represent ordered reduction as a loop based on the
(unknown at compile time) "compile time constant" that represents the
vector length

So, Amara's proposal converged to solve all those problems with a new
intrinsic similar to:

  %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector)

With:
* "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for
integer / float
* vec-ty the input type, which can be a scalable type in the future
(<n x N x type>).
* type-out not necessarily equal to the lane type in vec-ty, to allow
for widening/shotening or FP to int conversions, it supported.
* "ordered" some flag meaning FP precision is a concern (could be
"fast" meaning the opposite, doesn't matter)

The solution simplifies a lot the loop vectoriser, as well as the SLP
and, as Amara said, could even be used by front-ends or other passes
when the semantics is guaranteed (Haskell?).

However, it also incur in the same issue we had with the other vector
intrinsics: they stiff the IR, making other optimisations discard
information, avoid valid transformations or even plain bad code-gen.

So far, on the specific topic for reductions, the only case I could
think of that we'd miss optimisations is when we have two compatible
reductions and we want to merge them. For example, NEON can
horizontally reduce one or two vectors at the same time, or we could
have three strided reductions and then combine them all into one after
inlining.

What stops us from doing so with intrinsics is just the knowledge, so
we trade complexity in the back-end to match long patterns for
complexity in optimisation passes to know about the new intrinsics.

The arguments were:

Pro intrinsic:
* Simpler and shorter IR
* Easier to read, easier for the vectorisers to generate
* Easier for the back-end to match
* Easier transition towards scalable vectors in the future

Cons:
* Already existing patterns work and can expose further opportunities
* We may have to change more middle-end passes / back-ends than would
be affected

I really have no strong opinion either way. Both are challenging in
their own rights, and this is worth doing anyway.

Did I miss something?

cheers,
--renato

Thanks for the summary, some more comments inline.

I think we are converging enough at the detail level, but having a big
difference in the opinions at the "vision" level. :slight_smile:

Vision is always harder than engineering. :slight_smile:

So, I have copied a lot more people now, to make sure we get a wider
audience. I want to give them a chance to chime in early enough.
Folks, feel free to copy more people if you think they need to have an
opinion.

Here's a summary:

Problem #1: Reductions are common, but:
* the code generated is large (log(lanes) operations + extracts) and
will get "too large" with 1024/2048 bit vectors
* this may cause compile time slow downs (we're not sure, need data),
and increased complexity in pattern-matching description for the
back-end
* the semantics can be target-specific (ordered / non-ordered /
non-binary reduction), but I don't know of any target that is not
ordered/binary.
* constant propagation can hide the semantics half-way through, and
the back-end loses the ability to match

Unless I'm mistaken I think the constant propagation issue was with
the proposal to split reduction into two stages, where the reduce step
depends on the instruction used to produce the input value. This is
unrelated to the current shuffle representation.

Problem #2: As we move to scalable vectors (of unknown length, like SVE), we:
* can't generate binary extracts + ops (to mean unordered / aka
fast), as we don't know the number of lanes
* we'll have to represent ordered reduction as a loop based on the
(unknown at compile time) "compile time constant" that represents the
vector length

So, Amara's proposal converged to solve all those problems with a new
intrinsic similar to:

  %red = type-out @llvm.reduce.op.vec-type.type-out.ordered(<vec-ty> %vector)

With:
* "op" one of { add, sub, mul, max, min, etc } x { '', 'f' } for
integer / float
* vec-ty the input type, which can be a scalable type in the future
(<n x N x type>).
* type-out not necessarily equal to the lane type in vec-ty, to allow
for widening/shotening or FP to int conversions, it supported.

I didn't include this in my original proposal, although I'm not
particularly opposed to it. In the spirit of keeping the IR as simple
as possible, the widening/narrowing behaviour doesn't need to be
bundled into the reduction. A promotion can be a separate ext of the
input and I think it should be easily pattern matched.

* "ordered" some flag meaning FP precision is a concern (could be
"fast" meaning the opposite, doesn't matter)

The solution simplifies a lot the loop vectoriser, as well as the SLP
and, as Amara said, could even be used by front-ends or other passes
when the semantics is guaranteed (Haskell?).

However, it also incur in the same issue we had with the other vector
intrinsics: they stiff the IR, making other optimisations discard
information, avoid valid transformations or even plain bad code-gen.

So far, on the specific topic for reductions, the only case I could
think of that we'd miss optimisations is when we have two compatible
reductions and we want to merge them. For example, NEON can
horizontally reduce one or two vectors at the same time, or we could
have three strided reductions and then combine them all into one after
inlining.

My implementation of this for AArch64 NEON (I've not looked at AArch32
yet) converts the generic reduce ISD nodes to AArch64 specific ones
with an extract. It doesn't try to do anything else, so once it goes
past the original tree-shuffle matching code, the DAG should be
identical.

What stops us from doing so with intrinsics is just the knowledge, so
we trade complexity in the back-end to match long patterns for
complexity in optimisation passes to know about the new intrinsics.

The arguments were:

Pro intrinsic:
* Simpler and shorter IR
* Easier to read, easier for the vectorisers to generate
* Easier for the back-end to match
* Easier transition towards scalable vectors in the future

Also want to re-iterate that reductions are useful for testing bits of
predicate vectors, which can be applicable to other targets than SVE
for things like early-exit vectorization. Simon mentioned this to me
off-list. Simon, could you comment here if this proposal would work
for your needs?

Also want to re-iterate that reductions are useful for testing bits of predicate vectors, which can be applicable to other targets than SVE for things like early-exit vectorization. Simon >mentioned this to me off-list. Simon, could you comment here if this proposal would work for your needs?

For fixed sized vector, bitcasting vector of i1s into a single integer of the same number of bits can takes care of most cases.
For AVX512, it is K-register to GPR move. I don't think that works for SVE, though. This is a place where I don't mind
vectorizer not generating one kind of representation. Non-intrinsic alternative is so nice to give up. If we are talking about
reduction instruction, however, that might convey vectorizer's logic better than bitcasting. That's the line I'd draw.

For future outer loop vectorization, we'll be performing "if vector of predicate is not all-false", which is bitcast and then !=0.

Thanks,
Hideki

Yes - I’m hoping that we can both vectorise early-out ‘any_of’ predicate tests code and perform early-out breaks from already vectorised cases - nothing I’ve seen suggests this will get in the way. It’s mainly going to be a case of correct recognition in the LV, handling dereference’d arrays etc. and I don’t think these intrinsics will obfuscate these cases/attributes etc.

Does SVE have an early-out ability or is it an all or nothing mechanism?

Simon.

Yes, SVE can vectorize early exit loops by using speculative
(first-faulting) loads, which essentially give a predicate of the
lanes loaded successfully. For uncounted loops with these special
loads, the loop predicate tests can be done using a 'ptest'
instruction, checking if the last element is active.

Amara

Ping. Does anyone else have thoughts on this?

Amara

Hi Amara,

It seems the people who replied in this thread are mostly in sync with
the proposal, why don't you push a review in phab, and let's take this
to the next level?

cheers,
--renato

Agreed, I'll add some langref doc changes to the prototype patches and
post them for review.

Thanks,
Amara

Hi folks,

The first patch with LangRef updates is ready for review at
https://reviews.llvm.org/D30086

Thanks,
Amara