[RFC] Vector Predication

Hi,

There is now an RFC for a roadmap to native vector predication support in LLVM and a prototype implementation:

The prototype demonstrates: - Predicated vector intrinsics with an explicit mask and vector length parameter on IR level. - First-class predicated SDNodes on ISel level. Mask and vector length are value operands. - An incremental strategy to generalize PatternMatch/InstCombine/InstSimplify and DAGCombiner to work on both regular instructions and EVL intrinsics. - DAGCombiner example: FMA fusion. - InstCombine/InstSimplify example: FSub pattern re-writes. - Early experiments on the LNT test suite (Clang static release, O3 -ffast-math) indicate that compile time on non-EVL IR is not affected by the API abstractions in PatternMatch, etc. We’d like to get your feedback, in particular on the following to move forward: - Can we agree on EVL intrinsics as a transitional step to predicated IR instructions? - Can we agree on native EVL SDNodes for CodeGen? - Are the changes to InstCombine/InstSimplify/DAGCombiner and utility classes that go with it acceptable? Thanks Simon

Can I start w/some basic questions? I’ve skimmed your proposal, but haven’t read it in detail, so if something I ask is already addressed, please feel free to cite existing docs/discussion/whatever.

I’m going to use fsub as my running example, just because it’s the first IR test you posted. :slight_smile:

%result = call <4 x float> @llvm.evl.fsub.v4f32(<4 x float> %x, <4 x float> %y, <4 x i1> %M, i32 %L)

Question 1 - Why do we need separate mask and lengths? Can’t the length be easily folded into the mask operand?

e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
and then pattern matched in the backend if needed

Question 2 - Have you explored using selects instead? What practical problems do you run into which make you believe explicit predication is required?

e.g. %sub = fsub <4 x float> %x, %y
%result = select <4 x i1> %M, <4 x float> %sub, undef

My context for these questions is that my experience recently w/o existing masked intrinsics shows us missing fairly basic optimizations, precisely because they weren’t able to reuse all of the existing infrastructure. (I’ve been working on SimplifyDemandedVectorElts recently for exactly this reason.) My concern is that your EVL proposal will end up in the same state.

Philip

Philip Reames <listmail@philipreames.com> writes:

Question 1 - Why do we need separate mask and lengths? Can't the
length be easily folded into the mask operand?

e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
and then pattern matched in the backend if needed

I'm a little concerned about how difficult it will be to maintain enough
information throughout compilation to be able to match this on a machine
with an explicit vector length value.

Question 2 - Have you explored using selects instead? What practical
problems do you run into which make you believe explicit predication
is required?

e.g. %sub = fsub <4 x float> %x, %y
%result = select <4 x i1> %M, <4 x float> %sub, undef

That is semantically incorrect. According to IR semantics, the fsub is
fully evaluated before the select comes along. It could trap for
elements where %M is 0, whereas a masked intrinsic conveys the proper
semantics of masking traps for masked-out elements. We need intrinsics
and eventually (IMHO) fully first-class predication to make this work
properly.

My context for these questions is that my experience recently w/o
existing masked intrinsics shows us missing fairly basic
optimizations, precisely because they weren't able to reuse all of the
existing infrastructure. (I've been working on
SimplifyDemandedVectorElts recently for exactly this reason.) My
concern is that your EVL proposal will end up in the same state.

I think that's just the nature of the beast. We need IR-level support
for masking and we have to teach LLVM about it.

                           -David

Philip Reames <listmail@philipreames.com> writes:

Question 1 - Why do we need separate mask and lengths? Can't the
length be easily folded into the mask operand?

e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
and then pattern matched in the backend if needed

I'm a little concerned about how difficult it will be to maintain enough
information throughout compilation to be able to match this on a machine
with an explicit vector length value.

Does the hardware *also* have a mask register? If so, this is a likely minor code quality issue which can be incrementally refined on. If it doesn't, then I can see your concern.

Question 2 - Have you explored using selects instead? What practical
problems do you run into which make you believe explicit predication
is required?

e.g. %sub = fsub <4 x float> %x, %y
%result = select <4 x i1> %M, <4 x float> %sub, undef

That is semantically incorrect. According to IR semantics, the fsub is
fully evaluated before the select comes along. It could trap for
elements where %M is 0, whereas a masked intrinsic conveys the proper
semantics of masking traps for masked-out elements. We need intrinsics
and eventually (IMHO) fully first-class predication to make this work
properly.

If you want specific trap behavior, you need to use the constrained family of intrinsics instead. In IR, fsub is expected not to trap.

We have an existing solution for modeling FP environment aspects such as rounding and trapping. The proposed signatures for your EVL proposal do not appear to subsume those, and you've not proposed their retirement. We definitely don't want *two* ways of describing FP trapping.

In other words, I don't find this reason compelling since my example can simply be rewritten using the appropriate constrained intrinsic.

My context for these questions is that my experience recently w/o
existing masked intrinsics shows us missing fairly basic
optimizations, precisely because they weren't able to reuse all of the
existing infrastructure. (I've been working on
SimplifyDemandedVectorElts recently for exactly this reason.) My
concern is that your EVL proposal will end up in the same state.

I think that's just the nature of the beast. We need IR-level support
for masking and we have to teach LLVM about it.

I'm solidly of the opinion that we already *have* IR support for explicit masking in the form of gather/scatter/etc... Until someone has taken the effort to make masking in this context *actually work well*, I'm unconvinced that we should greatly expand the usage in the IR.

Question 2 - Have you explored using selects instead? What practical
problems do you run into which make you believe explicit predication
is required?

e.g. %sub = fsub <4 x float> %x, %y
%result = select <4 x i1> %M, <4 x float> %sub, undef

That is semantically incorrect. According to IR semantics, the fsub is fully
evaluated before the select comes along. It could trap for elements where
%M is 0, whereas a masked intrinsic conveys the proper semantics of
masking traps for masked-out elements. We need intrinsics and eventually
(IMHO) fully first-class predication to make this work properly.

The LLVM language reference says, "The default LLVM floating-point environment assumes that floating-point instructions do not have side effects." So that's why this situation has been tolerated. As you're probably aware have work in progress to enable a mode where the default FP environment is not assumed and we properly handle FP status flags and exception unmasking. This will absolutely require masked versions of the operations.

I personally like the idea of having masked operations in the general case as opposed to using selects and hoping the selection DAG will pick the right instructions, because it doesn't always work out that way. But I suppose that needs to be weighed against whatever optimization opportunities are missed because of the less general representation. I agree that we should be able to mitigate this by teaching the optimizer to handle masked operations.

-Andy

Philip Reames <listmail@philipreames.com> writes:

Question 1 - Why do we need separate mask and lengths? Can’t the
length be easily folded into the mask operand?

e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
and then pattern matched in the backend if needed
I’m a little concerned about how difficult it will be to maintain enough
information throughout compilation to be able to match this on a machine
with an explicit vector length value.
Does the hardware also have a mask register? If so, this is a likely
minor code quality issue which can be incrementally refined on. If it
doesn’t, then I can see your concern.

Masking/predication is supported nearly universally, but I don’t think the code quality issue is minor. It would be on a typical packed-SIMD machine with 128/256/512 bit registers, but the processors with a vector length register are usually built with much larger registers files and without a corresponding increase in the number of functional units. For example, 4096 bit per vector register is really quite modest for this kind of machine, while the data path can reasonable be “only” 128 or 256 bit.

This changes the calculus quite a bit: vector lengths much shorter or minimally larger than one full register are suddenly reasonable common (in application code, not so much in HPC kernels) and because each vector instruction is split into many data-path-sized uops, it’s trivial and very rewarding to cut processing short halfway through a vector. The efficiency of “short vector code” then depends on the ability to finish each operation on those short vectors relatively quickly rather than padding everything to a full vector register.

For example, if a loop with a trip count of 20 is vectorized on a machine with 64 elements per vector (that’s 64b elements in a 4096b register, so this is lowballing it!), using only masks and not the vector length register makes your vector unit do about three times more work than it would have to if you set the vector length register to 20. That keeps the register file and functional units busy for no good reason. Some microarchitectures take on the burden of determining when a whole chunk of the vector is masked out and can then skip over it quickly, but many others don’t. So you’re likely burning a whole bunch of power and quite possibly taking up cycles that could be filled with useful work from other instructions instead.

Cheers,
Robin

We’re in-progress designing a RISC-V extension (http://lists.libre-riscv.org/pipermail/libre-riscv-dev/2019-January/000433.html) that would have variable-length vectors of short vectors (1 to 4):
<VL x <4 x float>>
where each predicate bit masks out a whole short vector. We’re using this extension to vectorize graphics code where where variables in the pre-vectorization code are short vectors.
So, vectorizing code like:

for(int i = 0; i < 1000; i++)
{
vec4 color = colors[i];
vec3 normal = normals[i];
color.rgb *= fmax(0.0, dot(normal, light_dir));
colors[i] = color;
}

I’m planning on passing already vectorized code into LLVM and using LLVM as a backend for optimization and JIT code generation.

Do you think the EVL proposal would support an ISA like this as it’s currently written (by pattern matching on predicate expansion and vector-length multiplication)?
Or, do you think the EVL proposal would need modification to effectively support this (by adding a element group size argument to EVL intrinsics or something)?

Jacob Lifshay

    > Philip Reames <listmail@philipreames.com
    <mailto:listmail@philipreames.com>> writes:
    >
    >> Question 1 - Why do we need separate mask and lengths? Can't the
    >> length be easily folded into the mask operand?
    >>
    >> e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
    >> and then pattern matched in the backend if needed
    > I'm a little concerned about how difficult it will be to
    maintain enough
    > information throughout compilation to be able to match this on a
    machine
    > with an explicit vector length value.
    Does the hardware *also* have a mask register? If so, this is a
    likely
    minor code quality issue which can be incrementally refined on.
    If it
    doesn't, then I can see your concern.

Masking/predication is supported nearly universally, but I don't think the code quality issue is minor. It would be on a typical packed-SIMD machine with 128/256/512 bit registers, but the processors with a vector length register are usually built with much larger registers files and without a corresponding increase in the number of functional units. For example, 4096 bit per vector register is really quite modest for this kind of machine, while the data path can reasonable be "only" 128 or 256 bit.

This changes the calculus quite a bit: vector lengths much shorter or minimally larger than one full register are suddenly reasonable common (in application code, not so much in HPC kernels) and because each vector instruction is split into many data-path-sized uops, it's trivial and very rewarding to cut processing short halfway through a vector. The efficiency of "short vector code" then depends on the ability to finish each operation on those short vectors relatively quickly rather than padding everything to a full vector register.

For example, if a loop with a trip count of 20 is vectorized on a machine with 64 elements per vector (that's 64b elements in a 4096b register, so this is lowballing it!), using only masks and not the vector length register makes your vector unit do about three times more work than it would have to if you set the vector length register to 20. That keeps the register file and functional units busy for no good reason. Some microarchitectures take on the burden of determining when a whole chunk of the vector is masked out and can then skip over it quickly, but many others don't. So you're likely burning a whole bunch of power and quite possibly taking up cycles that could be filled with useful work from other instructions instead.

Thank you for the explanation.

Do such architectures frequently have arithmetic operations on the mask registers? (i.e. can I reasonable compute a conservative length given a mask register value) If I can, then having a mask as the canonical form and re-deriving the length register from a mask for a sequence of instructions which share a predicate seems fairly reasonable. Note that I'm assuming this as a fallback, and that the common case is handled via the equivalent of ComputeKnownBits on the mask itself at compile time.

The only case where the combination of a CKB and dynamic mask->length fallback wouldn't handle reliably is when we have a mask loaded from an external source (memory, function call boundary, etc...) and a short sequence of vector ops. Are such really common enough that it needs to be a first class element of the design?

p.s. To make sure my tone is coming across correctly, let me spell out that I'm not convinced, but I'm not actively objecting. I'm playing devils advocate for the purposes of fleshing out a design, but if folks more knowledgeable than I strongly believe the right design requires both masks and lengths, I'm happy to defer on that point.

when we have a mask loaded from an external source (memory, function call boundary, etc…) and a short sequence of vector ops

Mask value from function call parameter is common. OpenMP declare simd function does exactly that for the masked cases.

RISC-V has both masks and an active vector length and the semantics
are different.

TLDR: Masked-off elements in the destination register retain their
previous value, but elements past the active vector length are zeroed.

I'll quote from the current version of the draft spec:

If masking is used (which it is usually not for loops without control
flow inside the vectorised loop) then, yes, logical operations on the
mask registers will happen at every basic block boundary.

But it is NOT the case that you can computer the active vector length
VL from an initial mask value. The active vector length is set by the
hardware based on the remaining application vector length. The VL can
change for each loop iteration -- the normal pattern is for VL to
equal VLMAX for initial executions of the loop, and then be less than
VLMAX for the final one or two iterations of the loop. For example if
VLMAX is 16 and there are 19 elements left in the application vector
then the hardware might choose to use 10 elements for the 2nd to last
iteration and 9 elements for the last iteration. Or not. Other
hardware might choose to perform the last three iterations as 12/12/11
instead of 16/10/9. (It is constrained to be monotonic).

VL can also be dynamically shortened in the middle of a loop iteration
by an unaligned vector load that crosses a protection boundary if the
later elements are inaccessible.

I'm curious what SVE will do if there is an if/then/else in the middle
of a vectorised loop with a shorter-than-maximum vector length. You
can't just invert the mask when going from the then-part to the
else-part because that would re-enable elements past the end of the
vector. You'd need to invert the mask and then AND it with the mask
containing the (bitwise representation of) the vector length.

Such a mask is at the application level, not at the vector strip-mining loop level.

As well as possibly being many times longer than the masks the hardware works with, it’s likely to not even in the the format the hardware uses: different library APIs might pack a mask into bits, or one mask element per byte, short, or int.

I think you and I are talking two different things.

As far as Intel’s vector function ABI is concerned, unless the programmer specifically says otherwise, given an OpenMP declare simd function, compiler will

deduce the VF from HW vector register size and other function signatures. Of course, there can be different vector function ABIs for different targets. Intel

compiler cost model uses vector function VF as part of loop vectorization VF determination. So, it’s tightly coupled.

A hypothetical vector target may vectorize such a vector function for 4096b vector, with an explicit VF parameter 20 also passed to it, to execute only the lower

20-elements parts of the whole thing.

I think this scenario answers Philip’s question on why separate mask and VF parameters and why VF can’t be conservatively deduced from the mask/mask compute.

This situation can be handled easily in the standard RISC-V vector
extension. You'd do something like...

vsetvli t0, a0, vsew128,vnreg8,vdiv4

... to configure the vector unit to provide eight vector register
variables divided into a standard element width of 128 bits (some
instructions will widen or narrow one step to/from 64 bits or 256
bits), and then dividing each 128 bit element into 4 parts.

Arithmetic/logical/shift will happen on 32 bit elements, but
predication and loads and stores (including strided or scatter/gather)
will operate on 128 bit elements.

[I just made up "vnreg8" as an alias for the standard "vlmul4" because
"vlmul4,vdiv4" might look confusing. Either way it means to put 0b10
into bits [1:0] of the vtype CSR specifying that the 32 vector
registers should be ganged into 8 groups each 4x longer than standard
because (I'm assuming) we need more than four vector registers in this
loop, but no more than eight]

Hi,

Philip Reames <listmail@philipreames.com> writes:

Question 1 - Why do we need separate mask and lengths? Can't the
length be easily folded into the mask operand?

e.g. newmask = (<4 x i1>)((i4)%y & (1 << %L -1))
and then pattern matched in the backend if needed

I'm a little concerned about how difficult it will be to maintain enough
information throughout compilation to be able to match this on a machine
with an explicit vector length value.

Does the hardware *also* have a mask register? If so, this is a likely minor code quality issue which can be incrementally refined on. If it doesn't, then I can see your concern.

Question 2 - Have you explored using selects instead? What practical
problems do you run into which make you believe explicit predication
is required?

e.g. %sub = fsub <4 x float> %x, %y
%result = select <4 x i1> %M, <4 x float> %sub, undef

That is semantically incorrect. According to IR semantics, the fsub is
fully evaluated before the select comes along. It could trap for
elements where %M is 0, whereas a masked intrinsic conveys the proper
semantics of masking traps for masked-out elements. We need intrinsics
and eventually (IMHO) fully first-class predication to make this work
properly.

If you want specific trap behavior, you need to use the constrained family of intrinsics instead. In IR, fsub is expected not to trap.

We have an existing solution for modeling FP environment aspects such as rounding and trapping. The proposed signatures for your EVL proposal do not appear to subsume those, and you've not proposed their retirement. We definitely don't want *two* ways of describing FP trapping.

In other words, I don't find this reason compelling since my example can simply be rewritten using the appropriate constrained intrinsic.

The existing constrained fp intrinsics do not take a mask or vlen. So, you can not have vectorized trapping fp math at the moment (beyond what LV can do...).

Masking has advantages even in the default non-trapping fp environment: It is not uncommon for fp hardware to be slow on denormal values. If you take the operation + select approach, spurious computation on denomals could occur, slowing down the program.

If you target has no masked fp ops (SSE, NEON, ..), you can still use EVL and have the backend lower it to "select-safe-inputs-on-masked-off-lanes + fp-operation" pattern. If you emit that pattern to early, InstCombine etc might fold it away.. also because IR optimizations can not distinguish between a select that was part of the original program and a select that was inserted to have a matchable pattern in the backend.

My context for these questions is that my experience recently w/o
existing masked intrinsics shows us missing fairly basic
optimizations, precisely because they weren't able to reuse all of the
existing infrastructure. (I've been working on
SimplifyDemandedVectorElts recently for exactly this reason.) My
concern is that your EVL proposal will end up in the same state.

I think that's just the nature of the beast. We need IR-level support
for masking and we have to teach LLVM about it.

I'm solidly of the opinion that we already *have* IR support for explicit masking in the form of gather/scatter/etc... Until someone has taken the effort to make masking in this context *actually work well*, I'm unconvinced that we should greatly expand the usage in the IR.

What do you mean by "make masking *work well*"? LLVMs vectorization support is stuck in ~2007 (SSE, ..) with patched-in intrinsics to support masked load/store and gather/scatter on AVX2.

I think this is a chicken-and-egg problem: LLVMs LoopVectorizer is rather limited and is used to argue that better IR support for predication was not necessary. However, if we had better IR support more aggressive vectorization schemes are possible.. right now, if you are serious about exploiting a SIMD ISAs, people use target-specific intrinsics to get the functionality they need.

Hi,

Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.

We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension. Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions. We would also be reserving a 32-bit opcode for compressed variants of the most commonly used 48-bit prefixed instructions, similar in style to the C extension. Having a prefix also allows (assuming we don’t run out of encoding space) larger register fields, we’re planning on 128 integer and 128 fp registers.

Btw, for anyone interested, feel free to join us over on libre-riscv-dev or follow at https://www.crowdsupply.com/libre-risc-v/m-class (currently somewhat out-of-date).

Jacob

Sounds good to me. I haven’t checked if the current code allows for that.

Jacob

Neat! I did not know that about the V extension. So this sounds as though the V extension would like support for <VL x <4 x float>>-style vectors as well.

Yes. In general, support for <VL x <M x iN>> where M is in {2,4,8} and
N could be as small as 1 though support for smaller than i8 is
optional. (no distinction is drawn between int and float in the vector
configuration -- that's up to the operations performed)

We are currently thinking of defining the extension in terms of a 16-bit prefix that changes standard 32-bit instructions into vectorized 48-bit instructions, allowing most future or current standard/non-standard extensions to be vectorized, rather than having to wait for additional extensions to have vector versions added to the V extension (one reason we are not using the V extension instead), such as the B extension.

Do you mean instructions following the standard 48-bit encoding
scheme, that happen to contain a standard 32 bit instruction as a
payload?

Having a prefix rather than, or in addition to, a layout configuration register allows intermixing vector operations on different group/element sizes without having to constantly change the vector configuration every few instructions.

No real difference. The standard RISC-V Vector extension is intended
to allow exactly those changes to the vector configuration every few
instructions. It's mostly the microcontroller people coming from
DSP/SIMD who want to do that, so it's up to them to make that
efficient on their cores -- they might even do macro-op fusion on it.
Big OoO/Supercomputer style code compiled from C/FORTRAN in general
doesn't want to do that kind of thing.

Example code that changes the configuration within a loop to do 16 bit
loads, 16x16->32 multiply, then 32 bit shift and store:

# Example: Load 16-bit values, widen multiply to 32b, shift 32b result
# right by 3, store 32b values.
loop:
    vsetvli a3, a0, vsew16,vlmul4 # vtype = 16-bit integer vectors
    vlh.v v4, (a1) # Get 16b vector
      slli t1, a3, 1
      add a1, a1, t1 # Bump pointer
    vwmul.vs v8, v4, v1 # 32b in <v8--v15>

    vsetvli x0, a0, vsew32,vlmul8 # Operate on 32b values
    vsrl.vi v8, v8, 3
    vsw.v v8, (a2) # Store vector of 32b
      slli t1, t1, 2
      add a2, a2, t1 # Bump pointer
      sub a0, a0, a3 # Decrement count
      bnez a0, loop # Any more?

(this example is probably only useful if 16x16->32 mul is
significantly faster than 32x32->32, otherwise you'd just load and
sign extend the 16 bit data into 32 bit elements)

A note on vector register numbering. There are registers 0..31. If you
specify vlmul4 then only v0,v4,v8,v12,v16,v20,v24,v28 are valid
register numbers. If you specify vlmul8 then only v0,v8,v16,v24 are
valid.