[RFC] Code generation for RISC-V V-extension

Hi all,

(I uploaded a PDF rendered of this email just in case some of its sections are difficult to read
https://pm.bsc.es/~rferrer/tmp/rfc-vector-code-generation.pdf )

This is a RFC on the strategy we have been developing at the Barcelona Supercomputing Center (BSC) to generate code for the RISC-V V-extension.

We (BSC and SiFive) have prepared a very first patch showing the strategy explained in this email, found here

Feedback is very welcome as we see this change as a foundation on which to build the rest of the V-extension codegen support.

Introduction

The V-extension of RISC-V brings vector computation capabilities to the RISC-V ecosystem. The full details of the ISA specification are found here: https://github.com/riscv/riscv-v-spec This proposal is based on draft version 0.9 found at https://github.com/riscv/riscv-v-spec/releases/tag/0.9.

The V-extension is rather flexible so it is difficult to summarize it in an reasonably sized email. Please refer to the above specification for all the details. I’ll try to describe here all the features that are relevant for code generation.

The ISA provides 32 vector registers v0,v1, …, v31. The size in bits of each vector register is an implementation-specific parameter called VLEN and must be a power of two. Vector registers are partitioned (i.e. densely packed) in elements whose size in bits is a power of two, ranging from 8 to ELEN. ELEN is also a power of two and ELEN<=VLEN.

Due to encoding constraints, not all the operands of a vector operation are encoded in the instructions themselves. Two CSR (control and status registers) are used instead:

  • vl: the number of elements being operated, called the vector length. A vector instruction will operate the elements 0 to vl-1
  • vtype: the vector type. This register encodes the element size of the operation, called the standard element width (sew) and a vector grouping mechanism called the length multiplier (lmul)

The length multiplier (lmul) can take values 1, 2, 4, 8, 1/2, 1/4, 1/8.

  • When lmul=1 the vector instructions operate on the (32) vector registers.
  • When lmul>1 the vector instructions operate on vector groups (see below) encoded in the instruction using the lowest numbered vector register of the group.
  • When lmul<1 the vector instructions operate on the lowest half, quarter or eighth of a vector register.

A vector group of size lmul>1 is the set of consecutive vector registers v{lmul*i}, v{lmul*i+1}, … , v{lmul*(i + 1) - 1}. So

  • lmul=2 has 16 groups: v0, v2, v4, v6, …, v24, v26, v28, v30
  • lmul=4 has 8 groups: v0, v4, v8, v12, v16, v20, v24, v28
  • lmul=8 has 4 groups: v0, v8, v16, v24

For instance, under lmul=4, a vector group v4 operand includes vector registers v4, v5, v6 and v7 as if they had been concatenated as a four times larger vector register. lmul is useful to align the number of elements in vector codes whose element sizes are different (say when combining vectors of 32- and 64-bit elements) or when doing widenings (zero, sign or fp extensions) or narrowings (truncations).

A program must ensure that both vl and vtype have the correct values for a vector operation before executing a vector instruction. This is done using the vsetvli instruction.

vsetvli rdest, rsrc, sew,lmul,tx,mx # We discuss tx,mx below

rsrc is the application vector length (AVL) and will be used when setting the vl. rdest is updated with the value of vl. The spec allows some latitude here but a simple functional model of what vsetvli does is the following

vl ← min(rsrc, lmul*VLEN/sew)
vtype ← sew,lmul,...

vsetvli has a couple of special cases

  • When rsrc is x0 and rdest is not x0 then vl ← lmul*VLEN/sew. In other words, sets vl to be the maximum vector length for a given lmul and sew. This is useful for whole-register operations.
vsetvli t0, x0, e32,m2,… # vl ← 2*VLEN/64
# vtype ← e32,m2,...
# t0 ← vl
  • When both rsrc and rdest are x0 (the hard-coded zero of RISC-V) then vl is used as the AVL. This can be used to change the vtype when we know the ratio sew/lmul will be preserved.
vsetvli x0, x0, e64,m4,… # changing vtype from e32,m2 to e64,m4 is OK (vl is unchanged)
# vtype ← e64,m4,...

Two simple examples (register x10 contains the AVL)

  • Add two 32-bit element vectors under lmul=1
vsetvli x0, x10, e32,m1,…
vadd.vv v1, v2, v3 # v1[0:vl-1] ← v2[0:vl-1] + v3[0:vl-1]
# where v[i:j] is all v[x] where i <= x <= j
  • Add two 64-bit element vectors under lmul=2
vsetvli x0, x10, e64,m2,…
vadd.vv v2, v4, v6 # Updates v2 and v3. Reads v4, v5 and v6, v7
# v2[0:x-1] ← v4[0:x-1] + v6[0:x-1] where x = min(VLEN/64, vl)
# v3[0:y-1] ← v5[0:y-1] + v7[0:y-1] where y = vl - x

Note: it is not necessary to emit a vsetvli instruction before every vector instruction if the current vl and vtype are still suitable for the intended vector operation. Removing redundant vsetvli instructions is not part of this proposal.

Masks and tails

RISC-V Vector extension supports masks in almost all of its instructions. There are no distinguished mask registers, instead vector registers can be used to represent masks. However an instruction whose execution is masked can only use the v0 register as the mask operand. Elements of the destination register that are masked off by the mask are called inactive elements (i.e. masked-off)

A vector instruction can be executed under a vl setting where vl < lmul*VLEN/sew. Elements of the destination register past the current vl are called the tail elements.

There are two modes for the tail and inactive elements

  • undisturbed, in which the element of the destination register is left unmodified
  • agnostic, in which the elements of the destination register is either left unmodified or all its bits set to 1 (for debugging purposes). In this mode we cannot assume anything about the bits of those elements

tx,mx in vsetvli above correspond to these two policies and can be combined in 4 ways:

  • tu,mu. Both tail and inactive are left undisturbed
  • ta,ma. Both tail and inactive are agnostic
  • tu,ma. Tail is left undisturbed and inactive are agnostic
  • ta,mu. Tail is agnostic and inactive are left undisturbed.

It may be reasonable to prefer the mode that is the least defined (such as ta,ma) for the intended IR operation. Some select operations, however, can be folded using masks in instructions (that would lead us to ta,mu). However, for the sake of simplicity and given the early stage of this works, we are going to use tu,mu mode everywhere. Relaxing to agnostic where possible is postponed as later work.

Mapping to LLVM IR Types

It seems reasonable to make VLEN an unknown constant to the compiler. This leads us to use scalable vector types (such as the ones used by Arm SVE target).

However in contrast to SVE, RISC-V V-extension does not prescribe a minimal vector size (other than ELEN<=VLEN). This minimal vector size impacts the LLVM types we can use because it impacts vscale. We propose that LLVM targets only VLEN>=64. We believe other permissible values under the V-ext spec (such as VLEN=8) are not realistically useful. Also we propose that LLVM supports only ELEN=32 and ELEN=64.

These constraints allow us to define vscale as VLEN/64. This makes the LLVM IR types stable between the two ELENs considered.

lmul=⅛ | lmul=¼ | lmul=½ | lmul=1 | lmul=2 | lmul=4 | lmul=8 |
---------------- | ------------ | -------------- | --------------- | ---------------- | ---------------- | ---------------- | ---------------- |
i64 (ELEN=64) | N/A | N/A | N/A | <v x 1 x i64> | <v x 2 x i64> | <v x 4 x i64> | <v x 8 x i64> |
i32 | N/A | N/A | <v x 1 x i32> | <v x 2 x i32> | <v x 4 x i32> | <v x 8 x i32> | <v x 16 x i32> |
i16 | N/A | <v x 1 x i16> | <v x 2 x i16> | <v x 4 x i16> | <v x 8 x i16> | <v x 16 x i16> | <v x 32 x i16> |
i8 | <v x 1 x i8> | <v x 2 x i8> | <v x 4 x i8> | <v x 8 x i8> | <v x 16 x i8> | <v x 32 x i8> | <v x 64 x i8> |
double (ELEN=64) | N/A | N/A | N/A | <v x 1 x double> | <v x 2 x double> | <v x 4 x double> | <v x 8 x double> |
float | N/A | N/A | <v x 1 x float> | <v x 2 x float> | <v x 4 x float> | <v x 8 x float> | <v x 16 x float> |
half | N/A | <v x 1 x half> | <v x 2 x half> | <v x 4 x half> | <v x 8 x half> | <v x 16 x half> | <v x 32 x half> |

(Read <v x k x ty> as <vscale x k x ty>)

One downside of this design is that doesn’t allow vectors of i128 (this is, ELEN=128). In that case vscale would have to be 1/2 under lmul=1. This type (and its fp counterpart float128) are not that common and in case of extreme necessity types for lmul=2 could be used instead.

As for mask vectors, they are physically represented using a layout of densely packed bits in a vector register. It seems reasonable to map them to the following LLVM IR types

<vscale x 1 x i1>
<vscale x 2 x i1>
<vscale x 4 x i1>
<vscale x 8 x i1>
<vscale x 16 x i1>
<vscale x 32 x i1>
<vscale x 64 x i1>

Two types with the same ratio sew/lmul will have the same related mask type. For instance, two different comparisons one under sew=64, lmul=2 and the other under sew=32, lmul=1 will both generate a mask <vscale x 2 x i1>.

Register classes

Given the vector registers and the different sizes of vector groups, we propose to define 4 register classes.

  • VR for vector registers (v0, v1, …, v31). lmul<=1 and mask registers
  • VRM2 for vector groups of length 2, this is lmul=2 (v0m2, v2m2, …, v30m2)
  • VRM4 for vector groups of length 4 (v0m4, v4m4, …, v28m4)
  • VRM8 for vector groups of length 8 (v0m8, v8m8, v16m8, v24m8)

So far it looks like lmul<1 types and mask types do not benefit from having a dedicated class, so VR would be used in that case.

What LLVM IR inputs we care bout

There are three kinds of inputs we believe are important to be selected as vector instructions of V-extension.

  • LLVM IR instructions that operate on vectors (whole-register operations)
vadd <vscale x 2 x i32> %a, %b
  • Vector Predication intrinsics (explicit vector length vector operations)
call <vscale x 2 x i32> @llvm.vp.add(..., i32 %evl)?

See http://www.llvm.org/docs/Proposals/VectorPredication.html

  • RISC-V V-extension target-specific intrinsics
call <vscale x 2 x i32> @llvm.rvv.vadd.nxv2i32(...)

See https://github.com/riscv/rvv-intrinsic-doc

Challenges for code generation

The actual V-extension instructions are not immediately useful for the code generation process (= the gap between LLVM IR and the emission of the instructions)

  • Two instances of the same instruction have different semantics depending on the precise values of vtype and vl.

  • To the best of our knowledge MachineInstrs cannot be overloaded per register class

  • The correct values of vtype and vl must be set before executing a vector instruction

  • MachineInstrs may need more information than just the encoded operands, at least during selection

Other targets also have to solve this problem of “having the right configuration/state before executing an instruction”. As far as I understood from the documentation and comments in llvm-dev, Intel AMX (see http://lists.llvm.org/pipermail/llvm-dev/2020-September/144848.html) also has some configuration step before being able to use the intructions that implement some tensor operations. Looks like our work started earlier than Intel AMX so their approach here might be interesting to compare in case either design/approach can be improved.

Another example of a “needs configuration/set state” target (even if deprecated today and probably of little interest to LLVM) is the Arm VFP extension found in some legacy Arm cores such as the one in the Raspberry Pi 1. I’m pretty sure I’m missing other notable examples. If someone is working on a target with a similar way of working, feel free to chime in!

The actual proposal

(Thanks for reading up to here!)

  1. Mirror all the RISC-V V-extension instructions (which are already on upstream) that depend either on vl or vtype in pseudo instructions. One pseudo instruction per lmul.

  2. Example: VADD_VVPseudoVADD_VV_M1, PseudoVADD_VV_M2, PseudoVADD_VV_M4, PseudoVADD_VV_M8, PseudoVADD_VV_MF2, PseudoVADD_VV_MF4, PseudoVADD_VV_MF8,

  3. Extend each pseudo instructions with additional operands.

  4. a merge operand that is used for undisturbed semantics (otherwise set to IMPLICIT_DEF). This operand is tied to the destination. If this is an actual value it entails tu,mu (see section above “Masks and tails”)

  5. a vector mask operand for masked instructions (set to RISCV::NoRegister if unmasked)

  6. a sew immediate operand

  7. a GPR containing the vector length of this operation (or RISCV::X0 if we want to use the maximum vector length)

  8. implicit uses of vl and vtype registers

  9. Use the pseudo instructions in the SelectionDAG patterns.

  10. Patterns would, at least, choose the right lmul version, set the proper sew immediate.

  11. Optionally set a merge operand if merge semantics is desired

  12. Each pseudo instruction would also have a custom inserter

  13. The custom inserter emits a vsetvli instruction before the instruction. The lmul is derived from the pseudo instruction.

  14. Now the sew and the vl operands are not useful anymore and can be set to -1, RISCV::NoRegister respectively

The final result is that each operation is prepended with a vsetvli that sets both the sew, lmul and the vl.

While correct, this is naive, so later passes can remove the redundant vsetvlis. This is out of the scope of this RFC.

Finally, at the end of CodeGen

  1. AsmPrinter emits the Pseudo Instructions as their corresponding real RISC-V V-extension MCInst instructions
  2. We can’t do that earlier because we want to correctly keep track the liveness of registers (their register classes). This is important for LMUL>1.

Example of a pseudo-instruction

This is a slightly simplified tablegen definition of a pseudo instruction

let Constraints = “$rd = $merge”, Uses = [VL, VTYPE],
usesCustomInserter = 1, BaseInstr = VADD_VV in
def PseudoVADD_VV_M2 : Pseudo<(outs VRM2:$rd),
(ins VRM2:$merge, VRM2:$rs2, VRM2:$rs1, VMaskOp:$vm,
GPR:$vl, ixlenimm:$sew), []>;

Tablegen allows generating systematically (for all lmuls) the pseudo instructions and link them to the real instruction.

GenericTable is used to create a table of the pseudo instructions that contains information about the extra operands.

Example of the codegen flow

Let’s consider a very simple case using a whole-register op (this example uses lmul=2)

%c = add <vscale x 4 x i32> %a, %b

Input ISEL DAG:

t5: nxv4i32 = add t2, t4

Output ISEL DAG:

t5: nxv4i32 = PseudoVADD_VV_M2 IMPLICIT_DEF:nxv4i32, t2, t4, Register:nxv4i1 $noreg, Register:i64 $x0, TargetConstant:i64<32>

MachineInstrs after InstrEmitter (before CustomInserter)

  • %3:vrm2 = IMPLICIT_DEF
  • early-clobber %2:vrm2 = PseudoVADD_VV_M2 %3:vrm2(tied-def 0), %0:vrm2, %1:vrm2, $noreg, $x0, 32, implicit $vl, implicit $vtype

(If you wonder about the early-clobber it is needed to fulfill some constraints between sources and destination registers under lmul>1)

MachineInstrs after CustomInserter

  • %3:vrm2 = IMPLICIT_DEF
  • dead %4:gprnox0 = PseudoVSETVLI $x0, 9, implicit-def $vl, implicit-def $vtype
  • early-clobber %2:vrm2 = PseudoVADD_VV_M2 %3:vrm2(tied-def 0), %0:vrm2, %1:vrm2, $noreg, $noreg, -1, implicit $vl, implicit $vtype

Now optimization passes could remove unnecesary PseudoVSETVLIs (if neither vtype or vl change) or use the special form that only sets vtype (if vl stays the same). Again, not covered in this RFC.

Post Register Allocation:

  • dead renamable $x10 = PseudoVSETVLI $x0, 9, implicit-def $vl, implicit-def $vtype
  • early-clobber renamable $v2m2 = PseudoVADD_VV_M2 undef renamable $v2m2(tied-def 0), killed renamable $v16m2, killed renamable $v18m2, $noreg, $noreg, -1, implicit $vl, implicit $vtype

Finally AsmPrinter lowers the PseudoInstructions into real MCInsts, discarding unneeded operands.

  • <MCInst #3745 <MCOperand Reg:45> <MCOperand Reg:35> <MCOperand Imm:9>>
    vsetvli a0, zero, e32,m2
  • <MCInst #3468 <MCOperand Reg:5> <MCOperand Reg:19> <MCOperand Reg:21> <MCOperand Reg:0>>
    vadd.vv v2, v16, v18

Discussion

Some cons:

  • We need to mirror all the RVV instructions into pseudo instructions (at least 7 pseudos for each RVV instruction). MachineInstrs do not seem to be overloadable per register class.
  • Usefulness of vl and sew operands in the pseudo instructions seems rather limited: only to communicate between InstrEmitter and the CustomInserter.

Pros:

  • We can express what we need (vl, vtype) and all what we want (merging semantics, whole register operations, explicit vl, optimize vsetvl, etc).

Acknowledgements

We want to acknowledge the SiFive team who has been collaborating with us. They have provided very valuable feedback during the internal discussions we have had while shaping this proposal.

This work has been done as part of the European Processor Initiative project. The European Processor Initiative (EPI) (FPA: 800928) has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement EPI-SGA1: 826647

Kind regards,