[RFC] Support RISCV vector tuple type in LLVM

Overview

Currently we have built-in C types for RISCV vector tuple type, e.g. vint32m1x2_t, however it’s is represented as structure of scalable vector types, i.e. {<vscale x 2 x i32>, <vscale x 2 x i32>}. It loses the information for num_fields(NF) as struct is flattened during selectionDAG, thus it makes it not possible to handle inline assembly of vector tuple type, it also makes the calling convention of vector tuple types handing not strait forward and hard to realize the allocation code, i.e. RVVArgDispatcher.

This RFC proposes new llvm types for RISCV vector tuples represented as TargetExtType which contains both LMUL and NF(num_fields) information and keep it all the way down to selectionDAG to match the corresponding MVT(support in the following patch). The llvm IR for the example above is then represented as target("riscv_vec_tuple", <vscale x 8 x i8>, 2) in which the first type parameter is the equivalent size scalable vecotr of i8 element type, the following integer parameter is the NF of the tuple.

The new RISCV specific vector insert/extract intrinsics are also added as llvm.riscv.vector.insert and llvm.riscv.vector.extract to handle tuple type subvector insertion/extraction since the generic ones only operates on VectorType but not TargetExtType.

There are total of 32 llvm types added for each VREGS * NF <= 8, where VREGS is the vector registers needed for each LMUL and NF is num_fields.
The name of types are:

target("riscv_vec_tuple", <vscale x 1 x i8>, 2)  // LMUL = mf8, NF = 2
target("riscv_vec_tuple", <vscale x 1 x i8>, 3)  // LMUL = mf8, NF = 3
target("riscv_vec_tuple", <vscale x 1 x i8>, 4)  // LMUL = mf8, NF = 4
target("riscv_vec_tuple", <vscale x 1 x i8>, 5)  // LMUL = mf8, NF = 5
target("riscv_vec_tuple", <vscale x 1 x i8>, 6)  // LMUL = mf8, NF = 6
target("riscv_vec_tuple", <vscale x 1 x i8>, 7)  // LMUL = mf8, NF = 7
target("riscv_vec_tuple", <vscale x 1 x i8>, 8)  // LMUL = mf8, NF = 8
target("riscv_vec_tuple", <vscale x 2 x i8>, 2)  // LMUL = mf4, NF = 2
target("riscv_vec_tuple", <vscale x 2 x i8>, 3)  // LMUL = mf4, NF = 3
target("riscv_vec_tuple", <vscale x 2 x i8>, 4)  // LMUL = mf4, NF = 4
target("riscv_vec_tuple", <vscale x 2 x i8>, 5)  // LMUL = mf4, NF = 5
target("riscv_vec_tuple", <vscale x 2 x i8>, 6)  // LMUL = mf4, NF = 6
target("riscv_vec_tuple", <vscale x 2 x i8>, 7)  // LMUL = mf4, NF = 7
target("riscv_vec_tuple", <vscale x 2 x i8>, 8)  // LMUL = mf4, NF = 8
target("riscv_vec_tuple", <vscale x 4 x i8>, 2)  // LMUL = mf2, NF = 2
target("riscv_vec_tuple", <vscale x 4 x i8>, 3)  // LMUL = mf2, NF = 3
target("riscv_vec_tuple", <vscale x 4 x i8>, 4)  // LMUL = mf2, NF = 4
target("riscv_vec_tuple", <vscale x 4 x i8>, 5)  // LMUL = mf2, NF = 5
target("riscv_vec_tuple", <vscale x 4 x i8>, 6)  // LMUL = mf2, NF = 6
target("riscv_vec_tuple", <vscale x 4 x i8>, 7)  // LMUL = mf2, NF = 7
target("riscv_vec_tuple", <vscale x 4 x i8>, 8)  // LMUL = mf2, NF = 8
target("riscv_vec_tuple", <vscale x 8 x i8>, 2)  // LMUL = m1, NF = 2
target("riscv_vec_tuple", <vscale x 8 x i8>, 3)  // LMUL = m1, NF = 3
target("riscv_vec_tuple", <vscale x 8 x i8>, 4)  // LMUL = m1, NF = 4
target("riscv_vec_tuple", <vscale x 8 x i8>, 5)  // LMUL = m1, NF = 5
target("riscv_vec_tuple", <vscale x 8 x i8>, 6)  // LMUL = m1, NF = 6
target("riscv_vec_tuple", <vscale x 8 x i8>, 7)  // LMUL = m1, NF = 7
target("riscv_vec_tuple", <vscale x 8 x i8>, 8)  // LMUL = m1, NF = 8
target("riscv_vec_tuple", <vscale x 16 x i8>, 2) // LMUL = m2, NF = 2
target("riscv_vec_tuple", <vscale x 16 x i8>, 3) // LMUL = m2, NF = 3
target("riscv_vec_tuple", <vscale x 16 x i8>, 4) // LMUL = m2, NF = 4
target("riscv_vec_tuple", <vscale x 32 x i8>, 2) // LMUL = m4, NF = 2

Background

RISCV vector tuple type is a clang builtin-type to model the NFIELDS of vector register groups in Vector Load/Store Segment Instructions, it is represented as, for example, vint32m1x2_t and it can be lowered to corresponding llvm type “{<vscale x 2 x i32> %0, <vscale x 2 x i32> %1}” in current llvm implementation. During register assignments after calling convention or register allocation, it should be placed in VRN1M2 register class.

Problem statement

The constraint of vector tuple type is that it needs to be allocated to adjacent number of vector registers, for example, %0 → v2, %1 → v3, but it’s illegal to assign as such, %0 → v2, %1 → v5.
However in current approach which uses “struct of scalable vector” as illustrated in Background, it would be flattened to multiple primitive types, i.e. nxv2i32, nxv2i32, which loses the group information during selectionDAG construction, thus we are not able to guarantee that they’re placed in adjacent vector registers.

Related work

There are multiple patches that add initial support for RISCV vector tuple type:

Pull requests

CC: @rofirrim @asb @wangpc-pp @nikic @lukel @preames @topperc @kito-cheng

1 Like

Could you please provide some more context for people who are not familiar with RISC-V? What are these types for, what are the current problems with them, and how you are addressing these problems with your proposal?

Is it possible to use the current IR representation and only change how it gets lowered to SDAG?

What is the old type this target type corresponds to? I don’t really get what all these arguments are supposed to mean. What are the two i8s for? I also didn’t understand the reason for the +3.

Keep in mind that you can also have something like target("foo", <vscale x 2 x i32>, 2), if that makes a more natural representation. You don’t need to reduce it to primitives.

Thanks for your review @nikic so quickly!

Sure, I will add more context and give a brief description of the solution I provide.

If I understand correctly, you meant that use a single scalable vector type to represent the tuple type and lower it to corresponding MVT, but for example, it’s not possible to determine that <vscale x 4 x i64> which uses totally 4 vector register need to be vint64m4 or vint64m2x2 as they both use 4 vector registers.

For the old type, it can represent all of the type that is LMUL=1(LMUL=n means a group of continuous n registers) and NF=2(NF=n means n continuous register group), e.g. vint32m1x2, vint64m1x2 or vint8m1x2 etc.

Thanks for reminding this, I’ve modify the representation and added the comment for each type, I think it’s now much clearer.

Do the vrn2m1 etc. reg classes not guarantee that they’re placed in adjacent vector registers? I tried out a quick C example and although it’s flattened in SelectionDAG to multiple nxv2i32 MVTs, the register allocation seems ok:

#include <riscv_vector.h>

void foo(int *x) {
  vint32m1x2_t seg = __riscv_vlseg2e32_v_i32m1x2(x, -1);

  vint32m1_t a = __riscv_vget_i32m1(seg, 0);
  a = __riscv_vadd(a, 2, -1);
  seg = __riscv_vset(seg, 0, a);  
  
  vint32m1_t b = __riscv_vget_i32m1(seg, 1);
  b = __riscv_vadd(b, 1, -1);
  seg = __riscv_vset(seg, 1, b);  
  __riscv_vsseg2e32_v_i32m1x2(x, seg, -1);
}
Optimized legalized selection DAG: %bb.0 'foo:entry'
SelectionDAG has 15 nodes:
  t0: ch,glue = EntryToken
  t2: i64,ch = CopyFromReg t0, Register:i64 %0
  t6: nxv2i32,nxv2i32,ch = llvm.riscv.vlseg2<(load unknown-size from %ir.x, align 4)> t0, TargetConstant:i64<10552>, undef:nxv2i32, undef:nxv2i32, t2, Constant:i64<-1>
      t17: nxv2i32 = llvm.riscv.vadd TargetConstant:i64<10347>, undef:nxv2i32, t6, Constant:i64<2>, Constant:i64<-1>
      t18: nxv2i32 = llvm.riscv.vadd TargetConstant:i64<10347>, undef:nxv2i32, t6:1, Constant:i64<1>, Constant:i64<-1>
    t13: ch = llvm.riscv.vsseg2<(store unknown-size, align 4)> t6:2, TargetConstant:i64<10789>, t17, t18, t2, Constant:i64<-1>
  t14: ch = RISCVISD::RET_GLUE t13
bb.0.entry:
  liveins: $x10
  %0:gpr = COPY $x10
  %1:vrn2m1 = PseudoVLSEG2E32_V_M1 $noreg(tied-def 0), %0:gpr, -1, 5, 2 :: (load unknown-size from %ir.x, align 4)
  %2:vr = COPY %1.sub_vrm1_1:vrn2m1
  %3:vr = COPY %1.sub_vrm1_0:vrn2m1
  %4:vr = PseudoVADD_VI_M1 $noreg(tied-def 0), killed %3:vr, 2, -1, 5, 0
  %5:vr = PseudoVADD_VI_M1 $noreg(tied-def 0), killed %2:vr, 1, -1, 5, 0
  %6:vrn2m1 = REG_SEQUENCE killed %4:vr, %subreg.sub_vrm1_0, killed %5:vr, %subreg.sub_vrm1_1
  PseudoVSSEG2E32_V_M1 killed %6:vrn2m1, %0:gpr, -1, 5 :: (store unknown-size, align 4)
  PseudoRET
foo:                                    # @foo
# %bb.0:                                # %entry
	vsetvli	a1, zero, e32, m1, ta, ma
	vlseg2e32.v	v8, (a0)
	vadd.vi	v8, v8, 2
	vadd.vi	v9, v9, 1
	vsseg2e32.v	v8, (a0)
	ret

Is there a scenario in which we’ll fail to put it in one the tuple reg classes and generate incorrect code?

The description was mostly about SelectionDAGBuilder. The tuples are currently represented as structs in IR. For each argument to a call or function, SelectionDAGBuilder calls ComputeValueVTs. This decomposes structs into a type for each field. The RISCV call lowering and argument lowering code has special code to figure out that each field started out as part of a struct in IR. We use this to assign registers based on the ABI. We don’t use the tuple register classes here.

ComputeValueVTs is also called if a struct is passed to inline assembly. We don’t currently have a way to figure out that the individiual fields used to be part of a struct as part of the inline assembly handling. This prevents them from being assigned contiguous registers.

Only segment load/store intrinsics force to group the tuple type arguments and assign them to vrn*m* register when lowering to pseudo instruction in RISCVISelDAGToDAG. Other tuple type arguments that are not passed through those intrinsics are decomposed to NF MVTs and it can’t guarantee that they are assigned adjacent registers.