Hi,
I am writing to propose a RISC-V Vector extension (RVV) dialect. The RISC-V vector extension v1.0 candidate has been released. Currently, LLVM supports the stable release v0.10. RVV is rapidly emerging, I think applications and optimizations will benefit from its features, but RVV is absent in MLIR architectural-specific vector dialects now. In MLIR, there are two types of vector-related dialects:
- Virtual Vector level/General vector dialect: Vector Dialect
- Hardware Vector level/Architectural-specific dialects vector dialect: amx Dialect, x86-vector Dialect, arm-neon Dialect, and arm-sve Dialect.
This RFC proposes the initial RVV Dialect. Fortunately, the SVE dialect has explored scalable vector types and operations, allowing me to refer and simplify my implementation on the RVV side.
Motivation and Goal
RVV is the vector instruction set with scalable vector types, and it is designed for vector architecture. Unlike SIMD, RVV can group multiple registers to have a scalable vector length. It can also hide the length of the physical vector register and allow us to set the vector length we are operating on. These features can help us to avoid many disadvantages of SIMD:
- SIMD needs to add new instructions to support longer vector registers. RVV instructions are not bound to vector registers length.
- SIMD requires more effort than RVV to deal with the tails because of its fixed vector length.
- SIMD has more power consumption for fetching and decoding than RVV because SIMD needs more instructions to deal with the long vector.
RVV thus can do better than SIMD in some tasks, such as machine learning, multimedia, etc. I propose the RVV dialect to expose the vector processing features to MLIR, which allows the applications and compilers to have more optimization options and methods.
RVV Dialect First Patch
I have completed the RFC patch,which includes:
- RVV Dialect Definition
- RVV Scalable Vector Type
- RVV Operations
- RVV Intrinsic Operations
- Translation from RVV Dialect to LLVM Dialect
1. RVV Dialect
(1) RVV Scalable Vector Type
Before introducing the scalable type, let’s see some basic concepts for RVV.
VLEN: the number of bits in a single vector register.
ELEN: the maximum size of a vector element that any operation can produce or consume in bits.
SEW: dynamically selected element width.
LMUL: the vector length multiplier represents the number of vector registers combined to form a vector register group.
The mapping relationship between the RVV and LLVM types can be seen here. The following is the mapping table.
MF8 LMUL=1/8 | MF4 LMUL=1/4 | MF2 LMUL=1/2 | M1 LMUL=1 | M2 LMUL=2 | M4 LMUL=4 | M8 LMUL=8 | |
---|---|---|---|---|---|---|---|
i64 SEW=64 | N/A | N/A | N/A | nxv1i64 | nxv2i64 | nxv4i64 | nxv8i64 |
i32 SEW=32 | N/A | N/A | nxv1i32 | nxv2i32 | nxv4i32 | nxv8i32 | nxv16i32 |
i16 SEW=16 | N/A | nxv1i16 | nxv2i16 | nxv4i16 | nxv8i16 | nxv16i16 | nxv32i16 |
i8 SEW=8 | nxv1i8 | nxv2i8 | nxv4i8 | nxv8i8 | nxv16i8 | nxv32i8 | nxv64i8 |
double SEW=64 | N/A | N/A | N/A | nxv1f64 | nxv2f64 | nxv4f64 | nxv8f64 |
float SEW=32 | N/A | N/A | nxv1f32 | nxv2f32 | nxv4f32 | nxv8f32 | nxv16f32 |
half SEW=16 | N/A | nxv1f16 | nxv2f16 | nxv4f16 | nxv8f16 | nxv16f16 | nxv32f16 |
Therefore, we can infer the number of register groups and the data type from the LLVM scalable vector type. Similarly, we also need the scalable vector type in MLIR. The SVE dialect currently has the scalable vector type, but it is the dialect-specific version. I thus define an RVV scalable vector type with the same method as the SVE side. The standard and scalable types share the same syntax but have different semantics.
For example, if we want four vector register to be a group to deal with i32 element type, we can use the following type.
!rvv.vector<8xi32>
Corresponding Type in LLVM Dialect:
!llvm.vec<? x 8 x i32>
Corresponding Type in LLVM IR:
<vscale x 8 x i32>
(2) Operations in RVV Dialect
The operations in RVV dialect can be divided into two categories:
- RVV Operations: interoperate with higher-level abstractions.
- RVV Intrinsic Operations: interoperate with LLVM IR and intrinsic.
In the RFC patch, I define the basic arithmetic and memory accessing operations for the integer types. Those arithmetic operations can work with mask and support vector-scalar form, which means we can operate a vector with a scalar under a mask. The following table shows all the operations in my initial version.
RVV Operations | RVV Intrinsic Operations |
---|---|
rvv.load | rvv.intr.vle |
rvv.store | rvv.intr.vse |
rvv.add | rvv.intr.vadd |
rvv.sub | rvv.intr.vsub |
rvv.mul | rvv.intr.vmul |
rvv.div | rvv.intr.vdiv |
rvv.masked.add | rvv.intr.vadd_mask |
rvv.masked.sub | rvv.intr.vsub_mask |
rvv.masked.mul | rvv.intr.vmul_mask |
rvv.masked.div | rvv.intr.vdiv_mask |
2. Lowering Path
There are two steps to lower the RVV operations to LLVM IR:
- RVV operations to RVV intrinsic operations: as for the basic arithmetic operations, the conversion is a one-to-one lowering; as for memory access operations, the conversion should add some additional operations to convert memref to the pointer type.
- RVV intrinsic operation to LLVM IR: RVV intrinsic operations sit at the same abstraction level with LLVM dialect operations. In the definition, the RVV intrinsic operations are one-to-one binding with LLVM IR intrinsic, so the translation is a naturally one-to-one mapping.
Here I show the lowering path of load operation and add operation.
RVV Load Operation
%0 = rvv.load %m[%c0], %vl : memref<?xi64>, !rvv.vector<4xi64>, i64
RVV Load Intrinsic Operation
%1 = llvm.extractvalue %arg1[1] : !llvm.struct<(ptr<i64>, ptr<i64>, i64, array<1 x i64>, array<1 x i64>)>
%2 = llvm.getelementptr %1[%0] : (!llvm.ptr<i64>, i64) -> !llvm.ptr<i64>
%3 = llvm.bitcast %2 : !llvm.ptr<i64> to !llvm.ptr<vec<? x 4 x i64>>
%4 = "rvv.intr.vle"(%3, %arg2) : (!llvm.ptr<vec<? x 4 x i64>>, i64) -> !llvm.vec<? x 4 x i64>
LLVM IR Load Intrinsic
%4 = extractvalue { i64*, i64*, i64, [1 x i64], [1 x i64] } %1, 1
%5 = getelementptr i64, i64* %4, i64 0
%6 = bitcast i64* %5 to <vscale x 4 x i64>*
%7 = call <vscale x 4 x i64> @llvm.riscv.vle.nxv4i64.i64(<vscale x 4 x i64>* %6, i64 %2)
RVV Addition Operation
%0 = rvv.add %a, %b, %vl : !rvv.vector<4xi64>, !rvv.vector<4xi64>, i64
RVV Addition Intrinsic Operation
%0 = "rvv.intr.vadd"(%arg0, %arg1, %arg3) : (!llvm.vec<? x 4 x i64>, !llvm.vec<? x 4 x i64>, i64) -> !llvm.vec<? x 4 x i64>
LLVM IR Addition Intrinsic
%5 = call <vscale x 4 x i64> @llvm.riscv.vadd.nxv4i64.nxv4i64.i64(<vscale x 4 x i64> %0, <vscale x 4 x i64> %1, i64 %3)
The specific tools and commands used on the lowering path can be seen in the next section. How RVV dialect interoperates with higher-level dialects needs to be explored in the future, especially considering scalable vector types.
An Example
To demonstrate an executable version, I prepared an example (including mask, mixed-precision, and vector-scalar form). I define an MLIR function to perform an RVV addition operation and call the function in a CPP program to execute it.
func @vadd(%in1: memref<?xi64>, %in2: i32, %out: memref<?xi64>, %maskedoff: memref<?xi64>, %mask: memref<?xi1>) {
%c0 = constant 0 : index
%vl = constant 6 : i64
%input1 = rvv.load %in1[%c0], %vl : memref<?xi64>, !rvv.vector<4xi64>, i64
%off = rvv.load %maskedoff[%c0], %vl : memref<?xi64>, !rvv.vector<4xi64>, i64
%msk = rvv.load %mask[%c0], %vl : memref<?xi1>, !rvv.vector<4xi1>, i64
%output = rvv.masked.add %off, %input1, %in2, %msk, %vl: !rvv.vector<4xi64>, i32, !rvv.vector<4xi1>, i64
rvv.store %output, %out[%c0], %vl : !rvv.vector<4xi64>, memref<?xi64>, i64
return
}
The CPP program can be found here. Now we start the journey.
Lowering to LLVM Dialect with MLIR Tools
$ <mlir-opt> <mlir file> -convert-vector-to-llvm="enable-rvv" -convert-scf-to-std -convert-memref-to-llvm -convert-std-to-llvm='emit-c-wrappers=1' | <mlir-translate> -mlir-to-llvmir -o <llvm file>
Translate to LLVM IR and Generate Object File with LLVM Tools
$ <llc> -mtriple riscv64 -target-abi lp64d -mattr=+m,+d,+experimental-v <llvm file> --filetype=obj -o <object file>
Compile and Link with RISC-V GNU Compiler Toolchain
$ <riscv64-unknown-linux-gnu-g++> -mabi=lp64d <C++ file> <object file> -o <executable file>
Run and Simulate with QEMU
Note that the QEMU should build from source code ( rvv-intrinsic branch of RISC-V GNU compiler toolchain ).
$ <qemu-riscv64> -L <sysroot path> -cpu rv64,x-v=true <executable file>
Then you can get the result. According to the mask, the first and last one is the result of adding a scalar to the vector, and the middle four are from the masked off register.
[ 7 99 99 99 99 17 ]
Future Work
I only prepare the basic arithmetic and memory accessing operations in this RFC to express the main idea. In the future, the main exploration direction is how to make the RVV dialect benefit the higher-level dialects and workloads. There will be a project to explore how to improve convolution with RVV dialect. We will add more RVV operations for our optimization algorithm, and I hope my group can have discoveries and improve the RVV dialect.
I am looking forward to receiving comments and suggestions.
Thanks!
Hongbin