[RFC] Add RISC-V Vector Extension (RVV) Dialect

Brilliant! :smiley: I believe the dialect is in good hands :slight_smile: Once this has landed Iā€™ll reach out to Hongbin, thereā€™s a lot of work to do around scalable vectors outside of backend dialects, we should coordinate :slight_smile:

Thanks for taking the time to answer!

2 Likes

FYI, the type mapping from LMUL and SEW to llvm vscale types all falls apart if VLEN==32 instead of >= 64. We havenā€™t figured out to address this yet. The implementation defines vscale as VLENB/8, but if VLEN==32 then VLENB==4 and VLENB/8==0. Changing the mapping to support VLEN=32 leaves us no way to encode LMUL=1/8 for SEW=8.

Is the plan to support every RISCV vector operation or just the basic arithmetic, loads, stores, conversions? There is an ongoing effort to add intrinsics versions of basic IR instructions that take a mask and vector length argument. https://llvm.org/docs/LangRef.html#vector-predication-intrinsics It might make sense to target those instead of RISCV vector intrinsics. In theory those are supposed to work on multiple targets.

Thanks for the RFC!
Iā€™m trying to support multi-backend with MLIR, and this could be really helpful!

IMO, supporting all the RVV operations is the ideal state, but we should add frequently used operations first, and then gradually support others. The reason why I only implement the basic arithmetic, loads, stores for the initial patch is that I want to keep the RFC simple to show the basic idea (flexible to modify and change direction) and these operations can build an executable example.

Thanks for informing this! I think this work can help us to create a unified vector abstraction layer in MLIR. I will learn more about the details of this work.

1 Like

Is the plan to support every RISCV vector operation or just the basic arithmetic, loads, stores, conversions?

There is no need to have hw-specific basic arithmetic operations. Standard arithmetic ops on scalable vectors already map neatly to whatever scalable hardware you want to target through LLVM IR. We should only need specific hw instructions for those operations that donā€™t map cleanly into LLVM IR ones (e.g.: matrix multiply or dot products). If we find ourselves having a 1-1 map between an MLIR dialect and a whole ISA, weā€™re very likely doing something wrong. 99% of the work will be adapting passes to work with scalable vectors and building new passes to deal with scalable vectorization. These dialects should be just an outlet for specialized instructions. The only reason we need these right now is because MLIR builtin vector types are fixed length only.

There is an ongoing effort to add intrinsics versions of basic IR instructions that take a mask and vector length argument. LLVM Language Reference Manual ā€” LLVM 13 documentation It might make sense to target those instead of RISCV vector intrinsics. In theory those are supposed to work on multiple targets.

Indeed, those are the natural target for all masked vector operations. The reason why ā€œmaskedā€ instructions in the Arm SVE dialect map to SVE intrinsics (which ended up replicated in RISC-V Vector) is because something was failing in the instruction selection, I was advised itā€™s a work in progress, and I decided to work around that. Eventually, similarly to basic arithmetic instructions, masked operations in the Vector dialect should map to masked vector operations in LLVM IR. Whether those are fixed-length vectors or scalable vector, RISC-V or SVE, can be determined by type of the vector operands in the Vector dialect and the target hw in LLVM respectively.

1 Like

Please read and provide feedback on [RFC] Add built-in support for scalable vector types. If that patch or something to that effect gets accepted, it would significantly simplify this change, as well as the approval process for it.

Thank you!
Javier

1 Like

The built-in support will be very helpful for the RVV side. I have replied to your RFC and expressed my thoughts. In general, I think it is challenging to design a unified scalable vector type, and I am very willing to discuss and contribute to this direction :grin:

1 Like

Indeed, Iā€™m counting on that :smiley: Thanks, Hongbin!

I am writing to show the current state of the dialect. This work relies on two ongoing parts.

  • Built-in Scalable Vector Type

We have discussed this part in @javiersetoainā€™s RFC. After the patch lands, I will replace the current RVV specific type with the built-in scalable vector type.

  • Integration Test

The integration test needs lli or mlir-cpu-runner can work for the RISC-V backend. However, the RuntimeDyld doesnā€™t support the RISC-V side now. My teammate suggests that we should use JITLink, and we are working to support this. After the JIT supports the RISC-V backend, the problem of integration testing can be solved.

6 Likes

Update

  • Sync to the vector type with scalable dimensions.
  • Set the vta as an attribute.
  • Add setvl operation.
  • Some RISC-V + JIT progress (needed by integration test)

Here is the current RISCVV dialect patch .

Sync to the vector type with scalable dimensions.

According to the previous discussion, I sync the type to the built-in vector type with scalable dimensions.

Set the vta as an attribute.

The llvm intrinsic add vta argument to let users control the tail policy, see the patch for more details. Here I quote the sentences from the patch to show the meaning of the tail agnostic and tail undisturbed:

Tail agnostic means users do not care about the values in the tail elements and tail undisturbed means the values in the tail elements need to be kept after the operation.

Since the vta parameter is a tail policy option, it is more appropriate to be an attribute in MLIR. And the lowering pass is responsible for converting the attribute into an intrinsic argument.

Add setvl operation.

vsetvli is an useful instruction for RISC-V vector extension to set vector length according to the AVL, SEW, LMUL configurations. RVV uses this to achieve a direct and portable strip-mining approach, which is purposed to handle a large number of elements. The return value of this instruction is the number of elements for a single iteration. In this case, the vsetvli can help with strip-mining for loop iterations, which is different from the SIMD style (using masks for the tail processing).

After adding this operation, we can use strip-mining style loop iterations in MLIR for RVV target. I prepare an example to show this.

https://gist.github.com/zhanghb97/db87cd22d330ba6424b31c70b135b0ca#file-test-rvv-stripmining-mlir

Some RISC-V + JIT progress (needed by integration test)

My teammate has sent some patches hoping to support JIT for the RISC-V side.
Here I quote some sentences of his summary to show the point of the challenge:

In RISCV, temporary symbols will be used to generate dwarf, eh_frame sectionsā€¦, and will be placed in object codeā€™s symbol table. However, LLVM does not use names on these temporary symbols.

For more details, please see his patches:

https://reviews.llvm.org/D116475

https://reviews.llvm.org/D116794

Update

Here is the current RISCVV dialect patch .

Integration Test

  • Build the integration test environment

Currently, there is no RVV hardware available, so the emulator is required for the integration tests. I provide an environment setup document to show how to build the toolchain and perform the integration tests.

  • Test cases

I add three cases for the integration tests.

  1. test-riscvv-arithmetic
  2. test-riscvv-memory
  3. test-riscvv-stripmining

Patterns for the mask/tail policy strategies

There are two strategies to control the mask/tail policy in the RISC-V LLVM IR intrinsics:

  1. Use the ā€œpolicyā€ argument at the end of the argument list.
  2. Use the ā€œpassthroughā€ argument at the beginning of the argument list.

I add two patterns (ā€œConvertPolicyOperandOpToLLVMPatternā€ and ā€œConvertPassthruOperandOpToLLVMPatternā€) to deal with these two strategies.

Discussion

  • Unified integration test configurations for the emulator

The emulator configurations for the integration test are target specific. So I use similar configurations for the RVV side now, but it seems a little cumbersome. Should we design unified configurations for the integration test with emulators?

1 Like

Just wanted to chime in and say thank you for the good work. I can help setup access to actual hardware if you want to test on real hardware with RVV support.

1 Like

Hi @powderluv, thanks a lot for your help! I am very excited to hear that the RVV hardware is available, and I do hope to test the RFC patch on it! With the hardware support, the entire lowering process and integration testing part can be further tested. Maybe we can discuss the details of how to access the hardware through private messages.

Hi @zhanghb97 ,
I build RISCVV Dialect patch with the following instructions.

  1. Clone the patch files
git clone https://github.com/llvm/llvm-project.git
cd llvm-project
arc patch D108536
  1. Build local MLIR
cd llvm-project
mkdir build-local-mlir
cd build-local-mlir
cmake -G Ninja ../llvm \
   -DLLVM_ENABLE_PROJECTS=mlir \
   -DLLVM_TARGETS_TO_BUILD="host;RISCV" \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON
ninja check-mlir
  1. Export mlir-opt path
export PATH=/llvm-project/build-local-mlir/bin:$PATH

But while lowing the test example vadd.mlir with the following instructions, some error occurs.

  • vadd.mlir
func @vadd(%in1: memref<?xi64>, %in2: i32, %out: memref<?xi64>, %maskedoff: memref<?xi64>, %mask: memref<?xi1>) {
  %c0 = arith.constant 0 : index
  %vta = arith.constant 1 : i64
  %vl = arith.constant 6 : i64
  %input1 = riscvv.load %in1[%c0], %vl : memref<?xi64>, !riscvv.vector<!riscvv.m4,i64>, i64
  %off = riscvv.load %maskedoff[%c0], %vl : memref<?xi64>, !riscvv.vector<!riscvv.m4,i64>, i64
  %msk = riscvv.load %mask[%c0], %vl : memref<?xi1>, !riscvv.vector<!riscvv.mask16,i1>, i64
  %output = riscvv.masked.add %off, %input1, %in2, %msk, %vl, %vta : !riscvv.vector<!riscvv.m4,i64>, i32, !riscvv.vector<!riscvv.mask16,i1>, i64
  riscvv.store %output, %out[%c0], %vl : !riscvv.vector<!riscvv.m4,i64>, memref<?xi64>, i64
  return
}
  • Lowering instructions
mlir-opt vadd.mlir -convert-vector-to-llvm="enable-riscvv" -convert-scf-to-std -convert-memref-to-llvm -convert-std-to-llvm='emit-c-wrappers=1' | mlir-translate -mlir-to-llvmir -o vadd.ll
mlir-opt: Unknown command line argument '-convert-scf-to-std'.  Try: 'mlir-opt --help'
mlir-opt: Did you mean '--convert-scf-to-cf'?
mlir-opt: Unknown command line argument '-convert-std-to-llvm=emit-c-wrappers=1'.  Try: 'mlir-opt --help'
mlir-opt: Did you mean '--convert-cf-to-llvm=emit-c-wrappers=1'?

So I change the instructions to

mlir-opt vadd.mlir -convert-vector-to-llvm="enable-riscvv" -convert-scf-to-cf -convert-memref-to-llvm -convert-cf-to-llvm='emit-c-wrappers=1' | mlir-translate -mlir-to-llvmir -o vadd.ll

or

mlir-opt vadd.mlir -convert-vector-to-llvm="enable-riscvv"

And the error is

vadd.mlir:5:65: error: dialect 'riscvv' provides no type parsing hook
  %input1 = riscvv.load %in1[%c0], %vl : memref<?xi64>, !riscvv.vector<!riscvv.m4,i64>, i64

Please give me some help.

For mlir-opt, I also try the version with the flag -reconcile-unrealized-casts, but the same error occurs.

mlir-opt vadd.mlir -convert-vector-to-llvm="enable-riscvv" -convert-scf-to-cf -convert-memref-to-llvm -convert-cf-to-llvm='emit-c-wrappers=1' -reconcile-unrealized-casts | mlir-translate -mlir-to-llvmir -o vadd.ll
vadd.mlir:5:65: error: dialect 'riscvv' provides no type parsing hook
  %input1 = riscvv.load %in1[%c0], %vl : memref<?xi64>, !riscvv.vector<!riscvv.m4,i64>, i64
mlir-opt vadd.mlir -convert-vector-to-llvm="enable-riscvv" -convert-scf-to-std -convert-memref-to-llvm -convert-std-to-llvm='emit-c-wrappers=1' -reconcile-unrealized-casts | mlir-translate -mlir-to-llvmir -o vadd.ll
mlir-opt: Unknown command line argument '-convert-scf-to-std'.  Try: 'mlir-opt --help'
mlir-opt: Did you mean '--convert-scf-to-cf'?
mlir-opt: Unknown command line argument '-convert-std-to-llvm=emit-c-wrappers=1'.  Try: 'mlir-opt --help'
mlir-opt: Did you mean '--convert-cf-to-llvm=emit-c-wrappers=1'?

Hi @njru8cjo , thanks for your finding!

I notice that you are using the old version example, which still includes the target-specific type (!riscvv.vector<!riscvv.m4,i64>). Now the RFC patch uses the unified scalable vector type.

As for the latest example, you can see the integration test in the patch, and you can also find the lowering passes pipeline at the head of the test file. Furthermore, if you want to cross-compile and run the example, please follow this doc to build the RISC-V environment.

1 Like

Itā€™s work! Thank you very much for your answer and RFC! By the way,
the rvv-intrinsic branch in riscv-gnu-toolchain seems merge into main. Thank you again for your help and this excellent work.

Hi Hongbin,

Thank you so much for working on a dialect for RVV and all the integration with QEMU! Really exciting to see this moving forward! Iā€™m also working on RISC-V on the Google/IREE side and would be happy to collaborate on this work. In general, my feedback is aligned with what @javiersetoain and others have said so far: we should focus on leveraging all the reusable vector components of MLIR and the amazing work that the RISC-V community has been doing in LLVM. That would save us quite some maintainability work!

As mentioned before, it would be helpful if you could share more about your use case. Knowing how you plan to lower to the ā€œRISCVVā€ dialect would help us identify and prioritize what we need upstream. From our side, some use cases rely on Vector Length Specific (VLS) vectorization at MLIR (Linalg) and LLVM level so we have to make sure that the RISCVV dialect can handle both fixed-length and scalable vectors.

+1. Thatā€™s the policy weā€™ve been following so far for other hardware-specific dialects. Reusing arithmetic/logical operations from hardware-agnostic dialects will significantly reduce the number of operations we have to maintain.

+1. We should leverage the vector predicate extensions in LLVM. Some of them are supported by the RISC-V backend seamlessly. Bringing those extensions to the Vector dialect would be very useful since they would be reused across targets and automatically lowered to the LLVM counterparts by a common lowering.

IMO, the first version of the RISCVV dialect should be a bit higher level than the LLVM intrinsics to provide the specificity to target RISC-V instructions but also the flexibility to accommodate VLS and VLA vectorization. It should abstract away some low level details (RISC-V specific vector registers, LMUL configuration, tail/mask policies, etc.) that are currently handled by the LLVM backend, unless, of course, we have a strong reason to make them explicit in MLIR. At MLIR level we should leverage the concept of virtual vector and let the backend map it to physical vector registers and their specific configurations. For instance, we could influence the LMUL register configuration by changing the size of the vector type (e.g., assuming VLEN = 128, vector<4xf32> should lead to LMUL=1, vector<8xf32> to LMUL=2, etc). Itā€™s great to see that the iterations on the implementation have been moving towards this direction!

I think a good exercise to move this forward might be to create a few examples in the Vector dialect and see what is missing to get them lowered to RISC-V. We could start with some basic kernels with unmasked arithmetic operations, conversions, masked/unmasked loads/stores, etc. and have them both compiling and running for both fixed-length and scalable vectors. We may realize that for basic operations we donā€™t need to add much to the RISCVV dialect. Once that is working, we could think about adding masked operations to the Vector dialect and lower them to the vector predicate intrinsics in LLVM. WDYT?

Happy to collaborate with you on this front! Letā€™s catch up offline, if needed :slight_smile:.

Diego

Hi Diego,

Thanks a lot for your feedback! It really helps to understand the concerns and requirements on your side, and it also makes the direction and next step clear.

Use cases on my side

Most of my use cases are (1) using vector dialect to implement some optimization algorithm, (2) using auto-vectorization to achieve acceleration. In my practice, I found that the current vector abstractions cannot support the features of RVV very well. The most important issue is tail processing. We cannot specify the dynamic vector length at runtime, which in my opinion is a very important feature of the RVV architecture. With the dynamic vl setting, we can have the strip-mining style loops, which can be naturally adapted to RVV hardware on tail processing. Auto-vectorization also has the similar problem. As far as I know, auto-vectorization in LLVM still uses SIMD style code generation for the RVV backend, which leads to the overhead of mask instructions. Thus, the purpose of my proposal is to expose the features explicitly making them can be used in MLIR. Maybe the above is the answer to the reason :arrow_down:

Reuse as much existing work as possible

I totally agree that we should maximize the reuse of existing work.

  • Reusing arithmetic/logical operations from hardware-agnostic dialects

Whether the current operation can be directly reused depends on the information provided by the operation. Currently, I think the problem is that vector operations cannot provide the dynamic vector length and cannot perform explicit vector-scalar cases (can use broadcast to walkaround).

  • Leveraging the vector predicate extension in LLVM.

I think I need to conduct more experiments to see the current support of the vector prediction intrinsic. If it is strong enough for our cases, we can really save a lot of maintenance costs.

Abstraction level

Following the path of other backend-specific dialects, my original thought is using a multi-level RISCVV dialect as a cornerstone to provide intrinsic-level support, and then implementing the conversions and transformations to bridge the generic vector dialect to RISCVV dialect and abstract away low-level details.

I think this is a good idea! Maybe we should not just rely on intrinsic, and we also need to consider both fixed and scalable vector types. Furthermore, I guess you want to use the VLS vectorization to support the existing optimization that is designed for SIMD backend, right? According to your experience, are there any problems with the compatibility of generic vector dialect (without VLA and dynamic VL setting supports) and the RISC-V toolchain when using the VLS strategy?

Next step

+1. I will create more examples and summarize the result following days. Based on those examples, we can discuss what we need to support for the first version. When I finish my experiments and examples, I will let you know and post the summary here.

Again, thank you for the feedback! Looking forward to discussion and collaboration!

Hongbin

Agreed! We should have a way to model dynamic vector lengths in MLIR. However, I think a ā€œhardware-independentā€ (loosely tied to RVV) abstraction in the Vector dialect could have significant benefits in terms of reusability and generalization of the existing vectorization algorithms. Otherwise, we would have to introduce hardware-specific IR too early in the vectorization process and create specialized versions of the existing algorithms for RVV.

Regarding tail or epilogue processing, we could vectorize it or even ā€œtail-foldā€ it (this is the way itā€™s called in LLVM) into the main vector loop, by using both masking and dynamic vector length. They both are in our agenda but we are currently prioritizing masking (see Linalg and masking) since it has broader applicability.

Regarding tail or epilogue processing, we could vectorize it or even ā€œtail-foldā€ it (this is the way itā€™s called in LLVM) into the main vector loop, by using both masks and dynamic vector length. They both are in our agenda but we are currently prioritizing [masking] (Linalg and masking) since it has broader applicability.

Iā€™m aware of some work going on to add support for dynamic vector length in LLVM, leveraging the vector predication intrinsics. I see value in using dynamic vector length vs masking but it would be interesting to know how they compare in overhead/performance since the feedback I got about this is not conclusive. AFAIK, changing the vector length in RVV also has some overhead so I would expect both approaches to be comparable in most cases, unless a smaller vector length led to a lower latency RVV instructions in some RISC-V implementations.

The vector predication intrinsics can take either a mask or a vector length as input so they should be enough to model arithmetic operations with dynamic vector length. Regarding vector-scalar computation, I would expect the LLVM backend to take care of it. It should be able to match a broadcast feeding a vector instruction and map them to a single vector-scalar instruction. We wouldnā€™t have to model vector-scalar computation explicitly in MLIR.

We would have to look at the specific ops when the time comes but note that many of the ā€œmulti-levelā€ low-level dialects were designed when the LLVM dialect had an independent type system and couldnā€™t be mixed easily with operations from other ā€œcoreā€ dialects. Given that the LLVM dialect now shares the ā€œcoreā€ type system, we could leverage the LLVM intrinsics directly in MLIR and avoid the lower dialect-specific layer. That would cut the number of operations in half!

Not that Iā€™m aware of. We are generating RVV code for our models using the Vector dialect and VLS vectorization. Actually, we want to go even further: we would like to combine VLS with RISC-V scalable intrinsics in LLVM. I asked @javiersetoain and @topperc about this and they pointed at these experimental intrinsics to communicate VLS and VLA code in LLVM. That looks promising.