[RFC] VCIX Dialect

SiFive VCIX (Xsfvcp) is a RISC-V extension that allows to easily add own vector instructions and/or interact with own co-processor throughout special instructions

Motivation

The extension has been supported by Clang and LLVM IR for a while so that C/C++ users can efficiently utilize a co-processor which support Xsfvcp extension , but not MLIR users.

The purpose of the RFC is to add VCIX Dialect to MLIR so that users can use VCIX-compatible co-processor.

Proposal

The PR implements VCIX dialect for entire set of VCIX instructions

Mnemonic funct6    vm    rs2   rs1   funct3   rd   Destination Sources
sf.vc.x    0000--   1   -----  xs1    100    -----    none     scalar xs1
sf.vc.i    0000--   1   -----  simm   011    -----    none     simm[4:0]
sf.vc.v.x  0000--   0   -----  xs1    100    vd     vector vd  scalar xs1
sf.vc.v.i  0000--   0   -----  simm   011    vd     vector vd  simm[4:0]
sf.vc.v.vv 0010--   1    vs2   vs1    000    -----    none     vector vs1, vector vs2
sf.vc.v.xv 0010--   1    vs2   xs1    100    -----    none     scalar xs1, vector vs2
sf.vc.v.iv 0010--   1    vs2   simm   011    -----    none     simm[4:0], vector vs2
sf.vc.v.fv 0010--   1    vs2   fs1    101    -----    none     scalar fs1, vector vs2
sf.vc.v.vv 0010--   0    vs2   vs1    000    vd     vector vd  vector vs1, vector vs2
sf.vc.v.xv 0010--   0    vs2   xs1    100    vd     vector vd  scalar xs1, vector vs2
sf.vc.v.iv 0010--   0    vs2   simm   011    vd     vector vd  simm[4:0], vector vs2
sf.vc.v.fv 0010--   0    vs2   fs1    101    vd     vector vd  scalar fs1, vector vs2
sf.vc.vvv 1010--    1    vs2   vs1    000    vd       none     vector vs1, vector vs2, vector vd
sf.vc.xvv 1010--    1    vs2   xs1    100    vd       none     scalar xs1, vector vs2, vector vd
sf.vc.ivv 1010--    1    vs2   simm   011    vd       none     simm[4:0], vector vs2, vector vd
sf.vc.fvv 10101-    1    vs2   fs1    101    vd         none   scalar fs1, vector vs2, vector vd
sf.vc.v.vvv 1010--  0    vs2   vs1    000    vd     vector vd  vector vs1, vector vs2, vector vd
sf.vc.v.xvv 1010--  0    vs2   xs1    100    vd     vector vd  scalar xs1, vector vs2, vector vd
sf.vc.v.ivv 1010--  0    vs2   simm   011    vd     vector vd  simm[4:0], vector vs2, vector vd
sf.vc.v.fvv 10101-  0    vs2   fs1    101    vd     vector vd  scalar fs1, vector vs2, vector vd

The VCIX dialect consists of unary, binary, ternary, wide.ternary operations and their read-only (i.e. that don’t have destination vector register) variants. For example sv.vc.v.vv instruction will be represented as

%0 = vcix.binary %const, %op2, %rvl { opcode = 3 : i2 } : (i5, vector<[4] x f32>, ui32) -> vector<[4] x f32>

The operations of the VCIX dialect accept fixed or scalable vectors when RVV encoding is possible.

The PR also implements conversion of VCIX dialect to LLVM IR. Since the conversion requires correct bit-width for VL parameter, which is determined by target, RV64 is assumed by default. If user wants to convert for RV32 target, the function attribute vcix.target_features=”+32bit” must be set.

Use in MLIR ecosystem

Since the dialect does only operate either on scalable or on fixed vector type, thus the dialect cannot be used by high-level dialects that operate on Tensor or MemRef, such as TOSA, StableHLO, ONNX etc.

Example

The following simple example is used to demonstrate the possible conversion of math.exp with a fixed and scalable vtypes to VCIX operation

func.func @exp(%arg0: vector<32xf32>) -> vector<32xf32> {
    %0 = math.exp %arg0 : vector<32xf32>
  return %0 : vector<32xf32>
}

func.func @exp_scalable(%arg0: vector<[16]xf32>, %arg1: ui32) -> vector<[16]xf32> {
    %0 = math.exp %arg0 : vector<32xf32>
  return %0 : vector<32xf32>
}

After conversion to VCIX dialect

func.func @exp(%arg0: vector<32xf32>) -> vector<32xf32> {
    %const = arith.constant 1 : i5
    %0 = vcix.binary %const, %arg0 {opcode = 1 : i2, rs2 = 0 : i5} : (i5, vector<32xf32>) -> vector<32xf32>
  return %0 : vector<32xf32>
}

func.func @exp_scalable(%arg0: vector<[16]xf32>, %arg1: ui32) -> vector<[16]xf32> {
    %const = arith.constant 1 : i5
    %0 = vcix.binary %const, %arg0, %arg1 {opcode = 1 : i2, rs2 = 0 : i5} : (i5, vector<[16]xf32>, ui32) -> vector<[16]xf32>
  return %0 : vector<[16]xf32>
}

Compiling this down to machine code with Zvl256b extension enabled produces:

…
li a0, 32
li a1, 1
vsetvli zero, a0, e32, m4, ta, ma
sf.vc.v.xv 1, v8, v8, a1
…
…
slli a0, a0, 32
srli a0, a0, 32
li a1, 1
vsetvli zero, a0, e32, m8, ta, ma
sf.vc.v.xv 1, v8, v8, a1

NOTE: Ideally, the operations should use RVV Dialect instead of scalable vectors to minimize verification and conversion logic.

5 Likes

Hey Kolya,

Thanks a lot for the contribution! In general, it looks good to me! I don’t have any major concerns with respect to the proposal (I haven’t had time to look at the PR yet). It looks like these ops are pretty low level so they should be at the RVV level of abstraction and compose well with what we are working on.

It would be good to know more about the use cases that you have in mind that would justify having this upstream in MLIR. Also if there are any end-to-end testing plans (simulator?).

Some feedback that might be useful:

For target-specific dialects, in general, we’ve been trying to keep them pretty minimalist, leveraging the LLVM dialect intrinsic mechanism to define and convert target specific intrinsics. We have only created target-specific dialects on top of the LLVM intrinsics when we needed to implement IR transformations at that level that would benefit from better abstractions than the ones provided by the LLVM intrinsics (e.g., SME dialect). I think we should follow a similar approach here. What kind of transformations other than lowering to LLVM do you foresee happening, if any?

Regarding the upstreaming plan, we usually follow an incremental approach where ops are introduced as they are needed for end-to-end cases. We do that to reduce the maintainability cost even if that means not having the complete spec implemented.

Hopefully that helps!

Thanks!
Diego

they should be at the RVV level of abstraction and compose well with what we are working on.

That will simplify lots of things in the implementation. Looking forward to see it, especially how RVV will interact with current Vector dialect

Pardon in advance, but my next answers won’t be concrete.

It would be good to know more about the use cases that you have in mind that would justify having this upstream in MLIR. Also if there are any end-to-end testing plans (simulator?).

Semantics of each VCIX instruction is defined by the co-processor. For instance, one can implement co-processor so that sf.vc.v.vv represents Softmax, another can implement co-processor so that sf.vc.v.ivv represents Softmax.
That makes e2e testing really complicated, where even QEMU won’t help much yet. The best we can do for now is to rely on expected lowering of each operation to LLVM intrinsic.
At the same time, I do see the huge benefit to support VCIX in upstream as any other MLIR-based compiler/framework, can use that extension without a need to generated inline assembly

For target-specific dialects, in general, we’ve been trying to keep them pretty minimalist, leveraging the LLVM dialect intrinsic mechanism to define and convert target specific intrinsics. We have only created target-specific dialects on top of the LLVM intrinsics when we needed to implement IR transformations at that level that would benefit from better abstractions than the ones provided by the LLVM intrinsics (e.g., SME dialect). I think we should follow a similar approach here. What kind of transformations other than lowering to LLVM do you foresee happening, if any?

Thanks. That’s quite useful to know. Not having specific semantics makes it hard to think about any VCIX-specific optimization in upstream. However, some MLIR-based compiler may introduce some.

Regarding the upstreaming plan, we usually follow an incremental approach where ops are introduced as they are needed for end-to-end cases. We do that to reduce the maintainability cost even if that means not having the complete spec implemented.

Since use of the dialect will be driven by a specific co-processor, it’s really hard to predict which first op is important first. I do understand that it might not be welcomed by community to support all of them from the beginning, so I can limit the initial patch to few operations only and add others in followup patch(es)

Thanks for the clarifications!

I guess it’s the first time that it makes sense to say that the semantics of an op are defined by its lowering(s) :slight_smile:. Out of curiosity, how do you treat these instructions with regard to side effects in general? I assume they should be treated as black boxes with potentially any arbitrary side effect.

I’ll let others to chime in but I think we would need specific and active users of this technology for it to be upstream. In other words, this shouldn’t be an implementation drop that we leave there with the hope that somebody will use it… and then nobody does. Will SiFive be directly using this technology downstream from the upstream source and update it maintain it frequently? Any other partners that will directly use this technology in MLIR? It would be great if they could speak up and expose their use cases/needs :slight_smile:.

Ok, then I would suggest that the VCIX dialect only contains the thin LLVM intrinsic layer for now. That should be enough for the lowering purposes.

That’s why I’m asking about users… The upstreaming should be driven by the needs and use cases of those.

Thanks for sharing this proposal!

I am not really qualified to comment on the finer details, so will focus on the higher level bits.

This resonates with how I feel. With emphasis on “specific”. Also:

Such e2e tests are really helpful in understanding the bigger picture. That can inform the design and would also help us understand the ultimate goal :slight_smile: But if such tests are not possible then it would suggest that these are very early days?

Does this mean that the intrinsics are fixed but the meaning of these intrinsics depends on hardware implementation? Also:

How is it used/tested when compiling from Clang?

This makes sense, but I think that it would be nice if every upstream dialect somehow composed well and/or complemented other upstream dialects. This is currently unclear. Would you be adding conversions from Linalg/Vector dialect later on?

-Andrzej

C and LLVM intrinsics do have side and noside-effect versions. The VCIX Dialect implements only SE versions for now as you noticed.

Of course, maintenance is on our shoulders. The plan is to try to integrate this into IREE first.

Yes, exactly that. As I tried to express above, it’s up to the customer to define the semantics of each VCIX instruction and different customers may have different thing implemented for the same VCIX instruction.

Output is simply compared against expected one.

Since it’s driven by the co-processor, without extra help from the user it’s quite hard to do. An obvious approach is to have custom compiler for each co-processor. The other approach I was thinking is to use some sort of PDLL with dynamic loading. This will certainly help to avoid rebuilding compiler, but requires developer to know IR

Thanks @dcaballe @banach-space. Overall it looks like going with VCIX dialect now is unjustifiable. I assume that there’s no requirement to add VCIX intrinsics to LLVMIR dialect, is that right ?

Well, I guess as everything else in MLIR, things have to be used and tested. I think adding the intrinsics (which requires creating a folder with the “dialect” name, etc.) would be a good starting point given that these operations can be targeted in different ways (basically, one per hardware implementing them). Once that has landed and you have something working in IREE and has built some expertise, we may want to revisit if the SiFive specific implementation should live upstream… Sounds like a plan? :slight_smile:

Thanks for bearing with us!

2 Likes

Yep. That sounds like a good plan.

I’ve opened another PR which extends LLVMIR only. I still added a dialect there, like ROCDL for couple of reasons:

  1. Unlike vp-intrinsics, VCIX intrinsics do require to include IntrinsicsRISCV.h in a common path, which does not look desirable
  2. Conversion requires to know xlen (bitness of the target), so adding functions to infer it to a common path also does not look desirable.

@dcaballe @banach-space gentle ping

Thanks and sorry for the delay. I’m taking a look at the PR. For testing, I think we could implement a simple Linalg or Arith conversion under the test folder to make sure this is exercise somehow. Let’s move the discussion to the PR?