MLIR VP Ops on RVV Backend Integration Test and Issues Report

Authors: Hongbin @zhanghb97, Yingchi @inclyc

Context

VP intrinsic discussion in the RISC-V Vector Dialect RFC
VP intrinsic discussion in the vector masking representation RFC
In recent discussions about the vector abstractions, VP intrinsic is the critical point in the lowering path. However, there is no integration test for VP intrinsic in MLIR. We mainly focus on the RVV side, so we test all the MLIR VP Ops with both fixed and scalable vector types on the RVV backend.

Integration Test

You can find the test cases in our buddy-mlir repo and run the test cases in our web application buddy-caas (Buddy Compiler As A Service). We provide a table to show the test cases and results. In short, the LLC will crash when it legalizes the VPFRemOp, VPIntToPtrOp, VPPtrToIntOp, VPReduceFMulOp, and VPReduceMulOp.

PromoteIntegerOperand Op #2: t188: f32 = vp_frem t185, t186, Constant:i1<-1>, Constant:i64<8>

Do not know how to promote this operator's operand!
UNREACHABLE executed at llvm-project/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp:1629!
PromoteIntegerOperand Op #0: t231: i64 = vp_inttoptr t229, Constant:i1<-1>, Constant:i64<8>

Do not know how to promote this operator's operand!
UNREACHABLE executed at llvm-project/llvm/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp:1629!
llc: llvm-project/llvm/include/llvm/CodeGen/ValueTypes.h:309: 
unsigned int llvm::EVT::getVectorNumElements() const: Assertion `isVector() && "Invalid vector type!"' failed.
llc: llvm-project/llvm/include/llvm/CodeGen/ValueTypes.h:309: 
unsigned int llvm::EVT::getVectorNumElements() const: Assertion `isVector() && "Invalid vector type!"' failed.

Discussion and Fix

@inclyc Yingchi looked deep into the issues and came to the following conclusions.

  • VPReduceFMulOp & VPReduceMulOp

The LLVM IR example:

declare i32 @llvm.vp.reduce.mul.v4i32(i32, <4 x i32>, <4 x i1>, i32)

define signext i32 @vpreduce_mul_v4i32(i32 signext %s, <4 x i32> %v, <4 x i1> %m, i32 zeroext %evl) {
  %r = call i32 @llvm.vp.reduce.mul.v4i32(i32 %s, <4 x i32> %v, <4 x i1> %m, i32 %evl)
  ret i32 %r
}

There is no mul reduce instruction in RVV. For this reason, it is mandatory to unroll VP_REDUCE_{F,}MUL ops in LLVM. Unrolling process of these two VP intrinsics may be shared among backends. We may need to think about which part of LLVM to implement the unrolling of this instruction: ExpandVectorPredicationPass ? or TLI.expandVecReduce ?

The SelectionDAGBuilder will convert VP_REDUCE_MUL to VP_REDUCE_AND if the vector element is i1s. In this scenario specifically, this VP intrinsic will be compiled into RVV vredand.vs .

declare i1 @llvm.vp.reduce.mul.v4i1(i1, <4 x i1>, <4 x i1>, i32)

define signext i1 @vpreduce_mul_v4i1(i1 signext %s, <4 x i1> %v, <4 x i1> %m, i32 zeroext %evl) {
  %r = call i1 @llvm.vp.reduce.mul.v4i1(i1 %s, <4 x i1> %v, <4 x i1> %m, i32 %evl)
  ret i1 %r
}
  • VPFRemOp

The LLVM IR example:

; ModuleID = 'LLVMDialectModule'
define <8 x float> @vpfrem_v8f32(<8 x float> %v1, <8 x float> %v2, <8 x i1> %m, i32 %evl) {
  %ret = call <8 x float> @llvm.vp.frem.v8f32(<8 x float> %v1, <8 x float> %v2, <8 x i1> %m, i32 %evl)
  ret <8 x float> %ret
}

declare <8 x float> @llvm.vp.frem.v8f32(<8 x float>, <8 x float>, <8 x i1>, i32)

The LLVM community has discussed this problem in D104327. The frem node is unsupported due to a lack of available instructions.

For fixed-length vectors we could scalarize but that option is not (currently) available for scalable-vector types. The support is intentionally left out so it equivalent for both vector types.

  • VPPtrToIntOp & VPIntToPtrOp

VPPtrToIntOp and VPIntToPtrOp were introduced in D122291. Scalar inttoptr instruction is lowering to zext / trunc in SelectionDAGBuilder . Introducing similar logic in SelectionDAGBuilder is a possible solution. Redundant instructions are generated in the DAG Builder, e.g. zext (trunc) and should be reduced in InstCombine . Currently we do not have similar logic for VP intrinsics. That is to say, vp.zext & vp.trunc may not be reduced/eliminated by such logic.

@inclyc has submitted a candidate patch here!

The LLVM IR example in the patch:

declare <4 x ptr> @llvm.vp.inttoptr.v4p0.v4i32(<4 x i32>, <4 x i1>, i32)

define <4 x ptr> @inttoptr_v4p0_v4i32(<4 x i32> %va, <4 x i1> %m, i32 zeroext %evl) {
; CHECK-LABEL: inttoptr_v4p0_v4i32:
; CHECK:       # %bb.0:
; CHECK-NEXT:    vsetvli zero, a0, e64, m2, ta, ma
; CHECK-NEXT:    vzext.vf2 v10, v8, v0.t
; CHECK-NEXT:    vmv.v.v v8, v10
; CHECK-NEXT:    ret
  %v = call <4 x ptr> @llvm.vp.inttoptr.v4p0.v4i32(<4 x i32> %va, <4 x i1> %m, i32 %evl)
  ret <4 x ptr> %v
}
1 Like

Thanks @zhanghb97 and @inclyc! Really excited to finally see all these issues shared with the community! Perhaps @topperc and @asb can help with directions for the RISC-V backend issues?

However, there is no integration test for VP intrinsic in MLIR. We mainly focus on the RVV side, so we test all the MLIR VP Ops with both fixed and scalable vector types on the RVV backend.

I think bringing these integration tests to MLIR would be really valuable! Do you have plans to do that? It should be similar to the integration tests that we have for Intel’s AMX, using a simulator.

There is a bi-weekly vector predication meeting listed in this table Getting Involved — LLVM 16.0.0git documentation Unfortunately, attendance has been low recently.

Adding @frasercrmck and @rofirrim as well

Yes! I would love to bring these integration tests upstream. I will prepare an initial boilerplate patch as the first step and add more test cases gradually.

That is great! We can follow the latest progress at the meeting.

@inclyc wants to fix the above issues, and we would like to get more suggestions. Thanks in advance.

I have submitted the patch D137816 to add the initial VP intrinsic integration test.

  • Run the test cases on the host by configuring the CMake option: -DMLIR_INCLUDE_INTEGRATION_TESTS=ON . Both the X86 and Arm (Apple M1 Max) backends can pass the VP intrinsic tests, but there are some other integration tests that fail on the Arm side (complex correctness, memref_abi, some sparse tensor cases).
  • Build the RVV environment and run the test cases on RVV QEMU by this doc.