How to get LMUL>1 in RVV

How does the auto-vectorization decide the value for the RISCV V extension LMUL value? It always seems to choose LMUL=1, even when the option --riscv-v-fixed-length-vector-lmul-max=8 is specified.
The sample LLVMIR is generated internally in IREE for a fixed tensor size.

      builtin.module attributes {llvm.data_layout = "e-m:e-p:64:64-i64:64-i128:128-n64-S128", llvm.target_triple = "riscv64"}  {
        llvm.func internal @forward_dispatch_0(%arg0: !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>>, %arg1: !llvm.ptr<array<3 x i32>>, %arg2: !llvm.ptr<i8>) -> i32 attributes {sym_visibility = "private"} {
          %0 = llvm.mlir.constant(0 : index) : i64
          %1 = llvm.mlir.constant(4 : index) : i64
          %2 = llvm.mlir.constant(1 : index) : i64
          %3 = llvm.load %arg0 : !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>>
          %4 = llvm.extractvalue %3[5] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>
          %5 = llvm.mlir.constant(0 : i64) : i64
          %6 = llvm.getelementptr %4[%5] : (!llvm.ptr<ptr<i8>>, i64) -> !llvm.ptr<ptr<i8>>
          %7 = llvm.load %6 : !llvm.ptr<ptr<i8>>
          %8 = llvm.getelementptr %7[%0] : (!llvm.ptr<i8>, i64) -> !llvm.ptr<i8>
          %9 = llvm.bitcast %8 : !llvm.ptr<i8> to !llvm.ptr<f32>
          %10 = llvm.load %arg0 : !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>>
          %11 = llvm.extractvalue %10[5] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>
          %12 = llvm.mlir.constant(1 : i64) : i64
          %13 = llvm.getelementptr %11[%12] : (!llvm.ptr<ptr<i8>>, i64) -> !llvm.ptr<ptr<i8>>
          %14 = llvm.load %13 : !llvm.ptr<ptr<i8>>
          %15 = llvm.getelementptr %14[%0] : (!llvm.ptr<i8>, i64) -> !llvm.ptr<i8>
          %16 = llvm.bitcast %15 : !llvm.ptr<i8> to !llvm.ptr<f32>
          %17 = llvm.load %arg0 : !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>>
          %18 = llvm.extractvalue %17[5] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>
          %19 = llvm.mlir.constant(2 : i64) : i64
          %20 = llvm.getelementptr %18[%19] : (!llvm.ptr<ptr<i8>>, i64) -> !llvm.ptr<ptr<i8>>
          %21 = llvm.load %20 : !llvm.ptr<ptr<i8>>
          %22 = llvm.getelementptr %21[%0] : (!llvm.ptr<i8>, i64) -> !llvm.ptr<i8>
          %23 = llvm.bitcast %22 : !llvm.ptr<i8> to !llvm.ptr<f32>
          %24 = llvm.load %arg1 : !llvm.ptr<array<3 x i32>>
          %25 = llvm.extractvalue %24[0] : !llvm.array<3 x i32>
          %26 = llvm.zext %25 : i32 to i64
          %27 = llvm.load %arg0 : !llvm.ptr<struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>>
          %28 = llvm.extractvalue %27[0] : !llvm.struct<"iree_hal_executable_dispatch_state_v0_t", (array<3 x i32>, array<3 x i32>, i64, ptr<i32>, i64, ptr<ptr<i8>>, ptr<i64>)>
          %29 = llvm.extractvalue %28[0] : !llvm.array<3 x i32>
          %30 = llvm.zext %29 : i32 to i64
          %31 = llvm.mlir.constant(64 : index) : i64
          %32 = llvm.mul %26, %31  : i64
          %33 = llvm.mul %30, %31  : i64
          llvm.br ^bb1(%32 : i64)
        ^bb1(%34: i64):  // 2 preds: ^bb0, ^bb5
          %35 = llvm.icmp "slt" %34, %1 : i64
          llvm.cond_br %35, ^bb2, ^bb6
        ^bb2:  // pred: ^bb1
          %36 = llvm.mlir.constant(-1 : index) : i64
          %37 = llvm.mul %34, %36  : i64
          %38 = llvm.add %37, %1  : i64
          %39 = llvm.icmp "slt" %31, %38 : i64
          %40 = llvm.select %39, %31, %38 : i1, i64
          %41 = llvm.bitcast %9 : !llvm.ptr<f32> to !llvm.ptr<f32>
          %42 = llvm.mul %34, %2  : i64
          %43 = llvm.add %0, %42  : i64
          %44 = llvm.bitcast %16 : !llvm.ptr<f32> to !llvm.ptr<f32>
          %45 = llvm.bitcast %23 : !llvm.ptr<f32> to !llvm.ptr<f32>
          llvm.br ^bb3(%0 : i64)
        ^bb3(%46: i64):  // 2 preds: ^bb2, ^bb4
          %47 = llvm.icmp "slt" %46, %40 : i64
          llvm.cond_br %47, ^bb4, ^bb5
        ^bb4:  // pred: ^bb3
          %48 = llvm.add %43, %46  : i64
          %49 = llvm.getelementptr %41[%48] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
          %50 = llvm.load %49 : !llvm.ptr<f32>
          %51 = llvm.getelementptr %44[%48] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
          %52 = llvm.load %51 : !llvm.ptr<f32>
          %53 = llvm.fsub %50, %52  : f32
          %54 = llvm.getelementptr %45[%48] : (!llvm.ptr<f32>, i64) -> !llvm.ptr<f32>
          llvm.store %53, %54 : !llvm.ptr<f32>
          %55 = llvm.add %46, %2  : i64
          llvm.br ^bb3(%55 : i64)
        ^bb5:  // pred: ^bb3
          %56 = llvm.add %34, %33  : i64
          llvm.br ^bb1(%56 : i64)
        ^bb6:  // pred: ^bb1
          %57 = llvm.mlir.constant(0 : i32) : i32
          llvm.return %57 : i32
        }
      }

This produces:

               	vsetvli	s1, zero, e16, m1

for a fixed size 128-element array, when it might be better to use m4 or m8.

Hi,

Try to use -riscv-v-vector-bits-min option for llc, such as -riscv-v-vector-bits-min=128.
In my experience, LMUL can be set to a proper value according to the size of the MLIR vector type. Let’s see an example:

memref.global "private" @gv : memref<6x4xf32> = dense<[[0. , 1. , 2. , 3. ],
                                                       [10., 11., 12., 13.],
                                                       [20., 21., 22., 23.],
                                                       [30., 31., 32., 33.],
                                                       [40., 41., 42., 43.],
                                                       [50., 51., 52., 53.]]>

func @main() {
  %mem = memref.get_global @gv : memref<6x4xf32>
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c2 = arith.constant 2 : index
  %load_vec1 = vector.load %mem[%c0, %c0] : memref<6x4xf32>, vector<16xf32>
  %load_vec2 = vector.load %mem[%c1, %c0] : memref<6x4xf32>, vector<16xf32>
  %load_vec3 = vector.load %mem[%c2, %c0] : memref<6x4xf32>, vector<16xf32>
  %res = vector.fma %load_vec1, %load_vec2, %load_vec3 : vector<16xf32>
  vector.print %res : vector<16xf32>
  return
}

Generate assembly code with MLIR and LLVM toolchain:

$ <mlir-opt> <filename> \
       --convert-vector-to-llvm --convert-memref-to-llvm --convert-std-to-llvm \
       --reconcile-unrealized-casts | \
   <mlir-translate> --mlir-to-llvmir | \
   <llc> -mtriple riscv64 -target-abi lp64d \
	     -mattr=+m,+d,+experimental-v -riscv-v-vector-bits-min=128 \
         --filetype=asm -o log.s

Now see the assembly code:

...
vsetivli	zero, 16, e32, m4, ta, mu
...

For more details, you can see here, I give more examples.

I notice you are using IREE, but I’m not sure whether the hal of IREE will do anything special to LMUL setting.

Hope this helps you!

Hongbin

I have used -riscv-v-vector-bits-min=256. The issue is with automatic vectorization, the vector dialect is not being used.

Hi, have you managed to get this to work?
I can get LMUL>1 if I use pragmas, e.g. with

#pragma clang loop vectorize_width(64)

I can get the vsetvli i want to generate:

vsetvli t0,a2,e16,m8,ta,ma

But if I don’t specify the width, the vectorization pass decides it should be equal to 8, resulting in LMUL=1 in this case.

It’s hard to follow for me how and where the risc-v specific information is propagated to the loop vectorizer. I think it can at least check that a vectorization width is legal, cause if you go too high, it will (silently) fall back to a smaller width.

I had been forcing it to use LMUL=8 with command line arguments, not pragmas.