[MLIR][Arith][GPU]Arith Expansion results inefficient lowering

I’m working on generating GPU code using MLIR by invoking the gpu-lower-to-nvvm-pipeline pass. I’m comparing it against an implementation of matrix transpose written in CUDA. As part of my analysis, I’m comparing it against a CUDA-based implementation of matrix transpose. On examining the PTX generated by MLIR, I noticed that the arith.floordivsi operation is transformed into a branch to check the sign of the operands before performing the division, as handled by the --arith-expand pass. While this is the correct transformation if negative inputs are expected, it is not applicable in index address calculations where all addresses are positive integers. This transformation significantly impacts performance, resulting in nearly a 30% drop.

Additionally, the fact that thread and block indices are treated as 64-bit values in MLIR’s generated code slows down index calculations and leads to unnecessary branch divergence. To address this problem, I attempted to pass index-bitwidth=32 to the gpu-lower-to-nvvm-pipeline pass. However, this approach failed, as documented in the pass documentation.

I would greatly appreciate any assistance in resolving this issue.

The system defaults the index type to 64-bit to match the size of your computer’s memory addressing. Even if you set index-type to 32, llvm will promote everything to 64bit as far as I observe.

But as you’ve noticed with the CUDA compiler, there are instances where using 32-bit is okay. This is especially true for indexing operations involving shared and constant memory, since these have a 32-bit address space. We can adjust this by setting the ‘nvptx-short-ptr’ flag to see how it affects performance.

You can set it with
mlir-opt -nvptx-short-ptr -gpu-lower-to-nvvm-pipeline yourcode.mlir

Why is arith.floordivsi introduced in the first place if you only have positive integer?
I would look into where this is coming from and how to get to arith.divui instead.

Thank you. I’m still having the issue that there are conversions to 64-bit and are introducing branches inside the code that slow the program down

Correct, but the arith.floordivsi is generated in many places by default for address calculation, such as affine maps for memory access and memory layout in MLIR. I encountered this performance issue accidentally.

This sort of signed division by default is a big part of why I added -arith-unsigned-when-equivalent and upstreamed it - might be worth a look.

… come to think of it, that entire analysis probably needs to be updated for nsw and nuw being added to arith (if that’s happened)

… and there’s two integer range analyses running around

But all that’s minor details, the main point is there’s a rewrite that hopefully does what you want.

1 Like

@grypp
Hi. I was trying to get the 32 bit example running. I was able to compile PTX but now only the grid size and block size are being passed as i64 which breaks the compilation. This is the command, I was running

mlir-opt --arith-expand  --lower-affine -convert-scf-to-cf --convert-cf-to-llvm=index-bitwidth=32 --convert-func-to-llvm="index-bitwidth=32" --gpu-lower-to-nvvm-pipeline="cubin-chip=sm_80 index-bitwidth=32 cubin-triple=nvptx-nvidia-cuda host-bare-ptr-calling-convention=1" test.mlir 

This is the code before failing on

test.mlir:7:13: error: failed to legalize operation 'builtin.unrealized_conversion_cast' that was explicitly marked illegal
    %c256 = arith.constant 256 : index
            ^
test.mlir:7:13: note: see current operation: %8 = "builtin.unrealized_conversion_cast"(%7) : (i32) -> index
module attributes {gpu.container_module} {
  llvm.func @malloc(i64) -> !llvm.ptr
  llvm.func @main() {
    %0 = llvm.mlir.constant(0 : i8) : i8
    %1 = llvm.mlir.constant(1 : index) : i32
    %2 = llvm.mlir.constant(16 : index) : i32
    %3 = llvm.mlir.constant(4 : index) : i32
    %4 = llvm.mlir.constant(128 : index) : i32
    %5 = llvm.mlir.constant(32 : index) : i32
    %6 = llvm.mlir.constant(8 : index) : i32
    %7 = llvm.mlir.constant(256 : index) : i32
    %8 = builtin.unrealized_conversion_cast %7 : i32 to index
    %9 = builtin.unrealized_conversion_cast %8 : index to i64
    %10 = builtin.unrealized_conversion_cast %2 : i32 to index
    %11 = builtin.unrealized_conversion_cast %10 : index to i64
    %12 = builtin.unrealized_conversion_cast %1 : i32 to index
    %13 = builtin.unrealized_conversion_cast %12 : index to i64
    %14 = llvm.call @mgpuStreamCreate() : () -> !llvm.ptr
    %15 = llvm.mlir.zero : !llvm.ptr
    %16 = llvm.getelementptr %15[16384] : (!llvm.ptr) -> !llvm.ptr, f32
    %17 = llvm.ptrtoint %16 : !llvm.ptr to i64
    %18 = llvm.call @malloc(%17) : (i64) -> !llvm.ptr
    %19 = llvm.call @mgpuMemAlloc(%17, %14, %0) : (i64, !llvm.ptr, i8) -> !llvm.ptr
    llvm.call @mgpuMemcpy(%19, %18, %17, %14) : (!llvm.ptr, !llvm.ptr, i64, !llvm.ptr) -> ()
    %20 = llvm.call @mgpuEventCreate() : () -> !llvm.ptr
    llvm.call @mgpuEventRecord(%20, %14) : (!llvm.ptr, !llvm.ptr) -> ()
    %21 = llvm.call @mgpuMemAlloc(%17, %14, %0) : (i64, !llvm.ptr, i8) -> !llvm.ptr
    %22 = llvm.call @mgpuEventCreate() : () -> !llvm.ptr
    llvm.call @mgpuEventRecord(%22, %14) : (!llvm.ptr, !llvm.ptr) -> ()
    %23 = llvm.call @mgpuStreamCreate() : () -> !llvm.ptr
    llvm.call @mgpuStreamWaitEvent(%23, %22) : (!llvm.ptr, !llvm.ptr) -> ()
    llvm.call @mgpuStreamWaitEvent(%23, %20) : (!llvm.ptr, !llvm.ptr) -> ()
    llvm.call @mgpuEventDestroy(%22) : (!llvm.ptr) -> ()
    llvm.call @mgpuEventDestroy(%20) : (!llvm.ptr) -> ()
    gpu.launch_func <%23 : !llvm.ptr> @main_kernel::@main_kernel blocks in (%11, %13, %13) threads in (%9, %13, %13) : i64 args(%3 : i32, %5 : i32, %4 : i32, %6 : i32, %19 : !llvm.ptr, %21 : !llvm.ptr)
    llvm.return
  }
  gpu.binary @main_kernel  [#gpu.object<#nvvm.target<triple = "nvptx-nvidia-cuda", chip = "sm_80">, 

Thank you for your help.

The way it is, the RHS of floordiv, ceildiv, and mod in affine maps is always positive: 'affine' Dialect - MLIR
So a floordivui can be used even when the RHS is unknown and the LHS is known to be positive. In your case and most cases, the RHS might even be a known constant. An integer range analysis should help with using unsigned divisions at most places.

1 Like

The runtime ABIs actually expect a 64-bit grid/block size on the host, and this shouldn’t have any bad effects on performance.

Let’s give a shot without the index-bitwidth=32 and use -nvptx-short-ptr instead. Let’s see if that makes any difference!

Feel free to show input IR, I can also try

1 Like