Error at lower gpu dialect to llvmir

Here is the file of my mlir code named gpu-dialect.mlir(at gpu dialect level)

module attributes {gpu.container_module, llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu"} {
  func @main_graph(%arg0: memref<3x2xf32>, %arg1: memref<3x2xf32>) -> memref<3x2xf32> attributes {input_names = ["X1", "X2"], output_names = ["Y"]} {
    %0 = memref.alloc() {alignment = 16 : i64} : memref<3x2xf32>
    %c0 = arith.constant 0 : index
    %c3 = arith.constant 3 : index
    %1 = arith.subi %c3, %c0 : index
    %c1 = arith.constant 1 : index
    %c0_0 = arith.constant 0 : index
    %c2 = arith.constant 2 : index
    %2 = arith.subi %c2, %c0_0 : index
    %c1_1 = arith.constant 1 : index
    %c1_2 = arith.constant 1 : index
    gpu.launch_func  @main_graph_kernel::@main_graph_kernel blocks in (%1, %c1_2, %c1_2) threads in (%2, %c1_2, %c1_2) args(%c0 : index, %c0_0 : index, %arg0 : memref<3x2xf32>, %arg1 : memref<3x2xf32>, %0 : memref<3x2xf32>)
    return %0 : memref<3x2xf32>
  }
  gpu.module @main_graph_kernel {
    gpu.func @main_graph_kernel(%arg0: index, %arg1: index, %arg2: memref<3x2xf32>, %arg3: memref<3x2xf32>, %arg4: memref<3x2xf32>) kernel {
      %0 = gpu.block_id  x
      %1 = gpu.block_id  y
      %2 = gpu.block_id  z
      %3 = gpu.thread_id  x
      %4 = gpu.thread_id  y
      %5 = gpu.thread_id  z
      %6 = gpu.grid_dim  x
      %7 = gpu.grid_dim  y
      %8 = gpu.grid_dim  z
      %9 = gpu.block_dim  x
      %10 = gpu.block_dim  y
      %11 = gpu.block_dim  z
      cf.br ^bb1
    ^bb1:  // pred: ^bb0
      %12 = arith.addi %arg0, %0 : index
      %13 = arith.addi %arg1, %3 : index
      %14 = memref.load %arg2[%12, %13] : memref<3x2xf32>
      %15 = memref.load %arg3[%12, %13] : memref<3x2xf32>
      %16 = arith.addf %14, %15 : f32
      memref.store %16, %arg4[%12, %13] : memref<3x2xf32>
      gpu.return
    }
  }
  "krnl.entry_point"() {func = @main_graph, numInputs = 2 : i32, numOutputs = 1 : i32, signature = "[    { \22type\22 : \22f32\22 , \22dims\22 : [3 , 2] , \22name\22 : \22X1\22 }\0A ,    { \22type\22 : \22f32\22 , \22dims\22 : [3 , 2] , \22name\22 : \22X2\22 }\0A\0A]\00@[   { \22type\22 : \22f32\22 , \22dims\22 : [3 , 2] , \22name\22 : \22Y\22 }\0A\0A]\00"} : () -> ()
}

I want to lower gpu dialect to llvmir and the pipeline I added is below:

void addAffineToGPUPasses(mlir::PassManager &pm) {
  pm.addNestedPass<FuncOp>(mlir::createAffineForToGPUPass()); 
  pm.addPass(mlir::createGpuKernelOutliningPass());
  
  // input: gpu-dialect.mlir
  pm.addPass(mlir::createGpuToLLVMConversionPass());
}

Error occured:

error: 'gpu.module' op missing gpu.binary attribute
error: failed to legalize operation 'gpu.launch_func' that was explicitly marked illegal

Don’t have a clue to solve this. Can anyone help me? (I’m green hand at mlir)

Targeting GPU requires separate compilation of the device and kernel modules. First, you need to convert the gpu.module into a binary blob that is understood by the GPU. Something like --convert-gpu-to-nvvm --gpu-to-cubin should work for CUDA assuming a working and available toolkit installation. Only after that you will be able to convert the host code to LLVM with --gpu-to-llvm.

Thank you for replying and your suggestion :grinning:
Update.
I readed the code from mlir/test/lib/Dialect/GPU/TestConvertGPUKernelToCubin.cpp and copyed them as a pass below.

namespace mlir {

std::unique_ptr<Pass> createSerializeToCubin() {
  LLVMInitializeNVPTXTarget();
  LLVMInitializeNVPTXTargetInfo();
  LLVMInitializeNVPTXTargetMC();
  LLVMInitializeNVPTXAsmPrinter();
  return std::make_unique<TestSerializeToCubinPass>();
}
} // namespace mlir

And I make it added to the pass manager.

void addAffineToGPUPasses(mlir::PassManager &pm) {
  pm.addNestedPass<FuncOp>(mlir::createAffineForToGPUPass()); 

  pm.addPass(mlir::createSerializeToCubin());

  pm.addPass(mlir::createGpuKernelOutliningPass());

  // pm.addPass(mlir::createGpuToLLVMConversionPass());
}

But the output mlir doesn’t change at all. It’s still the same as the file I posted above.
Does --convert-gpu-to-cubin play the same role as the createSerializeToCubin() ?

As its name indicates, this pass is used for testing. You need the actual pass, its constructor can be found by looking for "gpu-to-cubin" in the code base. I suppose it is llvm-project/SerializeToCubin.cpp at main · llvm/llvm-project · GitHub.

Your code also ignores the previous step, “–convert-gpu-to-nvvm”, which is mandatory for the cubin conversion to work.

I used the registerGpuSerializeToCubinPass(); provided by Pass.h.
Still it doesn’t make any change to the mlir.

void addAffineToGPUPasses(mlir::PassManager &pm) {
  pm.addNestedPass<FuncOp>(mlir::createAffineForToGPUPass()); 

  registerGpuSerializeToCubinPass();

  pm.addPass(mlir::createGpuKernelOutliningPass());

  pm.addPass(mlir::createGpuToLLVMConversionPass());
}

The gpu.module doesn’t have attribute like gpu.module @kernel_module attributes { nvvm.cubin = "CUBIN", rocdl.hsaco = "HSACO" } { described in mlir/test/Conversion/GPUCommon/ lower-launch-func-to-gpu-runtime-calls.mlir

module attributes {gpu.container_module, llvm.data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128", llvm.target_triple = "x86_64-unknown-linux-gnu"} {
  func @main_graph(%arg0: memref<3x2xf32>, %arg1: memref<3x2xf32>) -> memref<3x2xf32> attributes {input_names = ["X1", "X2"], output_names = ["Y"]} {
    %0 = memref.alloc() {alignment = 16 : i64} : memref<3x2xf32>
    %c0 = arith.constant 0 : index
    %c3 = arith.constant 3 : index
    %1 = arith.subi %c3, %c0 : index
    %c1 = arith.constant 1 : index
    %c0_0 = arith.constant 0 : index
    %c2 = arith.constant 2 : index
    %2 = arith.subi %c2, %c0_0 : index
    %c1_1 = arith.constant 1 : index
    %c1_2 = arith.constant 1 : index
    gpu.launch_func  @main_graph_kernel::@main_graph_kernel blocks in (%1, %c1_2, %c1_2) threads in (%2, %c1_2, %c1_2) args(%c0 : index, %c0_0 : index, %arg0 : memref<3x2xf32>, %arg1 : memref<3x2xf32>, %0 : memref<3x2xf32>)
    return %0 : memref<3x2xf32>
  }
  gpu.module @main_graph_kernel {
    gpu.func @main_graph_kernel(%arg0: index, %arg1: index, %arg2: memref<3x2xf32>, %arg3: memref<3x2xf32>, %arg4: memref<3x2xf32>) kernel {
      %0 = gpu.block_id  x
      %1 = gpu.block_id  y
      %2 = gpu.block_id  z
      %3 = gpu.thread_id  x
      %4 = gpu.thread_id  y
      %5 = gpu.thread_id  z
      %6 = gpu.grid_dim  x
      %7 = gpu.grid_dim  y
      %8 = gpu.grid_dim  z
      %9 = gpu.block_dim  x
      %10 = gpu.block_dim  y
      %11 = gpu.block_dim  z
      cf.br ^bb1
    ^bb1:  // pred: ^bb0
      %12 = arith.addi %arg0, %0 : index
      %13 = arith.addi %arg1, %3 : index
      %14 = memref.load %arg2[%12, %13] : memref<3x2xf32>
      %15 = memref.load %arg3[%12, %13] : memref<3x2xf32>
      %16 = arith.addf %14, %15 : f32
      memref.store %16, %arg4[%12, %13] : memref<3x2xf32>
      gpu.return
    }
  }
  "krnl.entry_point"() {func = @main_graph, numInputs = 2 : i32, numOutputs = 1 : i32, signature = "[    { \22type\22 : \22f32\22 , \22dims\22 : [3 , 2] , \22name\22 : \22X1\22 }\0A ,    { \22type\22 : \22f32\22 , \22dims\22 : [3 , 2] , \22name\22 : \22X2\22 }\0A\0A]\00@[   { \22type\22 : \22f32\22 , \22dims\22 : [3 , 2] , \22name\22 : \22Y\22 }\0A\0A]\00"} : () -> ()
}

Make sure you read the documentation thoroughly and get a sufficient understanding of the different components involved. This function registers the pass, i.e., makes it available for mlir-opt-like tools to construct from command-line (FAQ - MLIR). It does not add the pass to the pipeline. This should be obvious since it does not interact with the pm variable in any way… You need to create a pass and add it to the pass pipeline.

The code also seems to ignore the lowering to NVVM that I mentioned twice:

without which the CUBIN conversion is unlikely to succeed.

If you run the two passes as required, it will precisely create the nvvm.cubin attribute on the module.

1 Like

In addition to what @ftynse says, if you are experimenting, I’d recommend using the pass pipeline string specification to better understand what the various pieces are doing. Once you’ve converted to the GPU dialect, depending on the other operations you mave have, something like this will work:

-pass-pipeline='gpu-kernel-outlining,canonicalize,gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_86 max-reg-per-thread=255 cu-jit-opt-level=4}),gpu-to-llvm,canonicalize'

(I’ve used some common options above as an example.)

For reference, this work provides detail on various pieces of a complete pipeline. But the lower part of the pipeline that you are asking about has all been used vanilla from upstream in that work.

1 Like