[RFC] Extending MLIR GPU device codegen pipeline

The basis of the new mechanism is now in trunk. The idea is to migrate all GPU compilation to this mechanism and eventually deprecate & remove gpu-to-(cubin|hsaco). The ETA for complete removal is not yet determined, a notice will be added to Deprecations & Current Refactoring.

Documentation and a general overview can be found in gpu Dialect or in D154153.

The main idea behind this mechanism is extensibility, as attribute interfaces handle compilation, allowing any dialect to implement them. These interfaces are GPU Target Attributes and GPU Offloading LLVM Translation Attributes.

Target attributes handle serialization to a string representation, while Offloading Translation attributes handle the translation of the ops: gpu.binary & gpu.launch_func.

Together these attributes with the new gpu.binary Op can implement concepts like fat binaries and CUDA or HIP kernel launching mechanisms.

The compilation attributes available in trunk are:

  1. #nvvm.target for compiling to cubin, PTX, or LLVM. Compiling to cubin requires a valid CUDAToolkit installation, as the mechanism invokes ptxas or links against nvptxcompiler. However, the mechanism is always present as long as the NVPTX was built. There are no hard CMake dependencies on the toolkit.
  2. #rocdl.target for compiling to hsaco, ISA, or LLVM. Compiling to hsaco requires a valid ROCM installation.
  3. #gpu.select_object for embedding a single object in LLVM IR and laucnahing kernels as the current mechanism in GPU to LLVM do.

Currently, only compilation to cubin or hsaco generates valid executables. In a future patch, I’ll add runtime support for PTX, allowing execution and compilation without a CUDAToolkit.

Example:

gpu.module @mymodule [#nvvm.target<O = 3, chip = "sm_90">, #nvvm.target<O = 3, chip = "sm_70">] {
}
// mlir-opt --gpu-module-to-binary
gpu.binary @mymodule [
    #gpu.object<#nvvm.target<O = 3, chip = "sm_90">, "sm_90 BINARY">, 
    #gpu.object<#nvvm.target<O = 3, chip = "sm_70">, "sm_70 BINARY">
  ];
// By default gpu.binary embeds the first object, for selecting the second object:
gpu.binary @mymodule <#gpu.select_object<1>> [
    #gpu.object<#nvvm.target<O = 3, chip = "sm_90">, "sm_90 BINARY">, 
    #gpu.object<#nvvm.target<O = 3, chip = "sm_70">, "sm_70 BINARY">
  ];
// Or:
gpu.binary @mymodule <#gpu.select_object<#nvvm.target<O = 3, chip = "sm_70">>> [
    #gpu.object<#nvvm.target<O = 3, chip = "sm_90">, "sm_90 BINARY">, 
    #gpu.object<#nvvm.target<O = 3, chip = "sm_70">, "sm_70 BINARY">
  ];

Compilation workflow:

mlir-opt example.mlir                   \
  --pass-pipeline="builtin.module(      \
    nvvm-attach-target{chip=sm_90 O=3}, \ # Attach an NVVM target to a gpu.module op.
    gpu.module(convert-gpu-to-nvvm),    \ # Convert GPU to NVVM.
    gpu-to-llvm,                        \ # Convert GPU to LLVM.
    gpu-module-to-binary                \ # Serialize GPU modules to binaries.
  )" -o example-nvvm.mlir
mlir-translate example-nvvm.mlir        \
  --mlir-to-llvmir                      \ # Obtain the translated LLVM IR.
  -o example.ll

If there are any lingering concerns, a bug, or ideas on improving it, you can post them here or on Discord; also, my DMs are open.

Shoutout to @mehdi_amini for all the feedback in the reviews, as well as @krzysz00 .

2 Likes