[RFC] Extending MLIR GPU device codegen pipeline

Here’s a brief recap of the conversation for discussion.

End goal:

A single robust pipeline for gpu code generation without the current shortcomings of the current pipeline. Where the heavy lifting of device code compilation is performed mostly by LLVM infra.

This pipeline will be slowly built across many patches and discussions, as it involves moving certain bits from clang to llvm as well as creating some components.

Concrete proposed changes:

  1. The introduction of target attributes to gpu.module, this attribute will hold device target information about the module, such as if it’s nvvm or rocdl, as well as target triple, features and arch. This could eventually lead to the removal of --convert-gpu-to-(nvvm|rocdl) in favor of a single gpu-to-llvm. The format for such attribute might look like:
gpu.module @foo [nvvm.target<chip = "sm_70">] {
...
}
  1. gpu.launch_op will not longer be lowered by gpu-to-llvm, but by a different pass. Allowing a more flexible handling of this op, as there are many ways to launch a kernel (cudaLaunchKernel, cudaLaunchCooperativeKernel, etc), and not a 1 to 1 mapping between this op and LLVM.
  2. The introduction of --gpu-embed-kernel, this pass will have to be executed after gpu-to-llvm and will serialize the gpu.module to an LLVM module. Why a separate pass? To allow running passes over the full LLVM MLIR IR, ie:
builtin.module {
gpu.module ... {
llvm.func @device_foo ...
}
llvm.func @host_foo ...
}
  1. Migrate the current serialization pipelines into this gpu code gen structure, while addressing some of the shortcomings of the current serialization passes like the lack of general device bitcode linking in trunk. Allowing downstream users to use libdevice without having to patch the tree to obtain this functionality.
  2. Once the work on the LLVM infra side is ready, migrate all gpu MLIR code compilation into this pipeline.

No JIT or AOT functionality will be lost at any point, we’ll only gain features. Upon agreement the first 4 items, could be rolled in the coming weeks.

Things outside this proposal that are also open for discussion:

  • Migrating from the cuda driver API to the cuda runtime API.
2 Likes