[RFC] Extending MLIR GPU device codegen pipeline


The current pipeline for generating GPU code has some drawbacks, like not being able to link to bytecode libraries like libdevice, or device linking in general, requiring building MLIR on the target system for obtaining executable code, to name a few.

This proposal would introduce an alternative pipeline for generating device code. Essentially it would generate clang compliant offload LLVM IR and the task for generating the final executable would be left to clang, removing the burden away from MLIR and leveraging clang’s capabilities for offload code generation.

Important notice, this won’t introduce a clang build dependency to MLIR.

Edit: Due to discourse restrictions I had to remove several links and user mentions, so here is a github gist with the original RFC.

Context & Motivation

The current approach for lowering the GPU dialect down can be summed up by:

mlir-opt --convert-gpu-to-nvvm \
         --convert-nvgpu-to-nvvm \
         --gpu-to-cubin \
         --gpu-to-llvm input.mlir

Where device code generation happens inside either --gpu-to-cubin or –gpu-to-hsaco`, these serialization passes follow the process:

(NVVM | ROCDL) + LLVM -> LLVM IR -> Device binary -> Embedding the device binary as constant into the host IR

This previous pipeline works in many cases, however as soon users start using things like math.sqrt they run into issues. Most of these issues appear due to the fact that there’s no linking during the serialization pass, see SerializeToCubin.cpp. Thus pipeline scalability is not possible with the current system.

Another issue with the current approach is that users are forced to build MLIR on every target system (or at least a system with a full pipeline), as the serialization passes and the ExecutionEngine depend on an installation of either ROCM or CUDA. One important thing to note is that this requirement is not imposed to pure host code, as the user can translate down to LLVM IR in one machine, copy the file to the target system and generate the executable with clang on the target system.

Why is this last point an issue? MLIR moves fast, so weekly recompiles are not uncommon, however these weekly builds might be prohibitive for some users with limited and precious compute time in target systems (e.g. DOE scientist), but with otherwise unlimited time in other systems in which they can build MLIR. Hence the ability to generate LLVM IR in one machine and compile it with clang in a different one would prove useful.

Clang Offload

Clang is able to handle offload to multiple target architectures like NVPTX and AMDGPU with ease, calling vendor tools like ptx automatically and even passing arguments to this tools, making the process extensible.

Clang currently has 2 offload drivers, a default one and --offload-new-driver. The new driver is not yet available in Windows / MacOS, however it has some important benefits:

  • Simpler to use.
  • Eventually is (probably) going to become the default (there are some issues to fix to support other platforms).
  • It’s more powerful in its capabilities and provides a more uniform compilation process.
  • Capable of performing device LTO.

For more information see Offloading Design .

Offloading example

// example.cu
__global__ void fill(int n, float a, float *x) {
  int i = threadIdx.x + blockIdx.x * blockDim.x;
  if (i < n)
    x[i] = a;
int main(void) {
  int n = 1 << 20;
  float *x;
  cudaMallocManaged(&x, n * sizeof(float));
  fill<<<(n + 255) / 256, 256>>>(n, 2.0f, x);

New driver compilation process:

# Create host & device IR.
clang -fgpu-rdc --offload-new-driver --offload-device-only --offload-arch=sm_70 -o example_dev.bc -c -emit-llvm example.cu
clang -fgpu-rdc --offload-new-driver --offload-host-only -o example_host.bc -c -emit-llvm  example.cu

# Compile everything together.
clang-offload-packager -o example_dev.bin --image=file=example_dev.bc,arch=sm_70,triple=nvptx64-nvidia-cuda,kind=cuda
clang -fgpu-rdc --offload-new-driver example_host.bc -Xclang -fembed-offload-object=example_dev.bin -O3 -lcudart -o example.exe

The last step can be replaced by:

# The output of this command will have host & device code, with the device code embedded as a constant, and some other necessary annotations.
clang -fgpu-rdc --offload-new-driver example_host.bc -Xclang -fembed-offload-object=example_dev.bin -O3 -c -emit-llvm  -o example_full.bc
clang -fgpu-rdc --offload-new-driver example_full.bc -lcudart -o example.exe

The output from the first command (example_dev.bc) is plain LLVM IR device offload code.
The output from the second command (example_host.bc) is LLVM IR code with with 2 additional annotations:

@.omp_offloading.entry_name = internal unnamed_addr constant [12 x i8] c"_Z4fillifPf\00"
@.omp_offloading.entry._Z4fillifPf = weak constant %struct.__tgt_offload_entry { ptr @_Z19__device_stub__fillifPf, ptr @.omp_offloading.entry_name, i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1

The output from embedding example_dev.bin into example_host.bc and producing example_full.bc just adds the following annotations:

@llvm.embedded.object = private constant [<num> x i8] c"...", section ".llvm.offloading", align 8, !exclude !0
@llvm.compiler.used = appending global [1 x ptr] [ptr @llvm.embedded.object], section "llvm.metadata"

The key step of transforming everything into an executable is performed under the hood by clang, by calling the tool clang-linker-wrapper, this tool will compile device LLVM IR code down to fatbin and generate the necessary code to register the kernels.

Default driver compilation process:

It’s more rigid and unlike the new driver it cannot be invoked from a combination of host IR and a device IR files, It requires either a cuda input file or a LLVM IR file with many more target specific annotations, including kernel registration code.

Shout-out to @jhuber6 for all the help in deciphering the steps.

Possible routes

From easiest to implement to hardest:

  1. Keep most of the existing stuff, but instead of serializing everything in one pass, have a pass that emits device LLVM IR, let the user manually compile this IR down to cubin, and create a standalone tool for embedding that IR into the host IR.
    • Drawbacks: no device lto, and there’s still extensibility issues, extra tool to maintain.
  2. Create a new serialization pipeline, but instead of compiling down to binary we serialize to device bytecode, introduce clang compatible offload annotations for the new driver, and let clang generate the executable.
    • Drawbacks: It would only work on linux until the new clang offload driver becomes available in all platforms. No JIT. The introduction of a new runner library, as clang uses the cuda runtime instead of the driver.
  3. Do 2, but with the old driver.
    • Drawbacks: It’s going to get deprecated in the future when clang makes the switch. The number of extra IR to be generated is considerable.

I’m inclined to do 2, it’s relatively simple to implement and in the long run it’s just a better option, as it provides more extensibility for generating device code, more optimization opportunities, and easier to maintain.

Note: On every step that says CUDA or related terms, these can be swapped for HIP.



I do like the idea of handing off the actual IR->GPU binary compilation to clang (route #2). We will need to make CUDA SDK available if/when we actually need to generate GPU binaries, but we would not have LLVM/MLIR build to depend on it, only the GPU tests.

serialization passes and the ExecutionEngine depend on an installation of either ROCM or CUDA .

I would point out that the the root of the problem here is MLIR’s design decision to rely on libcuda.so (which is part of NVIDIA GPU driver) for compilation of PTX to cubin, which is what currently forces compilation on a machine which actually does have the GPU installed.

On top of that, the libdevice is actually part of the CUDA SDK, which is orthogonal to the GPU driver. You can have machine with NVIDIA GPU and the driver, but no CUDA SDK or libdevice available and vice versa.

If we were to redo compilation to the GPU binary in a way that does not depend on the libcuda availability, we’d have much less of a problem.

In the end, in order to produce a GPU binary for NVIDIA GPUs we will need to depend on NVIDIA’s proprietary tools or libraries. E.g. we could change MLIR build to use PTX compiler and linker libraries shipped with recent CUDA versions, but that would make CUDA SDK a dependency for MLIR build.

Handing the whole IR->GPU compilation to clang solves all of that in a rather neat way, IMO. Let mlir-opt handle IR optimizations and let clang driver deal with logistics of wrangling the compilation/linking process that ultimately relies on third-party SDKs.


This seems fine to me to have a flag where « enable cuda support » requires the SDK at build time?

Something isn’t clear to me here in the interplay between the clang driver and the part of the tools that actually process IR: LLVM IR isn’t a stable format and I am concerned that the proposed flow is creating a revlock between MLIR and the « clang » that needs to be available.

I did a quick search and I think no test currently depends in any SDK (ROCM or CUDA).

There wouldn’t be any build dependencies from either ROCM or CUDA, as MLIR would stop at bytecode in all circumstances, The only requirement is AMDGPU and NVPTX targets when building LLVM.

That’s a possible issue, right now the overhead on us to maintain the feature is not big, we just need to add 4 clang related annotations to the IR. However, we would need to start keeping track of any possible changes to these annotations.

Some of these toolsclang-linker-wrapper and clang-offload-packager can become LLVM standalones, their source doe is not tied to clang.

I believe in the past we’ve tried to avoid it, but these days LLVM appears to be more relaxed about it as long as that dependency is optional. Enabling GPU-related functionality in MLIR depending on whether CUDA SDK is available/enabled at build time is fine.

I think I may still have around a rough patch converting SerializeToCubin.cpp to use CUDA SDK APIs.

LLVM IR isn’t a stable format

That is a good point. I was only thinking about interaction within the same build only. If the producer of IR and the clang that would process it are from different builds, that will indeed be potentially problematic.

1 Like

LLVM indeed does not have direct hard dependency on CUDA.

However, NVPTX tests that normally run llc have conditional dependency on ptxas which does come from CUDA SDK and has to be manually passed to the LLVM build.
See ⚙ D121727 [NVPTX] Integrate ptxas to LIT tests

I think we have a handful of other conditional third-party dependencies in the tree. E.g Z3. llvm-project/CMakeLists.txt at main · llvm/llvm-project · GitHub

I didn’t get the point of this comment in my previous reply, I thought you were talking about maintaining the annotations. Yeah, that’s a potential issue.
That’s one of the reasons I’m saying extending, because I would leave the current mechanism and make this one an opt-in in CMake, as this proposal is also unable to deliver JIT.

In here I was referring to MLIR, MLIR itself wouldn’t depend on it and there are no MLIR tests checking ptx, obviously clang would need them on the target system to do its magic.

I was answering the previous message, which referred to another approach we planned for a while where we’d use a C++ API to invoke Nvidia SDK during compilation, so that we can produce binaries on any system without a GPU driver available, and deploy on another system that does not have MLIR/LLVM, which was one of the problem you seem to target.
(edit: our messages crossed with your followup)

I think you missed a point here: the bytecode format itself is not stable. Your new MLIR would generate LLVM IR that potentially an older clang cannot load or have weird behavior if you try to do so. The requirements on LLVM IR (AFAIK) is that:

  1. It provides compatibility across releases of LLVM
  2. Backward compatibility only

That means we could have MLIR from an LLVM release v17 producing IR and and clang version v19 consuming it, but not the other way around, and definitely not arbitrary non-releases version.

It’d be great if you can dig it up! I’m happy to help finishing it if you don’t have time, I remember we discussed this a while back and I was looking forward to this, it seemed like a nice improvement to me.

You’re right, and we have to look not further than the whole thing about opaque pointers. I would still consider this approach as an extension, not a replacement, as it would enable to do more things, but there are restrictions on its usage.

Route one, would improve the existing infrastructure and doesn’t needs clang, only LLVM and the SDK.

I would also be happy to help with this.

On the offload (tooling) side we have even less guarantees as of now. That the IR from v17 can be used with v19 is not something I would promise right now. That said, with the new offload driver (and all the pluming that was changed) we might be able to provide backwards compatibility from now on. Though, we might want to codify that properly.


So, in notes from my experience dealing with our requirements as a ROCm kernel generator:

  • If you go look at -serialize-to-hsaco, you’ll note a whole bunch of code for handling device libraries
  • LLVM IR is very much not a stable format and this will cause issues. I, in a related experiment, tried to swap our usage of the GPU serialization pipeline for AMD’s COMgr (which wraps up Clang, etc.), and, because the released LLVM/Clang/… were older than what MLIR was building IR with, I ended up with a text-mangling IR downgrader, which isn’t guaranteed to be possible in any case (see [DO NOT MERGE] Use COMgr to compile our GPU binaries by krzysz00 · Pull Request #1018 · ROCmSoftwarePlatform/rocMLIR · GitHub )
  • One advantage to the current approach is that you can do offloading by cloning your kernels once per target. This allows, for example, for being able to choose matrix multiply instructions on a per-target basis. Shifting all the offloading to clang makes that tricky.

On top of that, some things that ship our kernel generator really don’t like hardcoded paths or assumptions about what else lives on the system (ex. we eventually end up in a PyTorch wheel). So, would it be possible to set up this new pipeline so that we don’t have to statically link in all of clang (we’re already needing to carry around large parts of LLVM and lld)

1 Like

But that handling is immutable, for example you cannot pass an external library with device bytecode.

This is an issue. But it can become a restriction for this pipeline. At the moment I’m not proposing a full replacement.

That wouldn’t be affected, this would concern just the very last step of serialization, what ever pops to be serialized is what gets serialized. Maybe I’ll have to do some name mangling, but it would be invisible for the user.

There’s not going to be any harcoded paths and in no case it would create a clang build dependency, and again this is not a replacement yet, for now it would be opt-in.

Found them! The changes did actually land in LLVM tree… For about 10 minutes. :slight_smile:


1 Like

Cool, I’ll take a deeper look later. Right now I’m doing approach 2 -even if it doesn’t makes it to trunk, I need it, I think that I should have it working by tomorrow.

Having said that, is there a reason you call nvPTXCompiler instead of calling ptx as a sub-command as clang would’ve done? I’d think that using temp files and sub-commands would make it easier for adding more steps to the pipeline.

nvPTXCompiler is a build-time dependency. Once mlir-opt is built, it does not need anything else. If we rely on ptxas to do the compilation, it becomes a runtime dependency and we’ll need to make sure it’s available when we need to run the tool. It’s one more thing to worry about and it complicates things. E.g. if we want to distribute the tests, now we need to make sure ptxas is availalble on all remote worker machines.

Another potential problem is that it makes the tests non-hermetic. If the ptxas found at runtime is of a different version we’ve tested with, all bets are off.

Clang’s use of ptxas is forced by the fact that it has to be able to deal with multiple different CUDA versions. Compiling in a single version of PTX compiler will not be able to work with all CUDA SDK versions the users may want to use it with – for some CUDA environments it will not have the recent enough features, for others it will produce binaries that are too new.


Here is the phabricator diff for option 2, and 85% of re implementing the current pipeline (more or less option 1):


I’m trying to understand some of the premises here.

As I posted at this link, it’s easy to link libdevice within SerializeToBlob. MLIR GPU libdevice linking support - #8 by bondhugula

This doesn’t appear correct. Why do you need to build MLIR on the target system to obtain executable code? You could generate code while specifying the right sm_<nn> version to serialize-to-cubin and perform AOT compilation. (Just translate and compile the generated LLVM dialect.) The code can run on another system without MLIR and with GPUs compatible with the specified GPU target. (The CUDA driver and runtime API have to be compatible.) What are the other things that a user may want to link to the device-side code (besides libdevice)? Linking things to host-side is anyway possible post-generation translation to LLVM IR.

With the current setup, you are just pre-compiling and creating the GPU binary and encoding/encapsulating that device binary into a single MLIR file via an attribute with the surrounding host-side code and CUDA driver API – IMO, it’s a really novel design. (Just to be clear: I’m not the designer/author/contributor to this core approach) It doesn’t prevent AOT compilation in any way: you can perform the host code compilation on the generated LLVM, link to it, and create standalone binaries that run on the GPU.

The discussion and the proposed alternative on the MLIR side in the linked revision (including the comments there) appear to be based on several incorrect premises/assumptions:

  • that linking to libdevice in the current setup isn’t possible,
  • that optimizing LLVM during MLIR → LLVM → PTX isn’t possible (doing this is trivial; we do this in the MLIR ExecutionEngine for CPU-side compilation by creating an LLVM transformer at the specified optimization level; the same can be done prior to translation to PTX in SerializeToBlob)
  • that AOT for GPUs isn’t somehow possible with the current setup or that MLIR is needed on the target system to generate executables: both of these aren’t true – the current setup doesn’t prevent AOT compilation.

Whatever you do from clang for GPUs could be done from the serialize-to-cubin using the same LLVM APIs. There may be things that could move from clang to LLVM land to share more common infrastructure if that’s appropriate for those things to reside in LLVM. clang is a driver, and so is the thing using the MLIR infrastructure to perform JIT or AOT compilation for GPUs. One could argue that we could rely on clang and move the lower part of the compilation and GPU binary generation out of MLIR land and leave it to clang (just like it is for CPUs), but all the premises for such an alternative listed here and in the revision discussion appear to be either incorrect or not fully spelt out. It would also break MLIR JIT compilation for GPUs which currently works via ORC JIT – or you would have two paths: the existing one to do the JIT compilation (either in SerializeToBlob or in MLIRExecutionEngine) and the clang one for AOT. What exists is already an approach unified for both JIT and AOT in MLIR land.

Also, note that it’s currently possible to do AOT compilation for MLIR for both CPUs and GPUs without using clang! (For eg. you can take the generate LLVM IR and go through ... | opt ... | llc ... and generate CPU or GPU-executable binaries. clang is a C-family compiler, and it shouldn’t be needed to compile MLIR generated from various frontends to executables. MLIR for CPUs is also JITTed without any dependency on clang – either via the test tool mlir-cpu-runner or via Python bindings to MLIR’s execution engine. The latter also allows you to dump binaries for AOT.

I see an abrupt jump in the conclusion here. What are the things that you would like to link to here (besides libdevice)? I assume you’d like the GPU module to link to libraries and libdevice linking is accomplishable with no more than 20 lines of code or so as mentioned above. Also, see NVPTX codegen for llvm.sin (and friends) - #16 by bondhugula for some larger discussions. Other things can also be done and reused from LLVM land.

I’m not opposed to a new alternative path, but I’m trying to better understand the rationale for something major and the duplication/additional effort to maintain such an alternative path.

CC: @herhut @csigg @ftynse

1 Like

I am wondering if you really got it through all of the discussion in the review, because I thought we went through all this already, and I actually insisted that we dug enough to align the pieces to have a single flow for JIT and AOT and not have any duplication: that is if the LLVM offloading annotation are the way to do, they should move from clang to LLVM, so that we can reuse them from MLIR for both an AOT and a JIT flow.
The difference would just be that the AOT flow can decouple the code generation on the MLIR side, which would not depend on the availability of the Cuda environment, and then ship this artifact to another system that does not have MLIR but can take it from there seamlessly. These aspects are the novelty here in terms of infra as far for the AOT flow as I can tell.

This comment ⚙ D149559 [mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations. (which starts with “it’s time to potentially self sabotage this diff”, which is a testimony of how the discussion in there changed the direction from the original RFC here), explains where this is headed.

Let’s take it concretely: do you have specific concerns about lost functionalities or regressions for the JIT path that needs to be addressed for this work to be able to move forward?
We don’t have patches yet, but my understanding of @fabianmc’s plan is that we shouldn’t lose anything there (I would have expressed concerned otherwise), and we’ll have a single flow shared for AOT and JIT.

Adding the ability to load extra bitcode libraries for our JIT infra (whether it is CPU or GPU) is a nice improvement: we should do it regardless of the work done here.

1 Like

Please note that in the context of the sparsification, I also consolidated some of the passes required to get GPU code running into a single pipeline. The setup is simple but functional, and provides a path for direct CUDA code generation as well as conversion into cuSPARSE calls (I linked to the command line of two end-to-end examples to illustrate both pipeline set ups). Here too, however I often found finding the exact lowering passes a bit brittle and I would love to see how you further enhance the setup!

1 Like

I’ll start by saying, at no point we’ll lose JIT, AOT or any other kind of functionality, we should only gain features.

Also, this discussion might make more sense next week once I have the new patches for review, as any potential shortcomings can be discussed knowing the actual implementation.

Here’s the current plan (it’s subject to change until a final patch is implemented and presented for discussion):

  1. Move the serialization passes to translation. Why? From a conceptual pov, they make more sense to be there as they involve translation. From an implementation pov, it would allow us to use any existing LLVM IR infrastructure for offloading code more easily, as it involves having both the host and device LLVM IR representations available at the same time.
  2. Move several offloading bits from Clang to LLVM.
  3. Introduce this pipeline in the translation context, leveraging steps 1 & 2.

Features to be gained:

  1. The ability to link to device bitcode libraries in general. Why do we want this? Interoperability, eg. someone created a bitcode library using CUDA or HIP and a user wants to use functions in this library. libdevice is a concrete example of such library.
  2. Flexibility of the process. For example adding the option for generating offload annotations, allows for using all Clang features wrt binary generation, linking, etc, including the ability for generating and linking rdc.
  3. The inclusion of new features, such as fast math.
  4. Having an always available GPU codegen pipeline, even in the absence of the CUDA driver, CUDA toolkit or ROCm installation, as all of those dependencies only need to be satisfied in a target system with Clang.