[RFC] Extending MLIR GPU device codegen pipeline

Coming late to this thread and trying to untangle because on the IREE side, we somehow are doing these 4 things with the current infra and do not have a build dependency in the compiler on the CUDA SDK tools (but do allow it to be configured and use it to extract a libdevice if not provided separately).

I’m moderately -1 on introducing a hard (cmake) dependency on the cuda sdk for the general compiler infra because it is such a fragile thing for a parent project to impose on child projects and most of the general IR manipulation, tests, etc don’t require it (can go into details – but very not pretty) and I disagree that bundling at that level improves the situation. Having the dependency for some slice of optional in tree poc “runtime”/“driver” code seems more palatable.

I could be wrong on this – I’m mainly trying to rationalize why I thought we already have the goals satisfied.

(I’ve cross posted this to our discord so the folks who know this better can track)

Hey, could you post the link to how IREE do these? I’m very interested on learning more, honestly I only proposed this because I found the current pipeline not flexible enough for my project (without having to introduce extra patches), and if there’s a better or existing solution I’d be happy to go along.

None of the proposed pipelines introduces a CUDA toolkit dependency on build time, actually the pipeline based on LLVM offloading, would work and be available even if no driver or toolkit are found (in this particular case, it needs clang to get to executable).

Having said that, at some point we did discuss moving from the driver to the toolkit for generating cubin objects. If you could go into details that would be great!

Let me see what I can find. I’m honestly not the expert here but just noticed the incongruence and am trying to get the folks who know.

As the last line of build system defense on some projects I’m involved in, I merely know the pain I have endured and supported downstreams through. The CMake CUDA SDK finder is unreliable in the extreme, I’ve found that it frequently “latches” incorrectly in both platform specific ways and is not robust to probing strategies, and we never really got it to work right with the CI friendly mechanisms that NVIDIA provides for fetching parts of the SDK (ie. GitHub - NVIDIA/build-system-archive-import-examples: Examples for importing precompiled binary tarball and zip archives into various build and packaging systems). It’s just software and can obviously be fixed… But I take notice when I spend an inordinate amount of time patching and providing field support.

Here’s what we eventually did: iree/build_tools/third_party/cuda at main · openxla/iree · GitHub

Basically, a sufficiently advanced user/downstream can bring their own, but by default with no other knobs, we download the needed SDK components and:

  • Encapsulate libdevice into a build rule for the compiler side and don’t take/require a further dep on the SDK.
  • Encapsulate the SDK headers for the runtime side (if enabled) and use dynamic loading trampolines.
1 Like

Yeah, I know what you mean now, and you’re right. At some point I tried locating libdevice on build time, and the CMake Find Cudatoolkit was broken because it was dependent on a file that not longer ships in the toolkit…

The offload pipeline and the updated pipeline in this proposal let you set the path pointing to any bitcode library for linking, including libdevice, removing some of those difficulties. In the case of the offload pipeline it also lets you ship your code and resolve those dependencies later on the target system with clang.

Thank you! Also, it would be great speaking with them, and knowing their perspective, because It would be great if my problem it’s already been solved.

1 Like

I want to point out that recent versions of CUDA do allow redistribution of libdevice.bc.

I wonder if it’s permissive enough for us to just carry it as a binary blob within MLIR/LLVM/clang?

The EULA also says “If the distribution terms in this Agreement are not suitable for your organization, or for any questions regarding this Agreement, please contact NVIDIA at nvidia-compute-license-questions@nvidia.com

2 Likes

It would be great if it could be shipped with MLIR, it might be worth asking. Wonder, what’s the deal on the AMD side.

In any case, I do think the general ability of linking device code is need it and moving the pipelines to translation make sense.

Here’s one use case of mine, currently shuffles are not supported on AMDGPU targets -there are some intrinsics missing in ROCDL needed to support them (They’re on my todo list). However I needed to target MI250X for my project, now, since I was using the proposed pipeline, I simply created the shuffle function with Clang, and linked against the shuffle function and it worked perfectly.

The previous use case is trivial and something that should be solved within MLIR, however it’s easy to see how general linking allows for extensibility.

The only change that was considered on this aspect is about the PTX->cuBin path, where we already require today the cmake dependency: it would just have changed a call to ptxas with a library call as far as I can tell. This is guarded by an opt-in CMake flags and only the tests that need to JIT on GPU have this dependency. Do I miss something about your concern?

No - just retelling battle scars and trying to avoid new ones. Sg. It wasn’t clear to me when the thread hit this simpler point.

This is fantastic. We’ve been asking about this for a long time and it relieves a lot of pressure. I had given up and stopped watching. Thanks for the catch!

Very interesting proposal… I have read this a couple of times (cause some of these concepts are new to me, even though I do understand libdevice and fat binaries). I probably have some more questions on how this all fits in.

  1. If I understand this proposal correctly, this will essentially link in libdevice (from NVIDIA SDK) as a binary blob of device side NVVM IR. That might be enough for functionality, but one real advantage (especially on NVIDIA hardware) is that the libdevice provides implementation of several __device__ side primitives. If available during link stage, these could be inlined. This is almost required for any meaningful use of libdevice. Function calls on NVIDIA hardware is really expensive… I know ptxas has some inlining capabilities, but linking at LLVM IR level and using LLVM to inline and optimize is going to be far superior (and I would claim almost required for any realistic use case). So does this proposal just provide linking to provide definitions, or would it also allow for aggressive inlining and optimizations past that.

  2. Instead of using clang, it would really useful to move the utilities to LLVM… There is no need to enforce dependency on building a front end compiler for C++ on all projects that need to use the offloading mechanism described here.

Some responses to posts in this thread…

I second this point… MLIR should not be using the driver API to generate CUBIN… instead the compiler should stop at PTX generation. The runtime can have a dependence on libcuda.so and JIT the PTX while launching the code. So the mlir-cuda-runner (dont know if it actually exists, I dont keep track of that part of the code base) can have a dependence on libcuda.so, i.e. CUDA driver being installed on the machine. (There are very few cases where you should be adding dependence on CUDA runtime using stock MLIR).

This is indeed unfortunate… the only real solution that I can think of is to have clang/llvm eco-system build a native implementation of libdevice. These are just math functions that are linked in at .bc stage… The clang/llvm eco system needs to have the ability to provide these implementations to be linked in at bitcode stage. This is actually needed for all backends. In IREE one of the big issues we have using LLVM is that on x86 architecture instruction selection ends up lowering to libc calls forcing everyone compiling to x86, to also have libc dependence. This apart from a dependence issue, is also a performance issue… all vectorization etc. are lost due to this external function call… In IREE we solved it by essentially generating our own libdevice for CPU architectures (using implementation from MUSL) and linking that in (similar to how NVIDIAs libdevice works). This is a bit of a long term thing, but probably is needed for downstream users like IREE.

Trying to understand these points. Linking with libdevice is indeed possible, but libdevice is part of CUDA-SDK (as pointed out previously, and not the CUDA driver which is not the SDK). So for some cases the MLIR-> LLVM → PTX will end up with undefined symbols to math functions whose definitions are provided by the libdevice in CUDA-SDK… so I think the issue is this depedendence. In other works AOT compilation adds a dependence on CUDA-SDK, so you need it where you compile and not just where you run the code. Am I missing something here?

(Apologies ahead of time if any of these questions are rewinding the stack a bit, or if I am just missing something obvious here in my questions above).

Although the following won’t apply 100% to the patches I’m submitting next week, I think to understand this proposal it’s easier to use a brief example:

The compilation process to obtain an executable would be:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -gpu-name-mangling \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

For a full breakdown of the compilation steps, see: ⚙ D149559 [mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations.

The above pipeline has the benefits of being able to leverage clang’s device codegen infrastructure, this means that the generation of ptx, cubin, etc, are all handled by clang. This is why this new pipeline can be built as long as the NVPTX or AMDGPU targets are enabled, only clang needs to know the specifics of the toolkits. This also allows for generating the IR on one machine, and compiling the executable on a different machine, among many other things.

This pipeline links against libdevice (in reality, to any bitcode library) importing the symbols, thus it’s able to inline all the calls.

For obtaining the full executable, clang is still is the desired tool, as it handles, linking, rdc and so many other things, that might not fit LLVM. However, we do plan to move some of those utilities to LLVM, that’s one of the reasons the above diff was trashed.

This is interesting, I personally don’t mind incorporating this approach, however I don’t know if users would complain about the added runtime overhead.

No, you just need those on the machine you compile. However, some device specific utilities are not always available in the machine with MLIR, that’s the difference I try to address, for example in the machine I use for MLIR I don’t have ROCm AFAR, but with this pipeline I don’t need to, ROCm is only needed by clang on the target machine. Hence I can generate LLVM IR from MLIR in one machine with no knowledge of ROCm, ship it to a different machine without MLIR and let clang figure out stuff.

Also, just want to acknowledge that this pipeline is possible thanks to all the offloading work done by the Clang and OpenMP teams.

I’ve asked NVIDIA whether LLVM/MLIR can incorporate or ship libdevice bitcode and whether they will consider open-sourcing libdevice. Will update the thread if I hear back from them.

1 Like

I dont see why clang is desirable (or needed). If you are starting from CUDA C++, I understand, but thats not true for MLIR… so needing Clang from MLIR is like adding a dependence from LLVM → Clang . I think that is not desirable. It might not be LLVM, but maybe other tools, but I see clang as a C++/C front end, and nothing related to linking, rdc, etc.

I am talking about in tree binaries like mlir-cuda-runner. I dont think initialization cost is a concern for these. For users who need to amortize the initialization cost, they can build their own binaries out of tree and manage dependencies… All mlir-* binaries are really for testing AFAICS.

Ok, this pipeline doesn’t pretend to replace the existing pipeline *(the patches I’m introducing next week will be updating it though), it would be for enabling additional capabilities closer to production and it doesn’t introduces a build dependency on MLIR. Using it or not is left to the user, with the requirement that the target machine needs clang.

I agree with you, I could see a lot of the infrastructure in it’s own project for LLVM offloading or in LLVM, -we actually plan to move some of it to LLVM, however clang as a front end is able to handle many more things that I don’t know if it’s even worth suggesting taking them out of clang -I honestly don’t know.

For example, with clang you can compile IR and C++, generate device binaries, link against libraries (both host and device), all in one step. As a front end, clang is much more than a C/C++ frontend, specially wrt to offloading.

I feel there is a confusion here between “clang the driver” and “clang the frontend”.
As a driver, clang is convenient to setup the right invocation of the right tools, etc. This is entirely disconnected from the C++ frontend (and actually the flang-driver depends on the clang driver).

There is a very simple anchor to the design to ensure the “no dependency on clang”: if the MLIR JIT can emit the same LLVM bitcode as the AOT and run it, then it means LLVM provides everything we need and clang is just a convenient command line orchestration for users: it’s not load-bearing.
@MaheshRavishankar let me know if I misunderstand your concerns here?

Let me make an analogy in where I think we should be able to get to with this work: clang (the C++ frontend) emits LLVM bitcode, and then run LLVM and its backend to get an object file.
You can also invoke clang with LLVM bitcode directly, it’ll be able to skip the C++ frontend and run LLVM.
However LLVM has everything to generate code for this same bitcode without using clang to setup LLVM, you could directly use opt/llc to do the same thing, or invoke yourself the LLVM APIs.

Here is is the same thing: the MLIR infra should provide everything to get to LLVM bitcode, and then it should be all LLVM! clang here should not be used for anything else that a convenient driver to setup LLVM in a way that it can process the LLVM bitcode generated by MLIR.

So for the sake of the discussion, we should really talk about “how do I do the AOT from MLIR without clang?” Can we take clang out of the picture and write down the LLVM tools invocations that will process the MLIR output down to final object file.

I don’t quite get that? To me it is a matter of toolchain setup: if you’re on a machine with just a CUDA Driver and not toolchain, there is no point in even starting to talk about libdevice.
If you’re providing a compiler built with MLIR and you’re emitting calls to libdevice, and you want to deploy this compiler as a toolchain in such an environment, then either you’re having to have the SDK as a dependency of your toolchain deployment (but you’re deploying a toolchain, so why not?), or you have to bundle libdevice with your deployment.

1 Like

+1 - that would clarify things if we had it. I understand what you are saying, but seeing it outside of a testing tool flow that used clang would help me a lot.

1 Like

+1 from me as well. I understand clang is just used as a driver for now, but with JIT mode we will have to still add dependency on clang headers… ideally all of that should be moved away to not have dependence on clang itself in JIT mode (‘rdc’ etc. are not an issue, that is just to generate position-independent code, which are similar to building object files with fPIC for shared library).

This is probably a tangent to the question at hand, but not sure I follow this. CUDA driver provides compilation from ptx → cuda device binaries. So you need to have CUDA driver installed to run your application, but that is not required for generating the compiled code. You can generate PTX in your compiler without dependence on CUDA driver (or SDK), as long as the deployment has CUDA driver installed.
This breaks down though since NVPTX backend has an implicit dependence on libdevice which is available only with CUDA SDK… So I need CUDA SDK during compilation phase. Its come down to different deployment scenario for different use cases

Use case 1 : Maintaining minimum dependencies to external packages/libraries (which is IREEs deployment scenario). In this scenario.

  • LLVM compiler has no dependence on external/proprietary libraries/SDK.
  • Runtime has a dependence on CUDA driver

Use case 2: Dont care about dependencies, and deployment packages everything needed. In this case the difference between CUDA SDK and driver doesnt matter. The compilation assumes CUDA SDK is available during compilation and execution.

I think the main point with this proposal is that Use case 1 is not really supported by LLVM. (Same is true for x86 and libc… It is a massive pain that LLVM generates calls into libc…)