[RFC] Use the 'new' offloding driver for CUDA and HIP compilation by default

Overview

The new offloading driver is a unified interface to create applications from single source offloading languages such as CUDA, OpenMP, or HIP. It has been the default method used to create OpenMP offloading programs following the LLVM 15 release but has remained opt-in for CUDA and HIP through the --[no-]offload-new-driver flags. I am proposing that we update both HIP and CUDA to use this compilation method by default.

The new offloading driver supports several features.

  • A unified and simplified interface allowing for linking providing interoperability between different offloading languages
  • Linking and compiling redistributable device code with -fgpu-rdc
  • Support for static libraries containing device code
  • Device-side LTO support through -foffload-lto
  • Supports compilation on both Windows and Linux
  • Completely compatible with standard builds, no CUDA_SEPARABLE_COMPILATION needed

New driver internals

Single source languages provide several challenges to standard build systems. This is because a single source file will generate multiple output objects which all must be separately linked. In order to support this, the new driver primarily uses two utilities, the clang-offload-packager and the clang-linker-wrapper.

The clang-offload-packager takes multiple device images and bundles them into a single binary blob containing metadata about each image. This blob is then inserted into a section called .llvm.offloading by the offloading toolchain so each compilation step yields a single output file.

The clang-linker-wrapper wraps around the user’s standard linker job so it can preprocess the embedded device images. It will scan the input files for embedded device code and use the embedded metadata to link it into a valid GPU image. This image will then be wrapped into a registration module that makes the necessary runtime calls to register the image with the CUDA, HIP, or OpenMP runtime. This module then gets appended to the link job and the linker runs as normal.

These steps can be inspected during a standard compilation of a simple “hello world” CUDA kernel…

$ clang hello.cu --offload-arch=sm_70,sm_80 --offload-new-driver -fgpu-rdc -c
$ llvm-objdump --offloading hello.o
hello.o:	file format elf64-x86-64

OFFLOADING IMAGE [0]:
kind            cubin
arch            sm_70
triple          nvptx64-nvidia-cuda
producer        cuda

OFFLOADING IMAGE [1]:
kind            cubin
arch            sm_80
triple          nvptx64-nvidia-cuda
producer        cuda
$ clang hello.o --offload-link -lcudart
$ ./a.out
Hello World!

The above steps are identical for HIP or OpenMP modulo flags and they can be linked together. More information is provided in the clang documentation Offloading Design & Internals — Clang 19.0.0git documentation.

What is required

The OpenMP offloading toolchain has been using this compilation method for over a year and a half. The functional change will simply be toggling a default false switch to true and updating the tests. We will want to test the default change on the existing builders as well. Downstream compilers may wish to retain the old default, but it will be a simple matter of keeping the boolean set false.

Potentially observable changes

These are a few of small issues that may need to be addressed first.

  • Currently Does not have a flag to enable for HIP image compression
  • Currently does not embed PTX code
  • Currently does not register HIP __managed__ variables
  • LTO linking may provide different results to HIP’s current llvm-link invocation
  • Obtaining linked intermediate files requires -save-temps as these steps are no longer in the driver from -###.

Conclusion

I want to toggle the switch to use the new driver by default so users need to opt-out with --no-offload-new-driver. This will simplify our offloading toolchain and make it easier to interoperate as we move forward with more GPU infrastructure.

@jdoerfert @Artem-B @yxsamliu @MaskRay

Thanks for reading, comments welcome.

3 Likes

This is a great milestone! Thank you for all the effort you’ve put into improving offloading infrastructure in LLVM.

But… there are still is to dot and ts to cross.

I think, this is going to be a pretty disruptive change, and we should be very careful with the transition planning. If there are past precedents of major user-facing compilation changes, it would be great to revisit lessons learned there.

Before we flip the switch we need to make sure that compilation will still work out of the box for most of the existing users.

E.g. we should talk to folks working on tensorflow and NVIDIA’s cccl and check what kinds of issues they will run into if they try to do the build with the new driver.

We may want to reach out to cmake folks and get the new driver support in place before we make it the default. This should make things “just work” for those who use cmake, which should help with the transition.

Scaling may also be something that may need to be documented. My understanding is that we will need to unpack GPU blobs for the final linking. This may be an issue for larger builds. E.g. we already running into resource limits on the build workers that need to link large executables. If the amount of temporary space they will need will all of a sudden skyrocket, that will be an issue.

I would also be curious to see the impact on the final linking step on a large binary, especially for CUDA, which would need to use nvlink. I’ve never used it to link anything huge, so I have no idea how well it would scale.

Another question is interoperability with the existing binaries. Some of the CUDA libraries (cuFFT) currently ship some RDC code in it. It would be great if we could interoperate with them, if possible, though it’s not a prerequisite for flipping the switch – we’re not breaking anything, as the old driver can’t do much with NVCC-compiled RDC code at all. With the new driver we may be able to interoperate, eventually.

Thanks for the quick response.

And thank you for reviewing a lot of those patches.

Yep, this is mostly just a declaration of intent if you will. For the common case of -fno-gpu-rdc I think the only change is that it no longer embeds PTX by default. Similar story for HIP.

I’d love to hear their feedback, they can email me at joseph.huber@amd.com or message me here.

Does anyone know how to get in contact with the CMake development team? The new driver doesn’t require any CMake workarounds to make it work, so the CUDA_SEPARABLE_COMPILATION flag is unnecessary so long as you pass --offload-link to the link phase to make it user the linker wrapper (Or just pass --offload-arch or any other flag that enables offloading).

I think this is par for the course with rdc-mode linking. However you can use a relocatable link with the linker wrapper to split large sections into smaller ones if needed. I don’t know about Nvidia’s tools but I’m assuming ptxas does something similar to nvlink under the hood, so it’ll probably be a similar case. Unfortunately nvlink isn’t publically available so I can’t really speak confidently on that front.

We may be able to produce some kind of “binary converter” tool that changes formats. Though I don’t know if that’s permitted under the CUDA toolkit EULA.

I believe they did implement their own way of making RDC compilation work, so the new driver will likely interfere with that. They may also want to take advantage of the new driver as that’s arguably a much nicer way of making CUDA compilation work in a way that benefits from other nice things like… linking.

As for the contacts, I think @tambre did a lot of CUDA work in CMake.

My concern is that normal linking of the final executable ships N GB of object files to a worker and expects to produce rougly comparable or smaller in size executable. If we need additional scratch space comparable in size to the size of the input object, that will be an issue for some users. E.g. our internal builds do have pretty hard constraints on the size of the inputs, and the amount of memory used.

ptxas and compilation of an individual file is much less of a problem as those are much smaller. The final linking is where we often collect everything and a kitchen sink and for large enough apps, a step function increase in memory/space/time requirements will be an issue.

I definitely support HIP switching to the new offloading driver in the long run.

However, to switch to the new driver in trunk, at least we need to make sure it supports most of existing HIP apps, otherwise, it may be too disruptive to the users.

Did you try it with internal CI and see how many HIP apps work if we switch to the new driver by default? Thanks.

No, I haven’t yet run it against the internal AMD tests. What’s the easiest way to do that? I could make a draft patch to turn it on by default and see how it fares.

I believe the default -fno-gpu-rdc case will be pretty much unaffected. For -fgpu-rdc mode I’ve tested it against a few applications and also enabled it on existing -fno-gpu-rdc applications without much issue. However, I do know that I will need to handle __managed__ variables there. I’m putting that off because the current registration scheme requires more arguments than I can fit in the offload_entry struct. I’ve also been floating around reworking that struct, but if we could simplify the registration scheme that would also help.

I think I should also make a patch to enable the compression you added for the HIP embedding scheme.

If you can send me a patch to enable it by default, I can kick off an internal PSDB.

I saw your patch to enable compression in the new driver has been merged. Thanks.

About -fno-gpu-rdc, I have a feeling that currently it is done by -fgpu-rdc since the clang-link-wrapper links all device bc together and do one LTO and generate one fat binary and generate one registration code for all. That will make the program work but it will incur long LTO time for projects containing many obj files. To do it the -fno-gpu-rdc way, for each object file, the device bc needs to be extracted and goes through LTO and generate one fat binary, and registration code is generated for each obj, CUID may be needed to make the variables holding the fat binaries unique.

Thanks, I’ll try to get that done. Hopefully there’s not much divergence with the AMD fork in this area.

So, I think in the -fno-gpu-rdc case both the new and old drivers just send the single IR file to lld to make an executable per-TU. In the -fgpu-rdc case, the current HIP driver also sends it through lld, which will implicitly use LTO, except that it passes the -plugin-opt=-amdgpu-internalize-symbols flag, which is likely an override to work around the ROCm-Device-Libs being exported as protected visibility instead of hidden or perhaps HIP device functions not being hidden by default.

I think functionally both approaches will incur similar link times, the main observable difference is that the clang-linker-wrapper approach does the lld call internally. (You can almost think of the clang-linker-wrapper as a linker driver).

the -fno-gpu-rdc case also need -amdgpu-internalize-symbols. The pass internalize all device functions, not just device library functions. The old driver launches backend in clang to do LLVM codegen for -fno-gpu-rdc. The new driver uses LTO to do LLVM codegen for -fno-gpu-rdc. The optimization pipeline is different for LTO and default LLVM optimization pipeline. LTO has an internal option to use the default optimization pipeline but it is not exposed.

For the same project, the LTO optimization time differs significantly in -fno-gpu-rdc mode and -fgpu-rdc mode, since -fno-gpu-rdc LTO only needs to optimize individual modules but -fgpu-rdc LTO needs to optimize all modules together. We have seen this happening in PyTorch since it originally was compiled with -fgpu-rdc mode for HIP.

Okay, the handling there is pretty much identical for -fno-gpu-rdc on the new and old drivers. They both pass to lld with that flag. set. The new driver only differs in the -fgpu-rdc case where it does indeed to LTO. I could potentially pass the internalize symbols pass for HIP linking at least.

Realized that SPIR-V support is a thing that I haven’t really handled yet in RDC-mode. Might need to look into it further.

I made [Offload] Move HIP and CUDA to new driver by default by jhuber6 · Pull Request #84420 · llvm/llvm-project · GitHub to flip the switch and keep all the tests passing. @yxsamliu is this sufficient to run some basic tests in PSDB?

Correct, I’m the de-facto maintainer for Clang’s CUDA C++ support in CMake.

At Clevon we only use Clang’s CUDA C++ support instead of NVIDIA’s own compiler, so it would definitely be in my interest in keeping things compiling after the default is switched. I’m however a bit short on time right now, with the situation hopefully improving in a month or so.

Yes, but it never made sense to me why we needed a custom flag to handle this. I would assume standard LTO pipeline would handle internalizing everything by default

Like I said earlier, I think this is a legacy hack around the fact that we didn’t set the default visibility correctly in the compiler and libraries. AMDGPU executables are fundamentally shared libraries and the HSA / HIP runtimes use the dynamic symbol table to look up symbols like kernels. The standard ELF semantics state that if something does not have hidden visibility it is present in the symbol table, that means LTO can’t internalize it because someone else might want to read it. I know the ROCm device libraries still use protected visibility, but we usually forcibly internalize that with -mlink-builtin-bitcode.

Thanks for the background. Ideally this change would mean that CUDA_SEPARABLE_COMPILATION on clang simply add the -fgpu-rdc compilation flag and --offload-link linker flag. I would also recommend setting up some way to allow the user to pass -foffload-lto if such a thing doesn’t exist already.

we need this llvm option because clang may compile HIP code with -fgpu-rdc -O3 for each TU and generate optimized bc. In this case we cannot internalize the device functions. Otherwise, they won’t be available to other TU’s later in LTO. Therefore this option can only be passed to LTO for -fgpu-rdc case. For -fno-gpu-rdc case each TU is self-contained, therefore it is safe to internalize device functions.

That will be controlled by the INTERPROCEDURAL_OPTIMIZATION target property once implemented. :slightly_smiling_face:

1 Like

I’m unclear on the exact set of tradeoffs here but am strongly in favour of making openmp, cuda, hip and any other single source GPU language compilation as similar as we can. All using the same “new driver” sounds good, especially if we delete the “old driver” relatively promptly after getting any apparent regressions fixed. Making clang simpler is a good thing.

2 Likes