Overview
The new offloading driver is a unified interface to create applications from single source offloading languages such as CUDA, OpenMP, or HIP. It has been the default method used to create OpenMP offloading programs following the LLVM 15 release but has remained opt-in for CUDA and HIP through the --[no-]offload-new-driver
flags. I am proposing that we update both HIP and CUDA to use this compilation method by default.
The new offloading driver supports several features.
- A unified and simplified interface allowing for linking providing interoperability between different offloading languages
- Linking and compiling redistributable device code with
-fgpu-rdc
- Support for static libraries containing device code
- Device-side LTO support through
-foffload-lto
- Supports compilation on both Windows and Linux
- Completely compatible with standard builds, no CUDA_SEPARABLE_COMPILATION needed
New driver internals
Single source languages provide several challenges to standard build systems. This is because a single source file will generate multiple output objects which all must be separately linked. In order to support this, the new driver primarily uses two utilities, the clang-offload-packager
and the clang-linker-wrapper
.
The clang-offload-packager
takes multiple device images and bundles them into a single binary blob containing metadata about each image. This blob is then inserted into a section called .llvm.offloading
by the offloading toolchain so each compilation step yields a single output file.
The clang-linker-wrapper
wraps around the user’s standard linker job so it can preprocess the embedded device images. It will scan the input files for embedded device code and use the embedded metadata to link it into a valid GPU image. This image will then be wrapped into a registration module that makes the necessary runtime calls to register the image with the CUDA, HIP, or OpenMP runtime. This module then gets appended to the link job and the linker runs as normal.
These steps can be inspected during a standard compilation of a simple “hello world” CUDA kernel…
$ clang hello.cu --offload-arch=sm_70,sm_80 --offload-new-driver -fgpu-rdc -c
$ llvm-objdump --offloading hello.o
hello.o: file format elf64-x86-64
OFFLOADING IMAGE [0]:
kind cubin
arch sm_70
triple nvptx64-nvidia-cuda
producer cuda
OFFLOADING IMAGE [1]:
kind cubin
arch sm_80
triple nvptx64-nvidia-cuda
producer cuda
$ clang hello.o --offload-link -lcudart
$ ./a.out
Hello World!
The above steps are identical for HIP or OpenMP modulo flags and they can be linked together. More information is provided in the clang documentation Offloading Design & Internals — Clang 19.0.0git documentation.
What is required
The OpenMP offloading toolchain has been using this compilation method for over a year and a half. The functional change will simply be toggling a default false
switch to true
and updating the tests. We will want to test the default change on the existing builders as well. Downstream compilers may wish to retain the old default, but it will be a simple matter of keeping the boolean set false.
Potentially observable changes
These are a few of small issues that may need to be addressed first.
Currently Does not have a flag to enable for HIP image compressionCurrently does not embed PTX code- Currently does not register HIP
__managed__
variables - LTO linking may provide different results to HIP’s current
llvm-link
invocation - Obtaining linked intermediate files requires
-save-temps
as these steps are no longer in the driver from-###
.
Conclusion
I want to toggle the switch to use the new driver by default so users need to opt-out with --no-offload-new-driver
. This will simplify our offloading toolchain and make it easier to interoperate as we move forward with more GPU infrastructure.
@jdoerfert @Artem-B @yxsamliu @MaskRay
Thanks for reading, comments welcome.