Summary
The current pipeline for generating GPU code has some drawbacks, like not being able to link to bytecode libraries like libdevice, or device linking in general, requiring building MLIR on the target system for obtaining executable code, to name a few.
This proposal would introduce an alternative pipeline for generating device code. Essentially it would generate clang
compliant offload LLVM IR and the task for generating the final executable would be left to clang
, removing the burden away from MLIR and leveraging clang
’s capabilities for offload code generation.
Important notice, this won’t introduce a clang build dependency to MLIR.
Edit: Due to discourse restrictions I had to remove several links and user mentions, so here is a github gist with the original RFC.
Context & Motivation
The current approach for lowering the GPU dialect down can be summed up by:
mlir-opt --convert-gpu-to-nvvm \
--convert-nvgpu-to-nvvm \
--gpu-to-cubin \
--gpu-to-llvm input.mlir
Where device code generation happens inside either --gpu-to-cubin or
–gpu-to-hsaco`, these serialization passes follow the process:
(NVVM | ROCDL) + LLVM -> LLVM IR -> Device binary -> Embedding the device binary as constant into the host IR
This previous pipeline works in many cases, however as soon users start using things like math.sqrt
they run into issues. Most of these issues appear due to the fact that there’s no linking during the serialization pass, see SerializeToCubin.cpp
. Thus pipeline scalability is not possible with the current system.
Another issue with the current approach is that users are forced to build MLIR on every target system (or at least a system with a full pipeline), as the serialization passes and the ExecutionEngine depend on an installation of either ROCM
or CUDA
. One important thing to note is that this requirement is not imposed to pure host code, as the user can translate down to LLVM IR in one machine, copy the file to the target system and generate the executable with clang
on the target system.
Why is this last point an issue? MLIR moves fast, so weekly recompiles are not uncommon, however these weekly builds might be prohibitive for some users with limited and precious compute time in target systems (e.g. DOE scientist), but with otherwise unlimited time in other systems in which they can build MLIR. Hence the ability to generate LLVM IR in one machine and compile it with clang
in a different one would prove useful.
Clang Offload
Clang is able to handle offload to multiple target architectures like NVPTX and AMDGPU with ease, calling vendor tools like ptx
automatically and even passing arguments to this tools, making the process extensible.
Clang currently has 2 offload drivers, a default one and --offload-new-driver
. The new driver is not yet available in Windows / MacOS, however it has some important benefits:
- Simpler to use.
- Eventually is (probably) going to become the default (there are some issues to fix to support other platforms).
- It’s more powerful in its capabilities and provides a more uniform compilation process.
- Capable of performing device LTO.
For more information see Offloading Design .
Offloading example
// example.cu
__global__ void fill(int n, float a, float *x) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
x[i] = a;
}
int main(void) {
int n = 1 << 20;
float *x;
cudaMallocManaged(&x, n * sizeof(float));
fill<<<(n + 255) / 256, 256>>>(n, 2.0f, x);
cudaDeviceSynchronize();
cudaFree(x);
}
New driver compilation process:
# Create host & device IR.
clang -fgpu-rdc --offload-new-driver --offload-device-only --offload-arch=sm_70 -o example_dev.bc -c -emit-llvm example.cu
clang -fgpu-rdc --offload-new-driver --offload-host-only -o example_host.bc -c -emit-llvm example.cu
# Compile everything together.
clang-offload-packager -o example_dev.bin --image=file=example_dev.bc,arch=sm_70,triple=nvptx64-nvidia-cuda,kind=cuda
clang -fgpu-rdc --offload-new-driver example_host.bc -Xclang -fembed-offload-object=example_dev.bin -O3 -lcudart -o example.exe
The last step can be replaced by:
# The output of this command will have host & device code, with the device code embedded as a constant, and some other necessary annotations.
clang -fgpu-rdc --offload-new-driver example_host.bc -Xclang -fembed-offload-object=example_dev.bin -O3 -c -emit-llvm -o example_full.bc
clang -fgpu-rdc --offload-new-driver example_full.bc -lcudart -o example.exe
The output from the first command (example_dev.bc
) is plain LLVM IR device offload code.
The output from the second command (example_host.bc
) is LLVM IR code with with 2 additional annotations:
@.omp_offloading.entry_name = internal unnamed_addr constant [12 x i8] c"_Z4fillifPf\00"
@.omp_offloading.entry._Z4fillifPf = weak constant %struct.__tgt_offload_entry { ptr @_Z19__device_stub__fillifPf, ptr @.omp_offloading.entry_name, i64 0, i32 0, i32 0 }, section "cuda_offloading_entries", align 1
The output from embedding example_dev.bin
into example_host.bc
and producing example_full.bc
just adds the following annotations:
@llvm.embedded.object = private constant [<num> x i8] c"...", section ".llvm.offloading", align 8, !exclude !0
@llvm.compiler.used = appending global [1 x ptr] [ptr @llvm.embedded.object], section "llvm.metadata"
The key step of transforming everything into an executable is performed under the hood by clang, by calling the tool clang-linker-wrapper, this tool will compile device LLVM IR code down to fatbin and generate the necessary code to register the kernels.
Default driver compilation process:
It’s more rigid and unlike the new driver it cannot be invoked from a combination of host IR and a device IR files, It requires either a cuda input file or a LLVM IR file with many more target specific annotations, including kernel registration code.
Shout-out to @jhuber6 for all the help in deciphering the steps.
Possible routes
From easiest to implement to hardest:
- Keep most of the existing stuff, but instead of serializing everything in one pass, have a pass that emits device LLVM IR, let the user manually compile this IR down to cubin, and create a standalone tool for embedding that IR into the host IR.
- Drawbacks: no device lto, and there’s still extensibility issues, extra tool to maintain.
- Create a new serialization pipeline, but instead of compiling down to binary we serialize to device bytecode, introduce
clang
compatible offload annotations for the new driver, and letclang
generate the executable.- Drawbacks: It would only work on linux until the new clang offload driver becomes available in all platforms. No JIT. The introduction of a new runner library, as
clang
uses the cuda runtime instead of the driver.
- Drawbacks: It would only work on linux until the new clang offload driver becomes available in all platforms. No JIT. The introduction of a new runner library, as
- Do 2, but with the old driver.
- Drawbacks: It’s going to get deprecated in the future when
clang
makes the switch. The number of extra IR to be generated is considerable.
- Drawbacks: It’s going to get deprecated in the future when
I’m inclined to do 2, it’s relatively simple to implement and in the long run it’s just a better option, as it provides more extensibility for generating device code, more optimization opportunities, and easier to maintain.
Note: On every step that says CUDA or related terms, these can be swapped for HIP.
cc:
@ftynse