Although the following won’t apply 100% to the patches I’m submitting next week, I think to understand this proposal it’s easier to use a brief example:
The compilation process to obtain an executable would be:
mlir-opt test.mlir \
-gpu-launch-sink-index-computations \
-gpu-kernel-outlining \
-gpu-async-region \
-gpu-name-mangling \
-convert-scf-to-cf \
-convert-gpu-to-nvvm \
-convert-math-to-llvm \
-convert-arith-to-llvm \
-convert-index-to-llvm \
-canonicalize \
-gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
-gpu-to-offload \
-canonicalize \
-o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
-O3 -o test.exe
For a full breakdown of the compilation steps, see: ⚙ D149559 [mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations.
The above pipeline has the benefits of being able to leverage clang’s device codegen infrastructure, this means that the generation of ptx, cubin, etc, are all handled by clang. This is why this new pipeline can be built as long as the NVPTX
or AMDGPU
targets are enabled, only clang needs to know the specifics of the toolkits. This also allows for generating the IR on one machine, and compiling the executable on a different machine, among many other things.
This pipeline links against libdevice
(in reality, to any bitcode library) importing the symbols, thus it’s able to inline all the calls.
For obtaining the full executable, clang is still is the desired tool, as it handles, linking, rdc and so many other things, that might not fit LLVM. However, we do plan to move some of those utilities to LLVM, that’s one of the reasons the above diff was trashed.
This is interesting, I personally don’t mind incorporating this approach, however I don’t know if users would complain about the added runtime overhead.
No, you just need those on the machine you compile. However, some device specific utilities are not always available in the machine with MLIR, that’s the difference I try to address, for example in the machine I use for MLIR I don’t have ROCm AFAR, but with this pipeline I don’t need to, ROCm is only needed by clang on the target machine. Hence I can generate LLVM IR from MLIR in one machine with no knowledge of ROCm, ship it to a different machine without MLIR and let clang figure out stuff.
Also, just want to acknowledge that this pipeline is possible thanks to all the offloading work done by the Clang and OpenMP teams.