Converting CUDA program to mlir (gpu, linalg etc.)

Hi all,

I am trying to convert a CUDA program (.cu) that does vector addition. I was able to lower the CUDA (.cu) to llvm (.ll) and then to llvm-mlir using clang and mlir-translate. But I need to lower it to mlir with other dialects like linalg or gpu etc. Is there any way of doing so? Thanks!

PS: I tried Polygiest but was not able to figure out.

ClangIR (incubator, upstream hasn’t reached that point yet) has support to go from CU → CIR → LLVM, see tests in clangir/clang/test/CIR/CodeGen/CUDA at main · llvm/clangir · GitHub

Since ClangIR also has a CU → CIR → Core Dialects pipeline, you could try playing with that and see what you can get (note sure anyone has tested that yet though).

Also @wsmoses @chelini for Polygeist, it should support CUDA. It would have been more helpful if you specifically explained what you tried and what the problem was.

Thank you. Will explore it.

Sure, so I did this using Polygeist:

One of the .cu file:

#include <stdio.h>
__global__ void hello() {
    printf("Hello from GPU!\n");
}
int main() {
    hello<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}

Polygeist build commands:

cmake -G Ninja ../llvm \
  -DLLVM_ENABLE_PROJECTS="mlir;clang" \
  -DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \
  -DMLIR_ENABLE_CUDA_RUNNER=ON \
  -DMLIR_ENABLE_EXECUTION_ENGINE=ON \
  -DLLVM_ENABLE_RTTI=ON \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DCMAKE_BUILD_TYPE=Debug \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DCMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES=/usr/local/cuda/include

And,

cmake -G Ninja .. \
  -DMLIR_DIR=$PWD/../llvm-project/build/lib/cmake/mlir \
  -DCLANG_DIR=$PWD/../llvm-project/build/lib/cmake/clang \
  -DPOLYGEIST_ENABLE_CUDA=1 \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda \
  -DLLVM_ENABLE_ASSERTIONS=ON \
  -DLLVM_ENABLE_RTTI=ON \
  -DCMAKE_BUILD_TYPE=Debug \
  -DCMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES=/usr/local/cuda/include

After building the cgeist and polygeist-opt, I run:

Exp/Polygeist/build/bin/cgeist --immediate --cuda-gpu-arch=sm_86 --cuda-path=/usr/lib/cuda  --nocudainc  -I/usr/lib/gcc/x86_64-linux-gnu/11/include -I/usr/include test2.cu

I get this error:

test2.cu:4:1: error: unknown type name '__global__'
    4 | __global__ void hello() {
      | ^
test2.cu:9:10: error: use of undeclared identifier __cudaPushCallConfiguration
    9 |     hello<<<1, 1>>>();
      |          ^
test2.cu:10:5: error: use of undeclared identifier 'cudaDeviceSynchronize'
   10 |     cudaDeviceSynchronize();
      |     ^

Got it working. Had to set the environment variable as export CPLUS_INCLUDE_PATH=Exp/Polygeist/llvm-project/build/lib/clang/18/include and then ran Exp/Polygeist/build/bin/cgeist test2.cu --cuda-gpu-arch=sm_86 --cuda-path=/usr/local/cuda --nocudalib -I/usr/local/cuda/include -I/usr/lib/gcc/x86_64-linux-gnu/11/include --immediate

The IR consists of llvm, arith, scf, gpu etc. dialects though.

I was going to say --nocudainc looks suspicious, you don’t want to remove CUDA definitions from consideration. You may also need to pass --resource-dir to cgeist explicitly by first querying it from clang --print-resource-dir using the clang you built.

This is correct. What did you expect?

Ok, will run with the modified flags. I was expecting these only except the llvm one. Thanks.

Ran with the modified flags, it generates the same code with or without the --nocudalib, I added the --resource-dir. Also, Polegeist dialect op also appears.

nocudalib disables linking with libdevice, so it will not have any effect on compilation itself. The original compilation string had nocudainc, which disables inclusion of system-level CUDA headers leading to undeclared symbol errors.

LLVM dialect remains for things that cannot be easily (or at all) converted to a higher-level dialect. This proejct GitHub - EnzymeAD/Enzyme-JAX: Custom Bindings for Enzyme Automatic Differentiation Tool and Interfacing with JAX. has more raising passes, namely -affine-cfg (which incidentally does other kinds of raising) and -libdevice-funcs-raise that converts libdevice calls to arith/math dialect.