MLIR GPU execution path - non-JIT trial

While the existing tests under mlir/test/Integration/GPU/CUDA/ use mlir-cpu-runner to JIT compile and execute MLIR on GPUs, I was trying to see how that same execution could be reproduced by compiling to an object/executable and then executing. This for example works for CPU execution where opt and llc could be used and then the output could be assembled and executed. However, there appear to be a few issues trying the same thing for GPUs and I’m listing the steps I followed below:

  1. Let’s take any one of the example test cases, say: test/Integration/GPU/CUDA/all-reduce-op.mlir. Now this works the way that test is set up for JIT execution and executes correctly:
$ mlir-opt ../../test/Integration/GPU/CUDA/all-reduce-op.mlir -gpu-kernel-outlining  -pass-pipeline="gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin)" -gpu-to-llvm |  ../../../build/bin/mlir-cpu-runner -O3 -entry-point-result=void --shared-libs=../../../build/lib/libmlir_runner_utils.so --shared-libs=../../../build/lib/libmlir_cuda_runtime.so --shared-libs=../../../build/lib/libmlir_c_runner_utils.so  
Unranked Memref base@ = 0x556873ff3ab0 rank = 3 offset = 0 sizes = [2, 4, 13] strides = [52, 13, 1] data = 
[[[5356,    5356,    5356,    5356,    5356,    5356,    5356,    5356,    5356,    5356,    5356,    5356,    5356],
...
  1. In order to compile the same thing down to a binary and execute, I tried:
mlir-opt ../../test/Integration/GPU/CUDA/all-reduce-op.mlir -gpu-kernel-outlining  -pass-pipeline="gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin)" -gpu-to-llvm | mlir-translate -mlir-to-llvmir | opt -O3 -S | llc -O3 -march=nvptx64 -o test.ptx

which generates the PTX. However, trying to assemble it:

$ ptxas test.ptx 
ptxas test.ptx, line 444; fatal   : Parsing error near '-': syntax error
ptxas fatal   : Ptx assembly aborted due to errors

At line 444: we have:

  .section  .debug_pubnames
  {
.b32 LpubNames_end0-LpubNames_start0    // Length of Public Names Info

The PTX header shows:

//
// Generated by LLVM NVPTX Back-End
//

.version 3.2
.target sm_20, debug
.address_size 64

...

Wildly guessing that this was an issue with a mismatch in the syntax (hyphens vs underscores), I changed the - in the names to _, and this resolves it but:

$ ptxas test.ptx 
ptxas test.ptx, line 445; error   : Feature 'Defining labels in .section' requires PTX ISA .version 7.0 or later
ptxas test.ptx, line 457; error   : Feature 'Defining labels in .section' requires PTX ISA .version 7.0 or later
ptxas test.ptx, line 462; error   : Feature 'Defining labels in .section' requires PTX ISA .version 7.0 or later
ptxas test.ptx, line 468; error   : Feature 'Defining labels in .section' requires PTX ISA .version 7.0 or later

To get PTX 7.0 which I’d anyway be able to run, I then used:

... | llc -O3 -march=nvptx64 -mcpu=sm_80 | sed -e 's/-Lpu/_Lpu/g' | ptxas - --gpu-name=sm_80 --compile-only -o test

And this finally works: (I guess one could have stripped the debug info as well to avoid this)

$ file test
test: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

It looks like the printed assembly is not being properly supported for parsing? Should this be reported on LLVM or to NVIDIA (albeit a minor issue)? The debug info labels apparently don’t have the right format and they are also being emitted for lower PTX ISA versions that don’t support them.

  1. More importantly, does anyone know what would be the right way now to link the above object with the MLIR runtime shared libraries? For CPUs, one would just use clang++, g++, or ld on it, but here we get an error like this that’s expected because we are cross-compiling:
clang++  test ../../../build/lib/libmlir_cuda_runtime.so
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: Relocations in generic ELF (EM: 190)
/usr/bin/ld: test: error adding symbols: file in wrong format
clang: error: linker command failed with exit code 1 (use -v to see invocation)

I couldn’t tell what target options/approach to use.

Looks like I was completely on the wrong path here! The following works:

$ mlir-opt ../../test/Integration/GPU/CUDA/all-reduce-op.mlir -gpu-kernel-outlining  -pass-pipeline="gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin)" -gpu-to-llvm | mlir-translate -mlir-to-llvmir | opt -O3 -S | llc -O3 | as - -o test.o
$ clang++ test.o <mlir runtime libraries>  -o exec -lcuda
$ ./exec

I shouldn’t have been using any nvptx targets in the first place! The device part had already been compiled to ptx and assembled into a cubin before the MLIR was even translated out to LLVMIR.

2 Likes

Yes, MLIR runs the PTX compilation in a pass. We should document this in some visible place, it’s not a trivial setup.

I think it is important to keep in mind that the design here is to have the CPU (host) and GPU (device) code in a single IR. That is somewhat novel and I think where the confusion comes from.

I am a little surprised that -gpu-to-llvm is enough to lower the whole host-side of the program. Maybe because it is trivial? Otherwise, I would have expected to see some lowering of std, as well.

Where would be a good place to document this? Would it suffice to do a small tutorial style example somewhere?

1 Like

It subsumes std-to-llvm as well as other things, which could benefit from a cleanup similar to what I did for other passes.

I’d say a top-level doc “targeting GPUs” or something similar is justified.

Absolutely - I think this is a great piece of engineering!

I had been well aware that the gpu-to-cubin pass already generated the final thing for the device but I was also trying to link another clang compiler C++ source with this MLIR and somehow ended up assuming this whole MLIR lowered file had device IR yet to be compiled and assembled for GPUs - OTOH, it was really host code making calls to CUDART.