How to lower scf.for to run on gpu with mlir-cude-runner

Hi all,

I want to run the ‘’‘scf.parallel’‘’ loop on gpu with mlir-cuda-runner. Can any one give me help on how to lower “scf.parallel” to run with mlir-cuda-runner? The following is an example of the code:

%0 = alloc() : memref<10000xi32>
// initialization for %0 omitted
%1 = alloc() : memref<10000xi32>
%c0 = constant 0 : index
%c10000 = constant 10000 : index
%c1 = constant 1 : index
scf.parallel (%arg0) = (%c0) to (%c10000) step (%c1) {
    %2 = load %0[%arg0] : memref<10000xi32>
    %castarg = index_cast %arg0 : index to i32
    %3 = addi %2, %castarg : i32
    store %3, %1[%arg0] : memref<10000xi32>

Thank you in advance!

You first need to put mapping annotations onto the parallel loop to identify which iteration dimension is mapped to the available gpu resources (thread/block x/y/z).

There is a greedy pass that does this for you, see mlir/include/mlir/Dialect/GPU/ParallelLoopMapper.h for details. It is not exposed as a pass you can run via a textual pipeline at the moment but you could add it to the file.

Once you have annotations, you can use mlir-opt -convert-parallel-loops-to-gpu to lower this to gpu code. See mlir/test/Conversion/SCFToGPU/parallel_loop.mlir for an example.

Thank you so much for your reply. Another question, do we need to use convert-gpu-to-nvvm is we run on cude gpu?

If you want to use the cuda runner, you can have a look at mlir/test/mlir-cuda-runner/all-reduce-xor.mlir for an example of how to invoke it. It will lower the gpu dialect correctly for you.

If you want to build your own pipeline then you would have to lower gpu to nvvm for cuda (that handles the device side) and also lower gpu to cuda (which handles the host side).

Thank you so much for your reply.

Lower gpu to nvvm has the pass -convert-gpu-to-nvvm. Does lower gpu to cuda also has the pass? I didn’t find it in mlir-opt --help or mlir-cuda-runner --help. Is there any suggestion on this? Thank you so much!

I missed your reply, so this comes a bit late. But for documentation’s sake: That pass is called --gpu-to-llvm as it lowers the gpu dialect to llvm with runtime calls.

The naming is not ideal and this probably should be cleaned up.