How to lower scf.for to run on gpu with mlir-cude-runner

rqtian · November 17, 2020, 6:19am

Hi all,

I want to run the ‘’‘scf.parallel’‘’ loop on gpu with mlir-cuda-runner. Can any one give me help on how to lower “scf.parallel” to run with mlir-cuda-runner? The following is an example of the code:

%0 = alloc() : memref<10000xi32>
// initialization for %0 omitted
%1 = alloc() : memref<10000xi32>
%c0 = constant 0 : index
%c10000 = constant 10000 : index
%c1 = constant 1 : index
scf.parallel (%arg0) = (%c0) to (%c10000) step (%c1) {
    %2 = load %0[%arg0] : memref<10000xi32>
    %castarg = index_cast %arg0 : index to i32
    %3 = addi %2, %castarg : i32
    store %3, %1[%arg0] : memref<10000xi32>
}

Thank you in advance!

herhut · November 17, 2020, 11:12am

You first need to put mapping annotations onto the parallel loop to identify which iteration dimension is mapped to the available gpu resources (thread/block x/y/z).

There is a greedy pass that does this for you, see mlir/include/mlir/Dialect/GPU/ParallelLoopMapper.h for details. It is not exposed as a pass you can run via a textual pipeline at the moment but you could add it to the passed.td file.

Once you have annotations, you can use mlir-opt -convert-parallel-loops-to-gpu to lower this to gpu code. See mlir/test/Conversion/SCFToGPU/parallel_loop.mlir for an example.

rqtian · November 17, 2020, 4:47pm

Thank you so much for your reply. Another question, do we need to use convert-gpu-to-nvvm is we run on cude gpu?

herhut · November 18, 2020, 10:51am

If you want to use the cuda runner, you can have a look at mlir/test/mlir-cuda-runner/all-reduce-xor.mlir for an example of how to invoke it. It will lower the gpu dialect correctly for you.

If you want to build your own pipeline then you would have to lower gpu to nvvm for cuda (that handles the device side) and also lower gpu to cuda (which handles the host side).

rqtian · November 18, 2020, 5:46pm

Thank you so much for your reply.

Lower gpu to nvvm has the pass -convert-gpu-to-nvvm. Does lower gpu to cuda also has the pass? I didn’t find it in mlir-opt --help or mlir-cuda-runner --help. Is there any suggestion on this? Thank you so much!

herhut · December 4, 2020, 10:41am

I missed your reply, so this comes a bit late. But for documentation’s sake: That pass is called --gpu-to-llvm as it lowers the gpu dialect to llvm with runtime calls.

The naming is not ideal and this probably should be cleaned up.

Topic		Replies	Views
Confused about -convert-parallel-loops-to-gpu MLIR	1	183	March 6, 2024
Problems on lowering scf.parallel with dynamic boundary to GPU MLIR	4	353	August 26, 2022
SCFToGPU convertion -convert-parallel-loops-to-gpu MLIR gpu	3	692	June 19, 2023
How to lowering gpu.launch correctly MLIR	4	258	December 4, 2023
Constructing pipeline lowering an affine parallel loop to NVIDIA GPU MLIR gpu	4	451	June 6, 2023

How to lower scf.for to run on gpu with mlir-cude-runner

Related topics