Recently I’m working on assigning different workloads for multiple GPUs on my host using MLIR. Therefore, I wrote a piece of code to test the capability of gpu.set_default_device
, like below:
module attributes {gpu.container_module} {
func.func @main() {
%len = arith.constant 10000000000 : index
%c0 = arith.constant 0 : index
%c0i32 = arith.constant 0 : i32
%c1i32 = arith.constant 1 : i32
%c1 = arith.constant 1 : index
%c100 = arith.constant 100 : index
%c512 = arith.constant 512 : index
gpu.set_default_device %c1i32
%mem = memref.alloc(%len) : memref<?xi8>
%dev = gpu.alloc (%len) : memref<?xi8>
gpu.memcpy %dev, %mem : memref<?xi8>, memref<?xi8>
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %c512, %block_y = %c1, %block_z = %c1) {
%val = memref.load %dev[%tx] : memref<?xi8>
scf.for %arg0 = %c0 to %c100 step %c1 {
%tmp = arith.addi %val, %val : i8
}
gpu.terminator
}
gpu.memcpy %mem, %dev : memref<?xi8>, memref<?xi8>
gpu.dealloc %dev : memref<?xi8>
gpu.set_default_device %c0i32
%len1 = arith.constant 5000000000 : index
%mem1 = memref.alloc(%len1) : memref<?xi8>
%dev1 = gpu.alloc (%len1) : memref<?xi8>
gpu.memcpy %dev1, %mem1 : memref<?xi8>, memref<?xi8>
gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
threads(%tx, %ty, %tz) in (%block_x = %c512, %block_y = %c1, %block_z = %c1) {
%val = memref.load %dev1[%tx] : memref<?xi8>
scf.for %arg0 = %c0 to %c100 step %c1 {
%tmp = arith.addi %val, %val : i8
}
gpu.terminator
}
gpu.memcpy %mem1, %dev1 : memref<?xi8>, memref<?xi8>
gpu.dealloc %dev1 : memref<?xi8>
memref.dealloc %mem : memref<?xi8>
memref.dealloc %mem1 : memref<?xi8>
return
}
}
There are two GPUs with each 15GB memory on the machine. I firstly use gpu.set_default_device %c1i32
to assign a task on the second GPU, and then use gpu.set_default_device %c0i32
to assign another task on the first GPU. However, nvtop
command shows that both of the tasks are excuted on the second GPU, which means the second gpu.set_default_device
does not work!
Is it a bug of gpu.set_default_device
? If is, is there another way to achieve my goal?
I’m using llvmorg-16.0.6
version of LLVM, and the pass pipelines for running the above test code are pasted below:
mlir-opt test.mlir -gpu-kernel-outlining | \
mlir-opt -convert-scf-to-cf | \
mlir-opt -convert-arith-to-llvm | \
mlir-opt -convert-index-to-llvm | \
mlir-opt -convert-memref-to-llvm | \
mlir-opt -convert-vector-to-llvm | \
mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))' | \
mlir-opt -gpu-async-region | \
mlir-opt -gpu-to-llvm | \
mlir-opt -convert-func-to-llvm | \
mlir-opt -reconcile-unrealized-casts | \
mlir-cpu-runner --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_cuda_runtime.so --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_runner_utils.so --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_c_runner_utils.so --entry-point-result=void -O0