`gpu.set_default_device` does not work when specified multiple times

Recently I’m working on assigning different workloads for multiple GPUs on my host using MLIR. Therefore, I wrote a piece of code to test the capability of gpu.set_default_device, like below:

module attributes {gpu.container_module} {
    func.func @main() {
        %len = arith.constant 10000000000 : index
        %c0 = arith.constant 0 : index
        %c0i32 = arith.constant 0 : i32
        %c1i32 = arith.constant 1 : i32
        %c1 = arith.constant 1 : index
        %c100 = arith.constant 100 : index
        %c512 = arith.constant 512 : index

        gpu.set_default_device %c1i32
        %mem = memref.alloc(%len) : memref<?xi8>
        %dev = gpu.alloc (%len) : memref<?xi8>
        gpu.memcpy %dev, %mem : memref<?xi8>, memref<?xi8>
        gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
                   threads(%tx, %ty, %tz) in (%block_x = %c512, %block_y = %c1, %block_z = %c1) {
            %val = memref.load %dev[%tx] : memref<?xi8>
            scf.for %arg0 = %c0 to %c100 step %c1 {
                %tmp = arith.addi %val, %val : i8
            }
            gpu.terminator
        }
        gpu.memcpy %mem, %dev : memref<?xi8>, memref<?xi8>
        gpu.dealloc %dev : memref<?xi8>

        gpu.set_default_device %c0i32
        %len1 = arith.constant 5000000000 : index
        %mem1 = memref.alloc(%len1) : memref<?xi8>
        %dev1 = gpu.alloc (%len1) : memref<?xi8>
        gpu.memcpy %dev1, %mem1 : memref<?xi8>, memref<?xi8>
        gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
                   threads(%tx, %ty, %tz) in (%block_x = %c512, %block_y = %c1, %block_z = %c1) {
            %val = memref.load %dev1[%tx] : memref<?xi8>
            scf.for %arg0 = %c0 to %c100 step %c1 {
                %tmp = arith.addi %val, %val : i8
            }
            gpu.terminator
        }
        gpu.memcpy %mem1, %dev1 : memref<?xi8>, memref<?xi8>
        gpu.dealloc %dev1 : memref<?xi8>

        memref.dealloc %mem : memref<?xi8>
        memref.dealloc %mem1 : memref<?xi8>
        return
    }
}

There are two GPUs with each 15GB memory on the machine. I firstly use gpu.set_default_device %c1i32 to assign a task on the second GPU, and then use gpu.set_default_device %c0i32 to assign another task on the first GPU. However, nvtop command shows that both of the tasks are excuted on the second GPU, which means the second gpu.set_default_device does not work!

Is it a bug of gpu.set_default_device? If is, is there another way to achieve my goal?

I’m using llvmorg-16.0.6 version of LLVM, and the pass pipelines for running the above test code are pasted below:

mlir-opt test.mlir -gpu-kernel-outlining | \
mlir-opt -convert-scf-to-cf | \
mlir-opt -convert-arith-to-llvm | \
mlir-opt -convert-index-to-llvm | \
mlir-opt -convert-memref-to-llvm | \
mlir-opt -convert-vector-to-llvm | \
mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,reconcile-unrealized-casts,gpu-to-cubin))' | \
mlir-opt -gpu-async-region | \
mlir-opt -gpu-to-llvm | \
mlir-opt -convert-func-to-llvm | \
mlir-opt -reconcile-unrealized-casts | \
mlir-cpu-runner --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_cuda_runtime.so --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_runner_utils.so --shared-libs=${LLVM_BUILD_DIR}/lib/libmlir_c_runner_utils.so --entry-point-result=void -O0

cuda context is created as singleton and it only reads defaultDevice value once https://github.com/llvm/llvm-project/blob/main/mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp#L96

We’ve had an RFC for explicit context/stream/queue argument for GPU ops to allow interleaving execution on multiple devices Proposal to add stream/queue as an optional argument to few GPU dialect ops but it didn’t get much traction.

Thanks a lot for your reply! Your answer helps me know better about the ubderlying machanism!

After viewing the disscusion you mentioned, I do think the abstraction of stream or queue is of vital importance. Because when i tried to overlap streams to improve the performance, the current way of using !gpu.async.token to create streams seems not working by comparing the excution time of single stream and multi streams. (I used to create a related topic but no one replied)

Back to the current issue, your answer raises a new question for me, described below:

Since CUDA context is created as singleton, i decided to rewrite the test code into the following form:

module attributes {gpu.container_module} {
    func.func @test(%idx : i32) {
        gpu.set_default_device %idx
        %len = arith.constant 5000000000 : index
        %c0 = arith.constant 0 : index
        %c1 = arith.constant 1 : index
        %c100 = arith.constant 100 : index
        %c512 = arith.constant 512 : index
        %mem = memref.alloc(%len) : memref<?xi8>
        %dev = gpu.alloc (%len) : memref<?xi8>
        gpu.memcpy %dev, %mem : memref<?xi8>, memref<?xi8>
        gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
                   threads(%tx, %ty, %tz) in (%block_x = %c512, %block_y = %c1, %block_z = %c1) {
            %val = memref.load %dev[%tx] : memref<?xi8>
            scf.for %arg0 = %c0 to %c100 step %c1 {
                %tmp = arith.addi %val, %val : i8
            }
            gpu.terminator
        }
        gpu.memcpy %mem, %dev : memref<?xi8>, memref<?xi8>
        gpu.dealloc %dev : memref<?xi8>

        memref.dealloc %mem : memref<?xi8>
        return
    }
}

Actually it is still the same as the above one, except that i now take the device id as an argument of this func.func. Then i wrote a main.cpp to call the test function using CPU multi-thread:

#include <iostream>
#include <thread>

extern "C" {
    void _mlir_ciface_test(int idx);
}

int main() {
    std::thread first(_mlir_ciface_test, 0);
    std::thread second(_mlir_ciface_test, 1);
    
    first.join();
    second.join();
    return 0;
}

i added -llvm-request-c-wrappers pass to make test function callable from main.cpp. However, these two tasks are still running on the same GPU.

So could you please tell me why would this happed? What does the creation of CUDA context relate to if not the CPU thread?

The singleton is not a thread local variable, there is no reason that threading would help having different context here.

I see, thank you! So there are now no means currently to support multiple GPUs? Or is related support under progress now?

There is no way right now to use multiple GPU using this particular runtime implementation. Work in this direction would be welcome I think, but I am not aware of anyone investing in improving this at the moment. Most downstream users will have their own runtime mapping (see IREE as an example).

Your reply is really helpful! Thanks a lot for your help! I might check IREE later for more information. :smiling_face:

As a potential suggestion, we could add gpu.create_stream %deviceNum : index -> !gpu.token which starts off an async chain and lowers to stream creation, thus making set_default_device rather pointless

1 Like