'cuStreamSynchronize(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS' on MLIR CUDA regression test

I am compiling MLIR like this:

cmake -G Ninja ../llvm \
   -DCMAKE_BUILD_TYPE=Release \
   -DCMAKE_C_COMPILER=clang-11 \
   -DCMAKE_CXX_COMPILER=clang++-11 \
   -DLLVM_ENABLE_PROJECTS=mlir \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="X86;NVPTX;AMDGPU" \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DLLVM_ENABLE_LLD=ON \
   -DLLVM_INCLUDE_EXAMPLES=ON \
   -DMLIR_ENABLE_CUDA_RUNNER=ON \
   -DMLIR_ENABLE_CUDA_CONVERSIONS=ON \
   -DMLIR_INCLUDE_TESTS=ON \
   -DMLIR_INCLUDE_INTEGRATION_TESTS=ON

There was no error during the compilation. However, during the test, it fails on all of the codes in llvm-project/mlir/test/Integration/GPU/CUDA. For example –

FAIL: MLIR :: Integration/GPU/CUDA/all-reduce-and.mlir (604 of 1035)
******************** TEST 'MLIR :: Integration/GPU/CUDA/all-reduce-and.mlir' FAILED ********************
Script:
--
: 'RUN: at line 1';   $HOME/opt/llvm-project/build/bin/mlir-opt $HOME/opt/llvm-project/mlir/test/Integration/GPU/CUDA/all-reduce-and.mlir    -gpu-kernel-outlining    -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin)'    -gpu-to-llvm  | $HOME/opt/llvm-project/build/bin/mlir-cpu-runner    --shared-libs=$HOME/opt/llvm-project/build/lib/libmlir_cuda_runtime.so    --shared-libs=$HOME/opt/llvm-project/build/lib/libmlir_runner_utils.so    --entry-point-result=void  | $HOME/opt/llvm-project/build/bin/FileCheck $HOME/opt/llvm-project/mlir/test/Integration/GPU/CUDA/all-reduce-and.mlir
--
Exit Code: 1

Command Output (stderr):
--
'cuStreamSynchronize(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
'cuStreamDestroy(stream)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
'cuModuleUnload(module)' failed with 'CUDA_ERROR_ILLEGAL_ADDRESS'
$HOME/opt/llvm-project/mlir/test/Integration/GPU/CUDA/all-reduce-and.mlir:64:12: error: CHECK: expected string not found in input
 // CHECK: [0, 2]
           ^
<stdin>:1:1: note: scanning from here
Unranked Memref base@ = 0x31e84b0 rank = 1 offset = 0 sizes = [2] strides = [1] data = 
^
<stdin>:2:8: note: possible intended match here
[53934704, 0]
       ^

Input file: <stdin>
Check file: $HOME/opt/llvm-project/mlir/test/Integration/GPU/CUDA/all-reduce-and.mlir

-dump-input=help explains the following input dump.

Input was:
<<<<<<
            1: Unranked Memref base@ = 0x31e84b0 rank = 1 offset = 0 sizes = [2] strides = [1] data =  
check:64'0     X~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ error: no match found
            2: [53934704, 0] 
check:64'0     ~~~~~~~~~~~~~~
check:64'1            ?       possible intended match
>>>>>>
--

I tested with the all-reduce-and.mlir code and it works if I comment out this line:

    %reduced = "gpu.all_reduce"(%val) ({}) { op = "and" } : (i32) -> (i32)

For some reason, gpu.all_reduce can’t access that part of the memory where the matrix is stored.

How do I fix this?

OS:

Linux 5.4.0-1030-gcp #32-Ubuntu SMP 2020 x86_64 GNU/Linux

Compiler:

Ubuntu clang version 11.0.0-2~ubuntu20.04.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

GPU:

Thu Jul 15 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.05    Driver Version: 450.51.05    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:04.0 Off |                    0 |
| N/A   47C    P8    31W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 00000000:00:05.0 Off |                    0 |
| N/A   72C    P8    34W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

Thanks in advance.

Hello, have you solved this problem?

Take a look at https://bugs.llvm.org/show_bug.cgi?id=51107.

The behavior you are seeing may be caused by the relatively old GPU you are using and the bug has a possible mitigation.