[openmp offloading] get `"CUDA" error: Unrecognized "CUDA" error code 4`

My program is:

#include "omp.h"

int main() {
  int a[100], b[100], c[100];
  int i;
  for (i = 0; i < 100; i++) {
    a[i] = i;
    b[i] = i;
  }
  #pragma omp parallel for
  for (i = 0; i < 100; i++) {
    c[i] = a[i] + b[i];
  }
  return 0;
}

The build script is:

set -e
export LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE="tmp.ir"
export LIBOMPTARGET_JIT_OPT_LEVEL="3"
# /usr/lib/llvm-16/lib/
export LD_LIBRARY_PATH="/usr/local/lib"
/usr/local/bin/clang-18 -fopenmp -fopenmp-targets=nvptx64 --libomptarget-nvptx-bc-path="/usr/local/lib" omp_test.c -o omp_test
./omp_test

The llvm is built with:

cmake  -G Ninja ../llvm \
   -DCUDAToolkit_INCLUDE_DIRECTORIES="/usr/local/cuda-11.8" \
   -DLLVM_ENABLE_PROJECTS="clang" \
   -DLLVM_ENABLE_RUNTIMES="openmp" \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="Native;NVPTX;AMDGPU" \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DCMAKE_C_COMPILER=clang-16 \
   -DCMAKE_CXX_COMPILER=clang++-16 \
   -DLLVM_USE_LINKER=lld-16 \
   -DLLVM_CCACHE_BUILD=ON
#    -DLLVM_USE_SANITIZER="Address;Undefined"

I got the following error:

Thread 46, Total threads 56
Thread 7, Total threads 56
Thread 28, Total threads 56
Thread 31, Total threads 56
Thread 29, Total threads 56
Thread 18, Total threads 56
Thread 43, Total threads 56
Thread 30, Total threads 56
Thread 41, Total threads 56
Thread 27, Total threads 56
Thread 37, Total threads 56
"CUDA" error: Unrecognized "CUDA" error code 4
"CUDA" error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
"CUDA" error: Unrecognized "CUDA" error code 4
"CUDA" error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
"CUDA" error: Unrecognized "CUDA" error code 4
"CUDA" error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
"CUDA" error: Unrecognized "CUDA" error code 4
"PluginInterface" error: Failed to deinitialize plugin: Error in cuCtxSetCurrent: Unknown error

I can’t reproduce this using upstream. Just looking at the error message I can probably make an educated guess at the underlying problem however. I think that this is likely a dynamic loader issue where the libcuda.so as already been destroyed by the environment before libomptarget.rtl.cuda.so can finish its cleanup. I’d guess this is just another symptom of our runtime library having a completely arbitrary teardown order with at least three shared objects lying around. I’ve been planning on fixing that but I keep getting distracted with other things.

Does this problem persist if you use the main branch in LLVM instead? Also as a heads-up, the libomptarget libraries have migrated in LLVM 19 to lib/<host-triple>.

Before we jump to that, could we check a few things first?

@syheliel Do all threads print their result?
Does the problem occur if you don’t pipe in a custom kernel module, e.g., might it be that the module is broken? (printf is a weird beast in the CUDA device toolchain).

~ J

Thanks @jdoerfert . Yes, the printf works as normal. I changed the program to:

#include <omp.h>
#include <stdio.h>

int main() {
    int A[100], B[100], C[100];
    #pragma omp parallel for
        for (int i = 0; i < 100; i++) {
            A[i] = i;
            B[i] = i * 2;
        }
    printf("initialized A and B\n");
        #pragma omp target teams distribute parallel for
        for (int i = 0; i < 100; i++) {
            C[i] = A[i] + B[i];
        }
    return 0;
}

The output shows that the program happens in #pragma omp target teams distribute parallel for line:

initialized A and B
"CUDA" error: Unrecognized "CUDA" error code 4
"CUDA" error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
"CUDA" error: Unrecognized "CUDA" error code 4
"CUDA" error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
"CUDA" error: Unrecognized "CUDA" error code 4
"CUDA" error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
"CUDA" error: Unrecognized "CUDA" error code 4
"PluginInterface" error: Failed to deinitialize plugin: Error in cuCtxSetCurrent: Unknown error

@jhuber6 Thanks for the hint. I will try the latest clang-19

@jhuber6 @jdoerfert Thanks for your guide! I have successfully resolved the problem by updating LLVM and changing the CUDA path when building LLVM, here is my building script now:

cmake  -G Ninja ../llvm \
   -DCUDAToolkit_INCLUDE_DIRECTORIES="/usr/local/cuda-12.2" \
   -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra;compiler-rt" \
   -DLLVM_ENABLE_RUNTIMES="openmp" \
   -DLLVM_BUILD_EXAMPLES=ON \
   -DLLVM_TARGETS_TO_BUILD="Native;NVPTX" \
   -DCMAKE_BUILD_TYPE=Release \
   -DLLVM_ENABLE_ASSERTIONS=ON \
   -DCMAKE_C_COMPILER=clang-16 \
   -DCMAKE_CXX_COMPILER=clang++-16 \
   -DLLVM_USE_LINKER=lld-16 \
   -DLLVM_CCACHE_BUILD=ON

my compile cmd:

set -e
export LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE="tmp.ir"
export LIBOMPTARGET_JIT_OPT_LEVEL="3"
export LD_LIBRARY_PATH="/usr/local/lib/x86_64-unknown-linux-gnu/:$LD_LIBRARY_PATH"
/usr/local/bin/clang-19 -fopenmp -fopenmp-targets=nvptx64 --libomptarget-nvptx-bc-path="/usr/local/lib" omp_test.c -o omp_test
./omp_test

By the way, is it possible to dump the IR generated by OMP offloading? I have set LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE and LIBOMPTARGET_JIT_OPT_LEVEL, but no IR file is generated.

Great, I’m roughly 95% confident you were hitting some library teardown error because those are the exact types of errors I saw when doing it intentionally. That situation will hopefully get better in the future.

The JIT is only handled if you did -fopenmp-target-jit at compile time. Generally the way you do this is by using -save-temps and the clang-linker-wrapper will spit out a file called something like <exe>.nvptx64-nvidia-cuda.sm_75.postopt.bc. Or <input>-openmp-nvptx64-nvidia-cuda-sm_75.bc from clang.

Thanks, -save-temps -fopenmp-target-jit works like magic :wink:

/usr/local/bin/clang-19 -save-temps -fopenmp-target-jit -fopenmp -fopenmp-targets=nvptx64  omp_test.c -o omp_test

There is no need for -save-temps, -fopenmp-target-jit is a standalone option that works fine.

Can you run with LIBOMPTARGET_DEBUG=1 and LIBOMPTARGET_INFO=-1 and post the output. No JIT needed, just the simplest program that fails.
EDIT: Never mind, seems to work now. It was likely just picking up wrong libraries.

(FWIW, you might want to look into explicit memory management rather than using stack arrays.)

@jdoerfert Here is my start script, I put LIBOMPTARGET_DEBUG and LIBOMPTARGET_DEBUG in there:

set -e
export LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE="tmp.ir"
export LIBOMPTARGET_JIT_OPT_LEVEL="3"
export LIBOMPTARGET_DEBUG=1
export LIBOMPTARGET_INFO=-1
export LD_LIBRARY_PATH="/usr/local/lib/x86_64-unknown-linux-gnu/:$LD_LIBRARY_PATH"
/usr/local/bin/clang-19 -g -fopenmp -fopenmp-targets=nvptx64  1.c -o omp_test
./omp_test

I also changed my code to use malloc:

#include <stdlib.h> // For malloc and free
#include <omp.h>
#include <stdio.h>

int main() {
    int *A = (int *)malloc(100 * sizeof(int));
    int *B = (int *)malloc(100 * sizeof(int));
    int *C = (int *)malloc(100 * sizeof(int));

    if (A == NULL || B == NULL || C == NULL) {
        // Handle memory allocation failure
        return 1;
    }

    #pragma omp parallel for
    for (int i = 0; i < 100; i++) {
        A[i] = i;
        B[i] = i * 2;
    }
    printf("initialized A and B\n");

    #pragma omp target teams distribute parallel for map(tofrom:A[0:100],B[0:100],C[0:100]) 
    for (int i = 0; i < 100; i++) {
        C[i] = A[i] + B[i];
    }

    free(A);
    free(B);
    free(C);

    return 0;
}

The output is:

initialized A and B

Also, when I put -fopenmp-target-jit into the compile flag, I got:

initialized A and B
ptxas fatal   : Optimized debugging not supported
"PluginInterface" error: Failure to jit IR image 0x56065e92e8e0 on device 0: Running 'ptxas' failed: 

omp_test: /home/zzh/llvm-project/openmp/libomptarget/plugins-nextgen/common/src/PluginInterface.cpp:1684: int32_t __tgt_rtl_load_binary(int32_t, __tgt_device_image *, __tgt_device_binary *): Assertion `Image != nullptr && "Invalid Image"' failed.
Aborted (core dumped)

Is this the right way to use explicit memory management?

This is a classic error. For whatever reason NVIDIA just hard errors if it finds debugging symbols instead of just ignoring them or something. I’ve forgotten about the 50 places we try to work around this in the compiler. Generally the easiest way around this is to not use device debugging. So, you can use -Xarch_host -g to only pass the -g flag to the host compilation. Either that or set the optimization level in JIT to zero.

Looks fine to me. The runtime will get the base pointers and the size then allocate a new device address range of the same size. We’ll then put the base pointer and the newly created device pointer into a big table so they can be looked up later.

@jhuber6 with --save-temps I see following 3 IR files.
1-openmp-nvptx64-nvidia-cuda.bc
omp_test-jit-nvptx64-nvidia-cuda.bc
omp_test.nvptx64-nvidia-cuda.sm_75.postlink.bc

Could you please summaries their uses and at which stage they are created?

Also using option --print-wrapped-module should work?
clang-19 -Xarch_host -g -save-temps --print-wrapped-module -fopenmp-target-jit -fopenmp -fopenmp-targets=nvptx64 file.c -o omp_test

based on llvm-project/clang/tools/clang-linker-wrapper/ClangLinkerWrapper.cpp at b274b23665dec30f3ae4fb83ccca8b77e6d3ada3 · llvm/llvm-project · GitHub it should work.

if yes then what causes following error for me
clang-19: error: unknown argument: '--print-wrapped-module'

This is the one that clang creates when it first compiles your C / C++ to LLVM-IR. Is your source file called 1.c?

This is the final linked LLVM-IR module. It’s what the clang-linker-wrapper embeds into your program for the runtime to execute. Because you are using JIT mode we output linked LLVM-IR instead of an executable.

This is the post-LTO linking output. Because we’re doing JIT here it’s pretty much the same as above.

Generally these either come from clang or the clang-linker-wrapper. If you only want the clang temps, use -c. If you only want the wrapper temps, use -Wl,--save-temps.

That’s a clang-linker-wrapper option, not a clang option. The linker wrapper is basically the linker so you can pass arguments to it in the standard way, i.e. -Wl,. That being said, I don’t know why you would ever want to use that option as a user, it’s only for testing the output.

1 Like

I hit a similar error while trying to work with something else. Running through the debugging confirmed that the libomptarget teardown code was being executed after _dl_fini unloaded all of the dynamic objects, resulting in calls to a destroted CUDA interface. [Offload] Change unregister library to use `atexit` instead of destructor by jhuber6 · Pull Request #86830 · llvm/llvm-project · GitHub fixes the issue I was running into, unsure if it will address yours as well.