[GSoC 2024] GPU Delta Debugging

Description

LLVM-reduce, and similar tools perform delta debugging but are less useful if many implicit constraints exist and violation could easily lead to errors similar to the cause that is to be isolated. This project is about developing a GPU-aware version, especially for execution time bugs, that can be used in conjunction with LLVM/OpenMP GPU-record-and-replay, or simply a GPU loader script, to minimize GPU test cases more efficiently and effectively.

Expected outcomes

A tool to reduce GPU errors without loosing the original error. Optionally, other properties could be the focus of the reduction, not only errors.

Confirmed mentors and their contacts

Required / desired skills

Required:

  • Good understanding of C++
  • Familiarity with GPUs and LLVM-IR

Desired:

  • Compiler knowledge including data flow and control flow analysis is a plus.
  • Experience with debugging and bug reduction techniques (llvm-reduce) is helpful

Size of the project.

medium

An easy, medium or hard rating if possible

medium

2 Likes

Hey @jdoerfert, I am fairly competent in C/C++ and have a good understanding of LLVM-IR. While I am familiar with shader programming and GPU internals, it would be helpful if u could clarify what level of knowledge of the same would be required. Apart from that, I would be grateful if u could point me towards some reading material / PRs to deepen my understanding of the issue.

If you have worked with GPUs thatā€™s probably fine. The LLVM-IR angle is more important since youā€™ll need to modify the IR as you reduce it.

I would suggest you look at record and replay (try it out with upstream and/or read up on it https://dl.acm.org/doi/pdf/10.1145/3581784.3607098). This is a good way to isolate GPU kernels.
Youā€™ll also want to combine that with the JIT capabilities to get an IR version of the kernel you can modify, and then replay, see: LLVM/OpenMP Runtimes ā€” LLVM/OpenMP 19.0.0git documentation and the JIT-related env vars around it.

Let me know if this allows you to try stuff out, e.g., run a kernel, record it as IR, modify the IR, replay it with the modified IR. You can, for example, add a trap into the kernel as your modification to see if it worked.

1 Like

I think one good thing is that , llvm-reduce can also regenerate llvm LTO metadata after reduction. This may be helpful as OMP JIT can also be done on LTO created llvm IR.

OMP JIT works fine (almost only) with LTO created LLVM-IR.
Iā€™m unsure what metadata we need to create in llvm-reduce.

I see.
I was under impression that the way to get delta debugging.is following
LTO-> Reduce ā†’ JIT
however after reading record and replay paper
capture IR to be reduced ā†’ reduce ā†’ LTO ā†’ JIT

Kindly pardon my ignorance as I am yet to try RR and look around the code.

Hey @jdoerfert, had a few doubts.

  1. I read the paper ā€˜record-and-replayā€™ and I understand it quite well now. Had a doubt in the Replay Validation part where you talk about validation of the replay run by bitwise comparison of memory. U said it can signal false positives if the kernel is nondeterministic by nature. Could you please elaborate on this a little more?
  2. As you instructed, I got the record and replay implementation from the upstream, which I believe is present here : llvm-project/openmp/libomptarget/tools/kernelreplay but I canā€™t seem to record my openmp files using the flag : -fopenmp-record . Can you please guide me as how to properly build openmp and then use the record and replay feature of it to get access to the IR of a kernel.

I am getting this error when I try to build llvm using the OpenMP runtime :

[ 96%] Performing build step for ā€˜runtimesā€™
[ 0%] Built target libomp-needed-headers
[ 0%] Building CXX object openmp/runtime/src/CMakeFiles/omp.dir/kmp_alloc.cpp.o
In file included from /home/shogo/master/dev/low/llvm-project/openmp/runtime/src/kmp_alloc.cpp:13:
/home/shogo/master/dev/low/llvm-project/openmp/runtime/src/kmp.h:80:10: fatal error: ā€˜limitsā€™ file not found
80 | #include
| ^~~~~~~~
1 error generated.
make[5]: *** [openmp/runtime/src/CMakeFiles/omp.dir/build.make:76: openmp/runtime/src/CMakeFiles/omp.dir/kmp_alloc.cpp.o] Error 1
make[4]: *** [CMakeFiles/Makefile2:1043: openmp/runtime/src/CMakeFiles/omp.dir/all] Error 2
make[3]: *** [Makefile:136: all] Error 2
make[2]: *** [runtimes/CMakeFiles/runtimes.dir/build.make:89: runtimes/runtimes-stamps/runtimes-build] Error 2
make[1]: *** [CMakeFiles/Makefile2:110882: runtimes/CMakeFiles/runtimes.dir/all] Error 2
make: *** [Makefile:156: all] Error 2

perhaps try installing libstdc+Ā±12-dev , but I feel building clang and libcxx and using it may be better. I am struggling with similar problem

1 Like

Also, would it be necessary to have an AMD / NVIDIA GPU? I have an integrated graphics card on my PC, will that work?

Hi @jdoerfert . I have some experience with LLVM and have learned some basics about openMP, openACC, and clang to devote to this project. Iā€™m currently stuck in recording the kernel, here is the script I use:

set -e
export LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE="tmp.ir"
export LIBOMPTARGET_JIT_OPT_LEVEL="3"
clang-16 -fopenmp -fopenmp-targets=nvptx64 --libomptarget-nvptx-bc-path="/usr/lib/llvm-16/lib" omp_test.c -o omp_test
./omp_test

Iā€™m using Nvidia V100, but I got the following error:

CUDA error: Unrecognized CUDA error code 4
CUDA error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
CUDA error: Unrecognized CUDA error code 4
"PluginInterface" error: Failed to deinitialize plugin: Error in cuCtxSetCurrent: Unknown error

I believe itā€™s caused by the low version of clang and Iā€™m gonna build llvm-18 on my local machine later. Two questions to ask:

  1. am I on the right track for recording kernel?
  2. I see openMC and two proxy-app are used in your paper for evaluation, but Iā€™ve checked the codebase, and they are too huge for kickstart. Would you recommend me some small benchmarks so that I can experiment with it more easily?

Letā€™s discuss the error in the other thread.
Wrt. a simple kernel, you should probably just write something in source. Later you can go to XSBench or other ā€œrealā€ proxy apps.

If the integrated card is an Intel, we canā€™t support that right now. Iā€™m unsure what the status of (our) GCloud machines with GPUs is right now. Youā€™d need AMD/NVIDIA GPU assess somewhere.

This looks like the clang you build as part of the LLVM built is not able to find C++ standard libraries. It should just use what your compiler (used for the LLVM built) uses. You can set the gcc toolchain (in cmake), or use config files, e.g., in your home folder, to point to the right paths. To debug the problem, donā€™t built openmp at all, but try out the clang you get on C++ code to make sure it works first.

@jdoerfert Thanks for your guidance. I have learned about the llvm-reduce by reading the doc and the presentation in LLVM Devmtg. In my understanding, it will apply different passes to modify the original IR/MIR, and the modified version of IR behavior like the same through a test script.

I have some experience with fuzzing, so it reminds me of afl-tmin, a tool provided by AFL to minimize the input while maintaining the original execution trace :slight_smile: I think that delta debugging shares the same logic as fuzzing. All we need to do is to design some mutation/generation rules for Openmpā€™s IR and find a good way to evaluate it.

But Iā€™m not quite familiar with OpenMP in clang. So Iā€™m wondering what is the internal representation of OpenMPā€™s program. Are they clang AST? or some kind of IR?

You donā€™t have to. The project deals 99% of the time with IR or the C++ in the offload runtime.

@jdoerfert Iā€™ve written some OpenMP programs and dumped the generated IR. Now Iā€™m trying to find some interesting tests for GPU error(like race condition) and to see what the bug looks like in the LLVM IR level. Is there a test benchmark containing files with multi-thread bugs?

There are things like ā€œdatarace benchā€ on github, IIRC. That said, race bugs are hard to reduce. Iā€™d look into a segfault, e.g. access non-mapped memory.