[GSoC 2024] GPU Delta Debugging

jdoerfert · February 26, 2024, 9:23pm

Description

LLVM-reduce, and similar tools perform delta debugging but are less useful if many implicit constraints exist and violation could easily lead to errors similar to the cause that is to be isolated. This project is about developing a GPU-aware version, especially for execution time bugs, that can be used in conjunction with LLVM/OpenMP GPU-record-and-replay, or simply a GPU loader script, to minimize GPU test cases more efficiently and effectively.

Expected outcomes

A tool to reduce GPU errors without loosing the original error. Optionally, other properties could be the focus of the reduction, not only errors.

Confirmed mentors and their contacts

Parasyris, Konstantinos parasyris1@llnl.gov
@jdoerfert

Required / desired skills

Required:

Good understanding of C++
Familiarity with GPUs and LLVM-IR

Desired:

Compiler knowledge including data flow and control flow analysis is a plus.
Experience with debugging and bug reduction techniques (llvm-reduce) is helpful

Size of the project.

medium

An easy, medium or hard rating if possible

medium

Sh0g0-1758 · February 27, 2024, 6:18pm

Hey @jdoerfert, I am fairly competent in C/C++ and have a good understanding of LLVM-IR. While I am familiar with shader programming and GPU internals, it would be helpful if u could clarify what level of knowledge of the same would be required. Apart from that, I would be grateful if u could point me towards some reading material / PRs to deepen my understanding of the issue.

jdoerfert · February 27, 2024, 7:00pm

If you have worked with GPUs that’s probably fine. The LLVM-IR angle is more important since you’ll need to modify the IR as you reduce it.

I would suggest you look at record and replay (try it out with upstream and/or read up on it https://dl.acm.org/doi/pdf/10.1145/3581784.3607098). This is a good way to isolate GPU kernels.
You’ll also want to combine that with the JIT capabilities to get an IR version of the kernel you can modify, and then replay, see: LLVM/OpenMP Runtimes — LLVM/OpenMP 19.0.0git documentation and the JIT-related env vars around it.

Let me know if this allows you to try stuff out, e.g., run a kernel, record it as IR, modify the IR, replay it with the modified IR. You can, for example, add a trap into the kernel as your modification to see if it worked.

vivekvpandya · February 28, 2024, 3:28am

I think one good thing is that , llvm-reduce can also regenerate llvm LTO metadata after reduction. This may be helpful as OMP JIT can also be done on LTO created llvm IR.

jdoerfert · February 28, 2024, 3:07pm

OMP JIT works fine (almost only) with LTO created LLVM-IR.
I’m unsure what metadata we need to create in llvm-reduce.

vivekvpandya · February 29, 2024, 4:16am

I see.
I was under impression that the way to get delta debugging.is following
LTO-> Reduce → JIT
however after reading record and replay paper
capture IR to be reduced → reduce → LTO → JIT

Kindly pardon my ignorance as I am yet to try RR and look around the code.

Sh0g0-1758 · February 29, 2024, 10:35am

Hey @jdoerfert, had a few doubts.

I read the paper ‘record-and-replay’ and I understand it quite well now. Had a doubt in the Replay Validation part where you talk about validation of the replay run by bitwise comparison of memory. U said it can signal false positives if the kernel is nondeterministic by nature. Could you please elaborate on this a little more?
As you instructed, I got the record and replay implementation from the upstream, which I believe is present here : llvm-project/openmp/libomptarget/tools/kernelreplay but I can’t seem to record my openmp files using the flag : -fopenmp-record . Can you please guide me as how to properly build openmp and then use the record and replay feature of it to get access to the IR of a kernel.

Sh0g0-1758 · February 29, 2024, 12:07pm

I am getting this error when I try to build llvm using the OpenMP runtime :

[ 96%] Performing build step for ‘runtimes’
[ 0%] Built target libomp-needed-headers
[ 0%] Building CXX object openmp/runtime/src/CMakeFiles/omp.dir/kmp_alloc.cpp.o
In file included from /home/shogo/master/dev/low/llvm-project/openmp/runtime/src/kmp_alloc.cpp:13:
/home/shogo/master/dev/low/llvm-project/openmp/runtime/src/kmp.h:80:10: fatal error: ‘limits’ file not found
80 | #include
| ^~~~~~~~
1 error generated.
make[5]: *** [openmp/runtime/src/CMakeFiles/omp.dir/build.make:76: openmp/runtime/src/CMakeFiles/omp.dir/kmp_alloc.cpp.o] Error 1
make[4]: *** [CMakeFiles/Makefile2:1043: openmp/runtime/src/CMakeFiles/omp.dir/all] Error 2
make[3]: *** [Makefile:136: all] Error 2
make[2]: *** [runtimes/CMakeFiles/runtimes.dir/build.make:89: runtimes/runtimes-stamps/runtimes-build] Error 2
make[1]: *** [CMakeFiles/Makefile2:110882: runtimes/CMakeFiles/runtimes.dir/all] Error 2
make: *** [Makefile:156: all] Error 2

vivekvpandya · February 29, 2024, 12:10pm

perhaps try installing libstdc+±12-dev , but I feel building clang and libcxx and using it may be better. I am struggling with similar problem

Sh0g0-1758 · February 29, 2024, 2:44pm

Also, would it be necessary to have an AMD / NVIDIA GPU? I have an integrated graphics card on my PC, will that work?

syheliel · March 10, 2024, 11:24am

Hi @jdoerfert . I have some experience with LLVM and have learned some basics about openMP, openACC, and clang to devote to this project. I’m currently stuck in recording the kernel, here is the script I use:

set -e
export LIBOMPTARGET_JIT_PRE_OPT_IR_MODULE="tmp.ir"
export LIBOMPTARGET_JIT_OPT_LEVEL="3"
clang-16 -fopenmp -fopenmp-targets=nvptx64 --libomptarget-nvptx-bc-path="/usr/lib/llvm-16/lib" omp_test.c -o omp_test
./omp_test

I’m using Nvidia V100, but I got the following error:

CUDA error: Unrecognized CUDA error code 4
CUDA error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
CUDA error: Unrecognized CUDA error code 4
"PluginInterface" error: Failed to deinitialize plugin: Error in cuCtxSetCurrent: Unknown error

I believe it’s caused by the low version of clang and I’m gonna build llvm-18 on my local machine later. Two questions to ask:

am I on the right track for recording kernel?
I see openMC and two proxy-app are used in your paper for evaluation, but I’ve checked the codebase, and they are too huge for kickstart. Would you recommend me some small benchmarks so that I can experiment with it more easily?

jdoerfert · March 10, 2024, 6:06pm

Let’s discuss the error in the other thread.
Wrt. a simple kernel, you should probably just write something in source. Later you can go to XSBench or other “real” proxy apps.

If the integrated card is an Intel, we can’t support that right now. I’m unsure what the status of (our) GCloud machines with GPUs is right now. You’d need AMD/NVIDIA GPU assess somewhere.

This looks like the clang you build as part of the LLVM built is not able to find C++ standard libraries. It should just use what your compiler (used for the LLVM built) uses. You can set the gcc toolchain (in cmake), or use config files, e.g., in your home folder, to point to the right paths. To debug the problem, don’t built openmp at all, but try out the clang you get on C++ code to make sure it works first.

syheliel · March 12, 2024, 2:13pm

@jdoerfert Thanks for your guidance. I have learned about the llvm-reduce by reading the doc and the presentation in LLVM Devmtg. In my understanding, it will apply different passes to modify the original IR/MIR, and the modified version of IR behavior like the same through a test script.

I have some experience with fuzzing, so it reminds me of afl-tmin, a tool provided by AFL to minimize the input while maintaining the original execution trace I think that delta debugging shares the same logic as fuzzing. All we need to do is to design some mutation/generation rules for Openmp’s IR and find a good way to evaluate it.

But I’m not quite familiar with OpenMP in clang. So I’m wondering what is the internal representation of OpenMP’s program. Are they clang AST? or some kind of IR?

jdoerfert · March 12, 2024, 3:27pm

You don’t have to. The project deals 99% of the time with IR or the C++ in the offload runtime.

syheliel · March 14, 2024, 3:45am

@jdoerfert I’ve written some OpenMP programs and dumped the generated IR. Now I’m trying to find some interesting tests for GPU error(like race condition) and to see what the bug looks like in the LLVM IR level. Is there a test benchmark containing files with multi-thread bugs?

jdoerfert · March 14, 2024, 4:45pm

There are things like “datarace bench” on github, IIRC. That said, race bugs are hard to reduce. I’d look into a segfault, e.g. access non-mapped memory.

Topic		Replies	Views
[libc][GSoC 2024] Performance and testing in the GPU libc GSoC gpu , libc , gsoc2024	55	1054	March 31, 2024
[GSoC 2018] Application - Improve Debugging of Optimized Code LLVM Dev List Archives	2	93	March 16, 2018
Improvements to llvm-reduce GSoC llvm , llvm-reduce , gsoc2024	4	358	March 23, 2024
[GSoC 2018] Improve Debugging of Optimized Code LLVM Dev List Archives	3	94	April 25, 2018
[GSOC 2018] Project covering LLVM and GPU LLVM Dev List Archives	1	97	February 20, 2018

[GSoC 2024] GPU Delta Debugging

Related Topics