LLVM-reduce, and similar tools perform delta debugging but are less useful if many implicit constraints exist and violation could easily lead to errors similar to the cause that is to be isolated. This project is about developing a GPU-aware version, especially for execution time bugs, that can be used in conjunction with LLVM/OpenMP GPU-record-and-replay, or simply a GPU loader script, to minimize GPU test cases more efficiently and effectively.
Expected outcomes
A tool to reduce GPU errors without loosing the original error. Optionally, other properties could be the focus of the reduction, not only errors.
Hey @jdoerfert, I am fairly competent in C/C++ and have a good understanding of LLVM-IR. While I am familiar with shader programming and GPU internals, it would be helpful if u could clarify what level of knowledge of the same would be required. Apart from that, I would be grateful if u could point me towards some reading material / PRs to deepen my understanding of the issue.
Let me know if this allows you to try stuff out, e.g., run a kernel, record it as IR, modify the IR, replay it with the modified IR. You can, for example, add a trap into the kernel as your modification to see if it worked.
I think one good thing is that , llvm-reduce can also regenerate llvm LTO metadata after reduction. This may be helpful as OMP JIT can also be done on LTO created llvm IR.
I see.
I was under impression that the way to get delta debugging.is following
LTO-> Reduce ā JIT
however after reading record and replay paper
capture IR to be reduced ā reduce ā LTO ā JIT
Kindly pardon my ignorance as I am yet to try RR and look around the code.
I read the paper ārecord-and-replayā and I understand it quite well now. Had a doubt in the Replay Validation part where you talk about validation of the replay run by bitwise comparison of memory. U said it can signal false positives if the kernel is nondeterministic by nature. Could you please elaborate on this a little more?
As you instructed, I got the record and replay implementation from the upstream, which I believe is present here : llvm-project/openmp/libomptarget/tools/kernelreplay but I canāt seem to record my openmp files using the flag : -fopenmp-record . Can you please guide me as how to properly build openmp and then use the record and replay feature of it to get access to the IR of a kernel.
Hi @jdoerfert . I have some experience with LLVM and have learned some basics about openMP, openACC, and clang to devote to this project. Iām currently stuck in recording the kernel, here is the script I use:
Iām using Nvidia V100, but I got the following error:
CUDA error: Unrecognized CUDA error code 4
CUDA error: Failure to free memory: Error in cuCtxSetCurrent: Unknown error
CUDA error: Unrecognized CUDA error code 4
"PluginInterface" error: Failed to deinitialize plugin: Error in cuCtxSetCurrent: Unknown error
I believe itās caused by the low version of clang and Iām gonna build llvm-18 on my local machine later. Two questions to ask:
am I on the right track for recording kernel?
I see openMC and two proxy-app are used in your paper for evaluation, but Iāve checked the codebase, and they are too huge for kickstart. Would you recommend me some small benchmarks so that I can experiment with it more easily?
Letās discuss the error in the other thread.
Wrt. a simple kernel, you should probably just write something in source. Later you can go to XSBench or other ārealā proxy apps.
If the integrated card is an Intel, we canāt support that right now. Iām unsure what the status of (our) GCloud machines with GPUs is right now. Youād need AMD/NVIDIA GPU assess somewhere.
This looks like the clang you build as part of the LLVM built is not able to find C++ standard libraries. It should just use what your compiler (used for the LLVM built) uses. You can set the gcc toolchain (in cmake), or use config files, e.g., in your home folder, to point to the right paths. To debug the problem, donāt built openmp at all, but try out the clang you get on C++ code to make sure it works first.
@jdoerfert Thanks for your guidance. I have learned about the llvm-reduce by reading the doc and the presentation in LLVM Devmtg. In my understanding, it will apply different passes to modify the original IR/MIR, and the modified version of IR behavior like the same through a test script.
I have some experience with fuzzing, so it reminds me of afl-tmin, a tool provided by AFL to minimize the input while maintaining the original execution trace I think that delta debugging shares the same logic as fuzzing. All we need to do is to design some mutation/generation rules for Openmpās IR and find a good way to evaluate it.
But Iām not quite familiar with OpenMP in clang. So Iām wondering what is the internal representation of OpenMPās program. Are they clang AST? or some kind of IR?
@jdoerfert Iāve written some OpenMP programs and dumped the generated IR. Now Iām trying to find some interesting tests for GPU error(like race condition) and to see what the bug looks like in the LLVM IR level. Is there a test benchmark containing files with multi-thread bugs?
There are things like ādatarace benchā on github, IIRC. That said, race bugs are hard to reduce. Iād look into a segfault, e.g. access non-mapped memory.