After looking through some stuff about profiling and the existing benchmark code, what would be the benefits of writing test infrastructure in C++ (Like in the libc/benchmarks
directory), over something like a python script that runs clang
then calls a CLI tool to do the profiling (e.g. nsys
)? The CLI tool should handle all the cycle counting and measurements that were done manually in testing frameworks like google-benchmark
, right? Or is there something else that the C++ infrastructure makes possible?
I made a hacky fix by treating redhat
the same way as OpenSUSE
since the comments indicate a similar issue. I just started running a full LLVM build on the HPC cluster - hopefully everything works out.
You will need to write C++ to stimulate the functions regardless. Testing this stuff accurately would likely require some size sweeping / divergence testing like Google benchmark. The results from nsys
or rocprof
are not going to have high resolution for specific functions because they are built around checking runtime calls into the offloading runtime.
Thatās great, if you get it working please open a PR and tag me or someone else from the libc
team on it.
That makes sense. I also opened a PR for the redhat fix, though it looks like GitHub isnāt allowing me to request reviewers (The gear icon is missing for some odd reason).
Hello, LLVM green here. As far as I understand, this project is about writing a program that can measure the performance of c library functions when called from GPU. Am I understanding right?
I have some programming experience with C++ but I have had no exposure to GPU programming and LLVM, let alone writing any microbenchmark programs. would it still be possible for me to get accepted to contribute to this project and if so, where should I start?
Thank you very much
I think it would be difficult without at least basic knowledge of how GPU programming works or computer architecture. This particular project doesnāt have much to do with LLVM since itās related more closely to language runtimes. I wouldnāt say itās impossible, but it would definitely be difficult since most of the allotted time would likely be taken up by learning the basics rather than working on the project. Youāre free to apply if you want however, I donāt really know how the process work in-depth.
When benchmarking, would you want to compare time taken on the GPU itself (i.e. the time the function takes just to run on the GPU) or would you compare the time from the host (i.e. time taken including transferring data to/from the GPU) when comparing against host libc? Iām assuming you can actually measure both accurately but was wondering which would be better for comparison to other benchmarks
We donāt care about runtime overhead as weāre only interested in GPU runtime. I think we will want both information gathered from the vendorās profiling tools, .e.g. nsys
and rocprof
, and also some microbenchmarking done simply using the processor clock and very careful hacks. One problem about microbenchmarking on the GPU is that memory latency is extremely high, so if anything goes to main memory it will pretty much dominate everything else.
Also, AMDGPU has llvm-mca
support so we can compare with that. I remember doing my initial round of microbenchmarking and compared against llvm-mca
to see if my results made sense.
Another interesting question is lane utilitization within the GPUās warps. I.e. āWhat % of my threads in each warp are active during executionā. However I canāt think of a way to do that without some instrumentation passes in LLVM.
By main memory do you mean the hostās RAM? Or GPU memory?
An unrelated question I had was how clang-gpu
fits into the project. After building gpu-libc
Iāve successfully used it with both OpenMP and direct compilation/loader utilities. IIUC, clang-gpu
wonāt help with libc benchmarking (Iām thinking it would be best to use direct compilation then a loader for that part), but Iām not sure where exactly clang-gpu
fits into the project. Did you have any ideas?
I mean any read or write done by the GPU, this could be pinned memory, shared memory, constant memory, etc. Theyāre all slow in their own ways.
Which clang-gpu
are you referring to here? This project just uses a standard build of clang
.
I think clang-gpu
was the one referred to in one of the papers and the post for GPU-first ([OpenMP][GSoC 2024] Improve GPU First Framework - #2 by Polygonalr) that took any generic host code and wrapped it in an OpenMP construct to offload to the GPU. I think it was used for running the test-suite on the GPU
Ah, yeah. Thatās not used by libc
currently. They cover different cases.
I guess I was confused then on when clang-gpu
should be used. What case was it specifically designed to cover? Just applications like running the entire test-suite on the GPU?
Yes, it was a research project to automatically convert whole programs to run on the GPU. That means having the compiler insert the magic required to do any of the RPC calls.
Iām starting to put together a proposal and was a little confused on a couple points:
- Would it make more sense to have the benchmarking utility run mainly on the host with actual execution/timing taking place in the GPU? I was thinking that this āoffloadingā could either be done by something like OpenMP, CUDA kernels, or commands to compile code for GPU and use
nvptx-loader
on select portions.
I was thinking this could help with managing a lot of the divergence testing and size sweeping in gbenchmark, although I was thinking an alternative approach would be to just compile the entire benchmark infrastructure for GPU and run it directly withnvptx-loader
. - Iām still a little unclear on how to provide an officially supported configuration for testing - if
clang-gpu
can already handle the RPC calls for code originally written on the CPU, why canāt we compile binaries with something like$CLANG_CC=clang-gpu
and pass those binaries tollvm-lit
? IIUC eachlit
test has itās own compilation commands in the form ofRUN: clang -o test.o test.c
. Is there something obvious Iām missing?
Thereās many angles we could go for here. Any sort of microbenchmarking to get very accurate cycle counts will need to take place on the GPU itself. Stuff like kernel timing and resource usage can just be done with rocprof
or nsys
.
The loader utilities realistically just launch a kernel called main
. If we just make main
call some random function then each file can be benchmarked regardless. If we want instrumentation (i.e. similar to profile guided optimization) we could just put some globals in the device code and copy it back in the loader.
The easy way is to just pretend like these are single applications. Microbenchmarks could either be heavyweight applications that basically implement something like google-benchmark
on the GPU, printing and all, or we could just make main
minimal and use the vendor tools like above. clang-gpu
isnāt really related here, itās not upstream and the libc
repository has everything it needs to compile C code more or less. Right now Iām just imagining this the same way we handle the unit tests, i.e.
add_custom_command(foo_benchmark COMMAND nvptx-loader bench)
Would this approach correctly handle RPC calls for the libc
functions that need it? Or should we just ignore any tests that would break on the GPU without RPC?
RPC is just an implementation detail, for practical purposes just pretend that you can call puts
on the GPU and everything works.
Got it. Just out of curiosity, how is the call implemented? I remember watching the talk and have a general idea of what the code is supposed to do, but I guess Iām confused on how exactly the compiled code knows when it needs to call back to the host. Does the compiler insert some code when the required functions are called? Then what process does the RPC server listener run?