[libc][GSoC 2024] Performance and testing in the GPU libc

After looking through some stuff about profiling and the existing benchmark code, what would be the benefits of writing test infrastructure in C++ (Like in the libc/benchmarks directory), over something like a python script that runs clang then calls a CLI tool to do the profiling (e.g. nsys)? The CLI tool should handle all the cycle counting and measurements that were done manually in testing frameworks like google-benchmark, right? Or is there something else that the C++ infrastructure makes possible?

I made a hacky fix by treating redhat the same way as OpenSUSE since the comments indicate a similar issue. I just started running a full LLVM build on the HPC cluster - hopefully everything works out.

You will need to write C++ to stimulate the functions regardless. Testing this stuff accurately would likely require some size sweeping / divergence testing like Google benchmark. The results from nsys or rocprof are not going to have high resolution for specific functions because they are built around checking runtime calls into the offloading runtime.

Thatā€™s great, if you get it working please open a PR and tag me or someone else from the libc team on it.

That makes sense. I also opened a PR for the redhat fix, though it looks like GitHub isnā€™t allowing me to request reviewers (The gear icon is missing for some odd reason).

Hello, LLVM green here. As far as I understand, this project is about writing a program that can measure the performance of c library functions when called from GPU. Am I understanding right?

I have some programming experience with C++ but I have had no exposure to GPU programming and LLVM, let alone writing any microbenchmark programs. would it still be possible for me to get accepted to contribute to this project and if so, where should I start?

Thank you very much

I think it would be difficult without at least basic knowledge of how GPU programming works or computer architecture. This particular project doesnā€™t have much to do with LLVM since itā€™s related more closely to language runtimes. I wouldnā€™t say itā€™s impossible, but it would definitely be difficult since most of the allotted time would likely be taken up by learning the basics rather than working on the project. Youā€™re free to apply if you want however, I donā€™t really know how the process work in-depth.

When benchmarking, would you want to compare time taken on the GPU itself (i.e. the time the function takes just to run on the GPU) or would you compare the time from the host (i.e. time taken including transferring data to/from the GPU) when comparing against host libc? Iā€™m assuming you can actually measure both accurately but was wondering which would be better for comparison to other benchmarks

We donā€™t care about runtime overhead as weā€™re only interested in GPU runtime. I think we will want both information gathered from the vendorā€™s profiling tools, .e.g. nsys and rocprof, and also some microbenchmarking done simply using the processor clock and very careful hacks. One problem about microbenchmarking on the GPU is that memory latency is extremely high, so if anything goes to main memory it will pretty much dominate everything else.

Also, AMDGPU has llvm-mca support so we can compare with that. I remember doing my initial round of microbenchmarking and compared against llvm-mca to see if my results made sense.

Another interesting question is lane utilitization within the GPUā€™s warps. I.e. ā€œWhat % of my threads in each warp are active during executionā€. However I canā€™t think of a way to do that without some instrumentation passes in LLVM.

By main memory do you mean the hostā€™s RAM? Or GPU memory?

An unrelated question I had was how clang-gpu fits into the project. After building gpu-libc Iā€™ve successfully used it with both OpenMP and direct compilation/loader utilities. IIUC, clang-gpu wonā€™t help with libc benchmarking (Iā€™m thinking it would be best to use direct compilation then a loader for that part), but Iā€™m not sure where exactly clang-gpu fits into the project. Did you have any ideas?

I mean any read or write done by the GPU, this could be pinned memory, shared memory, constant memory, etc. Theyā€™re all slow in their own ways.

Which clang-gpu are you referring to here? This project just uses a standard build of clang.

I think clang-gpu was the one referred to in one of the papers and the post for GPU-first ([OpenMP][GSoC 2024] Improve GPU First Framework - #2 by Polygonalr) that took any generic host code and wrapped it in an OpenMP construct to offload to the GPU. I think it was used for running the test-suite on the GPU

Ah, yeah. Thatā€™s not used by libc currently. They cover different cases.

I guess I was confused then on when clang-gpu should be used. What case was it specifically designed to cover? Just applications like running the entire test-suite on the GPU?

Yes, it was a research project to automatically convert whole programs to run on the GPU. That means having the compiler insert the magic required to do any of the RPC calls.

Iā€™m starting to put together a proposal and was a little confused on a couple points:

  1. Would it make more sense to have the benchmarking utility run mainly on the host with actual execution/timing taking place in the GPU? I was thinking that this ā€œoffloadingā€ could either be done by something like OpenMP, CUDA kernels, or commands to compile code for GPU and use nvptx-loader on select portions.
    I was thinking this could help with managing a lot of the divergence testing and size sweeping in gbenchmark, although I was thinking an alternative approach would be to just compile the entire benchmark infrastructure for GPU and run it directly with nvptx-loader.
  2. Iā€™m still a little unclear on how to provide an officially supported configuration for testing - if clang-gpu can already handle the RPC calls for code originally written on the CPU, why canā€™t we compile binaries with something like $CLANG_CC=clang-gpu and pass those binaries to llvm-lit? IIUC each lit test has itā€™s own compilation commands in the form of RUN: clang -o test.o test.c. Is there something obvious Iā€™m missing?

Thereā€™s many angles we could go for here. Any sort of microbenchmarking to get very accurate cycle counts will need to take place on the GPU itself. Stuff like kernel timing and resource usage can just be done with rocprof or nsys.

The loader utilities realistically just launch a kernel called main. If we just make main call some random function then each file can be benchmarked regardless. If we want instrumentation (i.e. similar to profile guided optimization) we could just put some globals in the device code and copy it back in the loader.

The easy way is to just pretend like these are single applications. Microbenchmarks could either be heavyweight applications that basically implement something like google-benchmark on the GPU, printing and all, or we could just make main minimal and use the vendor tools like above. clang-gpu isnā€™t really related here, itā€™s not upstream and the libc repository has everything it needs to compile C code more or less. Right now Iā€™m just imagining this the same way we handle the unit tests, i.e.

add_custom_command(foo_benchmark COMMAND nvptx-loader bench)

Would this approach correctly handle RPC calls for the libc functions that need it? Or should we just ignore any tests that would break on the GPU without RPC?

RPC is just an implementation detail, for practical purposes just pretend that you can call puts on the GPU and everything works.

Got it. Just out of curiosity, how is the call implemented? I remember watching the talk and have a general idea of what the code is supposed to do, but I guess Iā€™m confused on how exactly the compiled code knows when it needs to call back to the host. Does the compiler insert some code when the required functions are called? Then what process does the RPC server listener run?