[libc][GSoC 2024] Performance and testing in the GPU libc

Description: The GPU port of the LLVM C library is an experimental target to provide standard system utilities to GPU programs. As this is a new target, it would be worthwhile to extend testing and get specific performance numbers for various C library functions when called from the GPU. This would include microbenchmarking individual functions, comparing them to existing alternatives where applicable, and porting the LLVM test suite to run on the GPU.

Expected Results:

  • Writing benchmarking infrastructure in the LLVM C library. Measurements would include resource usage, wall time, and clock cycles.

  • Performance results for various functions to be published online

  • Running the LLVM test suite directly on the GPU where applicable.

  • If time permits, checking alternate methods for I/O via remote procedure calls.

Project Size: Small/Medium

Requirement: Moderate to advanced C++ knowledge, Knowledge of GPU programming and architecture, Profiling experience, GPU access

Difficulty: Easy/Medium

Confirm Mentor: Joseph Huber, Johannes Doerfert

3 Likes

Hi, I had a couple quick questions -

  1. IIUC, the LLVM test suite is expected to run (when applicable) on the GPU without modification. Would there be any case where a test only uses features supported by the GPU port but cannot be run successfully on GPUs?
  2. The post on the LLVM open projects page mentions implementation of malloc. Is a new implementation within the scope of this project? Or is this project intended to focus on mostly benchmarking/profiling?
    Thanks!

Sure, let me clarify some things.

The general idea would be to run the tests unmodified if possible. However, there are plenty of things we cannot currently run on the GPU. For example, thread_local variables, varargs (will be supported soon), or recursive global initializers. The hope is that we could simply detect a sufficiently large subset of the tests that fit the criteria.

Sorry for the confusion, it was worded poorly. The goal of the project is to create infrastructure that allows us to easily microbenchmark GPU code, maybe similar to Google benchmark. Implementing a good general purpose malloc that runs on GPU architectures is a comparatively much more difficult project. I’ve begun doing some basic work toward that end but don’t have anything finished yet. It was mentioned as a use-case for said benchmarking code.

  1. IIUC, the LLVM test suite is expected to run (when applicable) on the GPU without modification. Would there be any case where a test only uses features supported by the GPU port but cannot be run successfully on GPUs?

There is existing work that “successfully” got LLVM test suite running on GPUs. The infrastructure and utilities introduced in the LLVM GPU libc project, combined with techniques in that existing work, as well as another GSoC project, can almost get most of the LLVM test suite run on a GPU w/o any code change.

The paper was a really interesting read! Though I’m still a bit confused on the need for new testing infrastructure - why doesn’t running the test-suite with clang-gpu (Like in the paper) work? From the bit I’ve read on benchmarking LLVM, it seems like most benchmarks are run with perf, including the libc math benchmarks linked in the root post.

I believe it worked in the paper, but this was mostly for proof-of-concept. I would like a working configuration that runs the test suite on the GPU to be officially supported basically. I don’t think anything in the standard LLVM benchmarking will be relevant to the GPU case, however we may make use of things like rocprof or nsys to measure certain events.

Got it, I’ve just got a couple more clarifying questions.

  1. For GPU libc performance testing, where you thinking about something similar to the existing code used to test math function diffs, but extended to cover the rest of the libc functions on both CPU and GPU? Or did you have something else in mind?
  2. For the entire LLVM test-suite, would an officially supported configuration mean some combination of Cmake options used to configure before running something like llvm-lit -v?

I’m still a little new, so please let me know if anything doesn’t sound right!

Yes, we have a libc/benchmarks directory. The code written for the GPU will likely need to be highly specialized since it will need to use different tools.

I’ve done some cursory looks at this in the past. The llvm test suite supports cross compiling and using “emulators”. This means we would compile all the code to an executable and then the infrastructure will use one of the GPU loaders to execute it. Compilation and running would look something like this, but done by llvm-lit.

clang test.c --target=amdgcn-amd-amdhsa -mcpu=native -flto
amdhsa-loader a.out

Compilation is done by CMake so CMake has to set up target and architecture accordingly (or just via clang-gpu that wraps the actual command) and running is via llvm-lit.

I think I’m interested in this project and although I have some understanding of GPU architecture/programming, I can’t say I have much experience with profiling in general. You mentioned rocprof and nsys earlier, are there any other tools or resources that you would recommend I take a look at?

From what I understand so far, the benchmarking portion of this project will mostly consist of working on C++ code from libc/benchmark but adding specific tools that profile the GPU performance and running benchmarks on individual libc functions, correct? Then the test-suite portion would mainly involve creating a way to get llvm-lit to use the new infrastructure to compile/run the test-suite?

Glad you’re interested. I’m not the foremost expert on profiling either, but I would recommend scanning through the user documentation from somewhere like User Guide — nsight-systems 2024.1 documentation or ROCProfilerV1 User Manual — rocprofiler 2.0.0 Documentation.

So, we will likely want some utilities that go through standard GPU profiling tools like the above, and ones that just do microbenchmarking directly on the GPU. I have an old, abandoned patch that attempted to do the latter through expert usage of inline assembly hacks https://reviews.llvm.org/D158320. The idea there would be to get cycle-accurate counts, similar to what something like llvm-mca would spit out for the architecture. Function stimulus would likely want to vary between different levels of divergence. Traditional profiling tools will likely be better at picking up hardware events or resource usage.

Do you have access to a GPU you can use for development? Unfortunately I cannot get AMD to provide server access to external students so you will likely be on your own on that front. I believe the LLVM foundation has some basic access to computing resources and I can test things on your behalf in the worst case.

I don’t have a GPU right now, but I can ask the professors at my school to see if any of the labs have something I could use. Would it make more sense to ask for an Nvidia or AMD GPU?

Both would be ideal if you could manage it as the GPU libc targets AMD and NVPTX. I work at AMD, so of course I’d encourage you to try an AMD GPU, but just go with whatever you can get access to. it would be quite difficult to do this work without a reliable modern-ish GPU, so hopefully you can find one.

Got it, I can make sure to ask for both.

While looking through some more websites, I read the Open MP Parallelism Paper referenced in the first paper @shiltian linked earlier. From what I could tell, it looks like the paper tries to take an unmodified user program and effectively wrap it in an OpenMP target offload for the GPU to run, instead of only offloading specific parts like you usually do with OpenMP. IIUC, there was noticeable slowdown because of factors like RPC and the sequential initialization code running slower on GPU threads copmared to GPU threads. If I read it correctly, what is the benefit of trying to run entire programs on the GPU over offloading? Is it to improve developer experience at the expense of some performance?

The main utility of running programs directly on the GPU is testing both applications and the GPU backends. This is how the GPU libc runs tests, here’s an example of the NVPTX builder running the tests on an NVIDIA card, printing and everything Buildbot. Because this is run entirely on the GPU it shares the same source with the CPU tests. Also, CPU applications don’t look a lot like GPU applications. There have been a good amount of backend bugs running these kinds of workloads exposes.

As for purposes for the GPU C llibrary, I think it’s mostly to improve the developer experience to allow people to use standard C functions in CUDA, HIP, OpenMP, whatever. And it will also help with building the C++ library at some point. Here’s a talk I did at last year’s LLVM developer’s conference if you’re more interested with that part https://www.youtube.com/watch?v=_LLGc48GYHc.

Ah, I totally didn’t think of that! It makes much more sense now.

My school has offered access to Nvidia GPUs via a couple HPC clusters that I can use by sending slurm jobs to. I’m checking to see if I can build LLVM and if I can find a good workflow, but would there be potentially be any problems with this approach (i.e. maybe influencing benchmark scores)?

BTW, it looks like libc runs into a CMake error when building on redhat, which I think one of the clusters uses. Is there any specific reason for that?

Not really, you should be able to launch an interactive job with slurm to make your life easier. Obviously timing will vary highly depending on the card, but as long as we have a way to generate consistent results it should be simple enough to run it on whatever machines we care about. Once we have something working I could test it out on more GPUs.

Unsure, could you share the error message? I haven’t built the CPU version of the C library in awhile.

Yeah, the message is Unsupported libc target operating system redhat that seems to be getting triggered LLVMLibCArchitectures.cmake, then in the main libc cmake file after I tried to allow redhat as an option in LLVMLibCArchitectures.cmake. Possibly LIBC_TARGET_OS is getting set to redhat instead of linux? I think this may be a local issue and I’ll keep looking

There’s already some hacks around the autodetection there. It probably expects the triple to look like x86_64-unknown-linux-gnu and it’s trying to parse out linux. Unsure what the triple RedHat is using here.