[RFC] Implementing GPU headers in the LLVM C library

This RFC is in reference to generating the headers for the GPU port of the LLVM C Library, see libc for GPUs — The LLVM C Library.

Background

We currently compile the libc source for the GPU as if it were standard C++ without using an existing offloading support, for example like the following code.

clang++ --target=amdgcn-amd-amdhsa -mcpu=gfx90a -c -fvisibility=hidden -nogpulib fputs.cpp

The GPU is a completely freestanding environment and has no existing C library, including the system libraries when doing a freestanding compilation like the above example will cause issues as unsupported definitions are pulled in. For this reason we generate our own headers using the libc interface with the libc-hdrgen tool. Here is what that GPU port will emit currently for the stdio.h header.

#include <__llvm-libc-common.h>
#include <llvm-libc-macros/file-seek-macros.h>
#include <llvm-libc-macros/stdio-macros.h>

#define EOF -1

#include <llvm-libc-types/FILE.h>
#include <llvm-libc-types/size_t.h>

__BEGIN_C_DECLS
int puts(const char *__restrict) __NOEXCEPT;

int fputs(const char *__restrict, FILE *__restrict) __NOEXCEPT;

extern FILE * stdout;
extern FILE * stderr;
__END_C_DECLS

This works very well for the freestanding target the libc uses for its internal testing. However, most all users of this library will do so through an existing GPU offloading language such as OpenMP, CUDA, or HIP. These targets generally work by splitting a single source file into separate compilations, typically one for the CPU (host) and one for the GPU (device). For these targets to work we need to obey the following restrictions:

  1. The same headers need to be included from both the host and device
    This is necessary because the compilation via an offloading language is single source. In general, we need both sides to agree on the values of macros, constants, etc or else we will get strange divergent behavior.
  2. Objects present on the GPU must be marked on the GPU
    This must be done on both the host and device sides for the offloading language. For example, if we declare that stderr is on the GPU only for the device compilation but not the host and try to use it, we will miscompile because the host and device sides do not agree on what is present on the GPU.
  3. Types cannot conflict with the system headers
    The LLVM C Library is not a full replacement for the user’s system headers, so the host will need to include their own headers. This is problematic because we would then doubly declare types if combined with the LLVM C headers.

Currently, these offloading languages use existing wrappers around the system headers that can be found at llvm-project/clang/lib/Headers at main · llvm/llvm-project · GitHub. For offloading languages we do not have the same issues with picking up the system headers because we can eagerly cull things that are not actually on the GPU, while when doing a freestanding compilation we must include everything. Also, the offloading languages will pass definitions for the auxiliary triple (e.g. X86_64) which bypasses a lot of the failure modes.

Proposal

The proposal here is to provide a singer header that is compatible with both the freestanding GPU target and when included from an existing offloading language. This will allow us to precisely control the libc implementation when being used internally, and provide compatible headers when included from one of these existing languages. Taking the header example from above, we can transform it to provide the necessary utility depending on the compilation.

#if !defined(_OPENMP) && !defined(__CUDA__) && !defined(__HIP__)
#include <__llvm-libc-common.h>
#include <llvm-libc-macros/file-seek-macros.h>
#include <llvm-libc-macros/stdio-macros.h>

#define EOF -1

#include <llvm-libc-types/FILE.h>
#include <llvm-libc-types/size_t.h>
#else
#include_next<stdio.h>
#endif

#include <llvm-libc-macros/gpu-macros.h>

__BEGIN_C_DECLS
__BEGIN_OPENMP_DECLS
int puts(const char *__restrict) __NOEXCEPT __DEVICE;

int fputs(const char *__restrict, FILE *__restrict) __NOEXCEPT __DEVICE;

extern FILE * stdout __DEVICE;
extern FILE * stderr __DEVICE;
__END_OPENMP_DECLS
__END_C_DECLS

Where the additional header <llvm-libc-macros/gpu-macros.h> could look like the following,

#if defined(_OPENMP)
#define __BEGIN_OPENMP_DECLS _Pragma(omp target begin declare target)
#define __END_OPENMP_DECLS _Pragma(omp target end declare target)
#else
#define __BEGIN_OPENMP_DECLS
#define __END_OPENMP_DECLS
#endif

#if defined(__CUDA__) || defined(__HIP__)
#define __DEVICE __attribute__((device))
#else
#define __DEVICE
#endif

#if defined(_OPENMP) || defined(__CUDA__) || defined(__HIP__)
#undef __NOEXCEPT
#define __NOEXCEPT
#endif

This will allow us to use the headers as-is with the LLVM C library defined types when we are compiling with an offloading toolchain like we do currently. However, if we are compiling for OpenMP, CUDA, or HIP we will instead use #include_next to get the next header in the search path and obtain the system’s stdio.h which will provide the types instead. We will then use the generated entrypoints to precisely define which system utilities are available on the GPU.

This will allow us to then install these headers to the current include/gpu-none-llvm/ are prepend that search path when targeting the GPU. Long-term this will allow us to remove the wrapper headers in clang. We propose that this is done in a single header rather than a different platform for ease of use. If we were to generate separate headers we would then need to run libc-hdrgen multiple times in a single build and disambiguate on where they were installed.

What’s required

This will require some additions to the libc-hdrgen tool. Most likely we will need an extra flag if we are operation in GPU mode to perform the wrapping. Because we cannot include any of the LLVM headers we will also need to change the interface. The proposed method is to take the existing headers and convert them to something like this.

#ifndef LLVM_LIBC_STDIO_H 
#define LLVM_LIBC_STDIO_H

%%include(__llvm-libc-common.h)
%%include(llvm-libc-macros/file-seek-macros.h) 
%%include(llvm-libc-macros/stdio-macros.h)

%%public_api()

#endif // LLVM_LIBC_STDIO_H    

This would allow us to then cause the normal targets to emit the #include as normal, while the GPU target could defer that until it wraps it in the GPU check.

Caveats

This will require some fine-tuning to make work in general. This should be able to be handled in the platform file definitions in the general case. For example, the GNU libc provides isalnum as a macro.That means in order to use our implementation in the libc library we need to #undef isalnum on the GPU. I believe these can be handled on a case-by-case basis.

Feedback would be appreciated, this is required to actually ship the GPU libc as a product and would allow us to simply generate out own headers for the GPU instead of using the clang wrappers so I am eager to see this through.

This is mostly looking good. Couple of things:

  1. There is a %%include_file command already so you should may be name the new command as %%header.
  2. In a general case, there can be type collisions between the system libc and the GPU libc. As far as I can tell this proposal does not addres such scenarios. Am I missing something?
  1. Did not know that, I will change it.
  2. The idea is that we do not include any LLVM libc types in the headers when using an offloading language, instead getting them all from the system headers. I’m relying on the fact that the C standard defines all of the regular types to expect those to be available and consistent.

Depends on what “consistent” means. For struct types for example, the C standard specifies the members but not the order of those members. In a general case, the order can affect the ABI. Even for simple types like div_t, which you have brought up on discord, the order is relevant as we want both sides to have the same view of the world.

This is definitely one of those edge cases, for the ones I’m aware of it seems common between them. But in a worst-case scenario we can static assert on the offsets of those members and refuse to compile if they do not match.

It is beginning to appear more and more like trying to mix overlay mode and full build mode. Can we build the GPU libc in the overlay mode for the offloading language use case?

The system headers that the overlay mode would use cannot be included on the GPU so we must generate them. We have a few existing wrapper headers in llvm-project/clang/lib/Headers at main · llvm/llvm-project · GitHub, but these are nowhere near complete and require special handling from the offloading language to even work, which we don’t have with a standalone build. Writing libc headers here would seem backwards because then it wouldn’t have any clue which things libc actually implements.

In both cases we want to generate headers, and in generating headers that can work with offloading languages we can eliminate the code in clang/lib/Headers as well.

The semantics of overlay mode as libc knows it are a little different here. As we cannot use the system headers internally and we also need to generate headers to inform which “system” utilities we put on the GPU.

You will have to elaborate the problem for the benefit of people like me who do not understand enough of the GPU side. Reading “cannot be included on the GPU” does not give me any information. I can only speculate that may be there are some constructs in normal system libc headers which the GPU compilers don’t like.

Also, “nowhere near complete and require special handling from the offloading language to even work” makes wonder, how are things working currently? In the extreme, makes wonder again, are we trying to solve an unrelated problem in the libc?

Bottom line from my side, mixing headers with the hope of it just working and “we will patch if not”, reflects on the libc development discipline. May be there is no better way to solve the problems. But, I need help in appreciating the problems in first place.

Sure thing, the issue is that the system headers are not intended to be included on anything but the host system. Offloading languages like CUDA, HIP and, OpenMP can bypass this primarily because of two things. First, these offloading languages know the auxiliary triple, that is the one running on the host system. That is because offloading compilation is not cross-compilation, so it can define the host information that the system headers typically key off of, e.g. __X86_64__. Second, offloading languages on the GPU can eagerly cull things that are not on the GPU once the AST is parsed. Normally, if we included the following file we could get a linker error if it’s not defined,

extern int a;
void foo() { a = 1; }

But if we include this on the GPU, the call to foo will never reference a in the first place because we didn’t place it on the GPU using one of the offloading languages. This is the importance of not blindly declaring everything on the GPU and instead having libc individually specify only what’s supported.

The other issues, like needing the host and device to agree on the definitions, are mentioned in the original RFC.

The existing headers provide a small subset of shims to get a handful of commonly used utilities to run on the GPU. It’s not trying to implement libc, it’s mostly just a few functions that a customer wanted to use so someone wrote a workaround for.

Yes, it’s not an ideal situation, but as far as I can tell this is the best solution to the multitude of problems at this moment. The best solution would be if the LLVM C Library was fully features and cold provide a unified GPU / CPU implementation, but that is a long, long way off and would require a lot more from the user.

What I’d like to point out is that we are still implementing the libc according to the “true” headers, so the implementation of the LLVM C library does not need to worry about any of this. The only difference here is that we need to add some ugly workarounds to the header to get this library in a form that can be exported and used by the users. There will be no bleeding of this mess into the actual implementation. I’m simply stating that if we do have ABI violations we can in the worst case statically assert on it, but I don’t think it will be very common.

Mirroring from ⚙ D153897 [libc][hdr-gen] Add special offloading handling for the GPU target, but I’m presenting an alternative method that separates the Offloading languages more completely. We would make a special mode for libc-hdrgen that simply exports all the entrypoints and places them in a subdirectory called clang/lib/Headers/llvm-libc-definitions/stdio.h for example, That would contain just the definitions, no header guard or anything, So that would look like,

#ifndef __ATTRS
#define __ATTRS
#endif

int fputs(const char *__restrict, FILE *__restrict) __ATTRS;
extern FILE * stderr __ATTRS;

This would then be included by a wrapper header in clang/ilb/Headers/__libc_wrappers/stdio.h which looks like the following,

#ifndef __CLANG_LIBC_WRAPPERS_STDIO_H__
#define __CLANG_LIBC_WRAPPERS_STDIO_H__
#include_next <stdio.h>

#if defined(__CUDA__) || defined(__HIP__)
#define __ATTRS __attribiute__((device))
#endif

#pragma omp begin declare target
#include <llvm-gpu-none/llvm-libc-declarations/stdlib.h>
#pragma omp end declare target

#endif

The downside to this solution is that we now need to create a wrapper header in clang/lib/Headers while the other solution would allow us to remove that entirely. The upside is that the hacks required for this compatibility can be shuffled away from the implementation and more towards the interface. I will attempt to implement this solution as an alternative.