[RFC][libc] Exporting the RPC interface for the GPU libc

I’ve been working on the GPU port of the LLVM C library, see libc for GPUs — The LLVM C Library.

I’ll fist cover a bit of background for those unaware of the project. Because the GPU has no operating system, we have the concept of a Remote Procedure Call (RPC) to implement what would normally be done by Linux syscalls or a similar kernel interface. Functionally, this RPC interface is a buffer of shared memory that both the host (CPU) and device (GPU) can access atomically and asynchronously. This is then used to communicate certain functionality between the “client” running on the GPU and the “server” running on the CPU. We have implemented this interface inside of src/__support/RPC/rpc.h and define a CPU server inside of utils/gpu/loader/Server.h. Currently, this is a completely freestanding C++ header-only library. This means that you can import the same header when compiling for the GPU and CPU and stand up a client or server. Here’s an example of what a device memory allocation looks like with the existing interface.

// Client running on the GPU
void *malloc(size_t size) {
  void *ptr = nullptr;
  rpc::Client::Port port = rpc::client.open<rpc::MALLOC>();
  port.send_and_recv([=](rpc::Buffer *buffer) { buffer->data[0] = size; },
                     [&](rpc::Buffer *buffer) {
                       ptr = reinterpret_cast<void *>(buffer->data[0]);
                     });
  port.close();
  return ptr;
}
// Server running on the CPU
void run_server() {
    rpc::Server::Port port = server.open();
    switch (port.get_opcode()) {
    case rpc::Opcode::MALLOC: {
      port.recv_and_send([&](rpc::Buffer *buffer) {
        buffer->data[0] =
            reinterpret_cast<uintptr_t>(allocator(buffer->data[0]));
      });
      break;
    }
  port.close();
}

As you can see, we use the same interface to implement to code on the server running on the CPU as the client running on the GPU. This means that the GPU implementation of libc will somehow need to provide this functionality and make it available externally so the GPU runtime can spin up a server and handle work. There are a few options here that we can consider:

  1. Provide the headers directly: The headers are completely freestanding and can be compiled for both the CPU and GPU. This is what we do currently in the loader utility to run the GPU unit tests. The downside to this is that we will then expose libc internals. This would then allow the offloading runtime to implement its own server. I used this approach to work on a proof-of-concept implementing OpenMP offloading to the host GitHub - jhuber6/OpenMP-reverse-offloading: A demonstration of implementing OpenMP 5.1 reverse offloading using shared memory remote procedure calls.
  2. Maintain separate headers: We could take the same interface but split it into two headers. The server could then replace dependencies on stuff like cpp::optional, cpp:function, and cpp::Atomic with the standard library implementations. This would prevent us from exposing internals, but maintaining two copies of what is the same underlying scheme is not very desirable. This means any future changes or optimizations will need to be applied to both pretty much in parallel.
  3. Export the interface as a C library: We could wrap the whole RPC interface into a C-library. The problem is that this would then need to be compiled for all supported architectures whereas a header only library would work by default. The interface would also be heavily restricted as we make heavy use of lambdas.
  4. Implement a fully functional server internally in libc: We can implement the server internally in libc as export only a static library. This would allow us to hide the RPC interface while still using it to implement the server. The downside is that the GPU runtime requires a lot of extra state, so we will need to use a lot of function pointers to allow the offloading runtime to call things like GPU memory allocation, memory copies, etc. I have an old WIP of that approach here ⚙ D147054 [libc] Begin implementing a library for the RPC server.

I’d like to get the opinions of the libc developers as to what the approach here should be. This is a bizarre edge case because we have a target that needs to define its own “operating system” potentially externally. @sivachandra is concerned about exposing the internals of libc.

Personally, I would prefer 1 because it’s the easiest solution, followed by 4. Implementing a generic library has some some benefits so we can go down that route if needed. The downside of not providing the full interface via like in 1 is that we lose the ability to make extensions in the OpenMP or similar runtime.

The only options I would choose would be options 1 and 4. Option 2 seems like a lot of toil, and option 3 seems like trying to fit a square peg in a round hole.

My biggest concern for option 1 is that anything external depending on our internal libraries might require frequent updates since our internal implementations are still in flux. Additionally they would have to deal with the parts of our library that make sense for our uses, but perhaps not for general use, such as how we handle errno.

Option 4 sounds like it would be more hermetic, but still tightly tied to our implementation. I wonder if it’s possible to have an intermediary library (provided for a specific target?) that hooks deeply into LLVM’s libc like in option one, but provides a cleaner interface. It’s very possible that that’s basically what you’re describing in option 4, and if so feel free to ignore this suggestion.

I strongly prefer option 4 along with everything @michaelrj-google has said about actually providing a user facing library with a much cleaner interface. For example, instead of the server implementation reading out messages from the message buffer/stream directly, it should be a concrete implementation of an abstract base class:

class GPUSystemService {
public:
  ...
  virtual MallocResponse malloc(const MallocRequest &request) = 0;
  ...
};
...
class GPUSystemServiceOnCPU : public GPUSystemService {
  ...
  MallocResponse malloc(const MallocRequest &request) override {
    return {allocator(request.size)};
  }
  ...
};

int main() {
  auto service = std::make_unique<GPUSystemServiceOnCPU>();
  GPUServer server(service);
  ...
  return server.run();
}

The actual functionality of reading out message packets from the buffer is hidden somewhere in the implementation of GPUServer, may be in its run method.

Will you also support odd SYCL devices, where the pointer width between host and device differ?

Given that we own both libc and openmp, we can provide a custom interface that hides libc details but offers exactly what we need wrt extensibility, no?
That’s what I understand was proposed above as variation of 4. Would make sense to me and should solve our problems nicely.

The libc library should be usable with anything that can link in LLVM-IR. However, right now I only export a static library using the magic fat binary format we use for OpenMP Offloading Design & Internals — Clang 17.0.0git documentation. The only requirements for the shared memory buffer is that it supports asynchronous updates from the device and the host, relaxed atomic loads and stores, and acquire and release fences. The address space doesn’t need to be uniform, as long as both pointers somehow map to the same underlying memory.

So, the HIP group at AMD may want to use this in the future once it’s more implemented. I don’t want to rule out supporting external projects. Right now it’s trivial to use if you’re in-tree because you can just do -I ./libc since it’s a header only library. However, we also need to support a LLVM_ENABLE_RUNTIMES build, which more or less works via findLLVM.cmake, so we’d need to export it somehow in that case.

So, right now the things we depend on inside libc is cpp::optional, cpp::atomic, and cpp::function along with the architecture macros and GPU wrappers. I’d assume these are reasonably stable, but it would still be exporting the internal interface.

So, the biggest difficulty here is that in order to hide the src/__support/RPC/rpc.h header we’d need to put this behind an implementation. This means that we would be using std::function at least.

Doing 1 would probably require making a header only library in CMake, the leaked dependencies would be relatively minimal and still tied to some implementation. The benefit is that it’s easy and allows the user to easily interface with the GPU.

Doing 4 would prevent us from easily extending the interface. However there would be some benefit in having a common library to handle things that doesn’t depend on the GPU runtime, like printing or exiting.

I don’t see why this is an issue? This is what namespacing is for?

I agree on this front, the __llvm_libc namespace prevents users from confusing the __llvm_libc::cpp::optional or __llvm_libc::cpp:function that we use internally for the implementation. However the libc team seems concerned about having these “accessible” outside of the library entry points. The one difficulty is the fact that the headers are all in src/__support/ so we’d need to install some directory that contains those files there.

There is no problem if it is only a matter of the libc internal constructs getting exposed. But, two more serious problems in this case are:

  1. The public interface in this header file uses internal constructs.
  2. The internal constructs are actually defined in internal header files and not in this header file in question. So, if this header file is to be made public, then the internal header files also have to be made public.

Yeah, we use and define macros like LIBC_TARGET_ARCHITECTURE_IS_GPU at the top level. It would require some plumbing to separate out uses of non-reserved identifiers. I wish there was a preprocessing step that only resolved include directories or something.

I do not think we should approach this problem by conflating it with existing implementation details. I would view it this way: not exposing internal constructs in the interface and not requiring installation of internal header files is a hard constraint. If the solution under this constraint requires that existing uses of lambdas etc. have to be redesigned, then that is just part of the deal here.

If we’re defining a library we could probably just wrap the interface around some PIMPL formation with std::function. This would require keeping at least the buffer type in-sync, and would incur some runtime penalties because of all the calls that can’t be inlined, but it might work if the goal is to hide the interface.

namespace __llvm_libc::rpc {
class Port;
}
// Must be kept in-sync with `rpc.h`
using Buffer = uint64_t[8];

class Port {
  __llvm_libc::rpc::Port *port;
public:
// implemented elsewhere to hide the interface
  void send(std::function<void(Buffer *>);
  void recv(std::function<void(Buffer *>);
}

void register_callback(std::function<void(Port)>);

At the least, the goal is to:

  1. Prevent exposure of internal constructs in the public interface.
  2. Not require installing internal header files.

To achieve those, we have to pick option 4. Have we reached a conclusion on that? I would consider the design of the adapter layer to be separate problem. We can discuss it here, but we should first reach a conclusion that option 4 is what we pick.

I would not choose to introduce more virtual calls on the GPU than necessary for any reason.

1 Like

So, this portion would only be for implementing the server which runs on the CPU. For the GPU we will probably need to make an “extension” to the C library to emit an entry point to at least initialize the client with the shared memory.

I think in general it is unfortunate that we cannot provide the full interface. An option more in-line with 2 would be to pretty much copy what we have, trim it down, and remove the libc specific bits and have that exist as a standalone header. We could place this in the utils instead and then include it in libc to handle the GPU-side.

Normally, a client-server API does not expect user code to read out packets from the communication stream. I have given an example of how it is normally done on the server side. Likewise, the client API is typically like this:

class GPUSystemService {
...
public:
...
  class Client {
    Client(... /* connection related arguments */);
    MallocResponse malloc(const MallocRequest &request);
    ... // Other request-response methods
  };
}

The packet read/write is hidden behind those client-side request-response methods (which are not virtual btw :slight_smile: ) So, the user code just does:

int main() {
  GPUSystemService::Client client(...);
  auto response = client.malloc();;
  ...
}

We haven’t worked out how to handle extensions to this yet as far as I know. There’s one shared enum defining everything. It seems likely that how we choose to extend that - at the minimum so that libc and openmp aren’t bound to the same set of operations, but maybe to allow user code to register their own functions - will constrain how the code is exported.

I’d suggest we leave it as raw header only, i.e. the current, until we work out how to handle extension.

Yeah, the option there would be to rework the header to make it include all of its dependencies and then replace uses of LIBC_TARGET_ARCH_IS_GPU with __AMDGPU__ and __NVPTX__. But that would be unfortunate because it would mean we fail to track changes to functional, optional, etc. But that would at least leave us with a single header we could copy.