Emissary APIs, a general purpose framework for GPU-initiated host execution of native host APIs

We propose a flexible infrastructure for GPU-initiated host services, called Emissary APIs. This infrastructure would sit on top of the current offload RPC created by Joseph Huber.This enables offload kernels and offload code regions to use identical interfaces to their host counterpart. This is useful for OpenMP where runtime-optional host fallback is often encountered. But more importantly it allows for more complex kernels and code regions without the need to terminate the kernel or code region.

Example Emissary APIs would be MPI, HDF5, FORTRAN IO runtime, and coarray FORTRAN runtime. Since clang already has a device libc built on the same LLVM offload RPC that Emissary APIs would use , an emissary API for libc would not be necessary, although use of the Emissary API infrastructure would improve performance.

We have implemented Emissary APIs in AMD’s downstream ROCm compiler. We have also recently tuned that implementation to prepare for upstream review.

Using Emissary APIs reduces the need to end target regions to do some host service. That is, calls to MPI_Send() and MPI_Recv() can be added directly to target regions or kernels. An MPI implementation would enable GPU-initiated MPI. This is a powerful feature on top of existing GPU-aware MPI communications where the host is needed to control RDMA communications directly between GPUs.

WRITING DEVICE STUBS AND HOST VARIADIC WRAPPERS

One way to understand this proposed infrastructure is to imagine what currently happens when a call to a function such as MPI_Send() is encountered in a kernel or target region without Emissary APIs. You naturally get an unresolved reference during the offload link phase.

Maintainers of Emissary APIs provide the device stubs for the desired functions eliminating the unresolved ref. That stub simply calls the variadic function “_emissary_exec” with the same arguments as the stub parameters (with the addition of an identifier so the host service thread can tell which service to call).

Here is an example of a device stub definition for MPI_Send():

extern “C” int MPI_Send(const void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm) {
return (int)_emissary_exec(_PACK_EMIS_IDS(EMIS_ID_MPI, _MPI_Send_idx), buf,
count, datatype, dest, tag, comm);
}

The host in turn prepares calls to the appropriate service by unpacking the argument buffer and calling the corresponding host variadic wrapper function which makes the call to the actual host function.

Here is an example of the host variadic wrapper function for MPI_Send.

extern “C” int V_MPI_Send(void *fnptr, …) {
va_list args;
va_start(args, fnptr);
void *buf = va_arg(args, void *);
int count = va_arg(args, int);
MPI_Datatype datatype = va_arg(args, MPI_Datatype);
int dest_rank = va_arg(args, int);
int tag = va_arg(args, int);
MPI_Comm comm = va_arg(args, MPI_Comm);
va_end(args);
int rval = MPI_Send(buf, count, datatype, dest_rank, tag, comm);
return rval;
}

The host variadic wrapper function ensures proper argument types for the eventual call to the actual host function.

One goal of Emissary APIs is to distribute the development and maintenance of Emissary APIs especially to owners of existing host libraries. The Emissary API infrastructure makes extending host API to GPUs relatively easy. For each desired host function, a device stub is written and the variadic wrapper function is written. One could almost automate these steps from interfaces found in host API header files. However, some added intelligence is often needed. For example, the above example assumes unified shared memory. More code could be required if argument was a device pointer in a non-shared memory environment. The implementer would either call another function that assumes a device pointer or fetch the data from the GPU.

EMISSARY INTERNALS

When the compiler frontend codegen encounters a call to the variadic function _emissary_exec it emits code to replace that call. Firstly, it generates code to calculate the total size of a buffer to contain all the arguments to _emissary_exec. It then generates a call to __llvm_omp_emissary_premalloc which returns a pointer to buffer obtained with device malloc. See that function below. It then generates code to fill the buffer for each argument. Lastly, it generates a call to the device function __llvm_omp_emissary_rpc This function coordinates with the host using the LLVM offload RPC system. This code generation is in a new clang/lib/CodeGen file called CGEmitEmissaryExec.cpp. This is too long to post here.

The definitions for device functions __llvm_omp_emissary_premalloc and __llvm_omp_emissary_rpc are added to openmp/device/src/Misc.cpp and are shown below.

void *__llvm_omp_emissary_premalloc(uint32_t sz) {
return omp_alloc((size_t)sz, omp_default_mem_alloc);
}
unsigned long long __llvm_omp_emissary_rpc(uint32_t sz32, void *data) {
size_t size = (size_t)sz32;
rpc::Client::Port Port = ompx::impl::Client.open<OFFLOAD_EMISSARY>();
Port.send_n(data, size);
unsigned long long Ret;
Port.recv([&](rpc::Buffer *Buffer, uint32_t) {
Ret = static_cast(Buffer->data[0]);
});
omp_free(data, omp_default_mem_alloc);
Port.close();
return Ret;
}

On the host side the following stanza is added to initiate the main host function Emissary(buffer_ptr) for each active lane in the warp.

case OFFLOAD_EMISSARY: {
uint64_t Sizes[NumLanes] = {0};
unsigned long long Results[NumLanes] = {0};
void *Args[NumLanes] = {nullptr};
Port.recv_n(Args, Sizes, [&](uint64_t Size) { return new char[Size]; });
uint32_t id = 0;
for (void *buffer_ptr : Args)
if (buffer_ptr)
Results[id++] = Emissary((char *)buffer_ptr);
Port.send([&](rpc::Buffer *Buffer, uint32_t ID) {
Buffer->data[0] = static_cast<uint64_t>(Results[ID]);
delete reinterpret_cast<char *>(Args[ID]);
});
break;
}

The device function __llvm_omp_emissary_rpc and the above host runtime case stanza represent the necessary hand shaking between the device and the host for all Emissary API functions. Using the Offload RPC infrastructure (without the proposed Emissary API infrastructure) would require that each RPC function implement this handshaking to account for variability in the sets of arguments for each function. With Emissary APIs, all functions use the same handshaking (shown above) because any set of arguments can be packed into the arg buffer.

EMISSARY API MAINTENANCE

Emissary API maintainers do NOT need to maintain upstream llvm-project code to distribute updates to Emissary APIs. Although that would bring joy to the world.A reference Emissary API ID is reserved for external development and maintenance. API maintainers can implement the stubs and wrappers as a header library using the reserved ID. Adding a new set of functions for other APIs is relatively easy and once created, maintenance of the API (device stubs and host variadic wrappers) could be relegated to a single file for that API.

We demonstrate single file maintenance with the EmissaryMPI.h file in the upcoming PR. Having the simple variadic wrapper functions compile from a user-included header file solves an important maintenance issue. Compiling variadic wrappers into the compiler runtime would require linking to a specific native host library to resolve references to functions called in the wrappers (such as MPI_Send in above example). Since the user must link to the host library anyway, deferring wrapper compilation eliminates the need for the compiler runtime to link to that library.

SUMMARY:

In summary, the Emissary API infrastructure eliminates the complexity of implementing GPU-initiated host services with LLVM offload RPC. The Emissary API maintainer adds enum identifiers for new Functions, then writes the device stubs and host variadic wrappers for that function.

A pull request to llvm-project is being prepared for Emissary APIs.

The PR to support Emissary APIs is [OpenMP] support for Emissary APIs as discussed in 89169 by gregrodgers · Pull Request #175265 · llvm/llvm-project · GitHub
A new feature supports a general purpose method to have device pointers in the API.

This example demonstrates the use of MPI_Send and MPI_Recv in a target region.

#include <omp.h>
#include <mpi.h>
#include <stdio.h>
#include <EmissaryMPI.h>
#include <unordered_map>
#define VSIZE 5000

int main(int argc, char *argv[]) {
  int numranks, rank;
  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &numranks);
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  printf("R%d: Number of Ranks= %d ArraySize:%d\n", rank, numranks,VSIZE);

  MPI_Comm _mpi_comm = MPI_COMM_WORLD;
  MPI_Datatype _mpi_int = MPI_INT;

  Emissary_Initialize_MPI(); // See notes in EmissaryMPI.h clang/lib/Headers/EmissaryMPI.h

  int rc = 0;
  int *send_recv_buffer = (int *) malloc(VSIZE * sizeof(int));

  #pragma omp target teams distribute parallel for map(to:send_recv_buffer[0:VSIZE]) map(tofrom:rc) num_threads(256)
  for (int i = 0; i < VSIZE; i++) {
    if (rank == 0) {
      send_recv_buffer[i] = -i;
      MPI_Send(&send_recv_buffer[i], 1, _mpi_int , 1, i, _mpi_comm);
    } else {
      MPI_Recv(&send_recv_buffer[i], 1, _mpi_int , 0, i, _mpi_comm, MPI_STATUS_IGNORE);
      if (send_recv_buffer[i] !=-i)
        rc = 1; // FLAG AS ERROR IF NOT EXPECTED.
    }
  }
  MPI_Finalize();
  printf("R%d: === POST MPI_Finalize === rc = %d\n",rank,rc);
  return 0; //rc
}
~