[RFC] SYCL Kernel Lowering

SYCL Kernel Lowering

A SYCL construct such as parallel_for or single_task takes a named function object or a lambda as one of its arguments. The contents of this function object are executed on the device. However, as the SYCL runtime can rely on other offload APIs like OpenCL or CUDA to execute the function object, it needs to respect the calling convention of these API. To enable this, the function object is converted into the format of an OpenCL kernel.

Consider the following code snippet:

#include <sycl.hpp>

int main() {
  sycl::queue q;
  sycl::buffer<char> b{sycl::range{1024}};

  q.submit([&](sycl::handler &cgh) {
    sycl::accessor acc{b, cgh};
    int i;
    struct S {
      char c;
      int i;
    } test_s;
    test_s.c = 14;

    cgh.single_task([=] {
      if (i == 13 && test_s.c == 14) {
        acc[0] = 'a';
      }
    });
  });
}

In this example, the lambda passed to the single_task construct needs to be executed on the device. The corresponding function object looks like:

struct FuncObj {
  int i;
  struct S test_s;
  sycl::accessor acc1;

  void operator () {  // Function call operator
    if (i == 13 && test_s.c == 14) {
      acc[0] = 'a';
    }
  }
};

The device compiler then generates a caller in the form of an OpenCL kernel function that calls this function object. It does so by walking the function object data member and generating a parameter for each of them. Some special types like accessor are treated a bit differently (see below). Inside the OpenCL kernel, the function object is rebuilt and then called.

The device compiler transforms this into (pseudo-code):

    void Caller(
       int i,
       struct S test_s,
       __global int* accData, // arg1 of accessor init function
       range<1> accR1,        // arg2 of accessor init function
       range<1> accR2,        // arg3 of accessor init function
       id<1> accId            // arg4 of accessor init function
    )
    {
        // Local capture object
        struct FuncObj local;

        // Reassemble capture object from parts
        local.i = i;
        local.s = s;
        // Call acc1 accessor's init function
        sycl::accessor::init(&local.acc1, accData, accR1, accR2, accId);

        // Call the kernel body
        Callee(&local);
    }

    void Callee(struct FuncObj* this)
    {
        // body of the kernel invocation
    }

The SYCL specification defines rules for allowable types for a kernel parameter.

The proposed implementation passes the copyable types to the device as separate parameters. The current implementation is aware of some types such as sycl::accessor, for example, which cannot be simply copied from host to device. (The specification permits this to account for difference in host/device layouts, absence of some fields on either the host or the device, or to allow conversion of pointer values for correct behavior.) To enable all of this, these special types have an __init function. The parameters of this function are transfered from host to device separately. The values received on the device are passed to the init functions executed on the device, which results in the reassembly of the SYCL object in a form usable on the device. Note that when such types are elements of an array or a field of a struct or both, special traversal is necessary to pass the type properly. The proposed mechanism accounts for handling these special instances.

Location of this logic

Currently in our implementation, this logic is located in the Sema phase. Similar to what we are considering for generating the Integration Header and Footer, we have an open question between two options that we are considering - one, move this to the CodeGen phase (e.g., OpenMP does their equivalent transformation in CodeGen) and two, move it out of the clang FE and do it in an LLVM IR pass.

1 Like

Thank you for posting this RFC! I don’t see anything objectionable here, but there is an open question regarding whether this should be down from Sema, CodeGen, or an LLVM IR pass. I’d like to hear from @jdoerfert, @alexey-bataev, @efriedma-quic, @rjmmcall on the open question, but my inclination is that this should be spread out a bit. The AST should have the correct source fidelity so we can perform functions like AST matching on the original code. I think the closest analogy to something we already support would be lambdas in C++. Based on that, my intuition is that Sema will have some of the (semantic) logic and CodeGen will have some of the (ABI-related) logic. But I’m curious what the other experts think.

Note: we typically want to avoid the compiler needing to know about library types. Some library types are kind of special (size_t is a library type but the compiler is the only thing that knows what the underlying type should be, similar for nullptr_t, etc) but I don’t think sycl::accessor is the same kind of special.