RFC: SYCL support for unnamed SYCL kernel functions

The SYCL 2020 specification requires implementations to support the use of a lambda expression as the entry point to a SYCL kernel. Kernel invocation requires that host and device compilers agree on kernel names. Kernel names may be explicitly chosen by programmers by passing a type name as a template argument to a SYCL kernel invocation function like sycl::handler::single_task. In the following example, a kernel name is deterministically generated based on the my_kernel_name type.

#include <sycl/sycl.hpp>
class my_kernel_name;
int main() {
  sycl::queue q(sycl::cpu_selector_v);
  q.submit([](sycl::handler &cgh) {
    cgh.single_task<my_kernel_name>([]{});
  });
  q.wait();
}

If an explicit kernel name is not provided, then the implementation is required to implicitly generate a kernel name based on the kernel entry point and surrounding context.The above example can more simply be written as follows. Note that the declaration of my_kernel_name and all uses of it have been removed.

#include <sycl/sycl.hpp>
int main() {
  sycl::queue q(sycl::cpu_selector_v);
  q.submit([](sycl::handler &cgh) {
    cgh.single_task([]{});
  });
  q.wait();
}

When the kernel entry point is a lambda expression, it is technically challenging for host and device compilers to independently generate a matching kernel name. The existing C++ name mangling schemes specify how names are generated for the closure type of a lambda expression, but these names are not generally intended to be stable; particularly not in the presence of conditionally included code. The SYCL 2020 specification therefore allows conforming implementations to support a reduced feature set that does not include support for unnamed SYCL kernel functions. This is specified in Appendix B.2, “Reduced feature set”. Per section 5.6, “Preprocessor directives and macros”, implementations that do not support unnamed SYCL kernel functions are required to predefine the SYCL_FEATURE_SET_REDUCED macro with the value 1 and to not define SYCL_FEATURE_SET_FULL. SYCL programs that use unnamed SYCL kernel functions are therefore more limited in their portability.

Support for unnamed SYCL kernel functions is more easily achieved when the host compiler is SYCL-aware since this allows the host and device compilers to use more surrounding context to generate a stable kernel name than is incorporated in the existing C++ name mangling schemes used for lambdas. Clang already supports a __builtin_sycl_unique_stable_name builtin function to assist with production of a stable kernel name. Modulo defects in its design and implementation, this suffices to provide full support of unnamed SYCL kernel functions when Clang is used as both the host and device compiler.

Support for SYCL-unaware host compilers could be provided by preprocessing a SYCL translation unit, instrumenting the resulting preprocessed output with a SYCL-aware tool to insert stable kernel names in calls to SYCL kernel invocation functions, and then passing the result to the SYCL-unaware host compiler. This approach has several downsides including:

  • Loss of preprocessing information during host compilation. This may affect when diagnostics are issued and their presentation form.
  • Loss of precise source location information. This may affect diagnostic presentation form and the quality of debugging information.
  • Performing the instrumentation would be technically challenging as it would require the ability to consume preprocessed output intended for the host compiler while also being able to parse the code sufficiently well to deterministically generate kernel names that match those produced by the device compiler or to correlate kernel invocations with a table of kernel names produced by the device compiler.

At present, the __builtin_sycl_unique_stable_name builtin function does not generate matching names during host and device compilation for the following example. Running this program results in a run-time error due to a failure to resolve kernel names. This is presumed to be either a defect in the implementation of the builtin function or a limitation of its design; it has not yet been determined which of these is the case.

#include <sycl/sycl.hpp>
int main() {
  sycl::queue q(sycl::cpu_selector_v);
  q.submit([](sycl::handler &cgh) {
#if !defined(__SYCL_DEVICE_ONLY__)
    // A host-only lambda to misalign discriminators.
    []{}();
#endif
    cgh.single_task([]{});
  });
  q.wait();
}

Proposed:

  • Support for unnamed SYCL kernel functions will only be provided when Clang is used as both the host and device compiler.
  • Support for unnamed SYCL kernel functions will not be provided if the -fsycl-host-compiler option is used. Such support could be provided in the future, either using the preprocessed output approach described above or another mechanism, based on demand.
  • The SYCL_FEATURE_SET_FULL and SYCL_FEATURE_SET_REDUCED predefined macros will be defined as specified in the SYCL specification based on whether support for unnamed SYCL kernel functions is enabled.
  • The __builtin_sycl_unique_stable_name builtin will be enhanced to address known deficiencies like for the example above. This might require changing the name mangling scheme used for kernels; if so, those details will be described and addressed in code reviews.

The limitation in __builtin_sycl_unique_stable_name is well known, and was well known at the time of design. IMO, the problem is that we intended to try to use integration headers to communicate between the host and device compilation, rather than some other level of out-of-band communication.

IMO, the REAL solution (as I proposed repeatedly, and had well architected before my departure) is to do the ‘host’ vs ‘device’ split later in compilation process (such as either at the backend Code Generation time, or at the IR generation time), and not support arbitrary host compilers, particularly for the upstream product. There is no real value added to Clang to support it, and it ends up being a massive complication.

If we NEED to support 2 pass compilation (of which I’m absolutely thoroughly unconvinced, even after spending 4+ years involved in the language implementation), we need to come up with a better side-channel to communicate, not some half-solution like __builtin_sycl_unique_stable_name.

Bullets 1 2 and 3 are otherwise a positive direction, but I don’t think 4 is the way to go, but I don’t think 1 or 2 need to come AFTER however we decide to handle this (as I still suspect a later split is the RIGHT way to go).

I like that.
And then we can do “the same thing” as OpenMP offload is already shipping, no?

And then we can do “the same thing” as OpenMP offload is already shipping, no?

That is definitely 1 course of action, yes. I’m not a huge fan of it, but it is orders of magnitude better than what is being proposed.

One big benefit is that if someone “makes it better” later, we all benefit. No matter if it is an incremental improvement or a rewrite.

1 Like

I don’t believe __builtin_sycl_unique_stable_name() is relevant for integration headers; it is used to generate stable names that are then passed to the SYCL run-time to dynamically resolve symbols at load time for the devices that are enabled in the run-time environment. This builtin is only relevant when Clang is used as both the host and device compiler; a SYCL-aware host compiler is effectively required to support unnamed SYCL kernel functions; which is why this RFC doesn’t propose such support for third party host compilers. Assuming support for separate host and device compilation, even if we choose to use a mechanism other than integration headers to support that, this builtin will still be needed and will need to be enhanced to address examples like the one presented.

Here is another example for which enhancements to the builtin function are needed. The Intel compiler successfully compiles this, but produces an error at run-time.

#include <sycl/sycl.hpp>
inline void f(sycl::queue &q) {
#if !defined(__SYCL_DEVICE_ONLY__)
  { struct S; }
#endif
  {
    struct S {
      void mf(sycl::queue &q) {
        q.submit([](sycl::handler &cgh) {
          cgh.single_task([]{});
        });
      }
    };
    S{}.mf(q);
  }
}
int main() {
  sycl::queue q(sycl::cpu_selector_v);
  f(q);
  q.wait();
}

The builtin currently fails to produce a matching name for the host and device whenever any portion of the mangled name for the lambda includes a discriminator that is assigned differently for the host and device. The issues are not limited to mismatched discriminators for the lambda closure type itself.

For comparison purposes: CUDA does not support lambda expressions as kernels, but it does allow lambda expressions to be passed as arguments to kernels and has similar requirements for stable name mangling in that case. As far as I can tell, nvcc handles this well; I haven’t been able to get it to misbehave for similar examples that have conditional code guarded by __CUDA_ARCH__. However, I believe Clang’s CUDA implementation does not handle such cases correctly. Unfortunately, I don’t have a machine available where I can actually run such examples in order to prove this (and https://godbolt.org doesn’t seem to have an option to run CUDA code compiled with Clang).

It IS relevant to integration headers (note again I was the implementer, and participated extensively in the design/engineering effort of the feature), as it is a workaround for the fact that we didn’t have an out-of-band communication mechanism between the two compilations.

If we were to have a true level of communication between the host and device compilers (either as 1 pass, or a better out of band mechanism), this builtin isn’t necessary.

As far as your example, yes, we know that is a limitation, and it was explicitly decided that altering the existence of lambdas like that was UB for exactly the reason you’re seeing.

I’m aware of, but not at liberty to discuss, a number of limitations for Lambdas, the NVCC implementers have shared similar issues/concerns with me regarding lambdas, and have made different (based on different value judgements of what is permitted) tradeoffs.

Thinking further, I don’t actually SEE the need for _builtin_sycl_unique_stable_name at ALL without the integration headers. The way it is used is as follows:

1- During device code generation, we generate the ‘name’ of the kernel to match what the builtin will say for it, and put that ‘name’ in the Integration Header as a KernelInfo specialization.

2- During host-code generation, the library uses the builtin to pick up said specialization from the integration headers, which it can then use to get the information for dispatching the kernel.

With a better out-of-band communication mechanism, the builtin isn’t necessary at all, because the host/device CFEs communicate with each other and can just exchange where in the AST the change matches.

That said, I see you want to make changes, but I don’t see any actionable changes proposed. What are you actually proposing for your #4 bullet? #1 and 2 are effectively what is already implemented in intel/llvm, and I’m pretty sure #3 is as well (as those are from the SYCL standard, right?).

I am aware that you implemented this. That is why I double checked my understanding of the implementation to be sure I wasn’t spouting nonsense. Alas, I completely failed that double checking and ended up spouting nonsense anyway. You are right, of course.

For single-pass, yes. For multiple-pass though, we still have two ASTs to reconcile and therefore still have the same challenge of correlating kernel calls between them. A mechanism other than the integration headers would be helpful to avoid the language restrictions the SYCL specification imposes on the types used as kernel names, but I don’t think it would otherwise help with these inconsistent naming issues.

Understood. Thank you for mentioning that.

Per my statement above, I don’t see a clear way this would be accomplished if consistent names are not generated for the host and device. Can you elaborate on how you see this being done?

We’ve been discussing some ideas internally, but I don’t have details to share yet. We’ve been looking at how OpenMP kernels are named by clang and gcc for comparison purposes. Assuming we identify a solution that we believe to be an improvement over what is implemented now, I’ll follow up either in this RFC or in a new one. Perhaps there will be an opportunity to collaborate on naming methods that help with some of the CUDA cases you are aware of as well.

Mostly correct. The Intel compiler does not implement #3 correctly (it always defines SYCL_FEATURE_SET_FULL and never defines SYCL_FEATURE_SET_REDUCED, even when support for unnamed lambdas is not enabled).

I’m a bit straw-manning here TBH. I don’t have a great design in my head, but the OpenMP mechanism as an alternative for the purposes of kernel naming seems reasonable, but my understanding of it is limited.

Once you have an out-of-band mechanism for 2 pass, the name actually doesn’t matter, as long as it is unique in the program for kernel names. You could name them with a randomly generated GUID. The only problem is reconciling between the two runs of the compiler with different preprocessing.

Presumably some out-of-band mechanism could be taught to check source location information, but this gets tough when you realize that these lambdas are often wrapped by the library/etc. If it weren’t for the library, we could probably use AST matchers for it.

I’ll note the 1st version of this builtin did use source location, but the maintainer at the time said we shouldn’t be counting on that, so we switched to the ‘order in the function’ mechanism for lambdas we have now.

Speaking of AST Matchers… if we had said ‘out of band’ communication mechanism, we could probably use something like that to make the types ‘match’ based on contents. It means the lambdas have to be the same ‘contents’ (at least at the high level), but if we hold our nose ‘less’ with the ODR implications in this language, this is presumably acceptable.

Ah! Yeah, #3 is the one I was unsure about and figured we would have done correctly. FWIW, I’m in favor of that part of the change. I THINK there is a macro as to whether they are enabled (unnamed lambdas), but it isn’t the standard one.

That is the challenge; how the host and device compilations arrive at a matching name. Or, equivalently, how corresponding kernel calls are identified in the host vs device ASTs so that they can be assigned a common name.

Clang’s support for OpenMP relies on source location information. That helps, but isn’t sufficient because of macros and templates. A function signature still gets mangled into the name and a suffix is appended to handle collisions. See Compiler Explorer for an example. Clang generates the following names:

  • __omp_offloading_10302_7e286_void ft<int>()_l13
  • __omp_offloading_10302_7e286_void ft<int>()_l13.2
  • __omp_offloading_10302_7e286_void ft<short>()_l13
  • __omp_offloading_10302_7e286_void ft<short>()_l13.4

I suspect this scheme can be broken in some of the same ways as the earlier SYCL examples, but I haven’t bothered to construct a test case. I suspect the likelihood of such shenanigans is lower than with SYCL.

For reference, gcc uses (target?) clones for these names:

  • _ZZ2ftIiEvvENKUlvE_clEv._omp_fn.0 (ft<int>()::{lambda()#1}::operator()() const [clone ._omp_fn.0])
  • _ZZ2ftIiEvvENKUlvE_clEv._omp_fn.1 (ft<int>()::{lambda()#1}::operator()() const [clone ._omp_fn.1])
  • _ZZ2ftIsEvvENKUlvE_clEv._omp_fn.0 (ft<short>()::{lambda()#1}::operator()() const [clone ._omp_fn.0])
  • _ZZ2ftIsEvvENKUlvE_clEv._omp_fn.1 (ft<short>()::{lambda()#1}::operator()() const [clone ._omp_fn.1])

Yep, exactly. The two runs of the CFE have to figure it out ‘somehow’. So far, I’m pretty convinced that ‘some name from source’ ends up being always unreliable thanks to macros and template instantiations.

An AST-type compare with names could be more successful (and actually FIX the absurd named-kernels issues), since it can count on type names, and in the case of lambdas, the ‘contents’ of the lambda.

Yeah, that is unfortunate, it seems like it would have similar issues to the Clang issues. IT has the advantage though of not being wrapped in a significant number of function library calls (that is, the actual location of a kernel is always DEEP inside of the library, as is sometimes the kernel object itself).

That appears to me that they’re using a straight ordinal, which makes me wonder if they’re doing some OTHER sort of out-of-band communication. OR if they are just making the existence of the kernel-calls required between runs. I think OpenMP is ‘less tolerant’ of the macro nonsense around kernel calls that SYCL tried (and again, gets even messier with how it is implemented as library).

Aren’t they just using normal Itanium C++ name mangling rules? The way the Itanium rules disambiguate lambdas is just counting the number of lambdas in the scope.

No, those aren’t. Everything after the ‘.’ is a place for extensions outside of the ABI, or ‘clones’. This is what attribute target, target_clones, cpu_dispatch, and cpu_specific use as well.

Lambdas ARE differentiated by an integral, but it is before the ‘.’.

See:

_ZZ3foovENK3$_0clEv: 
_ZZ3foovENK3$_1clEv: 
_ZZ3foovENK3$_2clEv: 

are the lambda names, the differentiator is the _# after the $.

Just to be a pedant, those use an extension of the Itanium mangling scheme since those lambda closure classes have internal linkage. When the function is declared inline, then we get actual Itanium names. Compiler Explorer

_ZZ3foovENKUlvE_clEv
_ZZ3foovENKUlvE0_clEv
_ZZ3foovENKUlvE1_clEv
1 Like