Preferred alternative to a C++ dialect for device library functions

Hello OpenMP dev,

A motivating example is atomicInc for amdgcn. There is ISA support for this so a good implementation folds to a single instruction. There is no corresponding clang intrinsic, though there is an llvm intrinsic.

I see the following options:

  • Implement it in IR, linked into deviceRTL
  • Inline assembly
  • Delay implementation until the intrinsic can be added to clang
  • Implement in terms of CAS
  • Your suggestion here

Adding atomicInc.ll to the source tree is the easy short term fix. It has drawbacks in terms of future ABI change, build complexity and limited precedent - libclc does this, but nowhere else.

Inline assembly works (modulo getting the syntax right) and hits the right instruction.

Implementing in terms of CAS means one can stay in HIP or OpenCL, but performance suffers.

What would the you prefer out of these options?

Thanks,

Jon

Hello OpenMP dev,

A motivating example is atomicInc for amdgcn. There is ISA support for this so a good implementation folds to a single instruction. There is no corresponding clang intrinsic, though there is an llvm intrinsic.

Do you mean a target-specific intrinsic, or a target-independent intrinsic?

-Hal

I see the following options:
- Implement it in IR, linked into deviceRTL
- Inline assembly
- Delay implementation until the intrinsic can be added to clang
- Implement in terms of CAS
- Your suggestion here

Adding atomicInc.ll to the source tree is the easy short term fix. It has drawbacks in terms of future ABI change, build complexity and limited precedent - libclc does this, but nowhere else.

Inline assembly works (modulo getting the syntax right) and hits the right instruction.

Implementing in terms of CAS means one can stay in HIP or OpenCL, but performance suffers.

What would the you prefer out of these options?

Thanks,

Jon

Hi Hal,

In this case (atomicInc) it’s a target specific IR intrinsic but there are other cases where there is target-independent IR support but not clang. OpenCL memory fences for one.

So the general question is what do do about functions with IR/asm support but no clang builtins.

Thanks!

Jon

Hi Hal,

In this case (atomicInc) it's a target specific IR intrinsic but there are other cases where there is target-independent IR support but not clang. OpenCL memory fences for one.

So the general question is what do do about functions with IR/asm support but no clang builtins.

My general preference is just to add the Clang intrinsic. Adding target-specific intrinsics is, generically, pretty easy (one line in include/clang/Basic/BuiltinsAMDGPU.def and a few lines of code in CodeGenFunction::EmitAMDGPUBuiltinExpr in lib/CodeGen/CGBuiltin.cpp and some lines for testing in test/CodeGen/builtins-amdgcn.c (which, unfortunately, doesn't seem to exist, but just make one like test/CodeGen/builtins-nvptx.c or like test/CodeGen/builtins-nvptx-sm_70.cu).

For the target-independent ones, please post an RFC to cfe-dev about adding the intrinsics so that we can settle on that before you need them.

If there's something complicated about the frontend work, I recommend a .ll file as a work-around.

-Hal

Thanks!

Jon

I'm not certain I understand the problem. Could you elaborate a bit?

One thing that might help:
  Because we pre-include CUDA headers to get math functions in NVPTX
  compilation we get all CUDA intrinsic as well. These can be used
  as if it was CUDA code in target compilation. We should do the same
  for HIP and others. Would it solve your issue to allow HIP/OpenCL
  intrinsics?