CUCLANG struggle with cooperative groups headers

Hello everyone,
In my code, I’m using CUDA cooperative groups functions from these two header files:

  • cooperative_groups/reduce.h;
  • cooperative_groups/memcpy_async.h.
    When I try to compile for sm_80 with clang using CUDA 11.8, I get errors such as
    error: use of undeclared identifier ‘__reduce_add_sync’.
    I examined the CUDA header file and it seems that for some reason clang doesn’t automatically include crt/sm_80_rt.hpp where these functions are defined for the target architecture.

Since sm_75 compiles fine, I suspect the error is in the definition of _CG_HAS_OP_REDUX for CUDA_ARCH >= 800 in include/cooperative_groups/details/info.h:132.

To fix this, I’ve included it manually:
#if CUDA_ARCH >= 800 && !(defined(NVCC))
#include “crt/sm_80_rt.hpp”
However, I want to understand if there is a more elegant solution, or if I should open an issue.

If you want to reproduce the error, you can find here a snippet of the code.


Well, it’s a little bit more complicated than just adding #include “crt/sm_80_rt.hpp”. It will avoid compilation error, but it would not actually provide the implementation of those functions, so any attempt to use them will result in ptxas failure.

If you dig a bit deeper, you will see that __reduce_add_sync is defined as

__SM_80_RT_DECL__ unsigned __reduce_add_sync(unsigned mask, unsigned value) {
  return __reduce_add_sync_unsigned_impl(mask, value);

There’s no implementation provided for __reduce_add_sync_unsigned_impl() by the CUDA headers, so clang will need to provide its own. So, if you really need to use those headers, in addition to explicitly including crt/sm_80_rt.hpp you will need to provide the implementation of __reduce_add_sync_unsigned_impl using inline asm.

I do plan to get it fixed, but it may take some time until I get to it.

We already provide a handful of _impl functions CUDA headers depend on:

Adding few more would be fairly straightforward, once we know what instructions exactly those functions are supposed to generate.

Hi there,
Thanks for your reply.
At the moment I’ve added #include "crt/sm_80_rt.hpp" inside clang/lib/Headers/__clang_cuda_intrinsics.h after the declaration of #include "crt/sm_70_rt.hpp". Is this correct?
In any case, as you mention, it broke at link time because __reduce_add_sync_unsigned_impl was not found.

__reduce_add_sync_unsigned_impl is not supposed to be defined in the shared library that ships with CUDA? Why is this not the case?
And if not, how can I find out what asm instructions this function is made up of? Maybe I can try to implement those functions.
Otherwise I should compile with CUDA_ARCH < 800 correct?

Kind regards.

Following some hint I’ve found in the source code, I’ve added the following to clang/lib/Headers/__clang_cuda_intrinsics.h:

#if __CUDA_ARCH__ >= 800

__device__ inline int __reduce_add_sync_signed_impl(unsigned mask, int value){
  return __nvvm_redux_sync_add(value,mask);


I don’t know if this is correct or not. But in any case upon linking nvlink returns a segmentation fault.


I have no idea what’s the problem with nvlink in your case.

If you do need to rely on RDC compilation, I’d suggest taking a look at the --offload-new-driver flag in recent clang, which handles GPU-side linking in a more convenient way. 08 Improving the OpenMP Offloading Driver: LTO, libraries, and toolchains - YouTube