Hello everyone,
In my code, I’m using CUDA cooperative groups functions from these two header files:
cooperative_groups/reduce.h;
cooperative_groups/memcpy_async.h.
When I try to compile for sm_80 with clang using CUDA 11.8, I get errors such as
error: use of undeclared identifier ‘__reduce_add_sync’.
I examined the CUDA header file and it seems that for some reason clang doesn’t automatically include crt/sm_80_rt.hpp where these functions are defined for the target architecture.
Since sm_75 compiles fine, I suspect the error is in the definition of _CG_HAS_OP_REDUX for CUDA_ARCH >= 800 in include/cooperative_groups/details/info.h:132.
To fix this, I’ve included it manually: #ifCUDA_ARCH >= 800 && !(defined(NVCC)) #include “crt/sm_80_rt.hpp” #endif
However, I want to understand if there is a more elegant solution, or if I should open an issue.
If you want to reproduce the error, you can find here a snippet of the code.
Well, it’s a little bit more complicated than just adding #include “crt/sm_80_rt.hpp”. It will avoid compilation error, but it would not actually provide the implementation of those functions, so any attempt to use them will result in ptxas failure.
If you dig a bit deeper, you will see that __reduce_add_sync is defined as
There’s no implementation provided for __reduce_add_sync_unsigned_impl() by the CUDA headers, so clang will need to provide its own. So, if you really need to use those headers, in addition to explicitly including crt/sm_80_rt.hpp you will need to provide the implementation of __reduce_add_sync_unsigned_impl using inline asm.
I do plan to get it fixed, but it may take some time until I get to it.
We already provide a handful of _impl functions CUDA headers depend on:
Adding few more would be fairly straightforward, once we know what instructions exactly those functions are supposed to generate.
Hi there,
Thanks for your reply.
At the moment I’ve added #include "crt/sm_80_rt.hpp" inside clang/lib/Headers/__clang_cuda_intrinsics.h after the declaration of #include "crt/sm_70_rt.hpp". Is this correct?
In any case, as you mention, it broke at link time because __reduce_add_sync_unsigned_impl was not found.
__reduce_add_sync_unsigned_impl is not supposed to be defined in the shared library that ships with CUDA? Why is this not the case?
And if not, how can I find out what asm instructions this function is made up of? Maybe I can try to implement those functions.
Otherwise I should compile with CUDA_ARCH < 800 correct?