[CUDA][RFC] Clang support for CUDA 11.4, issues with <cooperative_groups.h>

Hi!

I wanted to ask what’s the expected support for CUDA 11.4 in Clang? We want to update from 10.2 to 11.4, and came across the following issue when compiling:

// foo.cu
#include <cooperative_groups.h>

Just by including this header, and compiling for “–cuda-gpu-arch=sm_61”, we get the following error building with Clang:

/usr/local/cuda-11.4/targets/x86_64-linux/include/cooperative_groups/details/partitioning.h:85:32: error: use of undeclared identifier ‘__match_any_sync’
unsigned int subMask = __match_any_sync(thisMask, pred);
^

The error goes away if we use sm_70 or later, but we need sm_61 in my application.

Cooperative groups have been working just fine since they were introduced in CUDA 9, so I wonder why this would break now. NVCC has no problems with the above code. Clang can also build it just fine on CUDA 10.2.

Digging a bit deeper, I can narrow it down to this:

// foo.cu

template
device void foo()
{
__match_any_sync(0,0);
}

The above code compiles fine with NVCC + sm_61. However, if I remove the template, then NVCC complains that __match_any_sync is not defined. Clang complains regardless.
Does this make any sense? Is NVCC doing some magic that allows it to ignore undeclared functions if they are used in a non-instantiated template function?

PS: from what I understand, __match_any_sync is indeed only available on SM 7.0 or later. I still wonder how NVCC can handle this, is it a bug or a feature? Should Clang be updated to match NVCC?

PS2: The problem goes away in CUDA 11.5, even though it’s not officially supported in Clang. I get a warning but the code compiles. I can still see the __match_any_sync function in the CUDA header so I don’t understand what’s happening. Is Clang really compiling, or what does it do when using a “too new” CUDA version?

Thanks!

It's not a CUDA issue per se (IMHO) and not related to the builtin at all.

C++ says this ain't alright, nvcc is just very forgiving apparently:
https://godbolt.org/z/14xYfbKdz

I'd recommend to either declare the builtin yourself in you header or use
a CUDA version that comes with a declaration.

~ Johannes

Thanks for the reply! Indeed declaring the function myself somewhere is probably the easiest workaround.

I asked in the Nvidia forums and it seems they forgot to guard the problematic code inside an #ifdef that is only active for SM 7.x or later. I can see they’ve fixed it in CUDA 11.5, and indeed using CUDA 11.5 with Clang solves the problem (modulo Clang warning due to “too new CUDA version”).

Therefore my next question is: would it be possible to bump the supported CUDA version in Clang to 11.5? I could do it myself following the steps done for 11.4, but I’m not sure what things I should be looking out for or how to test it. Let me know if I should open a new thread for this new question.

/Carlos