Clang and CUDA with C++11 features

Hi,

I am experimenting with CUDA language support in Clang, and so far
the kernel-call test in Clang trunk (r158426) compiles and runs, on
a Tesla C2050.

Now I would like to enable C++11 support in Clang to use compile-time
C++11 features in GPU code, primarily variadic templates and lambda
functions.

The following code compiles fine:

    // kernel-call.cu

    #include <cuda_runtime.h>

    __attribute__((global)) void g1(int x) {}

    int main(void) {
      // CHECK: call{{.*}}cudaConfigureCall
      // CHECK: icmp
      // CHECK: br
      // CHECK: call{{.*}}g1
      g1<<<1, 1>>>(42);
    }

    clang++ -std=cuda -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

After adding a lambda function (in the host code), compilation fails:

    #include <cuda_runtime.h>

    __attribute__((global)) void g1(int x) {}

    int main(void) {
      // CHECK: call{{.*}}cudaConfigureCall
      // CHECK: icmp
      // CHECK: br
      // CHECK: call{{.*}}g1
      g1<<<1, 1>>>(42);

      auto lambda = [](){};
    }

    clang++ -std=cuda -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu
    kernel-call.cu:14:3: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
      auto lambda = [](){};
      ^
    kernel-call.cu:14:17: error: expected expression
      auto lambda = [](){};
                    ^
    1 warning and 1 error generated.

As expected, -std=c++11 does not work, since it disables CUDA support:

    clang++ -std=c++11 -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu
    kernel-call.cu:5:16: warning: global attribute ignored
    __attribute__((global)) void g1(int x) {}
                   ^
    kernel-call.cu:12:7: error: expected expression
      g1<<<1, 1>>>(42);
          ^
    kernel-call.cu:12:14: error: expected expression
      g1<<<1, 1>>>(42);
                 ^
    1 warning and 2 errors generated.

Is there a way to tell Clang to enable C++11 extensions?

Thanks,
Peter

Sadly no, it appears we enable CUDA features based on whether we're
using -std=cuda, not whether we're using -x cuda, so even though we
accept -x cuda -std=c++11, that diasbles CUDA support! I expect we'd
accept a patch to fix that :slight_smile:

For your own experimentation, try modifying
include/clang/Frontend/LangStandards.def as follows:

--- include/clang/Frontend/LangStandards.def (revision 158416)
+++ include/clang/Frontend/LangStandards.def (working copy)
@@ -115,6 +115,6 @@
// CUDA
LANGSTANDARD(cuda, "cuda",
              "NVIDIA CUDA(tm)",
- BCPLComment | CPlusPlus | Digraphs)
+ BCPLComment | CPlusPlus | CPlusPlus0x | Digraphs)

#undef LANGSTANDARD

Looks like Richard and I came up with exactly the same patch.

–Sean Silva

Thanks, Richard and Sean, indeed that works! I could compile a
__attribute__((global)) function calling a local lambda function,
and a variadic function with __attribute__((device)).

There seems to be an issue with C++11 headers, e.g. <iostream>

    clang++ -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu
    In file included from kernel-call.cu:7:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/iostream:39:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/ostream:39:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/ios:40:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/bits/char_traits.h:40:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/bits/stl_algobase.h:65:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/bits/stl_pair.h:61:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/bits/move.h:57:
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:441:38: error: expected '>'
                              is_void<_Tp>>>::type
                                         ^
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:441:38: error: expected a type
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:442:8: error: expected a type
        { };
           ^
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:442:8: error: expected class
          name
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:442:8: error: expected
          '{' after base class list
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:563:61: error: expected '>'
        : public __and_<is_arithmetic<_Tp>, __not_<is_signed<_Tp>>>::type
                                                                ^
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:563:61: error: expected a type
    /home/peter/usr/rhel6-x86_64/gcc-4.7.0/lib/gcc/x86_64-unknown-linux-gnu/4.7.0/../../../../include/c++/4.7.0/type_traits:564:8: error: expected a type
        { };
    …

The parser interprets the compressed C++11 template parameter syntax
as a call to a CUDA kernel function. Is there a way to disable parsing
of the CUDA call syntax <<< >>>? I would be using a C++ wrapper around
cudaConfigureCall, cudaSetupArgument and cudaLaunch anyway.

Besides, this is very impressive. C++11 on GPUs appears to be near :-).

Peter

Hmmm, the old cudaLaunch trick does not seem to work:

    #include <cuda_runtime.h>

    #include <cstdio>
    #include <cstdlib>

    #define CUDA_REQUIRE( x ) \
        { \
            cudaError_t err = (x); \
            if (err != cudaSuccess) { \
                fprintf( \
                    stderr \
                  , "%s (%d): error: CUDA: %s\n" \
                  , __FILE__ , __LINE__ \
                  , cudaGetErrorString(err) \
                ); \
                exit(1); \
            } \
        }

    __attribute__((global)) void g1(int x, int* g_array)
    {
        g_array[0] = x;
    }

    int main()
    {
        int* g_array = 0;
        CUDA_REQUIRE( cudaMalloc(&g_array, sizeof(*g_array)) );
        CUDA_REQUIRE( cudaMemset(g_array, 0, sizeof(*g_array)) );

        int dev = -1;
        CUDA_REQUIRE( cudaGetDevice(&dev) );
        printf("Using CUDA device #%d\n", dev);

        struct arguments
        {
            int x;
            int* g_array;
        };

        int x = 42;
    #ifdef USE_CUDA_CALL_SYNTAX
        g1<<<1, 1>>>(x, g_array);
    #else
        CUDA_REQUIRE( cudaConfigureCall(1, 1) );
        CUDA_REQUIRE( cudaSetupArgument(&x, sizeof(x), offsetof(arguments, x)) );
        CUDA_REQUIRE( cudaSetupArgument(&g_array, sizeof(g_array), offsetof(arguments, g_array)) );
        CUDA_REQUIRE( cudaLaunch(reinterpret_cast<char const*>(&g1)) );
    #endif
        CUDA_REQUIRE( cudaDeviceSynchronize() );

        int result = 0;
        CUDA_REQUIRE( cudaMemcpy(&result, g_array, sizeof(*g_array), cudaMemcpyDeviceToHost) );
        printf("42 == %d\n", result);
    }

Compile with Clang using <<< >>> syntax:

    clang++ -DUSE_CUDA_CALL_SYNTAX -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

    ./kernel-call
    Using CUDA device #0
    42 == 0

Compile with Clang using manual cudaLaunch:

    clang++ -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

    ./kernel-call
    Using CUDA device #0
    kernel-call.cu (48): error: CUDA: invalid device function

Compile with nvcc using manual cudaLaunch:

    nvcc -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

    ./kernel-call
    Using CUDA device #0
    42 == 42

How does the glue between host and GPU kernel work?

Could I somehow obtain a cudaLaunch-callable pointer to the GPU kernel?

The other issue is the result of kernel execution. The kernel appears
not to write to the global device memory array. I guess this also
involves the NVPTX backend. Should I forward this to the llvm-dev
mailing list, or are the NVPTX developers reading cfe-dev as well?

Thanks,
Peter

Try:

find $HEADER_DIR | xargs sed --i.bak -e ‘s/>>>/> > >/’

I don’t think there are any syntactic dark corners where that will break otherwise valid C++.

No guarantees though (that’s what the -i.bak is for ;).

–Sean Silva

The parser interprets the compressed C++11 template parameter syntax
as a call to a CUDA kernel function. Is there a way to disable parsing
of the CUDA call syntax <<< >>>? I would be using a C++ wrapper around
cudaConfigureCall, cudaSetupArgument and cudaLaunch anyway.

Hmmm, the old cudaLaunch trick does not seem to work:

#include <cuda_runtime.h>

#include
#include

#define CUDA_REQUIRE( x )
{
cudaError_t err = (x);
if (err != cudaSuccess) {
fprintf(
stderr
, “%s (%d): error: CUDA: %s\n”
, FILE , LINE
, cudaGetErrorString(err)
);
exit(1);
}
}

attribute((global)) void g1(int x, int* g_array)
{
g_array[0] = x;
}

int main()
{
int* g_array = 0;
CUDA_REQUIRE( cudaMalloc(&g_array, sizeof(*g_array)) );
CUDA_REQUIRE( cudaMemset(g_array, 0, sizeof(*g_array)) );

int dev = -1;
CUDA_REQUIRE( cudaGetDevice(&dev) );
printf(“Using CUDA device #%d\n”, dev);

struct arguments
{
int x;
int* g_array;
};

int x = 42;
#ifdef USE_CUDA_CALL_SYNTAX
g1<<<1, 1>>>(x, g_array);
#else
CUDA_REQUIRE( cudaConfigureCall(1, 1) );
CUDA_REQUIRE( cudaSetupArgument(&x, sizeof(x), offsetof(arguments, x)) );
CUDA_REQUIRE( cudaSetupArgument(&g_array, sizeof(g_array), offsetof(arguments, g_array)) );
CUDA_REQUIRE( cudaLaunch(reinterpret_cast<char const*>(&g1)) );
#endif
CUDA_REQUIRE( cudaDeviceSynchronize() );

int result = 0;
CUDA_REQUIRE( cudaMemcpy(&result, g_array, sizeof(*g_array), cudaMemcpyDeviceToHost) );
printf(“42 == %d\n”, result);
}

Compile with Clang using <<< >>> syntax:

clang++ -DUSE_CUDA_CALL_SYNTAX -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

./kernel-call
Using CUDA device #0
42 == 0

Compile with Clang using manual cudaLaunch:

clang++ -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

./kernel-call
Using CUDA device #0
kernel-call.cu (48): error: CUDA: invalid device function

Compile with nvcc using manual cudaLaunch:

nvcc -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

./kernel-call
Using CUDA device #0
42 == 42

How does the glue between host and GPU kernel work?

Could I somehow obtain a cudaLaunch-callable pointer to the GPU kernel?

The other issue is the result of kernel execution. The kernel appears
not to write to the global device memory array. I guess this also
involves the NVPTX backend. Should I forward this to the llvm-dev
mailing list, or are the NVPTX developers reading cfe-dev as well?

Something is probably getting messed up in the IR generation for the kernel functions (we recently got rid of the old PTX back-end, on which the CUDA integration is based, in favor of the NVPTX back-end). If you post the IR, I can take a look at it.

The other way around: CUDA violates the C++ standard with regard to
template parameter syntax, so I would like to disable the CUDA
execution syntax, and use cudaLaunch with a pointer instead.

The lexer of Clang is straight-forward :-).

--- lib/Lex/Lexer.cpp 2012-06-14 10:48:34.675569175 -0400
+++ lib/Lex/Lexer.cpp 2012-06-14 10:48:51.246645348 -0400
@@ -3054,10 +3054,6 @@
         // If this is '<<<<' and we're in a Perforce-style conflict marker,
         // ignore it.
         goto LexNextToken;
- } else if (LangOpts.CUDA && After == '<') {
- Kind = tok::lesslessless;
- CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result),
- SizeTmp2, Result);
       } else {
         CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
         Kind = tok::lessless;
@@ -3110,10 +3106,6 @@
       } else if (After == '>' && HandleEndOfConflictMarker(CurPtr-1)) {
         // If this is '>>>>>>>' and we're in a conflict marker, ignore it.
         goto LexNextToken;
- } else if (LangOpts.CUDA && After == '>') {
- Kind = tok::greatergreatergreater;
- CurPtr = ConsumeChar(ConsumeChar(CurPtr, SizeTmp, Result),
- SizeTmp2, Result);
       } else {
         CurPtr = ConsumeChar(CurPtr, SizeTmp, Result);
         Kind = tok::greatergreater;

Peter

Oh, yeah. idk why but I got the impression that you were trying to get it to work while keeping the cuda call syntax (I haven’t done any cuda programming, but the alternative to the cuda call syntax in the example you gave seemed painful to write), even though you explicitly said “Is there a way to disable parsing of the CUDA call syntax <<< >>>?”

D’oh.

–Sean Silva

The CUDA kernel execution syntax is really just syntactic sugar, and
not needed for CUDA programming. My example code using cudaConfigure,
cudaSetupArgument and cudaLaunch seems impractical at first, but with
a C++ wrapper it becomes very convenient to use:

cuda::function<void (int, int*)> f = &kernel_function;
cuda::configure(blocks, threads);
f(args...);

With nvcc and C++98, I was using the following wrapper. It's not the
nicest code, but it will look better with C++11 and variadic templates
to implement cuda::function.

http://git.colberg.org/cuda-wrapper.git/plain/cuda_wrapper/function.hpp

Anyway, the kernel launching is only a side issue ;-).

Peter

I attached the CUDA source and the LLVM IR.

clang++ -S -emit-llvm -I/usr/local/cuda-4.2/cuda/include -o kernel-call.s kernel-call.cu

clang++ -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

./kernel-call
Using CUDA device #0
42 == 0

clang++ --version
clang version 3.2 (http://llvm.org/git/clang.git f1492f970c7c6eb85dc18f13fb864b185bed1d23) (http://llvm.org/git/llvm.git cba91230c0beef79e5042d8e983198b26aac5616)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Thanks,
Peter

kernel-call.cu (1.21 KB)

kernel-call.s (10.4 KB)

Something is probably getting messed up in the IR generation for the kernel
functions (we recently got rid of the old PTX back-end, on which the CUDA
integration is based, in favor of the NVPTX back-end). If you post the IR,
I can take a look at it.

I attached the CUDA source and the LLVM IR.

clang++ -S -emit-llvm -I/usr/local/cuda-4.2/cuda/include -o kernel-call.s kernel-call.cu

clang++ -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o kernel-call kernel-call.cu

./kernel-call
Using CUDA device #0

42 == 0

clang++ --version
clang version 3.2 (http://llvm.org/git/clang.git f1492f970c7c6eb85dc18f13fb864b185bed1d23) (http://llvm.org/git/llvm.git cba91230c0beef79e5042d8e983198b26aac5616)
Target: x86_64-unknown-linux-gnu
Thread model: posix

The attached IR is for the host, not the device. I haven’t played around with the CUDA front-end in Clang, but I don’t think the plumbing is hooked up to generate PTX device code and embed it into the final binary. Someone who works on the front-end would be better able to comment.

I reduced the source file to this:

    // kernel.cu

    __attribute__((global)) void f(int* array)
    {
        array[0] = 42;
    }

clang++ -I/usr/local/cuda-4.2/cuda/include -S -emit-llvm -o kernel-x86_64.s kernel.cu

    ; ModuleID = 'kernel.cu'
    target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64-S128"
    target triple = "x86_64-unknown-linux-gnu"

    define void @_Z1fPi(i32* %array) uwtable {
    entry:
      %array.addr = alloca i32*, align 8
      store i32* %array, i32** %array.addr, align 8
      %0 = bitcast i32** %array.addr to i8*
      %1 = call i32 @cudaSetupArgument(i8* %0, i64 ptrtoint (i1** getelementptr (i1** null, i32 1) to i64), i64 0)
      %2 = icmp eq i32 %1, 0
      br i1 %2, label %setup.next, label %setup.end

    setup.next: ; preds = %entry
      %3 = call i32 @cudaLaunch(i8* bitcast (void (i32*)* @_Z1fPi to i8*))
      br label %setup.end

    setup.end: ; preds = %setup.next, %entry
      ret void
    }

    declare i32 @cudaSetupArgument(i8*, i64, i64)

    declare i32 @cudaLaunch(i8*)

clang -cc1 -fcuda-is-device -I/usr/local/cuda-4.2/cuda/include -emit-llvm -triple nvptx64-unknown-unknown -o kernel-nvptx64.s kernel.cu

    ; ModuleID = 'kernel.cu'
    target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
    target triple = "nvptx64-unknown-unknown"

    define ptx_kernel void @_Z1fPi(i32* %array) nounwind {
    entry:
      %array.addr = alloca i32*, align 8
      store i32* %array, i32** %array.addr, align 8
      %0 = load i32** %array.addr, align 8
      %arrayidx = getelementptr inbounds i32* %0, i64 0
      store i32 42, i32* %arrayidx, align 4
      ret void
    }

The second output does look like LLVM IR of device code.

How do I compile the PTX to object code?

How do I link device and host code together?

Why is there a cudaLaunch in the host LLVM IR despite omitted <<< >>> call?

Thanks,
Peter

As Justin mentioned, I don't think the right plumbing exists that uses the
frontend support for parsing and lowering the CUDA syntax and connects it
to the NVPTX backend. For instance, from the IR, it looks like the kernel
function pointer is just cast to i8* and passed to cudaLaunch. cudaLaunch
requires the name of the kernel to passed a char string.

Definitely not, otherwise my cudaLaunch wrapper would be failing since 2007 ;-).

The advantage of the CUDA runtime library, as opposed to the CUDA
driver library, lies in the convenient execution of kernels.

When the nvcc frontend converts the <<< >>> syntax to proper C++ code,
it does not pass a char string to cudaLaunch, but an actual pointer.
The only question is how to get that pointer in Clang…

If you are interested, have a look at the preprocessed C++ output of
the nvcc frontend of a simple kernel call.

I feel CUDA support needs lot of work before becoming usable.

Yes, that is indeed the case… :frowning:

Peter

I have to admit that I never looked at the memory pointed to by a
kernel function pointer, so I did (please tell me if too naïvely…).

// function.cu

__global__ void f(int* array)
{
    array[0] = 42;
}

int main()
{
    void (*p)(int*) = &f;
    printf("%s\n", reinterpret_cast<char const*>(p));
}

nvcc -Xcompiler -Wall -Xptxas -v -o function function.cu
ptxas info : Compiling entry function '_Z1fPi' for 'sm_10'
ptxas info : Used 2 registers, 8+16 bytes smem

./function
UH��H��H�}�H�E�H�������UH��SH��(�P@

So the function pointer to the GPU kernel is indeed not an ASCII string.

What is it then?

Peter

As Justin mentioned, I don’t think the right plumbing exists that uses the
frontend support for parsing and lowering the CUDA syntax and connects it
to the NVPTX backend. For instance, from the IR, it looks like the kernel
function pointer is just cast to i8* and passed to cudaLaunch. cudaLaunch
requires the name of the kernel to passed a char string.

Definitely not, otherwise my cudaLaunch wrapper would be failing since 2007 ;-).

The advantage of the CUDA runtime library, as opposed to the CUDA
driver library, lies in the convenient execution of kernels.

When the nvcc frontend converts the <<< >>> syntax to proper C++ code,
it does not pass a char string to cudaLaunch, but an actual pointer.
The only question is how to get that pointer in Clang…

I have to admit that I never looked at the memory pointed to by a
kernel function pointer, so I did (please tell me if too naïvely…).

// function.cu

global void f(int* array)
{
array[0] = 42;
}

int main()
{
void (p)(int) = &f;
printf("%s\n", reinterpret_cast<char const*>(p));
}

nvcc -Xcompiler -Wall -Xptxas -v -o function function.cu
ptxas info : Compiling entry function ‘_Z1fPi’ for ‘sm_10’
ptxas info : Used 2 registers, 8+16 bytes smem

./function
UH��H��H�}�H�E�H�������UH��SH��(�P@

So the function pointer to the GPU kernel is indeed not an ASCII string.

What is it then?

Regardless of how nvcc works under the hood, the point is that Clang is not currently set up to fully support CUDA. You can invoke it once with -fcuda-is-device and once without to get the IR for both the host and device, but there is not yet a good way to link those together. The path of least resistance right now is to compile the device code to PTX and then invoke it with the Driver API. I’m not sure what the maintainer of the CUDA front-end in Clang is planning, you would have to ask him/her when full support is planned.

> The parser interprets the compressed C++11 template parameter syntax
> as a call to a CUDA kernel function. Is there a way to disable parsing
> of the CUDA call syntax <<< >>>? I would be using a C++ wrapper around
> cudaConfigureCall, cudaSetupArgument and cudaLaunch anyway.

Try:

find $HEADER_DIR | xargs sed --i.bak -e 's/>>>/> > >/'

I don't think there are any syntactic dark corners where that will break
otherwise valid C++.

No guarantees though (that's what the -i.bak is for ;).

The other way around: CUDA violates the C++ standard with regard to
template parameter syntax, so I would like to disable the CUDA
execution syntax, and use cudaLaunch with a pointer instead.

The lexer of Clang is straight-forward :-).

I've just improved the diagnostic for this in Clang ToT:

test/Parser/cuda-kernel-call.cu:13:12: error: a space is required between
      consecutive right angle brackets (use '> >')
  S<S<S<int>>> s; // expected-error 2{{use '> >'}}
           ^~
           > >
test/Parser/cuda-kernel-call.cu:13:13: error: a space is required between
      consecutive right angle brackets (use '> >')
  S<S<S<int>>> s; // expected-error 2{{use '> >'}}
            ^~
            > >

If we want to support CUDA + C++11, I think the right fix for this
issue is to extend C++11's rule for '>>'-fission to CUDA's '>>>'
token:

Index: lib/Parse/ParseTemplate.cpp

With the above patch, C++11 template parameters and CUDA syntax coexist nicely:

    #include <cuda_runtime.h>

    template <typename T>
    struct S;

    template <typename T>
    __attribute__((global)) void f() {}

    int main()
    {
        f<S<S<int>>><<<1, 1>>>();
    }

When including <iostream>, there is still a minor problem:

    // template.cu

    #include <cuda_runtime.h>
    #include <iostream>

    int main() {}

    clang++ -Wall -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o template template.cu

    In file included from template.cu:1:
    /usr/local/cuda-4.2/cuda/include/cuda_runtime.h:267:10: warning: function 'cudaMallocHost' has internal linkage but is used in an inline function with external
          linkage [-Winternal-linkage-in-inline]
      return cudaMallocHost((void**)(void*)ptr, size, flags);
             ^
    /usr/local/cuda-4.2/cuda/include/cuda_runtime.h:261:1: note: use 'static' to give inline function 'cudaMallocHost' internal linkage
    __inline__ __host__ cudaError_t cudaMallocHost(
    ^
    static
    /usr/local/cuda-4.2/cuda/include/cuda_runtime.h:222:40: note: 'cudaMallocHost' declared here
    static __inline__ __host__ cudaError_t cudaMallocHost(
                                           ^
    In file included from template.cu:2:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/iostream:39:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/ostream:39:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/ios:42:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/ios_base.h:42:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/locale_classes.h:41:
    In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/string:53:
    /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:504:9: error: call to
          implicitly-deleted copy constructor of 'std::basic_string<char, std::char_traits<char>, std::allocator<char>>::_Alloc_hider'
          : _M_dataplus(__str._M_dataplus)
            ^ ~~~~~~~~~~~~~~~~~
    /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:2863:12: note: in
          instantiation of member function 'std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string' requested here
      { return __gnu_cxx::__to_xstring<string>(&std::vsnprintf, 4 * sizeof(int),
               ^
    /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:268:29: note: copy
          constructor of '_Alloc_hider' is implicitly deleted because base class 'std::allocator<char>' has no copy constructor
          struct _Alloc_hider : _Alloc
                                ^
    /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:504:9: error: call to
          implicitly-deleted copy constructor of 'std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t>>::_Alloc_hider'
          : _M_dataplus(__str._M_dataplus)
            ^ ~~~~~~~~~~~~~~~~~
    /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:2964:12: note: in
          instantiation of member function 'std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >::basic_string' requested here
      { return __gnu_cxx::__to_xstring<wstring>(&std::vswprintf, 4 * sizeof(int),
               ^
    /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:268:29: note: copy
          constructor of '_Alloc_hider' is implicitly deleted because base class 'std::allocator<wchar_t>' has no copy constructor
          struct _Alloc_hider : _Alloc
                                ^
    1 warning and 2 errors generated.

Is there something else that needs to be added for C++11 parsing,
besides CPlusPlus0x in include/clang/Frontend/LangStandards.def?

The same compiles fine in C++ mode (modulo warning):

    // template.cpp

    #include <cuda_runtime.h>
    #include <iostream>

    int main() {}

    clang++ -std=c++11 -Wall -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o template template.cpp
    In file included from template.cpp:1:
    /usr/local/cuda-4.2/cuda/include/cuda_runtime.h:267:10: warning: function 'cudaMallocHost' has internal linkage but is used in an inline function with external
          linkage [-Winternal-linkage-in-inline]
      return cudaMallocHost((void**)(void*)ptr, size, flags);
             ^
    /usr/local/cuda-4.2/cuda/include/cuda_runtime.h:261:1: note: use 'static' to give inline function 'cudaMallocHost' internal linkage
    __inline__ __host__ cudaError_t cudaMallocHost(
    ^
    static
    /usr/local/cuda-4.2/cuda/include/cuda_runtime.h:222:40: note: 'cudaMallocHost' declared here
    static __inline__ __host__ cudaError_t cudaMallocHost(
                                           ^
    1 warning generated.

Regards,
Peter

I've just improved the diagnostic for this in Clang ToT:

test/Parser/cuda-kernel-call.cu:13:12: error: a space is required between
consecutive right angle brackets (use '> >')
S<S<S<int>>> s; // expected-error 2{{use '> >'}}
^~
> >
test/Parser/cuda-kernel-call.cu:13:13: error: a space is required between
consecutive right angle brackets (use '> >')
S<S<S<int>>> s; // expected-error 2{{use '> >'}}
^~
> >

If we want to support CUDA + C++11, I think the right fix for this
issue is to extend C++11's rule for '>>'-fission to CUDA's '>>>'
token:

Index: lib/Parse/ParseTemplate.cpp

--- lib/Parse/ParseTemplate.cpp (revision 158652)
+++ lib/Parse/ParseTemplate.cpp (working copy)
@@ -785,7 +785,8 @@
Hint2 = FixItHint::CreateInsertion(Next.getLocation(), " ");

unsigned DiagId = diag::err_two_right_angle_brackets_need_space;
- if (getLangOpts().CPlusPlus0x && Tok.is(tok::greatergreater))
+ if (getLangOpts().CPlusPlus0x &&
+ (Tok.is(tok::greatergreater) || Tok.is(tok::greatergreatergreater)))
DiagId = diag::warn_cxx98_compat_two_right_angle_brackets;
else if (Tok.is(tok::greaterequal))
DiagId = diag::err_right_angle_bracket_equal_needs_space;

This will allow us to correctly deal with '>>' in almost all cases in
CUDA mode, and the exceptions are truly bizarre and implausible
constructs like "(SomeType)&FnTmpl<ClassTmpl<T>>>>3;" -- here the
'>>>>' splits into '> > >>' in C++11 and would split into '> > > >' in
CUDA/C++11 with the proposed rule.

With the above patch, C++11 template parameters and CUDA syntax coexist nicely:

#include <cuda_runtime.h>

template <typename T>
struct S;

template <typename T>
__attribute__((global)) void f() {}

int main()
{
f<S<S<int>>><<<1, 1>>>();
}

When including <iostream>, there is still a minor problem:

// template.cu

#include <cuda_runtime.h>
#include <iostream>

int main() {}

clang++ -Wall -I/usr/local/cuda-4.2/cuda/include -L/usr/local/cuda-4.2/cuda/lib64 -lcudart -o template template.cu

[...]

In file included from template.cu:2:
In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/iostream:39:
In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/ostream:39:
In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/ios:42:
In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/ios_base.h:42:
In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/locale_classes.h:41:
In file included from /home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/string:53:
/home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:504:9: error: call to
implicitly-deleted copy constructor of 'std::basic_string<char, std::char_traits<char>, std::allocator<char>>::_Alloc_hider'
: _M_dataplus(__str._M_dataplus)
^ ~~~~~~~~~~~~~~~~~
/home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:2863:12: note: in
instantiation of member function 'std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string' requested here
{ return __gnu_cxx::__to_xstring<string>(&std::vsnprintf, 4 * sizeof(int),
^
/home/peter/usr/rhel6-x86_64/gcc-4.7.1/lib/gcc/x86_64-unknown-linux-gnu/4.7.1/../../../../include/c++/4.7.1/bits/basic_string.h:268:29: note: copy
constructor of '_Alloc_hider' is implicitly deleted because base class 'std::allocator<char>' has no copy constructor
struct _Alloc_hider : _Alloc
^

This "has no copy constructor" condition should not be possible. Looks
like the problem is here (end of Sema/SemaDeclCXX.cpp):

Sema::CUDAFunctionTarget Sema::IdentifyCUDATarget(const FunctionDecl *D) {
  // Implicitly declared functions (e.g. copy constructors) are
  // __host__ __device__
  if (D->isImplicit())
    return CFT_HostDevice;

bool Sema::CheckCUDATarget(CUDAFunctionTarget CallerTarget,
                           CUDAFunctionTarget CalleeTarget) {
  if (CallerTarget == CFT_HostDevice && CalleeTarget != CFT_HostDevice)
    return true;

So... an implicit copy constructor is __host__ __device__, and thus
can't call an explicitly-declared copy constructor (which is just
__host__). Oops. Is this a bug in the CUDA spec or in Clang?

Please correct me if I am wrong, but I have never seen a CUDA
specification. The CUDA programming guide does not mention
copy constructors.

Why are these attributes needed at all? Can't the compiler just fail
on code that will not compile to a GPU kernel, e.g. due to use of C++
run-time support?

I have never understood why the nvcc compiler rejects perfectly
valid C++ code simply because of a nonexistent __device__ attribute.

Peter