Proposal: CUDA support; outline, initial patches

Hi,

This is intended to outline how we could add CUDA support to Clang,
feedback appreciated. I've also attached patches with initial
progress.

Architecture

0001-Add-CUDA-language-option.patch (1.14 KB)

0002-Lexer-add-CUDA-kernel-call-tokens.patch (2.07 KB)

0003-Frontend-add-cuda-flag.patch (1.58 KB)

0004-Parse-basic-support-for-parsing-CUDA-kernel-calls.patch (2.72 KB)

0005-AST-support-for-extra-subexpressions-on-CallExpr-sub.patch (3.76 KB)

0006-AST-add-CUDAKernelCallExpr.patch (9.23 KB)

0007-Sema-implement-Sema-PrepareArgument.patch (3.39 KB)

0008-Sema-support-for-building-CUDAKernelCallExpr-from-Ac.patch (6.75 KB)

0009-Parse-pass-execution-configuration-to-Sema.patch (1.01 KB)

0010-Sema-diagnostics-for-too-few-many-exec-config-args-t.patch (2.6 KB)

0011-Lex-support-for-CUDA-attributes.patch (2.35 KB)

Why not OpenCL as more people can actually run it and on far more devices?

OvermindDL1 wrote:

Why not OpenCL as more people can actually run it and on far more devices?
  

If you have to ask then you're likely out of touch with GPGPU usage trends.... (Basically there's a lot more people using CUDA and a lot more CUDA code out there..)

But this raises an interesting question: If we can build an AST from CUDA source, is there any reason we cannot build OpenCL code from that?

-- Owen Shepherd
http://www.owenshepherd.net/
owen.shepherd@e43.eu (general) / oshepherd1@shef.ac.uk (university)

If it is possible to do that then it could also help port projects from
CUDA to OpenCL (assuming the output is human readable).

But in either case, you may want to make sure that the LLVM IR you are
generating from CUDA code, will be compatible with what Mesa's clover
will eventually generate, or can use.
I lost track on what IR Mesa will standardise on (is it TGSI,
Mesa-IR, GLSL2 IR, or LLVM IR?), you should probably ask that on the
Mesa list.

Best regards,
--Edwin

> > OvermindDL1 wrote:
> >> Why not OpenCL as more people can actually run it and on far more
> >> devices?
> >>
> > If you have to ask then you're likely out of touch with GPGPU usage
> > trends.... (Basically there's a lot more people using CUDA and a
> > lot more CUDA code out there..)
>
> But this raises an interesting question: If we can build an AST from
> CUDA source, is there any reason we cannot build OpenCL code from
> that?

If it is possible to do that then it could also help port projects from
CUDA to OpenCL (assuming the output is human readable).

This is more difficult than it sounds. For one thing, CUDA is based
on C++ while OpenCL on C. We may be able to do something with (a
modified version of?) the C backend, though it wouldn't be very human
readable, and we'd be restricted in the CUDA we can accept.

A better approach IMHO would be to add more LLVM backends for GPU
targets. If we had an AMDIL backend that would be great.

But in either case, you may want to make sure that the LLVM IR you are
generating from CUDA code, will be compatible with what Mesa's clover
will eventually generate, or can use.

We'll need to introduce a number of target-specific parameters for CUDA.
It will be a matter for Mesa to tune those parameters for their needs.

Thanks,

After further investigation, I determined that this won't work.
CUDA (at least the NVIDIA SDK headers) depends on certain macros and
declarations being subtly different for host and device code.

The new strategy is to parse the source file twice, once for the host and
once for the device. CodeGen would still be responsible for filtering
declarations.

This new patch series is built on top of the old series and replaces
patch 11 (which was based on the mistaken assumption that the type
qualifiers weren't attributes hiding behind a #define) and implements:

- Initial support for device, global, host, constant, shared and
   launch_bounds attributes. The attributes are recognised and added
   to the AST but no significant semantic analysis (e.g. checking
   for incompatible combinations) is performed.

- CodeGen support for filtering declarations based on attributes.
   A new flag, -fcuda-is-device, is added to the frontend to select
   between host and device generation.

- Added initial PTX target to Basic and CodeGen. This target sets up
   the correct calling conventions for device vs global functions,
   but not much else.

Any reviews are much appreciated.

Thanks,

0001-Basic-Sema-initial-support-for-CUDA-location-attribu.patch (9.1 KB)

0002-Basic-Sema-Add-launch_bounds-attribute.patch (4.29 KB)

0003-Frontend-add-CodeGenOptions-CUDAIsDevice.patch (1.1 KB)

0004-CodeGen-have-EmitGlobal-act-on-CUDAIsDevice.patch (1.4 KB)

0005-Frontend-add-fcuda-is-device-flag.patch (1.39 KB)

0006-Basic-add-PTXTargetInfo.patch (3.28 KB)

0007-CodeGen-add-PTXTargetCodeGenInfo.patch (2.08 KB)

Hi,

+cfe-commits, as I think this series is now ready for pre-commit review.

This new patch series replaces series 1 and 2 entirely, and includes
the following changes:

- Driver and Frontend support for the .cu file extension and '-x cuda',
   which both enable CUDA features.

- Improved parsing error recovery for kernel call expressions.

- Added a few diagnostics related to global functions and kernel calls.

- Redesigned CUDAKernelCallExpr to store cudaConfigureCall function
   call as a subexpression. The cudaConfigureCall declaration is
   now also used to parse the execution configuration. This has
   a number of advantages:

   - Reduces hardcoded surface area
   - Same behaviour as NVIDIA toolchain
   - Therefore future-proof; if the cudaConfigureCall parameter list
     changes, there is no need to change CUDAKernelCallExpr
   - Simplifies CUDAKernelCallExpr::child_end() which before needed
     to take care with optional subexpressions
   - Can utilise existing support (Sema and CodeGen) for CallExprs

- CodeGen support for CUDAKernelCallExpr. This calls cudaConfigureCall
   to set the execution configuration before calling the global function
   (currently not CodeGen'd for the host).

- Added small test suite, which tests Parse, Sema and CodeGen.

Future work:

- Host-side CodeGen support for global functions. This will involve
   generating a local device stub which uses cudaSetupArgument and
   cudaLaunch to set up the argument vector and launch the kernel,
   similar to the NVIDIA toolchain.

Reviews appreciated.

Thanks,

0001-Basic-Add-CUDA-language-option.patch (1.14 KB)

0002-Driver-Frontend-add-CUDA-language-support.patch (5.26 KB)

0003-AST-support-for-extra-subexpressions-on-CallExpr-sub.patch (4.77 KB)

0004-AST-add-CUDAKernelCallExpr.patch (8.56 KB)

0005-AST-Sema-keep-track-of-cudaConfigureCall.patch (2.54 KB)

0006-Sema-support-for-building-CUDAKernelCallExpr-from-Ac.patch (6.08 KB)

0007-Lexer-add-CUDA-kernel-call-tokens.patch (2.07 KB)

0008-Parse-add-support-for-parsing-CUDA-kernel-calls.patch (5.76 KB)

0009-Basic-Sema-add-support-for-CUDA-location-attributes.patch (10.7 KB)

0010-Basic-Sema-Add-launch_bounds-attribute.patch (4.29 KB)

0011-Sema-diagnose-kernel-calls-to-non-global-functions.patch (2.31 KB)

0012-Sema-diagnose-kernel-functions-and-kernel-function-c.patch (3.32 KB)

0013-Sema-improve-ConvertArgumentsForCall-diagnostics.patch (2.08 KB)

0014-Sema-add-separate-diagnostics-for-too-few-many-exec-.patch (5.88 KB)

0015-Basic-add-PTXTargetInfo.patch (3.29 KB)

0016-CodeGen-add-PTXTargetCodeGenInfo.patch (2.08 KB)

0017-CodeGen-support-for-CUDAKernelCallExpr.patch (3.9 KB)

0018-Frontend-add-CodeGenOptions-CUDAIsDevice.patch (1.1 KB)

0019-CodeGen-filter-declarations-based-on-attributes-and-.patch (1.42 KB)

0020-Frontend-add-fcuda-is-device-flag.patch (2.69 KB)

With a Clang-based driver, we can do better, by parsing the source
file once to produce a single AST,

After further investigation, I determined that this won't work.
CUDA (at least the NVIDIA SDK headers) depends on certain macros and
declarations being subtly different for host and device code.
The new strategy is to parse the source file twice, once for the host and
once for the device. CodeGen would still be responsible for filtering
declarations.

Ok.

This new patch series is built on top of the old series and replaces
patch 11 (which was based on the mistaken assumption that the type
qualifiers weren't attributes hiding behind a #define) and implements:
- Initial support for device, global, host, constant, shared and
  launch_bounds attributes. The attributes are recognised and added
  to the AST but no significant semantic analysis (e.g. checking
  for incompatible combinations) is performed.

Ok, please trickle the patches in one at a time. Starting here makes sense. In this patch, please use attribute names like attribute(cuda_device) etc instead of just "device" to avoid ambiguity. Also, these attributes should be rejected when not in cuda language mode. The prerequisite for that is to add a cuda language mode.

-Chris

Hi Chris,

> This new patch series is built on top of the old series and replaces
> patch 11 (which was based on the mistaken assumption that the type
> qualifiers weren't attributes hiding behind a #define) and implements:
> - Initial support for device, global, host, constant, shared and
> launch_bounds attributes. The attributes are recognised and added
> to the AST but no significant semantic analysis (e.g. checking
> for incompatible combinations) is performed.

Ok, please trickle the patches in one at a time. Starting here makes sense.

Agree.

In this patch, please use attribute names like attribute(cuda_device) etc instead of just "device" to avoid ambiguity.

The problem with renaming the attributes is that the nvidia SDK headers
use the plain attribute names without the cuda_. It's important that
we are able to parse those headers to use clang as a frontend for
unmodified CUDA programs. Instead, we could rename the Attr classes
to CUDA*Attr and keep the spelling attributes as they are.

Also, these attributes should be rejected when not in cuda language mode.

Agree.

The prerequisite for that is to add a cuda language mode.

Yes, that will be patches 1, 2 from series 3.

Thanks,

In this patch, please use attribute names like attribute(cuda_device) etc instead of just "device" to avoid ambiguity.

The problem with renaming the attributes is that the nvidia SDK headers
use the plain attribute names without the cuda_. It's important that
we are able to parse those headers to use clang as a frontend for
unmodified CUDA programs. Instead, we could rename the Attr classes
to CUDA*Attr and keep the spelling attributes as they are.

Sounds great to me. Parsing "kernel" to CUDAKernelAttr is the right approach when in cuda mode.

The prerequisite for that is to add a cuda language mode.

Yes, that will be patches 1, 2 from series 3.

Ok, but please start this patch series by adding cuda language mode.

-Chris