Parallelism TS implementation and feasibility of GPU execution policies

Newcomer here, I hope this isn’t off-topic, but this seemed to be the most appropriate place to ask:

Are there plans to implement Parallelism-TS in libc++/Clang? If so, what execution policies might be supported?

Besides multi-threaded CPU execution policies, are there plans, or would it even be feasible to implement a GPU execution policy in libc++/Clang, which targets the NVPTX backend, using the Clang frontend only (i.e., without NVCC)?

This would be extremely useful, since even with the latest CUDA 7 release of NVCC, it remains slow, buggy, and consumes massive amounts of memory in comparison to Clang. If I compile my Thrust-based code it takes just a minute or so, and consumes just a few gigabytes, using Clang, against Thrust’s TBB backend. If I compile that exact same code, only using Thrust's CUDA backend with NVCC, it consumes ~20 gigabytes of memory and it takes well over an hour to compile (on my 24-GB workstation, on my 16-GB laptop it never finishes). Obviating the need for NVCC for compiling code targeting NVIDIA GPUs via a Parallelism TS implementation would be extremely useful.

Finally, are there plans, or would it even be feasible, to target OpenCL/SYCL/SPIR(-V) via Parallelism-TS? I am aware of existing OpenCL-based parallel algorithms library but I am really hoping for a Parallelism TS execution policy for OpenCL devices, so that it is a single-source, fully-integrated approach that one can pass C++ function objects to directly, as opposed to being restricted to passing strings containing OpenCL C99 syntax, or having to pre-instantiatiate template functors with macro wrappers.

Andrew Corrigan

Newcomer here, I hope this isn’t off-topic, but this seemed to be the most
appropriate place to ask:

Are there plans to implement Parallelism-TS in libc++/Clang? If so, what
execution policies might be supported?

Besides multi-threaded CPU execution policies, are there plans, or would
it even be feasible to implement a GPU execution policy in libc++/Clang,
which targets the NVPTX backend, using the Clang frontend only (i.e.,
without NVCC)?

This would be extremely useful, since even with the latest CUDA 7 release
of NVCC, it remains slow, buggy, and consumes massive amounts of memory in
comparison to Clang. If I compile my Thrust-based code it takes just a
minute or so, and consumes just a few gigabytes, using Clang, against
Thrust’s TBB backend. If I compile that exact same code, only using
Thrust's CUDA backend with NVCC, it consumes ~20 gigabytes of memory and it
takes well over an hour to compile (on my 24-GB workstation, on my 16-GB
laptop it never finishes). Obviating the need for NVCC for compiling code
targeting NVIDIA GPUs via a Parallelism TS implementation would be
extremely useful.

I can't speak for Parallelism-TS, but for the past year or so we've been
steadily trickling more CUDA support into upstream Clang. Internally, we
use a Clang-based compiler for CUDA-->PTX (alongside nvcc), and as you
mention, one of its strengths vs. nvcc is compilation time & resource
consumption. For large template-metaprogramming code Clang's frontend is an
order of magnitude faster.

The pace of our upstreaming is picking up. Take a look at
http://reviews.llvm.org/D8463, for example; and feel free to help out with
reviews.

Finally, are there plans, or would it even be feasible, to target
OpenCL/SYCL/SPIR(-V) via Parallelism-TS? I am aware of existing
OpenCL-based parallel algorithms library but I am really hoping for a
Parallelism TS execution policy for OpenCL devices, so that it is a
single-source, fully-integrated approach that one can pass C++ function
objects to directly, as opposed to being restricted to passing strings
containing OpenCL C99 syntax, or having to pre-instantiatiate template
functors with macro wrappers.

It is certainly *possible* to target something like SPIR(-V) from Clang for
CUDA - since it just generates LLVM IR now. Not sure if anyone is planning
it at this time though.

Eli

Hi.

Sorry to interrupt, but I understood there is way to emit llvm ir from cuda code?

Is there any documentation on that?

I’m really interested

Thanks

Hi.

Sorry to interrupt, but I understood there is way to emit llvm ir from
cuda code?

In general, the Clang frontend (-cc1) can generate LLVM IR for the nvptx
triples/targets, when passed -fcuda-is-device. To use this in practice,
you'll need to supply a bunch of things in headers (definitions of
builtins, CUDA types and such), and no such headers exist in the open yet.
Clang won't be able to parse the NVIDIA headers as these collide with the
standard C++ headers in some ways.

Look at the code review I linked to earlier for more progress in making
Clang a viable compiler for CUDA.

Is there any documentation on that?

Not that I know of, at this time.

Eli

The old cuda headers used to be permissively licensed. (below)

Those headers are probably sufficient to get things "rolling", but I
don't know if they are really a good start. (Not to mention you'd be
missing a runtime)

The problem is that you'd have to do 2 passes with different
(conflicting) defines. Once for host and once for device. To get
host+device to play nice together is a b*. We have this resolved in
our clang, but it's really specific to our compilation flow. A general
solution would most likely involve extensive changes to the headers or
a rewrite. :confused:

I've fought with this for the past 5 years. I'll try to help in a
general way if/where I can.

Old CUDA header license

>
>
>>
>> Hi.
>>
>> Sorry to interrupt, but I understood there is way to emit llvm ir from
>> cuda code?
>>
>
> In general, the Clang frontend (-cc1) can generate LLVM IR for the nvptx
> triples/targets, when passed -fcuda-is-device. To use this in practice,
> you'll need to supply a bunch of things in headers (definitions of
builtins,
> CUDA types and such), and no such headers exist in the open yet. Clang
won't
> be able to parse the NVIDIA headers as these collide with the standard
C++
> headers in some ways.

The old cuda headers used to be permissively licensed. (below)

Those headers are probably sufficient to get things "rolling", but I
don't know if they are really a good start. (Not to mention you'd be
missing a runtime)

The problem is that you'd have to do 2 passes with different
(conflicting) defines. Once for host and once for device. To get
host+device to play nice together is a b*. We have this resolved in
our clang, but it's really specific to our compilation flow. A general
solution would most likely involve extensive changes to the headers or
a rewrite. :confused:

Yep, this (2 passes with different defines) is the path taken in our
approach, and the one we're pushing in http://reviews.llvm.org/D8463

Note that, headers notwithstanding, the 2 pass compilation flow is enforced
by the definition of the CUDA language, because __CUDA_ARCH__ is defined
only for device code, and undefined for host code, even though these two
can live in the same TU. So you *have* to compile the code twice.

Back to the headers - these will remain a problem. We're now looking at
different options.

Eli

Thanks

The thing is I’m actually writing those headers (because I have to).

So builtins are not really an issue.

My problem was

global

shared

Which I defined as custom attributes

And the more than painful kernel<<<32, 128>>> which I can’t handle without modifying the front end myself (and I doesn’t look easy J )

I’ll try the -fcuda-is-device and keep you posted

thanks

Newcomer here, I hope this isn’t off-topic, but this seemed

    > to be the most appropriate place to ask: Are there plans to
    > implement Parallelism-TS in libc++/Clang? If so, what
    > execution policies might be supported?

    [...]

    > Finally, are there plans, or would it even be feasible, to
    > target OpenCL/SYCL/SPIR(-V) via Parallelism-TS? I am aware
    > of existing OpenCL-based parallel algorithms library but I
    > am really hoping for a Parallelism TS execution policy for
    > OpenCL devices, so that it is a single-source,
    > fully-integrated approach that one can pass C++ function
    > objects to directly, as opposed to being restricted to
    > passing strings containing OpenCL C99 syntax, or having to
    > pre-instantiatiate template functors with macro wrappers.

I know that Codeplay has a Parallelism-TS implementation based on OpenCL
SYCL.

I guess they will talk about this during http://www.iwocl.org/ next
month where there will be several sessions on SYCL.

They presented their Clang/LLVM-based SYCL implementation during the
Workshop on the LLVM Compiler Infrastructure in HPC at SuperComputing
2014
http://www.codeplay.com/public/uploaded/publications/SC2014_LLVM_HPC.pdf
but I just checked and it does not talk about Parallelism-TS yet.