RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

At Google we’re doing a lot of work on parallel programming models for CPUs, GPUs and other platforms. One place where we’re investing a lot are parallel libraries, especially those closely tied to compiler technology like runtime and math libraries. We would like to develop these in the open, and the natural place seems to be as a subproject in LLVM if others in the community are interested.

Initially, we’d like to open source our StreamExecutor runtime library, which is used for simplifying the management of data-parallel workflows on accelerator devices and can also be extended to support other hardware platforms. We’d like to teach Clang to use StreamExecutor when targeting CUDA and work on other integrations, but that makes much more sense if it is part of the LLVM project.

However, we think the LLVM subproject should be organized as a set of several libraries with StreamExecutor as just the first instance. As just one example of how creating a unified parallelism subproject could help with code sharing, the StreamExecutor library contains some nice wrappers around the CUDA driver API and OpenCL API that create a unified API for managing all kinds of GPU devices. This unified GPU wrapper would be broadly applicable for libraries that need to communicate with GPU devices.

Of course, there is already an LLVM subproject for a parallel runtime library: OpenMP! So there is a question of how it would fit into this picture. Eventually, it might make sense to pull in the OpenMP project as a library in this proposed new subproject. In particular, there is a good chance that OpenMP and StreamExecutor could share code for offloading to GPUs and managing workloads on those devices. This is discussed at the end of the StreamExecutor documentation below. However, if it turns out that the needs of OpenMP are too specialized to fit well in a generic parallelism project, then it may make sense to leave OpenMP as a separate LLVM subproject so it can focus on serving the particular needs of OpenMP.

Documentation for the StreamExecutor library that is being proposed for open-sourcing is included below to give a sense of what it is, in order to give context for how it might fit into a general parallelism LLVM subproject.

What do folks think? Is there general interest in something like this? If so, we can start working on getting a project in place and sketching out a skeleton for how it would be organized, as well as contributing StreamExecutor to it. We’re happy to iterate on the particulars to figure out what works for the community.

FWIW, LLVM sub-project stuff probably is best discussed on llvm-dev. Maybe re-send there?

Thanks for the heads-up Chandler. I’ve moved this thread to llvm-dev. Let’s consider this thread closed and move the discussion there.

It’s all good. Also good for the Clang folks to be aware of the discussion and context so they know to follow along on the llvm-dev thread. might also be good to drop a line to openmp-dev that this discussion is taking place.

Sounds like a neat project!

Some side questions to help with perspective
1) How well does this align with what the C++ standard is doing for
accelerator parallelism?
2) Do you have any benchmarks showing how much it costs to use this
wrapper vs bare cuda
3) What sort of changes would exactly be needed inside clang/llvm to
make it do what you need
4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc
5) Does it handle collapse, reductions and complex types?
6) On the CPU side does it just lower to pthreads?

Thanks for your interest. I think you bring up some very good questions.

  1. How well does this align with what the C++ standard is doing for
    accelerator parallelism?

I think that StreamExecutor will basically live independently from any accelerator-specific changes to the C++ standard. StreamExecutor only wraps the host-side code that launches kernels and has no real opinion about how those kernels are created. If C++ introduces annotations or other constructs to allow functions or blocks to be run on an accelerator, I would expect that C++ would then become another supported accelerator programming language, in the same way that CUDA and OpenCL are now currently supported.

  1. Do you have any benchmarks showing how much it costs to use this
    wrapper vs bare cuda

I think the appropriate comparison here would be between StreamExecutor and the Nvidia CUDA runtime library. I don’t have numbers for that comparison, but we do measure the time spent in StreamExecutor calls as a fraction of the total runtime of several of our real applications. In those measurements, we find that the StreamExecutor calls take up less than 1% of the total runtime, so we have been satisfied with that level of performance so far.

  1. What sort of changes would exactly be needed inside clang/llvm to
    make it do what you need

The changes would all be inside of clang. Clang already supports compiling CUDA code by lowering to calls to the Nvidia CUDA runtime library. We would introduce a new option into clang for using the StreamExecutor library instead. There are some changes that need to be made to Sema because it is currently hardcoded to look for a Nvidia CUDA runtime library function with a specific name in order to determine which types are allowed as arguments in the CUDA triple angle bracket launch syntax. Then there would be changes to CodeGen to optionally lower CUDA kernel calls onto StreamExecutor library calls, whereas now they are lowered to Nvidia CUDA runtime library calls.

  1. How is this different from say Thrust, AMD’s wrapper libs, Raja… etc

My understanding is that Thrust only supports STL operations, whereas StreamExecutor will support general user-defined kernels.

I’m not personally familiar with AMD’s wrapper libs or Raja. If you have links to point me in the right direction, I would be happy to comment on any similarities or differences.

  1. Does it handle collapse, reductions and complex types?

I don’t think I fully understand the question here. If you mean reductions in the sense of the generic programming operation, there is no direct support. A user would currently have to write their own kernel for that, but StreamExecutor does support some common “canned” operations and that set of operations could be extended to include reductions.

For complex types, the support will depend on what the kernel language (such as CUDA or OpenCL) supports. StreamExecutor will basically just treat the data as bytes and shuttle them to and from the accelerator as needed.

  1. On the CPU side does it just lower to pthreads?

Yes it is just pthreads under the hood. The host executor is not very clever at this point. It was mostly developed as a way for us to keep in mind how the interface would need to look for different platforms.

I think my comments are more generic than for/against this sort of
proposal - I hope it helps start a discussion in general

Thanks for your interest. I think you bring up some very good questions.

1) How well does this align with what the C++ standard is doing for
accelerator parallelism?

I think that StreamExecutor will basically live independently from any
accelerator-specific changes to the C++ standard. StreamExecutor only wraps
the host-side code that launches kernels and has no real opinion about how
those kernels are created. If C++ introduces annotations or other constructs
to allow functions or blocks to be run on an accelerator, I would expect
that C++ would then become another supported accelerator programming
language, in the same way that CUDA and OpenCL are now currently supported.

2) Do you have any benchmarks showing how much it costs to use this
wrapper vs bare cuda

I think the appropriate comparison here would be between StreamExecutor and
the Nvidia CUDA runtime library. I don't have numbers for that comparison,
but we do measure the time spent in StreamExecutor calls as a fraction of
the total runtime of several of our real applications. In those
measurements, we find that the StreamExecutor calls take up less than 1% of
the total runtime, so we have been satisfied with that level of performance
so far.

3) What sort of changes would exactly be needed inside clang/llvm to
make it do what you need

The changes would all be inside of clang. Clang already supports compiling
CUDA code by lowering to calls to the Nvidia CUDA runtime library. We would
introduce a new option into clang for using the StreamExecutor library
instead. There are some changes that need to be made to Sema because it is
currently hardcoded to look for a Nvidia CUDA runtime library function with
a specific name in order to determine which types are allowed as arguments
in the CUDA triple angle bracket launch syntax. Then there would be changes
to CodeGen to optionally lower CUDA kernel calls onto StreamExecutor library
calls, whereas now they are lowered to Nvidia CUDA runtime library calls.

Sorry - just trying to get implementation details here

So it's pure C++ syntax exposed to the user, but your runtime. Is
there "CUDA" or OpenCL hidden in the headers and that's where the
actual offload portion is happening

Is there anything stopping you from exposing "wrapper" interfaces
which are the same as the NVIDIA runtime? To avoid overhead you can
just force inline them.

Where is the StreamExecutor runtime source now? Does StreamExecutor
wrapper around public or private CUDA/OpenCL runtimes?

/*
I have said this before and I really get uncomfortable with the
generic term "CUDA" in clang. Until someone from NVIDIA (lawyers) put
something in writing. CUDA is an NV trademark and clang/llvm project
can't claim to be "CUDA" and need to make a distinction. Informally
this is all friendly now, but I do hope it's officially clarified at
some point. Maybe it's as simple as saying "CUDA compatible" - I don't
know..
*/

4) How is this different from say Thrust, AMD's wrapper libs, Raja.. etc

My understanding is that Thrust only supports STL operations, whereas
StreamExecutor will support general user-defined kernels.

I'm not personally familiar with AMD's wrapper libs or Raja. If you have
links to point me in the right direction, I would be happy to comment on any
similarities or differences.

5) Does it handle collapse, reductions and complex types?

I don't think I fully understand the question here. If you mean reductions
in the sense of the generic programming operation, there is no direct
support. A user would currently have to write their own kernel for that, but
StreamExecutor does support some common "canned" operations and that set of
operations could be extended to include reductions.

For complex types, the support will depend on what the kernel language (such
as CUDA or OpenCL) supports. StreamExecutor will basically just treat the
data as bytes and shuttle them to and from the accelerator as needed.

I think having a nice model that lowers cleanly (high performance) to
at least some targets is (should be) very important. From my
experience - if you have complex or perfectly nested loops - how would
you take this sort of algorithm and map it to StreamExecutor? Getting
reductions right or wrong can also have a performance impact - If your
goal is to create a "one wrapper rules them all" approach - I'm hoping
you can find a common way to also make it easier for basic needs to be
expressed to the underlying target. (In a target agnostic way)

From: "C Bergström via cfe-dev" <cfe-dev@lists.llvm.org>
To: "Jason Henline" <jhen@google.com>
Cc: "clang developer list" <cfe-dev@lists.llvm.org>
Sent: Wednesday, March 9, 2016 10:33:25 PM
Subject: Re: [cfe-dev] RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

I think my comments are more generic than for/against this sort of
proposal - I hope it helps start a discussion in general

> Thanks for your interest. I think you bring up some very good
> questions.
>
> 1) How well does this align with what the C++ standard is doing for
> accelerator parallelism?
>
> I think that StreamExecutor will basically live independently from
> any
> accelerator-specific changes to the C++ standard. StreamExecutor
> only wraps
> the host-side code that launches kernels and has no real opinion
> about how
> those kernels are created. If C++ introduces annotations or other
> constructs
> to allow functions or blocks to be run on an accelerator, I would
> expect
> that C++ would then become another supported accelerator
> programming
> language, in the same way that CUDA and OpenCL are now currently
> supported.
>
> 2) Do you have any benchmarks showing how much it costs to use this
> wrapper vs bare cuda
>
> I think the appropriate comparison here would be between
> StreamExecutor and
> the Nvidia CUDA runtime library. I don't have numbers for that
> comparison,
> but we do measure the time spent in StreamExecutor calls as a
> fraction of
> the total runtime of several of our real applications. In those
> measurements, we find that the StreamExecutor calls take up less
> than 1% of
> the total runtime, so we have been satisfied with that level of
> performance
> so far.
>
> 3) What sort of changes would exactly be needed inside clang/llvm
> to
> make it do what you need
>
> The changes would all be inside of clang. Clang already supports
> compiling
> CUDA code by lowering to calls to the Nvidia CUDA runtime library.
> We would
> introduce a new option into clang for using the StreamExecutor
> library
> instead. There are some changes that need to be made to Sema
> because it is
> currently hardcoded to look for a Nvidia CUDA runtime library
> function with
> a specific name in order to determine which types are allowed as
> arguments
> in the CUDA triple angle bracket launch syntax. Then there would be
> changes
> to CodeGen to optionally lower CUDA kernel calls onto
> StreamExecutor library
> calls, whereas now they are lowered to Nvidia CUDA runtime library
> calls.

Sorry - just trying to get implementation details here

So it's pure C++ syntax exposed to the user, but your runtime. Is
there "CUDA" or OpenCL hidden in the headers and that's where the
actual offload portion is happening

Is there anything stopping you from exposing "wrapper" interfaces
which are the same as the NVIDIA runtime? To avoid overhead you can
just force inline them.

Where is the StreamExecutor runtime source now? Does StreamExecutor
wrapper around public or private CUDA/OpenCL runtimes?

/*
I have said this before and I really get uncomfortable with the
generic term "CUDA" in clang. Until someone from NVIDIA (lawyers) put
something in writing. CUDA is an NV trademark and clang/llvm project
can't claim to be "CUDA" and need to make a distinction. Informally
this is all friendly now, but I do hope it's officially clarified at
some point. Maybe it's as simple as saying "CUDA compatible" - I
don't
know..
*/

>
> 4) How is this different from say Thrust, AMD's wrapper libs,
> Raja.. etc
>
> My understanding is that Thrust only supports STL operations,
> whereas
> StreamExecutor will support general user-defined kernels.
>
> I'm not personally familiar with AMD's wrapper libs or Raja. If you
> have
> links to point me in the right direction, I would be happy to
> comment on any
> similarities or differences.
>
> 5) Does it handle collapse, reductions and complex types?
>
> I don't think I fully understand the question here. If you mean
> reductions
> in the sense of the generic programming operation, there is no
> direct
> support. A user would currently have to write their own kernel for
> that, but
> StreamExecutor does support some common "canned" operations and
> that set of
> operations could be extended to include reductions.
>
> For complex types, the support will depend on what the kernel
> language (such
> as CUDA or OpenCL) supports. StreamExecutor will basically just
> treat the
> data as bytes and shuttle them to and from the accelerator as
> needed.

I think having a nice model that lowers cleanly (high performance) to
at least some targets is (should be) very important. From my
experience - if you have complex or perfectly nested loops - how
would
you take this sort of algorithm and map it to StreamExecutor? Getting
reductions right or wrong can also have a performance impact - If
your
goal is to create a "one wrapper rules them all" approach - I'm
hoping
you can find a common way to also make it easier for basic needs to
be
expressed to the underlying target. (In a target agnostic way)
------------
Hal's question

Unified memory - I can't see this solving much of anything. Most of
the roadmaps I have seen will introduce high bandwidth memory, which
isn't unified if you want best performance, at some point in the near
future. So your latencies will admittedly change (hopefully for the
better), but to really program with performance in mind - there's
still going to be multiple layers of memory which should be
considered
for data movement.

While I'm going to withhold judgment until the relevant future hardware arrives, I'm inclined to agree with you. Using unified memory, in raw form, will probably not give you the best performance. That is, however, perhaps not the point. Many applications have complex configuration data structures that need to be shared between host and devices, and unified memory can map those transparently. For the remaining data, for which the transfers are performance sensitive, you'll want to explicitly manage the transfers (or at least hint to the driver to transfer the data ahead of time). The overall result, however, should be significantly simpler code.

-Hal