RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

At Google we’re doing a lot of work on parallel programming models for CPUs, GPUs and other platforms. One place where we’re investing a lot are parallel libraries, especially those closely tied to compiler technology like runtime and math libraries. We would like to develop these in the open, and the natural place seems to be as a subproject in LLVM if others in the community are interested.

Initially, we’d like to open source our StreamExecutor runtime library, which is used for simplifying the management of data-parallel workflows on accelerator devices and can also be extended to support other hardware platforms. We’d like to teach Clang to use StreamExecutor when targeting CUDA and work on other integrations, but that makes much more sense if it is part of the LLVM project.

However, we think the LLVM subproject should be organized as a set of several libraries with StreamExecutor as just the first instance. As just one example of how creating a unified parallelism subproject could help with code sharing, the StreamExecutor library contains some nice wrappers around the CUDA driver API and OpenCL API that create a unified API for managing all kinds of GPU devices. This unified GPU wrapper would be broadly applicable for libraries that need to communicate with GPU devices.

Of course, there is already an LLVM subproject for a parallel runtime library: OpenMP! So there is a question of how it would fit into this picture. Eventually, it might make sense to pull in the OpenMP project as a library in this proposed new subproject. In particular, there is a good chance that OpenMP and StreamExecutor could share code for offloading to GPUs and managing workloads on those devices. This is discussed at the end of the StreamExecutor documentation below. However, if it turns out that the needs of OpenMP are too specialized to fit well in a generic parallelism project, then it may make sense to leave OpenMP as a separate LLVM subproject so it can focus on serving the particular needs of OpenMP.

Documentation for the StreamExecutor library that is being proposed for open-sourcing is included below to give a sense of what it is, in order to give context for how it might fit into a general parallelism LLVM subproject.

What do folks think? Is there general interest in something like this? If so, we can start working on getting a project in place and sketching out a skeleton for how it would be organized, as well as contributing StreamExecutor to it. We’re happy to iterate on the particulars to figure out what works for the community.

From: "Jason Henline via llvm-dev" <llvm-dev@lists.llvm.org>
To: llvm-dev@lists.llvm.org
Sent: Wednesday, March 9, 2016 4:20:15 PM
Subject: [llvm-dev] RFC: Proposing an LLVM subproject for parallelism
runtime and support libraries

At Google we're doing a lot of work on parallel programming models
for CPUs, GPUs and other platforms. One place where we're investing
a lot are parallel libraries, especially those closely tied to
compiler technology like runtime and math libraries. We would like
to develop these in the open, and the natural place seems to be as a
subproject in LLVM if others in the community are interested.

Initially, we'd like to open source our StreamExecutor runtime
library, which is used for simplifying the management of
data-parallel workflows on accelerator devices and can also be
extended to support other hardware platforms. We'd like to teach
Clang to use StreamExecutor when targeting CUDA and work on other
integrations, but that makes much more sense if it is part of the
LLVM project.

However, we think the LLVM subproject should be organized as a set of
several libraries with StreamExecutor as just the first instance. As
just one example of how creating a unified parallelism subproject
could help with code sharing, the StreamExecutor library contains
some nice wrappers around the CUDA driver API and OpenCL API that
create a unified API for managing all kinds of GPU devices. This
unified GPU wrapper would be broadly applicable for libraries that
need to communicate with GPU devices.

Of course, there is already an LLVM subproject for a parallel runtime
library: OpenMP! So there is a question of how it would fit into
this picture. Eventually, it might make sense to pull in the OpenMP
project as a library in this proposed new subproject. In particular,
there is a good chance that OpenMP and StreamExecutor could share
code for offloading to GPUs and managing workloads on those devices.
This is discussed at the end of the StreamExecutor documentation
below. However, if it turns out that the needs of OpenMP are too
specialized to fit well in a generic parallelism project, then it
may make sense to leave OpenMP as a separate LLVM subproject so it
can focus on serving the particular needs of OpenMP.

The document starts by talking about work you're doing on "CPUs, GPUs and other platforms", but you've only really discussed accelerators here. I'm wondering if there is any overlap, either current or planned, with the functionality that host-side OpenMP provides. For example, is there some kind of host-side thread pool / task queue?

Thanks in advance,
Hal

P.S. I'm really happy that it looks like you have a sane API here for handling multi-GPU systems. Dealing with cudaSetDevice is a real pain.

Hi Hal,

Thanks for taking a look at the proposal.

The current version of StreamExecutor has partial support for a “host” platform which performs work on the CPU. It’s interface is the same as the that of the CUDA platform discussed in the design documentation, but right now it does not support launching user-defined kernels, so it is very limited. The host platform does manage a thread pool internally and uses those threads to execute the “canned” StreamExecutor operations (BLAS, FFT, etc.), but that’s all it can currently do. I think it would be relatively easy to extend the host platform to support launching of user-defined kernels, and then its functionality may overlap quite a bit with OpenMP, but we don’t have any active plans to add that support at this time. However, this is something we may pursue in the long run because of the added flexibility it would provide for porting accelerator code to a device without an accelerator, etc.

From: "Jason Henline" <jhen@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvm-dev@lists.llvm.org
Sent: Wednesday, March 9, 2016 5:04:53 PM
Subject: Re: [llvm-dev] RFC: Proposing an LLVM subproject for
parallelism runtime and support libraries

Hi Hal,

Thanks for taking a look at the proposal.

The current version of StreamExecutor has partial support for a
"host" platform which performs work on the CPU. It's interface is
the same as the that of the CUDA platform discussed in the design
documentation, but right now it does not support launching
user-defined kernels, so it is very limited. The host platform does
manage a thread pool internally and uses those threads to execute
the "canned" StreamExecutor operations (BLAS, FFT, etc.), but that's
all it can currently do. I think it would be relatively easy to
extend the host platform to support launching of user-defined
kernels, and then its functionality may overlap quite a bit with
OpenMP, but we don't have any active plans to add that support at
this time.

I think that having support for running host-side tasks makes this library much more useful, not only for acceleratorless systems, but even for those with accelerators (especially if you can place dependency edges between the host tasks and the accelerator tasks).

Also, does your implementation support, or do you plan on supporting, CUDA-style unified memory between host and device?

Thanks again,
Hal

Thanks for your input, Hal.

I think that having support for running host-side tasks makes this library much more useful, not only for acceleratorless systems, but even for those with accelerators (especially if you can place dependency edges between the host tasks and the accelerator tasks).

Based on your comments, I think that supporting host-side tasks sounds like something that should be added to our roadmap, and it should be pretty simple to do within the current model. However, supporting dependency edges between different “platforms” (as StreamExecutor calls them) such as host and GPU could be slightly more challenging. The current model organizes each stream of execution as belonging to a parent platform, and streams are the structures that are meant to manage dependency edges. It will probably take some thought to decide how to do that in the right way.

Also, does your implementation support, or do you plan on supporting, CUDA-style unified memory between host and device?

I’m not sure how much this had been considered before you mentioned it. It is not supported right now, but I think it would fit naturally into the design model. Currently a custom C++ type is used to represent device memory, and so we could add a sub-type to represent unified memory. In fact, a similar type of thing is already done for the host platform where memcpy operations between host and platform are converted to nops. I think this would be a pretty easy change in the current framework.

From: "Jason Henline" <jhen@google.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvm-dev@lists.llvm.org
Sent: Wednesday, March 9, 2016 7:16:01 PM
Subject: Re: [llvm-dev] RFC: Proposing an LLVM subproject for
parallelism runtime and support libraries

Thanks for your input, Hal.

I think that having support for running host-side tasks makes this
library much more useful, not only for acceleratorless systems, but
even for those with accelerators (especially if you can place
dependency edges between the host tasks and the accelerator tasks).

Based on your comments, I think that supporting host-side tasks
sounds like something that should be added to our roadmap, and it
should be pretty simple to do within the current model.

Great!

However, supporting dependency edges between different "platforms"
(as StreamExecutor calls them) such as host and GPU could be
slightly more challenging. The current model organizes each stream
of execution as belonging to a parent platform, and streams are the
structures that are meant to manage dependency edges. It will
probably take some thought to decide how to do that in the right
way.

You might get a fair amount of millage just by allowing host tasks to be inserted into the stream of device tasks (I'm making assumptions here based on how CUDA streams work). Do you currently support inter-stream synchronization generally?

Also, does your implementation support, or do you plan on supporting,
CUDA-style unified memory between host and device?

I'm not sure how much this had been considered before you mentioned
it. It is not supported right now, but I think it would fit
naturally into the design model. Currently a custom C++ type is used
to represent device memory, and so we could add a sub-type to
represent unified memory. In fact, a similar type of thing is
already done for the host platform where memcpy operations between
host and platform are converted to nops. I think this would be a
pretty easy change in the current framework.

Interesting. Definitely worth talking about (although probably on some other dedicated thread).

In case you can't already tell, I'm supportive of this kind of functionality in LLVM's ecosystem. I'm excited that this might be a near-term possibility (especially once you have the ability to execute host tasks). There might also be a relationship between this library and what we need to implement the upcoming C++17 parallel algorithms library.

Thanks again,
Hal

Do you currently support inter-stream synchronization generally?

At this point, it can only be done with the cooperation of the host. The host would have to call a blocking wait on first one stream and then the other.

In case you can’t already tell, I’m supportive of this kind of functionality in LLVM’s ecosystem. I’m excited that this might be a near-term possibility (especially once you have the ability to execute host tasks). There might also be a relationship between this library and what we need to implement the upcoming C++17 parallel algorithms library.

I’m very glad to hear it, and thank you for all your input.