RFC: Proposing an LLVM subproject for parallelism runtime and support libraries

Jason,

It’s great that Google are interested in contributing to the development of LLVM in this area, and that you have code to support offload.

However, I’m not sure that all of it is needed, since LLVM already has the offload library which has been being developed in the context of OpenMP, but actually provides a general facility. It has been a part of LLVM since April 2014, and is already being used to offload to both Intel Xeon Phi and (at least NVidia) GPUs. (The IBM folks can tell you more about that!)

The main difference I see (at a very first glance!) is that your StreamExecutor interfaces seem to be aimed more at end user code, whereas the interface to the existing offload library has not been designed for the user, but to be an interface from the compiler. That has advantages and disadvantages

Advantages:

· It is a C level interface, so is callable from C,C++ and Fortran

Disadvantages:

· Using it directly from C++ user code may be harder than using StreamExecutor.

However, there is nothing in the interface that prevents it from being used with CUDA or OpenCL, and it already seems to support the low level features you cited as StreamExecutor’s advantages, though not the “looks just like CUDA” aspects, since it’s explicitly vendor neutral.

StreamExecutor:

  • abstracts the underlying accelerator platform (avoids locking you into a

single vendor, and lets you write code without thinking about which

platform you’ll be running on).

Liboffload does this (and has a specific design for how to abstract new devices and support them using device specific libraries).

  • provides an open-source alternative to the CUDA runtime library.

I am not a CUDA expert, so I can’t comment on this! As before, IBM should comment.

  • gives users a stream management model whose terminology matches that of the CUDA programming model.

This is not abstract, but seems CUDA target specific, which is, if anything, worrying for a supposedly vendor-neutral interface!

  • makes use of modern C++ to create a safe, efficient, easy-to-use programming interface.

No, because liboffload is an implementation layer, not intended to be user-visible.

StreamExecutor makes it easy to:

  • move data between host and accelerator (and also between peer accelerators).

Liboffload supports this.

  • execute data-parallel kernels written in the OpenCL or CUDA kernel languages.

I believe this should be easy; IBM can comment better, since they have been working on GPU support.

  • inspect the capabilities of a GPU-like device at runtime.
  • manage multiple devices.

Liboffload supports this.

We’d therefore be very interested in seeing an approach that implemented a C++ specific user-friendly interface on top of the existing liboffload functionality, but we don’t see a reason to rework the OpenMP implementation to use StreamExecutor (since what LLVM already has is working fine, and supporting offload to both GPUs and Xeon Phi).

– Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

I'd support some of Jame's comments if liboffload wasn't glued to OMP
as it is now. My attempts to decouple it into something with better
design layering and outside of OMP source repo, have failed. For it to
be advocated as "the" offload lib - it needs a home (imnsho) outside
of OMP. Somewhere that others can easily play with it and not pay the
OMP tax. It may tick some of the boxes which have been mentioned, but
I'm curious how well it does when put under real workloads.

I'd support some of Jame's comments if liboffload wasn't glued to OMP as it is now.

I certainly have no objection to moving liboffload elsewhere if that makes it more useful to people.
There is no real "glue" holding it there; it simply ended up in the OpenMP directory structure because that
was an easy place to put it, not because that's the optimal place for it.

To some extent it has stayed there because no-one has put in any effort to do the work to move it.

-- Jim

James Cownie <james.h.cownie@intel.com>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438

/* ignorable rant */
I've publicly advocated it shouldn't have been there in the 1st place.
I have been quite vocal the work wasn't for everyone else to pay, but
should have been part of the initial design. (Basically getting it
right the 1st time - instead of forcing someone else to wade through a
bunch of cmake)

I think it would be great if StreamExecutor could use liboffload to perform its offloading under the hood. Right now offloading is handled in StreamExecutor using platform plugins, so I think it could be very natural for us to write a plugin which basically forwards to liboffload. If that worked out, we could delete our current plugins and depend only on those based on liboffload, and then all the offloading code would be unified. Then, just as James said, StreamExecutor would provide a nice C++ interface on top of liboffload, and liboffload could continue to support OpenMP directly.

In this plan, I think it would make sense to move liboffload to the new project being proposed by this RFC, and hopefully that would also make liboffload more usable as a stand-alone project. Before moving forward with any of these plans, I think it is right to wait to hear what IBM thinks.

Jason,

It looks a cool thought on providing high-level interface for CUDA-based development. By far, my understanding is that the StreamExecutor is designed as an abstract runtime layer of target architecture for offloading, still requiring the target specific code from the user. This is something that contradicts the PPM in my understanding: the abstraction of the programming itself. To make the parallel: you provide an inline assembly to enable a user with best performance/flexibility while OpenMP is rather a Fortran compiler that gives abstraction of machine at all. This I would name the biggest difference between the projects. Am I right?

The document you refers to is published a year ago and there was a significant progress since then, with prototype compiler implemented at https://github.com/clang-omp. The implementation supports NVIDIA and x86 targets, providing an abstraction of the target platform. There are other parties contributed to the design, yet we will see open source contributions from them. Nevertheless, I can say that the design was thoroughly reviewed by them and it is approved by the full list of authors - hence, their targets are satisfied also. As per the design document, the abstraction layer is the libomptarget library supposed to dispatch different target binaries at runtime. Along with dispatching the library keeps track of all data mapping between the host and all target devices, hiding this from the user and thus removes the burden of bookkeeping. This can be a good point for the StreamExecutor to integrate its interface. Still we need to review both sides: the libomptarget interface provided to the compiler and the StreamExecutor internal interfaces.

While the OpenMP model provides the convenience of allowing the author to write their kernel code in standard C/C++, the StreamExecutor model allows for the use of any kernel language (e.g. CUDA C++ or OpenCL C). This lets authors use platform-specific features that are only present in platform-specific kernel definition languages.

Per my understanding, the OpenMP standard allows calling of functions in target regions. Those functions can be generated for target either by the compiler, where user annotate appropriate functions with “#pragma omp declare target”. But it also allows using of functions added to the target binaries in any other way. For example, on Xeon PHI platform one can use any shared library that is put on the device beforehand. As for the GPU and other targets – binaries obtained from other compilers/build tools should be passed to the target image link explicitly.

The OpenMP programming model was aimed the program is written once, and can be compiled with many compilers for multi-targets, in addition, with the same compiler it can be compiled in serial mode. This fits well for complex build systems and provides support to number of targets with no additional interference with build scripts. My understanding is that the StreamExecutor requires different targets to be built separately and thereafter put together to allow offloading.

I see some other differences in programming model as well, for example, the executor project you referred to supports C++ only. Is there any plans to support plain C and Fortran? I would suggest we do a more analysis to extend the existing library for support StreamExecutor.

To collaborate is always a plus and having more interested parties is beneficial both ways. I would like to start from the interfaces description and review. The interface of the libomptarget (both compiler side and the target RTL) are in the document. Could you prepare a small overview of the (perhaps, supposed) StreamExecutor internal interfaces with CUDA and OpenCL so that we can derive key common points of abstraction and try to map them to libomptarget interface?

Regards,

Sergos

Intel Compiler Team

Sergos,

It looks a cool thought on providing high-level interface for CUDA-based development. By far, my understanding is that the StreamExecutor is designed as an abstract runtime layer of target architecture for offloading, still requiring the target specific code from the user. This is something that contradicts the PPM in my understanding: the abstraction of the programming itself. To make the parallel: you provide an inline assembly to enable a user with best performance/flexibility while OpenMP is rather a Fortran compiler that gives abstraction of machine at all. This I would name the biggest difference between the projects. Am I right?

Yes, it sounds to me that you have the right idea here.

The document you refers to is published a year ago and there was a significant progress since then, with prototype compiler implemented at https://github.com/clang-omp. The implementation supports NVIDIA and x86 targets, providing an abstraction of the target platform. There are other parties contributed to the design, yet we will see open source contributions from them. Nevertheless, I can say that the design was thoroughly reviewed by them and it is approved by the full list of authors - hence, their targets are satisfied also. As per the design document, the abstraction layer is the libomptarget library supposed to dispatch different target binaries at runtime. Along with dispatching the library keeps track of all data mapping between the host and all target devices, hiding this from the user and thus removes the burden of bookkeeping. This can be a good point for the StreamExecutor to integrate its interface. Still we need to review both sides: the libomptarget interface provided to the compiler and the StreamExecutor internal interfaces.

Thanks for letting me know about the implementation on GitHub. I will take a look at the code for libomptarget there to see how I think it could work with StreamExecutor. In general terms, I really like the idea of StreamExecutor being able to call into libomptarget rather than implementing the offloading itself, so I think it will be great if we can get those details to work out.

Per my understanding, the OpenMP standard allows calling of functions in target regions. Those functions can be generated for target either by the compiler, where user annotate appropriate functions with “#pragma omp declare target”. But it also allows using of functions added to the target binaries in any other way. For example, on Xeon PHI platform one can use any shared library that is put on the device beforehand. As for the GPU and other targets – binaries obtained from other compilers/build tools should be passed to the target image link explicitly.

I was not aware that OpenMP had a mode for running code compiled by a different compiler. That sounds very nice. I would like to learn more about what the user interface looks like for this, specifically in the case of CUDA.

The OpenMP programming model was aimed the program is written once, and can be compiled with many compilers for multi-targets, in addition, with the same compiler it can be compiled in serial mode. This fits well for complex build systems and provides support to number of targets with no additional interference with build scripts. My understanding is that the StreamExecutor requires different targets to be built separately and thereafter put together to allow offloading.

Yes, I think that is an accurate representation of StreamExecutor. Our hope is to integrate StreamExecutor into clang itself so that clang can manage the bundling of device code in object files and the launching of that code.

I see some other differences in programming model as well, for example, the executor project you referred to supports C++ only. Is there any plans to support plain C and Fortran? I would suggest we do a more analysis to extend the existing library for support StreamExecutor.

We are only interested in supporting C++. One of the main goals of StreamExecutor is to create a nice interface specifically for C++.

To collaborate is always a plus and having more interested parties is beneficial both ways. I would like to start from the interfaces description and review. The interface of the libomptarget (both compiler side and the target RTL) are in the document. Could you prepare a small overview of the (perhaps, supposed) StreamExecutor internal interfaces with CUDA and OpenCL so that we can derive key common points of abstraction and try to map them to libomptarget interface?

Yes I will prepare a little overview of the internal StreamExecutor interface to CUDA and OpenCL. This interface is already well defined, so I will just need to copy and paste some things to a document. I’ll plan to have that completed some time tomorrow so you can see how StreamExecutor would like to interact with libomptarget.

Thanks very much for your input on this,
-Jason

I think that having a liboffload plugin would be nice, but I don’t think we should really base everything on top of this for a few reasons:

  1. I think we already have a nice plugin interface specifically designed to support out-of-tree platforms with StreamExecutor, and it wouldn’t make a lot of sense to force them to re-implement there stuff.

  2. Some platforms may not want or be able to use the liboffload style plugin.

It seems like if the OpenMP folks want to add a liboffload plugin to StreamExecutor, that would be an awesome additional platform, but I don’t see why we need to force the coupling here.

My 2 cents.
-Chandler

Chandler,

That raises a more meta-question for me, which is “Why should StreamExecutor be in LLVM at all?”

AFAICS, with you approach

· It is not a runtime library whose interface the compiler needs to understand.

· It does not depend on any LLVM runtime libraries.

· It is expected to be used with out-of-tree plugins.

If I got all of that right, what connection does it have with LLVM that makes having it in the LLVM tree necessary, or an improvement over simply having it on github (or whatever your favourite open-source hosting location is)?

Did I misunderstand something?

– Jim

James Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

Chandler,

That raises a more meta-question for me, which is “Why should StreamExecutor be in LLVM at all?”

AFAICS, with you approach

· It is not a runtime library whose interface the compiler needs to understand.

The original email pretty clearly spells out how it is specifically intended to be a target for the compiler?

· It does not depend on any LLVM runtime libraries.

· It is expected to be used with out-of-tree plugins.

Note that this doesn’t mean there won’t be in-tree plugins. It’s analogous to how LLVM supports out-of-tree targets.

If I got all of that right, what connection does it have with LLVM that makes having it in the LLVM tree necessary, or an improvement over simply having it on github (or whatever your favourite open-source hosting location is)?

If there were nothing compiler related to it, then the argument would be much more tenuous. But I think there is a great deal of compiler relevance here.

OK, thanks.

– Jim

James Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)

Tel: +44 117 9071438

I see the distinction of being "agnostic" and welcoming different
plugins, but at the same time I asked how Google was engaging hw
stakeholders. The feedback from Intel if I heard them correctly - they
would warmly welcome some liboffload integration. (Which would enable
PHI support if I'm not mistaken?)

While I don't think anyone will try to block the integration of this
on whether it does or doesn't have support for liboffload - I think
you may win more friends if it does.

IMHO, until this gets market traction I don't think it's ready for
inclusion. There's lots of really great ideas, but there should be
some threshold of _____________ (importance?) before it's included. (I
apologize I can't word this previous sentence perfectly) /* I really
wish there was some way to have it be an "incubator" before formal
inclusion */

Hi

Sorry for replying to this thread late - I see that most of the questions have already found an answer.
As Jim, I am also very happy to hear about Google’s work on accelerators.

A couple of notes that caught my eye before going into the integration discussion:

provides an open-source alternative to the CUDA runtime library.

Interesting: this still means that you have to go through the CUDA driver API, right?
I am specifically referring to what is described here:
http://docs.nvidia.com/cuda/cuda-driver-api/#axzz42zF30dZw

libomptarget’s rtl library for cuda-enabled device uses the CUDA driver API directly, not the CUDA runtime library.

However, if it turns out that the needs of OpenMP are too specialized to fit well in a generic parallelism project,

Of course, the libomptarget interface was designed to support OpenMP, not any parallelism project. It is still very general, as the OpenMP programming language is, and independent of any acceleration type because of the huge multi-company and users interest in the language.

execute data-parallel kernels written in the OpenCL or CUDA kernel languages.

Libomptarget does not currently have an OpenCL interface in its device-dependent RTL list. However, I am pretty sure that other companies working on OpenCL (ARM? AMD?) would be interested in having this, as well as a clang/llvm path that is capable of going from OpenMP to their native languages.

About integration of the two project:

  • If I read Chandler’s comment correctly, I do agree that we do not need to force coupling between the two libraries.

  • libomptarget could add a RTL plug-in which interfaces to StreamExecutor. My question is: what does StreamExecutor do additionally from the current CUDA RTL implementation? I would be specifically interested in performance-related points here.
    To do this, StreamExecutor would need to implement the interface that is defined between the target-agnostic part of libomptarget and the target-dependent part. What would be the interest of StreamExecutor in doing this?

Finally, one interesting bit about OpenMP is that it does not expose (in the language) any CUDA-like mechanisms like streams. The user can specify that certain tasks (e.g. target data-parallel tasks running on a device) can be executed asynchronously and can define dependencies between them. Under the hood, libomptarget will make extensive use of CUDA streams to satisfy dependencies living exclusively on the GPU, but the user does not get exposed to it.

I hope this helps bringing the discussion forward.

Thanks!

– Carlo

graycol.gif“Cownie, James H via llvm-dev” —03/15/2016 07:41:27 AM—OK, thanks. – Jim

Hi

Sorry for replying to this thread late - I see that most of the questions have already found an answer.
As Jim, I am also very happy to hear about Google’s work on accelerators.

A couple of notes that caught my eye before going into the integration discussion:

provides an open-source alternative to the CUDA runtime library.

Interesting: this still means that you have to go through the CUDA driver API, right?
I am specifically referring to what is described here:
http://docs.nvidia.com/cuda/cuda-driver-api/#axzz42zF30dZw

libomptarget’s rtl library for cuda-enabled device uses the CUDA driver API directly, not the CUDA runtime library.

However, if it turns out that the needs of OpenMP are too specialized to fit well in a generic parallelism project,

Of course, the libomptarget interface was designed to support OpenMP, not any parallelism project. It is still very general, as the OpenMP programming language is, and independent of any acceleration type because of the huge multi-company and users interest in the language.

execute data-parallel kernels written in the OpenCL or CUDA kernel languages.

Libomptarget does not currently have an OpenCL interface in its device-dependent RTL list. However, I am pretty sure that other companies working on OpenCL (ARM? AMD?) would be interested in having this, as well as a clang/llvm path that is capable of going from OpenMP to their native languages.

About integration of the two project:

  • If I read Chandler’s comment correctly, I do agree that we do not need to force coupling between the two libraries.

  • libomptarget could add a RTL plug-in which interfaces to StreamExecutor. My question is: what does StreamExecutor do additionally from the current CUDA RTL implementation? I would be specifically interested in performance-related points here.
    To do this, StreamExecutor would need to implement the interface that is defined between the target-agnostic part of libomptarget and the target-dependent part. What would be the interest of StreamExecutor in doing this?

Finally, one interesting bit about OpenMP is that it does not expose (in the language) any CUDA-like mechanisms like streams. The user can specify that certain tasks (e.g. target data-parallel tasks running on a device) can be executed asynchronously and can define dependencies between them. Under the hood, libomptarget will make extensive use of CUDA streams to satisfy dependencies living exclusively on the GPU, but the user does not get exposed to it.

I hope this helps bringing the discussion forward.

Thanks!

– Carlo

graycol.gif“Cownie, James H via llvm-dev” —03/15/2016 07:41:27 AM—OK, thanks. – Jim

Hi

Sorry for replying to this thread late - I see that most of the questions have already found an answer.
As Jim, I am also very happy to hear about Google’s work on accelerators.

A couple of notes that caught my eye before going into the integration discussion:

provides an open-source alternative to the CUDA runtime library.

Interesting: this still means that you have to go through the CUDA driver API, right?
I am specifically referring to what is described here:
http://docs.nvidia.com/cuda/cuda-driver-api/#axzz42zF30dZw

libomptarget’s rtl library for cuda-enabled device uses the CUDA driver API directly, not the CUDA runtime library.

However, if it turns out that the needs of OpenMP are too specialized to fit well in a generic parallelism project,

Of course, the libomptarget interface was designed to support OpenMP, not any parallelism project. It is still very general, as the OpenMP programming language is, and independent of any acceleration type because of the huge multi-company and users interest in the language.

execute data-parallel kernels written in the OpenCL or CUDA kernel languages.

Libomptarget does not currently have an OpenCL interface in its device-dependent RTL list. However, I am pretty sure that other companies working on OpenCL (ARM? AMD?) would be interested in having this, as well as a clang/llvm path that is capable of going from OpenMP to their native languages.

About integration of the two project:

  • If I read Chandler’s comment correctly, I do agree that we do not need to force coupling between the two libraries.

  • libomptarget could add a RTL plug-in which interfaces to StreamExecutor. My question is: what does StreamExecutor do additionally from the current CUDA RTL implementation? I would be specifically interested in performance-related points here.
    To do this, StreamExecutor would need to implement the interface that is defined between the target-agnostic part of libomptarget and the target-dependent part. What would be the interest of StreamExecutor in doing this?

Finally, one interesting bit about OpenMP is that it does not expose (in the language) any CUDA-like mechanisms like streams. The user can specify that certain tasks (e.g. target data-parallel tasks running on a device) can be executed asynchronously and can define dependencies between them. Under the hood, libomptarget will make extensive use of CUDA streams to satisfy dependencies living exclusively on the GPU, but the user does not get exposed to it.

I hope this helps bringing the discussion forward.

Thanks!

– Carlo

graycol.gif“Cownie, James H via llvm-dev” —03/15/2016 07:41:27 AM—OK, thanks. – Jim

Hola Chandler,

I created a GitHub repo that contains the documentation I have been creating for StreamExecutor. https://github.com/henline/streamexecutordoc

It contains the design docs from the original email in this thread, and it contains a new doc I just made that gives a more detailed sketch of the StreamExecutor platform plugin interface. This shows which methods must be implemented to support a new platform in StreamExecutor, or to provide a new implementation for an existing platform (e.g. using liboffload to implement the CUDA platform).

I wrote up this doc in response to a lot of good questions I am getting about the details of how StreamExecutor might work with the code OpenMP already has in place.

Best Regards,
-Jason

I did a more thorough read through liboffload and wrote up a more detailed doc describing how StreamExecutor platforms relate to libomptarget RTL interfaces. The doc also describes why the lack of support for streams in libomptarget makes it impossible to implement some of the most important StreamExecutor platforms in terms of libomptarget (https://github.com/henline/streamexecutordoc/blob/master/se_and_openmp.rst). When I was originally optimistic about using liboffload to implement StreamExecutor platforms, I was not aware of this issue with streams. Thanks to Carlo Bertolli for bringing this to my attention.

After having looked in detail at the liboffload code, it sounds like the best thing to do at this point is to keep StreamExecutor and liboffload separate, but to leave the door open to implement future StreamExecutor platforms in terms of liboffload. From the recent messages on this subject from Carlo and Andrey it seems like there is a general consensus on this, so I would like to move forward with the StreamExecutor project in this spirit.

Hi Jason,

This is probably because I’m not aware of the details, but it was claimed in this thread that liboffload can target Xeon Phi and Nvidia GPUs. Adding a new library that the compiler has to be aware of has to bring significant benefit.
So it is not clear to me yet why the compiler should target two different runtime libraries that seems to have large chunk of overlapping functionalities.
On a high-level view, these libraries seems to have the same goals with respect to what they provide to the compiler.

Can you elaborate?

Thanks,

(Ignore this please)

To butt in with a peanut gallery comment - I suspect it's because
liboffload is really just providing a bare set of non-portable API
mostly tailored to OpenMP4. Having it support any other programming
model is probably going to take real work on the part of refactoring
"liboffload". (*cough* *cough* good design)

In the end I pessimistically suspect each programming model wanting
inclusion will reinvent the wheel and make the same argument each
time. So we'll end up with lots of libraries doing mostly the same
thing, duplicate code/support.. etc