It’s great that Google are interested in contributing to the development of LLVM in this area, and that you have code to support offload.
However, I’m not sure that all of it is needed, since LLVM already has the offload library which has been being developed in the context of OpenMP, but actually provides a general facility. It has been a part of LLVM since April 2014, and is already being used to offload to both Intel Xeon Phi and (at least NVidia) GPUs. (The IBM folks can tell you more about that!)
The main difference I see (at a very first glance!) is that your StreamExecutor interfaces seem to be aimed more at end user code, whereas the interface to the existing offload library has not been designed for the user, but to be an interface from the compiler. That has advantages and disadvantages
· It is a C level interface, so is callable from C,C++ and Fortran
· Using it directly from C++ user code may be harder than using StreamExecutor.
However, there is nothing in the interface that prevents it from being used with CUDA or OpenCL, and it already seems to support the low level features you cited as StreamExecutor’s advantages, though not the “looks just like CUDA” aspects, since it’s explicitly vendor neutral.
- abstracts the underlying accelerator platform (avoids locking you into a
single vendor, and lets you write code without thinking about which
platform you’ll be running on).
Liboffload does this (and has a specific design for how to abstract new devices and support them using device specific libraries).
- provides an open-source alternative to the CUDA runtime library.
I am not a CUDA expert, so I can’t comment on this! As before, IBM should comment.
- gives users a stream management model whose terminology matches that of the CUDA programming model.
This is not abstract, but seems CUDA target specific, which is, if anything, worrying for a supposedly vendor-neutral interface!
- makes use of modern C++ to create a safe, efficient, easy-to-use programming interface.
No, because liboffload is an implementation layer, not intended to be user-visible.
StreamExecutor makes it easy to:
- move data between host and accelerator (and also between peer accelerators).
Liboffload supports this.
- execute data-parallel kernels written in the OpenCL or CUDA kernel languages.
I believe this should be easy; IBM can comment better, since they have been working on GPU support.
- inspect the capabilities of a GPU-like device at runtime.
- manage multiple devices.
Liboffload supports this.
We’d therefore be very interested in seeing an approach that implemented a C++ specific user-friendly interface on top of the existing liboffload functionality, but we don’t see a reason to rework the OpenMP implementation to use StreamExecutor (since what LLVM already has is working fine, and supporting offload to both GPUs and Xeon Phi).
James Cownie <firstname.lastname@example.org>
SSG/DPD/TCAR (Technical Computing, Analyzers and Runtimes)
Tel: +44 117 9071438