Parallelism TS implementation and feasibility of GPU execution policies

Hi Andrew!

Although I am not a regular reader of the mailing list, your mail was brought to my attention. While it may not be the answer you were looking for, but I would like to say a few words about the topic.

While having a GPU capable Parallelism capable execution policy, it will likely not happen, at least no in the straightforward sense that one might hope for. How would such a library work?

One would create a std::vector of data, call a GPU algorithm, have the data moved to device, have it moved back after (both implicitly!) and do something with the data. This is how most GPU algorithm libraries have been developed. Problem being, that what if one wants to do 2 transformations on the dataset? One can see there is a great deal of needless datamovement involved.

Implicitly hiding data movement will impose setbacks very soon you start optimizing your code. Getting rid of that would need you to write custom allocators to your containers that specify which device you wish to allocate memory on. The problem with this is that the current C++ standard defines allocators to have static member functions only, which completely kills off any attempt for you to create an allocator to a specific device. The current trend is polymorphic allocators, so one can erase the type of the allocator from the container’s type. (If someone knows whether the interface of allocators changes to a non-static one in Poly Alloc, that would be great.)

Taking allocators a little further, it would be possible to have std::copy behave in the way expected, if iterators could access the allocators of their containers, and when these are allocators to different devices, it would do an inter-device copy.

The problem with implementing useful GPU execution policy is that the current STL design does not let you easily express data residency.

If you are content with a Thrust-like functionality, I would advise you to give C++AMP and the C++AMP Parallel Algorithms library. AMP is very similar to CUDA in some sense, the “requires(amp)” clause that it uses to identify GPU-capable functions is ery similar to the “device” directive of CUDA, but it interacts with the rest of the language (function overload resolution) in a well defined and intuitive manner.

Microsoft’s own implementation relies on DirectX compute shaders instead of NVPTX, which is more portable in terms of vendors. There is a linux impementation from AMD and Multicoreware available for download. The linux implementation uses SPIR and OpenCL as a back-end. The interesting fact about it, is that it goes well beyond C++AMP 1.0 specification, and allows you to capture ANYTHING in a GPU kernel, given that you are using HSA compatible hardware. An HSA compatible IGP (found those in Kaveri) allow you to directly capture the entire std::vector for exampe, and calling std::vector::data() will give you the same pointer you can dereference in the kernel lambda, as you could do host-side.

Bottom line, GPU execution policies as non-standard C++ extension to the Parallelism TS will likely not happen, as it is very hard to “hack” on top of the rest of the STL in a useful manner. Creating something with the same limitations as all the other similar projects is duplicate work in my opinion. I would very much like to see something useful working in standard C++, but that is a long way down the road I fear. If someone knows of a good project, let us know.

Máté Nagy-Egri

Newcomer here, I hope this isn’t off-topic, but this seemed to be the most appropriate place to ask:

Are there plans to implement Parallelism-TS in libc++/Clang? If so, what execution policies might be supported?

Besides multi-threaded CPU execution policies, are there plans, or would it even be feasible to implement a GPU execution policy in libc++/Clang, which targets the NVPTX backend, using the Clang frontend only (i.e., without NVCC)?

This would be extremely useful, since even with the latest CUDA 7 release of NVCC, it remains slow, buggy, and consumes massive amounts of memory in comparison to Clang. If I compile my Thrust-based code it takes just a minute or so, and consumes just a few gigabytes, using Clang, against Thrust’s TBB backend. If I compile that exact same code, only using Thrust’s CUDA backend with NVCC, it consumes ~20 gigabytes of memory and it takes well over an hour to compile (on my 24-GB workstation, on my 16-GB laptop it never finishes). Obviating the need for NVCC for compiling code targeting NVIDIA GPUs via a Parallelism TS implementation would be extremely useful.

Finally, are there plans, or would it even be feasible, to target OpenCL/SYCL/SPIR(-V) via Parallelism-TS? I am aware of existing OpenCL-based parallel algorithms library but I am really hoping for a Parallelism TS execution policy for OpenCL devices, so that it is a single-source, fully-integrated approach that one can pass C++ function objects to directly, as opposed to being restricted to passing strings containing OpenCL C99 syntax, or having to pre-instantiatiate template functors with macro wrappers.

Andrew Corrigan