[libc++] Working on the parallel STL algorithms

Hello all,

I’m an intern at Apple this summer, and my project is to work on the C++17 parallel algorithms library. In a recent thread, Hal mentioned [0] that he had spoken with Marshall about the design of the library. I’m currently working on a prototype that is pretty similar to what was committed into lib/Support in r302748. I’m wondering what you folks talked about, and if any consensus wrt to design has been reached. Even if not, I’d love to hear about what thoughts were exchanged, or any reference material that could bring me up to speed!

It might also be worth having a conversation about how to divide up the work to be done here. Were any of you intending to start working on the library?

Thanks!
Erik

[0] http://lists.llvm.org/pipermail/llvm-dev/2017-May/112988.html

Hi, Erik,

That’s great!

Gor, Marshall, and I discussed this after some past committee meeting. We wanted to architect the implementation so that we could provide different underlying concurrency mechanisms; including:

a. A self-contained thread-pool-based implementation using a work-stealing scheme.

b. An implementation that wraps Grand Central Dispatch (for Mac and any other platforms providing libdispatch).

c. An implementation that uses OpenMP.

The OpenMP-based provider is essential for interoperability in applications using OpenMP, GCD seemed most natural on Mac, and the self-contained implementation is available for anyone else. CUDA+UVM should be possible in the future as well. All of these mechanisms can export an interface for scheduling tasks, and we specifically discussed the interface having a “bulk enqueue” capability. This is essential for efficiently mapping onto several potential underlying technologies (e.g. OpenMP). You don’t want to generate individual tasks for each item, and for underlying technologies that can do intelligent things, you want the underlying technology to make the coarsening decisions.

par_unseq should map, via pragmas, either Clang-specific and/or OpenMP, to enabling vectorizaton. As a longer-term exercise, we might need additional compiler extensions to make this work well for some algorithms (such as sorting), dealing with exceptions, etc.

-Hal

Sorry to butt in, but I'm kinda curious how these will be substantially
different under the hood

"OpenMP" is a pretty vague term and I'm curious what that means in terms of
actual directives used. All non-accelerator OpenMP implementations lower
down to threading currently. (Even if you use tasks it still ends up being
a thread)

GCD (libdispatch) is essentially a task based execution model, but again on
non-OSX platforms lowers to threads. (I have a doubt that GCD offers any
performance benefit over native threads or Intel OMP runtime on OSX.)

How would the above offer any benefit over a native thread pool? Would you
be just duplicating code which is already working?

No need to be sorry; this is a good question. I think that there are a few high-level goals here: 1. Provide a solution that works for everybody 2. Take advantage of compiler technology as appropriate 3. Provide useful interoperability. In practice: don’t oversubscribe the system. The motivation for providing an implementation based on a libc++ thread pool is to satisfy (1). Your suggestion of using our OpenMP runtime’s low-level API directly is a good one. Personally, I really like this idea. It does imply, however, that organizations that distribute libc++ will also end up distributing libomp. If libomp has matured (in the open-source sense) to the point where this is a suitable solution, then we should do this. As I recall, however, we still have at least several organizations that ship Clang/LLVM/libc+±based toolchains that don’t ship libomp, and I don’t know how generally comfortable people will be with this dependency. That having been said, to point (2), using the OpenMP compiler directives is superior to calling the low-level API directly. OpenMP directives to translate into API calls, as you point out, but they also provide optimization hints to the compiler (e.g. about lack of loop-carried dependencies). Over the next couple of years, I expect to see a lot more in the compiler optimization capabilities around OpenMP (and perhaps other parallelism) directives (parallel-region fusion, etc.). OpenMP also provides a standard way to access many of the relevant vectorization hints, and taking advantage of this is useful for compiling with Clang and also other compilers. Regarding why you’d use GDC on Mac, and similarly why it is important for many users to use OpenMP underneath, it is important, to the extent possible, to use the same underlying thread pool as other things in the application. This is to avoid over-subscription and other issues associated with conflicting threading runtimes. If parts of the application are already using GCD, then we probably want to do this to (or at least not compete with it). Otherwise, OpenMP’s runtime is probably better :wink: I had in mind basic host-level OpenMP directives (i.e. OpenMP 3 style plus simd directives for vectorization, although using taskloop is a good thing to consider as well). I don’t think we can transparently use OpenMP accelerator directives in their current state because we can’t identify the memory dependencies. When OpenMP grows some way to deal with accelerators in a global address space (e.g. the new NVIDIA UVM technology), then we should be able to use that too. CUDA+UVM will be an option in the shorter term here as well, however. Given that Clang can function as a CUDA compiler, this is definitely worth exploring. Thanks again, Hal

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We
wanted to architect the implementation so that we could provide different
underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a
work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and
any other platforms providing libdispatch).

   c. An implementation that uses OpenMP.

Sorry to butt in, but I'm kinda curious how these will be substantially
different under the hood

No need to be sorry; this is a good question. I think that there are a few
high-level goals here:

1. Provide a solution that works for everybody

2. Take advantage of compiler technology as appropriate

3. Provide useful interoperability. In practice: don't oversubscribe the
system.

The motivation for providing an implementation based on a libc++ thread
pool is to satisfy (1). Your suggestion of using our OpenMP runtime's
low-level API directly is a good one. Personally, I really like this idea.
It does imply, however, that organizations that distribute libc++ will also
end up distributing libomp. If libomp has matured (in the open-source
sense) to the point where this is a suitable solution, then we should do
this. As I recall, however, we still have at least several organizations
that ship Clang/LLVM/libc++-based toolchains that don't ship libomp, and I
don't know how generally comfortable people will be with this dependency.

If "people" aren't comfortable with llvm-openmp then kick it out as a
project. I use it and I know other projects that use it just fine. I can
maybe claim the title of OpenMP hater and yet I don't know any legitimate
reason against having this as a dependency. It's a portable parallel
runtime that exposes an API and works.. I hope someone does speak up about
specific concerns if they exist.

That having been said, to point (2), using the OpenMP compiler directives
is superior to calling the low-level API directly. OpenMP directives to
translate into API calls, as you point out, but they also provide
optimization hints to the compiler (e.g. about lack of loop-carried
dependencies). Over the next couple of years, I expect to see a lot more in
the compiler optimization capabilities around OpenMP (and perhaps other
parallelism) directives (parallel-region fusion, etc.). OpenMP also
provides a standard way to access many of the relevant vectorization hints,
and taking advantage of this is useful for compiling with Clang and also
other compilers.

If projects can't even ship llvm-openmp runtime then I have a very strong
concern with bootstrap dependencies which may start relying on external
tools.

Further, I'm not sure I understand your point here. The directives wouldn't
be in the end user code, but would be in the STL implementation side.
Wouldn't that implementation stuff be fixed and an abstract layer exposed
to the end user? It almost sounds like you're expressing the benefits of
OMP here and not the parallel STL side. (Hmm.. in the distance I
hear.. "*premature
optimization* is the root of *all evil")*

Once llvm OpenMP can do things like handle nested parallelism and a few
more advanced things properly all this might be fun (We can go down a big
list if anyone wants to digress)

Regarding why you'd use GDC on Mac, and similarly why it is important for
many users to use OpenMP underneath, it is important, to the extent
possible, to use the same underlying thread pool as other things in the
application. This is to avoid over-subscription and other issues associated
with conflicting threading runtimes. If parts of the application are
already using GCD, then we probably want to do this to (or at least not
compete with it). Otherwise, OpenMP's runtime is probably better :wink:

Again this detail isn't visible to the end user? We pick an implementation
that makes sense. If other applications use GCD and we use OpenMP, if
multiple thread heavy applications are running, over-subscription would be
a kernel issue and not userland. I don't see how you can always avoid that
situation and creating two implementations to try kinda seems funny. btw
GCD is a marketing term and libdispatch is really what I'm talking about
here. It's been quite a while since I hands on worked with it, but I wonder
how much the API overlaps with similar interfaces to llvm-openmp. If the
interfaces are similar and the "cost" in terms of complexity is low, who
cares, but I don't remember that being the case. (side note: I worked on an
older version of libdispatch and ported it Solaris. I also played around
and benchmarked OMP tasks lowering directly down to libdispatch calls
across multiple platforms. At the time our runtime always beat it in
performance. Maybe newer versions of libdispatch are better)

I'm not trying to be combative, but your points just don't make
sense....... (I take the blame and must be missing something)

That’s correct. The OpenMP pragmas would be an implementation detail. However, we’d design this so that the lambda that gets passed into the algorithm can be inlined into the code that has the compiler directives, thus reaping the benefit of OpenMP’s compiler integration. This is why I said we might consider using taskloop :wink: – There are other ways of handling nesting as well (colleagues of mine work on one: ), but we should probably have a separate thread on OpenMP and nesting to discuss this aspect of things. The detail is invisible to the user at the source-code level. Obviously they might notice if we’re oversubscribing the system. Yes, on many systems the kernel can manage oversubscription, but that does not mean it will perform well. As I’m sure you understand, because of cache locality and many other effects, just running a bunch of threads and letting the kernel switch them is often much slower than running a smaller number of threads and having them pull from a task queue. There are exceptions worth mentioning, however, such as when the threads are mostly themselves blocked on I/O. Sounds great. Â -Hal

Hello all,

I’m an intern at Apple this summer, and my project is to work on the C++17
parallel algorithms library. In a recent thread, Hal mentioned [0] that he
had spoken with Marshall about the design of the library. I’m currently
working on a prototype that is pretty similar to what was committed into
lib/Support in r302748. I’m wondering what you folks talked about, and if
any consensus wrt to design has been reached. Even if not, I’d love to hear
about what thoughts were exchanged, or any reference material that could
bring me up to speed!

Awesome!

You may want to read my (long) post in the other thread if you haven't
already - I described some of my thoughts about the design of C++17
parallel algorithm runtimes.

http://llvm.1065342.n5.nabble.com/llvm-dev-PSA-Parallel-STL-algorithms-available-in-LLVM-td108822.html#a108873

It might also be worth having a conversation about how to divide up the work
to be done here. Were any of you intending to start working on the library?

Yes, I am hoping to contribute to this effort starting sometime in
June. I need to speak with Marshall to figure out how I can be most
effective here.

I'd be happy to collaborate. Some of the other HPX developers may be
interested in contributing as well.

Some thoughts:

* I pretty much agree with Hal - his vision for what we should do here
closely matches mine.
* I agree with Chris that we may want to target the Intel OMP Runtime
interface - but Hal is right that this may be sub-optimal.
Additionally, it may be quite tricky to do correctly.
* I am a little concerned that it will be hard to de-couple the
"parallel engine" and "algorithms" components of the library if we're
using OpenMP pragmas. I am also a little concerned about using pragmas
because they don't always interact well with C++ abstractions.
* I want to make sure we end up with one generic "algorithms" core and
O(3) backends that libc++ supports/maintains, and an interface that
3rd parties can implement to plug their "parallel engines" into
libc++'s algorithms. I'm a little worried that the pragma-based OpenMP
one might be tightly coupled (although 3rd parties could plug in by
implement the OpenMP runtime API, that'd be sub-optimal).

I don't see why you have that concern. We have plenty of existing parallel abstraction libraries (e.g. Kokkos) that have multiple backends, including OpenMP, and don't have this problem. You can certainly have a set of templates that implement bulk task dispatch, etc. in terms of OpenMP and a set of templates that implement that interface in terms of other things. The key thing here is to reuse the thread pool and get the added compiler optimizations that the pragmas can imply (when OpenMP is available). To be clear, I'm not proposing that we put OpenMP pragmas on the algorithm implementations themselves, only use it for an underlying parallelism-execution provider.

What is true, AFAIK, is that the abstractions don't play well with OpenMP pragmas that take variable names (reductions, for example) in a general way. Thus, for example, it is not straightforward/possible to have a variadic template that takes a list of variables on which to perform a reduction and transform that into a loop with a pragma with the right list of variables in a reduction clause (in general). You can handle specific cases (e.g. one reduction variable for specific operations). For the general case, we need to do our own thing using underlying primitives (which is what we'd need to do for the non-OpenMP implementations as well). I don't even know that we run into this problem with the current set of standardized algorithms.

  -Hal

Hi, Erik,

That's great!

Gor, Marshall, and I discussed this after some past committee meeting. We
wanted to architect the implementation so that we could provide different
underlying concurrency mechanisms; including:

   a. A self-contained thread-pool-based implementation using a
work-stealing scheme.

   b. An implementation that wraps Grand Central Dispatch (for Mac and any
other platforms providing libdispatch).

   c. An implementation that uses OpenMP.

The OpenMP-based provider is essential for interoperability in
applications using OpenMP, GCD seemed most natural on Mac, and the
self-contained implementation is available for anyone else. CUDA+UVM should
be possible in the future as well. All of these mechanisms can export an
interface for scheduling tasks, and we specifically discussed the interface
having a "bulk enqueue" capability. This is essential for efficiently
mapping onto several potential underlying technologies (e.g. OpenMP). You
don't want to generate individual tasks for each item, and for underlying
technologies that can do intelligent things, you want the underlying
technology to make the coarsening decisions.

par_unseq should map, via pragmas, either Clang-specific and/or OpenMP, to
enabling vectorizaton. As a longer-term exercise, we might need additional
compiler extensions to make this work well for some algorithms (such as
sorting), dealing with exceptions, etc.

-Hal

Don't forget about ConcRT on Windows.

- Michael Spencer

I don't see why you have that concern. We have plenty of existing parallel
abstraction libraries (e.g. Kokkos) that have multiple backends, including
OpenMP, and don't have this problem. You can certainly have a set of
templates that implement bulk task dispatch, etc. in terms of OpenMP and a
set of templates that implement that interface in terms of other things. The
key thing here is to reuse the thread pool and get the added compiler
optimizations that the pragmas can imply (when OpenMP is available).

Upon reflection, I think you're correct. I probably let some of my
prior experience with OMP (pragmas sprinkled all over the place in
applications, historical avoidance of nested OMP and composition, and
some of the rough edges that you get when using OMP on iterator loops
with certain compilers) bias my thinking a bit. With the right
approach, like the design you suggested, makes sense to me, and as you
mentioned there's plenty of prior art.

You make an excellent point about leveraging existing OpenMP compiler
optimizations (this is actually one of the reasons that we wrote an
OpenMP compatibility layer for HPX!).

To be clear, I'm not proposing that we put OpenMP pragmas on the algorithm
implementations themselves, only use it for an underlying
parallelism-execution provider.

+1 - I think my concern was mostly to make sure we do this.

What is true, AFAIK, is that the abstractions don't play well with OpenMP
pragmas that take variable names (reductions, for example) in a general way.
Thus, for example, it is not straightforward/possible to have a variadic
template that takes a list of variables on which to perform a reduction and
transform that into a loop with a pragma with the right list of variables in
a reduction clause (in general). You can handle specific cases (e.g. one
reduction variable for specific operations).

I think this is the only case we have in the current algorithms.

I don't even know that we run into this problem with the current set of standardized algorithms.

Yep, I believe this is correct.