PSA: Parallel STL algorithms available in LLVM

Hi all,

This is just a PSA that as of r302752, 3 parallel algorithms (for_each, for_each_n, and sort) are available in llvm/Support/Parallel.h.

Effort was made to match the C++ Parallelism TS N4507 [http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4507.pdf] as closely as possible, but some aspects were intentionally omitted.

No support is added for the executor proposal N4406 [http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4406.pdf], but I plan to try to work on this in the future, with no specified timeline.

The spec doesn’t seem to say anything about recursive calls to parallel functions. In my mind that means the functions must support it, since it doesn’t explicitly say it does not need to support it. Do you think that’s accurate?

If so, I’ll rely on that behavior in LLDB, and extend the implementation in LLVM accordingly.

It’s hard to say. By definition it appears undefined (in the sense that the TS literally does not define it), but on the other hand it is a TS and this issue would (hopefully) come up and be specified before it made it to standardization.

Supporting recursive parallel calls certainly seems like desirable behavior, so from my point of view it would be nice to make sure it works. Not sure if others feel differently.

It seems to me like this would be the responsibility of the executor being used. You might have one parallel executor which uses a fixed size thread pool, and another which can dynamically add more threads if they are needed. In any case, I would still hope that it be specified by the time of standardization. Since the current executor stuff is all magically hidden away and inaccessible, I guess it wouldn’t hurt to add this logic there?

It’s hard to say. By definition it appears undefined (in the sense that the TS literally does not define it), but on the other hand it is a TS and this issue would (hopefully) come up and be specified before it made it to standardization.

You mean the parallelism TS that was voted into C++17? :wink:

Bryce, did they end up defining this?

-Hal

Supporting recursive parallel calls certainly seems like desirable behavior, so from my point of view it would be nice to make sure it works. Not sure if others feel differently.

The spec doesn’t seem to say anything about recursive calls to parallel functions. In my mind that means the functions must support it, since it doesn’t explicitly say it does not need to support it. Do you think that’s accurate?

If so, I’ll rely on that behavior in LLDB, and extend the implementation in LLVM accordingly.

_______________________________________________
LLVM Developers mailing list
[llvm-dev@lists.llvm.org](mailto:llvm-dev@lists.llvm.org)
[http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev](http://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-dev)

TL;DR: … $*!#

I’ll take a full look at this tomorrow

Alright, I thought about this today and now I have a less terse response.

High-level summary on nested parallelism:

The spec doesn't seem to say anything about recursive calls to parallel
functions. In my mind that means the functions must support it, since it
doesn't explicitly say it does not need to support it.

* ^ That sounds correct. Conforming C++17 and Parallelism TS
implementations should support nested parallel algorithms; I am fairly
confident about this.
* My implementation (HPX) supports nested parallel algorithms
implicitly (due to the design of our system - we have a M:N user-space
tasking system) - at least for our CPU-centric executors. So, I've
never thought about this before.
* I am concerned that nested parallel algorithms will prove to be a
big implementation burden for GPU and accelerator architectures.
* I don't think having some executors support this and some not
support this makes sense. Executors represent the /where/ of
execution; on what resources (be they software abstractions, like
thread pools, or a concrete hardware resource) should work be
executed. Execution policies describe the /how/ of execution; e.g.
what parallelism is allowable and what the contract is between element
access functions (aka users) and the parallel algorithms runtime. I
could imagine an execution policy that specifies UB for nested
parallelism.
* I don't think we've discussed this on the committee in the 1.5 years
I've been involved with the standardization of the parallel
algorithms. That time period covers the TS process and the IS process
- I don't remember any NB comments about this on either. I checked the
minutes briefly, as well. Perhaps I am very wrong and it was discussed
earlier in the process before I arrived, or I am remembering
incorrectly; but it is entirely possible this issue was never
considered directly.

I took a look at your initial work on llvm/Support/Parallel.h - it
looks like a good start!

However, I want to gently urge caution and then take this thread on a
bit of a tangent, because I suspect this work will start leading
towards whatever goes into the libc++ C++17 parallel algorithms
implementation. Even if the intention is to have a separate,
stand-alone implementation in LLVM Support, I think that
implementation should share some interfaces and design philosophy with
whatever goes into libc++. I think there is a very important step that
needs to be taken before substantial work is done on libc++'s parallel
algorithms.

Let me explain by analogy: A few years ago, my research group was
tasked with delivering an OpenMP compatibility story for our software
stack. We decided the best approach was for us to implement the
interface of the Intel OpenMP runtime library because this would allow
us to "hijack" that interface, which is targeted by multiple compilers
(Clang, ICPC, and one DoE research compiler thingy). Much to my
surprise, this effort actually worked pretty well, with very few of
the nightmarish stories that you might expect when gluing multiple
large projects together.

Why was this effort successful?

* The Intel OpenMP interface was fairly stable.
* It was well documented (relative to other internal interface layers).
* It was designed to support this "hijacking" model.
* The interface was not overly restrictive; there were not an
excessive number of places where we had to force square pegs through
round holes.

I feel strongly that parallel algorithms implementation should take a
similar approach, where the parallel runtime layer is encapsulated,
well defined, and separate from the "frontend" (aka the algorithms
themselves). One option, of course, would be to use executors for
this. However, executors will likely not enable the "hijacking" model
without code changes and recompilation. Even with executors, we will
always have the "implicit default executor" that notionally exists
today in C++17 and the Parallelism TS. I'd really like end-users and
people who are shipping LLVM toolchains to be able to switch the
parallel runtime layer by simply linking against a different library
that implements some agreed-upon interface.

I've spoken with both the libc++ and libstdc++ devs at past committee
meetings about this; I think they have had similar thoughts.

The challenge is to come up with something that meet everyone's requirements:

* Everyone will want clear encapsulation between the parallel runtime
engine and the algorithms implementation - including a well-defined
and fairly stable (both API and ABI stable) interface between the two.
* The default implementation will need to work for hundreds of
thousands of users; it has to be robust and MUST avoid inflicting any
unnecessary runtime costs (e.g. no one should pay for what they aren't
using).
* High-performance users will want a very different type of
implementation - they'll want a more aggressive runtime layer that
will assume large amounts of parallel work will be executed and will
try to amortize, not minimize costs.
* Some users will need implementations with very strict real-time or
latency constraints.
* Some users will want to plug in special implementations for
heterogeneous systems.

At the end of the day, crazy people like me will make our runtimes
work with whatever interface is provided. However, I think it would be
very beneficial to try and come up with a design before major
implementation work is done, and get feedback and guidance from
interested parties.

The potential benefit of this approach: one day in the future, we
might have multiple different backend implementations of the parallel
algorithms that work with multiple different compiler frameworks, all
of which interoperate through this agreed-upon C++17 parallel
algorithms runtime layer. I think we'll thank ourselves later for
sketching out a concrete, overall design first.

So... the question is, are others interested in this approach and will
to help? As one of the culprits of parallel algorithms in C++17, I've
suspected that I'd end up contributing some sweat and blood to
libc++'s design and implementation, but it is not a one-person
project.

TL;DR - We should encapsulate the parallelism engine from the
algorithms themselves in libc++ and stick a flexible and robust
interface layer (distinct from executors and completely implicit)
between them. I tentatively volunteer to collaborate with
Marshall/Eric on this, but would need other interested parties to
help.

I'd be happy to quietly follow the discussion and chime in on the GPU or
accelerator side of things if needed.

* I am concerned that nested parallel algorithms will prove to be a
big implementation burden for GPU and accelerator architectures.

Can't they fall back on serial execution? I thought the executor is a
hint, not a requirement (certainly the standard doesn't say it has to
execute on multiple threads, just that it may).

However, I want to gently urge caution and then take this thread on a
bit of a tangent, because I suspect this work will start leading
towards whatever goes into the libc++ C++17 parallel algorithms
implementation. Even if the intention is to have a separate,
stand-alone implementation in LLVM Support, I think that
implementation should share some interfaces and design philosophy with
whatever goes into libc++. I think there is a very important step that
needs to be taken before substantial work is done on libc++'s parallel
algorithms.

I fear what you're describing is another 1.5 year long standards
committee-like process, involving multiple stakeholders, discussions, etc.

The background of how this came to be in LLVM is roughly:
1. I wanted to parallelize more work in LLDB, which has it's own
non-nestable task execution model. It involved creating individual tasks,
rather than describing the iteration requested, so I put together my own
parallel:for_each-like implementation just for LLDB.
2. It was suggested that rather than have each LLVM subproject implement
its own framework, that it should instead be put into LLVM proper. Zachary
volunteered to do that, taking code from LLD and pulling it into LLVM.
3. It was then suggested that the interface should look more like the C++17
standard, presumably to make it easier to transition to the standard
library and drop LLVM's own implementation once the time was right.
4. Back to LLDB, I want to make more code use for_each, but that code may
itself call for_each, so in order to avoid creating Yet Another parallelism
model, I want to make sure LLVM's for_each supports nested calling.

As it is, we have a real use case today for this behavior, but not the
resources/time to invest in coming up with defining how a shared library
interface should look, separate from the C++17 parallelism interface, just
so that libc++ may (or may not) pick it up somewhere down the line.

IMO it makes more sense to continue with the separate implementation of
"kinda mostly C++17 parallelism" with minimal changes/improvements as
necessary inside LLVM, and then switch to libc++/libstdc++/etc standard
implementation later once those interfaces are implemented and pervasive
across all the architectures that LLVM needs to work on. Otherwise, we
hold up progress in LLVM/LLDB today.

I agree. I’ve chatted a couple of times with Marshall about the design for parallel algorithms in libc++. I think that we have a reasonable idea of what we’d like to do there, at least in some general sense. I’d not hold up this work on a libc+±appropriate implementation. Having the use case, however, is definitely valuable. -Hal

Even without a concrete use case, I agree that it’s absolutely imperative for the standard to require this of a conforming implementation. It’s going to be the source of so many problems otherwise

I agree that it should work in some sense, but I meant that having the use case against which we can evaluate different implementation strategies is important. -Hal

Apologies for my delayed reply - I've been on travel the past few days.

Can't they fall back on serial execution? I thought the executor is a hint,
not a requirement (certainly the standard doesn't say it has to execute on
multiple threads, just that it may).

Yes. However, it is a quality-of-implementation issue.

I fear what you're describing is another 1.5 year long standards
committee-like process, involving multiple stakeholders, discussions, etc.

Yep. I don't think it will take quite that long, but it's not a trivial task.

As it is, we have a real use case today for this behavior, but not the
resources/time to invest in coming up with defining how a shared library
interface should look, separate from the C++17 parallelism interface, just
so that libc++ may (or may not) pick it up somewhere down the line.

Ah, alright. I think I understand your requirements a bit better now.

IMO it makes more sense to continue with the separate implementation of
"kinda mostly C++17 parallelism" with minimal changes/improvements as
necessary inside LLVM, and then switch to libc++/libstdc++/etc standard
implementation later once those interfaces are implemented and pervasive
across all the architectures that LLVM needs to work on. Otherwise, we hold
up progress in LLVM/LLDB today.

I agree, this sounds quite sensible to me.

Sounds good to me. This might be a good time to start on this sort of
thing, given that there's interest and internal use cases in the LLVM
ecosystem.

To confirm - what you'd like is clarification in the standard that
recursive parallelism is supported?

I think this is feasible; I'd suggest a non-normative note. I could
write a short committee paper on this (targeting C++20) for the next
meeting.

Yes, I would hate for some library implementer to either interpret the standard differently or simply not consider the issue of recursive parallelism at all and end up with an implementation that doesn’t support it (not that unlikely considering it went 1.5 years through committee as you said and the topic never came up).

When you say a “non normative note” does this imply that a library implementer would be free to ignore the note?

Yes.
That's what "non-normative" means.

-- Marshall

P.S. I expect to land all the new algorithms (the non-parallel versions!)
in libc++ this week or next week.

P.P.S. I would be happy to work with other people on the infastructure for
libc++ that Bryce described. That idea (having a replaceable parallelism
engine that could be replaced by rebuilding libc++) is what I had in mind.