RFC: An Extension Mechanism for Parallel Compilers Based on LLVM

Hey LLVM Dev,

We in the LLVMPar working group have been working hard on this RFC, which we hope to discuss on the list and during the upcoming dev meeting. It’s long, and the associated Google doc is even longer, but we hope you’ll take a look and let us know your thoughts and feedback.

Cheers,
The LLVMPar working group

Hi,

We have experience with a similar approach in our compiler at Cray -- we represent parallel regions with begin/end intrinsic operators. This approach does potentially enable some optimization opportunities, but we've also experienced some real challenges in getting this approach to work properly. The proposal seems to touch on many of the concerns we would raise, but we still have some questions.

The main challenges we have experienced are: (1) maintaining the required region structure and nesting; (2) preventing illegal motion across region boundaries; and (3) establishing and maintaining proper object allocation/lifetime.

It seems that a region will begin as a single-entry, single-exit block of code; the single-entry property is required to be maintained throughout transformations, but is it correct that the single-exit property is allowed to be relaxed? If so, then I assume that the multiple exit points would still need to "collectively post dominate" the single entry point? That is, any path leaving a region must pass through an exit point (and that exit point must be dominated by the entry point). Is this correct?

Do you have thoughts about how to handle issues that arise from unreachable code or unusual control flow? For example: an infinite loop in a region could result in the end marker being deleted, creating a region with an entry but no exit. Is this going to be allowed (and handled), or prohibited (and if so, how)? What about control flow that might exit from a region, like exception handling? Will this be prohibited, or handled in some manner?

Are region entry/exit intrinsics allowed to be reordered (with respect to other region entry/exit intrinsics)? If so, is there any mechanism that would prevent two regions from becoming interleaved (i.e., overlapped but not properly nested)?

The proposal states that SSA values are not allowed to be defined in one region and used in another. Is a nested region considered the same or a different region? That is, can an SSA value defined in one region be referenced in a nested region?

What mechanisms would prevent an SSA value from flowing out of a region? The language about dominator relations and direct SSA references in C(i) in the SOUNDNESS section seems to suggest that a region will have an implicit control-flow path from the start of the region to the exit (i.e., around the body of the region). Is this correct? (If not, what prevents an SSA value defined in one region to be directly referenced in another region?) Also, the proposal mentions that enforcing property C(i) would require changes to PHI-node construction. Will there be additional checks/asserts added to ensure that future transformations are not allowed to introduce these prohibited PHI nodes?

Is the intent that all necessary privatization will be performed by the PREPARE translation? Or, is it possible for a region to contain "shared" accesses with corresponding meta-data indicating they will eventually become private accesses? The big challenge with the latter is that the IR cannot be interpreted without also considering the meta-data, since the meaning is entirely different. For example, the address of an object will change when the object is converted from shared to private.

Is the proposed framework intended/expected to support heterogeneous compilation? That is, could different regions be compiled for different targets?

Has there been any thought about adopting a model that represents regions as outlined functions (similar to how OpenMP is represented today), but abstracts the "fork" call into a general LLVM intrinsic? I believe a lot or all of the same optimizations could be performed with this representation, though the analysis and transformations would necessarily be inter-procedural (and probably require writing new passes, rather than leveraging existing passes). But, the main advantage is that it would be less invasive to the standard LLVM IR and infrastructure.

Is there a middle ground, with first-class support for nested functions or lambdas? This would allow outlining a region into a nested function or lambda. Optimizations on this form would still require inter-procedural analysis and transformation, but it could be limited to a smaller scope (i.e., not an entire module, but perhaps a single function plus a set of nested functions or associated lambdas).

I believe Cray will have a representative attending the BoF on this topic (David Greene), and we are certainly interested in following this proposal and participating in future discussions.

Regards,
Jeff

Hi, I need to say that I agree with Jeff’s concerns. The provided solution won’t help with something like Raja or with upcoming parallel C++ extensions, which are lambda-based. The more I think about it, the more I think that it’s better to have something like “inlined function call”-like extension, which, I think, is similar to the proposal from Cray.

For Raja and Kokko comment, there are OpenMP usage in Raja and Kokko framework. So, the optimizations that apply to OpenMP should help Raja and Kokko.

Hi Jeff,

Has there been any thought about adopting a model that represents
regions as outlined functions (similar to how OpenMP is represented
today), but abstracts the "fork" call into a general LLVM intrinsic?
I believe a lot or all of the same optimizations could be performed
with this representation, though the analysis and transformations
would necessarily be inter-procedural (and probably require writing
new passes, rather than leveraging existing passes). But, the main
advantage is that it would be less invasive to the standard LLVM IR
and infrastructure.

Is there a middle ground, with first-class support for nested
functions or lambdas? This would allow outlining a region into a
nested function or lambda. Optimizations on this form would still
require inter-procedural analysis and transformation, but it could be
limited to a smaller scope (i.e., not an entire module, but perhaps a
single function plus a set of nested functions or associated lambdas).

I have been looking into a way to enable "existing sequential
optimizations" with minimal overhead and a "soundness guarantee". There
have been multiple versions along the way but I think we have a solution
now that makes a lot of sense. I'll explain in the following what I
talked about at the LLVM-Dev'18 Meeting [0,1] and the LCPC'18 workshop
[2]. For some context on what we did before, and how we want to enable
parallel optimizations, see our EuroLLVM'18 Talk [3,4] and our IWOMP'18
paper [5].

When we want to "enable existing optimizations" we should keep in mind
that these are just optimizing sequential aspects of a program. They do
not know, nor do they need to know, about parallelism. They are not
applied today due to the "two levels of indirection" which we introduced
to guarantee a sound compilation in the presence of parallel semantics.

Let's say we want to keep the indirections for now but enable existing
optimizations in their presence. While intra-procedural optimizations
won't be applicable, we can enable inter-procedural optimizations. The
"only problem" is the runtime library call which hides the connection
between the call site (as well as the arguments) and the outlined
parallel function (and its parameters). To overcome this indirection we
introduced a "transitive call site" that exposes this connection
seamlessly through an interface which is a subset of the existing
CallSite abstraction. The transitive call site models the fact that the
runtime library will eventually call the outlined function and pass some
of its arguments along. In contrast to regular call sites, transitive
ones are partially defined. Thus, arguments of the actual call
instruction might not map to parameters of the transitively called
function. Similarly, not all parameters of the transitively called
function have corresponding arguments in the actual call instruction.

Note that transitive call sites have _nothing_ to do with parallelism.
They model a call that is performed by a "broker function" which is
called directly. As a consequence, transitive call sites can be reused
for other indirections that involve "function pointers", e.g., qsort in
C, or std::functions that are passed and called in C++.

I will push patches to phabricator in the next two weeks so everybody
can take a look and provide feedback. I'm happy to answer questions or
start a discussion before that already.

Cheers,
  Johannes

[0] https://llvm.org/devmtg/2018-10/talk-abstracts.html#talk20
[1] https://llvm.org/devmtg/2018-10/slides/Doerfert-Johannes-Optimizing-Indirections-Slides-LLVM-2018.pdf
[2] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt_lcpc18.pdf
[3] http://llvm.org/devmtg/2018-04/slides/Finkel-Representing%20Parallelism%20Within%20LLVM.pdf
[4] https://www.youtube.com/watch?v=u2Soj49R-i4
[5] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf