[RFC][PIR] Parallel LLVM IR -- Stage 0 -- IR extension

Dear all,

This RFC proposes three new LLVM IR instructions to express high-level
parallel constructs in a simple, low-level fashion. For this first stage
we prepared two commits that add the proposed instructions and a pass to
lower them to obtain sequential IR. Both patches have be uploaded for
review [1, 2]. The latter patch is very simple and the former consists
of almost only mechanical changes needed to add new instructions.

The rest of this email contains (1) an introduction of the IR extension
(2) the reasoning behind this approach, (3) a comparison to other ideas
proposed so far, (4) a validation of the feasibility and potential
impact, and (5) an outlook on the next steps.

(1) IR extension:
Parallel IR adds three new terminator instructions that define the
beginning and the end of parallel regions in the CFG. A parallel region
is a connected subgraph of the CFG that is potentially executed by two
threads in parallel. It can only be entered with a fork instruction and
spreads till a join instruction is reached. Therefor parallel regions
are single-entry-multiple-exit regions. Parallel regions can be nested
and if they are, they form a parallel region tree similar to the loop
tree maintained by the natural loop info pass. Each parallel region
defines two independent “sibling” tasks, namely the forked and
continuation task.

The new instructions are defined as follows:

1. fork: marks the beginning of parallel region. Every fork has two
         successor blocks which represent two parallel tasks. We call
         these two “sibling” tasks the forked and continuation tasks.
         Nested forking is supported, meaning that another fork can be
         reach prior to the join.

2. halt: marks the end of a forked task. The "sibling" continuation block
         (see fork above) is the operand of the halt terminator. This
         represents the idea of asymmetric parallelism as introduced by
         [1]. One advantage of asymmetric parallelism is that sequential
         semantics of the program are clear from its CFG (ref. [1]).
         Note that the edge from a forked block to a continuation block
         (the one introduced by the halt) represents the control flow
         when the two successors of a fork execute sequentially, not
         when they execute in parallel. In the latter case there is no
         “control transfer” happening via this edge but only
         synchronization between the tasks.

3. join: marks a synchronization point and the end of a parallel region.
         Once a join terminator is reached by a thread, execution stops
         in that thread until all tasks spawned by that thread finish
         their work, thus reach their respective halt instruction. A
         join shall only be reached by the continuation task of a fork,
         the forked task shall reach a halt with the continuation as a
         successor.

Here is an example of a parallel OpenMP loop and its idiomatic lowering
to Parallel IR. We set up a wiki [0] with additional examples.

#pragma omp parallel
for(int i = 0; i < n; ++i) {
A[i] = C[i];
}

preheader:
  br label %header

header:
  %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
  %done = icmp ge %i, %n
  br i1 %done, label %exit, label %body

body:
  fork label %task, label %latch

task:
  %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
  %aval = load i32* %aptr
  %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
  store i32 %aval, i32* %aptr
  halt label %latch

latch:
  %inc = add i32, i32 %i, i32 1
  br label %header

exit:
  join label %afterloop

afterloop:
...

(2) Reasoning:
The proposed approach is crafted such that the semantics of the parallel
program is represented correctly in almost native, low-level IR right
after front-end and preserved at any point till the final lowering to
sequential IR or parallel runtime library calls. To this end, asymmetric
parallelism is employed, a concept that uses control flow and the common
concept of dominance to represent parts of the parallel semantics. In
this model the parallel tasks do not dominate each other and only one
parallel task dominates the code after the parallel region. As a
consequence, various transformations that would break assumptions we
make about parallel regions cannot happen (see [3,4]). While the
explicitly modeled control flow together with dominance prevents various
code motion problems, the use of terminators helps to minimize the
changes needed to educate passes about parallel regions. Only a fraction
of analysis and transformation passes deal with terminators explicitly.
Most passes either test for known terminators (like branches), rely on
dominance information, or work on a basic block level. To even further
reduce changes to the existing passes, high-level concepts are broken
down to already available low-level concepts instead of introducing new,
semantically rich instructions/intrinsics (see the last paragraph of [5]
and section 4 in the PIR white paper [6] for examples). Finally, this
scheme allows a pass to simply reason about the sequential semantics of
a parallel region, transform it back to one if needed or deemed
beneficial and employ existing tooling solutions to debug and analyze
the code [7].

(3) Comparison:
The BoF discussion sheet [8] and the recent “[RFC] on IR-level region
annotations” [9] both list pros and cons of different proposed schemes
and implementations. We summarize and comment the discussion on the ones
listen in the recent RFC here:
  (a) Metadata: It seems a consensus has been reached that metadata is
        not the solution but only a means to enhance a different solution.
  (c) One Intrinsic per directive/clause: This approach basically embeds
        a high-level (parallel) language in LLVM IR using intrinsics. It
        seems there is little to no support for this approach at the
        moment.
  (d) Parallel loop/region annotations: Here, intrinsics enclosing a
        parallel loop/region are used to represent parallelism.
        High-level knowledge is represented as attached metadata or in
        separate intrinsics. For more details please see the original
        RFC [9]. In the discussion several potential drawbacks have been
        mentioned:
        - The annotations might be too general [10].
        - The IR is not semantically correct (or ready for optimization)
          after the front-end and needs an additional “prepare phase for
          pre-privatization" [11].
        - The currently available “potential side effect for intrinsic
          calls” seem not to suffice for the proposed intrinsics as they
          do not have "call semantics" [12].
  (b) Parallel instructions (this approach): The table in the region RFC
        [9] lists two drawbacks with this approach, both of which have
        already been called into question [5]. The first drawback is the
        effort needed to implement this scheme which is discussed in
        more detail in section (4) of this mail. The second drawback is
        the need for additional representation of high-level information
        that is not part of the semantics of the new fork-join
        instructions. As mentioned above, the choice to keep the new
        instructions as simple as possible is deliberate. This parallel
        IR is intended to be extensible, and in particular, compatible
        with representations of high-level parallel concepts that might
        be developed in the future. For the time being, the parallel IR
        is compatible with approach taken today of lowering high-level
        parallel linguistics, such as reductions and private memory, to
        existing IR constructs, such as parallel-runtime calls,
        atomicrmw instructions, and well placed alloca’s [5,6]. Although
        other extensions to the IR might allow LLVM to compile these
        higher-level constructs more effectively, we see no reason the
        parallel IR would conflict with any such extensions. (On the
        contrary, the parallel IR would seem to help compiler analyses
        of higher-level parallel constructs by exposing logical
        parallelism.)

(4) Feasibility and Impact:
The Tapir and PIR prototypes demonstrate the feasibility of this
approach. The Tapir prototype [13] has recently proven its robustness as
the standard compiler in the MIT class on parallel programming. It was
implemented in ~ 5k LOC. However, >1k are explicit parallel
optimization, 1k is used to add new instructions (thus mechanical) and
2k are used to lower the parallelism (basically needed for any scheme).
Only the rest is required to make it work with existing analysis and
transformation passes. While Tapir added explicit optimization passes
for parallel regions/loops, the representation allows for a variety of
classic optimizations (CSE, GVN, LICM, loop unrolling, TRE) to work with
little to no modifications. Potential speedups compared to a classic
“early-outlining” approaches can also be seen in the Tapir paper [13].
For the PIR prototype [14] we modified only three transformation passes
(<20 LOC) [15] before we could run the O3 pipeline successfully on a
parallel matrix multiplication.

Together, these prototypes show how little passes actually inspect new
(or “unknown”) terminators. The default assumption passes have to make,
namely that control might be transferred to any successors at runtime,
has, in terms of potential compiler transformations, a similar effect as
the parallel semantics we want to model, namely that control is
transferred to all successors.

(5) Outlook:
This first stage will only introduce and test the new instructions and
the sequentialization pass. Afterwards we intend to start additions in
different, partially overlapping but often orthogonal directions. We do
welcome comments as well as developers for each of them:
- Analysis and optimization:
    * A “parallel region info” pass to keep track of parallel regions
      and their nesting. The information can be made accessible in a
      “parallel region tree” similar to the loop tree maintained by the
      loop info pass. [stage 1, immediate next goal]
    * Extension of the verifier that allow to check parallel IR for
      “well-formedness”. [stage 1, immediate next goal]
    * Documentation of the PIR instructions in the language reference.
      [stage 1, immediate next goal]
    * A cost analysis for parallel tasks that can be queried by
      optimizations. The cost model needs to take the hardware, the
      runtime library and the parallel tasks into account.
    * Vectorizer enhancements to enable the vectorization of parallel
    * loops and tasks.
    * Parallelization centric optimizations:
      a) Parallel tasks can be balanced, merged or split as well as created
         from and lowered to sequential code.
      b) Barriers can be eliminated.
      c) Parallel loops can be statically scheduled or created from
         parallel recursive calls [13]
    * Analysis to extract high-level information (reductions, private
      memory, ...) from the low-level representation.
- Front-end:
    * Lowering of simple OpenMP and Cilk++ annotations to PIR, including
      parallel sections and parallel loops with limited support for
      clauses (at first) (examples can be found here [1]). [milestone 1]
    * Generation of PIR code through automatic parallelization. A
      patched version of Polly exists that emits parallel loops using PIR instead of
      OpenMP runtime calls or llvm.parallel.loop metadata. [milestone 1]
    * Representation of more evolved high-level features like assignment
      of computation units.
- Back-end:
    * Lowering of PIR regions to calls to the OpenMP (GOMP) and Cilk++
      runtime library. [milestone 1]
    * A simple parallel library, e.g., based on pthreads, to be shipped
      with LLVM as a fallback implementation for parallel regions.

Thank you all for your time and hopefully constructive input on this proposal!

Cheers,
  Johannes, on behalf of the PIR team

Disclaimer:
This RFC, the patches, the wiki, etc. are a joint effort by Tao B.
Schardl (MIT), Charles E. Leiserson (MIT), Kareem Ergawy (Saarland
University), Simon Moll (Saarland University) and myself. However, ideas
and feedback came from many people, including the members of the
LLVM-HPC IR Extensions working group (Hal Finkel, Xinmin Tian, ...), the
participants in the BoF at the US Developers’ meeting, everybody that
commented on the BoF discussion sheet [16] and the recent RFC on
IR-level region annotations [9] (Mehdi Amini, Sanjoy Das, Daniel Berlin,
...).

[0] https://github.com/Parallel-IR/llvm-pir/wiki
[1] https://reviews.llvm.org/D29250
[2] https://reviews.llvm.org/D29251
[3] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109302.html
[4] http://lists.llvm.org/pipermail/llvm-dev/2015-March/083348.html
[5] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109264.html
[6] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf
[7] http://supertech.csail.mit.edu/papers/spbags.pdf & www.cse.wustl.edu/~angelee/papers/cilkprof.pdf
[8] https://goo.gl/Blp2Xr
[9] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108906.html
[10] http://lists.llvm.org/pipermail/llvm-dev/2017-January/108997.html
[11] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109377.html
[12] http://lists.llvm.org/pipermail/llvm-dev/2017-January/109351.html
[13] http://wsmoses.com/tapir.pdf
[14] https://github.com/jdoerfert/llvm-pir/tree/feature/fork-join
[15] https://github.com/jdoerfert/llvm-pir/commit/854259881d24d71f9f1f17e52547758c7be0618a
[16] https://goo.gl/wKps3c

Hi Johannes,

Sorry for the delayed response! I have some basic questions inline:

This RFC proposes three new LLVM IR instructions to express high-level
parallel constructs in a simple, low-level fashion. For this first stage
we prepared two commits that add the proposed instructions and a pass to
lower them to obtain sequential IR. Both patches have be uploaded for
review [1, 2]. The latter patch is very simple and the former consists
of almost only mechanical changes needed to add new instructions.

The rest of this email contains (1) an introduction of the IR extension
(2) the reasoning behind this approach, (3) a comparison to other ideas
proposed so far, (4) a validation of the feasibility and potential
impact, and (5) an outlook on the next steps.

(1) IR extension:
Parallel IR adds three new terminator instructions that define the
beginning and the end of parallel regions in the CFG. A parallel region
is a connected subgraph of the CFG that is potentially executed by two
threads in parallel. It can only be entered with a fork instruction and
spreads till a join instruction is reached. Therefor parallel regions
are single-entry-multiple-exit regions. Parallel regions can be nested
and if they are, they form a parallel region tree similar to the loop
tree maintained by the natural loop info pass. Each parallel region
defines two independent “sibling” tasks, namely the forked and
continuation task.

The new instructions are defined as follows:

1. fork: marks the beginning of parallel region. Every fork has two
         successor blocks which represent two parallel tasks. We call
         these two “sibling” tasks the forked and continuation tasks.
         Nested forking is supported, meaning that another fork can be
         reach prior to the join.

2. halt: marks the end of a forked task. The "sibling" continuation block
         (see fork above) is the operand of the halt terminator. This
         represents the idea of asymmetric parallelism as introduced by
         [1]. One advantage of asymmetric parallelism is that sequential
         semantics of the program are clear from its CFG (ref. [1]).
         Note that the edge from a forked block to a continuation block
         (the one introduced by the halt) represents the control flow
         when the two successors of a fork execute sequentially, not
         when they execute in parallel. In the latter case there is no
         “control transfer” happening via this edge but only
         synchronization between the tasks.

3. join: marks a synchronization point and the end of a parallel region.
         Once a join terminator is reached by a thread, execution stops
         in that thread until all tasks spawned by that thread finish
         their work, thus reach their respective halt instruction. A
         join shall only be reached by the continuation task of a fork,
         the forked task shall reach a halt with the continuation as a
         successor.

Here is an example of a parallel OpenMP loop and its idiomatic lowering
to Parallel IR. We set up a wiki [0] with additional examples.

#pragma omp parallel
for(int i = 0; i < n; ++i) {
A[i] = C[i];
}

preheader:
  br label %header

header:
  %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
  %done = icmp ge %i, %n
  br i1 %done, label %exit, label %body

body:
  fork label %task, label %latch

task:
  %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
  %aval = load i32* %aptr
  %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
  store i32 %aval, i32* %aptr
  halt label %latch

latch:

Can we have a PHI node in this block? If yes, how is the incoming
value for %task computed when %task and %latch are running in
parallel?

  %inc = add i32, i32 %i, i32 1
  br label %header

exit:
  join label %afterloop

afterloop:
...

Looks like there are no edges to %exit from "inside the loop"? What
is the control flow here?

(2) Reasoning:
The proposed approach is crafted such that the semantics of the parallel
program is represented correctly in almost native, low-level IR right
after front-end and preserved at any point till the final lowering to
sequential IR or parallel runtime library calls. To this end, asymmetric
parallelism is employed, a concept that uses control flow and the common
concept of dominance to represent parts of the parallel semantics. In
this model the parallel tasks do not dominate each other and only one
parallel task dominates the code after the parallel region. As a

Can you give an example to show what you mean by "only one parallel
task dominates the code after the parallel region"?

What about cases like these (in quasi-llvm syntax):

body:
   fork label %a, label %b
a:
   x = alloca
   use(x) // but not escape
   halt label %b
b:
   y = alloca
   use(y) // but not escape
   br label %cont
cant:
   ...

=>

   common_alloca = alloca
body:
   fork label %a, label %b
a:
   use(common_alloca) // but not escape
   halt label %b
b:
   use(common_alloca) // but not escape
   br label %cont
cant:
   ...

As far as I can tell, nothing in the IR tells LLVM that %a and %b may
"interfere" with each other (by running in parallel).

Hi Sanjoy,

Sorry for the delayed response!

No worries.

I have some basic questions inline:

Answers inlined.

> This RFC proposes three new LLVM IR instructions to express high-level
> parallel constructs in a simple, low-level fashion. For this first stage
> we prepared two commits that add the proposed instructions and a pass to
> lower them to obtain sequential IR. Both patches have be uploaded for
> review [1, 2]. The latter patch is very simple and the former consists
> of almost only mechanical changes needed to add new instructions.
>
> The rest of this email contains (1) an introduction of the IR extension
> (2) the reasoning behind this approach, (3) a comparison to other ideas
> proposed so far, (4) a validation of the feasibility and potential
> impact, and (5) an outlook on the next steps.
>
> (1) IR extension:
> Parallel IR adds three new terminator instructions that define the
> beginning and the end of parallel regions in the CFG. A parallel region
> is a connected subgraph of the CFG that is potentially executed by two
> threads in parallel. It can only be entered with a fork instruction and
> spreads till a join instruction is reached. Therefor parallel regions
> are single-entry-multiple-exit regions. Parallel regions can be nested
> and if they are, they form a parallel region tree similar to the loop
> tree maintained by the natural loop info pass. Each parallel region
> defines two independent “sibling” tasks, namely the forked and
> continuation task.
>
> The new instructions are defined as follows:
>
> 1. fork: marks the beginning of parallel region. Every fork has two
> successor blocks which represent two parallel tasks. We call
> these two “sibling” tasks the forked and continuation tasks.
> Nested forking is supported, meaning that another fork can be
> reach prior to the join.
>
> 2. halt: marks the end of a forked task. The "sibling" continuation block
> (see fork above) is the operand of the halt terminator. This
> represents the idea of asymmetric parallelism as introduced by
> [1]. One advantage of asymmetric parallelism is that sequential
> semantics of the program are clear from its CFG (ref. [1]).
> Note that the edge from a forked block to a continuation block
> (the one introduced by the halt) represents the control flow
> when the two successors of a fork execute sequentially, not
> when they execute in parallel. In the latter case there is no
> “control transfer” happening via this edge but only
> synchronization between the tasks.
>
> 3. join: marks a synchronization point and the end of a parallel region.
> Once a join terminator is reached by a thread, execution stops
> in that thread until all tasks spawned by that thread finish
> their work, thus reach their respective halt instruction. A
> join shall only be reached by the continuation task of a fork,
> the forked task shall reach a halt with the continuation as a
> successor.
>
>
> Here is an example of a parallel OpenMP loop and its idiomatic lowering
> to Parallel IR. We set up a wiki [0] with additional examples.
>
> #pragma omp parallel
> for(int i = 0; i < n; ++i) {
> A[i] = C[i];
> }
>
>
> preheader:
> br label %header
>
> header:
> %i = phi [ i32 0, %preheader ], [ %inc, %latch ]
> %done = icmp ge %i, %n
> br i1 %done, label %exit, label %body
>
> body:
> fork label %task, label %latch
>
> task:
> %aptr = getelementptr i32, i32* %A, i32 0, i32 %i
> %aval = load i32* %aptr
> %cptr = getelementptr i32, i32* %C, i32 0, i32 %i
> store i32 %aval, i32* %aptr
> halt label %latch
>
> latch:

Can we have a PHI node in this block? If yes, how is the incoming
value for %task computed when %task and %latch are running in
parallel?

You _cannot_ place PHI nodes here. As you noticed, it is not obvious
which value should be forwarded to the PHI as _both_ predecessors are
executed.

> %inc = add i32, i32 %i, i32 1
> br label %header
>
> exit:
> join label %afterloop
>
> afterloop:
> ...

Looks like there are no edges to %exit from "inside the loop"? What
is the control flow here?

Right, %exit is _not_ part of the loop but it is the synchronization point
_after_ the loop. The loop spawns one iteration at a time and after all
have been started it will wait at the join till all have finished.

> (2) Reasoning:
> The proposed approach is crafted such that the semantics of the parallel
> program is represented correctly in almost native, low-level IR right
> after front-end and preserved at any point till the final lowering to
> sequential IR or parallel runtime library calls. To this end, asymmetric
> parallelism is employed, a concept that uses control flow and the common
> concept of dominance to represent parts of the parallel semantics. In
> this model the parallel tasks do not dominate each other and only one
> parallel task dominates the code after the parallel region. As a

Can you give an example to show what you mean by "only one parallel
task dominates the code after the parallel region"?

I mean that %a below is not dominating %cont but only %b is.

What about cases like these (in quasi-llvm syntax):

body:
   fork label %a, label %b
a:
   x = alloca
   use(x) // but not escape
   halt label %b
b:
   y = alloca
   use(y) // but not escape
   br label %cont
cont:
   ...

=>

   common_alloca = alloca
body:
   fork label %a, label %b
a:
   use(common_alloca) // but not escape
   halt label %b
b:
   use(common_alloca) // but not escape
   br label %cont
cont:
   ...

As far as I can tell, nothing in the IR tells LLVM that %a and %b may
"interfere" with each other (by running in parallel).

Correct, this is something we have to teach LLVM. The fork does have an
effect to the memory _and_ to the stack. Since allocas (in loops) that
could escape have to be treated differently from other allocas we should
have code paths and checks in the pipeline we can leverage here.

To (re)start a discussion here we published two patches to the
phabricator:

1) A documentation draft (including LangRef) for the PIR instructions
   and the parallel regions [0].

2) A draft implementation for a ParallelRegionInfo pass that identifies
   and maintains parallel regions in the CFG [1]. It also serves as a
   verifier pass for now.

Please take a look and comment the patches either in the phabricator or
here on the list.

Thanks!

[0] https://reviews.llvm.org/D30353
[1] https://reviews.llvm.org/D30354