LLVM Parallel IR

I’m part of a research group at MIT looking to create an extension of LLVM that inherently allows one to nicely code a parallel loop.

Most parallel frameworks tend to take the body of a parallel loop and stick it inside of a function for the parallel runtime to call when appropriate. However, this makes optimizations significantly more difficult as most compiler optimizations tend to be intraprocedural. The goal of this parallel IR is to allow LLVM to more easily optimize parallel code – which often gets the short end of the optimization stick.

Long story short: I was wondering if anyone in the LLVM community has already begun a similar project or knows of something that begins to accomplish this.

Additionally, does anyone who has worked with parallelism in LLVM have any specific comments or suggestions regarding this project.

Sincerely,
William Moses
MIT Computer Science and Artificial Intelligence Laboratory

Hi William,

this sounds like a very interesting idea, but it also is not an easy problem.

You may probably want to have a look into how LLVM models parallel loops using the loop.parallel metadata. At the time when this metadata was introduced (see archives), we had a longer debate on how to model parallelism. First ideas suggested to just add metadata flag to the loop header to indicate that the loop is parallel. However, the problem with such an approach is that scalar transformations might introduce memory operations that destroy parallelism. As they have commonly no idea about loop level metadata, the metadata needs to be designed in a way that newly introduced scalar dependences automatically invalidate the loop level metadata.

There were also relevant discussions at the time when LLVM's OpenMP support was introduced. People considered to use some kind of parallelism metadata and to only, at IR level, introduce the openmp runtime calls. The hope was to be able to perform useful optimizations
at IR level. If my memories are right, one of the critical issues (besides other engineering considerations) was that parallelism metadata in LLVM is optional and can always be dropped. However, for
OpenMP it sometimes is incorrect to execute a loop sequential that has been marked parallel in the source code.

Best regards,
Tobias

Hi William,

A similar discussion has recently been started on this list by a grad
student of MIT (The student was Lefteris Ioannidis and the discussion
titled "LLVM parallel annotations"). So you might also participate in
that discussion.

I am working on automatic parallelization in LLVM and have frequently
come to the point during my research that I wished I had started with
discussing and implementing a parallel extension of LLVM IR.
Unfortunately I did not, or at least not in a way it should be
implemented. So I am happy to hear that someone plans to work on this.

As you might be aware of there is already a representation for
parallel loops in LLVM IR. This representation is basically based on
metadata being added to both the loop (llvm.loop <id>) and
all memory accesses within that loop (llvm.mem.parallel_loop_access
<id>) which are safe. The loop with id <id> can then be executed in
parallel (i.e., every iteration thereof) provided all potential memory
accesses it contains carry a corresponding mark with the same id.
Optimizations which are not aware of the parallelism thus potentially
break it, but the result would be sequential execution, which is at
least sound.

You will find details and discussions about those annotations in the
archives of this list and in the sources (e.g., LoopInfo)

Although I understand why such a minimally invasive approach to
parallelism in the IR is desired for most use cases, I think that
parallelism is invasive by nature and would have to influence most
optimizations. During my PhD work I always wished for a completely
parallel IR with all analyses and transformations being aware of
parallelism at least those that have to be. Such an IR would be usable
as the target for parallel source languages like Cilk(+), OpenMP, or
simply for automatic parallelization.

For that however, it would have to be able to express parallelism not
only for loops. Instead, one should also be able to "fork" arbitrary
code regions for parallel execution. Think, a cilk spawn.

I think it should be possible to come up with a minimal, clearly
defined set of parallelism constructs in the IR. The goal should be to
be able to map the most important forms of parallelism to these
constructs. Completely parallel loops (DOALL loops) are just one form.
A first guide might be OpenMP constructs, including parallel sections
and tasks.

I'd be interested in hearing more details about your ideas in that direction.

Cheers,

Exactly. The fact that metadata goes stale quickly is not a flaw, but
a design decision. If the producing pass is close enough from the
consuming one, you should be ok. If not, then proving legality might
be tricky and time consuming. The problem with OpenMP and other
vectorizer pragmas is that the metadata is introduced by the
front-end, and it's a long way down the pipeline for it to be
consumed. Having said that, it works ok if you keep your code simple.

I'd be interested in knowing what in the IR cannot be accomplished in
terms of metadata for parallelization, and what would be the new
constructs that needed to be added to the IR in order to do that. If
there is benefit for your project at the same time as for OpenMP and
our internal vectorizer annotation, changing the IR wouldn't be
impossible. We have done the same for exception handling...

cheers,
--renato

> If my memories are right, one of the critical issues (besides
> other engineering considerations) was that parallelism metadata in LLVM is
> optional and can always be dropped. However, for
> OpenMP it sometimes is incorrect to execute a loop sequential that has been
> marked parallel in the source code.

Exactly. The fact that metadata goes stale quickly is not a flaw, but
a design decision. If the producing pass is close enough from the
consuming one, you should be ok. If not, then proving legality might
be tricky and time consuming. The problem with OpenMP and other
vectorizer pragmas is that the metadata is introduced by the
front-end, and it's a long way down the pipeline for it to be
consumed. Having said that, it works ok if you keep your code simple.

I know that this was a long discussion and that the "breakability" of parallel
loop infos is the result of a design decision. And I also believe that this is
a good way as long as parallelism is not part of the contract with the user
(i.e., the programmer when placing explicit parallelism annotations, or the
language designer introducing parallelism into the semantics of the language).
Tobias already mentioned problems with breaking OpenMP semantics. Similarly
different forms of parallelism, like task parallelism, could certainly be
represented using metadata, say by extracting the task code to a function and
annotating the call as being spawned for parallel execution. Again,
optimizations could break it, violating a possible contract with the user.

Alternatively we could introduce intrinsics, which I currently do. This would
forbid certain optimizations like moving potential memory accesses in and out
of "parallel code sections" and therefore does not break parallelism that
often. The headaches that this approach causes me are that basic analyses like
dominance, reachability and the like are broken in that setting as everything
computed in one parallel task, followed by another parallel task in the cfg
does not dominate or even reach the second task. This of course influences the
precision and correctness of optimizations, like for instance redundant code
elimination or GVN.

I'd be interested in knowing what in the IR cannot be accomplished in
terms of metadata for parallelization, and what would be the new
constructs that needed to be added to the IR in order to do that. If
there is benefit for your project at the same time as for OpenMP and
our internal vectorizer annotation, changing the IR wouldn't be
impossible. We have done the same for exception handling...

I understand that parallelism is a very invasive concept and introducing it
into a so far "sequential" IR will cause severe breakage and headaches. But I
am afraid that if we accept parallelism as being a first class citizen, then I
would prefer doing it as a core part of the IR. One possibility to do this
gradually might also be to have a seperate, parallel, IR, say PIR, that will be
lowered to regular IR at some point (however this point is chosen). Existing
optimizations can then be gradually moved from the regular IR phase to the PIR
phase where appropriate and useful. Nevertheless I do not propose to do such a
thing in LLVM right now. I think this might be an option for a (bigger)
research project at first.

I'd be happy to hear further thoughts about that.

Cheers,

From: "Kevin Streit" <streit@mailbox.org>
To: "Renato Golin" <renato.golin@linaro.org>, "Tobias Grosser" <tgrosser@inf.ethz.ch>
Cc: "William Moses" <wmoses@csail.mit.edu>, "LLVM Developers Mailing List" <llvmdev@cs.uiuc.edu>
Sent: Tuesday, March 10, 2015 3:36:10 AM
Subject: Re: [LLVMdev] LLVM Parallel IR

>
>
> > If my memories are right, one of the critical issues (besides
> > other engineering considerations) was that parallelism metadata
> > in LLVM is
> > optional and can always be dropped. However, for
> > OpenMP it sometimes is incorrect to execute a loop sequential
> > that has been
> > marked parallel in the source code.
>
> Exactly. The fact that metadata goes stale quickly is not a flaw,
> but
> a design decision. If the producing pass is close enough from the
> consuming one, you should be ok. If not, then proving legality
> might
> be tricky and time consuming. The problem with OpenMP and other
> vectorizer pragmas is that the metadata is introduced by the
> front-end, and it's a long way down the pipeline for it to be
> consumed. Having said that, it works ok if you keep your code
> simple.

I know that this was a long discussion and that the "breakability" of
parallel
loop infos is the result of a design decision. And I also believe
that this is
a good way as long as parallelism is not part of the contract with
the user
(i.e., the programmer when placing explicit parallelism annotations,
or the
language designer introducing parallelism into the semantics of the
language).
Tobias already mentioned problems with breaking OpenMP semantics.
Similarly
different forms of parallelism, like task parallelism, could
certainly be
represented using metadata, say by extracting the task code to a
function and
annotating the call as being spawned for parallel execution. Again,
optimizations could break it, violating a possible contract with the
user.

Alternatively we could introduce intrinsics, which I currently do.
This would
forbid certain optimizations like moving potential memory accesses in
and out
of "parallel code sections" and therefore does not break parallelism
that
often. The headaches that this approach causes me are that basic
analyses like
dominance, reachability and the like are broken in that setting as
everything
computed in one parallel task, followed by another parallel task in
the cfg
does not dominate or even reach the second task. This of course
influences the
precision and correctness of optimizations, like for instance
redundant code
elimination or GVN.

> I'd be interested in knowing what in the IR cannot be accomplished
> in
> terms of metadata for parallelization, and what would be the new
> constructs that needed to be added to the IR in order to do that.
> If
> there is benefit for your project at the same time as for OpenMP
> and
> our internal vectorizer annotation, changing the IR wouldn't be
> impossible. We have done the same for exception handling...

I understand that parallelism is a very invasive concept and
introducing it
into a so far "sequential" IR will cause severe breakage and
headaches. But I
am afraid that if we accept parallelism as being a first class
citizen, then I
would prefer doing it as a core part of the IR. One possibility to
do this
gradually might also be to have a seperate, parallel, IR, say PIR,
that will be
lowered to regular IR at some point (however this point is chosen).
Existing
optimizations can then be gradually moved from the regular IR phase
to the PIR
phase where appropriate and useful. Nevertheless I do not propose to
do such a
thing in LLVM right now. I think this might be an option for a
(bigger)
research project at first.

I'd be happy to hear further thoughts about that.

Part of the issue here is that the benefit to IR features vs. enhancing LLVM (TLI, BasicAA, etc.) to understand more about the semantics of the runtime calls is not clear. To play devil's advocate;

- Currently, runtime calls interfere with optimizations such as LICM because of the conservative answer BasicAA must provide about unknown external functions, however, BasicAA has knowledge of certain external function calls, and some knowledge of these could be added.

- We'd like to perform some optimizations, such as duplicate barrier removal. However, an optimization can recognize consecutive calls to a barrier runtime library function pretty easily -- we don't need special IR features for that, perhaps only a good abstraction layer if we support different runtimes.

-Hal

Again, optimizations could break it, violating a possible contract with the user.

AFAIK, all these annotations are more hints than contracts. Adding
#pragma omp simd on a loop that has forward dependencies that cannot
be solved should not make the loop vectorize (incorrectly). Same goes
for standard OMP, threads and other types of parallel computation.

The headaches that this approach causes me are that basic analyses like
dominance, reachability and the like are broken

Yes, intrinsics are heavy handed and will stop the compiler from being
smart in many ways.

One possibility to do this gradually might also be to have a seperate, parallel, IR, say PIR, that will be
lowered to regular IR at some point (however this point is chosen).

So, we have Polly that does that, and it's not a trivial issue. Adding
yet another representation is worrying. The IR Language Reference is
comprehensive and authoritative, and is a great resource for building
IR from ASTs or adding optimisations. For every new representation, we
have to add a similar document, plus another to tell how that is
represented in other IRs, so that transformations are well defined. I
don't think we have that for Polly, and adding PIR would double that
problem.

I'd be much more open to slow moving changes to the current IR
(especially if they benefit Polly, too). This path worked very well
for exception handling (which is an equally complex case), and should
work for parallelism.

But that's just my opinion. :slight_smile:

cheers,
--renato

> Again, optimizations could break it, violating a possible contract with the
> user.

AFAIK, all these annotations are more hints than contracts. Adding
#pragma omp simd on a loop that has forward dependencies that cannot
be solved should not make the loop vectorize (incorrectly). Same goes
for standard OMP, threads and other types of parallel computation.

I am actually not completely familiar with the semantics of all OpenMP
directives,
but in the case of Cilk for instance a spawn means execute in parallel if the
runtime
deems it beneficial. In my understanding this does not cover sequential
execution
because of other (static) optimizations breaking parallelism.
Other parallel source languages might be even more strict in this.

> The headaches that this approach causes me are that basic analyses like
> dominance, reachability and the like are broken

Yes, intrinsics are heavy handed and will stop the compiler from being
smart in many ways.

Just to be clear here. Code like this:

  parallel.task.start(1)
    some_fun(n/2)
  parallel.task.end(1)

  ...

  parallel.task.start(2)
    other_fun(n/2)
  parallel.task.end(2)

Would be transformed to

  parallel.task.start(1)
    tmp = n/2
    some_fun(tmp)
  parallel.task.end(1)

  ...

  parallel.task.start(2)
    other_fun(tmp)
  parallel.task.end(2)

Which is fine in the eyes of the respective optimizations currently, even if the
parallel.task intrinsics might throw. The problem stems from dominance thinking
that
the new location of the division dominates all uses, which is wrong.
This is not just not smart, it is wrong and the result of trying to
insufficiently
integrate parallelism into a sequential IR.
One possibility would be to add a "barrier" property to the intrinsics with a
semantics
of not allowing to move anything around it. But his again would prohibit other
useful
optimizations.

Basically the parallelism and the different dominance resulting from it might
affect
most flow analyses.
Of course we could make them all aware of the intrinsics, but how far are we
then away
from integrating it into the one, or a different, IR.

> One possibility to do this gradually might also be to have a seperate,
> parallel, IR, say PIR, that will be
> lowered to regular IR at some point (however this point is chosen).

So, we have Polly that does that, and it's not a trivial issue. Adding
yet another representation is worrying. The IR Language Reference is
comprehensive and authoritative, and is a great resource for building
IR from ASTs or adding optimisations. For every new representation, we
have to add a similar document, plus another to tell how that is
represented in other IRs, so that transformations are well defined. I
don't think we have that for Polly, and adding PIR would double that
problem.

You are completely right on that and I fully agree. That's why, from the
perspective
of LLVM, I would definitely not perform such a big, risky and error-prone step.
I think defining a parallel IR (with first class parallelism) is a research
topic,
just as Polly has been in the beginning (and still is, for that matter).

I'd be much more open to slow moving changes to the current IR
(especially if they benefit Polly, too). This path worked very well
for exception handling (which is an equally complex case), and should
work for parallelism.

But that's just my opinion. :slight_smile:

As said, I completely agree with you on that. It's just that, to play
parallelism's
advocate here, parallelism is invasive and consequently has, or should have,
significant
impact on most analyses and transformations. I am not sure how much effort and
head-aches
will be necessary in the long run to gradually move over to a parallelism-aware
toolchain.

Cheers,

Any choice (metadata, intrinsics, new IR, change IR) will take time
and will be hard. What matters most is where do you want to go from
there.

If your task is mainly research, produce a PhD and be gone with it,
probably creating a whole new IR where you can control how it comes
and goes to/from LLVM IR on a very specific version of LLVM is the
fastest way you'll get there. This will also ensure that you're not
adding unknown delays to an already troubled enterprise (PhDs). As an
example, we have Polly, which is still not fully integrated into LLVM
and maybe never will. It's got some very interesting transformations,
but the integration with the rest of the workflow has proven not easy.

If you want this to be widely used after the research is done, than
I'd suggest a little more discussion on the merits of the remaining
alternatives (metadata, intrinsics, IR changes). Your short
description has already given some food for thought, and I see why
you'd use intrinsics (barrier?) and why metadata could be cumbersome
(annotate everything?) and fragile (we'd have to teach most passes to
not throw them on specific conditions). Other heavy weight
optimisations, such as the vectorizer and LTO, have been quickly
integrated and evolved inside LLVM, because that was the original
intention.

One thing that would help a lot is to know which IR constructs are
short-sighted. So, like you did just now, give some examples of the
problems you'll face and why the current IR can't handle it. We can
then discuss the other options and see which ones are more
appropriate. Based on that discussion, you can decide which parts of
your project will change the IR and which will use some other
techniques (metadata, intrinsics).

cheers,
--renato

Hi William,

Currently, LLVM lowers parallel constructs of OpenMP at the front-end level to runtime calls; although simple, this ad-hoc solution prevents high-level reasoning about parallel languages while possibly introducing semantic inconsistencies in existing sequential program analyses. Worse, users implicitly rely upon the lack of interprocedural analyses within many compilers to preserve the semantics of programs even in the presence of code optimizations.

As part of my PhD work, I worked on introducing a parallel intermediate representation extension for parallel compilers which I called SPIRE. SPIRE was successfully applied to automatically parallelize sequential programs within the PIPS source to source compiler. (Here is a link to my PhD thesis: https://tel.archives-ouvertes.fr/pastel-00935483/document).

As part of my Postdoc work at University of Houston, I am currently applying SPIRE to LLVM in order to optimize PGAS languages, specifically OpenSHMEM. The idea is to be able to recognize OpenSHMEM constructs and represent them explicitly in the LLVM IR using SPIRE to take advantage of sequential optimizations offered by LLVM. The ultimate goal is to develop new parallel optimizations based on SPIRE(LLVM IR). The tricky part here is to prove that the enabled optimizations will generate correct codes. I am using SPIRE formal operational semantics I developed to prove the correctness of optimizations performed on parallel programs.

I can send you a submitted paper regarding this work on LLVM if you are interested.

Best,

These discussions bring back memories… :slight_smile:

I suggest everyone who is looking into extending LLVM IR to express parallel semantic to read the list archives from September and October of 2012:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-September/thread.html
http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-October/thread.html

looking for all threads with “OpenMP” in their names.

Some things were discussed extensively back then; for example, this proposal from Kevin:

Hi William,

as has been mentioned by Kevin, the idea of building an explicit parallel IR and associated tools has its merits. With the Insieme Project (http://insieme-compiler.org) we went along that road and built an explicit parallel high-level IR for analysing, manipulating and tuning parallel codes. In particular we managed to design a compact IR unifying various different parallel language extensions and APIs, including OpenMP, Cilk, OpenCL and -- to some extend -- MPI.

As part of this work we got to believe that in the general case the compiler supported coordination and tuning of parallelism in programs is beyond the scope of conventional low-level IRs. If you are interested in our experiences, concepts, design and results, an overview can be found in [1] and a rather extensive description and formalization in the associated thesis [2].

Maybe those can provide you some insights regarding potential alternative approaches for your work.

Best,
Herbert

[1] http://dl.acm.org/citation.cfm?id=2523721.2523727
[2] http://www.dps.uibk.ac.at/~csaf7445/pub/phd_thesis_jordan.pdf - in particular Chapter 3