[RFC] Abstract Parallel IR Optimizations

This is an RFC to add analyses and transformation passes into LLVM to
optimize programs based on an abstract notion of a parallel region.

  == this is _not_ a proposal to add a new encoding of parallelism ==

We currently perform poorly when it comes to optimizations for parallel
codes. In fact, parallelizing your loops might actually prevent various
optimizations that would have been applied otherwise. One solution to
this problem is to teach the compiler about the semantics of the used
parallel representation. While this sounds tedious at first, it turns
out that we can perform key optimizations with reasonable implementation
effort (and thereby also reasonable maintenance costs). However, we have
various parallel representations that are already in use (KMPC,
GOMP, CILK runtime, ...) or proposed (Tapir, IntelPIR, ...).

Our proposal seeks to introduce parallelism specific optimizations for
multiple representations while minimizing the implementation overhead.
This is done through an abstract notion of a parallel region which hides
the actual representation from the analysis and optimization passes. In
the schemata below, our current five optimizations (described in detail
here [0]) are shown on the left, the abstract parallel IR interface is
is in the middle, and the representation specific implementations is on
the right.

         Optimization (A)nalysis/(T)ransformation Impl.

Hi Johannes,

apologies in advance if the questions following are silly or don't
make sense. I lack a bit of context here and I'm not sure to fully
understand your proposal.

Currently clang (and flang) are lowering OpenMP when building LLVM IR
(this is because LLVM IR can't express the parallel/concurrent
concepts of OpenMP so they have to be lowered first). So, can I assume
that your proposal starts off in a context where that lowering is not
happening anymore in the front end but it'd happen later in a LLVM IR
pass? If so, then you'd be assuming that there is already a way of
representing OpenMP constructs in the LLVM IR, is my understanding
correct here? I think that the Intel proposal [1] could be one way
(not necessarily the one) to do this (disregarding the fact that it is
tailored for OpenMP), does this still make sense?

If this is the case, and given that you explicitly state that this is
not a Parallel IR of any sort, is your suggestion to improve
optimisation of OpenMP code, based on a "side-car"/ancillary
representation built on top of the existing IR, which as I understand
should already be able to represent OpenMP? But then this looks a bit
redundant to me. So I'm pretty sure one of my assumptions is
incorrect. Unless your auxiliar representation is more an alternative
to the W-regions [1].

Or, maybe I am completely wrong here: you didn't say anything about
the FE lowering, which would still happen, and then your proposal
builds on top of that. I don't think you meant that, given that your
proposal mentions KMP and GOMP (and the current lowering done by clang
targets only KMP).

Thank you very much,
Roger

[1] https://dl.acm.org/citation.cfm?id=3148191

Hi Roger,

apologies in advance if the questions following are silly or don't
make sense. I lack a bit of context here and I'm not sure to fully
understand your proposal.

No worries, I'm glad if people ask questions!

Currently clang (and flang) are lowering OpenMP when building LLVM IR
(this is because LLVM IR can't express the parallel/concurrent
concepts of OpenMP so they have to be lowered first). So, can I assume
that your proposal starts off in a context where that lowering is not
happening anymore in the front end but it'd happen later in a LLVM IR
pass? If so, then you'd be assuming that there is already a way of
representing OpenMP constructs in the LLVM IR, is my understanding
correct here? I think that the Intel proposal [1] could be one way
(not necessarily the one) to do this (disregarding the fact that it is
tailored for OpenMP), does this still make sense?

My proposal does _not_ assume we change clang in any way, though it does
also not require it. However, the initial patch [1] will only work with
the OpenMP lowering used by clang right now.

The idea is as follows:

  We have different representation of parallelism in the IR, for example
  the KMP runtime library calls emitted by clang or the Intel parallel
  IR you mentioned. For each of them we write a piece of code that (1)
  extracts domain specific information and (2) allows to modify the
  parallel representation. This is the only piece of code that has to be
  adapted for each parallel representation we want to optimize. On top
  of this are abstract interfaces that expose the information and
  modification options to parallel optimization passes. The patch [1]
  only contains the attribute annotator but we have more as explained in
  the paper [0]. The analysis/optimization logic is part of these passes
  and not aware of the underlying representation. We can consequently
  use the same passes to optimize code that was lowered to use different
  parallel runtime libraries (GOMP, KMP, Cilk runtime, TBB, ...) or into
  a native parallel IR (of any shape). This is especially useful as the
  native parallel IR might not always be usable. If that happens we have
  to fallback to early outlining, thus runtime library calls emitted by
  the front-end. Even if we at some point have a native parallel
  representation that is always used, we can simply remove the
  abstraction introduced by this approach but keep the
  analysis/optimizations around.

[0] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf
[1] ⚙ D47300 [RFC] Abstract parallel IR analyzes & optimizations + OpenMP implementations

If this is the case, and given that you explicitly state that this is
not a Parallel IR of any sort, is your suggestion to improve
optimisation of OpenMP code, based on a "side-car"/ancillary
representation built on top of the existing IR, which as I understand
should already be able to represent OpenMP? But then this looks a bit
redundant to me. So I'm pretty sure one of my assumptions is
incorrect. Unless your auxiliar representation is more an alternative
to the W-regions [1].

Or, maybe I am completely wrong here: you didn't say anything about
the FE lowering, which would still happen, and then your proposal
builds on top of that. I don't think you meant that, given that your
proposal mentions KMP and GOMP (and the current lowering done by clang
targets only KMP).

I'm not sure if these paragraphs are still relevant. Does the above
"explanation" answers you questions already? If not, please continue
asking!

Cheers,
  Johannes

Hi Johannes,

thanks a lot, all clear now!

Kind regards,
Roger