[RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Hal, Andrey, Alexey,

From the LLVM design viewpoint, there is a fundamental problem with both Hal's approach and the Intel approach: both are quite language-specific. OpenMP is a particular parallel language, with particular constructs (e.g., parallel regions) and semantics. LLVM is a language-neutral IR and infrastructure and OpenMP-specific concepts should not creep into it. I've included an excerpt from Hal's proposal below, which shows what I mean: the design is couched in terms of OpenMP parallel regions. Other parallel languages, e.g, Cilk, have no such notion. The latest Intel proposal is at least as OpenMP-specific.

I do agree with the general goal of trying to support parallel programming languages in a more first-class manner in LLVM than we do today. But the right approach for that is to be as language-neutral as possible. For example, any parallelism primitives in the LLVM IR and any LLVM- or machine-level optimizations that apply to parallel code should be applicable not only to OpenMP but also to languages like Cilk, Java/C#, and several others. I think "libraries" like Intel's TBB should be supported, too: they have a (reasonably) well-defined semantics, just like languages do, and are become widely used.

I also do not think LLVM metadata is the way to represent the primitives, because they are simply too fragile. But you don't have to argue that one with me :-), others have argued this already. You really need more first class, language-neutral, LLVM mechanisms for parallelism. I'm not pretending I know how to do this, though there are papers on the subject, including one from an Intel team (Pillar: A Parallel Implementation Language, LCPC 2007).

--Vikram
Professor, Computer Science
University of Illinois at Urbana-Champaign
http://llvm.org/~vadve

To mark this function as a parallel region, a module-level 'parallel'
metadata entry is created. The call site(s) of this function are marked
with this metadata,. The metadata has entries:
- The string "region"
- A reference to the parallel-region function
- If applicable, a list of metadata references specifying
special-handling child regions (parallel loops and serialized/critical
regions)

If the special-handling region metadata is no longer referenced by code
within the parallel region, then the region has become invalid, and
will be removed (meaning all parallelization metadata will be removed)
by the ParallelizationCleanup. The same is true for all other
cross-referenced metadata below.

Note that parallel regions can be nested.

As a quick example, something like:
int main() {
  int a;
#pragma omp parallel firstprivate(a)
  do_something(a)
  ...
}

becomes something like:

define private void @parreg(i32 %a) {
entry:
  call void @do_something(i32 %a)
  ret
}

define i32 @main() {
entry:
...
call void @parreg1(i32 %a) !parallel !0
...

!0 = metadata !{ metadata !"region", @parreg }

--Vikram
Professor, Computer Science
University of Illinois at Urbana-Champaign
http://llvm.org/~vadve

Hi,

constructs (e.g., parallel regions) and semantics. LLVM is a language-
neutral IR and infrastructure and OpenMP-specific concepts should not

This is exactly the reason I proposed [1] -- mirroring the openmp
directives in LLVM IR doesn't seem very elegant. The parallelization
information in the IR should be general and orthogonal.

I do realize that boxing the loops into procedures in the frontend
will initially inhibit loop optimizations, but that can be resolved
with some work. Ideally, I'd like to have a different class
(ParallelLoop, maybe) altogether representing parallel loops and make
the relevant passes aware of it. More work, yes, but I think such an
approach will pay off eventually.

[1] http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-September/053798.html

Vikram,

I agree with basically everything you said.

Yes, our (Intel) proposal is clearly OpenMP-specific. This is
obviously a deficiency that probably makes it unsuitable for inclusion
as a "first class" part of LLVM IR. What we intended is to have a
simple extension of LLVM IR (based on existing IR constructs) that
establishes common ground for front-end, back-end and runtime library
writers -- and thus, further progress of OpenMP support in LLVM
compiler toolchain.

Yours,
Andrey

Hi,

> constructs (e.g., parallel regions) and semantics. LLVM is a
> language- neutral IR and infrastructure and OpenMP-specific
> concepts should not

This is exactly the reason I proposed [1] -- mirroring the openmp
directives in LLVM IR doesn't seem very elegant. The parallelization
information in the IR should be general and orthogonal.

I do realize that boxing the loops into procedures in the frontend
will initially inhibit loop optimizations, but that can be resolved
with some work.

It is hard to evaluate this without more details. Would you convert
LoopInfo, SE, etc. into module-level passes? Would you inline these
functions early with some special attached semantics? If the second,
how is this different from attaching the special semantics using
intrinsics or metadata?

Ideally, I'd like to have a different class
(ParallelLoop, maybe) altogether representing parallel loops and make
the relevant passes aware of it. More work, yes, but I think such an
approach will pay off eventually.

Can you be more specific? Pay off how?

Thanks again,
Hal

Hal Finkel <hfinkel@anl.gov> writes:

It is hard to evaluate this without more details. Would you convert
LoopInfo, SE, etc. into module-level passes? Would you inline these
functions early with some special attached semantics? If the second,
how is this different from attaching the special semantics using
intrinsics or metadata?

One important way the latter is different from intrinsics is that the
additional semantic information is pass-specific. No other part of LLVM
need be aware of it. That helps contain the "parallelism problem" to
only those passes that care.

                          -David

Hal, Andrey, Alexey,

From the LLVM design viewpoint, there is a fundamental problem with
both Hal's approach and the Intel approach: both are quite
language-specific. OpenMP is a particular parallel language, with
particular constructs (e.g., parallel regions) and semantics. LLVM
is a language-neutral IR and infrastructure and OpenMP-specific
concepts should not creep into it.

This is a matter of perspective. One could also argue that the LLVM IR
should be target neutral. Nevertheless, we have target-specific
intrinsics. Similarly, there is a legitimate case to be made for
producing code that targets existing OpenMP runtime ABIs. The most
natural way to do this is for some of the ABIs semantics, and thus some
of OpenMP's language semantics, to leak into the IR level. Otherwise,
one ends up playing too much of a double-translation game.

Consider, for example, an OpenMP loop with runtime scheduling. This
implies a certain interaction with the OpenMP runtime library (and so is
explicitly OpenMP-specific). Nevertheless, we don't want to lower the
parallelization too early because we'd like to perform look analysis
and transformations (like LICM) first. The only way to do this properly
seems to be to push some of the OpenMP-specific nature of the loop into
the IR. This is not necessarily bad.

I've included an excerpt from
Hal's proposal below, which shows what I mean: the design is couched
in terms of OpenMP parallel regions. Other parallel languages, e.g,
Cilk, have no such notion.

The approach that I proposed was certainly inspired by OpenMP, and
designed to fully support OpenMP, but was not limited to it. As a
practical matter, OpenMP includes both loop-based parallelism and
task-based parallelism, which is a pretty broad foundation for
supporting parallelism in general.

I looked at the Cilk documentation when writing my proposal. Is there a
reason why Cilk's semantics cannot be mapped onto the proposed support
for parallel tasks?

The latest Intel proposal is at least as
OpenMP-specific.

I do agree with the general goal of trying to support parallel
programming languages in a more first-class manner in LLVM than we do
today. But the right approach for that is to be as language-neutral
as possible. For example, any parallelism primitives in the LLVM IR
and any LLVM- or machine-level optimizations that apply to parallel
code should be applicable not only to OpenMP but also to languages
like Cilk, Java/C#, and several others. I think "libraries" like
Intel's TBB should be supported, too: they have a (reasonably)
well-defined semantics, just like languages do, and are become widely
used.

I also do not think LLVM metadata is the way to represent the
primitives, because they are simply too fragile. But you don't have
to argue that one with me :-), others have argued this already.

I've never argued that the mechanism is not fragile. However, I do
think that, with a proper design, it is possible to use the existing
metadata infrastructure (with some minimal changes, for example, to
inhibit inlining). I am not committed to a metadata-based approach, but
I think such an approach is workable.

You
really need more first class, language-neutral, LLVM mechanisms for
parallelism. I'm not pretending I know how to do this, though there
are papers on the subject, including one from an Intel team (Pillar:
A Parallel Implementation Language, LCPC 2007).

I'll look at the paper, thanks for the reference! The problem is not
just in supporting parallelism in general, the problem is specifically
in supporting OpenMP, with its mix of language semantics, runtime
semantics, and the interaction of the two, while not inhibiting
optimization.

Thanks again,
Hal

I am not an optimizer guy, but, I am just thinking, if we can solve the problems that we are discussing in this mail chain, by introducing a middle-end in between front-end and LLVM. We may need to introduce GGC GIMPLE kind of IR (or any new suitable IR) in the middle-end so that front-end can produce this new IR, middle-end can consume it, and do all the parallelization and subsequent optimizations, and generate LLVM IR to take it forward by LLVM. Again going through middle-end can be made optional depending on the requirements. This means, front-end must have to have a capability to produce either new IR to middle-end or LLVM IR directly to LLVM.

I am not an optimizer guy, but, I am just thinking, if we can solve
the problems that we are discussing in this mail chain, by
introducing a middle-end in between front-end and LLVM. We may need
to introduce GGC GIMPLE kind of IR (or any new suitable IR) in the
middle-end so that front-end can produce this new IR, middle-end can
consume it, and do all the parallelization and subsequent
optimizations, and generate LLVM IR to take it forward by LLVM. Again
going through middle-end can be made optional depending on the
requirements. This means, front-end must have to have a capability to
produce either new IR to middle-end or LLVM IR directly to LLVM.

I don't necessarily object to this, but it would be a major change to
the frontends, plus there would be significant effort involved in
implementing the analysis and transformation passes on this
intermediate IR. I see this as little different from, but more work
than, adding metadata/intrinsics to the LLVM IR and then teaching
select existing (and some new) transformation passes to preserve this
additional information.

I could certainly be convinced otherwise, but I think that I'd need to
see a detailed proposal.

-Hal

Hal, Andrey, Alexey,

From the LLVM design viewpoint, there is a fundamental problem with
both Hal's approach and the Intel approach: both are quite
language-specific. OpenMP is a particular parallel language, with
particular constructs (e.g., parallel regions) and semantics. LLVM
is a language-neutral IR and infrastructure and OpenMP-specific
concepts should not creep into it. I've included an excerpt from
Hal's proposal below, which shows what I mean: the design is couched
in terms of OpenMP parallel regions. Other parallel languages, e.g,
Cilk, have no such notion. The latest Intel proposal is at least as
OpenMP-specific.

I do agree with the general goal of trying to support parallel
programming languages in a more first-class manner in LLVM than we do
today. But the right approach for that is to be as language-neutral
as possible. For example, any parallelism primitives in the LLVM IR
and any LLVM- or machine-level optimizations that apply to parallel
code should be applicable not only to OpenMP but also to languages
like Cilk, Java/C#, and several others. I think "libraries" like
Intel's TBB should be supported, too: they have a (reasonably)
well-defined semantics, just like languages do, and are become widely
used.

I also do not think LLVM metadata is the way to represent the
primitives, because they are simply too fragile. But you don't have
to argue that one with me :-), others have argued this already. You
really need more first class, language-neutral, LLVM mechanisms for
parallelism. I'm not pretending I know how to do this, though there
are papers on the subject, including one from an Intel team (Pillar:
A Parallel Implementation Language, LCPC 2007).

I took a quick look at this paper; essentially, their language
introduces continuations and several different types of 'parallel call'
functions. Do we want continuations? If we want to go down this road,
then I think that something like Sanoy's parallel_map is a good idea. My
worry with these restricted approaches is that correctly implementing
OpenMP's semantics may prove inefficient (if not impossible). The
specification dictates specific interactions between the runtime
library, certain environmental variables, and the pragmas. I think that
showing that this will work will require a specific proposal and an
explicit construction of the mapping. Moreover, I fear that restricting
parallelism to some special types of call instructions will inhibit
useful loop optimizations on parallelized loops. Since parallel loop
performance is one of the most important measures for the quality of a
parallelizing compiler, we need to consider the impact on loop
optimizations carefully.

-Hal

Hal, Andrey, Alexey,

From the LLVM design viewpoint, there is a fundamental problem with
both Hal's approach and the Intel approach: both are quite
language-specific. OpenMP is a particular parallel language, with
particular constructs (e.g., parallel regions) and semantics.

Also, after thinking about it, I object to this characterization.
OpenMP is a parallelization model and runtime interface (with some
specific language bindings pre-specified). There is very little in the
standard that is language specific. It would not be difficult to create
OpenMP for Python, or Go, etc. OpenMP covers both task-based and
region/loop-based parallelism. As far as choosing a conceptual
substrate on which to base LLVM's parallelism, using OpenMP would not
be a bad idea. If LLVM has parallelization support, such support will
also be a, "a particular parallel language, with particular constructs
(e.g., parallel regions) and semantics."

I think that there is a dangerous inclination to oversimply the
parallelization interface. While it is true that parallel algorithms
can often be modeled using a few distinct operators, implementing a
programming language this way is not a good idea. You would not, for
example, argue that C should eliminate all integer types except for the
largest. It is true that integer operations on shorter types can be
implemented in terms of the larger types combined with appropriate
masking operations. However, it would neither be convenient,
aesthetically pleasing, nor easy to optimize in practice.

-Hal