[RFC] Parallelization metadata and intrinsics in LLVM (for OpenMP, etc.)

Chris,

My comment was mostly in response to the Intel proposal, which effectively translates OpenMP pragmas directly into llvm intrinsics + metadata. I can't imagine a way to make this work *correctly* without massive changes to the optimizer.

There are three ways to make this work correctly:

1) Ignore OpenMP-related intrinsics and associated metadata. Least
effort, least benefit (no OpenMP support). Yet, OpenMP programs
compiled correctly, as if no pragmas are present -- including *exactly
the same* number of routines and call graph (thanks to no
procedurization in front-end). OpenMP specification allow such
compilation. This might be the choice for targets that don't support
OpenMP runtime library.

2) Make procedurization (including all runtime calls -- no intrinsics
left after this step) at the very start of LLVM optimizer. No changes
to optimizations, but no opportunity to optimize parallel code. As
cheap and easy as one can do to support OpenMP. This might be a good
choice for initial implementation.

3) Do some carefully chosen optimizations before procedurization. Do
heavylifting (like loop restructuring optimizations) after
procedurization. Some effort, a lot of benefit. This is essentially
what is described in [Tian05] (referenced in our proposal).

4) Make all optimizations thread-aware. Best approach in theory, no
compilers exist that go as far.

Our proposal make all these choices possible. One can implement 1) in
half an hour, yet keep the door opened for a better solution.

Yours,
Andrey

Andrey Bokhanko <andreybokhanko@gmail.com> writes:

There are three ways to make this work correctly:

1) Ignore OpenMP-related intrinsics and associated metadata. Least
effort, least benefit (no OpenMP support). Yet, OpenMP programs
compiled correctly, as if no pragmas are present -- including *exactly
the same* number of routines and call graph (thanks to no
procedurization in front-end). OpenMP specification allow such
compilation. This might be the choice for targets that don't support
OpenMP runtime library.

Actually, it is perfectly possible to have a program with OpenMP
directives that is NOT valid when those directives are ignored. In
other words, it's possible to write a legal OMP program that relies on
parallelism to function correctly. In practice this doesn't happen in
production codes but it's wrong to say the compiler can just ignore
directives with no problems whatsoever.

2) Make procedurization (including all runtime calls -- no intrinsics
left after this step) at the very start of LLVM optimizer. No changes
to optimizations, but no opportunity to optimize parallel code. As
cheap and easy as one can do to support OpenMP. This might be a good
choice for initial implementation.

This should work fine, but then why support intrinsics in LLVM at all.
I understand you're talking about an initial implementation.

3) Do some carefully chosen optimizations before procedurization. Do
heavylifting (like loop restructuring optimizations) after
procedurization. Some effort, a lot of benefit. This is essentially
what is described in [Tian05] (referenced in our proposal).

What are the important optimizations?

4) Make all optimizations thread-aware. Best approach in theory, no
compilers exist that go as far.

This is probably not practical. It may be fine in academia but in
production environments the resources don't exist, unfortunately.

                          -David

Chris,

My comment was mostly in response to the Intel proposal, which effectively translates OpenMP pragmas directly into llvm intrinsics + metadata. I can't imagine a way to make this work *correctly* without massive changes to the optimizer.

There are three ways to make this work correctly:

1) Ignore OpenMP-related intrinsics and associated metadata. Least
effort, least benefit (no OpenMP support).

This is trivially true, but the entire point of supporting OpenMP in the IR would be to have some sort of late "procedurization" pass that actually exposes the parallelism through some runtime. Saying that we could just ignore this is silly: if we wanted to ignore OpenMP, we can do that in the frontend with far less complexity. In fact, we're already done! :wink:

2) Make procedurization (including all runtime calls -- no intrinsics
left after this step) at the very start of LLVM optimizer. No changes
to optimizations, but no opportunity to optimize parallel code. As
cheap and easy as one can do to support OpenMP. This might be a good
choice for initial implementation.

3) Do some carefully chosen optimizations before procedurization. Do
heavylifting (like loop restructuring optimizations) after
procedurization. Some effort, a lot of benefit. This is essentially
what is described in [Tian05] (referenced in our proposal).

I think you're missing the point here. The whole idea of LLVM IR is that it doesn't have various "forms" that are valid at different points in the optimizer. Even very late lowering passes like strength reduction are pure IR to IR passes that do not introduce special forms. This is in stark contrast to other compilers (e.g. Open64) which have several levels of lowering.

My whole objection comes from the (possibly incorrect, I am not an OpenMP expert!) idea that there are only two reasonable implementation approaches:

1. Early procedurization (e.g. in the frontend that produces LLVM IR). This is very easy to preserve and correctness is trivial, but you lose some (theoretical?) optimization benefits by doing procedurization early.

2. Late procedurization where the IR has explicit parallelism constructs and all optimizers preserve its correctness requirements (this is your #4). While this is possible in theory, I'm skeptical that this could make sense, and your proposal certainly isn't the right way to do it.

4) Make all optimizations thread-aware. Best approach in theory, no
compilers exist that go as far.

It's not clear to me exactly what sorts of optimizations that late procedurization is attempting to allow. I understand that this is the design that the Intel compiler uses, and you are motivated to make LLVM fit that model. However, the technical benefits of this design are not clear to me, and I also understand that late procedurization has been a continuous source of subtle correctness bugs that are still being found even though the product is mature. This is exactly the sort of thing that I want to avoid in LLVM.

-Chris

Chris,

I think you're missing the point here. The whole idea of LLVM IR is that it doesn't have various "forms" that are valid at different points in the optimizer. Even very late lowering passes like strength reduction are pure IR to IR passes that do not introduce special forms. This is in stark contrast to other compilers (e.g. Open64) which have several levels of lowering.

Well, at some point compiler *has* to insert runtime library calls.
This is true for all proposals, both existing and potential ones. Do
you mean that runtime calls must be inserted either strictly before
LLVM optimizer or strictly after it -- no other place? More on this
later.

As for treating IR with/without OpenMP intrinsics as separate forms,
this is a matter of personal taste and design choice, I guess. Is
strength reduction (that replaces multiplications with additions)
transforms IR into another "form"?

My whole objection comes from the (possibly incorrect, I am not an OpenMP expert!) idea that there are only two reasonable implementation approaches:

1. Early procedurization (e.g. in the frontend that produces LLVM IR). This is very easy to preserve and correctness is trivial, but you lose some (theoretical?) optimization benefits by doing procedurization early.

2. Late procedurization where the IR has explicit parallelism constructs and all optimizers preserve its correctness requirements (this is your #4). While this is possible in theory, I'm skeptical that this could make sense, and your proposal certainly isn't the right way to do it.

I understand your point... and respectfully disagree with it.

You basically say that it is all or nothing at all: either *no*
optimizations on parallel code (runtime calls inserted before LLVM
optimizer), or *all* optimizations workable on parallel code (calls
inserted after LLVM optimizer). In former case we lose *all*
optimizations, not some. As for latter, I share your skepticism -- and
duplicate it.

I understand that this is the design that the Intel compiler uses, and you are motivated to make LLVM fit that model.

Yes and yes.

And one more: "the proof is in the pudding", or so they say. Intel
Compiler (that, as you correctly noted, uses essentially the same
design) is the metaphorical "pudding" that proves viability and good
performance potential of the approach we proposed.

I also understand that late procedurization has been a continuous source of subtle correctness bugs that are still being found even though the product is mature.

Hmmm... One has to analyze Intel Compiler bugs statistics to make this
assertion, but this is certainly not being impression.

Yours,
Andrey

David,

Actually, it is perfectly possible to have a program with OpenMP
directives that is NOT valid when those directives are ignored. In
other words, it's possible to write a legal OMP program that relies on
parallelism to function correctly. In practice this doesn't happen in
production codes but it's wrong to say the compiler can just ignore
directives with no problems whatsoever.

You might be right. But this is as good as one can do compiling an
OpenMP program for a target with no OpenMP support.

What are the important optimizations?

You mean "that should be done before procedurization"?

As you understand, there is only way to know -- try it.

As been mentioned elsewhere, Intel Compiler employs essentially the
same design as we proposed. [Tian05] (use this link to access the
paper: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3763&rep=rep1&type=pdf)
describes phase ordering that Intel Compiler developers found to
provide good performance while preserving correctness.

4) Make all optimizations thread-aware. Best approach in theory, no
compilers exist that go as far.

This is probably not practical. It may be fine in academia but in
production environments the resources don't exist, unfortunately.

I do agree! :slight_smile:

That's why we propose what we propose -- the design leaves all doors opened.

Yours,
Andrey