[RFC] OpenMP Representation in LLVM IR

Hi All,

We'd like to make a proposal for OpenMP representation in LLVM IR.

Our goal is to reach an agreement in the community on a simple,
complete and extensible representation of OpenMP language constructs
in LLVM IR. Hopefully, this would serve as a common ground and would
enable further development of OpenMP support both in Clang and LLVM
compiler toolchain.

We seek feedback on the proposed design and ways to improve it.

Main authors of the proposal are Andrey Bokhanko and Alexey Bataev.
Also, we'd like to acknowledge valuable contributions and advice
provided by Kevin B Smith, Xinmin Tian, Stefanus Du Toit, Brian
Minard, Dmitry Babokin and other colleagues from Intel and Diego
Novillo from Google. NB, this *does not* automatically imply support
of the proposal by said individuals.

Please find the proposal in *.pdf (attached to the message, for
reading convenience) and plain text (below the message, for quoting
convenience) formats. Their content is identical.

Full disclosure: both of us are associated with Intel and Intel Compiler team.

Yours,
Andrey Bokhanko
Alexey Bataev

OpenMP_LLVM_IR.pdf (651 KB)

Andrey,

I am very glad to see that you're interested in working on this! I
have a few comments:

As you may know, this is the third such proposal over the past two
months, one by me
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html)
and the other, based somewhat on mine, by Sanjoy
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-September/053798.html)

In order for your proposal to work well, there will be a lot of
infrastructure work required (more than with my proposal); many passes
will need to be made explicitly aware of how they can, or can't, reorder
things with respect to the parallelization intrinsics; loop
restructuring may require special care, etc. How this is done depends
in part on where the state information is stored: Do we keep the
parallelization information in the intrinsics during mid-level
optimization, or do we move its state into an analysis pass? In any
case, I don't object to this approach so long as we have a good plan
for how this work will be done.

When we discussed this earlier this year, there seemed to be some
consensus that we wanted to avoid, to the extent possible, introducing
OpenMP-specific intrinsics into the LLVM layer. Rather, we should
define some kind of parallelization API (in the form of metadata,
intrinsics, etc.) onto which OpenMP can naturally map along with other
paradigms. There is interest in supporting OpenACC, for example, which
will require data copying clauses, and it would make sense to share
as much of the infrastructure as possible with OpenMP. Are you
interested in providing Cilk support as well? We probably don't want to
have NxM slightly-different ways of expressing 'this is a parallel
region'. There are obviously cases in which things need to be specific
to the interface (like runtime loop scheduling in OpenMP which implies
a specific interaction with the runtime library), but such cases may be
the exception rather than the rule.

We don't need 'omp' in the intrinsic names and also 'OMP_' on all of
the string specifiers. Maybe, to my previous point, we could call the
intrinsics 'parallel' and use 'OMP_' only when something is really
OpenMP-specific?

You don't seem to want to map thread-private variables onto the
existing TLS support. Why?

Sincerely,
Hal

I'd like to remind that more generic parallelism constructs in LLVM
would also help the Portable OpenCL implementation cause. The SPMD
multi work-item work-group functions (produced from OpenCL C kernels) should be
easily mappable to such constructs after which the actual parallelization (DLP,
TLP, ILP, or a combination of them) can be target-specific and done in generic
LLVM passes which, e.g., OpenMP benefits from.

Hal,

Thank you for the reply!

As you may know, this is the third such proposal over the past two
months, one by me
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html)
and the other, based somewhat on mine, by Sanjoy
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-September/053798.html)

Yes, I was aware of your proposal. I hesitated to make any comments or
criticism -- as I am, obviously, biased.

In my opinion, two most important differences between our proposals are:

1) Your design employs explicit procedurization done in front-end,
while our design allows both early (right after front-end) and late
(later in the back-end) procedurization.
2) You aim to provide a general support for all (or at least most)
parallel standards, while our aim is more modest -- just OpenMP.

Please see discussion of 1) in "Function Outlining" section of our proposal.

As for 2), there are many arguments one might use in favor of more
general or more specialized solution. What is easier to implement?
What is better for LLVM IR development? Are we sure what we see as
necessary and sufficient today would be suitable for future parallel
standards -- given all the developments happening in this area as we
speak? Whatever one answers, it would be quite subjective. My personal
preference is for simplest and most focused solution -- but then again
this is subjective.

In order for your proposal to work well, there will be a lot of
infrastructure work required (more than with my proposal); many passes
will need to be made explicitly aware of how they can, or can't, reorder
things with respect to the parallelization intrinsics; loop
restructuring may require special care, etc. How this is done depends
in part on where the state information is stored: Do we keep the
parallelization information in the intrinsics during mid-level
optimization, or do we move its state into an analysis pass? In any
case, I don't object to this approach so long as we have a good plan
for how this work will be done.

No -- only passes than happen before procedurization should be aware
of these intrinsics.

I agree that it is not so easy to make optimizations "thread-aware".
But the problem essentially the same, no matter how parallel extension
is manifested in the IR.

When we discussed this earlier this year, there seemed to be some
consensus that we wanted to avoid, to the extent possible, introducing
OpenMP-specific intrinsics into the LLVM layer. Rather, we should
define some kind of parallelization API (in the form of metadata,
intrinsics, etc.) onto which OpenMP can naturally map along with other
paradigms. There is interest in supporting OpenACC, for example, which
will require data copying clauses, and it would make sense to share
as much of the infrastructure as possible with OpenMP. Are you
interested in providing Cilk support as well? We probably don't want to
have NxM slightly-different ways of expressing 'this is a parallel
region'. There are obviously cases in which things need to be specific
to the interface (like runtime loop scheduling in OpenMP which implies
a specific interaction with the runtime library), but such cases may be
the exception rather than the rule.

We don't need 'omp' in the intrinsic names and also 'OMP_' on all of
the string specifiers. Maybe, to my previous point, we could call the
intrinsics 'parallel' and use 'OMP_' only when something is really
OpenMP-specific?

As I said before, our aim was quite simple -- OpenMP support only.

Can the design be extended to allow more general form of parallel
extensions support? Probably... but this is definitely more than what
we intended.

You don't seem to want to map thread-private variables onto the
existing TLS support. Why?

Because we don't employ explicit procedurization. What happens after
procedurization (including how thread-private variables are manifested
in the IR) is heavily dependent on OpenMP runtime library one relies
upon and out of scope of our proposal.

Yours,
Andrey Bokhanko

Hal,

Thank you for the reply!

> As you may know, this is the third such proposal over the past two
> months, one by me
> (http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html)
> and the other, based somewhat on mine, by Sanjoy
> (http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-September/053798.html)

Yes, I was aware of your proposal. I hesitated to make any comments or
criticism -- as I am, obviously, biased.

In my opinion, two most important differences between our proposals
are:

1) Your design employs explicit procedurization done in front-end,
while our design allows both early (right after front-end) and late
(later in the back-end) procedurization.

Yes.

2) You aim to provide a general support for all (or at least most)
parallel standards, while our aim is more modest -- just OpenMP.

To be fair, my proposal was also fairly OpenMP specific.

Please see discussion of 1) in "Function Outlining" section of our
proposal.

As for 2), there are many arguments one might use in favor of more
general or more specialized solution. What is easier to implement?

I feel that my proposal is easier to implement because it is safer:
because of the procedurization and the cross-referencing of the
metadata, passes that don't know about the parallelization metadata and
drop it will cause parallel regions to be lost, but should not
otherwise lead to miscompiled code, and with inlining, most
optimization opportunities are preserved.

I agree that your proposal allows more optimization opportunities to be
preserved. On the other hand, it will require more auditing of existing
code, and new infrastructure just to make the new intrinsics don't
interfere with existing optimizations. I trust that you have sufficient
resources to do these things, and that being the case, I don't object.

What is better for LLVM IR development? Are we sure what we see as
necessary and sufficient today would be suitable for future parallel
standards

I guarantee that the answer is no :wink: -- but there are a number of
current standards that can be considered.

-- given all the developments happening in this area as we
speak? Whatever one answers, it would be quite subjective. My personal
preference is for simplest and most focused solution -- but then again
this is subjective.

> In order for your proposal to work well, there will be a lot of
> infrastructure work required (more than with my proposal); many
> passes will need to be made explicitly aware of how they can, or
> can't, reorder things with respect to the parallelization
> intrinsics; loop restructuring may require special care, etc. How
> this is done depends in part on where the state information is
> stored: Do we keep the parallelization information in the
> intrinsics during mid-level optimization, or do we move its state
> into an analysis pass? In any case, I don't object to this approach
> so long as we have a good plan for how this work will be done.

No -- only passes than happen before procedurization should be aware
of these intrinsics.

This answer is fairly ambiguous because you haven't explained exactly
when this will happen. I assume that it will happen fairly late. For
some things, like atomics lowering, we may want to wait until just
prior to code generation to allow late customization by target-specific
code.

I agree that it is not so easy to make optimizations "thread-aware".
But the problem essentially the same, no matter how parallel extension
is manifested in the IR.

> When we discussed this earlier this year, there seemed to be some
> consensus that we wanted to avoid, to the extent possible,
> introducing OpenMP-specific intrinsics into the LLVM layer. Rather,
> we should define some kind of parallelization API (in the form of
> metadata, intrinsics, etc.) onto which OpenMP can naturally map
> along with other paradigms. There is interest in supporting
> OpenACC, for example, which will require data copying clauses, and
> it would make sense to share as much of the infrastructure as
> possible with OpenMP. Are you interested in providing Cilk support
> as well? We probably don't want to have NxM slightly-different ways
> of expressing 'this is a parallel region'. There are obviously
> cases in which things need to be specific to the interface (like
> runtime loop scheduling in OpenMP which implies a specific
> interaction with the runtime library), but such cases may be the
> exception rather than the rule.
>
> We don't need 'omp' in the intrinsic names and also 'OMP_' on all of
> the string specifiers. Maybe, to my previous point, we could call
> the intrinsics 'parallel' and use 'OMP_' only when something is
> really OpenMP-specific?

As I said before, our aim was quite simple -- OpenMP support only.

Fair enough, but that does not explain why, even with a restricted
scope, we need to repeat 'omp' in both the intrinsic name and its
associated metadata.

As far as I can tell, what you've proposed is a fairly generic way to
pass pragma-type information from the frontend to the backend. Going
through all of the effort to implement that only to arbitrarily
restrict it to OpenMP pragmas seems silly. Having this capability would
be great, and we could use it for other things. For example, I'd like
to have a '#pragma unroll(n)' for loops. If we have a generic way to
pass such contextual pragmas to the backend, it would make supporting
such extensions much easier.

Can the design be extended to allow more general form of parallel
extensions support? Probably... but this is definitely more than what
we intended.

> You don't seem to want to map thread-private variables onto the
> existing TLS support. Why?

Because we don't employ explicit procedurization. What happens after
procedurization (including how thread-private variables are manifested
in the IR) is heavily dependent on OpenMP runtime library one relies
upon and out of scope of our proposal.

I thought that thread-private variables in OpenMP could be declared
only a global scope. This makes them map cleanly to existing TLS
support, and I don't see how the intrinsics will work in this case
(because you can't call intrinsics at global scope). That having been
said, I recommend that we introduce a new 'omp' TLS mode so that the
implementation is free to choose the most-appropriate lowering.

Thanks again,
Hal

Andrey Bokhanko <andreybokhanko@gmail.com> writes:

Hi All,

We'd like to make a proposal for OpenMP representation in LLVM IR.

I'm providing some brief comments after a skim of this..

Our goal is to reach an agreement in the community on a simple,
complete and extensible representation of OpenMP language constructs
in LLVM IR.

I think this is a bad idea. OpenMP is not Low Level and I can't think
of a good reason to start putting OpenMP support in the IR. Cray has
complete functioning OpenMP stack that performs very well without any
LLVM IR support at all.

As can be seen in the following sections, the IR extension we propose
doesn’t involve explicit procedurization. Thus, we assume that
function outlining should happen somewhere in the LLVM back-end, and
usually this should be aligned with how chosen OpenMP runtime library
works and what it expects. This is a deliberate decision on our part.
We believe it provides the following benefits (when compared with
designs involving procedurization done in a front-end):

This is a very high-level transformation. I don't think it belongs
in a low-level backend.

1) Function outlining doesn’t depend on source language; thus, it can
be implemented once and used with any front-ends.

A higher-level IR would be more appropriate for this, either something
provided by Clang or another frontend or a some other mid-level IR.

2) Optimizations are usually restricted by a single function boundary.
If procedurization is done in a front-end, this effectively kills any
optimizations – as simple ones as loop invariant code motion. Refer to
[Tian2005] for more information on why this is important for efficient
optimization of OpenMP programs.

You're assuming all optimization is done by LLVM. That's not true in
general.

It should be stressed, though, that in order to preserve correct
semantics of a user program, optimizations should be made thread-aware
(which, to the best of our knowledge, is not the case with LLVM
optimizations).

Another reason not to do this in LLVM.

We also included a set of requirements for front-ends and back-ends,
which establish mutual expectations and is an important addition to
the design.

This will increase coupling between the "front ends" and LLVM. That
would be very unfortunate. One of LLVM's great strength is its
flexibility.

I didn't look at the details of how you map OMP directives to LLVM IR.
I think this is really the wrong way to go.

                       -David

Hal Finkel <hfinkel@anl.gov> writes:

Hi Hal,

As you may know, this is the third such proposal over the past two
months, one by me
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-August/052472.html)

This link seems to be broken. I missed your earlier proposal and would
like to read it. As with this proposal, I fear any direct
parallelization support in LLVM is going to take us out of the "low
level" feature of LLVM which is a huge strength.

and the other, based somewhat on mine, by Sanjoy
(http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-September/053798.html)

I read this proposal quickly. I don't understand why we need
intrinsics. Won't calls to runtime routines work just fine?

Ah, Sanjoy had a link to your proposal in his message.

Again, I only skimmed the document, but I was left with the question,
"why not just make calls to runtime routines?" What is the reason for
the "paralleliation metadata?" It seems to me this implies/requires that
LLVM have knowledge of parallel semantics. That would be very
unfortunate.

                     -David

Andrey Bokhanko <andreybokhanko@gmail.com> writes:

> Hi All,
>
> We'd like to make a proposal for OpenMP representation in LLVM IR.

I'm providing some brief comments after a skim of this..

> Our goal is to reach an agreement in the community on a simple,
> complete and extensible representation of OpenMP language constructs
> in LLVM IR.

I think this is a bad idea. OpenMP is not Low Level and I can't think
of a good reason to start putting OpenMP support in the IR. Cray has
complete functioning OpenMP stack that performs very well without any
LLVM IR support at all.

OpenMP provides a mechanism to expresses parallel semantics, and the
best way to implement those semantics is highly target-dependent. On
some targets early lowering into a runtime library will perform well,
and optimization opportunities lost by doing so will prove fairly
insignificant in many cases. I can believe that this is true on those
systems that Cray targets. However, that will not be true everywhere.

> As can be seen in the following sections, the IR extension we
> propose doesn’t involve explicit procedurization. Thus, we assume
> that function outlining should happen somewhere in the LLVM
> back-end, and usually this should be aligned with how chosen OpenMP
> runtime library works and what it expects. This is a deliberate
> decision on our part. We believe it provides the following benefits
> (when compared with designs involving procedurization done in a
> front-end):

This is a very high-level transformation. I don't think it belongs
in a low-level backend.

> 1) Function outlining doesn’t depend on source language; thus, it
> can be implemented once and used with any front-ends.

A higher-level IR would be more appropriate for this, either something
provided by Clang or another frontend or a some other mid-level IR.

For some things, yes, but at the moment we don't have anything else
besides the LLVM IR. The LLVM IR is currently where vectorization is
done, loop-restructuring is done, aliasing analysis is performed, etc.
and so is where the parallelization should be done as well. For other
things, like atomics, lowering later may be better.

> 2) Optimizations are usually restricted by a single function
> boundary. If procedurization is done in a front-end, this
> effectively kills any optimizations – as simple ones as loop
> invariant code motion. Refer to [Tian2005] for more information on
> why this is important for efficient optimization of OpenMP programs.

You're assuming all optimization is done by LLVM. That's not true in
general.

Even if LLVM has support for parallelization, no customer is required
to use it. If you'd like to lower parallelization semantics into
runtime calls before lowering in LLVM, you're free to do that.

Nevertheless, LLVM is where, for example, loop-invariant code motion is
performed. We don't want to procedurize parallel loops before that
happens.

> It should be stressed, though, that in order to preserve correct
> semantics of a user program, optimizations should be made
> thread-aware (which, to the best of our knowledge, is not the case
> with LLVM optimizations).

Another reason not to do this in LLVM.

> We also included a set of requirements for front-ends and back-ends,
> which establish mutual expectations and is an important addition to
> the design.

This will increase coupling between the "front ends" and LLVM. That
would be very unfortunate. One of LLVM's great strength is its
flexibility.

Most users do not use every feature of a programming language, and LLVM
is no different.

I didn't look at the details of how you map OMP directives to LLVM IR.
I think this is really the wrong way to go.

Respectfully, I disagree.

-Hal

Not to distract, but the word, `procedurization’ is not an English word. It’s just leaping out at me when it is either procedure(s) (noun) or proceduralize (verb). Even processes would make sense. I couldn’t help myself because the word was distracting.

  • Marc

P.S. Not that my vote counts, but I’m more in the camp of Hal whose approach to tackling the parallelization foundation within LLVM [OpenMP just one aspect] makes more sense hitting that now by providing a clean and generic container without being hardwired to any one particular third party extension, ala the `complete and extensible representation of OpenMP.’

Speaking from not having the necessary foundation to really sway either direction it just seems more the spirit of Hal’s project would fall in-line, with what has been historically how Lattner and company prefer to keep LLVM as clean a design and extensible as possible without being interdependent upon other projects hooks into extending value to the project.

As a former Apple/NeXT alum I would have already expected Enderby and company to have designed LLVM/Clang with OpenMP hooks if it were intended to have an intimate relationship with the OpenMP project. It seems to me that OpenCL, OpenMP and other solutions are best served as add-ons to the project freeing the project from being design compromised from it’s original goals.

I’d make the same observations whether AMD, ARM or anyone else made the proposal attempting to interweave OpenMP as you hope to sway the community to allow. I prefer Hal’s approach to the who problem space.

Hi David,

Thank you for your comments.

Basically, I agree with Hal's answers -- nothing substantial to add to
what he already said.

As for

Again, I only skimmed the document, but I was left with the question,
"why not just make calls to runtime routines?"

Granted, this is the easiest and cheapest way to support OpenMP...
that throws away the whole notion of "optimizing compilation" and
"front-end / back-end decoupling".

Using same logic: why bother with virtual registers? Why not to use
target machine's physical registers in IR code generated by a
front-end?

Right?

Wait a sec... LLVM IR is meant to be portable and supporting
"life-long program analysis and transformation". Locking it with
target machine's OpenMP runtime calls from the very beginning is not
the best way to acheive these goals. Same with physical registers --
this must be the reason why they invented virtual ones several decades
ago.

What is the reason for
the "paralleliation metadata?" It seems to me this implies/requires that
LLVM have knowledge of parallel semantics. That would be very
unfortunate.

The reasons are listed in "Function Outlining" section of my proposal.
You simply dimissed them with:

This is a very high-level transformation. I don't think it belongs in a low-level backend.

A higher-level IR would be more appropriate for this, either something provided by Clang or another frontend or a some other mid-level IR.

You're assuming all optimization is done by LLVM. That's not true in general.

Sorry, but what you said are your opinions, not data-driven agruments.
We all have opinions. The arguments I listed are supported by data
presented in [Tian05], referenced in our proposal. Do you have data
supporting your opinions?

Yours,
Andrey

Mark,

Speaking from not having the necessary foundation to really sway either
direction it just seems more the spirit of Hal's project would fall in-line,
with what has been historically how Lattner and company prefer to keep LLVM
as clean a design and extensible as possible without being interdependent
upon other projects hooks into extending value to the project.

I have hard time understanding why one proposal is different from
another in this regard.

Both rely on intrinsics and metadata... why intrinsics used in one
proposal are better (or worse) than ones used in another?

Yours,
Andrey

Hal Finkel <hfinkel@anl.gov> writes:

OpenMP provides a mechanism to expresses parallel semantics, and the
best way to implement those semantics is highly target-dependent. On
some targets early lowering into a runtime library will perform well,
and optimization opportunities lost by doing so will prove fairly
insignificant in many cases. I can believe that this is true on those
systems that Cray targets. However, that will not be true everywhere.

Granted. However, it's still not clear to me that LLVM IR is the right
level for this. I'm not going to oppose the idea or anything, I'm just
expressing thoughts.

A higher-level IR would be more appropriate for this, either something
provided by Clang or another frontend or a some other mid-level IR.

For some things, yes, but at the moment we don't have anything else
besides the LLVM IR.

That's simply because no one has addressed the issue. Obviously, it
would be a lot of work to develop a new IR but the long-term benefits
may be worth it.

The LLVM IR is currently where vectorization is done,
loop-restructuring is done, aliasing analysis is performed, etc. and
so is where the parallelization should be done as well.

I made a presentation at the llvmdev conference some years ago in which
I (briefly) argued that LLVM IR is not the best fit for this stuff. I
still believe that. Obviously it _can_ be done because LLVM IR is
Turing-complete. But would it be easier/more effective to do it with a
different IR representation? I believe so. 'Course I don't expect any
of you to just take my word for it. :slight_smile:

Even if LLVM has support for parallelization, no customer is required
to use it. If you'd like to lower parallelization semantics into
runtime calls before lowering in LLVM, you're free to do that.

Yep, and we'll continue doing that. I'm not objecting because the
proposal will hurt us in some way. I'm not even objecting, really, just
pointing out alternatives for thought.

Nevertheless, LLVM is where, for example, loop-invariant code motion is
performed. We don't want to procedurize parallel loops before that
happens.

Yes, there are benefits to delaying outlining to allow other passes to
run. I do understand the whys of the various proposals.

                           -David

Andrey Bokhanko <andreybokhanko@gmail.com> writes:

Again, I only skimmed the document, but I was left with the question,
"why not just make calls to runtime routines?"

Granted, this is the easiest and cheapest way to support OpenMP...
that throws away the whole notion of "optimizing compilation" and
"front-end / back-end decoupling".

How? Nothing prevents the use of an intermediate layor to handle
high-level transformation.

Wait a sec... LLVM IR is meant to be portable and supporting
"life-long program analysis and transformation". Locking it with
target machine's OpenMP runtime calls from the very beginning is not
the best way to acheive these goals. Same with physical registers --
this must be the reason why they invented virtual ones several decades
ago.

Sure, I understand the whys of the proposal and I understand that
intrisics can be valuable. I don't want to lose the low-level nature of
LLVM and these proposals (not just parallelization, but GPU IR,
eval-style inline bitcode, etc.) are starting to feel like
mission-creep.

That's all. I'm not forcibly objecting to anything. Just passing on
thoughts.

The reasons are listed in "Function Outlining" section of my proposal.
You simply dimissed them with:

I apologize for offending you. That was certainly not my intent.

                             -David

Andrey,

While I think that it will be relatively easy to have the intrinsics
serve as code-motion barriers for other code that might be threads
sensitive (like other external function calls), we would need to think
through exactly how this would work. The easiest thing would be to make
the intrinsics have having unmodeled side effects, although we might
want to do something more intelligent.

Where do you propose placing the parallel loop intrinsics calls
relative to the loop code? Will this inhibit restructuring (like loop
interchange), fusion, etc. if necessary?

-Hal

Hal,

While I think that it will be relatively easy to have the intrinsics
serve as code-motion barriers for other code that might be threads
sensitive (like other external function calls), we would need to think
through exactly how this would work. The easiest thing would be to make
the intrinsics have having unmodeled side effects, although we might
want to do something more intelligent.

Yes, that's exactly the idea.

Where do you propose placing the parallel loop intrinsics calls
relative to the loop code?

In preloop ("opening" intrinsic) and postloop ("closing" one).

Will this inhibit restructuring (like loop
interchange), fusion, etc. if necessary?

I guess so... Loops usually deal with reading/writing memory, and if
an intrinsic is marked as "modifies everything", this hardly leaves
any possibility for [at least] the optimizations you mentioned.

But this is different from what I have in mind. Basically, the plan is
to perform analysis and some optimizations before procedurization, and
do the rest (including loop restructuring) after it. This is not
mentioned in the proposal (we tried to be succint -- only 20 pages
long! :-)), but explained in detail in [Tian05] (sorry, the link in
the proposal doesn't lead you directly to pdf file; use this one
instead: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3763&rep=rep1&type=pdf).

Yours,
Andrey

From: "Andrey Bokhanko" <andreybokhanko@gmail.com>
To: "Hal Finkel" <hfinkel@anl.gov>
Cc: llvmdev@cs.uiuc.edu
Sent: Wednesday, October 3, 2012 3:15:54 AM
Subject: Re: [LLVMdev] [RFC] OpenMP Representation in LLVM IR

Hal,

> While I think that it will be relatively easy to have the
> intrinsics
> serve as code-motion barriers for other code that might be threads
> sensitive (like other external function calls), we would need to
> think
> through exactly how this would work. The easiest thing would be to
> make
> the intrinsics have having unmodeled side effects, although we
> might
> want to do something more intelligent.

Yes, that's exactly the idea.

Right. You should verify that using the 'unmodeled side effects' tag does not inhibit the optimizations you seek to preserve. If we need to work out some other less-restrictive semantics, then we should discuss that.

> Where do you propose placing the parallel loop intrinsics calls
> relative to the loop code?

In preloop ("opening" intrinsic) and postloop ("closing" one).

> Will this inhibit restructuring (like loop
> interchange), fusion, etc. if necessary?

I guess so... Loops usually deal with reading/writing memory, and if
an intrinsic is marked as "modifies everything", this hardly leaves
any possibility for [at least] the optimizations you mentioned.

But this is different from what I have in mind. Basically, the plan
is
to perform analysis and some optimizations before procedurization,
and
do the rest (including loop restructuring) after it. This is not
mentioned in the proposal (we tried to be succint -- only 20 pages
long! :-)), but explained in detail in [Tian05] (sorry, the link in
the proposal doesn't lead you directly to pdf file; use this one
instead:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.97.3763&rep=rep1&type=pdf).

With regard to what you're proposing, the paper actually leaves a lot unexplained. The optimizations that it discusses prior to OpenMP lowering seem to be, "classical peephole optimizations within basic-blocks", inlining, and "OpenMP construct-aware constant propagation" (in addition to some aliasing analysis). If this is what you plan to do in LLVM as well, are you planning on implementing special-purpose passes for these transformations, or re-using existing ones? If you're reusing existing ones, which ones? And how would they need to be modified to become 'OpenMP aware'?

Can you please comment on the loop-analysis use case that I outline here:
http://lists.cs.uiuc.edu/pipermail/llvmdev/2012-October/054078.html
Would this kind of simplification fall under the 'constant propagation' transformation, or will something else be required?

What might be most useful is if we develop a set of optimization tests so that we have a set of use cases from which we can base a design. Do you already have a set of such tests? I'd be happy to work with you on this.

Thanks again,
Hal

Andrey,

Are you still working on this?

Thanks again,
Hal

Hal,

Our proposal is effectively got scrapped by the community, so we are
not pushing any further on the approach we proposed before.

How about meeting at the LLVM conference to discuss this?

Yours,
Andrey

Andrey,

Yes, that would be great.

-Hal

The "opt" in LLVM transforms bitcode into bitcode, which can be looked at as a "high-level" optimizer. Low-level optimizations would be those that work on machine instructions.

I think that the bitcode optimizer is the right place to do the OMP implementation. Parallelization infrastructure will be complex wherever we choose to add it to. It is necessary for LLVM/clang to be competitive, so I don't see it as a mission creep, more like a next step in LLVM's evolution.

-Krzysztof