[RFC] IR-level Region Annotations

Hi Johannes,

I am especially curious where you get your data from. Tapir [0] (and to
some degree PIR [1]) have shown that, counterintuitively, only a few changes
to LLVM passes are needed. Tapir was recently used in an MIT class with a
lot of students and it seemed to work well with only minimal changes
to analysis and especially transformation passes.

TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.

However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses. For example, automatic array privatization in parallel loops is a very hard problem (automatic scalar privatization is easier, but even that is interprocedural). Reduction recognition is doable for common cases, but there are hard cases here as well. These are all standard features of parallel programs, not specific to OpenMP (e.g., C++17 parallel template operators are likely to produce these as well).

If you support all these capabilities in the IR, a *lot* more than 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably have to be modified in LLVM.

[...]
(b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
   join, barrier, etc.
[...]

For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.

A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.

Does that make sense?

Also I think we already have a lot of
simple constructs in the IR to express high-level information properly,
e.g. 'atomicrmw' instructions for high-level reduction. While we currently
lack analysis passes to extract information from such low-level
representations, these are certainly possible [2,3]. I would argue that such
analysis are a better way to do things than placing "high-level intrinsics" in the IR to mark things like reductions.

IMO, this is too fragile for ensuring performance for realistic applications. While there is a lot of work on reduction recognition in the literature going back to the 1980s, there will always be cases where these algorithms miss a reduction pattern and you get terrible parallel performance. Second, I’m not convinced that you can use specific operations like “atomicrmw” to find candidate loop nests in the first place that *might* be reductions, because the atomic op may in fact be in a library operation linked in as binary code. Third, this greatly increases the complexity of a “first working prototype” for a language like OpenMP because you can’t exclude REDUCTION clauses. Fourth, I believe array privatization is even harder; more generally, you cannot rely on heroic compiler optimizations to take the place of explicit language directives.

I may be showing my age, but I co-wrote a full HPF-to-MPI source-to-source compiler in the late 1990s at Rice University with some quite sophisticated optimizations [1,2], and we and others in the HPF compiler community — both product and research teams — encountered many cases where compiler analyses were inadequate. There is a credible view that HPF failed despite very high hopes and significant industry and research investment because it relied too much on heroic compiler technology.

[1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps (PLDI 1998)
[2] http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf (SC 1998)

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve@illinois.edu
// http://llvm.org

Hi Johannes,

I am especially curious where you get your data from. Tapir [0] (and to
some degree PIR [1]) have shown that, counterintuitively, only a few changes
to LLVM passes are needed. Tapir was recently used in an MIT class with a
lot of students and it seemed to work well with only minimal changes
to analysis and especially transformation passes.

TAPIR is an elegant, small extension and, in particular, I think the idea of asymmetric parallel tasks and control flow is a clever way to express parallelism with serial semantics, as in Cilk. Encoding the control flow extensions as explicit instructions is orthogonal to that, though arguably more elegant than using region tags + metadata.

However, Cilk is a tiny language compared with the full complexity of other languages, like OpenMP. To take just one example, TAPIR cannot express the ORDERED construct of OpenMP. A more serious concern, IMO, is that TAPIR (like Cilk) requires serial semantics, whereas there are many parallel languages, OpenMP included, that do not obey that restriction. Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.

I agree with this, but I’m also wondering if it needs to be first class in the IR?
For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.

Just a thought, I have given much time studying other constructs and how they map to the IR :slight_smile:

For example, automatic array privatization in parallel loops is a very hard problem (automatic scalar privatization is easier, but even that is interprocedural). Reduction recognition is doable for common cases, but there are hard cases here as well. These are all standard features of parallel programs, not specific to OpenMP (e.g., C++17 parallel template operators are likely to produce these as well).

If you support all these capabilities in the IR, a *lot* more than 6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for PIR) would probably have to be modified in LLVM.

[...]
(b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
  join, barrier, etc.
[...]

For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.

A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.

Does that make sense?

Now that I read this, I wonder if it isn’t close to what I tried to express above :slight_smile:

FWIW, while i agree with the general point, i wouldn't use this example.
Because we pretty much still suffer to this day because of it (both in AA,
and devirt, and ...) :slight_smile:
We can't always even tell fields apart

<snip>

Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.

I agree with this, but I’m also wondering if it needs to be first class in the IR?
For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.

As Danny said (next message), lack of strong AA does limit some of our optimizations today. Even beyond that, however, parallel performance is a different beast from scalar optimizations. If you don’t devirtualize a subset of indirect calls, you may take a few % hit. If you fail to recognize a reduction and run the loop serially instead of in parallel, you could easily take a 10x or greater slowdown on that loop, leading to a substantial slowdown overall, depending on the importance of the reduction.

Put another way, individual parallel optimizations (whether done manually or by a compiler) have far larger performance effects, generally speaking, than individual scalar optimizations.

Just a thought, I have given much time studying other constructs and how they map to the IR :slight_smile:

[...]
(b) Add several new LLVM instructions such as, for parallelism, fork, spawn,
join, barrier, etc.
[...]

For me fork and spawn are serving the same purpose, most new schemes suggested three new instructions in total.

A reasonable question is whether to use (#b) first-class instructions for some features, *in combination with* (#d) — i.e., region markers + metadata — or to use #d exclusively. There are almost certainly far too many essential features in parallel programs to capture them all as new instructions. I don’t see a need to answer this question on Day 1. Instead, we can begin with regions and metadata annotations, and then “promote” a few features to first-class instructions if the benefit is justified.

Does that make sense?

Now that I read this, I wonder if it isn’t close to what I tried to express above :slight_smile:

:-).

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve@illinois.edu
// http://llvm.org

Is it inherent to the infrastructure, i.e. using metadata instead of first class IR construct or is it just a “quality of implementation” issue?

Not to derail this conversation:

IMHO, At some point there is no real difference :slight_smile:

Because otherwise, everything is a QOI issue.

IE if it's super tricky to get metadata that works well and works right,
doesn't get lost, etc, and that's inherent to using metadata, that to me is
not a QOI issue.

So could it be done with metadata? Probably?
But at the same time, if it had been done with more first class
constructs, it would have happened years ago and been much lower cost.

So it's all tradeoffs.

This is what I meant by “inherent to the infrastructure”, thanks for clarifying.

To clarify, we were proposing metadata that is used as arguments to the region-annotation intrinsics. This metadata has the nice property that it does not get dropped (so it is just being used as a way of encoding whatever data structures are necessary without predefining a syntactic schema). -Hal

Hi,

My bias is to use both (b) and (d), since they have complementary
strengths. We should use (b) for expressing concepts that can't be
semantically modeled as a call or invoke (this branch takes both its
successors), and (d) for expressing things that can be (this call may
never return), and annotation like things (this region (denoted by
def-use of a token) is a reduction).

I don't grok OpenMP, but perhaps we can come with one or two
"generalized control flow"-type instructions that can be used to model
the non-call/invoke like semantics we'd like LLVM to know about, and
model the rest with (d)?

-- Sanjoy

Hi Sanjoy,

Yes, that's exactly what we have been looking at recently here, but the region tags seem to make it possible to express the control flow as well, so I think we could start with reg ions+metadata, as Hal and Xinmin proposed, and then figure out what needs to be first class instructions.

--Vikram Adve

Hi Vikram,

Hi Sanjoy,

Yes, that's exactly what we have been looking at recently here, but
the region tags seem to make it possible to express the control flow
as well, so I think we could start with reg ions+metadata, as Hal and

I'm not yet convinced that region tags are sufficient to model exotic
control flow.

(I don't know OpenMP so this is a copy-pasted-edited example)

Say we have:

void main() {
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val;
    compute_something_into_val(&val, i);
    a[i] = val;
  }
}

I presume the (eventual) intended lowering is something like this (if
the intended lowering is different than this, and avoids the issue I'm
trying to highlight then my point is moot):

void main() {
  tok = llvm.experimental.intrinsic_a();

  int i = omp_get_thread_num();
  i32* val = alloca i32
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok);
}

However, LLVM is free to hoist the alloca to the entry block:

void main() {
  i32* val = alloca i32
  tok = llvm.experimental.intrinsic_a();

  int i = omp_get_thread_num();
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok);
}

and now you have a race between the four parallel forks.

The problem here is that nothing in the IR expresses that we have four
copies of the region running "at the same time". In fact, such a
control flow is alien to LLVM today.

For instance, another evil optimization may turn:

void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }

  return a[0] + a[1];
}

to

void main() {
  int a[4];
  #pragma omp parallel num_threads(4)
  {
    int i = omp_get_thread_num();
    int val = compute_something_into_val(i);
    a[i] = val;
  }

  return undef;
}

since a[i] = val could have initialized at most one element in a.

Now you could say that the llvm.experimental.intrinsic_a and
llvm.experimental.intrinsic_b intrinsics are magic, and even such
"obvious" optimizations are not allowed to happen across them; but then
calls to these intrinsics is pretty fundamentally different from
"normal" calls, and are probably best modeled as new instructions.
You're going to have to do the same kind of auditing of passes either
way, and the only extra cost of a new instruction is the extra bitcode
reading / writing code.

I hope I made sense.

-- Sanjoy

Hi Vikram,

Thanks for the elaborate answer. I will try to be brief with my comment.

Hi Johannes,

> I am especially curious where you get your data from. Tapir [0] (and to
> some degree PIR [1]) have shown that, counterintuitively, only a few changes
> to LLVM passes are needed. Tapir was recently used in an MIT class with a
> lot of students and it seemed to work well with only minimal changes
> to analysis and especially transformation passes.

TAPIR is an elegant, small extension and, in particular, I think the
idea of asymmetric parallel tasks and control flow is a clever way to
express parallelism with serial semantics, as in Cilk. Encoding the
control flow extensions as explicit instructions is orthogonal to
that, though arguably more elegant than using region tags + metadata.

I agree that any form of first class parallelism seems more elegant than
region tags + metadata. Choosing a clever structure for the parallel
constructs can additionally save a lot of work wrt. existing passes,
e.g., the dominance relationship will prevent most illegal
transformations in our proposed fork-join scheme.

However, Cilk is a tiny language compared with the full complexity of
other languages, like OpenMP. To take just one example, TAPIR cannot
express the ORDERED construct of OpenMP.

As with reductions, ORDERED is something we might want to express with
low-level constructs we already have, e.g., atomics. Alternatively, the
fork instructions can be extended with "attributes" to express more
elaborate schemes, e.g., the "lockstep" annotation we introduced for the
PIR white paper.

A more serious concern, IMO, is that TAPIR (like Cilk) requires serial
semantics, whereas there are many parallel languages, OpenMP included,
that do not obey that restriction.

Serial semantics are good to have but do not need to be enforced. In PIR
we used a "force" attribute to force parallel semantics and disable
potential serialization, though OpenMP does not necessarily enforce
parallelism.

Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE,
that are needed because without that, you’d be dependent on
fundamentally hard compiler analyses to extract the same information
for satisfactory parallel performance; realistic applications cannot
depend on the success of such analyses. For example, automatic array
privatization in parallel loops is a very hard problem (automatic
scalar privatization is easier, but even that is interprocedural).
Reduction recognition is doable for common cases, but there are hard
cases here as well. These are all standard features of parallel
programs, not specific to OpenMP (e.g., C++17 parallel template
operators are likely to produce these as well).

In the end it boils down how much of this information we want/need to
encode as special case (intrinsics/metadata) in the IR and what
information should be analyzed from constructs build with common IR.
Additionally, one has to consider which is easier do deal/optimize with
in the existing passes. In the PIR white paper we shown how OpenMP
(FIRST)PRIVATE clauses can be expressed using only well placed alloca
instructions. From there, analysis is not as hard as automatic
paralellization.

If you support all these capabilities in the IR, a *lot* more than
6000 LOC (Tapir’s statistic; I’m sorry I don’t recall the number for
PIR) would probably have to be modified in LLVM.

All in all TAPIR was implemented in 5k LOC. However, >1k are explicit
parallel optimization, 1k is used to add new instructions (thus
mechanical), 2k are used to lower the parallelism (thus needed for
basically all schemes), and the rest is needed to make it work with
existing analysis and transformation passes. To this end I do not see
how any other scheme would require much less.

>> [...] (b) Add several new LLVM instructions such as, for
>> parallelism, fork, spawn, join, barrier, etc. [...]
>
> For me fork and spawn are serving the same purpose, most new schemes
> suggested three new instructions in total.

A reasonable question is whether to use (#b) first-class instructions
for some features, *in combination with* (#d) — i.e., region markers +
metadata — or to use #d exclusively. There are almost certainly far
too many essential features in parallel programs to capture them all
as new instructions. I don’t see a need to answer this question on Day
1. Instead, we can begin with regions and metadata annotations, and
then “promote” a few features to first-class instructions if the
benefit is justified.

Does that make sense?

Given parallelism as a first-class citizen we can still add new features
in various ways, including intrinsics or metadata. However, if we walk
down a path it is hard to change the direction completely, especially
since region annotations and fork-join parallel regions do serve
overlapping needs.

> Also I think we already have a lot of simple constructs in the IR to
> express high-level information properly, e.g. 'atomicrmw'
> instructions for high-level reduction. While we currently lack
> analysis passes to extract information from such low-level
> representations, these are certainly possible [2,3]. I would argue
> that such analysis are a better way to do things than placing
> "high-level intrinsics" in the IR to mark things like reductions.

IMO, this is too fragile for ensuring performance for realistic
applications. While there is a lot of work on reduction recognition in
the literature going back to the 1980s, there will always be cases
where these algorithms miss a reduction pattern and you get terrible
parallel performance.

This might be true but also a problem of current optimizations e.g.,
loop vectorization. Missed optimization opportunities will not go away
one way or another.

Second, I’m not convinced that you can use specific operations like
“atomicrmw” to find candidate loop nests in the first place that
*might* be reductions, because the atomic op may in fact be in a
library operation linked in as binary code.

I do not get the library operation comment as atomicrmw is an intrinsic.
Also the front-end would place the atomicrmw intrinsics in a parallel
loop so there is no finding of candidate loop nests involved, right? (I
have the feeling you mix automatic parallelization with representation
of existing parallelism here.)

Third, this greatly increases the complexity of a “first working
prototype” for a language like OpenMP because you can’t exclude
REDUCTION clauses. Fourth, I believe array privatization is even
harder; more generally, you cannot rely on heroic compiler
optimizations to take the place of explicit language directives.

As above, it is not about automatic parallelization of sequential
program but expressing parallel constructs and reasoning about them
afterwards. The latter is a different and often much easier problem.

I may be showing my age, but I co-wrote a full HPF-to-MPI
source-to-source compiler in the late 1990s at Rice University with
some quite sophisticated optimizations [1,2], and we and others in the
HPF compiler community — both product and research teams — encountered
many cases where compiler analyses were inadequate. There is a
credible view that HPF failed despite very high hopes and significant
industry and research investment because it relied too much on heroic
compiler technology.

[1] http://vadve.web.engr.illinois.edu/Papers/set_framework.pldi98.ps
(PLDI 1998) [2]
http://vadve.web.engr.illinois.edu/Papers/SC98.NASBenchmarksStudy/dhpfsc98.pdf
(SC 1998)

-—Vikram

// Vikram S. Adve // Professor, Department of Computer Science //
University of Illinois at Urbana-Champaign // vadve@illinois.edu //
http://llvm.org

As a final remark I want to stress that a lot of the problems or
complications you mention do not go away with region annotations and/or
intrinsics/metadata. We always have to extract high-level knowledge from
low-level constructs, the question is only how these constructs will
look like. I try to argue that simple low-level constructs are just
better suited for the job as they are easier to understand and
transformed by existing passes. Nevertheless, the will always be cases
we cannot handle well in IR and which should be optimized prior, similar
to the optimizations functional languages do prior to the LLVM IR.

Cheers,
  Johannes

>
>>
>>
<snip>
>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are needed because without that, you’d be dependent on fundamentally hard compiler analyses to extract the same information for satisfactory parallel performance; realistic applications cannot depend on the success of such analyses.
>
> I agree with this, but I’m also wondering if it needs to be first class in the IR?
> For example we know our alias analysis is very basic, and C/C++ have a higher constraint thanks to their type system, but we didn’t inject this higher level information that helps the optimizer as first class IR constructs.
> I wonder if the same wouldn’t apply to an openmp reduction clause for instance, where you could use the “basic” IR construct and the analysis would use a metadata emitted by the frontend instead of trying to infer the reduction.

As Danny said (next message), lack of strong AA does limit some of our
optimizations today. Even beyond that, however, parallel performance
is a different beast from scalar optimizations. If you don’t
devirtualize a subset of indirect calls, you may take a few % hit. If
you fail to recognize a reduction and run the loop serially instead
of in parallel, you could easily take a 10x or greater slowdown on
that loop, leading to a substantial slowdown overall, depending on the
importance of the reduction.

This is not about automatic detection of parallelism and/or reductions.
The front-end would emit parallel code and encode the reduction one way
or another. As long as the encoding is semantically correct we would
always get a parallel running result. The question here is:
  If the reduction is recognized in the parallel lowering step we can
  use the reduction support of the parallel runtime library and if not
  we would "have to" implement it the way the front-end encoded it e.g.,
  atomicrmw operations.

Put another way, individual parallel optimizations (whether done
manually or by a compiler) have far larger performance effects,
generally speaking, than individual scalar optimizations.

As these proposals are all about parallel representation and not yet
optimization I think it is hard to argue that one is inherently better
suited to optimize already represented parallel regions than the other.

>
>
>
>
>
>>
>> >
>> > Hi Johannes,
>> >
>> >> I am especially curious where you get your data from. Tapir [0] (and to
>> >> some degree PIR [1]) have shown that, counterintuitively, only a few
>> changes
>> >> to LLVM passes are needed. Tapir was recently used in an MIT class
>> with a
>> >> lot of students and it seemed to work well with only minimal changes
>> >> to analysis and especially transformation passes.
>> >
>> > TAPIR is an elegant, small extension and, in particular, I think the
>> idea of asymmetric parallel tasks and control flow is a clever way to
>> express parallelism with serial semantics, as in Cilk. Encoding the
>> control flow extensions as explicit instructions is orthogonal to that,
>> though arguably more elegant than using region tags + metadata.
>> >
>> > However, Cilk is a tiny language compared with the full complexity of
>> other languages, like OpenMP. To take just one example, TAPIR cannot
>> express the ORDERED construct of OpenMP. A more serious concern, IMO, is
>> that TAPIR (like Cilk) requires serial semantics, whereas there are many
>> parallel languages, OpenMP included, that do not obey that restriction.
>> Third, OpenMP has *numerous* clauses, e.g., REDUCTION or PRIVATE, that are
>> needed because without that, you’d be dependent on fundamentally hard
>> compiler analyses to extract the same information for satisfactory parallel
>> performance; realistic applications cannot depend on the success of such
>> analyses.
>>
>> I agree with this, but I’m also wondering if it needs to be first class
>> in the IR?
>> For example we know our alias analysis is very basic, and C/C++ have a
>> higher constraint thanks to their type system, but we didn’t inject this
>> higher level information that helps the optimizer as first class IR
>> constructs.
>>
>
> FWIW, while i agree with the general point, i wouldn't use this example.
> Because we pretty much still suffer to this day because of it (both in AA,
> and devirt, and ...) :slight_smile:
> We can't always even tell fields apart
>
>
> Is it inherent to the infrastructure, i.e. using metadata instead of first
> class IR construct or is it just a “quality of implementation” issue?
>
>
Not to derail this conversation:

IMHO, At some point there is no real difference :slight_smile:

Because otherwise, everything is a QOI issue.

IE if it's super tricky to get metadata that works well and works right,
doesn't get lost, etc, and that's inherent to using metadata, that to me is
not a QOI issue.

So could it be done with metadata? Probably?
But at the same time, if it had been done with more first class
constructs, it would have happened years ago and been much lower cost.

So it's all tradeoffs.

Right. And while I think the metadata Hal proposed is different (see the
answer by Hal), prior experience (llvm.parallel.loop) tought us that it
takes a lot of effort to keep arbitrary attached metadata up to date and
working well.

Hi Sanjoy,

thanks for the example! It is petty similar to what we have in our PIR
white paper (see page 10 in [0]). We think such cases nicely showcase
that we should use actual control flow (in combination with SSA and
dominance) to prevent current optimizations to break parallel invariants
even without them knowing about parallelism.

Cheers,
  Johannes

[0] http://compilers.cs.uni-saarland.de/people/doerfert/parallelcfg.pdf

Sanjoy, the IR would be like something below. It is ok to hoist alloca instruction outside the region. There are some small changes in optimizer to understand region-annotation intrinsic.

{ void main() {
  i32* val = alloca i32
  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* val), "QUAL.NUM_THREADS"(i32 4)]

  int i = omp_get_thread_num();
  compute_something_into_val(val, i);
  a[i] = val;

  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

With above representation, we can do privatization and outlining as below

{ void main() {
  i32* val = alloca i32
  i32* I = alloca 32
  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4)]

  %ii = omp_get_thread_num();
  compute_something_into_val(%val, %i);
  a[i] = %val;

  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

1. create i32* priv_val = alloca i32 %priv_i = ...in the region, and replace all %val with %prv_val in the region.
2. perform outlining.

Caller code
....
omp_push_num_threads(4)
omp_fork_call( .... outline_par_region....) ....

Callee code:
Outlined_par_rgion {
   I32* priv_val = alloca 32
   I32* priv_i = ....

    Ret
}

For OpenMP, we do support it at -O0, -O1, -O2 and -O3. We had to make sure it runs correctly w/ and w/o optimizations and advanced analysis. So we need to preserve all source information for BE.
You can take a look our LLVM-HPC paper for a bit some details. There are still tons of work to be done. Thanks.

Xinmin

In the case, "val" is shared per OpenMP language rule. There is no privatization needed. %val is on the stack of master, to share %val among all threads, &val is passed to the outlined function.

void main() {
  int val;
  #pragma omp parallel num_threads(4)
  {
    // Really bad naming, won't pass code review. :slight_smile:
    compute_something_into_val(&val, omp_get_thread_num());
  }
}

The IR would be.

{ void main() {
  i32* val = alloca i32
   tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(), "QUAL.SHARED"(i32* %val), "QUAL.NUM_THREADS"(i32 4)
      %1 = omp_get_thread_num();
      compute_something_into_val(%val, %1);
   llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
}

Xinmin

Here we come to the interesting part: the hoisting of "i32* I = alloca 32” above the intrinsics required to update the intrinsics information “QUAL.PRIVATE”.
This means that the optimizer has to be aware of it, I’m missing the magic here?
I understand that an openmp specific optimization can do it, the question is how it an openmp agnostic supposed to behave in face of llvm.experimental.intrinsic_a?

This means that the optimizer has to be aware of it, I’m missing the magic here?

This is one option.

The another option is that, as I mentioned in our LLVM-HPC paper in our implementation. We have a "prepare phase for pre-privatization" can be invoked by both Clang FE and Fortran FE right after LLVM IR is generated. So, in this way, we are able to minimize the optimizations impact for the original val and i

{ void main() {
  i32* val = alloca i32
  i32* I = alloca 32
  i32* priv_val = alloca i32
  i32* priv_i alloca 32
  tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),"QUAL.PRIVATE"(i32* %priv_val, i32 %priv_i), "QUAL.NUM_THREADS"(i32 4)]

  %priv_i = omp_get_thread_num();
  compute_something_into_val(%priv_val, %priv_i);
   a[priv_i] = %priv_val;

  llvm.experimental.intrinsic_b(tok)["DIR.END.PARALLEL"()];
  ....

  I =
  Val =
   Foo(val, i).
}

"Prepare phase" is our way of minimizing the impact to existing optimizations.

Xinmin

Xinmin,

outlining turns a parallel program into a sequential one from compiler's
perspective, and that is why most of the parallel-ignorant pass would hurt.
In your IR description for Sanjoy's example, does the current approach of
outlining impacting the way of the IR should be enhanced for parallelism?

For that specific example (or other analysis and optimization SPMD) and
what is implemented in clang, I am not sure whether we are going to change
the frontend so not to outline the parallel region, or allow to perform
certain optimization such as hoisting that alloca in clang which is not
desired I believe. Or annotate the outlined function together with the
intrinsic_a so that hoisting can be performed, in which case the
instrisic_a would like this:

tok = llvm.experimental.intrinsic_a()[ "DIR.PARALLEL"(),
"QUAL.PRIVATE"(i32* %val, i32 %i), "QUAL.NUM_THREADS"(i32 4) plus info for
OUTLINED_call.

Mehdi,

I think i am asking the same question as you asked me.

Yonghong