[RFC] Late (OpenMP) GPU code "SPMD-zation"

Where we are

Could you elaborate on what you refer to wrt data sharing. What do we currently do in the clang code generation that we could not effectively implement in the runtime, potentially with support of an llvm pass.

Thanks,

James

The globalization for the local variables, for example. It must be implemented in the compiler to get the good performance, not in the runtime.

We could still do that in clang, couldn’t we?

But we need to know the execution mode, SPMD or “guarded”

We would still know that. We can do exactly the same reasoning as we do now.

I think the important question is, how different is the code generated for either mode and can we hide (most of) the differences in the runtime.

If I understand you correctly, you say the data sharing code looks very different and the differences cannot be hidden, correct?

It would be helpful for me to understand your point if you could give me a piece of OpenMP for which the data sharing in SPMD mode and “guarded”

mode are as different as possible. I can compile it in both modes myself so high-level OpenMP is fine (I will disable SPMD mode manually in the source if necessary).

Thanks,

Johannes

No, we don’t. We need to perform the different kind of the analysis for SPMD mode constructs and Non-SPMD.

For SPMD mode we need to globalize only reduction/lastprivate variables. For Non-SPMD mode, we need to globalize all the private/local variables, that may escape their declaration context in the construct.

What do you refer to with: “No, we don’t”.

Again, I do not propose to remove the SPMD “detection” in Clang. We will still identify SPMD mode based on the syntactic criteria we have now.

The Clang analysis is also not affected. Thus, we will globalize/localize the same variables as we do now. I don’t see why this should be any different.

After an IRC discussion, I think Alexey and I are pretty much in agreement (on the general feasibility at least).

I try to sketch the proposed idea again below, as the initial RFC was simply not descriptive enough.

After that, I shortly summarize how I see these changes being developed and committed so that we

  • never have any regressions,

  • can make an educated decision before removing any existing code.

What we want to do:

The intermediate goal is that the code generated by clang for the SPMD and non-SPMD (earlier denoted as “guarded”) case

is conceptually/structurally very similar. The current non-SPMD code is, however, a state machine generated into the user code

module. This state machine is very hard to analyze and optimize. If the code would look as the SPMD code but *behave the same

way it does now*, we could “easily” switch from non-SPMD to SPMD version after a (late) analysis determined legality. To make

the code look the same but behave differently, we propose to hide the semantic difference in the runtime library calls. That is,

the runtime calls emitted in the two modes are (slightly) different, or there is a flag which indicates the (initial) mode. If that mode

is SPMD, the runtime behavior does not change compared to the way it is now. If that mode is non-SPMD, the runtime would

separate the master and worker threads, as we do it now in the user code module, and keep the workers in an internal state machine

waiting for the master to provide them with work. Only the master would return from the runtime call and the mechanism

to distribute work to the worker threads would (for now) stay the same.

Preliminary implementation (and integration) steps:

  1. Design and implement the necessary runtime extensions and determine feasibility.

  2. Allow to Clang codegen to use the new runtime extensions if explicitly chosen by the user.

2b) Performance comparison unoptimized new code path vs. original code path on test cases and real use cases.

  1. Implement the middle-end pass to analyze and optimize the code using the runtime extensions.

3b) Performance comparison optimized new code path vs. original code path on real use cases.

  1. If no conceptual problem was found and 2b)/3b) determined that the new code path is superior, switch to the

new code path by default.

  1. If no regressions/complaints are reported after a grace period, remove the old code path from the clang front-end.

Again, this is an early design RFC for which I welcome any feedback!

Thanks,

Johannes

We are working on OpenMP target offloading for GPUs in Flang, and adopting the same code generation strategy. The proposal is affecting us. It would be nice to know more details about the proposal. So we can prepare ourselves to adapt flang (if everything goes on the way).

Have you find and a solution for data sharing? How are you going to manage data sharing for SPMD and non-SPMD?

We are working on OpenMP target offloading for GPUs in Flang, and adopting the same code generation strategy.

Great to hear that that Flang is making progress on this front. I hope we can find ways to generalize/uncouple

the OpenMP code generation in Clang so we do not need to re-implement it completely. That could also allow

other language front-ends easier access to the OpenMP runtime for parallelization. Anyway, that is a different

topic we should discuss separately.

The proposal is affecting us. It would be nice to know more details about the proposal. So we can prepare ourselves to adapt flang (if everything goes on the way).

As I mentioned before, this is an early design RFC, a lot of details need to be figured out.

However, I do not think you will need to adapt much if you have (or reuse) a code generation similar to

the SPMD code generation path in Clang. The goal of the first step is to make the code generation

always look like that so in the best case you won’t need to implement the non-SPMD code generation.

Have you find and a solution for data sharing? How are you going to manage data sharing for SPMD and non-SPMD?

Initially, we generate the same SPMD code as we do now. Data sharing for non-SPMD code is also not altered.

The interesting question is how we can analyze non-SPMD code and transform it to SPMD code as easily as possible.

That might require us to change the “encoding” later down the road but for now, we won’t change anything in this regard.

Hi Alexey and Johannes,

I believe that deferring to use of the state machine can help to make a better decision. I agree with that. However, I am not sure how feasible is that. I look forward to hearing more details once the RFC is ready.

In flang, currently we have implemented the SPMD mode, but not non-SPMD mode.

Hi Johannes,

First of all thanks for looking into the matter of improving non-SPMD mode!

I have a question regarding the state machine that you said you’d like to replace/improve. There are cases (such as target regions that span multiple compilation units) where the switch statement is required. Is this something that your changes will touch in any way?

My next question is, for the workloads which are in the same compilation unit there is a trick that code gen performs (or could perform I’m not sure if this has been upstreamed) where it can check for the specific name of an outlined function and then just call it directly thus making that function inline-able (thus erasing most if not all the overhead of having the state machine in the first place). In other words the “worst” part of the switch statement will only apply to outlined functions from other compilation units. With this in mind what would the impact of your changes be in the end? If this part isn’t clear I can do some digging to find out how this actually works in more details it’s been too long since I’ve had to look at this part.

Can you share some performance numbers given an example you have been looking at? I see you have one that uses “#pragma omp atomic”. I would avoid using something like that since it may have other overheads not related to your changes. I would put together an example with this directive structure:

#pragma omp target teams distribute
for(…){

#pragma omp parallel for
for(…) {

}

}

which forces the use of the master-worker scheme (non-SPMD mode) without any other distractions.

It would then be interesting to understand how you plan to change the LLVM code generated for this, what the overheads that you’re targeting are (register usage, synchronization cost etc), and then what the performance gain is compared to the current scheme.

Thanks,

–Doru

Hi Doru,

[+ llvm-dev and cfe-dev]

Hi Johannes,

First of all thanks for looking into the matter of improving non-SPMD mode!

I have a question regarding the state machine that you said you'd like to
replace/improve. There are cases (such as target regions that span multiple
compilation units) where the switch statement is required. Is this something
that your changes will touch in any way?

There will not be a difference. Let me explain in some details as there
seems to be a lot of confusion on this state machine topic:

Now:

Build a state machine in the user code (module) with all the parallel
regions as explicit targets of the switch statement and a fallback
default that does a indirect call to the requested parallel region.

Proposed, after Clang:

Use the runtime state machine implementation [0] which reduces the
switch to the default case, thus an indirect call to the requested
parallel region. This will always work, regardless of the translation
unit that contained the parallel region (pointer).

Proposed, after OpenMP-Opt pass in LLVM (assuming SPMD wasn't achieved):

All reachable parallel regions in a kernel are collected and used to
create the switch statement in the user code (module) [1, line 111] with
a fallback if there are potentially [1, line 212] hidden parallel
regions.

Does that make sense?

[0] ⚙ D57460 [OpenMP][Offloading] A generic and simple OpenMP target kernel interface
[1] ⚙ D57460 [OpenMP][Offloading] A generic and simple OpenMP target kernel interface

My next question is, for the workloads which are in the same compilation unit
there is a trick that code gen performs (or could perform I'm not sure if this
has been upstreamed) where it can check for the specific name of an outlined
function and then just call it directly thus making that function inline-able
(thus erasing most if not all the overhead of having the state machine in the
first place). In other words the "worst" part of the switch statement will only
apply to outlined functions from other compilation units. With this in mind
what would the impact of your changes be in the end? If this part isn't clear I
can do some digging to find out how this actually works in more details it's
been too long since I've had to look at this part.

See the answer above.

Can you share some performance numbers given an example you have been looking
at? I see you have one that uses "#pragma omp atomic". I would avoid using
something like that since it may have other overheads not related to your
changes. I would put together an example with this directive structure:

#pragma omp target teams distribute
for(...){
  <code1>
  #pragma omp parallel for
  for(...) {
    <code2>
  }
  <code3>
}

which forces the use of the master-worker scheme (non-SPMD mode) without any
other distractions.

The atomic stuff I used to determine correctness. I haven't yet looked
at performance. I will do so now and inform you on my results.

It would then be interesting to understand how you plan to change the LLVM code
generated for this,

The examples show how the LLVM-IR is supposed to look like, right?

what the overheads that you're targeting are (register usage,
synchronization cost etc), and then what the performance gain is
compared to the current scheme.

I can also compare register usage in addition to performance but there
is no difference in synchronization. The number and (relative) order of
original runtime library calls stays the same. The number of user code
-> runtime library calls is even decreased.

Please let me know if this helps and what questions remain.

Thanks,
  Johannes

Hi Johannes,

Thank you for the explanation.

I think we need to clarify some details about code generation in Clang today:

  1. non-SPMD mode, or generic mode, uses the master-worker code gen scheme where the master thread and the worker threads are disjoint sets of threads (when one set runs the other set is blocked and doesn’t participate in the execution):

workers | master

Hi Doru,

maybe I should clarify something I mentioned in an earlier email already
but it seems there are things getting lost in this thread:

  While the prototype replaces code generation parts in Clang, the
  actual patches will add alternative code generation paths, guarded
  under a cmd flag. Once, and obviously only if, everything is in place
  and has been shown to improve the current situation, the default path
  would be switched.

Hi Johannes,

Thank you for the explanation.

I think we need to clarify some details about code generation in Clang today:

I'm not really sure why you feel the need to do that but OK.

1. non-SPMD mode, or generic mode, uses the master-worker code gen scheme where
the master thread and the worker threads are disjoint sets of threads (when one
set runs the other set is blocked and doesn't participate in the execution):

workers | master

BLOCKED | RUNNING
------- sync -------
RUNNING | BLOCKED
------- sync -------
BLOCKED | RUNNING

I agree, and for the record, this is not changed by my prototype, see
[1, line 295].

[1] ⚙ D57460 [OpenMP][Offloading] A generic and simple OpenMP target kernel interface

2. the worker threads, in their RUNNING state above, contain a state machine
which chooses the parallel region to be executed. Today this choice happens in
one of two ways: explicit targets (where you know what outlined region you are
calling and you just call it) and indirect targets (via function pointer set by
master thread in one of its RUNNING regions):

workers | master

BLOCKED | RUNNING
------- sync -------
RUNNING |
state | BLOCKED
machine |
------- sync -------
BLOCKED | RUNNING

Partially agreed. Afaik, it will always be decided through a function
pointer set by the master thread and communicated to the workers through
the runtime. The workers use a switch, or in fact an if-cascade, to
check if the function pointer points to a known parallel region. If so
it will be called directly, otherwise there is the fallback indirect
call of the function pointer.

Your intended changes (only target the RUNNING state machine of the WORKERS):
- remove explicit targets from current code gen. (by itself this is a major
step back!!)
- introduce a pass in LLVM which will add back the explicit targets.

Simplified but correct. From my perspective this is not a problem
because in production I will always run the LLVM passes after Clang.
Even if you do not run the LLVM passes, the below reasoning might be
enough to convince people to run a similar pass in their respective
pipeline. If that is not enough, we can also keep the Clang state
machine generation around (see the top comment).

Can you point out any major improvements this will bring compared to the
current state?

Sure, I'll give you three for now:

[FIRST]
Here is the original motivation from the first RFC mail (in case you
have missed it):

2) Implement a middle-end LLVM-IR pass that detects the guarded mode,
    e.g., through the runtime library calls used, and that tries to
    convert it into the SPMD mode potentially by introducing lightweight
    guards in the process.

    Why:
    - After the inliner, the canonicalizations, dead code elimination,
      code movement [2, Section 7 on page 8], we have a clearer picture
      of the code that is actually executed in the target region and all
      the side effects it contains. Thus, we can make an educated
      decision on the required amount of guards that prevent unwanted
      side effects from happening after a move to SPMD mode.
    - At this point we can more easily introduce different schemes to
      avoid side effects by threads that were not supposed to run. We
      can decide if a state machine is needed, conditionals should be
      employed, masked instructions are appropriate, or "dummy" local
      storage can be used to hide the side effect from the outside
      world.

[2] http://compilers.cs.uni-saarland.de/people/doerfert/par_opt18.pdf

Let me give you the canonical example that shows the need for this:

  #pragma omp target teams
  {
    foo(i + 0)
    foo(i + 1)
    foo(i + 2)
  }

  void foo(int i) {
  #pragma omp parallel
  ...
  }

The target region can be executed in SPMD mode but we cannot decide that
syntactically when the region is encountered. Agreed?

[SECOND]
Now there are other benefits with regards to the above mentioned state
machine. In the LLVM pass we can analyze the kernel code
interprocedurally and detect all potentially executed parallel regions,
together with a relation between them, and the need for the fallback
case. That means we can build a state machine that __takes control
dependences into account__, __after inlining and dead code elimination__
canonicalized the kernel.

If inlining and code canonicalization resulted in the following
structure, the state machine we can build late can know that after
section0 the workers will execute section1, potentially multiple times,
before they move on to section3. In today's scheme, this is sth. we
cannot simply do, causing us to traverse the if-cascade from top to
bottom all the time (which grows linear with the number of parallel
regions).

  if (...) {
    #pragma omp parallel
    section0(...)
    do {
      #pragma omp parallel
      section1(...)
    } while (...)
  }
  #pragma omp parallel
  section3(...)

[THIRD]
Depending on the hardware, we need to make sure, or at least try rally
hard, that there is no fallback case in the state machine, which is an
indirect function call. This can be done best at link time which
requires us to analyze the kernel late and modify the state machine at
that point anyway.

From your answer below you mention a lower number of function calls. Since
today we inline everything anyway how does that help?

If we inline, it doesn't for performance purposes. If we do not inline,
it does. In either case, it helps to simplify middle-end analyses and
transformations that work on kernels. Finally, it prevents us from
wasting compile time looking at the (unoptimizable) state machine of
every target region.

Maybe it is worth asking the opposite question:
  What are the reasons against these general runtime calls that hide the
  complexity we currently emit into the user code module?
[Note that I discuss the only drawback I came up with, a non-customized
state machine, already above.]

If you haven't considered performance so far how come you're proposing all
these changes? What led you to propose all these changes?

See above.

In SPMD mode all threads execute the same code. Using the notation in the
schemes above you can depict this as:

    all threads

      RUNNING

No state machine being used, no disjoints sets of threads. This is as
if you're executing CUDA code.

Agreed.

Could you explain what your proposed changes are in this context?

None, at least after inlining the runtime library calls there is
literally the same code executed before and after the changes.

Could you also explain what you mean by "assuming SPMD wasn't achieved"?

That is one of the two motivations for the whole change. I explained
that in the initial RFC and again above. The next comment points you to
the code that tries to achieve SPMD mode for inputs that were generated
in the non-SPMD mode (master-worker + state machine) by Clang.

Do you expect to write another LLVM pass which will transform the
master-worker scheme + state machine into an SPMD scheme?

I did already, as that was the main motivation for the whole thing.
It is part of the prototype, see [3, line 321].

[3] ⚙ D57460 [OpenMP][Offloading] A generic and simple OpenMP target kernel interface

Cheers,
  Johannes

Hi Johannes,

Your clarifications helped a lot, having all details gathered in one place helped me understand better what you are proposing.

Thanks a lot for taking the time to explain.

Thanks,

–Doru

Please consider reviewing the code for the proposed approach here:
  https://reviews.llvm.org/D57460

Initial tests, e.g., on the nw (needleman-wunsch) benchmark in the
rodinia 3.1 benchmark suite, showed 30% improvement after SPMD mode was
enabled automatically. The code in nw is conceptually equivalent to the
first example in the "to_SPMD_mode.ll" test case that can be found here:
  https://reviews.llvm.org/D57460#change-sBfg7kuN4Bid

The implementation is missing key features but one should be able to see
the overall design by now. Once accepted, the missing features and more
optimizations will be added.

There are tooooooo(!) many changes, I don’t who’s going to review sooooo big patch. You definitely need to split it into several smaller patches. Also, I don’t like the idea adding of one more class for NVPTX codegen. All your changes should be on top of the eixisting solution.