Proposal: Resolve combined directives in parsing phase

At the moment, combined directives have their own ast representation for type-checking and code generation. For some of the combined constructs, the code generation is implemented as inlined function what results in ignoring the semantic meaning of these directives.

This is true for e.g.
EmitOMPTargetParallelForSimdDirective
EmitOMPTargetSimdDirective
EmitOMPTeamsDistributeDirective
EmitOMPTargetTeamsDistributeDirective
EmitOMPTargetTeamsDistributeParallelForDirective
and more

One solution would be the proper codegen implementation for these directives.
However, I would like to propose a simpler and closer-to-spec approach:
By resolving combined directives in the parsing phase into nested AST nodes.

E.g. an OMPTargetTeamsDistributeDirective would be resolved into
OMPTargetDirective
     >- OMPTeamsDirective
         >- OMPDistributeDirective

whereas type-checking and codegen for these single directives is already implemented.
The advantages are:
- Much simpler type-checking and code generation
- We match the specification stating that combined directives have the semantic meaning of one construct immediately followed by the other construct
- All combined directives are fully supported if their derived constructs are supported

Potential disadvantages:
- The AST representation differs from the input. However, this is already the case due to inserted implicit parameters.
- Code optimizations for combined directives may be harder to implement

In my opinion the benefits outweigh the disadvantages, but I may not be aware of some implications. Please let me know your thoughts about this idea. And tell me if I missunderstood anything related that led to the decision for the actual design.

Unrelated question:
I don't understand the necessity of the __kmpc_fork_teams() run-time call as the __tgt_target_teams() implementation should be able to handle this case.

Daniel

This would have some impact on us long term and I have a slight preference that the sema is per unique combined directive since I’m not sure the error handling on a parsing issue would cascade like you expect. (This may require more thought… hmmm)

(btw I’m not thinking of this purely from the codegen/compiler perspective, but also IDE where libclang can (does) get used for syntax highlighting and other editor features.

If it’s important to you I’ll poke around internally to see how our implementation may differ from pure upstream and give feedback.

Daniel,

Cons: - Code optimizations for combined directives may be harder to implement

From the perspective of GPU code generation treating combined directives as a special case is immensely important for performance. Knowing that there is no serial section in the target region makes it possible to generate simplified and low overhead code that resembles typical CUDA kernels. I am worried that splitting combined directives will make it much harder to get this performance back, particularly because we must do OpenMP lowering in Clang.

the code generation is implemented as inlined function what results in
> ignoring the semantic meaning of these directives.

Can the current code generation be altered to fix the issues that you see?

> Unrelated question:
> I don't understand the necessity of the __kmpc_fork_teams() run-time
> call as the __tgt_target_teams() implementation should be able to handle
> this case.

My understanding is that the __tgt* calls are implemented in the target offload library while the __kmpc* calls are in the OpenMP runtime. On CPUs, forking of teams is done in the __kmpc_fork_teams() call. On the GPU, the offload call __tgt_target_teams() launches a kernel with multiple teams so the __kmpc_fork_teams() is a no-op.

Regards,
Arpith

graycol.gifDaniel Schürmann via Openmp-dev —06/02/2017 09:08:11 AM—At the moment, combined directives have their own ast representation for type-checking and code gen

What kind of impact long term does this have on you? Error handling should be unaffected by this change as long as src loc is propagated correctly. Furthermore, sema doesn’t need to be changed.

Syntax highlighting and other editor features should as well be unaffected. However, as a compromise the AST changes can also be performed in sema although this would be more complicated.

It is no urgent request as there is an easy workaround. I would be glad for feedback, but don’t trouble yourself on my account.

Actually, it is not an easy thing to do. You need to create a captured region for each independent standalone OpenMP construct, joined into a combined directives. Also, you will need to properly capture arguments of some of the clauses that are used in inner OpenMP constructs, like in something like '#pragma omp target teams distribute parallel for schedule(static, a)'. The arguments of the 'schedule' clause must be bypassed into the inner oulined region for the 'omp for' coonstruct for correct codegen.

6/2/17 4:12 PM, Schürmann, Daniel via Openmp-dev пишет:
What kind of impact long term does this have on you? Error handling should be unaffected by this change as long as src loc is propagated correctly. Furthermore, sema doesn't need to be changed.

Syntax highlighting and other editor features should as well be unaffected. However, as a compromise the AST changes can also be performed in sema although this would be more complicated.

It is no urgent request as there is an easy workaround. I would be glad for feedback, but don’t trouble yourself on my account.

Thank you for your feedback, Arpith.

I see that the nvptx codegen relies in parts on combined directives and understand the possible difficulties.

However, codegen is already able to do this with the target and teams directive. Remember that the split directives are immediate child nodes in the AST.

But you are right, this change can only be made after changes to codegen.

In my opinion the code generation should in general be able to handle both cases equivally, combined directives and the separated version.

The reason is that the user would expect the same outcome/performance from both, e.g.

#pragma omp target parallel for

as well as

#pragma omp target

#pragma omp parallel

#pragma omp for

like the spec states. (The only exception seem to be the distribute- combined directives.

Kind regards,

Daniel

graycol.gif

> However, codegen is already able to do this with the target and
> teams directive.

That is not an equivalent example.

The spec guarantees that there can be no user code between the target and the teams directive. This is not the case with the other combined directives.

The reason is that the user would expect the same outcome/performance from both,

The user should not expect the same performance.

Consider, if you write

#pragma omp parallel for

for (…)

and expand it as described by the standard, you have (where the implicit barrier at the end of omp for is made visible)

#pragma omp parallel

{

pragma omp for nowait

for (…)

pragma omp barrier

}

That barrier is unnecessary since it immediately precedes a join which is itself a barrier (the mast thread cannot leave until all threads arrive).

Therefore, the code that is notmally generated elides the unnecessary barrier and expands into

#pragma omp parallel

{

pragma omp for nowait

for (…)

}

Thus it can easily be the case that omp parallel do/for is faster than omp parallel + omp do/for.

(Of course you can fix this if you expand to parallel; do/for nowait…)

graycol.gif

Thank you all for your feedback and suggestions!

I would like to update my proposal while taking your considerations into account.

Also, I hope it is okay to answer in one mail instead of spread out discussions.

Briefly again the motivation:

  • some combined constructs are unhandled in the code generation.

  • codegen is very cumbersome to match all directive combinations.

  • combined constructs and separate nested constructs have potentially different performance characteristics.

Section 2.11 of the specification about Combined Constructs states:

The semantics of the combined constructs are identical to that of explicitly specifying

the first construct containing one instance of the second construct and no other statements.

To match this semantic rule, the idea is to expand these combined constructs already in the AST construction. This enables unimplemented combined constructs to use the already implemented code generation. Simultaneously, it provides same performance for combined constructs as separate ones.

After reconsidering some implications, it seems easier to leave parsing and type-checking as is and do the expansion in the AST construction (Sema::ActOnOpenMPxyzDirective()).

This way, the AST should look exactly the same whether the code contains combined constructs or not. The issue of performance regressions due to losing information about the close nesting should be solvable by flags in cases where this is really necessary. On the upside, it should be possible to derive the close nesting information if the constructs are previously not combined.

Now, I would like to reply to some of the points raised:

C Bergström:

I’m not sure the error handling on a parsing issue would cascade like you expect.

This updated proposal is taking this into account by delaying the expansion to the AST construction.

Alexey Bataev:

Also, you will need to properly capture arguments of some of the clauses that are used in inner OpenMP constructs.

Although I was more concerned about clauses related to the outer constructs, this is the main reason to better not do the expansion in the parsing phase. In Sema, all clauses are parsed and available. The clauses can be added to either both constructs or have to be splitted. I’m not sure if ‘wrong’ clauses would do any harm later (e.g. a num_teams clause added to a target construct).

Jim:

Thus it can easily be the case that omp parallel do/for is faster than omp parallel + omp do/for.

This is another good motivation for this proposal as I think, it is but should not be the case.

Btw, thank you for this very good example and provided solution. Question is, if we can resolve all combined constructs that easily.

Arpith:

The spec guarantees that there can be no user code between the target and the teams directive. This is not the case with the other combined directives.

I was a little bit unspecific in my response. I meant that a close nesting, if present, can also be derived. Might be that this is easier for target teams combination, but we already use the nesting information for typechecking.

I know I’m proposing a not-so-small rework, but I think the benefit could be a cleaner implementation of the spec. As it is no urgent request, we could also slowly work in this direction, e.g. starting only with combined directives which remain working the same or are broken anyway.

Thanks again for taking the time!

Best regards,

Daniel

Jim:

Thus it can easily be the case that omp parallel do/for is faster than omp parallel + omp do/for.

This is another good motivation for this proposal as I think, it is but should not be the case.

Btw, thank you for this very good example and provided solution. Question is, if we can resolve all combined constructs that easily.

I’m not quite sure what you’re saying here; are you saying that there should be an unnecessary barrier in the omp parallel do/for ?

If so I disagree.

Or are you saying that the compiler should optimise omp parallel; {omp do/for} to remove the unnecessary barrier?

In which case I agree.

Like many standards, OpenMP is all predicated by “as if”, so the standard lays down the user-visible behaviour, and any implementation which provides that is fine. The unnecessary barriers implied by the simple transformation of omp parallel do/for => omp parallel; {omp do/for} are not user visible and can be removed by the implementation.

You may choose to note, in particular, that there is language in TR4 that makes it clear that the OMPT profiling interface cannot be used to check whether this unnecessary barrier is present. In other words optimizations that are not visible to user-code are not outlawed because you can see them by using the OMPT profiling interfaces.

I’m not quite sure what you’re saying here; are you saying that there should be an unnecessary barrier in the omp parallel do/for ?

If so I disagree.

Or are you saying that the compiler should optimise omp parallel; {omp do/for} to remove the unnecessary barrier?

In which case I agree.

I meant the latter. Maybe “semantic meaning” isn’t the right term as optimizations preserve semantics anyway. However, if we merge these two cases in the AST construction and optimize away the unnecessary barrier, we gain easier Codegen with same performance for both cases:

  • decompose “omp parallel for” into “omp parallel; {omp do/for}”
  • check for closely nested omp for (we could also do this more generic I think)
  • in Codegen add barrier only if no omp for is closely nested

I think this could be applicable to more if not all combined directives.

Kind regards
Daniel

James is really clear on the point of “AS-IF” - The compiler can do anything as long as the user can’t observe a difference in the behavior. The standard can say add a barrier, but if the compiler can make the code visibly act the same with our without a barrier that’s all that matters.

I think we are all in agreement.

Daniel is just pointing out that the compiler should be optimizing the case where the user implied the unneeded barrier (by writing omp parallel {omp for/do {}}), and that if it does that down the line, it’s safe for the front-end to do the omp parallel for => omp parallel { omp for {} } transformation.

– Jim

Jim Cownie james.h.cownie@intel.com
SSG/DPD/TCAR (Technical Computing, Analyzers, and Runtimes)

Tel: +44 117 9071438