[RFC] IR-level Region Annotations

(d) Add a small number of LLVM intrinsics for region or loop annotations,
    represent the directive/clause names using metadata and the remaining
    information using arguments.

Here we're proposing (d),

I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level. Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.

However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP. Two near-term examples I can think of are C++17 and Cilk (or Cilk++). I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC. Many of the *parallel* optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).

Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible. There are a number of common parallel constructs that are common to many, perhaps most, of these parallel languages: parallel loops, barriers, associative and non-associative reductions, loop scheduling specifications, affinity specifications, etc. It seems to me that we could define a core set of parallel constructs that capture *most* of the common requirements of these languages. How this is encoded is really a separate issue, e.g., perhaps(*) we could use approach (d). We’d essentially be doing what you’re proposing for OpenMP, but with more language-neutral structure and semantics for the parallel constructs. The corresponding language features would be directly lowered to these constructs. The missed features could use the per-language encodings you’re describing, but with a good design they should be relatively few such features. The overall design complexity and the impact on the existing passes should be no greater, and potentially less, than what you’ve proposed.

(*) Or possibly some combination of (b) and (d). In particular, after thinking about this quite a bit recently, I’m not convinced that flexible parallel control flow, e.g., task creation and joins, are best encoded as intrinsics. Adding a very small number of first-class instructions for them may achieve this goal more effectively. But I’d be happy to be convinced otherwise. In any case, that’s a detail we can discuss separately — it doesn’t change my main point that we should keep most of the parallel IR as language-neutral as possible.

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve@illinois.edu
// http://llvm.org

(d) Add a small number of LLVM intrinsics for region or loop annotations,
     represent the directive/clause names using metadata and the remaining
     information using arguments.
  Here we're proposing (d),

I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level. Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.

However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP. Two near-term examples I can think of are C++17 and Cilk (or Cilk++). I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC. Many of the *parallel* optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).

Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible.

We obviously need to work out the details here, but one motivation is to allow the same facility to both represent concepts common to many programming models as well as programming-model-specific concepts. Also, I'd like to be able to transition from programming-model-specific representations (where I imagine most things will start) toward abstracted concepts. The goal is to retain programming-model-specific semantics while allowing the creation of transformations and analysis which deal with abstract concepts. One way we might accomplish this is by using both like this:

    1. A frontend generates region annotations. A frontend like Clang will generate (mostly) programming-model-specific region annotations. Frontends for other languages might directly use the abstract concepts for their region annotations.

   2. During optimization, a transformation pass analyzes programming-model-specific region annotations and, if legal, transforms them into abstract-concept annotations. It might:

   !"omp.barrier" -> !"llvm.parallel.barrier", !"openmp"

Such that the barrier is now a general concept that transformations might understand (and, for example, eliminate redundant barriers). It is tagged with !"openmp" do that in the end, should it survive, the concept will be lowered using OpenMP.

  3. During optimization, transformations optimize abstract-concept annotations (i.e. eliminate redundant barriers, fuse parallel regions, etc.)

  4. Later in the pipeline, programming-model specific code lowers annotations for each programming model into concrete IR (i.e. runtime function calls, etc.). For abstract concepts without a specific programming-model tag, some default programming model is selected.

The programming-model-specific to abstract-concept translation in (2) can sometimes be done on a syntactic basis alone (we already do this, in fact, for atomics), but sometimes will require analysis that can be done only after inlining/IPA (to make sure, for example, that the parallel region does not contain certain classes of runtime-library calls). Plus, this allows the translation logic to be shared easily by different frontends.

Thoughts?

  -Hal

(d) Add a small number of LLVM intrinsics for region or loop annotations,
    represent the directive/clause names using metadata and the remaining
    information using arguments.
Here we're proposing (d),

I think this would serve the goal of communicating source-level directives and annotations down to LLVM passes and back-ends, while deferring inlining and allowing optimizations and code-generation for parallel code to happen more effectively at the IR level. Essentially, you’re tunneling language-specific information through the front-end, incorporating the new information in the IR using minimally invasive mechanisms, but the contents (captured in the metadata strings) are entirely language-specific, both in detailed syntax and semantics.

However, as you also said, a parallel IR needs to be able to support a wide range of parallel languages, besides OpenMP. Two near-term examples I can think of are C++17 and Cilk (or Cilk++). I can imagine adding support for other existing languages like Halide and perhaps the PGAS languages like UPC. Many of the *parallel* optimizations and code-generation tasks are likely to be common for these different languages (e.g., optimizations of reductions, parallel loop fusion, or GPU code generation).

Fundamentally, I think we should keep most of the parallel IR extensions as language-neutral as possible.

We obviously need to work out the details here, but one motivation is to allow the same facility to both represent concepts common to many programming models as well as programming-model-specific concepts.

Yes, I agree. There will inevitably be programming-model-specific features that are not supported by any generic abstraction. The hope is that they will be few, especially for the "smaller" languages.

Also, I'd like to be able to transition from programming-model-specific representations (where I imagine most things will start) toward abstracted concepts. The goal is to retain programming-model-specific semantics while allowing the creation of transformations and analysis which deal with abstract concepts. One way we might accomplish this is by using both like this:

  1. A frontend generates region annotations. A frontend like Clang will generate (mostly) programming-model-specific region annotations. Frontends for other languages might directly use the abstract concepts for their region annotations.

2. During optimization, a transformation pass analyzes programming-model-specific region annotations and, if legal, transforms them into abstract-concept annotations. It might:

!"omp.barrier" -> !"llvm.parallel.barrier", !"openmp"

Such that the barrier is now a general concept that transformations might understand (and, for example, eliminate redundant barriers). It is tagged with !"openmp" do that in the end, should it survive, the concept will be lowered using OpenMP.

Yes, this is exactly what I have in mind too. We can discuss the details — what should particular front-ends should generate directly; what back-end components can be shared even when doing programming-model-specific code generation — but this flow has many advantages.

Some specific goals I’d like to see are:
+ Have as many optimizations and back-end components as possible be driven by the programming-model-agnostic information and shared among multiple languages, i.e., minimize the need for passes that use the programming-model-specific information.
+ Allow concepts in different languages be mixed and matched, to maximize performance, e.g., a work-stealing scheduler used with an OMP parallel loop; a static schedule be used with a Cilk_for parallel loop; an SIMD directive and hints used with a Cilk_for loop; etc.).
+ In the same vein, allow optimization and code generation passes to leverage available features of run-time systems and target hardware, to maximize performance in similar ways.
+ (This is not a separate goal, but rather a strategy to enable the previous two goals.) Use the annotations to decouple front-ends and upstream auto-parallelization passes from optimizations and code generation, so that the optimizations and code generation phases "don’t care" what source language(s) or other mechanisms were used to parallelize code.
+ Allow a flexible parallel run-time system that can span multiple hardware targets, e.g., a pipeline that runs some pipeline stages on a shared memory multicore host and some on one or more GPUs.

I didn’t explicitly spell out other goals like ones in your original email, especially to make sure standard optimization passes (const. prop.; redundancy elim; strength reduction; etc.) should continue to be as effective as possible, while minimizing the need to rewrite them to respect parallel semantics. E.g., avoiding outlining in the front-end is likely to be an important requirement.

3. During optimization, transformations optimize abstract-concept annotations (i.e. eliminate redundant barriers, fuse parallel regions, etc.)

4. Later in the pipeline, programming-model specific code lowers annotations for each programming model into concrete IR (i.e. runtime function calls, etc.). For abstract concepts without a specific programming-model tag, some default programming model is selected.

For code with programming-model specific tags, it may still be possible to map into a more general run-time. See examples above.

The programming-model-specific to abstract-concept translation in (2) can sometimes be done on a syntactic basis alone (we already do this, in fact, for atomics), but sometimes will require analysis that can be done only after inlining/IPA (to make sure, for example, that the parallel region does not contain certain classes of runtime-library calls). Plus, this allows the translation logic to be shared easily by different frontends.

Thoughts?

I generally agree. My main additional point (perhaps also what you had in mind) is that we should aim to maximize flexibility in the opts. and code-gen passes, while minimizing the dependence on programming-model-specific semantics.

-—Vikram

// Vikram S. Adve
// Professor, Department of Computer Science
// University of Illinois at Urbana-Champaign
// vadve@illinois.edu
// http://llvm.org

Some specific goals I’d like to see are:
+ Have as many optimizations and back-end components as possible be driven by the programming-model-agnostic information and shared among multiple languages, i.e., minimize the need for passes that use the programming-model-specific information.
+ Allow concepts in different languages be mixed and matched, to maximize performance, e.g., a work-stealing scheduler used with an OMP parallel loop; a static schedule be used with a Cilk_for parallel loop; an SIMD directive and hints used with a Cilk_for loop; etc.).
+ In the same vein, allow optimization and code generation passes to leverage available features of run-time systems and target hardware, to maximize performance in similar ways.

Yes, these goals are aligned with some major features we have supported in the Intel compilers and the LLVM IR and compiler development we are targeting for.

+ Allow a flexible parallel run-time system that can span multiple hardware targets, e.g., a pipeline that runs some pipeline stages on a shared memory multicore host and some on one or more GPUs.

At this point, it sounds a very good research topic / project.

Xinmin