RFC: User-directed code transformations with #pragma clang transform

Dear Clang community,

I am working on adding more powerful compiler optimization directives
that allows the programmer to steer optimization decisions in mid-end.
Most recently presented this work at the LLVM Dev Meeting [2] and I am
looking for reviewers https://reviews.llvm.org/D69088.

Hi Michael, thanks for this RFC! I just have a few questions and
comments to start off.

Are these pragmas meant to be advisory or prescriptive (when legal)?

From your description and motivation I assume the latter but I wanted to

double-check.

* Hints are emitted in form of metadata (MDNode) that can be dropped
by mid-end optimizers

Below you state that metadata will be used to encode the transformations
and order. Doesn't this suffer from the same problem?

The selection is currently limited to the passes LLVM currently
supports. I am working on more transformations that currently are only
picked-up by Polly. The largest difference to loop hints is that it
allows to specify in which order the transformations are applied,
which is ignored by clang's current LoopHint attribute. That is, the
following for reverses the loop, then unrolls it.

    #pragma clang transform unroll partial(2)
    #pragma clang transform reverse
    for (int i = 0; i < 128; i+=1)
      body(i);

This seems unintuitive to me. I would expect this to unroll first and
then reverse. I get the "inner-to-outer" ordering you present here, but
I wonder if it will be too easy for users to get something unexpected.

Will it be possible to list multiple transformations with one directive?

     #pragma clang transform reverse, unroll partial(2)

If so, then the "inner-to-outer" ordering seems even more problematic:

     #pragma clang transform reverse, unroll partial(2)
     #pragma clang transform distribute, vectorize

In what order are the transformations applied?

Furthermore, I intend to implement assigning identifiers to loop to
reference them in followup transformations (e.g. tile a loop,
parallelize the generated outer loop and vectorize the inner),

Do you have an example of this idea?

                   -David

Hi Michael, thanks for this RFC! I just have a few questions and
comments to start off.

Are these pragmas meant to be advisory or prescriptive (when legal)?
From your description and motivation I assume the latter but I wanted to
double-check.

For loop hint style directives, they are advisory. However, I would
expect the compiler to able to emit a diagnostic (not an error) if
applying it fails, as #pragma clang loop vectorize(enable) already
does.
Determining the safety of a transformations depends on compiler
capability (which may change between versions of the compiler), hence
cannot be prescriptive.

For OpenMP style directives (and possibly assume_safey), they should
be prescriptive.

> * Hints are emitted in form of metadata (MDNode) that can be dropped
> by mid-end optimizers

Below you state that metadata will be used to encode the transformations
and order. Doesn't this suffer from the same problem?

I did not claim to be able to solve every problem :wink:
But we should work towards removing sources where the metadata can be
lost (e.g. https://reviews.llvm.org/D53876 and
https://reviews.llvm.org/D66892), but it will always be best effort by
the definition of MDNode.

However, emitting already transformed IR using the OpenMPIRBuilder
would eliminate this source of transformations being forgotten.

> The selection is currently limited to the passes LLVM currently
> supports. I am working on more transformations that currently are only
> picked-up by Polly. The largest difference to loop hints is that it
> allows to specify in which order the transformations are applied,
> which is ignored by clang's current LoopHint attribute. That is, the
> following for reverses the loop, then unrolls it.
>
> #pragma clang transform unroll partial(2)
> #pragma clang transform reverse
> for (int i = 0; i < 128; i+=1)
> body(i);

This seems unintuitive to me. I would expect this to unroll first and
then reverse. I get the "inner-to-outer" ordering you present here, but
I wonder if it will be too easy for users to get something unexpected.

I find this order more intuitive and matches the OpenMP semantics. For instance,

    #pragma omp parallel for
    for (int i = 0; i < 128; i+=1)

has the same semantics as

    #pragma omp parallel
    #pragma omp for
   for (int i = 0; i < 128; i+=1)

which is the same as

    #pragma omp parallel
    {
       #pragma omp for
       for (int i = 0; i < 128; i+=1)
       ..
    }

I think the difference in interpretation comes from either seeing the pragmas as
1. a collection of attributes to the next statement (like
AttributedStmt/LoopHint)

or

2. as an statement taking another statement as argument (like
OMPExecutableDirective)

In trying unify both implementations, I will have to use the latter.
It also avoid the problems you mentioned below. Moreover, for
transformations that do not apply on loops, this interpretation makes
it clear that it consumes a statement with its own scope:

   #pragma clang transform offload // Compared to "#pragma omp
target", the compiler has to to a legality analysis
   {
     do_something();
     do_something_else();
   }

Will it be possible to list multiple transformations with one directive?

No, and one of the reason I decided against reusing #pragma clang loop syntax.

> Furthermore, I intend to implement assigning identifiers to loop to
> reference them in followup transformations (e.g. tile a loop,
> parallelize the generated outer loop and vectorize the inner),

Do you have an example of this idea?

The same example as code:

    #pragma clang transform vectorize on(innername) width(4)
    #pragma clang transform parallelize_thread on(outername)

    #pragma clang transform tile sizes(32) floor(id(outername))
tile(id(innername))
    for (int i = 0; i < 128; i+=1)

Without ids, by writing more transformation in the clauses, it could
also be written as

    #pragma clang transform parallelize_thread // applies to the
outermost loop of the previous transformation
    #pragma clang transform tile sizes(32) tile(vectorize width(4))
    for (int i = 0; i < 128; i+=1)

Ids are more required when a follow-up transformation applies to
multiple loops, or handy for writing transformations for a specific
target together.

    #ifdef OPTIMIZE_FOR_COARSE_GRAIN
      #pragma clang transform interchange on(j,thefloor) permutation(thefloor,j)
    #endif

    #pragma clang transform id(j)
    for (int j = 0; j < n; j+=1) {
      #pragma clang transform tile size(32) floor(id(thefloor))
      for (int i = 0; i < 128; i+=1)

Michael

Hi Michael,

Thanks for the RFC! While I see that the identifier is needed to be able to apply transform directives to new loop (which is an important capability), I’m wondering if the identifier is meant to generically separate the source location from the optimization directive. Will I be able to add identifiers to my loops and have the optimization directives live elsewhere? If that is the case, then the ordering of the transformations could be difficult to understand.

Thanks again,
Brian

The identifiers also disambiguate the transformation order because a
loop can only be 'consumed' by a transformation. Any transformation
that should apply to after the the first transformation must use a
different loop name. That is,

#pragma clang transform unroll on(myloop)
#pragma clang transform vectorize on(myloop)
...
#pragma clang transform id(myloop)
for (int i = 0; i < n; ++i)
  Body(i);

is a compiler error since that transformation order is ambiguous.
Instead, only one of the transformations applies on myloop, while the
output of that transformation must be given a new name:

#pragma clang transform unroll on(mysimdloop)
#pragma clang transform vectorize on(myloop) apply(vectorized:id(mysimdloop))

In which order the pragmas appear in the source code is up to the
programmer. I hope they try to do it in a way that improved
understandably but I think misinterpreting the transformation order is
a lesser concern.

Michael