loop pragmas

Hi,

I've been looking at several loop pragma discussion threads, and every now and then, people note that it is difficult to ensure that metadata (that would represent the information from the pragma) is not destroyed during optimizations. I understand that in theory CFG transformations might destroy loops, but in practice, I notice that even after heavy optimization with opt and after transformations such as if-conversion in llc, the IR seems to remember where there were loops in the code. At least that is what I make of information such as

BB#28: derived from LLVM BB %for.body47.i
  ...
BB#43: derived from LLVM BB %while.body.i.preheader.i
  ...
BB#44: derived from LLVM BB %while.body.i.i
  ...

that gets printed when I run llc with -print-after-all. So apparently, at least some loop information is passed to the compiler backend successfully. As are line numbers, by the way, and pragmas also have line numbers.

So I wonder, what is the practical problem with passing information about loop pragmas as metadata on basic blocks?

Thanks,

Bjorn De Sutter
Computer Systems Lab
Ghent University

The problem is that the "BB derived from %for.loop.preheader" doesn't really mean that the BB is still a preheader.

-Krzysztof

There are two more points:

o Annotations may become invalid

Even if the loop structure is unchanged, annotations may become invalid. A loop may e.g. not be parallel any more as loop invariant code motion can introduce additional dependences.

o Blocks future optimizations

At the moment LLVM performs little loop optimizations. Hence, the original structure is not changed too much. However, future optimizations within core LLVM may change this behavior. Relying on the absence of certain transformations, does not seem to be a future proof concept.

There are probably ways to carefully design annotations on loops, but it is a difficult problem. I propose to discuss the kind of annotation you want to put and we can than try to find a proper way to represent annotations.

For annotations like '#pragma ivdep', it may e.g. be better to not mark a certain loop, but to directly mark the memory loads/stores within that loop to not carrying dependences of a certain kind.

Cheers
Tobi

I'm thinking of this in terms of parallelization directives. The optimizations that rely on such annotations would need to be done as early as possible, before any optimization that could invalidate them. If the annotation can become false, you are right---it's probably not a good idea to have it as the medium. Other types of annotations that are "harmless" are probably good to have, for example "unroll-by" (assuming that this is a suggestion to the compiler, not an order).

-Krzysztof

o Blocks future optimizations

At the moment LLVM performs little loop optimizations. Hence, the
original structure is not changed too much.

I'm thinking of this in terms of parallelization directives. The
optimizations that rely on such annotations would need to be done as
early as possible, before any optimization that could invalidate them.
If the annotation can become false, you are right---it's probably not a
good idea to have it as the medium.

If we use metadata to model annotations, we need to ensure that it is either correct or in case a transformation can not guarantee the correctness of the meta data, that it is removed.

The next point is that we should choose a meta-data representation that is not removed by the most trivial transformation.

Other types of annotations that are
"harmless" are probably good to have, for example "unroll-by" (assuming
that this is a suggestion to the compiler, not an order).

To my knowledge, we are avoiding to allow the user to 'tune' the compiler. Manual tuning may be good for a certain piece of hardware, but will have negative effects on other platforms.

Instead of providing facilities to tune the hardware, we should understand why LLVM does not choose the right unrolling factor. Maybe there is additional information that can help LLVM to derive that information.

Cheers
Tobi

I'm thinking of this in terms of parallelization directives. The
optimizations that rely on such annotations would need to be done as
early as possible, before any optimization that could invalidate them.
If the annotation can become false, you are right---it's probably not a
good idea to have it as the medium.

If we use metadata to model annotations, we need to ensure that it is
either correct or in case a transformation can not guarantee the
correctness of the meta data, that it is removed.

Yes, that is not hard to accomplish.

Other types of annotations that are
"harmless" are probably good to have, for example "unroll-by" (assuming
that this is a suggestion to the compiler, not an order).

To my knowledge, we are avoiding to allow the user to 'tune' the
compiler. Manual tuning may be good for a certain piece of hardware, but
will have negative effects on other platforms.

A lot of ISV code is meant to run on a particular platform, or on a small set of target platforms. Such code is often hand-tuned and the tuning directives will be different for different targets.. I see no reason why we shouldn't allow that. As a matter of fact, not allowing it will make us less competitive.

Instead of providing facilities to tune the hardware, we should
understand why LLVM does not choose the right unrolling factor.

Because in general it's impossible. User's hints are always welcome.

-Krzysztof

I'm thinking of this in terms of parallelization directives. The
optimizations that rely on such annotations would need to be done as
early as possible, before any optimization that could invalidate them.
If the annotation can become false, you are right---it's probably not a
good idea to have it as the medium.

If we use metadata to model annotations, we need to ensure that it is
either correct or in case a transformation can not guarantee the
correctness of the meta data, that it is removed.

Yes, that is not hard to accomplish.

Actually, I believe this is _not_ easy. Though, it makes only sense to discuss this with a specific annotation in mind.

Other types of annotations that are
"harmless" are probably good to have, for example "unroll-by" (assuming
that this is a suggestion to the compiler, not an order).

To my knowledge, we are avoiding to allow the user to 'tune' the
compiler. Manual tuning may be good for a certain piece of hardware, but
will have negative effects on other platforms.

A lot of ISV code is meant to run on a particular platform, or on a
small set of target platforms. Such code is often hand-tuned and the
tuning directives will be different for different targets.. I see no
reason why we shouldn't allow that. As a matter of fact, not allowing
it will make us less competitive.

This was the impression I got. Feel free to show examples/patches that convince people to push this.

Cheers
Tobi

Agreed. If there is a dedicated consumer of certain annotations, then after it's done, it should remove the annotations. This is not hard when it can run early enough not to have the annotations altered by prior transformations. Now, if there are special annotations for several dedicated consumers, or the consumer has to run late, then it does become difficult.

What would be nice is if we came up with a design that would allow a relatively easy validation of the annotations (something conceptually similar to error-detecting codes). The consumer then would disregard and remove the annotations that are no longer valid.

-Krzysztof

From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
To: "Tobias Grosser" <tobias@grosser.es>
Cc: llvmdev@cs.uiuc.edu
Sent: Wednesday, November 21, 2012 12:59:48 PM
Subject: Re: [LLVMdev] loop pragmas

>>
>> Yes, that is not hard to accomplish.
>
> Actually, I believe this is _not_ easy. Though, it makes only sense
> to
> discuss this with a specific annotation in mind.

Agreed. If there is a dedicated consumer of certain annotations,
then
after it's done, it should remove the annotations. This is not hard
when it can run early enough not to have the annotations altered by
prior transformations. Now, if there are special annotations for
several dedicated consumers, or the consumer has to run late, then it
does become difficult.

What would be nice is if we came up with a design that would allow a
relatively easy validation of the annotations (something conceptually
similar to error-detecting codes). The consumer then would disregard
and remove the annotations that are no longer valid.

I proposed something which has this property for OpenMP handling a few months ago. If any of it were dropped, or things became otherwise inconsistent, because of the cross-referencing, it would be possible to detect the problem and drop the entire parallel region. The problem with using this for OpenMP is that there are cases where dropping the entire parallel region is not a valid thing to do.

-Hal

In another compiler that I know, the OpenMP directives were processed very early, before any other transformation that could have damaged them.

-Krzysztof

Other types of annotations that are
"harmless" are probably good to have, for example "unroll-by" (assuming
that this is a suggestion to the compiler, not an order).

To my knowledge, we are avoiding to allow the user to 'tune' the
compiler. Manual tuning may be good for a certain piece of hardware, but
will have negative effects on other platforms.

Note that LLVM is typically the "backend" compiler portion of a combined
compiler that does language specific stuff. It would be useful to have a way
for such a front-end compiler to, if it so desires, annotate transformation
like loop unroll factors if there's machinery to implement the
transformation. I'd actually prefer such directives to actually be orders
rather than suggestions, unless they specify impossible things.

In a personal capacity, I'm very interested in auto-tuning pieces of
software -- think ATLAS or SPIRAL -- where the ability to empirically try
various alternatives on a new hardware platform is very useful.

Instead of providing facilities to tune the hardware, we should
understand why LLVM does not choose the right unrolling factor. Maybe
there is additional information that can help LLVM to derive that
information.

This is a laudable goal, but there always comes up the issue that developing
understanding of why things perform in a particular way on a modern
CPU/memory/chipset combination requires manpower, often more manpower than
is available.

Regards,
Dave

From: "David Tweed" <david.tweed@arm.com>
To: "Tobias Grosser" <tobias@grosser.es>, "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
Cc: llvmdev@cs.uiuc.edu
Sent: Thursday, November 22, 2012 3:01:07 AM
Subject: Re: [LLVMdev] loop pragmas

> Other types of annotations that are
> "harmless" are probably good to have, for example "unroll-by"
> (assuming
> that this is a suggestion to the compiler, not an order).

> To my knowledge, we are avoiding to allow the user to 'tune' the
> compiler. Manual tuning may be good for a certain piece of
> hardware, but
> will have negative effects on other platforms.

Note that LLVM is typically the "backend" compiler portion of a
combined
compiler that does language specific stuff. It would be useful to
have a way
for such a front-end compiler to, if it so desires, annotate
transformation
like loop unroll factors if there's machinery to implement the
transformation. I'd actually prefer such directives to actually be
orders
rather than suggestions, unless they specify impossible things.

I agree.

In a personal capacity, I'm very interested in auto-tuning pieces of
software -- think ATLAS or SPIRAL -- where the ability to empirically
try
various alternatives on a new hardware platform is very useful.

FWIW, I also have my eye on (something like) this project:
http://projects.csail.mit.edu/petabricks/

> Instead of providing facilities to tune the hardware, we should
> understand why LLVM does not choose the right unrolling factor.
> Maybe
> there is additional information that can help LLVM to derive that
> information.

This is a laudable goal, but there always comes up the issue that
developing
understanding of why things perform in a particular way on a modern
CPU/memory/chipset combination requires manpower, often more manpower
than
is available.

I agree that we should try harder than we currently do. Nevertheless, providing a user-override is also important.

Full analysis turns out to be very difficult, and would require analysis of instruction scheduling, accounting for characteristics of the memory hierarchy, etc. In the end, the best that can be done is profiling-guided optimization to capture some of these factors, which I think requires much of the same infrastructure as providing user-accessible pragmas (and then much more).

-Hal

Krzysztof Parzyszek wrote:

I am thinking about another use of annotations that fits in a longer term vision, which centers around feeding compilers with information from higher-level tools such as precompilers.

Deciding how to map a portable piece of software to a heterogeneous multicore processor and get the best performance for a range of widely varying architectures, requires much higher level code analysis than what is possible in "standard" compilers on their relatively low-level IRs. To have portable performance and high programmer productivity, an application will need to be written in a higher-level language, like Julia, MATLAB, ... that can deliver much of the needed information to the compilers, for example by using parallel programming patterns (see Berkeley Parlab), and that allows a compiler to choose among different implementations of algorithms and data structures (see Petabricks). Building compiler front-ends, middle-ends and back-ends from scratch that can use all information available in such programs and that produce high-quality code for a range of architectures is undoable for most if not all research labs.

And it should also not be necessary: many excellent lower-level language compilers already exist, some of which support a wide range of architectures, such as LLVM (OoO CPUs, VLIWs, GPUs, CGRAs in my own backend and in Samsung's proprietary SRP backend, etc.). So for research purposes and hopefully later also real-world development if the research results are good, ideally we would have to only develop the necessary precompiler tools that take in very high-level code and that produce tuned lower-level (C or bitcode or LLVM IR or ...) after high-level analysis and target-dependent optimisations and selections, after which LLVM then takes care of the further low-level optimizations and specific code generation for the different targets in the heterogeneous multicore.

To facilitate research in this direction and use LLVM in such a tool flow, it is absolutely necessary that both the precompiler and the researcher doing manual experiments can steer the low-level LLVM compiler with code annotations such as loop pragmas or attributes, simply because once the code is in a lower-level form such as C or bitcode or LLVM IR, there is not enough information available in the code itself to select the best low-level transformations.

It is interesting to note that in such an approach, the precompiler probably would be trained with machine learning. If this is the case, it might also be able to learn which annotations actually influence the compiler and which do not, for example because they are destroyed before some optimization pass is executed. So the precompiler can learn to generate code tuned for specific targets taking into account all limitations of the compiler that will do the actual code generation, including its incomplete or stupid or whatever support for pragmas and other such things...

Best,

Bjorn

From: "Bjorn De Sutter" <bjorn.desutter@elis.ugent.be>
To: llvmdev@cs.uiuc.edu
Sent: Tuesday, November 27, 2012 6:49:39 AM
Subject: Re: [LLVMdev] loop pragmas

I am thinking about another use of annotations that fits in a longer
term vision, which centers around feeding compilers with information
from higher-level tools such as precompilers.

Deciding how to map a portable piece of software to a heterogeneous
multicore processor and get the best performance for a range of
widely varying architectures, requires much higher level code
analysis than what is possible in "standard" compilers on their
relatively low-level IRs. To have portable performance and high
programmer productivity, an application will need to be written in a
higher-level language, like Julia, MATLAB, ... that can deliver much
of the needed information to the compilers, for example by using
parallel programming patterns (see Berkeley Parlab), and that allows
a compiler to choose among different implementations of algorithms
and data structures (see Petabricks). Building compiler front-ends,
middle-ends and back-ends from scratch that can use all information
available in such programs and that produce high-quality code for a
range of architectures is undoable for most if not all research
labs.

And it should also not be necessary: many excellent lower-level
language compilers already exist, some of which support a wide range
of architectures, such as LLVM (OoO CPUs, VLIWs, GPUs, CGRAs in my
own backend and in Samsung's proprietary SRP backend, etc.). So for
research purposes and hopefully later also real-world development if
the research results are good, ideally we would have to only develop
the necessary precompiler tools that take in very high-level code
and that produce tuned lower-level (C or bitcode or LLVM IR or ...)
after high-level analysis and target-dependent optimisations and
selections, after which LLVM then takes care of the further
low-level optimizations and specific code generation for the
different targets in the heterogeneous multicore.

To facilitate research in this direction and use LLVM in such a tool
flow, it is absolutely necessary that both the precompiler and the
researcher doing manual experiments can steer the low-level LLVM
compiler with code annotations such as loop pragmas or attributes,
simply because once the code is in a lower-level form such as C or
bitcode or LLVM IR, there is not enough information available in the
code itself to select the best low-level transformations.

It is interesting to note that in such an approach, the precompiler
probably would be trained with machine learning. If this is the
case, it might also be able to learn which annotations actually
influence the compiler and which do not, for example because they
are destroyed before some optimization pass is executed. So the
precompiler can learn to generate code tuned for specific targets
taking into account all limitations of the compiler that will do the
actual code generation, including its incomplete or stupid or
whatever support for pragmas and other such things...

+1

This still leaves the question of exactly how to attach metadata to loops, etc.

-Hal

What about the following:

1) During parsing, pragmas and attributes and ... are collected and stored in some kind of annotation container. Every occurrence of a pragma, attribute, etc. ends up as a pseudo instruction in the IR that references one annotation in the container. Per annotation, generic information such as its type is stored, potential parameters, line number information, etc.

2) Additional passes can extend the information tracked for each annotation: for example, for loop pragmas, some small pass could compute the nesting depth of each pragma. For data dependency related loop params, the loop annotation can be converted into per instruction metadata, etc. Also additional pseudo-instructions can be inserted. Similar passes can check which annotations are still meaningful, or aggregate information from multiple annotations. This might be useful, for example to omit loop annotations on loops that were unrolled completely, or to aggregate annotations when loops are fused (by polly e.g.).

3) Each and every code transformation pass is responsible for either maintaining the annotations or for invalidating them. The standard behavior would be to invalidate all annotations. It is up to the developers of passes using the annotations to make sure that the annotations survive it to their pass. A second step would be to invalidate only the annotations referenced by pseudo-instructions that are involved in the transformations. This can also be done in post-processing steps of transformations: for example, if some pseudo-instruction is not at a loop nest level indicated by the annotation it references, the data is invalid. Or if more than one pseudo instruction refers to an annotation after some transformation, this points to duplication which might invalidate some annotation.

Bjorn

In one implementation, the loops had a fixed structure: guard branch, preheader, header and loop body, optional epilog. The structure was clearly identified by various means (flags, bits, etc.). No optimization in the optimizer (roughly a counterpart of the LLVM's bitcode optimizer) was allowed to alter it. For example, CFG simplification could change branches all it wanted, *except* the branches that were a part of the loop structure. A "continue" inside of a loop would not create a nested loop, etc. This worked like a dream (hence my limited appreciation for SimplifyCFG...).

In that implementation, the metadata was placed in the loop header, or the loop preheader, depending on what information it was. Since any code that would try to move things around in a loop would have to know what it's doing (i.e. would have to be aware of the loop structure and what the rules were), there was no risk that the metadata would become accidentally separated from the loop.

I'd love to see something like that in LLVM, help make it happen, etc, etc. Of course, other ideas are welcome as well. :slight_smile:

-Krzysztof

Krzysztof Parzyszek wrote:

From: "Krzysztof Parzyszek" <kparzysz@codeaurora.org>
To: llvmdev@cs.uiuc.edu
Sent: Tuesday, November 27, 2012 10:58:05 AM
Subject: Re: [LLVMdev] loop pragmas

>
> This still leaves the question of exactly how to attach metadata to
> loops, etc.

In one implementation, the loops had a fixed structure: guard branch,
preheader, header and loop body, optional epilog. The structure was
clearly identified by various means (flags, bits, etc.). No
optimization in the optimizer (roughly a counterpart of the LLVM's
bitcode optimizer) was allowed to alter it. For example, CFG
simplification could change branches all it wanted, *except* the
branches that were a part of the loop structure. A "continue" inside
of
a loop would not create a nested loop, etc. This worked like a dream
(hence my limited appreciation for SimplifyCFG...).

In that implementation, the metadata was placed in the loop header,
or
the loop preheader, depending on what information it was. Since any
code that would try to move things around in a loop would have to
know
what it's doing (i.e. would have to be aware of the loop structure
and
what the rules were), there was no risk that the metadata would
become
accidentally separated from the loop.

I'd love to see something like that in LLVM, help make it happen,
etc,
etc. Of course, other ideas are welcome as well. :slight_smile:

Can you please write up a description of exactly what you have in mind? What information would go in the header, preheader (or maybe attached to the backedges, etc.). Would this be metadata or intrinsics (or both)? Would basic-block-level metadata be a better fit? (we had a patch for this proposed some months ago by Ralf Karrenberg -- I just figured I'd mention it in case it is useful here). Obviously we don't want to limit the optimization space available in order to preserve this metadata, and so we need to work out what happens in the case of loop fusion, splitting, etc.

-Hal

Sure. Will do.

-Krzysztof