Vectorization metadata

Hal,

I'm opening a new discussion on vectorization metadata, since it has
little to do with fp-math. :wink:

What kind of metadata would you annotate in the instructions? If I
remember from your talk, you're not doing any loop or whole-function
analysis, possibly leaving it for Polly to help you along the way.

I remember discussing it with Tobias that Polly could have three main steps:

1. Early analysis and annotation: a step that wouldn't modify code,
but extensively annotate (with metadata), so that itself, and other
passes like yours, could benefit from the polyhedral model.

2. Full polyhedral code modification: use the annotation of the
previous pass to extensively modify code. This is what Polly does
today, but the result of the analysis is not benefiting anyone except
for Polly.

This step can be fused with step 1 for performance reasons, but would
be good to be able to pass only the analysis part
for the benefit of the annotation, without the heavy modifications.
This will be fundamental for independently testing vectorization
passes that depend on Polly's metadata.

3. Code generation steps. As you said in your talk, and we discussed
in the fp-math thread, some code-generation steps could be aware of
the optimizations done via the metadata that was left in it.

That will require some guarantees on metadata semantics and
persistence that is not available today... Anyway, not sure any
metadata-hardening will be very well accepted... :wink:

Hal,

I'm opening a new discussion on vectorization metadata, since it has
little to do with fp-math. :wink:

Fair enough, but I was actually taking about how fp-math, etc. metadata
is updated during vectorization. When vectorization fuses
originally-independent instructions, it has the same metadata issues as
GVN, etc.

Metadata specifically for vectorization is another interesting topic,
but I don't have any specific ideas for this at the moment. That having
been said, I think that we do need to think about metadata that will
help with vectorization; we might want to tag instructions as safe for
speculative execution, for example. We might want to tag loops with a
specific unrolling factor. We might want to be able to pass along
specific alias independence results. None of these things are really
specific to vectorization, but will generally have an impact on it.

-Hal

Hal,

I'm opening a new discussion on vectorization metadata, since it has
little to do with fp-math. :wink:

What kind of metadata would you annotate in the instructions? If I
remember from your talk, you're not doing any loop or whole-function
analysis, possibly leaving it for Polly to help you along the way.

I remember discussing it with Tobias that Polly could have three main steps:

1. Early analysis and annotation: a step that wouldn't modify code,
but extensively annotate (with metadata), so that itself, and other
passes like yours, could benefit from the polyhedral model.

hi renato,

Instead of exporting the polyhedral model of the program with
metadata, another possible solution is designing a generic "Loop
Parallelism" analysis interface just like the AliasAnalysis group.
For a particular loop, the interface simply answer how many loop
iterations can run in parallel. With information provided by this
interface we can unroll the loop to expose vectorizable iterations and
apply vectorization to the unrolled loop with BBVectorizer.

Like AliasAnalysis we can have difference implementation of loop
parallelism analysis, i.e., we can have a light weight loop
parallelism Analysis implementation based on SCEV (or the
LoopDependency Analysis), and we can also have a Loop Parallelism
Analysis implementation based on polyhedral model analysis implemented
in polly (called polyhedral loop parallelism analysis), but analysis
result of Polly is not visible at the scope of FunctionPass/LoopPass
as all polly passes are RegionPasses right now.

To allow polly export its analysis result to FunctionPass/LoopPass, we
need to make the polyhedral loop parallelism analysis became a
FunctionPass, and schedule it before all polly passes but do nothing
in its runOnFunction method, after that we can let another pass of
polly to fill the actually analysis results into the polyhedral loop
parallelism analysis pass. By doing this, other
FunctionPasses/LoopPasses can query the parallelism information
calculated by Polly.

If the parallelism information is available outside polly, we can also
find some way to move code generation support for OpenMP, Vecorization
and CUDA from Polly to LLVM transformation library, after that we can
also generate such code base on the analysis result of the SCEV based
parallelism analysis.

best regards
ether

Hi Ether,

Instead of exporting the polyhedral model of the program with
metadata, another possible solution is designing a generic "Loop
Parallelism" analysis interface just like the AliasAnalysis group.
For a particular loop, the interface simply answer how many loop
iterations can run in parallel. With information provided by this
interface we can unroll the loop to expose vectorizable iterations and
apply vectorization to the unrolled loop with BBVectorizer.

In the long run, this kind of paralellism detector should be replaced
by Polly, but it could be a starting point. I only fear that the
burden might out-weight the benefits in the short term.

To allow polly export its analysis result to FunctionPass/LoopPass, we
need to make the polyhedral loop parallelism analysis became a
FunctionPass, and schedule it before all polly passes but do nothing
in its runOnFunction method, after that we can let another pass of
polly to fill the actually analysis results into the polyhedral loop
parallelism analysis pass. By doing this, other
FunctionPasses/LoopPasses can query the parallelism information
calculated by Polly.

That's the idea. Would be good if you could have both
analysis+transform AND analysis-only pre-passes, to allow a more fine
grained control over vectorization (and ease tests of other passes
that use Polly's info).

If the parallelism information is available outside polly, we can also
find some way to move code generation support for OpenMP, Vecorization
and CUDA from Polly to LLVM transformation library, after that we can
also generate such code base on the analysis result of the SCEV based
parallelism analysis.

LLVM already has OpenMP support, maybe we should follow a similar
standard, or common them up.

CUDA would be closer to OpenCL than OpenMP or Polly, I'm not sure
there is a feasible way to make sure the semantics remains the same on
such drastic changes of paradigm.

I think this is a very important feature for vectorization. If we
start building small passes for small vectorization steps (like one
for hoisting loop constants, other to simplify the induction range,
other to unroll loops), we might not be able to predict the best
strategy, since early changes might shadow better strategies later.

Having metadata allows one to infer what's the best strategy as a
whole, and apply it, rather than hoping for a good sequence of
passes... We still can have separate passes for each task, but not run
them all on all code all the time.

So, if an early analysis pass annotate saying in a particular loop,
you should only hoist the loop-constants (aggressive inlining is
possible, for ex.), while on another you should actually unroll, then
each pass can run independently and trust the metadata on each
loop/block/instruction.

Hi Ether,

> Instead of exporting the polyhedral model of the program with
> metadata, another possible solution is designing a generic "Loop
> Parallelism" analysis interface just like the AliasAnalysis group.
> For a particular loop, the interface simply answer how many loop
> iterations can run in parallel. With information provided by this
> interface we can unroll the loop to expose vectorizable iterations
> and apply vectorization to the unrolled loop with BBVectorizer.

In the long run, this kind of paralellism detector should be replaced
by Polly, but it could be a starting point. I only fear that the
burden might out-weight the benefits in the short term.

> To allow polly export its analysis result to FunctionPass/LoopPass,
> we need to make the polyhedral loop parallelism analysis became a
> FunctionPass, and schedule it before all polly passes but do nothing
> in its runOnFunction method, after that we can let another pass of
> polly to fill the actually analysis results into the polyhedral loop
> parallelism analysis pass. By doing this, other
> FunctionPasses/LoopPasses can query the parallelism information
> calculated by Polly.

That's the idea. Would be good if you could have both
analysis+transform AND analysis-only pre-passes, to allow a more fine
grained control over vectorization (and ease tests of other passes
that use Polly's info).

> If the parallelism information is available outside polly, we can
> also find some way to move code generation support for OpenMP,
> Vecorization and CUDA from Polly to LLVM transformation library,
> after that we can also generate such code base on the analysis
> result of the SCEV based parallelism analysis.

LLVM already has OpenMP support, maybe we should follow a similar
standard, or common them up.

I wish that were true; unless you know something I don't know, there
is no parallelization support at this time. Polly has some ability
to lower directly to the libgomp runtime, but that is not the same
as OpenMP support. This is, however, something I'd like to work on.

-Hal

Hi Hal,

  We might want to be able to pass along
specific alias independence results. None of these things are really
specific to vectorization, but will generally have an impact on it.

For what it's worth, in pocl we do something along this lines now [1].

We annotate the OpenCL C kernel instructions with the OpenCL work
item id and the "parallel region id" (region between barriers).
As you probably know, in OpenCL C the work items are fully independent
"threads of execution" between the barrier regions which is useful
information to pass along.

This metadata is used to both guide a (modified) bb-vectorizer to
perform the work group auto-vectorization (whole function vectorization,
if you will) more efficiently and to improve the alias analysis for
instruction scheduling (and other optimizations that might benefit).

The benefit of not just vectorizing directly the parallel regions is that
we can choose to wg-vectorize and/or to statically instruction parallelize
using the same input from pocl.

It would be really nice to have a set of "standard independence
metadata" in LLVM that would cover also this scenario.

[1] http://bazaar.launchpad.net/~pocl/pocl/trunk/revision/237

BR,

Well, a few months ago there were some people (where there?)
implementing support to read the pragmas, but I'm not sure how far did
they get.

It might be just my imagination, though...