Proposal: Generic auto-vectorization and parallelization approach for LLVM and Polly

Hi,

I just have a detail look at the code of Polly[1], it seems that Polly
start to support some basic auto-parallelization stuffs. I have some
idea to improve the current auto-vectorization and parallelization
approach in Polly.

The main idea is, we separate the transform passes and codegen passes
for auto-parallelization and vectorization (Graphite[2] for gcc seems
to taking similar approach for auto-vectorization).

That means Polly (Or other similar framework) should perform necessary
code transform, then just generates the sequential code, and provides
necessary parallelism information (These information could be encoded
as metadata just like TBAA), then the later passes can generate the
parallel code with those parallelism information.

The main benefit of separating transform passes and codegen passes are:
1. Decouple the the autopar framework from various parallel runtime
environment, so we can keep both autopar framework and code generation
pass for specific parallel runtime environment simple. And we can
support more parallel runtime environments easily.

2. Allow the some generic parallelism information live out specific
autopar framework, so these information can benefit more passes in
llvm. For example, the X86 and PTX backend could use these information
to perform target specific auto-vectorization.

Implementation consideration:

We may define some kind of generic "parallelism analysis" or the
generic version of "LoopDependenceAnalysis" interface just like
AliasAnalysis, or we can also encode those parallelism information as
metadata. And combining the both should be OK, too.

Any comments are appreciated.

best regards
ether

[1] http://wiki.llvm.org/Polly
[2] Graphite/Parallelization - GCC Wiki

Hi,

I just have a detail look at the code of Polly[1], it seems that Polly
start to support some basic auto-parallelization stuffs.

This is true. However still work in progress. I hope we can soon show some interesting results.

> I have some idea to improve the current auto-vectorization
> and parallelization approach in Polly.

The main idea is, we separate the transform passes and codegen passes
for auto-parallelization and vectorization (Graphite[2] for gcc seems
to taking similar approach for auto-vectorization).

This is just historical. Sebastian is planning to couple them a lot closer. Graphite and auto parallelization/vectorization are not yet playing well together, even if Sebastian made some huge progress with Graphite<->Autovec interaction.

That means Polly (Or other similar framework) should perform necessary
code transform, then just generates the sequential code, and provides
necessary parallelism information (These information could be encoded
as metadata just like TBAA), then the later passes can generate the
parallel code with those parallelism information.

I also thought about this approach.

The main benefit of separating transform passes and codegen passes are:
1. Decouple the the autopar framework from various parallel runtime
environment, so we can keep both autopar framework and code generation
pass for specific parallel runtime environment simple. And we can
support more parallel runtime environments easily.

Yes, as soon as we start to support more targets we should move out some of the logic into separate functions/file/passes. This could be either done inside polly or as a generic LLVM pass, depending
on which is more suitable.

2. Allow the some generic parallelism information live out specific
autopar framework, so these information can benefit more passes in
llvm. For example, the X86 and PTX backend could use these information
to perform target specific auto-vectorization.

What other types of parallelism are you expecting? We currently support thread level parallelism (as in OpenMP) and vector level parallelism (as in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered extremely well to x86 SIMD instructions. I suppose this is the same for all CPU targets. I still need to look into GPU targets.

Implementation consideration:

We may define some kind of generic "parallelism analysis" or the
generic version of "LoopDependenceAnalysis" interface just like
AliasAnalysis, or we can also encode those parallelism information as
metadata. And combining the both should be OK, too.

Thanks ether for starting to reason about this.

My current plan is as follows:

Create stable support for OpenMP and LLVM-IR vector generation inside polly. This is not very difficult, as you the needed analysis can easily be performed in the polyhedral framework. Code generation is not difficult either. SIMD Vector support is almost complete and OpenMP code generation is also half way through. Based on that infrastructure we can take advantage of both thread level as well as SIMD level parallelism and which covers the two most common targets.
The next step will be to evaluate loop transformations that enable thread and vector level parallelism as it is done e.g. in pluto[1].
I am pretty confident we can show good improvements. Stay tuned. :wink:

Regarding your proposal I hope to move some of the code into more generic frameworks. I am thinking e.g. about introducing OpenMP intrinsics to support different OpenMP libraries like libgomp and mpc[2]. LLVM-IR vector instructions however are generic SIMD instructions so I do not see any reason to create target specific
auto vectorizer passes.

The remaining question is how/where to generate parallel/vector
code.

The current approach is to do this directly at Polly code generation time, the solution you proposed would be to generate sequential
code inside polly, annotate it with meta data, reparse it and finally create openmp/vector code.

The second approach would have the advantage that multiple LLVM passes
can use the information and generate code from it as well as there could exist several analysis that create the needed metadata.

It has however the drawback that instead of just doing code generation once after polly, we do sequential code generation -> reparsing/analysis -> parallel code generation. Furthermore, the infrastructure needs to pass all the information needed
for efficient parallelisation which are at least the access strides, the alignment and privatized variables. Recomputing this information using scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.

As Polly - based on the current approach - will soon support target agnostic autovectorization and OpenMP code generation, I plan to now focus on polyhedral loop nest optimizations that enable efficient vectorization and thread level parallelism. As this transformations are
performed on the abstract polyhedral description, no further changes to the code generation are needed. In case a larger framework is shown to be useful, I will definitely support this. For the moment however I am very fine with the existing lightweight approach.

Cheers
Tobi

[1] http://pluto-compiler.sourceforge.net/
[2] http://mpc.sf.net

The main idea is, we separate the transform passes and codegen passes
for auto-parallelization and vectorization (Graphite[2] for gcc seems
to taking similar approach for auto-vectorization).

I agree with Ether.

A two-stage vectorization would allow you to use the simple
loop-unroller already in place to generate vector/mp intrinsics from
them, and if more parallelism is required, use the expensive Poly
framework to skew loops and remove dependencies, so the loop-unroller
and other cheap bits can do their job where then couldn't before.

So, in essence, this is a three-stage job. The optional heavy-duty
Poly analysis, the cheap loop-optimizer and the mp/vector
transformation pass. The best features of having them three is to be
able to choose the level of vectorization you want and to re-use the
current loop analysis into the scheme.

What other types of parallelism are you expecting? We currently support
thread level parallelism (as in OpenMP) and vector level parallelism (as
in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered
extremely well to x86 SIMD instructions. I suppose this is the same for
all CPU targets. I still need to look into GPU targets.

I'd suggest to try and transform sequential instructions into vector
instructions (in the third stage) if proven to be correct.

So, when Poly skews a loop, and the loop analysis unrolls it to, say,
4 calls to the same instruction, a metadata binding them together can
hint the third stage to make that a vector operation with the same
semantics.

LLVM-IR vector instructions however are generic SIMD
instructions so I do not see any reason to create target specific
auto vectorizer passes.

If you're assuming the original code is using intrinsics, that is
correct. But if you want to generate the vector code from Poly, than
you need to add that support, too.

ARM also has good vector instruction selection (on Cortex-A* with
NEON), so you also get that for free. :wink:

cheers,
--renato

Hi tobi,

2. Allow the some generic parallelism information live out specific
autopar framework, so these information can benefit more passes in
llvm. For example, the X86 and PTX backend could use these information
to perform target specific auto-vectorization.

What other types of parallelism are you expecting? We currently support
thread level parallelism (as in OpenMP) and vector level parallelism (as in
LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered extremely
well to x86 SIMD instructions. I suppose this is the same for all CPU
targets. I still need to look into GPU targets.

I just think the vector units in different target may have a
difference width, so the best unroll count of a loop for vectorization
in not know in high level optimization passes.

It has however the drawback that instead of just doing code generation once
after polly, we do sequential code generation -> reparsing/analysis ->
parallel code generation. Furthermore, the infrastructure needs to pass all
the information needed
for efficient parallelisation which are at least the access strides, the
alignment and privatized variables. Recomputing this information using
scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.

To overcame this, We can encode these kind of "hard to recover"
information as metadata while generating sequential code, and what the
later "Polyhedral Parallelism Analysis" pass need to do is just read
these information form metadata, and reparsing/analysis other
information which is easy to recover. so the process become:
sequential code generation and metadata annotation -> read metadata
(and perform some cheap reparsing/analysis)->parallel code generation

The bigger picture is:
1. Define the common interface for "Parallelism Analysis" or
"LoopDependenceAnalysis", just like AliasAnalysis.
2. Then we can have different implementations of Parallelism Analysis.
    For example, we may have the "SCEVParallelsimAnalysis", which
compute the parallelism information base on SCEV.
    and we can also have the "PolyhedralParallelismAnalysis", which
read "hard to recover" information from metadata and recompute the
cheap information, then provides these information via the common
"Parallelism Analysis" interface.
3. The auto-vectorization and parallelization codegen passes can just
ask the common interface of "Parallelism Analysis" to get necessary
information.

The new approach may also make current work for OpenMP support esaier,
Instead of generate the subfunction directly from clast and insert new
function in a region pass(it seems that we can only insert new
function in a modulepass or callgraphSCC pass), we can extract the
body of the parallel for to a new function with existing CodeExtractor
in LLVM.

best regards
ether

Hi tobi,

2. Allow the some generic parallelism information live out specific
autopar framework, so these information can benefit more passes in
llvm. For example, the X86 and PTX backend could use these information
to perform target specific auto-vectorization.

What other types of parallelism are you expecting? We currently support
thread level parallelism (as in OpenMP) and vector level parallelism (as in
LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered extremely
well to x86 SIMD instructions. I suppose this is the same for all CPU
targets. I still need to look into GPU targets.

I just think the vector units in different target may have a
difference width, so the best unroll count of a loop for vectorization
in not know in high level optimization passes.

I believe we can obtain this information from the target data. If this information is not yet available target data should be extended,as also high level loop nest transformations should have knowledge about vector width and at best even about the number of registers, if we want to support effective register tiling.

It has however the drawback that instead of just doing code generation once
after polly, we do sequential code generation -> reparsing/analysis ->
parallel code generation. Furthermore, the infrastructure needs to pass all
the information needed
for efficient parallelisation which are at least the access strides, the
alignment and privatized variables. Recomputing this information using
scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.

To overcame this, We can encode these kind of "hard to recover"
information as metadata while generating sequential code, and what the
later "Polyhedral Parallelism Analysis" pass need to do is just read
these information form metadata, and reparsing/analysis other
information which is easy to recover. so the process become:
sequential code generation and metadata annotation -> read metadata
(and perform some cheap reparsing/analysis)->parallel code generation

I believe this is a reasonable amount of work and in terms of vectorization for Polly I _currently_ see limited benefits. The current advantage is- as Renato pointed out - that we could create a very light weight vectorizer by taking advantage of the existing loop passes. Also in terms of openmp code generation, this might be a good way.

The bigger picture is:
1. Define the common interface for "Parallelism Analysis" or
"LoopDependenceAnalysis", just like AliasAnalysis.
2. Then we can have different implementations of Parallelism Analysis.
     For example, we may have the "SCEVParallelsimAnalysis", which
compute the parallelism information base on SCEV.
     and we can also have the "PolyhedralParallelismAnalysis", which
read "hard to recover" information from metadata and recompute the
cheap information, then provides these information via the common
"Parallelism Analysis" interface.
3. The auto-vectorization and parallelization codegen passes can just
ask the common interface of "Parallelism Analysis" to get necessary
information.

A reasonable approach.

The new approach may also make current work for OpenMP support esaier,
Instead of generate the subfunction directly from clast and insert new
function in a region pass(it seems that we can only insert new
function in a modulepass or callgraphSCC pass), we can extract the
body of the parallel for to a new function with existing CodeExtractor
in LLVM.

I agree we need to improve the implementation of the OpenMP support. The
reason I did not propose a integrated framework yet is that I still need to understand OpenMP a little bit better. Hope after the basic OpenMP support in Polly is finished, we can move to an LLVM integrated approach. As we already have an working implementation and test cases we can compare against, this will probably be an easier shift.

Maybe we can start in that area, by first introducing some generic openmp intrinsics. And later automatically generate those based on meta data annotations.

Cheers
Tobi

The main idea is, we separate the transform passes and codegen passes
for auto-parallelization and vectorization (Graphite[2] for gcc seems
to taking similar approach for auto-vectorization).

I agree with Ether.

A two-stage vectorization would allow you to use the simple
loop-unroller already in place to generate vector/mp intrinsics from
them, and if more parallelism is required, use the expensive Poly
framework to skew loops and remove dependencies, so the loop-unroller
and other cheap bits can do their job where then couldn't before.

So, in essence, this is a three-stage job. The optional heavy-duty
Poly analysis, the cheap loop-optimizer and the mp/vector
transformation pass. The best features of having them three is to be
able to choose the level of vectorization you want and to re-use the
current loop analysis into the scheme.

OK. First of all to agree on a name, we decided to call the Polyhedral analysis we develop PoLLy, as in Polly the parrot. :wink: Maybe it was a misleading choice?

In general as I explained I agree that a three stage approach is useful,
for the reasons you explained, however it is more overhead (and just implementation work) than the one we use now. I currently do not have the time to implement the proposed approach. In case anybody is interested to work on patches, I am happy to support this.

What other types of parallelism are you expecting? We currently support
thread level parallelism (as in OpenMP) and vector level parallelism (as
in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered
extremely well to x86 SIMD instructions. I suppose this is the same for
all CPU targets. I still need to look into GPU targets.

I'd suggest to try and transform sequential instructions into vector
instructions (in the third stage) if proven to be correct.

So, when Poly skews a loop, and the loop analysis unrolls it to, say,
4 calls to the same instruction, a metadata binding them together can
hint the third stage to make that a vector operation with the same
semantics.

I know, this is the classical approach for vector code generation. The difference in Polly is that we do not have a loop represented in LLVM-IR, which we would like to vectorize, but we have a loop body
and its content which we want to create as vector code. So instead of
creating the LLVM-IR loop structure, write meta data, unroll the loop
and than create merge instructions to vector instructions, the only change in polly is, that it either generates N scalar instructions per original instruction or one vector instruction (if N is the number of loop iterations which is equivalent to the vector width). So vectorization in Polly was very easy to implement and already works reasonable well.

LLVM-IR vector instructions however are generic SIMD
instructions so I do not see any reason to create target specific
auto vectorizer passes.

If you're assuming the original code is using intrinsics, that is
correct. But if you want to generate the vector code from Poly, than
you need to add that support, too.

Why are target specific vectorization passes needed to generate vector instructions from Polly? The only target specific information I currently see is the vector width, which a generic vectorization pass can obtain from the target data information. Could you explain for which features target specific vectorization would be needed?

ARM also has good vector instruction selection (on Cortex-A* with
NEON), so you also get that for free. :wink:

I have read this and the look interesting. I suppose they are created out of the box, if a pass generates LLVM-IR vector instructions?

cheers,
--renato

Thanks for your comments

Tobi

OK. First of all to agree on a name, we decided to call the Polyhedral
analysis we develop PoLLy, as in Polly the parrot. :wink: Maybe it was a
misleading choice?

I never realised... :wink: Polly it is!

In general as I explained I agree that a three stage approach is useful,
for the reasons you explained, however it is more overhead (and just
implementation work) than the one we use now. I currently do not have the
time to implement the proposed approach. In case anybody is interested to
work on patches, I am happy to support this.

Good. If it's just a matter of time (not design), things can be left
ready for future implementation without breaking the current model. I
thought there was a fundamental flaw with the three-stage design (and
was eager to learn it).

the only change
in polly is, that it either generates N scalar instructions per original
instruction or one vector instruction (if N is the number of loop iterations
which is equivalent to the vector width). So vectorization in Polly was very
easy to implement and already works reasonable well.

Ok, this comes with another current change in LLVM: OpenCL. I explain.

OpenCL has very large (and odd) vector sizes, that if implemented to
vectorized units (like SSE or NEON), need to be legalised.

Such a pass should be target specific and polly could make use of
that. If polly always generate vector code (instead of reason if the
number of unrolled operations are the same as the current target being
compiled into), the later legalisation pass can deal with the odd
sized vectors and transform into multiples of legal vector + some
surplus of the module as normal instructions.

Also, if the target doesn't have vector units, there could be a
generic (or not) transformation to cpu instructions (if there isn't
one already), so that makes your polly pass completely target
agnostic.

Why are target specific vectorization passes needed to generate vector
instructions from Polly? The only target specific information I currently
see is the vector width, which a generic vectorization pass can obtain from
the target data information. Could you explain for which features target
specific vectorization would be needed?

Not target specific, generic vectors. See above.

I have read this and the look interesting. I suppose they are created out of
the box, if a pass generates LLVM-IR vector instructions?

Yup. It's pretty neat. SSE is probably similar, but with NEON, a
pattern-match is done when the variable type is a vector.

So, a multiplication followed by an addition in the right way is
transformed into a multiply-and-add NEON instruction.

An example (in a completely wrong IR, just to make a point):

%a = <4 x i32>
%b = <4 x i32>
%c = <4 x i32>
%mul = mul %b, %c
%acc = add %mul, %a

gets transformed into:

VMLA.I32 q0, q1, q2

Multiplying vectors (of the correct size) gets into VMUL, adding gets
to VADD and so on...

cheers,
--renato

OK. First of all to agree on a name, we decided to call the Polyhedral
analysis we develop PoLLy, as in Polly the parrot. :wink: Maybe it was a
misleading choice?

I never realised... :wink: Polly it is!

In general as I explained I agree that a three stage approach is useful,
for the reasons you explained, however it is more overhead (and just
implementation work) than the one we use now. I currently do not have the
time to implement the proposed approach. In case anybody is interested to
work on patches, I am happy to support this.

Good. If it's just a matter of time (not design), things can be left
ready for future implementation without breaking the current model. I
thought there was a fundamental flaw with the three-stage design (and
was eager to learn it).

Not at all. It was just soo much easier to create vector instructions right inside of Polly that I did not start another large project.

the only change
in polly is, that it either generates N scalar instructions per original
instruction or one vector instruction (if N is the number of loop iterations
which is equivalent to the vector width). So vectorization in Polly was very
easy to implement and already works reasonable well.

Ok, this comes with another current change in LLVM: OpenCL. I explain.

OpenCL has very large (and odd) vector sizes, that if implemented to
vectorized units (like SSE or NEON), need to be legalised.

Such a pass should be target specific and polly could make use of
that. If polly always generate vector code (instead of reason if the
number of unrolled operations are the same as the current target being
compiled into), the later legalisation pass can deal with the odd
sized vectors and transform into multiples of legal vector + some
surplus of the module as normal instructions.

Also, if the target doesn't have vector units, there could be a
generic (or not) transformation to cpu instructions (if there isn't
one already), so that makes your polly pass completely target
agnostic.

I believe LLVM already has lowering of wide LLVM-IR vectors to a set of operations of vectors of target specific wide. This is done in the back end and works correct for all cases I tested. The problem with this is just performance. They fastest loop nest is different on architectures with different vector width and size of vector registers. A generic polyhedral optimizer will therefore not generate incorrect code, however the proposed transformations are wrong. If Polly e.g. always generates vector operations to be 32 floats width, this would result in not very pleasant code, but might still be lowered to something well performing on cpus supporting AVX. Matching the target vector width in our heuristics will obviously give the best performance. So to get optimal performance Polly needs to take target data into account.

Talking about OpenCL. The lowering you described for the large vector instructions sounds reasonable. Optimal code would however probably produced by revisiting the whole loop structure and generating one that is performance wise optimal for the target architecture.

Why are target specific vectorization passes needed to generate vector
instructions from Polly? The only target specific information I currently
see is the vector width, which a generic vectorization pass can obtain from
the target data information. Could you explain for which features target
specific vectorization would be needed?

Not target specific, generic vectors. See above.

I have read this and the look interesting. I suppose they are created out of
the box, if a pass generates LLVM-IR vector instructions?

Yup. It's pretty neat. SSE is probably similar, but with NEON, a
pattern-match is done when the variable type is a vector.

So, a multiplication followed by an addition in the right way is
transformed into a multiply-and-add NEON instruction.

An example (in a completely wrong IR, just to make a point):

%a =<4 x i32>
%b =<4 x i32>
%c =<4 x i32>
%mul = mul %b, %c
%acc = add %mul, %a

gets transformed into:

VMLA.I32 q0, q1, q2

Multiplying vectors (of the correct size) gets into VMUL, adding gets
to VADD and so on...

Yes. I have seen this also for powerpc and hope we can soon use the AVX multiply and add instruction too.

Cheers
Tobi

Matching the target vector width in our heuristics will obviously give the
best performance. So to get optimal performance Polly needs to take target
data into account.

Indeed! And even if you lack target information, you won't generate
wrong code. :wink:

Talking about OpenCL. The lowering you described for the large vector
instructions sounds reasonable. Optimal code would however probably produced
by revisiting the whole loop structure and generating one that is
performance wise optimal for the target architecture.

Yes, and this is an important point in OpenCL for CPUs. If we could
run a sub-pass of Polly (just the vector fiddling) after the
legalization, that would make it much easier for OpenCL
implementations.

However, none of these apply to GPUs, and any pass you run could
completely destroy the semantics for a GPU back-end. The AMD
presentation on the meetings last year expose some of that.

cheers,
--renato

Matching the target vector width in our heuristics will obviously give the
best performance. So to get optimal performance Polly needs to take target
data into account.

Indeed! And even if you lack target information, you won't generate
wrong code. :wink:

Talking about OpenCL. The lowering you described for the large vector
instructions sounds reasonable. Optimal code would however probably produced
by revisiting the whole loop structure and generating one that is
performance wise optimal for the target architecture.

Yes, and this is an important point in OpenCL for CPUs. If we could
run a sub-pass of Polly (just the vector fiddling) after the
legalization, that would make it much easier for OpenCL
implementations.

I do not get this one? Why would you just use a part of Polly? Was I wrong by
assuming LLVM will even today without any special pass generate correct code for the width OpenCL vectors. For me Polly just is an optimization,
that could revisit the whole vectorization decision by looking at the big picture of the whole loop nest and generating a target specific loop nest with target specific vectorization (and openmp parallelisation).

However, none of these apply to GPUs, and any pass you run could
completely destroy the semantics for a GPU back-end. The AMD
presentation on the meetings last year expose some of that.

I have seen the AMD presentation and believe we can generate efficient vector code for GPUs. Obviously with some adaptions, however I am convinced this is doable.

Cheers
Tobi

I do not get this one? Why would you just use a part of Polly?

Oh, you can. Just that maybe you don't need to go over all Polly if
openCL already has the vector semantics done in the front-end.

Was I wrong by assuming LLVM will even today without any special pass generate correct code
for the width OpenCL vectors. For me Polly just is an optimization,
that could revisit the whole vectorization decision by looking at the big
picture of the whole loop nest and generating a target specific loop nest
with target specific vectorization (and openmp parallelisation).

I'm really not the OpenCL expert, but I hear that it's not as trivial
as one would think.

I know from generating NEON code in the front-end that any fiddling in
the semantics of the instructions could make the pattern-matching
algorithm to fail and you fall back to normal instructions.

I'm just trying to be cautions here not to fall into false hopes, but
someone with more knowledge in OpenCL would know better.

I have seen the AMD presentation and believe we can generate efficient
vector code for GPUs. Obviously with some adaptions, however I am convinced
this is doable.

Great! Even better than I thought! :wink:

cheers,
--renatoorder