Hi,
I just have a detail look at the code of Polly[1], it seems that Polly
start to support some basic auto-parallelization stuffs.
This is true. However still work in progress. I hope we can soon show some interesting results.
> I have some idea to improve the current auto-vectorization
> and parallelization approach in Polly.
The main idea is, we separate the transform passes and codegen passes
for auto-parallelization and vectorization (Graphite[2] for gcc seems
to taking similar approach for auto-vectorization).
This is just historical. Sebastian is planning to couple them a lot closer. Graphite and auto parallelization/vectorization are not yet playing well together, even if Sebastian made some huge progress with Graphite<->Autovec interaction.
That means Polly (Or other similar framework) should perform necessary
code transform, then just generates the sequential code, and provides
necessary parallelism information (These information could be encoded
as metadata just like TBAA), then the later passes can generate the
parallel code with those parallelism information.
I also thought about this approach.
The main benefit of separating transform passes and codegen passes are:
1. Decouple the the autopar framework from various parallel runtime
environment, so we can keep both autopar framework and code generation
pass for specific parallel runtime environment simple. And we can
support more parallel runtime environments easily.
Yes, as soon as we start to support more targets we should move out some of the logic into separate functions/file/passes. This could be either done inside polly or as a generic LLVM pass, depending
on which is more suitable.
2. Allow the some generic parallelism information live out specific
autopar framework, so these information can benefit more passes in
llvm. For example, the X86 and PTX backend could use these information
to perform target specific auto-vectorization.
What other types of parallelism are you expecting? We currently support thread level parallelism (as in OpenMP) and vector level parallelism (as in LLVM-IR vectors). At least for X86 I do not see any reason for
target specific auto-vectorization as LLVM-IR vectors are lowered extremely well to x86 SIMD instructions. I suppose this is the same for all CPU targets. I still need to look into GPU targets.
Implementation consideration:
We may define some kind of generic "parallelism analysis" or the
generic version of "LoopDependenceAnalysis" interface just like
AliasAnalysis, or we can also encode those parallelism information as
metadata. And combining the both should be OK, too.
Thanks ether for starting to reason about this.
My current plan is as follows:
Create stable support for OpenMP and LLVM-IR vector generation inside polly. This is not very difficult, as you the needed analysis can easily be performed in the polyhedral framework. Code generation is not difficult either. SIMD Vector support is almost complete and OpenMP code generation is also half way through. Based on that infrastructure we can take advantage of both thread level as well as SIMD level parallelism and which covers the two most common targets.
The next step will be to evaluate loop transformations that enable thread and vector level parallelism as it is done e.g. in pluto[1].
I am pretty confident we can show good improvements. Stay tuned. 
Regarding your proposal I hope to move some of the code into more generic frameworks. I am thinking e.g. about introducing OpenMP intrinsics to support different OpenMP libraries like libgomp and mpc[2]. LLVM-IR vector instructions however are generic SIMD instructions so I do not see any reason to create target specific
auto vectorizer passes.
The remaining question is how/where to generate parallel/vector
code.
The current approach is to do this directly at Polly code generation time, the solution you proposed would be to generate sequential
code inside polly, annotate it with meta data, reparse it and finally create openmp/vector code.
The second approach would have the advantage that multiple LLVM passes
can use the information and generate code from it as well as there could exist several analysis that create the needed metadata.
It has however the drawback that instead of just doing code generation once after polly, we do sequential code generation -> reparsing/analysis -> parallel code generation. Furthermore, the infrastructure needs to pass all the information needed
for efficient parallelisation which are at least the access strides, the alignment and privatized variables. Recomputing this information using scalar evolution might be difficult as Polly may introduce
loop ivs using e.g. ceil/floor divisions.
As Polly - based on the current approach - will soon support target agnostic autovectorization and OpenMP code generation, I plan to now focus on polyhedral loop nest optimizations that enable efficient vectorization and thread level parallelism. As this transformations are
performed on the abstract polyhedral description, no further changes to the code generation are needed. In case a larger framework is shown to be useful, I will definitely support this. For the moment however I am very fine with the existing lightweight approach.
Cheers
Tobi
[1] http://pluto-compiler.sourceforge.net/
[2] http://mpc.sf.net