GSoC 2009: Auto-vectorization

Hi all,

I'd like to have a first stab at a loop-based auto-vectorization pass as
part of 2009's Google Summer of Code program. As far as I can tell from
searching the mailing list archives, no work on such an auto-vectorizer
seems to be currently in progress.

Whereas auto-vectorization is a well-researched topic, complexity seems
to quickly explode once more advanced loop constructs are to be
supported. Therefore my aim is to start with the most minimal
implementation possible, to explore the difficulties encountered in
the specific context of LLVM and to build a foundation from which future
work can progress.

So, initially, I aim at supporting only the simplest loops such as:

    int a[256], b[256], c[256];
    for (int i = 0; i < 256; ++i)
      c[i] = a[i] + b[i];

My goal is to implement the necessary analyses and transformations to
turn IR corresponding to such code into IR utilizing vector
instructions; i.e. the core of the desired result would look like:

    %va = load <256 x i32>* %a
    %vb = load <256 x i32>* %b
    %vc = add <256 x i32> %a, %b
    store <256 x i32> %vc, <256 x i32>* %c

After this transformation, the code could be left as is (which would
generate a fully unrolled vector-add, as long as the vector length
doesn't exceed the lowering capabilities of the code generator), or
strip-mined into a loop using fixed-size vectors (preferably
corresponding to the SIMD register width of the target architecture).

Auto-vectorization has numerous dependencies on other
analysis/transformation passes (loop normalization, constant/scalar
expansion, dependence analysis, etc). A significant part of the proposed
project is to evaluate already-available LLVM passes regarding their
applicability to support auto-vectorization and to extend/implement
what's missing.

As for my personal background, I'm a graduate student at Vienna
University of Technology (TU Wien) with a strong background in compiler
theory, acquired in a wide variety of undergraduate- and graduate-level
courses. I recently implemented two simple compilers LLVM and conducted
various experiments using LLVM's vector support. My general interest in
data-parallel computation stems from a great fondness of APL (especially
of its modern offspring K). While I'm familiar with the concepts behind
auto-vectorization, I haven't implemented an auto-vectorizer before. I'm
author of and contributor to several smaller open source projects.

I appreciate any suggestions and would be very excited if someone is
interested in mentoring this.

Andreas Bolka wrote:

Hi all,

I'd like to have a first stab at a loop-based auto-vectorization pass as
part of 2009's Google Summer of Code program. As far as I can tell from
searching the mailing list archives, no work on such an auto-vectorizer
seems to be currently in progress.

Hi Andreas,

Actually, you'd be the third person to try writing one, with myself being the second. :slight_smile: Don't let that discourage you, I don't think anyone's actively working on it now.

Whereas auto-vectorization is a well-researched topic, complexity seems
to quickly explode once more advanced loop constructs are to be
supported. Therefore my aim is to start with the most minimal
implementation possible, to explore the difficulties encountered in
the specific context of LLVM and to build a foundation from which future
work can progress.

There's two types of autovectorization, SLP (superword level parallelism) and ILP (instruction level parallelism). You can do ILP which is loop-ignorant autovectorization of straight-line code which turns out to be the code that runs inside the loop.

So, initially, I aim at supporting only the simplest loops such as:

    int a[256], b[256], c[256];
    for (int i = 0; i < 256; ++i)
      c[i] = a[i] + b[i];

My goal is to implement the necessary analyses and transformations to
turn IR corresponding to such code into IR utilizing vector
instructions; i.e. the core of the desired result would look like:

    %va = load <256 x i32>* %a
    %vb = load <256 x i32>* %b
    %vc = add <256 x i32> %a, %b
    store <256 x i32> %vc, <256 x i32>* %c

After this transformation, the code could be left as is (which would
generate a fully unrolled vector-add, as long as the vector length
doesn't exceed the lowering capabilities of the code generator), or
strip-mined into a loop using fixed-size vectors (preferably
corresponding to the SIMD register width of the target architecture).

Auto-vectorization has numerous dependencies on other
analysis/transformation passes (loop normalization, constant/scalar
expansion, dependence analysis, etc). A significant part of the proposed
project is to evaluate already-available LLVM passes regarding their
applicability to support auto-vectorization and to extend/implement
what's missing.

There's no loop dependence analysis in LLVM yet. How are you planning to implement it? Can you point to a paper, or write up a plain English description?

As for my personal background, I'm a graduate student at Vienna
University of Technology (TU Wien) with a strong background in compiler
theory, acquired in a wide variety of undergraduate- and graduate-level
courses. I recently implemented two simple compilers LLVM and conducted
various experiments using LLVM's vector support. My general interest in
data-parallel computation stems from a great fondness of APL (especially
of its modern offspring K). While I'm familiar with the concepts behind
auto-vectorization, I haven't implemented an auto-vectorizer before. I'm
author of and contributor to several smaller open source projects.

I appreciate any suggestions and would be very excited if someone is
interested in mentoring this.

I'm interested, but I'd like to see more details about exactly how (ie., with what algorithm) you intend to use to find and transform code first.

Nick

Nick Lewycky wrote:

Andreas Bolka wrote:

Hi all,

I'd like to have a first stab at a loop-based auto-vectorization pass as
part of 2009's Google Summer of Code program. As far as I can tell from
searching the mailing list archives, no work on such an auto-vectorizer
seems to be currently in progress.

Hi Andreas,

Actually, you'd be the third person to try writing one, with myself being the second. :slight_smile: Don't let that discourage you, I don't think anyone's actively working on it now.

Whereas auto-vectorization is a well-researched topic, complexity seems
to quickly explode once more advanced loop constructs are to be
supported. Therefore my aim is to start with the most minimal
implementation possible, to explore the difficulties encountered in
the specific context of LLVM and to build a foundation from which future
work can progress.

There's two types of autovectorization, SLP (superword level parallelism) and ILP (instruction level parallelism). You can do ILP which is loop-ignorant autovectorization of straight-line code which turns out to be the code that runs inside the loop.

Pardon me. I've been disabused by Owen Anderson for writing the above.

SLP is loop-ignorant. ILP is a rather generic term and probably shouldn't be described as a type of autovectorization. Cross-iteration and intra-iteration better describes the distinction.

In any event, the paper I was thinking of when I wrote the above is:

My concern being that writing both loop dependence analysis and loop autovectorization would be difficult in the time span alloted for GSoC, and you may want to consider just doing SLP.

Nick

Hi all,
I'd like to have a first stab at a loop-based auto-vectorization pass as
part of 2009's Google Summer of Code program. As far as I can tell from
searching the mailing list archives, no work on such an auto-vectorizer
seems to be currently in progress.

Hi Andreas,

This would be a very interesting project! The most important part of the proposal is to break it down into small pieces.

Whereas auto-vectorization is a well-researched topic, complexity seems
to quickly explode once more advanced loop constructs are to be
supported. Therefore my aim is to start with the most minimal
implementation possible, to explore the difficulties encountered in
the specific context of LLVM and to build a foundation from which future
work can progress.

This is a great approach. I'm very much in favor of solving small problems first, instead of trying to solve all the worlds issues at once. :slight_smile:

So, initially, I aim at supporting only the simplest loops such as:

   int a[256], b[256], c[256];
   for (int i = 0; i < 256; ++i)
     c[i] = a[i] + b[i];

My goal is to implement the necessary analyses and transformations to
turn IR corresponding to such code into IR utilizing vector
instructions;

Sounds great. Some important steps:

1. We need an abstract dependence analysis interface (like we have for AA) and one implementation (probably handling trivial loops like the one above to start with). Many people have talked about this, but no code has gone in yet.

2. We need some interface that can be implemented by a target machine to describe what the vector capabilities of the machine are.

3. We need the transformation code. Starting with a simple loop vectorizor (as opposed to an SLP system) would make sense to me.

Once the basics are in place, it can be feature-crept to support new idioms (reductions, alignment analysis, etc). The first cut doesn't need to provide an immediate speedup, but should have many testcases that demonstrates the loop forms it can handle.

i.e. the core of the desired result would look like:

   %va = load <256 x i32>* %a
   %vb = load <256 x i32>* %b
   %vc = add <256 x i32> %a, %b
   store <256 x i32> %vc, <256 x i32>* %c

Actually, IMO this is probably not the directly we want to go. Vectorizing to extremely wide vectors like this isn't along the lines of what we want: we should get something like the original loop unrolled 4 times to do operations on <4 x i32>'s (for x86/ppc).

Auto-vectorization has numerous dependencies on other
analysis/transformation passes (loop normalization, constant/scalar
expansion, dependence analysis, etc). A significant part of the proposed
project is to evaluate already-available LLVM passes regarding their
applicability to support auto-vectorization and to extend/implement
what's missing.

Sounds good. I'm also fine with it only working in carefully controlled circumstances. Depending on how ambitious you are, you could also do a GSoC project just around a) building dependence analysis interface + some *non*-trivial implementations, and b) build some dep analysis clients around it (e.g. loop interchange, etc). This may be a less risky and higher impact project in the short term, if you are interested in it.

As for my personal background, I'm a graduate student at Vienna
University of Technology (TU Wien) with a strong background in compiler
theory, acquired in a wide variety of undergraduate- and graduate-level
courses. I recently implemented two simple compilers LLVM and conducted
various experiments using LLVM's vector support. My general interest in
data-parallel computation stems from a great fondness of APL (especially
of its modern offspring K). While I'm familiar with the concepts behind
auto-vectorization, I haven't implemented an auto-vectorizer before. I'm
author of and contributor to several smaller open source projects.

Great! Either approach (building a very simple auto-vec system from end-to-end, or building some non-trivial dep analysis implementations + a client or two) sound like very promising GSoC projects!

-Chris

Chris Lattner wrote:

Depending on how ambitious you are, you
could also do a GSoC project just around a) building dependence analysis interface + some *non*-trivial implementations, and b) build some dep analysis clients around it (e.g. loop interchange, etc). This may be a less risky and higher impact project in the short term, if you are interested in it.

This would be in great demand independent of the auto-vectorization.

Given that the amount of work is fixed, I would rather see all
the work go into a brilliant dependence analysis rather than an ok
dependence analysis and an ok vectorization. I encourage you to
consider it, Andreas.

Once you have decent loop dependence analysis, performing autovectorization becomes fairly mechanical *if* you start off by restricting the kinds of instructions you allow to be vectorized. For example, if you start off with loops containing:

  - only instructions whose operands have the same size in bytes, and thus will work well with the same "vectorization width" or "lane count"
  - loop counts that are well-known and divisible by the lane count
  - no other control flow
  - only well-aligned loads and stores
  - only instructions that have good vector support in llvm today

Then, given decent dependence analysis, you should be able to write a vectorizer that vectorizes loops meeting these conditions pretty trivially. Once you've done that, you can start relaxing these assumptions, e.g. by adding a scalar loop to "finish up" loop counts that don't match the vectorization width to get rid of the loop count constraint, performing dynamic alignment analysis, etc.

There may be other issues in LLVM that make this more complicated of which I'm not aware, but with a reasonable initial set of constraints, the "vectorization" part of a loop autovectorizer should be feasible to do in a short timeframe.

Hi Andreas,

So, initially, I aim at supporting only the simplest loops such as:

   int a[256], b[256], c[256];
   for (int i = 0; i < 256; ++i)
     c[i] = a[i] + b[i];

My goal is to implement the necessary analyses and transformations to
turn IR corresponding to such code into IR utilizing vector
instructions; i.e. the core of the desired result would look like:

   %va = load <256 x i32>* %a
   %vb = load <256 x i32>* %b
   %vc = add <256 x i32> %a, %b
   store <256 x i32> %vc, <256 x i32>* %c

After this transformation, the code could be left as is (which would
generate a fully unrolled vector-add, as long as the vector length
doesn't exceed the lowering capabilities of the code generator), or
strip-mined into a loop using fixed-size vectors (preferably
corresponding to the SIMD register width of the target architecture).

I think the biggest problem with this approach, apart from the fact that it doesn't mirror how vectors are typically used today in LLVM, is that vectors in LLVM are of fixed size. This is going to severely limit the usefulness of this transformation. I think you may be better off getting information about vector widths for the architecture (e.g. from TargetData) and vectorizing directly with a particular width in mind.

Overall, this would be a great GSOC project, and I guarantee you would get a lot of interest in this :). Good luck!

Stefanus

Hi Andreas,

Hi all,

I'd like to have a first stab at a loop-based auto-vectorization pass as
part of 2009's Google Summer of Code program.

Good idea!

As far as I can tell from
searching the mailing list archives, no work on such an auto-vectorizer
seems to be currently in progress.

Yes.

Whereas auto-vectorization is a well-researched topic, complexity seems
to quickly explode once more advanced loop constructs are to be
supported. Therefore my aim is to start with the most minimal
implementation possible, to explore the difficulties encountered in
the specific context of LLVM and to build a foundation from which future
work can progress.

So, initially, I aim at supporting only the simplest loops such as:

   int a[256], b[256], c[256];
   for (int i = 0; i < 256; ++i)
     c[i] = a[i] + b[i];

My goal is to implement the necessary analyses and transformations to
turn IR corresponding to such code into IR utilizing vector
instructions; i.e. the core of the desired result would look like:

   %va = load <256 x i32>* %a
   %vb = load <256 x i32>* %b
   %vc = add <256 x i32> %a, %b
   store <256 x i32> %vc, <256 x i32>* %c

After this transformation, the code could be left as is (which would
generate a fully unrolled vector-add, as long as the vector length
doesn't exceed the lowering capabilities of the code generator), or
strip-mined into a loop using fixed-size vectors (preferably
corresponding to the SIMD register width of the target architecture).

Auto-vectorization has numerous dependencies on other
analysis/transformation passes (loop normalization, constant/scalar
expansion, dependence analysis, etc). A significant part of the proposed
project is to evaluate already-available LLVM passes regarding their
applicability to support auto-vectorization and to extend/implement
what's missing.

As for my personal background, I'm a graduate student at Vienna
University of Technology (TU Wien) with a strong background in compiler
theory, acquired in a wide variety of undergraduate- and graduate-level
courses. I recently implemented two simple compilers LLVM and conducted
various experiments using LLVM's vector support. My general interest in
data-parallel computation stems from a great fondness of APL (especially
of its modern offspring K). While I'm familiar with the concepts behind
auto-vectorization, I haven't implemented an auto-vectorizer before. I'm
author of and contributor to several smaller open source projects.

I appreciate any suggestions and would be very excited if someone is
interested in mentoring this.

As Chris said, there are many approaches you can take for a GSoC project whose title includes "auto vectorization". IMO, building a initial foundation for data dependence analysis is a good start.

I encourage you to prepare a proposal that includes list of measurable and achievable milestones with realistic goals, where each milestone is a small step in the right direction. I am ready to mentor this project.

Hi Stefanus,

> i.e. the core of the desired result would look like:
>
> %va = load <256 x i32>* %a
> %vb = load <256 x i32>* %b
> %vc = add <256 x i32> %a, %b
> store <256 x i32> %vc, <256 x i32>* %c

I think the biggest problem with this approach, apart from the fact
that it doesn't mirror how vectors are typically used today in LLVM,
is that vectors in LLVM are of fixed size. This is going to severely
limit the usefulness of this transformation. I think you may be better
off getting information about vector widths for the architecture (e.g.
from TargetData) and vectorizing directly with a particular width in
mind.

Thanks for the remark. My initial thinking was that, independent of
auto-vectorization, a general "wide vector strip-mining" transformation
would be worthwhile to have anyway. I.e. a transformation which would
convert such wide vector operations as above to a loop over
(target-dependent) smaller width vectors. And as the auto-vectorizer I
am aiming for would initially only be able to vectorize loops with
statically known constant trip counts, I could squash some complexity by
leaving this target-dependent aspect to such a hypothetical strip-mining
pass.

But I understand the limits of this, and based on the feedback I got so
far, I'll rather generate a loop with fixed-size vectors instead. A
command-line option to determine the vectorization factor should do fine
to enable quick bootstrapping, if querying the necessary register width
from TargetData turns out to be problematic.

Overall, this would be a great GSOC project, and I guarantee you would
get a lot of interest in this :). Good luck!

Thanks!

Hi Chris,

> My goal is to implement the necessary analyses and transformations to
> turn IR corresponding to such code into IR utilizing vector
> instructions;

Sounds great. Some important steps:

1. We need an abstract dependence analysis interface (like we have for
AA) and one implementation (probably handling trivial loops like the
one above to start with). Many people have talked about this, but no
code has gone in yet.

My original plan was to implement a loop dependence analysis only for
non-nested loops using the "GCD" method of determining interdependent
statements. This limits the array subscript expressions to simple linear
equations of the form x1*i + x0 where i is the induction variable of the
loop.

I haven't thought yet about generalising this to ease future
implementation of more complex loop dependency analyses.

2. We need some interface that can be implemented by a target machine
to describe what the vector capabilities of the machine are.

Initially the vectorization pass would basically only need the SIMD
register width; later, the supported "packing formats" may come in handy
(i.e. the information that 128-bit registers that be used as e.g.
16 x i8, 8 x i16, and 4 x i32).

3. We need the transformation code. Starting with a simple loop
vectorizor (as opposed to an SLP system) would make sense to me.

I'm also inclined to prefer "classical" Allen & Kennedy-style
auto-vectorization over "Vectorization by unrolling"/SLP approaches.
Based on the responses on this list, it seems that this sentiment is
shared by the LLVM community. An SLP pass would certainly be a nice
complement to loop-based auto-vectorization, though

Once the basics are in place, it can be feature-crept to support new
idioms (reductions, alignment analysis, etc). The first cut doesn't
need to provide an immediate speedup, but should have many testcases
that demonstrates the loop forms it can handle.

That's exactly my understanding of it. I want to limit this to the most
basic transformation possible; i.e.:

- non-nested loops, having
- a statically known constant trip count, which is
- divisible by the vectorization factor;
- a single basic block as loop body, with
- a single arithmetic instruction, using
- only same-sized vector operands, and
- ignoring alignment issues.

Vectorizing to extremely wide vectors like this isn't along the lines
of what we want: we should get something like the original loop
unrolled 4 times to do operations on <4 x i32>'s (for x86/ppc).

Yes, I agree. See also my reply to Stefanus.

> Auto-vectorization has numerous dependencies on other
> analysis/transformation passes (loop normalization, constant/scalar
> expansion, dependence analysis, etc). A significant part of the
> proposed project is to evaluate already-available LLVM passes
> regarding their applicability to support auto-vectorization and to
> extend/implement what's missing.

Depending on how ambitious you are, you could also do a GSoC project
just around a) building dependence analysis interface + some
*non*-trivial implementations, and b) build some dep analysis clients
around it (e.g. loop interchange, etc). This may be a less risky and
higher impact project in the short term, if you are interested in it.

I understand the importance and necessity of a general loop dependence
interface, but I'm not too familiar with the more complex
approaches (which generally require integer linear programming, if I'm
not mistaken). So I can't really asses if there's a realistic chance to
build a generally usable non-trivial implementation in the time
available.

But as a simple loop dependence analysis pass will be one of the first
steps in this project anyway, I'll be in a much better position to
re-evaluate that question once this first pass is done. I'd certainly be
willing to modify the project, if it turns out that this direction would
be more valuable and/or more realistic to pursue.

Great! Either approach (building a very simple auto-vec system from
end-to-end, or building some non-trivial dep analysis implementations
+ a client or two) sound like very promising GSoC projects!

Thanks for the extensive feedback!

Andreas,

I agree this would be a great project. One comment:

Hi Stefanus,

i.e. the core of the desired result would look like:

  %va = load <256 x i32>* %a
  %vb = load <256 x i32>* %b
  %vc = add <256 x i32> %a, %b
  store <256 x i32> %vc, <256 x i32>* %c

I think the biggest problem with this approach, apart from the fact
that it doesn't mirror how vectors are typically used today in LLVM,
is that vectors in LLVM are of fixed size. This is going to severely
limit the usefulness of this transformation. I think you may be better
off getting information about vector widths for the architecture (e.g.
from TargetData) and vectorizing directly with a particular width in
mind.

Thanks for the remark. My initial thinking was that, independent of
auto-vectorization, a general "wide vector strip-mining" transformation
would be worthwhile to have anyway. I.e. a transformation which would
convert such wide vector operations as above to a loop over
(target-dependent) smaller width vectors. And as the auto-vectorizer I
am aiming for would initially only be able to vectorize loops with
statically known constant trip counts, I could squash some complexity by
leaving this target-dependent aspect to such a hypothetical strip-mining
pass.

But I understand the limits of this, and based on the feedback I got so
far, I'll rather generate a loop with fixed-size vectors instead. A
command-line option to determine the vectorization factor should do fine
to enable quick bootstrapping, if querying the necessary register width
from TargetData turns out to be problematic.

Even if you decide to generate loops with fixed-size vectors, I think it could be worth separating your implementation into
(a) an auto-vectorization analysis that identifies vectorizable statements and transforms the code to capture them, independent of vector length;
(b) a trivial strip-mining pass that generates loops with fixed-size vectors; and
(c) an *optional* loop alignment step that adjusts vector operations so vector loads/stores are aligned (and creates the extra scalar loop at the end).

Doing (a) in terms of arbitrary-length vectors will simplify the problem by allowing you to isolate two relatively unrelated tasks, viz., auto-vectorization analysis (#a) and target-dependent code generation (#b and #c). It also let's you leave the third step as a separate pass, which might be difficult otherwise. Note that (c) is optional only in the sense that you don't need it for simple cases like you described; in general, it is not optional.

On a related note, the following paper describes such a "back end" and a reasonable IR for interfacing such a front-end (a-c above) and the "back-end" of such a compiler:
  http://llvm.cs.uiuc.edu/pubs/2006-06-15-VEE-VectorLLVA.html

Overall, this would be a great GSOC project, and I guarantee you would
get a lot of interest in this :). Good luck!

Thanks!

--
Andreas
_______________________________________________
LLVM Developers mailing list
LLVMdev@cs.uiuc.edu http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

--Vikram
Associate Professor, Computer Science
University of Illinois at Urbana-Champaign
http://llvm.org/~vadve

1. We need an abstract dependence analysis interface (like we have for
AA) and one implementation (probably handling trivial loops like the
one above to start with). Many people have talked about this, but no
code has gone in yet.

My original plan was to implement a loop dependence analysis only for
non-nested loops using the "GCD" method of determining interdependent
statements. This limits the array subscript expressions to simple linear
equations of the form x1*i + x0 where i is the induction variable of the
loop.

I haven't thought yet about generalising this to ease future
implementation of more complex loop dependency analyses.

Sounds fine, I'm ok with starting simple, so long as the client talks through an abstract interface of some sort, and the impl is in a different pass.

2. We need some interface that can be implemented by a target machine
to describe what the vector capabilities of the machine are.

Initially the vectorization pass would basically only need the SIMD
register width; later, the supported "packing formats" may come in handy
(i.e. the information that 128-bit registers that be used as e.g.
16 x i8, 8 x i16, and 4 x i32).

Sounds good. Perhaps a list of supported types would be enough.

Once the basics are in place, it can be feature-crept to support new
idioms (reductions, alignment analysis, etc). The first cut doesn't
need to provide an immediate speedup, but should have many testcases
that demonstrates the loop forms it can handle.

That's exactly my understanding of it. I want to limit this to the most
basic transformation possible; i.e.:

- non-nested loops, having
- a statically known constant trip count, which is
- divisible by the vectorization factor;
- a single basic block as loop body, with
- a single arithmetic instruction, using
- only same-sized vector operands, and
- ignoring alignment issues.

Sounds great!

Depending on how ambitious you are, you could also do a GSoC project
just around a) building dependence analysis interface + some
*non*-trivial implementations, and b) build some dep analysis clients
around it (e.g. loop interchange, etc). This may be a less risky and
higher impact project in the short term, if you are interested in it.

I understand the importance and necessity of a general loop dependence
interface, but I'm not too familiar with the more complex
approaches (which generally require integer linear programming, if I'm
not mistaken). So I can't really asses if there's a realistic chance to
build a generally usable non-trivial implementation in the time
available.

Ok!

-Chris

Hi Nick,

Nick Lewycky wrote:

There's no loop dependence analysis in LLVM yet. How are you planning
to implement it? Can you point to a paper, or write up a plain English
description?

Currently I'm planning to implement dependence analyis as described in
e.g. Allen, R. and Kennedy, K. (1987). Automatic translation of FORTRAN
programs to vector form. ACM Trans. Program. Lang. Syst. 9, 4.

As mentioned in other replies, I'd only want to support non-nested loops
and array index expressions that are linear functions of the induction
variable, so that the GCD test can be used to determine statement
interdependence.

For the simple loop I showed in my original example, which uses three
distinct arrays, dependence checking may even be sufficient via LLVM's
MemoryDependence interface. But I'll have to investigate this in more
detail.

In any event, the paper I was thinking of [..] is:
http://www.cag.lcs.mit.edu/slp/SLP-PLDI-2000.pdf

My concern being that writing both loop dependence analysis and loop
autovectorization would be difficult in the time span alloted for
GSoC, and you may want to consider just doing SLP.

Thanks for the suggestion! I considered SLP/"vectorization by unrolling"
as well, but my personal interests currently lean more towards
"classical" vectorization, given the usefulness of loop dependence
analysis for other transformations. The general interest expressed on
this list seems to confirm this.