[llvm-commits] [PATCH] BasicBlock Autovectorization Pass

Ralf, et al.,

Attached is the latest version of my autovectorization patch. llvmdev
has been CC'd (as had been suggested to me); this e-mail contains
additional benchmark results.

First, these are preliminary results because I did not do the things
necessary to make them real (explicitly quiet the machine, bind the
processes to one cpu, etc.). But they should be good enough for
discussion.

I'm using LLVM head r143101, with the attached patch applied, and clang
head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
without any other optimization flags. opt was run -vectorize
-unroll-allow-partial -O3 with no other optimization flags (the patch
adds the -vectorize option). llc was just given -O3.

Below I've included results using the benchmark program by Maleki, et
al. See:
An Evaluation of Vectorizing Compilers - PACT'11
(http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
their benchmark program was retrieved from:
http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz

Also, when using clang, I had to pass -Dinline= on the command line:
when using -emit-llvm, clang appears not to emit code for functions
declared inline. This is a bug, but I've not yet tracked it down. There
are two such small functions in the benchmark program, and the regular
inliner *should* catch them anyway.

Results:
0. Name of the loop
1. Time using LLVM with vectorization
2. Time using LLVM without vectorization
3. Time using gcc with vectorization
4. Time using gcc without vectorization

Loop llvm-v llvm gcc-v gcc

llvm_bb_vectorize-20111029.diff (70.1 KB)

Ralf, et al.,

Attached is the latest version of my autovectorization patch. llvmdev
has been CC'd (as had been suggested to me); this e-mail contains
additional benchmark results.

First, these are preliminary results because I did not do the things
necessary to make them real (explicitly quiet the machine, bind the
processes to one cpu, etc.). But they should be good enough for
discussion.

I'm using LLVM head r143101, with the attached patch applied, and clang
head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
without any other optimization flags. opt was run -vectorize
-unroll-allow-partial -O3 with no other optimization flags (the patch
adds the -vectorize option).

And opt had also been given the flag: -bb-vectorize-vector-bits=256

-Hal

> Ralf, et al.,
>
> Attached is the latest version of my autovectorization patch. llvmdev
> has been CC'd (as had been suggested to me); this e-mail contains
> additional benchmark results.
>
> First, these are preliminary results because I did not do the things
> necessary to make them real (explicitly quiet the machine, bind the
> processes to one cpu, etc.). But they should be good enough for
> discussion.
>
> I'm using LLVM head r143101, with the attached patch applied, and clang
> head r143100 on an x86_64 machine (some kind of Intel Xeon). For the gcc
> comparison, I'm using build Ubuntu 4.4.3-4ubuntu5. gcc was run -O3
> without any other optimization flags. opt was run -vectorize
> -unroll-allow-partial -O3 with no other optimization flags (the patch
> adds the -vectorize option).

And opt had also been given the flag: -bb-vectorize-vector-bits=256

And this was a mistake (because the machine on which the benchmarks were
run does not have AVX). I've rerun, see better results below...

-Hal

> llc was just given -O3.
>
> Below I've included results using the benchmark program by Maleki, et
> al. See:
> An Evaluation of Vectorizing Compilers - PACT'11
> (http://polaris.cs.uiuc.edu/~garzaran/doc/pact11.pdf). The source of
> their benchmark program was retrieved from:
> http://polaris.cs.uiuc.edu/~maleki1/TSVC.tar.gz
>
> Also, when using clang, I had to pass -Dinline= on the command line:
> when using -emit-llvm, clang appears not to emit code for functions
> declared inline. This is a bug, but I've not yet tracked it down. There
> are two such small functions in the benchmark program, and the regular
> inliner *should* catch them anyway.
>
> Results:
> 0. Name of the loop
> 1. Time using LLVM with vectorization
> 2. Time using LLVM without vectorization
> 3. Time using gcc with vectorization
> 4. Time using gcc without vectorization

Here are improved results where the correct (and default)
vector-register size was used.

Loop llvm-v llvm gcc-v gcc

http://clang.llvm.org/compatibility.html#inline

Thanks,

> Also, when using clang, I had to pass -Dinline= on the command line:
> when using -emit-llvm, clang appears not to emit code for functions
> declared inline. This is a bug, but I've not yet tracked it down.

http://clang.llvm.org/compatibility.html#inline

Thanks! (Of course, the standard does not govern the relationship
between the compiler frontend and the backend, so it *could* work some
other way).

-Hal

As Peter Collingbourne indirectly pointed out to me, clang's
optimizations are still important (even if it is emitting only llvm).
I've rerun the llvm code generation steps, adding -O3 to clang. Here are
the results (they are significantly better):

Loop llvm-v llvm gcc-v gcc

I've attached the latest version of my autovectorization patch. This
version is significantly faster (in compile time) than the version I
posted a couple of days ago, and generally produces better output.

At this point, next steps in enhancing the vectorization include:
1. Add an add/sub and/or alternating-negation vector intrinsic to
provide for generating add-subtract, and more generally, asymmetric fma
instructions.
2. Make -vectorize imply -unroll-allow-partial [Is there an easy way to
do this?]
3. Add a -fvectorize flag to clang along the same lines.

Updated vectorization benchmark:
Loop llvm-v llvm gcc-v gcc

llvm_bb_vectorize-20111031-2.diff (75.3 KB)

I've attached the latest version of my autovectorization patch.

Working through the test suite has proved to be a productive
experience :wink: -- And almost all of the bugs that it revealed have now
been fixed. There are still two programs that don't compile with
vectorization turned on, and I'm working on those now, but in case
anyone feels like playing with vectorization, this patch will probably
work for you.

The largest three performance speedups are:
SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

The largest three performance slowdowns are:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

(from these, I've excluded tests that took less that 0.1 seconds to
run).

Largest three compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
1276% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Not everything slows down, MultiSource/Benchmarks/Prolangs-C
++/city/city, for example, compiles 10% faster with vectorization
enabled; but, for the most part, things certainly take longer to compile
with vectorization enabled. The average slowdown over all tests was 29%,
the median was 11%. On the other hand, the average speedup over all
tests was 5.2%, the median was 1.3%.

Compared to previous patches, which had a minimum required chain length
of 3 or 4, I've now made the default 6. While using a chain length of 4
worked well for targeted benchmarks, it caused an overall slowdown on
almost all test-suite programs. Using a minimum length of 6 causes, on
average, a speedup; so I think that is a better default choice.

-Hal

llvm_bb_vectorize-20111107.diff (77.6 KB)

I've attached the latest version of my autovectorization patch.

Working through the test suite has proved to be a productive
experience :wink: -- And almost all of the bugs that it revealed have now
been fixed. There are still two programs that don't compile with
vectorization turned on, and I'm working on those now, but in case
anyone feels like playing with vectorization, this patch will probably
work for you.

Hey Hal,

those are great news. Especially as the numbers seem to show that vectorization has a significant performance impact. What did you compare exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?

The largest three performance speedups are:
SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

The largest three performance slowdowns are:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

Interesting. Do you understand what causes these slowdowns? Can your heuristic be improved?

Largest three compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
1276% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Yes, that is a lot. Do you understand if this time is invested well (does it give significant speedups)?

If I understood correctly it seems your vectorizer has quadratic complexity which may cause large slowdowns. Do you think it may be useful/possible to make it linear by introducing a constant upper bound somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and most of the vectorization opportunities are close by (in some sense), such that we get most of the speedup by locking at a subset of the problem.

Not everything slows down, MultiSource/Benchmarks/Prolangs-C
++/city/city, for example, compiles 10% faster with vectorization
enabled; but, for the most part, things certainly take longer to compile
with vectorization enabled. The average slowdown over all tests was 29%,
the median was 11%. On the other hand, the average speedup over all
tests was 5.2%, the median was 1.3%.

Nice. I think this is a great start.

Compared to previous patches, which had a minimum required chain length
of 3 or 4, I've now made the default 6. While using a chain length of 4
worked well for targeted benchmarks, it caused an overall slowdown on
almost all test-suite programs. Using a minimum length of 6 causes, on
average, a speedup; so I think that is a better default choice.

I also try to understand if it is possible to use your vectorizer for Polly. My idea is to do some clever loop unrolling.

Starting from this loop.

for (int i = 0; i < 4; i++)
    A[i] += 1;
    A[i] = B[i] + 3;
    C[i] = A[i];

The classical unroller would create this code:

    A[0] += 1;
    A[0] = B[i] + 3;
    C[0] = A[i];

    A[1] += 1;
    A[1] = B[i] + 3;
    C[1] = A[i];

    A[2] += 1;
    A[2] = B[i] + 3;
    C[2] = A[i];

    A[3] += 1;
    A[3] = B[i] + 3;
    C[3] = A[i];

However, in case I can prove this loop is parallel, I want to create this code:

    A[0] += 1;
    A[1] += 1;
    A[2] += 1;
    A[3] += 1;

    A[0] = B[i] + 3;
    A[1] = B[i] + 3;
    A[2] = B[i] + 3;
    A[3] = B[i] + 3;

    C[0] = A[i];
    C[1] = A[i];
    C[2] = A[i];
    C[3] = A[i];

I assume this will allow the vectorization of test cases, that failed because of possible aliasing. However, I am more interested, if the
execution order change could also improve the vectorization outcome or reduce compile time overhead of your vectorizer.

Thanks for working on the vectorization
Cheers

Tobi

> I've attached the latest version of my autovectorization patch.
>
> Working through the test suite has proved to be a productive
> experience :wink: -- And almost all of the bugs that it revealed have now
> been fixed. There are still two programs that don't compile with
> vectorization turned on, and I'm working on those now, but in case
> anyone feels like playing with vectorization, this patch will probably
> work for you.

Hey Hal,

those are great news. Especially as the numbers seem to show that
vectorization has a significant performance impact. What did you compare
exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?

Yes. [I've tested the current patch directly using opt -vectorize
-unroll-allow-partial; for running the test suite I recompiled
llvm/clang to hardcode the options as I wanted them].

> The largest three performance speedups are:
> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup
>
> The largest three performance slowdowns are:
> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> 114% slowdown
> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
> slowdown
> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown
>
Interesting. Do you understand what causes these slowdowns? Can your
heuristic be improved?

I've not specifically looked at these cases.

Generally, I've observed slowdowns from the introduction of too many
permutations per chain (the chain selection procedure can be changed to
help this, and I'll work on that). Also, sometimes legalization of
non-native vector operations creates inefficient code, and I'll also
look at these cases in more detail.

> Largest three compile-time slowdowns:
> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
> 1276% slowdown
> SingleSource/Benchmarks/Misc/salsa20 - 1000% slowdown
> MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 508% slowdown

Yes, that is a lot. Do you understand if this time is invested well
(does it give significant speedups)?

No, not always. Actually, the security-rijndael test not only takes the
longest to vectorize, but it also shows the largest slowdown. This is
certainly something that I'll investigate.

If I understood correctly it seems your vectorizer has quadratic
complexity which may cause large slowdowns. Do you think it may be
useful/possible to make it linear by introducing a constant upper bound
somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
most of the vectorization opportunities are close by (in some sense),
such that we get most of the speedup by locking at a subset of the problem.

Yes, I agree. That makes a lot of sense.

What would be even better is if the loop unroller would intermix
statements from the loops where possible instead of leaving it to the
vectorizer to do all of the grouping after the fact. That, I fear, is a
whole other project.

> Not everything slows down, MultiSource/Benchmarks/Prolangs-C
> ++/city/city, for example, compiles 10% faster with vectorization
> enabled; but, for the most part, things certainly take longer to compile
> with vectorization enabled. The average slowdown over all tests was 29%,
> the median was 11%. On the other hand, the average speedup over all
> tests was 5.2%, the median was 1.3%.
Nice. I think this is a great start.

Thanks!

> Compared to previous patches, which had a minimum required chain length
> of 3 or 4, I've now made the default 6. While using a chain length of 4
> worked well for targeted benchmarks, it caused an overall slowdown on
> almost all test-suite programs. Using a minimum length of 6 causes, on
> average, a speedup; so I think that is a better default choice.

I also try to understand if it is possible to use your vectorizer for
Polly. My idea is to do some clever loop unrolling.

Starting from this loop.

for (int i = 0; i < 4; i++)
    A[i] += 1;
    A[i] = B[i] + 3;
    C[i] = A[i];

The classical unroller would create this code:

    A[0] += 1;
    A[0] = B[i] + 3;
    C[0] = A[i];

    A[1] += 1;
    A[1] = B[i] + 3;
    C[1] = A[i];

    A[2] += 1;
    A[2] = B[i] + 3;
    C[2] = A[i];

    A[3] += 1;
    A[3] = B[i] + 3;
    C[3] = A[i];

However, in case I can prove this loop is parallel, I want to create
this code:

    A[0] += 1;
    A[1] += 1;
    A[2] += 1;
    A[3] += 1;

    A[0] = B[i] + 3;
    A[1] = B[i] + 3;
    A[2] = B[i] + 3;
    A[3] = B[i] + 3;

    C[0] = A[i];
    C[1] = A[i];
    C[2] = A[i];
    C[3] = A[i];

I assume this will allow the vectorization of test cases, that failed
because of possible aliasing. However, I am more interested, if the
execution order change could also improve the vectorization outcome or
reduce compile time overhead of your vectorizer.

Yes, this would certainly help.

By the way, the current implementation, by default, it will create
unaligned vector loads and stores, but these are generally broken up by
the legalizer. This behavior can be suppressed using the
-bb-vectorize-aligned-only flag. It would be nice if the loop unroller
chose the unrolling factor to preserve the maximum available alignment,
but I don't think that it currently does so.

One problem with the current implementation is that it relies on
GetPointerBaseWithConstantOffset to determine if two loads or stores
share the same base address. This fails with partially-unrolled loops
because it cannot "undo" all of the additions to the offset induction
variable in order to understand that some of the loads and stores are
really adjacent in memory. This is something that I think can be
improved within the vectorizer itself, and I'm planning on doing some
work on this in the future.

Thanks for your feedback,
Hal

I've attached the latest version of my autovectorization patch.

Working through the test suite has proved to be a productive
experience :wink: -- And almost all of the bugs that it revealed have now
been fixed. There are still two programs that don't compile with
vectorization turned on, and I'm working on those now, but in case
anyone feels like playing with vectorization, this patch will probably
work for you.

Hey Hal,

those are great news. Especially as the numbers seem to show that
vectorization has a significant performance impact. What did you compare
exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?

Yes. [I've tested the current patch directly using opt -vectorize
-unroll-allow-partial; for running the test suite I recompiled
llvm/clang to hardcode the options as I wanted them].

You should not need to hack clang. As shown above, you should be able to
pass '-vectorize' to the optimizer by using '-mllvm -vectorize' in your CFLAGS. Did you run the test suite at -O3?

Also I think it would be interesting to compare for the test-suite the performance of
'clang -O3 -mllvm -unroll-allow-partial' with
'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will show how much of the runtime overhead is due to the unrolling (produces more code that needs to be optimized) and which part is due to vectorization. The same counts for the speedup. How much is caused by
unrolling and how much is actually caused by your pass.

The largest three performance speedups are:
SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup
SingleSource/UnitTests/Vector/multiplies - 57.7% speedup
SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

The largest three performance slowdowns are:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

Interesting. Do you understand what causes these slowdowns? Can your
heuristic be improved?

I've not specifically looked at these cases.

Generally, I've observed slowdowns from the introduction of too many
permutations per chain (the chain selection procedure can be changed to
help this, and I'll work on that). Also, sometimes legalization of
non-native vector operations creates inefficient code, and I'll also
look at these cases in more detail.

Sure. That sounds reasonable. I am confident you can improve this gradually.

If I understood correctly it seems your vectorizer has quadratic
complexity which may cause large slowdowns. Do you think it may be
useful/possible to make it linear by introducing a constant upper bound
somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
most of the vectorization opportunities are close by (in some sense),
such that we get most of the speedup by locking at a subset of the problem.

Yes, I agree. That makes a lot of sense.

What would be even better is if the loop unroller would intermix
statements from the loops where possible instead of leaving it to the
vectorizer to do all of the grouping after the fact. That, I fear, is a
whole other project.

First, I do not understand the 'even better' here. To me it would be great if on one hand the vectorizer could be constrained, such that the compile time is predictable. And on the other hand, we could make the loop unroller create code in a way such that the constrained vectorizer still performs the relevant transformations.

Also, what do you mean with 'if the loop unroller would intermix
statements from the loops where possible'. Are you referring to the
grouped unrolling as shown in my the last mail?

Compared to previous patches, which had a minimum required chain length
of 3 or 4, I've now made the default 6. While using a chain length of 4
worked well for targeted benchmarks, it caused an overall slowdown on
almost all test-suite programs. Using a minimum length of 6 causes, on
average, a speedup; so I think that is a better default choice.

I also try to understand if it is possible to use your vectorizer for
Polly. My idea is to do some clever loop unrolling.

Starting from this loop.

for (int i = 0; i< 4; i++)
     A[i] += 1;
     A[i] = B[i] + 3;
     C[i] = A[i];

The classical unroller would create this code:

     A[0] += 1;
     A[0] = B[i] + 3;
     C[0] = A[i];

     A[1] += 1;
     A[1] = B[i] + 3;
     C[1] = A[i];

     A[2] += 1;
     A[2] = B[i] + 3;
     C[2] = A[i];

     A[3] += 1;
     A[3] = B[i] + 3;
     C[3] = A[i];

However, in case I can prove this loop is parallel, I want to create
this code:

     A[0] += 1;
     A[1] += 1;
     A[2] += 1;
     A[3] += 1;

     A[0] = B[i] + 3;
     A[1] = B[i] + 3;
     A[2] = B[i] + 3;
     A[3] = B[i] + 3;

     C[0] = A[i];
     C[1] = A[i];
     C[2] = A[i];
     C[3] = A[i];

I assume this will allow the vectorization of test cases, that failed
because of possible aliasing. However, I am more interested, if the
execution order change could also improve the vectorization outcome or
reduce compile time overhead of your vectorizer.

Yes, this would certainly help.

In which way? How much? Would it be possible to restrict the vectorizer such that it's complexity is linear, while at the same time we can ensure it will still vectorize code that is 'group unrolled' (which means unrolled in the way shown above?

By the way, the current implementation, by default, it will create
unaligned vector loads and stores, but these are generally broken up by
the legalizer. This behavior can be suppressed using the
-bb-vectorize-aligned-only flag. It would be nice if the loop unroller
chose the unrolling factor to preserve the maximum available alignment,
but I don't think that it currently does so.

I don't know, but it sounds like a good thing to do.

One problem with the current implementation is that it relies on
GetPointerBaseWithConstantOffset to determine if two loads or stores
share the same base address. This fails with partially-unrolled loops
because it cannot "undo" all of the additions to the offset induction
variable in order to understand that some of the loads and stores are
really adjacent in memory. This is something that I think can be
improved within the vectorizer itself, and I'm planning on doing some
work on this in the future.

Here you may also want to look into ScalarEvolution. Basically two loads
access adjacent memory if the difference of the scalar evolution of the two load addresses is equal to sizeof(element_type). ScalarEvolution should be a lot more general than GetPointerBaseWithConstantOffset().

Cheers
Tobi

>>> I've attached the latest version of my autovectorization patch.
>>>
>>> Working through the test suite has proved to be a productive
>>> experience :wink: -- And almost all of the bugs that it revealed have now
>>> been fixed. There are still two programs that don't compile with
>>> vectorization turned on, and I'm working on those now, but in case
>>> anyone feels like playing with vectorization, this patch will probably
>>> work for you.
>>
>> Hey Hal,
>>
>> those are great news. Especially as the numbers seem to show that
>> vectorization has a significant performance impact. What did you compare
>> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
>
> Yes. [I've tested the current patch directly using opt -vectorize
> -unroll-allow-partial; for running the test suite I recompiled
> llvm/clang to hardcode the options as I wanted them].

You should not need to hack clang. As shown above, you should be able to
pass '-vectorize' to the optimizer by using '-mllvm -vectorize' in your
CFLAGS. Did you run the test suite at -O3?

That's good to know. Yes, it was run at -O3.

Also I think it would be interesting to compare for the test-suite the
performance of
'clang -O3 -mllvm -unroll-allow-partial' with
'clang -O3 -mllvm -unroll-allow-partial -mllvm -vectorize'. It will show
how much of the runtime overhead is due to the unrolling (produces more
code that needs to be optimized) and which part is due to vectorization.
The same counts for the speedup. How much is caused by
unrolling and how much is actually caused by your pass.

Just turning on partial unrolling yields, on average, a 4.7% slowdown
(0% median), and a 28.6% increase in compile time (21.8% median).

>>> The largest three performance speedups are:
>>> SingleSource/Benchmarks/BenchmarkGame/puzzle - 59.2% speedup

With just partial unrolling: 64.8% speedup

>>> SingleSource/UnitTests/Vector/multiplies - 57.7% speedup

With just partial unrolling: 58.2% speedup

>>> SingleSource/Benchmarks/Misc/flops-7 - 50.75% speedup

With just partial unrolling: 11.8% speedup

>>>
>>> The largest three performance slowdowns are:
>>> MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
>>> 114% slowdown

With just partial unrolling: 0% slowdown (47.8% slowdown in compile
time)

>>> MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 66.6%
>>> slowdown

With just partial unrolling: 8.3% speedup

>>> SingleSource/Benchmarks/Misc/flops-8 - 64.2% slowdown

With just partial unrolling: 8.3% speedup

>>>

Here are the top-3 lists comparing vectorization with partial unrolling
to just with partial unrolling:

Top speedups:
SingleSource/Benchmarks/Misc/flops-7 - 44.1% speedup
SingleSource/Benchmarks/Misc/flops-1 - 42.3% speedup
MultiSource/Applications/sqlite3/sqlite3 - 42.3% speedup

Top slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
114% slowdown
MultiSource/Benchmarks/MiBench/network-patricia/network-patricia - 81.8%
slowdown
SingleSource/Benchmarks/Misc/flops-8 - 79% slowdown

Top compile-time slowdowns:
MultiSource/Benchmarks/MiBench/security-rijndael/security-rijndael -
832% slowdown
SingleSource/Benchmarks/Misc/salsa20 - 600% slowdown
MultiSource/Benchmarks/Trimaran/enc-3des/enc-3des - 263% slowdown

For this comparison, however (unlike comparing to plain -O3), there are
a significant number of compile-time speedups (I guess that this is
because vectorization can reduce the number of instructions processed by
later passes). Top compile-time speedups:
SingleSource/Benchmarks/Stanford/Oscar - 46% speedup
SingleSource/Regression/C/bigstack - 45% speedup
SingleSource/Benchmarks/BenchmarkGame/fannkuch - 45% speedup

>> Interesting. Do you understand what causes these slowdowns? Can your
>> heuristic be improved?
>
> I've not specifically looked at these cases.
>
> Generally, I've observed slowdowns from the introduction of too many
> permutations per chain (the chain selection procedure can be changed to
> help this, and I'll work on that). Also, sometimes legalization of
> non-native vector operations creates inefficient code, and I'll also
> look at these cases in more detail.
Sure. That sounds reasonable. I am confident you can improve this gradually.

>> If I understood correctly it seems your vectorizer has quadratic
>> complexity which may cause large slowdowns. Do you think it may be
>> useful/possible to make it linear by introducing a constant upper bound
>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
>> most of the vectorization opportunities are close by (in some sense),
>> such that we get most of the speedup by locking at a subset of the problem.
>
> Yes, I agree. That makes a lot of sense.
>
> What would be even better is if the loop unroller would intermix
> statements from the loops where possible instead of leaving it to the
> vectorizer to do all of the grouping after the fact. That, I fear, is a
> whole other project.

First, I do not understand the 'even better' here. To me it would be
great if on one hand the vectorizer could be constrained, such that the
compile time is predictable. And on the other hand, we could make the
loop unroller create code in a way such that the constrained vectorizer
still performs the relevant transformations.

I was not clear; I meant that imposing a cut off is likely to work, but
it would work even better if the loop unroller produced code
more-conducive to vectorization.

Also, what do you mean with 'if the loop unroller would intermix
statements from the loops where possible'. Are you referring to the
grouped unrolling as shown in my the last mail?

Yes.

Also, the code necessary to take advantage of grouped unrolling already
exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
causes the vectorizer to stop searching for instruction pairings after
the first use of an instruction.

>>> Compared to previous patches, which had a minimum required chain length
>>> of 3 or 4, I've now made the default 6. While using a chain length of 4
>>> worked well for targeted benchmarks, it caused an overall slowdown on
>>> almost all test-suite programs. Using a minimum length of 6 causes, on
>>> average, a speedup; so I think that is a better default choice.
>>
>> I also try to understand if it is possible to use your vectorizer for
>> Polly. My idea is to do some clever loop unrolling.
>>
>> Starting from this loop.
>>
>> for (int i = 0; i< 4; i++)
>> A[i] += 1;
>> A[i] = B[i] + 3;
>> C[i] = A[i];
>>
>> The classical unroller would create this code:
>>
>> A[0] += 1;
>> A[0] = B[i] + 3;
>> C[0] = A[i];
>>
>> A[1] += 1;
>> A[1] = B[i] + 3;
>> C[1] = A[i];
>>
>> A[2] += 1;
>> A[2] = B[i] + 3;
>> C[2] = A[i];
>>
>> A[3] += 1;
>> A[3] = B[i] + 3;
>> C[3] = A[i];
>>
>> However, in case I can prove this loop is parallel, I want to create
>> this code:
>>
>> A[0] += 1;
>> A[1] += 1;
>> A[2] += 1;
>> A[3] += 1;
>>
>> A[0] = B[i] + 3;
>> A[1] = B[i] + 3;
>> A[2] = B[i] + 3;
>> A[3] = B[i] + 3;
>>
>> C[0] = A[i];
>> C[1] = A[i];
>> C[2] = A[i];
>> C[3] = A[i];
>>
>> I assume this will allow the vectorization of test cases, that failed
>> because of possible aliasing. However, I am more interested, if the
>> execution order change could also improve the vectorization outcome or
>> reduce compile time overhead of your vectorizer.
>
> Yes, this would certainly help.

In which way? How much? Would it be possible to restrict the vectorizer
such that it's complexity is linear, while at the same time we can
ensure it will still vectorize code that is 'group unrolled' (which
means unrolled in the way shown above?

It would help because we could stop the pairing search after an
instructions first use; and that is significantly faster and generates
many fewer potential pairings that need to be considered by the later
stages. Exactly "how much" it would help I can't yet answer; I'd need to
construct actual benchmarks for that. If you have some group-unrolled
code, then it will be simple to test: just pass -bb-vectorize-fast-dep.

> By the way, the current implementation, by default, it will create
> unaligned vector loads and stores, but these are generally broken up by
> the legalizer. This behavior can be suppressed using the
> -bb-vectorize-aligned-only flag. It would be nice if the loop unroller
> chose the unrolling factor to preserve the maximum available alignment,
> but I don't think that it currently does so.
I don't know, but it sounds like a good thing to do.

> One problem with the current implementation is that it relies on
> GetPointerBaseWithConstantOffset to determine if two loads or stores
> share the same base address. This fails with partially-unrolled loops
> because it cannot "undo" all of the additions to the offset induction
> variable in order to understand that some of the loads and stores are
> really adjacent in memory. This is something that I think can be
> improved within the vectorizer itself, and I'm planning on doing some
> work on this in the future.
Here you may also want to look into ScalarEvolution. Basically two loads
access adjacent memory if the difference of the scalar evolution of the
two load addresses is equal to sizeof(element_type). ScalarEvolution
should be a lot more general than GetPointerBaseWithConstantOffset().

Thanks! That sounds great; I'll have to look at that.

-Hal

>>> I've attached the latest version of my autovectorization patch.
>>>
>>> Working through the test suite has proved to be a productive
>>> experience :wink: -- And almost all of the bugs that it revealed have now
>>> been fixed. There are still two programs that don't compile with
>>> vectorization turned on, and I'm working on those now, but in case
>>> anyone feels like playing with vectorization, this patch will probably
>>> work for you.
>>
>> Hey Hal,
>>
>> those are great news. Especially as the numbers seem to show that
>> vectorization has a significant performance impact. What did you compare
>> exactly. 'clang -O3' against 'clang -O3 -mllvm -vectorize'?
>
> Yes. [I've tested the current patch directly using opt -vectorize
> -unroll-allow-partial; for running the test suite I recompiled
> llvm/clang to hardcode the options as I wanted them].

You should not need to hack clang. As shown above, you should be able to
pass '-vectorize' to the optimizer by using '-mllvm -vectorize' in your
CFLAGS.

-mllvm -unroll-allow-partial works because -unroll-allow-partial is an
option to the pass, but I had added -vectorize as an option to opt
itself, and those options do not get picked up by clang. Would it be
better to add -vectorize as an option in IPO/PassManagerBuilder? Maybe
I'll try doing it that way instead.

-Hal

Yes, that's probably better.

Cheers
Tobi

[A lot of performance results skipped]

OK. As expected part of the speedup is because of unrolling, however it also shows that vectorization itself improves performance. There is still a lot of room for improvement, but it seems to be a good start.

If I understood correctly it seems your vectorizer has quadratic
complexity which may cause large slowdowns. Do you think it may be
useful/possible to make it linear by introducing a constant upper bound
somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
most of the vectorization opportunities are close by (in some sense),
such that we get most of the speedup by locking at a subset of the problem.

Yes, I agree. That makes a lot of sense.

What would be even better is if the loop unroller would intermix
statements from the loops where possible instead of leaving it to the
vectorizer to do all of the grouping after the fact. That, I fear, is a
whole other project.

First, I do not understand the 'even better' here. To me it would be
great if on one hand the vectorizer could be constrained, such that the
compile time is predictable. And on the other hand, we could make the
loop unroller create code in a way such that the constrained vectorizer
still performs the relevant transformations.

I was not clear; I meant that imposing a cut off is likely to work, but
it would work even better if the loop unroller produced code
more-conducive to vectorization.

Is such a cutoff already integrated and can be enabled by default. I believe it would be great if we could show that the compile time increase is limited by a low constant (maybe 100%), while still showing an overall improvement in run time.

Also, what do you mean with 'if the loop unroller would intermix
statements from the loops where possible'. Are you referring to the
grouped unrolling as shown in my the last mail?

Yes.

Also, the code necessary to take advantage of grouped unrolling already
exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
causes the vectorizer to stop searching for instruction pairings after
the first use of an instruction.

Nice. I just tried one of the very basic examples from the Polly test suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create any vector instructions. I also played a little with the vectorizer options, but was not able to get this example vectorized. Is this because the chain is too short?

One problem with the current implementation is that it relies on
GetPointerBaseWithConstantOffset to determine if two loads or stores
share the same base address. This fails with partially-unrolled loops
because it cannot "undo" all of the additions to the offset induction
variable in order to understand that some of the loads and stores are
really adjacent in memory. This is something that I think can be
improved within the vectorizer itself, and I'm planning on doing some
work on this in the future.

Here you may also want to look into ScalarEvolution. Basically two loads
access adjacent memory if the difference of the scalar evolution of the
two load addresses is equal to sizeof(element_type). ScalarEvolution
should be a lot more general than GetPointerBaseWithConstantOffset().

Thanks! That sounds great; I'll have to look at that.

Talking about this I looked again into ScalarEvolution.

To analyze a load, you would do:

LoadInst *Load = ...
Value *Pointer = Load->getPointer();
const SCEV *PointerSCEV = SE->getSCEV(Pointer);
const SCEVUnknown *PointerBase =
  dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase)
   return 'Analysis failed'

const Value *BaseValue = PointerBase->getValue();

You get the offset between two load addresses with SE->getMinusSCEV(). The size of an element is SE->getSizeOfExpr().

Hope that helps
Tobi

simple_vec.ll (1.14 KB)

[A lot of performance results skipped]

OK. As expected part of the speedup is because of unrolling, however it
also shows that vectorization itself improves performance. There is
still a lot of room for improvement, but it seems to be a good start.

>>>> If I understood correctly it seems your vectorizer has quadratic
>>>> complexity which may cause large slowdowns. Do you think it may be
>>>> useful/possible to make it linear by introducing a constant upper bound
>>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
>>>> most of the vectorization opportunities are close by (in some sense),
>>>> such that we get most of the speedup by locking at a subset of the problem.
>>>
>>> Yes, I agree. That makes a lot of sense.
>>>
>>> What would be even better is if the loop unroller would intermix
>>> statements from the loops where possible instead of leaving it to the
>>> vectorizer to do all of the grouping after the fact. That, I fear, is a
>>> whole other project.
>>
>> First, I do not understand the 'even better' here. To me it would be
>> great if on one hand the vectorizer could be constrained, such that the
>> compile time is predictable. And on the other hand, we could make the
>> loop unroller create code in a way such that the constrained vectorizer
>> still performs the relevant transformations.
>
> I was not clear; I meant that imposing a cut off is likely to work, but
> it would work even better if the loop unroller produced code
> more-conducive to vectorization.

Is such a cutoff already integrated and can be enabled by default.

No, but I'll add one.

I
believe it would be great if we could show that the compile time
increase is limited by a low constant (maybe 100%), while still showing
an overall improvement in run time.

>> Also, what do you mean with 'if the loop unroller would intermix
>> statements from the loops where possible'. Are you referring to the
>> grouped unrolling as shown in my the last mail?
>
> Yes.
>
> Also, the code necessary to take advantage of grouped unrolling already
> exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> causes the vectorizer to stop searching for instruction pairings after
> the first use of an instruction.

Nice. I just tried one of the very basic examples from the Polly test
suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create
any vector instructions. I also played a little with the vectorizer
options, but was not able to get this example vectorized. Is this
because the chain is too short?

You are correct. It will vectorize with:
opt -O3 -vectorize -bb-vectorize-req-chain-depth=2

>>> One problem with the current implementation is that it relies on
>>> GetPointerBaseWithConstantOffset to determine if two loads or stores
>>> share the same base address. This fails with partially-unrolled loops
>>> because it cannot "undo" all of the additions to the offset induction
>>> variable in order to understand that some of the loads and stores are
>>> really adjacent in memory. This is something that I think can be
>>> improved within the vectorizer itself, and I'm planning on doing some
>>> work on this in the future.
>> Here you may also want to look into ScalarEvolution. Basically two loads
>> access adjacent memory if the difference of the scalar evolution of the
>> two load addresses is equal to sizeof(element_type). ScalarEvolution
>> should be a lot more general than GetPointerBaseWithConstantOffset().
>
> Thanks! That sounds great; I'll have to look at that.

Talking about this I looked again into ScalarEvolution.

To analyze a load, you would do:

LoadInst *Load = ...
Value *Pointer = Load->getPointer();
const SCEV *PointerSCEV = SE->getSCEV(Pointer);
const SCEVUnknown *PointerBase =
  dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase)
   return 'Analysis failed'

const Value *BaseValue = PointerBase->getValue();

You get the offset between two load addresses with SE->getMinusSCEV().
The size of an element is SE->getSizeOfExpr().

This is great! Thanks!

-Hal

[A lot of performance results skipped]

OK. As expected part of the speedup is because of unrolling, however it
also shows that vectorization itself improves performance. There is
still a lot of room for improvement, but it seems to be a good start.

>>>> If I understood correctly it seems your vectorizer has quadratic
>>>> complexity which may cause large slowdowns. Do you think it may be
>>>> useful/possible to make it linear by introducing a constant upper bound
>>>> somewhere? E.g. limiting it to 10/20/100 steps. Maybe we are lucky and
>>>> most of the vectorization opportunities are close by (in some sense),
>>>> such that we get most of the speedup by locking at a subset of the problem.
>>>
>>> Yes, I agree. That makes a lot of sense.
>>>
>>> What would be even better is if the loop unroller would intermix
>>> statements from the loops where possible instead of leaving it to the
>>> vectorizer to do all of the grouping after the fact. That, I fear, is a
>>> whole other project.
>>
>> First, I do not understand the 'even better' here. To me it would be
>> great if on one hand the vectorizer could be constrained, such that the
>> compile time is predictable. And on the other hand, we could make the
>> loop unroller create code in a way such that the constrained vectorizer
>> still performs the relevant transformations.
>
> I was not clear; I meant that imposing a cut off is likely to work, but
> it would work even better if the loop unroller produced code
> more-conducive to vectorization.

Is such a cutoff already integrated and can be enabled by default. I
believe it would be great if we could show that the compile time
increase is limited by a low constant (maybe 100%), while still showing
an overall improvement in run time.

>> Also, what do you mean with 'if the loop unroller would intermix
>> statements from the loops where possible'. Are you referring to the
>> grouped unrolling as shown in my the last mail?
>
> Yes.
>
> Also, the code necessary to take advantage of grouped unrolling already
> exists in the vectorizer. Enabling the flag -bb-vectorize-fast-dep
> causes the vectorizer to stop searching for instruction pairings after
> the first use of an instruction.

Nice. I just tried one of the very basic examples from the Polly test
suite (simple_vec.ll). Running opt -O3 -vectorize on it, does not create
any vector instructions. I also played a little with the vectorizer
options, but was not able to get this example vectorized. Is this
because the chain is too short?

>>> One problem with the current implementation is that it relies on
>>> GetPointerBaseWithConstantOffset to determine if two loads or stores
>>> share the same base address. This fails with partially-unrolled loops
>>> because it cannot "undo" all of the additions to the offset induction
>>> variable in order to understand that some of the loads and stores are
>>> really adjacent in memory. This is something that I think can be
>>> improved within the vectorizer itself, and I'm planning on doing some
>>> work on this in the future.
>> Here you may also want to look into ScalarEvolution. Basically two loads
>> access adjacent memory if the difference of the scalar evolution of the
>> two load addresses is equal to sizeof(element_type). ScalarEvolution
>> should be a lot more general than GetPointerBaseWithConstantOffset().
>
> Thanks! That sounds great; I'll have to look at that.

Talking about this I looked again into ScalarEvolution.

To analyze a load, you would do:

LoadInst *Load = ...
Value *Pointer = Load->getPointer();
const SCEV *PointerSCEV = SE->getSCEV(Pointer);
const SCEVUnknown *PointerBase =
  dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase)
   return 'Analysis failed'

const Value *BaseValue = PointerBase->getValue();

You get the offset between two load addresses with SE->getMinusSCEV().
The size of an element is SE->getSizeOfExpr().

The AliasAnalysis class has a set of interfaces that can be used to
preserve the analysis even when some things are changed. Does
ScalarEvolution have a similar capability?

Thanks again,
Hal

You can state that your pass preserves ScalarEvolution. In this case all analysis results are by default preserved and it is your job to invalidate the scalar evolution for the loops/values where it needs to
be recalculated.

The relevant functions are

ScalarEvolution::forgetValue(Value *)
ScalarEvolution::forgetLoop(Loop *)

Cheers
Tobi

>> Talking about this I looked again into ScalarEvolution.
>>
>> To analyze a load, you would do:
>>
>> LoadInst *Load = ...
>> Value *Pointer = Load->getPointer();
>> const SCEV *PointerSCEV = SE->getSCEV(Pointer);
>> const SCEVUnknown *PointerBase =
>> dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));
>>
>> if (!PointerBase)
>> return 'Analysis failed'
>>
>> const Value *BaseValue = PointerBase->getValue();
>>
>> You get the offset between two load addresses with SE->getMinusSCEV().
>> The size of an element is SE->getSizeOfExpr().
>>
>
> The AliasAnalysis class has a set of interfaces that can be used to
> preserve the analysis even when some things are changed. Does
> ScalarEvolution have a similar capability?

You can state that your pass preserves ScalarEvolution. In this case all
analysis results are by default preserved and it is your job to
invalidate the scalar evolution for the loops/values where it needs to
be recalculated.

The relevant functions are

ScalarEvolution::forgetValue(Value *)

Since the vectorization pass is currently just a basic-block pass, I
think that I should need only forgetValue, right? I suppose that I would
call that on all of the values that are fused.

Also, using getPointerBase to get the base pointer seems simple enough,
but how should I use getMinusSCEV to get the offset. Should I call it on
each load pointer and its base pointer, or between the two load pointers
once I know that they share the same base. And once I do that, how do I
get the offset (if known). I see the get[Uns|S]ignedRange functions, but
if there is a way to directly get a constant value, then that would be
more straightforward.

Thanks again,
Hal

On 11/08/2011 11:29 PM, Hal Finkel wrote: Talking about this I
looked again into ScalarEvolution.

To analyze a load, you would do:

LoadInst *Load = ... Value *Pointer = Load->getPointer(); const
SCEV *PointerSCEV = SE->getSCEV(Pointer); const SCEVUnknown
*PointerBase =
dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase) return 'Analysis failed'

const Value *BaseValue = PointerBase->getValue();

You get the offset between two load addresses with
SE->getMinusSCEV(). The size of an element is
SE->getSizeOfExpr().

The AliasAnalysis class has a set of interfaces that can be used
to preserve the analysis even when some things are changed. Does
ScalarEvolution have a similar capability?

You can state that your pass preserves ScalarEvolution. In this
case all analysis results are by default preserved and it is your
job to invalidate the scalar evolution for the loops/values where
it needs to be recalculated.

The relevant functions are

ScalarEvolution::forgetValue(Value *)

Since the vectorization pass is currently just a basic-block pass, I
think that I should need only forgetValue, right? I suppose that I
would call that on all of the values that are fused.

You call it on all the values/instructions that are removed and for
those where the result calculated has changed.

Also, using getPointerBase to get the base pointer seems simple
enough, but how should I use getMinusSCEV to get the offset.

This should give you the offset from the base pointer:

LoadInst *Load = ...
Value *Pointer = Load->getPointer();
const SCEV *PointerSCEV = SE->getSCEV(Pointer);
const SCEVUnknown *PointerBase =
  dyn_cast<SCEVUnknown>(SE->getPointerBase(PointerSCEV));

if (!PointerBase)
   return 'Analysis failed'

const SCEV *OffsetFromBase = SE->getMinusSCEV(Pointer, PointerBase);

Should I call it on each load pointer and its base pointer, or
between the two load pointers once I know that they share the same
base.

That depends what you want.

And once I do that, how do I get the offset (if known). I see the
get[Uns|S]ignedRange functions, but if there is a way to directly get
a constant value, then that would be more straightforward.

I assume you want to know if two load addresses are either identical, have stride one (have a offset of +1) or some other more complicated stuff.

What might work is the following (entirely untested):

LoadInst *LoadOne = ...
LoadInst *LoadTwo = ...

Value *PointerOne = LoadOne->getPointer();
Value *PointerTwo = LoadTwo->getPointer();

const SCEV *PointerOneSCEV = SE->getSCEV(PointerOne);
const SCEV *PointerTwoSCEV = SE->getSCEV(PointerTwo);

// If this is a trivial offset we get something like 1*sizeof(long)
const SCEV *Offset = SE->getMinusSCEV(PointerOneSCEV, PointerTwoSCEV);

// Now we devide it by the element size
Type *AllocTy = LoadOne->getType()->getAllocTy();
const SCEV *TypeOfSCEV = SE->getSizeOfExpr(AllocTy);
const SCEV *OffsetInElements = SE->getUDivExpr(Offset, TypeOfSCEV);

if (const SCEVConstant *IntOffsetSCEV
  = dyn_cast<SCEVConstant>(OffsetInElements)) {
   ConstantInt *IntOffset = IntOffsetSCEV->getValue()
   return IntOffset;
} else {
   return "This seems to be a complicated offset";
}

const SCEV *OffsetInElements = SE->getUDivExpr(Offset, TypeOfSCEV);

Let me know if this or something similar worked.
Tobi