[GSoC 2016] Improvement of vectorization process in Polly - Final report

Dear LLVM contributors,

during the Google Summer of Code 2016 I have been working on the
"Improvement of vectorization process in Polly" project.

I planned to add a mode that uses the same Polly tool flow and
processes statements, which contain only matrix multiplication of the
following form, in a special way.

C := AB + C, where C, A, and B are matrices of sizes m × n, m × k, and
k × n, respectively.

It was also planned to implement determination of vectorization
factors and profitability of vectorization.

We decided to adjust the goals and focus only on the optimization of
matrix multiplication and its generalization. In the result, we
optimize a class of statements (see [1] for further details), which
also contains the matrix multiplication.

The result of evaluations of the corresponding code compiled with
r279395 of Clang on the Intel Core i7-3820 SandyBridge, for different
values of m, n, k and different compiler options can be found on the
following link [2].

According to the figure [2], the algorithm for the computation of the
optimal values of the parameters of matrix multiplication can be
improved. One way is a new algorithm for the computation of the
parameter Mc (see [1] for further details). If we, for example,
manually equate the Mc to Nc we get the following results [3]. Its
improvement is planned for the future.

The class also contains general matrix multiplication of the form C :=
αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k
× n, respectively. In case, for example, of m = n = 1056 and k = 1024,
α = 32412, β = 2123, the corresponding code compiled with r278666 of
Clang with following options

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME
-march=native -mllvm -polly -mllvm
-polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB
-mllvm -polly-target-cache-level-associativity=8,8 -mllvm
-polly-target-cache-level-sizes=32768,262144 -mllvm
-polly-target-througput-vector-fma=1 -mllvm

takes about 0.30 on the Intel Core i7-3820 SandyBridge. In case the
described optimization is disabled, it takes about 0.75 and 2.1 in
case Polly is not used.

Furthermore, using the optimization we can optimize the following
examples of code:

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1
             C[i][j] += 32412 Op1 B[k][j]

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1
Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1
Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1
                   C[i][j] = C[i][j] Op1 A[i][k] Op2 B[k][j]

where Opi is an operation from the set {+, -, /, *} (see [4], [5],
[6], [7] for results of their evaluations).

The described optimization is implemented in [8], [9], [10], [11],
[12], [13], [14] that are already committed and [15] that is under
review. For further details about the project, what code got merged,
what code did not get merged, and what is left to do, please see my
blog posts [1], [16].

I am planning to continue working on the project as part of my Ph.D.
thesis and implement the following:

1. Prefetching of the certain cache lines

The BLIS implementation of matrix-matrix multiplication prefetches
only the first cache line, before traversing a given interval of
memory. This could confirm that the implementation relies on hardware
preteching to prefetch the subsequent lines [17]. Consequently, there
is a possibility that its utilization could improve the described

2. Reduction of the number of the parameters of a target architecture
that require to be specified as command line parameters

At the moment, to be able to use the implemented optimization, one
should specify parameters of a target architecture like, for example,
throughput of vector instructions per clock cycle (see [1] for further
details). Reduction of the number of such parameters using the
information about the target architecture of the LLVM (e.g.,
TargetTransformInfo::getArithmeticInstrCost) could be preferable.

3. Generalization of determination of parameters of the transformation

At the moment, for this purpose, we use an algorithm described in
"Analytical Modeling is Enough for High Performance BLIS" [1] that was
designed to optimize a subclass of the class and could possibly cause
performance regression, if we try to apply it to all statements of the
class. We could try to create a new algorithm specialized for the
whole class.

I would like to thank Michael Kruse, Albert Cohen, Johannes Doerfert
and the LLVM community for your comments, reviews, ideas and the
opportunity to work on this project.

I would like to especially thank Tobias Grosser for his encouragement
and excellent support.


[1] - http://romangareev.blogspot.ru/2016/06/gsoc-2016-report-i.html
[2] - https://drive.google.com/file/d/0B2Wloo-931AoVTRINXp0dDZPSjQ/view?usp=sharing
[3] - https://drive.google.com/file/d/0B2Wloo-931AoWG9jTjg3aEVxRG8/view?usp=sharing
[4] - https://docs.google.com/spreadsheets/d/1RtioBokRePdX2zdFxGxOxTbtY_WnI4moYFCfbujr7S8/pubhtml?gid=0&single=true
[5] - https://docs.google.com/spreadsheets/d/1hUtXH9JsgYUhHCdydVtE7ayNupc9ViZuAG8z_JKevP0/pubhtml?gid=0&single=true
[6] - https://docs.google.com/spreadsheets/d/1NQHqCuhc1A8pVJtN3FuTHrQ2gcyXUYs5T2L8tMXqLE4/pubhtml?gid=0&single=true
[7] - https://docs.google.com/spreadsheets/d/1nr4qZpG5V--Cq5cTLX1TLA7OpfdFqhIGr6-IFuoq1yw/pubhtml?gid=0&single=true
[8] - http://reviews.llvm.org/D21140
[9] - http://reviews.llvm.org/D21491
[10] - http://reviews.llvm.org/D20575
[11] - https://reviews.llvm.org/D22187
[12] - https://reviews.llvm.org/D22828
[13] - https://reviews.llvm.org/D23740
[14] - https://reviews.llvm.org/D23759
[15] - https://reviews.llvm.org/D23260
[16] - http://romangareev.blogspot.ru/2016/08/gsoc-2016-report-ii.html
[17] - http://lists.llvm.org/pipermail/llvm-dev/2016-May/100229.html