Dear LLVM contributors,

during the Google Summer of Code 2016 I have been working on the

"Improvement of vectorization process in Polly" project.

I planned to add a mode that uses the same Polly tool flow and

processes statements, which contain only matrix multiplication of the

following form, in a special way.

C := AB + C, where C, A, and B are matrices of sizes m × n, m × k, and

k × n, respectively.

It was also planned to implement determination of vectorization

factors and profitability of vectorization.

We decided to adjust the goals and focus only on the optimization of

matrix multiplication and its generalization. In the result, we

optimize a class of statements (see [1] for further details), which

also contains the matrix multiplication.

The result of evaluations of the corresponding code compiled with

r279395 of Clang on the Intel Core i7-3820 SandyBridge, for different

values of m, n, k and different compiler options can be found on the

following link [2].

According to the figure [2], the algorithm for the computation of the

optimal values of the parameters of matrix multiplication can be

improved. One way is a new algorithm for the computation of the

parameter Mc (see [1] for further details). If we, for example,

manually equate the Mc to Nc we get the following results [3]. Its

improvement is planned for the future.

The class also contains general matrix multiplication of the form C :=

αAB + βC, where C, A, and B are matrices of sizes m × n, m × k, and k

× n, respectively. In case, for example, of m = n = 1056 and k = 1024,

α = 32412, β = 2123, the corresponding code compiled with r278666 of

Clang with following options

clang -O3 gemm.c -I utilities/ utilities/polybench.c -DPOLYBENCH_TIME

-march=native -mllvm -polly -mllvm

-polly-pattern-matching-based-opts=true -DPOLYBENCH_USE_SCALAR_LB

-mllvm -polly-target-cache-level-associativity=8,8 -mllvm

-polly-target-cache-level-sizes=32768,262144 -mllvm

-polly-target-througput-vector-fma=1 -mllvm

-polly-target-latency-vector-fma=8

takes about 0.30 on the Intel Core i7-3820 SandyBridge. In case the

described optimization is disabled, it takes about 0.75 and 2.1 in

case Polly is not used.

Furthermore, using the optimization we can optimize the following

examples of code:

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] += 32412 Op1 B[k][j]

endfor

endfor

endfor

Loop 1 for i = 0, …, UpperBound1 - 1 in steps of 1

Loop 2 for j = 0, …, UpperBound2 - 1 in steps of 1

Loop 3 for k = 0, …, UpperBound3 - 1 in steps of 1

C[i][j] = C[i][j] Op1 A[i][k] Op2 B[k][j]

endfor

endfor

endfor

where Opi is an operation from the set {+, -, /, *} (see [4], [5],

[6], [7] for results of their evaluations).

The described optimization is implemented in [8], [9], [10], [11],

[12], [13], [14] that are already committed and [15] that is under

review. For further details about the project, what code got merged,

what code did not get merged, and what is left to do, please see my

blog posts [1], [16].

I am planning to continue working on the project as part of my Ph.D.

thesis and implement the following:

1. Prefetching of the certain cache lines

The BLIS implementation of matrix-matrix multiplication prefetches

only the first cache line, before traversing a given interval of

memory. This could confirm that the implementation relies on hardware

preteching to prefetch the subsequent lines [17]. Consequently, there

is a possibility that its utilization could improve the described

optimization.

2. Reduction of the number of the parameters of a target architecture

that require to be specified as command line parameters

At the moment, to be able to use the implemented optimization, one

should specify parameters of a target architecture like, for example,

throughput of vector instructions per clock cycle (see [1] for further

details). Reduction of the number of such parameters using the

information about the target architecture of the LLVM (e.g.,

TargetTransformInfo::getArithmeticInstrCost) could be preferable.

3. Generalization of determination of parameters of the transformation

At the moment, for this purpose, we use an algorithm described in

"Analytical Modeling is Enough for High Performance BLIS" [1] that was

designed to optimize a subclass of the class and could possibly cause

performance regression, if we try to apply it to all statements of the

class. We could try to create a new algorithm specialized for the

whole class.

I would like to thank Michael Kruse, Albert Cohen, Johannes Doerfert

and the LLVM community for your comments, reviews, ideas and the

opportunity to work on this project.

I would like to especially thank Tobias Grosser for his encouragement

and excellent support.

Refs.:

[1] - http://romangareev.blogspot.ru/2016/06/gsoc-2016-report-i.html

[2] - https://drive.google.com/file/d/0B2Wloo-931AoVTRINXp0dDZPSjQ/view?usp=sharing

[3] - https://drive.google.com/file/d/0B2Wloo-931AoWG9jTjg3aEVxRG8/view?usp=sharing

[4] - https://docs.google.com/spreadsheets/d/1RtioBokRePdX2zdFxGxOxTbtY_WnI4moYFCfbujr7S8/pubhtml?gid=0&single=true

[5] - https://docs.google.com/spreadsheets/d/1hUtXH9JsgYUhHCdydVtE7ayNupc9ViZuAG8z_JKevP0/pubhtml?gid=0&single=true

[6] - https://docs.google.com/spreadsheets/d/1NQHqCuhc1A8pVJtN3FuTHrQ2gcyXUYs5T2L8tMXqLE4/pubhtml?gid=0&single=true

[7] - https://docs.google.com/spreadsheets/d/1nr4qZpG5V--Cq5cTLX1TLA7OpfdFqhIGr6-IFuoq1yw/pubhtml?gid=0&single=true

[8] - http://reviews.llvm.org/D21140

[9] - http://reviews.llvm.org/D21491

[10] - http://reviews.llvm.org/D20575

[11] - https://reviews.llvm.org/D22187

[12] - https://reviews.llvm.org/D22828

[13] - https://reviews.llvm.org/D23740

[14] - https://reviews.llvm.org/D23759

[15] - https://reviews.llvm.org/D23260

[16] - http://romangareev.blogspot.ru/2016/08/gsoc-2016-report-ii.html

[17] - http://lists.llvm.org/pipermail/llvm-dev/2016-May/100229.html