Micro-kernel analysis
We are working on the problem of generating the best microkernel to improve the performance of GEMM (and generically many linear algebra operations). We will share in this page what we found so far. We are mostly interested in Arm architectures, so any reference to assembly implies the Arm ISA.
What is a micro-kernel?
In any loop-based computation the micro-kernel is the inner for loop in the computation.
In the case of GEMM, that we are using as an example, this is the loop-nest of the computation:
And the micro-kernel is represented by:
for pr = 0, ..., kc -1 in steps of 1
Cc(ir:ir+mr -1, jr:jr+nr-1) += Ac(ir:ir+mr-1, pr) * Bc(pr, ir:ir+nr-1)
Please note that Ac
and Bc
are laid out such that memory accesses are contiguous in memory.
How to write a micro-kernel
Many of the BLAS libraries (ACL, OpenBLAS, etc…) tend to write the micro-kernel in hand-written assembly.
This has the advantage of squeezing the last drop of performance from the machine, but has different disadvantages:
-
The library designer can be stuck in local minima
-
It is very hard to generalize to new {micro/macro} architectures
The compiler, on the other hand, might find hard to generate the optimal micro-kernel, but it can list a series of optimizations that can be tentatively applied by an auto-tuner to improve the performance. We can call a sequence of those optimization a strategy
.
The main condition for this to work is if the auto-tuner asks to apply a given strategy (e.g., strategy=<Tile<3,3,3>, unroll_x_2, pipeline>
) and the compiler generates the best possible code.
Let’s say that, given a strategy S
and a program P
, if we consider:
Lopt = S(P) // strategy S applied on program P by a library developer
Copt= S(P) // strategy S applied on program P by the compiler
We would like to have Lopt==Copt
(one way of defining the ==
operator is through performance, i.e., T(Lopt)==T(Copt)
).
We ( @stevenvar @allenaann and me) analyzed (and are still analyzing) those optimizations that can compose those strategies, and this is a summary of what we found out
Overall results
So far, those have been the best results we were able to obtain:
Tile (mrxnr) | MLIR transformations applied | MLIR/Theoretical |
---|---|---|
8x8 | Unrolling(x32) | >72% |
4x16 | Unrolling(total) | >82% |
12x8 | Unrolling(x2)+Pipelining+legalization+scheduling | >81% |
12x8 | Unrolling(x2)+Pipelining+legalization+scheduling+prefetching | >84% |
Few words to clarify some of the above techniques:
-
Total unrolling means unrolling the entire micro kernel. I.e., removing the kernel for-loop and computing
kc
mr x nr
outer-products -
12x8 is the tiling that ACL uses (which gives the best performance so far)
-
Scheduling is manually applied at the MLIR level, almost before entering LLVM-land
-
Legalizing means producing MLIR instructions that have a 1:1 correspondence in assembly. For instance, a
vector<12xf32>
is not legal in (Arm) assembly, and this means that the LLVM back-end has to split it up in chunks ofvector<4xf32>
. Apparently, LLVM’s handling of non-power-of-two vectors is sub-optimal, and in general illegal operation in the LLVM world generate sub-optimal code. -
Prefetching is achieved through autotuning (more about this later)
Assembly inspection
While it is hard to read the assembly for the total unrolled case, a skim through the assembly in the case of the 12x8 tiling strategy, shows that the micro-kernel assembly seems in a good shape:
-
All registers are used (
v0-v31
) -
The instruction dependencies (
ldr -> fmla
) are minimized -
Loads for
vX
seem to be placed immediately after the fmla instructions that usevX
All in all, the generated assembly is quite nice, albeit still sub-optimal
Prefetching
We wrote a simple auto-tuner to inject prefetch instructions in the MLIR micro-kernel. While in ACL and OpenBLAS prefetching seems to be quite effective we only got 3% speed up. We will let the tuner run more to see if we can identify better prefetching values
LLVM compiler and limitations
We are using the LLVM compiler as a sort of black-box (mostly for lack of expertise).
The main issue we are facing is that there seems to be a dependence between the MLIR that we generate and the LLVM performance. In other words, we cannot assume that two functional equivalent MLIR programs are optimized in the same way by the LLVM compiler. Is this expected?
The clear example is about scheduling (@nicolasvasilache, you were right ): the position of independent operations in MLIR and the scheduling+RA of the LLVM compiler are connected.
This MLIR/LLVMIR dependency begs the question if scheduling should be really done in MLIR. Can we enhance the scheduling algorithm of LLVM with something more exotic (given that we are less time constrained)?
Conclusion
The main take-away from all this is that there is a way of writing MLIR that produces (almost) optimal assembly.
This is good news for us, because we “only” need to provide and use the right set of transformations to accomplish the task.
Now it is the time to push for these transformations into the compiler. Some of those are available (pipelining), some need a bit of improvement (legalization), for some we need to understand how to split the responsibility between MLIR and LLVM (scheduling + prefetching).