TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We’ve designed the approach so it’s flexible and easy to extend.
With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support row-major layouts.
The floating point versions of the intrinsics also take fast-math flags, which can be used to opt-in to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.
The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets. Additionally, targets can implement their own lowering to a specialized ISA extension. We implemented such a lowering for an internal matrix accelerator.
The example below shows how the matrix_transpose builtin will be lowered in the IR lowering pass.
define void @transpose(<8 x double> * %A, <8 x double>* %B) {
entry:
%a = load <8 x double>, <8 x double>* %A, align 16
%b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
store <8 x double> %c, <8 x double>* %B, align 16
ret void
}
declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)
Will get lowered to something like
define void @transpose(<8 x double> * %A, <8 x double>* %B) {
entry:
%0 = bitcast <8 x double>* %A to <2 x double>*
%1 = load <2 x double>, <2 x double>* %0, align 8
%2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
%3 = bitcast double* %2 to <2 x double>*
%4 = load <2 x double>, <2 x double>* %3, align 8
%5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
%6 = bitcast double* %5 to <2 x double>*
%7 = load <2 x double>, <2 x double>* %6, align 8
%8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
%10 = bitcast double* %9 to <2 x double>*
%11 = load <2 x double>, <2 x double>* %10, align 8
%12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0, i32 2, i32 undef, i32 undef>
%14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
%15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
%16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
%17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
%18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 5>
%19 = bitcast <8 x double>* %C to <4 x double>*
store <4 x double> %15, <4 x double>* %19, align 8
%20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4
%21 = bitcast double* %20 to <4 x double>*
store <4 x double> %18, <4 x double>* %21, align 8
ret void
}
Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to %a = load <8 x double>, <8 x double>* %A, align 16
and lower it to a series of load <2 x double>, <2 x double>*
, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.
This infrastructure can also be used to improve codegen for large vector operations in general. For example, consider loading 2 <100 x double> vectors, adding them and storing the result. Currently this will be lowered naively doing most of the loads first, then most of the adds and then most of the stores, causing plenty of spilling on AArch64. We can significantly improve the generated code by breaking it into load/add/store blocks that fit into the target registers.
We also plan to implement some additional optimisations as part of the lowering, like generating fused/tiled loops for matrix multiplications and direct FMA generation.
As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.