Hi,

We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.

For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16

%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16

%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)

store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t

attribute((matrix_type(4, 4)));mf4x4_t add(mf4x4_t a, mf4x4_t b) {

return __builtin_matrix_multiply(a, b);

}

** Benefits **

Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors. Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.

Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient **intermediate layout** for the matrix values between ABI observable points (calls, memory operations).

The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).

The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit run-time alias checks.

Having a dedicated first-class type also allows for dedicated target-specific **ABIs** for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored row-major or column-major order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.

Having all this knowledge at the IR level means that **front-ends** are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.

At Apple, we also need to support **interoperability** between row-major and column-major layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.

** ABI **

We currently default to column-major order with no padding between the columns in memory. We have plans to also support row-major order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR self-contained I am planning to make the defaults explicit in the DataLayout string.

For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous Short-Vector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.

Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.

Alignment of the matrix type is the same as the alignment of its first row/column vector.

** Flow **

Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to target-supported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.

After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.

The lowering pass is also run at -O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at -O0.

** Roll-out and Maintenance **

Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.

We are obviously interested in maintaining and improving this code in the future.

Looking forward to comments and suggestions.

Thanks,

Adam