RFC: Matrix math support

Hello,# This is a follow-up to Adam’s original RFC (RFC: First-class Matrix type, [1]) and his follow-up ([[RFC] Matrix support (take 2, 2]). The main sticking points of the last thread were centred around representing padding, propagating shape information (dimensions) and the number of proposed intrinsics [8].We’ve incorporated the feedback into our design and would like to share our updated proposal. We stripped it down to the minimum required to realize the major benefits, like elimination of loads/stores to temporaries and generation of tuned loops for matrix multiplies. We discuss further extensions and points that came up during the earlier discussions after the new proposal.

TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We’ve designed the approach so it’s flexible and easy to extend.

ProposalAfter the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].

With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support row-major layouts.

Initially, we would like to propose the following intrinsics:

Hi Florian, thanks for the update!

After the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].

Nice!

With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support row-major layouts.

Initially, we would like to propose the following intrinsics:

This all seems reasonable to me - a constrained extension that can start out as experimental. I personally don’t have any significant objections to this.

The floating point versions of the intrinsics also take fast-math flags, which can be used to opt-in to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.

What is your though on the FP semantics of matmul? There is a lot of room for interpretation and various ‘fast’ approximations that are occasionally interesting. Should this use the existing fast math flags or are they different?

The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets.

Nice!

Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to %a = load <8 x double>, <8 x double>* %A, align 16 and lower it to a series of load <2 x double>, <2 x double>*, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.

I’m very glad to hear that this only affects performance but not correctness. This ensures that the extension is correct in the face of outlining and various code motion passes that can introduces weird phi nodes that may (in unusual cases) escape the limits of your reasoning.

As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.

+1. It would be interesting to extend the set of target independent vector intrinsics in general.

Potential Future Extensions-

Random thought, but would it be enough to model these as undef elements if they became important?

We propose adding a new matrix value type, that can be declared via a matrix_type attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred.

I don’t care strongly about this at all, but I have a weak preference for introducing a new attribute for this. There is effectively no cost to doing so, and this would make the documentation in the clang extensions manual much easier to understand.

In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.

I’m sure the clang folks will want to know what you mean by ‘constant’s in terms of ICE, constexpr, integer template arguments, etc…

We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.

+1, sounds like a very promising direction, thank you!

-Chris

Hi Chris,

Thanks for taking a look!

Hi Florian, thanks for the update!

After the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].

Nice!

With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support row-major layouts.

Initially, we would like to propose the following intrinsics:

This all seems reasonable to me - a constrained extension that can start out as experimental. I personally don’t have any significant objections to this.

The floating point versions of the intrinsics also take fast-math flags, which can be used to opt-in to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.

What is your though on the FP semantics of matmul? There is a lot of room for interpretation and various ‘fast’ approximations that are occasionally interesting. Should this use the existing fast math flags or are they different?

Initially we thought of using the existing fast math flags directly, apply them to the generated instruction while lowering and rely on the existing related optimizations in InstCombine & co. But it should be possible to handle additional FP semantics directly while doing the lowering, if there are users that could benefit.

The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets.

Nice!

Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to %a = load <8 x double>, <8 x double>* %A, align 16 and lower it to a series of load <2 x double>, <2 x double>*, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.

I’m very glad to hear that this only affects performance but not correctness. This ensures that the extension is correct in the face of outlining and various code motion passes that can introduces weird phi nodes that may (in unusual cases) escape the limits of your reasoning.

As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.

+1. It would be interesting to extend the set of target independent vector intrinsics in general.

Potential Future Extensions-

Random thought, but would it be enough to model these as undef elements if they became important?

Interesting! Yes, I think we could use undef while lowering the builtins in clang, to deal with padding.

We propose adding a new matrix value type, that can be declared via a matrix_type attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred.

I don’t care strongly about this at all, but I have a weak preference for introducing a new attribute for this. There is effectively no cost to doing so, and this would make the documentation in the clang extensions manual much easier to understand.

In all cases, we require known constants as dimensions and we do not plan to support dynamic dimensions for now.

I’m sure the clang folks will want to know what you mean by ‘constant’s in terms of ICE, constexpr, integer template arguments, etc…

We think our current proposal addresses the concerns raised previously, especially the concerns around the high cost of adding a new IR type, adding too many new intrinsics and generalising the approach to N dimensions. Unless there are any additional major concerns, we should be able to share patches for review soon.

+1, sounds like a very promising direction, thank you!

Thanks for all the valuable input during the earlier discussions!

Cheers,
Florian

Thanks a lot for the comments so far!

I’m planning to start preparing a set of initial patches for the LLVM side in a couple of days. Please let me know if there are any remaining high-level concerns.

I received additional feedback off-list to try to remove the transpose and load/store intrinsics initially. I'll probably exclude them from the initial set of patches (we currently mostly have them for convenience) and add them as we go along, if required.

Cheers,
Florian

Hi,

I just uploaded the initial patch adding the matrix intrinsics as described, as well as an initial lowering pass: https://reviews.llvm.org/D70456

The first version of the patch works without any shape propagation and lowers each intrinsic independently. We plan to iterate on the pass once it is in, to improve the generated code as discussed in the RFC.

In a bit, I’ll also share a patch for the clang side, to get a concrete discussion started about the Clang side.

Thanks,
Florian

Hello,# This is a follow-up to Adam’s original RFC (RFC: First-class Matrix type, [1]) and his follow-up ([[RFC] Matrix support (take 2, 2]). The main sticking points of the last thread were centred around representing padding, propagating shape information (dimensions) and the number of proposed intrinsics [8].We’ve incorporated the feedback into our design and would like to share our updated proposal. We stripped it down to the minimum required to realize the major benefits, like elimination of loads/stores to temporaries and generation of tuned loops for matrix multiplies. We discuss further extensions and points that came up during the earlier discussions after the new proposal.

TLDR: Our new proposal boils down to adding 4 matrix related intrinsics (transpose, multiply, and strided load & store) with shape information and an IR pass, that lowers the intrinsics to regular vector operations. The IR pass can also be extended to break down operations into chunks suitable for specialised matrix ISAs (like the matrix instructions included in ARM v8.6) and to reducing spilling. We’ve designed the approach so it’s flexible and easy to extend.

ProposalAfter the initial feedback, we evaluated using ‘flattened’ vectors to hold matrix values, instead of adding a new matrix type, as suggested originally. This was option #1 suggested in [3].

With that in mind, we propose to add a set of experimental intrinsics for matrix operations that require information about the shape/layout of the underlying matrix. The suggested intrinsics take the shape information as additional (constant) integer arguments and column major layout is assumed initially. With the new proposal, there are very few intrinsics that actually care about the memory layout and they can be easily extended to also support row-major layouts.

Initially, we would like to propose the following intrinsics:

The floating point versions of the intrinsics also take fast-math flags, which can be used to opt-in to FMA generation and/or constant folding opportunities via NoInfs and NoNaNs. We plan to add them to the lowered instructions and rely on InstCombine & Co for related optimisations.

The intrinsics will be lowered to regular LLVM vector operations in a IR lowering pass. This means per default, we can lower the builtins on all targets. Additionally, targets can implement their own lowering to a specialized ISA extension. We implemented such a lowering for an internal matrix accelerator.

The example below shows how the matrix_transpose builtin will be lowered in the IR lowering pass.

define void @transpose(<8 x double> * %A, <8 x double>* %B) {
entry:
%a = load <8 x double>, <8 x double>* %A, align 16
%b = call <8 x double> @llvm.matrix.transpose(<8 x double> %a, i32 2, i32 4)
store <8 x double> %c, <8 x double>* %B, align 16
ret void
}
declare <8 x double> @llvm.matrix.transpose(<8 x double>, i32, i32)

Will get lowered to something like

define void @transpose(<8 x double> * %A, <8 x double>* %B) {
entry:
%0 = bitcast <8 x double>* %A to <2 x double>*
%1 = load <2 x double>, <2 x double>* %0, align 8
%2 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 2
%3 = bitcast double* %2 to <2 x double>*
%4 = load <2 x double>, <2 x double>* %3, align 8
%5 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 4
%6 = bitcast double* %5 to <2 x double>*
%7 = load <2 x double>, <2 x double>* %6, align 8
%8 = shufflevector <2 x double> %7, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%9 = getelementptr <8 x double>, <8 x double>* %A, i64 0, i64 6
%10 = bitcast double* %9 to <2 x double>*
%11 = load <2 x double>, <2 x double>* %10, align 8
%12 = shufflevector <2 x double> %11, <2 x double> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
%13 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 0, i32 2, i32 undef, i32 undef>
%14 = shufflevector <4 x double> %13, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 4, i32 undef>
%15 = shufflevector <4 x double> %14, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 4>
%16 = shufflevector <2 x double> %1, <2 x double> %4, <4 x i32> <i32 1, i32 3, i32 undef, i32 undef>
%17 = shufflevector <4 x double> %16, <4 x double> %8, <4 x i32> <i32 0, i32 1, i32 5, i32 undef>
%18 = shufflevector <4 x double> %17, <4 x double> %12, <4 x i32> <i32 0, i32 1, i32 2, i32 5>
%19 = bitcast <8 x double>* %C to <4 x double>*
store <4 x double> %15, <4 x double>* %19, align 8
%20 = getelementptr <8 x double>, <8 x double>* %C, i64 0, i64 4
%21 = bitcast double* %20 to <4 x double>*
store <4 x double> %18, <4 x double>* %21, align 8
ret void
}

Before we do the actual lowering, we propagate the shape information from intrinsics to connected instructions. This allows us to improve the code we generate for regular IR operations on matrixes embedded in a flattened vector. In the example above, we propagate the information that we are loading a matrix with 2 rows and 4 columns to %a = load <8 x double>, <8 x double>* %A, align 16 and lower it to a series of load <2 x double>, <2 x double>*, which helps with avoiding a large number of shufflevector instructions to get column vectors. Please note that propagating the shape information allows us to improve the code we generate during lowering, but is not required for correctness. Without propagating shape information, we would just need additional shuffles to extract the rows/columns at the point where we lower a matrix multiply for example.

This infrastructure can also be used to improve codegen for large vector operations in general. For example, consider loading 2 <100 x double> vectors, adding them and storing the result. Currently this will be lowered naively doing most of the loads first, then most of the adds and then most of the stores, causing plenty of spilling on AArch64. We can significantly improve the generated code by breaking it into load/add/store blocks that fit into the target registers.
We also plan to implement some additional optimisations as part of the lowering, like generating fused/tiled loops for matrix multiplications and direct FMA generation.

As future work, we are also evaluating adding additional operations including clamping a vector or matrix, min/max of matrixes, inverting a matrix and computing matrix determinates. Some of those might require additional intrinsics, which we can discuss on a case by case basis.

In terms of integration, we are planning on adding a separate MatrixIRBuilder utility (as suggested in [4]), which can be used by frontends to create various matrix operations. For point wise operations, they will lower to vector instructions directly.

Potential Future ExtensionsDuring the previous discussions, a few extensions were discussed, but we would like exclude them from the initial implementation. Those are

  • N-Dimension
    Given that the dimensions are represented as arguments to intrinsics, we can go from supporting 2 dimensions to N dimensions by using variadic arguments for shape information and handle it accordingly while lowering.
  • Padding
    For small matrixes, it might be desirable to automatically add additional padding to columns/rows, e.g. add 1 element padding to each column in a 3 x 3 matrix, to allow for using vector instructions operating on power-of-2 number of elements or satisfy an alignment requirement by a target. This allows for additional optimizations, but is not required for lowering the intrinsics. We also haven’t seen this being an issue so far. We should be able to iterate on that, once it becomes an issue. (Earlier discussion [5] )

Support in ClangWe are also planning to expose the matrix support in Clang via a matrix type on the C/C++ level (similar to the ext_vector_type attribute) together with a set of matrix builtins. Similar to the LLVM part, we would like to propose a pragmatic solution for common matrix operations, while being flexible enough to allow iterative improvements to accommodate additional requirements. The main motivation for the matrix support on the Clang side is to give users a way to

  • Guarantee generating high-quality code for matrix operations and trees of matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
  • Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
  • Move optimisations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigen-style matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.

We propose adding a new matrix value type, that can be declared via a matrix_type attribute. Alternatively we could also generalise the existing ext_vector_type attribute ([6]), if that is preferred. Initially, the matrix type supports 2 dimensions (rows & columns). For example, a 4x4 float matrix would be declared as typedef float m4x4_t __attribute__((matrix_type(4, 4)));.

Hi Florian,

You can find Clang’s policy for accepting language extensions described at http://clang.llvm.org/get_involved.html

This extension clearly satisfies point 2, and I’m happy to trust that we can iterate to satisfying points 6 and 7. I would like to hear your thoughts on the other criteria.

Hi Richard,

Thank you very much for taking a look! We are working on a clang-focused update to the proposal to get a discussion rolling on the Clang integration. Thanks for pointing me to the policy for language extensions, we will make sure to cover the points mentioned there.

With Thanksgiving coming up in the US, it will probably take us until after the holidays to share the update.

Cheers,
Florian