RFC: First-class Matrix type

Hi,

We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t attribute((matrix_type(4, 4)));

mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}

** Benefits **

Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors. Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.

Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).

The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).

The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit run-time alias checks.

Having a dedicated first-class type also allows for dedicated target-specific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored row-major or column-major order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.

Having all this knowledge at the IR level means that front-ends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.

At Apple, we also need to support interoperability between row-major and column-major layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.

** ABI **

We currently default to column-major order with no padding between the columns in memory. We have plans to also support row-major order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR self-contained I am planning to make the defaults explicit in the DataLayout string.

For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous Short-Vector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.

Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.

Alignment of the matrix type is the same as the alignment of its first row/column vector.

** Flow **

Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to target-supported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.

After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.

The lowering pass is also run at -O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at -O0.

** Roll-out and Maintenance **

Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.

We are obviously interested in maintaining and improving this code in the future.

Looking forward to comments and suggestions.

Thanks,
Adam

This sounds like it would be really useful for 3D Graphics APIs as SPIR-V (the Vulkan/OpenCL2.1 intermediate representation) has matrices as first-class types. This will promote round-trip-ability. I have also heard that RISC-V’s V extension wants to support matrices (not sure if it’s as an additional extension on top of V or as part of V proper).

For the ABI, I recommend considering compatibility with the layouts required for 3D Graphics APIs. For Vulkan, I think the relevant section in the spec is 14.5.4.

Jacob Lifshay

Adam Nemet via cfe-dev <cfe-dev@lists.llvm.org> writes:

    %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
    %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
    %2 = call <4 x 4 x float>
    @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0,
    <4 x 4 x float> %1)
    store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

This sounds very interesting. Would it make sense to later expand the
idea to allow an arbitrary number of dimensions? Maybe that doesn't
make sense if we're restricted to statically-known dimensions.

How would this relate to scalable vectors? Most of the time matrix
dimensions are not known statically. Would <n x m x float> be possible?

Do you have a prototype of this?

                       -David

Adam Nemet via cfe-dev <cfe-dev@lists.llvm.org> writes:

   %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
   %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
   %2 = call <4 x 4 x float>
   @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0,
   <4 x 4 x float> %1)
   store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

This sounds very interesting. Would it make sense to later expand the
idea to allow an arbitrary number of dimensions? Maybe that doesn't
make sense if we're restricted to statically-known dimensions.

Yes, that matches my current feeling too.

How would this relate to scalable vectors?

Scalable vectors would be a possible of lowering of the matrix type. I *believe* you'd need go generate loops or at least some conditional code at run time due to the unknown scale factor.

Most of the time matrix dimensions are not known statically. Would <n x m x float> be possible?

No, we only support statically-known dimensions.

Do you have a prototype of this?

Yes. I can make it available it there is interest.

Adam

Hi,

We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t attribute((matrix_type(4, 4)));

mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}

This seems to me to be proposing two distinct things – built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR – and I think it makes sense to discuss them at least somewhat separately.

On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.html

I’m totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I’d note that they don’t even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?

On the LLVM IR side, I’m personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there’s some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don’t have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?

On the LLVM IR side, I’m personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there’s some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don’t have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?

I had mentioned it earlier: I believe RISC-V’s V extension is planning on supporting matrices in hardware (maybe as an extension on top of V), so that satisfies there being a target with support for matrix registers. Additionally, SPIR-V has matrices as a first-class type, and I believe that there is support for targeting SPIR-V either in-progress or planned.

Jacob

On the LLVM IR side, I’m personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there’s some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.

Recent NVIDIA GPUs do support some matrix operations natively in the hardware: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ .

Ciao,
.Andrea

Hi,

We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t attribute((matrix_type(4, 4)));

mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}

This seems to me to be proposing two distinct things – built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR – and I think it makes sense to discuss them at least somewhat separately.

On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.html

I’m totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I’d note that they don’t even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?

Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences. My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?

I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.

On the LLVM IR side, I’m personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there’s some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.

Since people already spoke up about direct HW needs, you’re probably satisfied here but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time. You don’t need to go to very large matrices at all to start to hit this.

However, I do wonder: how is this different from, say, complex numbers, for which we don’t have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)

That’s hard for me to answer as I didn’t look at complex-number performance.

How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?

As you say, complex numbers are not a first-class type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.

Adam

Can you please write more about the cases you’re seeing where this provides benefits? What kinds of InstCombine(-like) rules have you implemented to go along with this? Is this something that we could get if we had better loop transformations (i.e., loop fusion)? Thanks again, Hal

Hi,

We are proposing first-class type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t attribute((matrix_type(4, 4)));

mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}

This seems to me to be proposing two distinct things – built-in matrix support in the frontend as a language extension, and support for matrices in LLVM IR – and I think it makes sense to discuss them at least somewhat separately.

On the Clang side: our general policy for frontend language extensions is described here: http://clang.llvm.org/get_involved.html

I’m totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I’d note that they don’t even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?

Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences. My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?

Yes, I think that’s OK, but I also think you should start talking to the relevant people from WG14 and WG21 early (ie, right now) so that they can provide feedback to you and vice versa about the needs of the most likely eventual language model and of the optimizer. We don’t want to find in a year or two that this somehow fundamentally fails to satisfy the frontend language needs, or conversely that WG21 has standardized something that’s problematic to lower to our optimizable primitives.

If you’ve not done so already, it’d be a good idea to also talk to the Eigen folks to make sure that what you’re adding is something they could use in the cases where it’s applicable.

I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.

On the LLVM IR side, I’m personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there’s some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.

Since people already spoke up about direct HW needs, you’re probably satisfied here

Yes, I am.

but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time. You don’t need to go to very large matrices at all to start to hit this.

Well, modern C++ matrix libraries deal with similar issues using expression templates, but such an approach is certainly not without drawbacks.

However, I do wonder: how is this different from, say, complex numbers, for which we don’t have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)

That’s hard for me to answer as I didn’t look at complex-number performance.

How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?

As you say, complex numbers are not a first-class type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.

This seems worth thinking about early, even if you choose to defer implementing anything in this space.

I think there are a few potential problems with adding these into LLVM IR as a first-class type:

- With vectors, we have a simple extension from scalar operations to vector ones. A scalar operation on a vector is exactly the same as the same scalar operation applied pairwise to the elements. This is not the case for matrix types.

- A lot of analysis and transform passes need to have special cases for vectors. This is already a bit of a maintenance problem, but one that we put up with because of the big wins from vector types. Do matrix types have the same tradeoff?

- What does a matrix of pointers look like? What is a GEP on a matrix of pointers? If I do GEP on a matrix of pointers and a matrix of integers and I replace it with ptrtoint, add, inttoptr, is that equivalent?

- Does the IR representation enforce an in-memory representation? If different targets represent matrixes in different orders (e.g. column-major vs row-major) then how much do optimisers need to be aware of this?

- Can all of the more complex operations that you might want to implement on matrixes be implemented in terms of a smaller number of primitives, without resorting to long extract and insert element sequences?

- Can the same optimisations be used on sparse matrix representations that are then lowered by something custom nearer the back end?

- As others have said, how does this interact with variable-sized matrixes?

David

Can you please write more about the cases you’re seeing where this provides benefits? What kinds of InstCombine(-like) rules have you implemented to go along with this?

Sure. We have a case for example where the result of a matrix multiply is in-place added to a submatrix. The intermediate matrix values can’t fit in the vector registers but even if they could using the minimal number of registers is very beneficial to provide more register renaming bandwidth to the OOO cores. So we just load enough from the input matrices to compute a single native vector register-size chunk of the multiply, add the matching chunk from the submatrix and then store it back. What the optimal input chunks are depends on whether the inputs are transposed or not which where the InstCombine-like rules come in.

This case takes advantage of another semantical rule with matrices that the new representation allows. Because we know we are loading from the same submatrix that we store to we know that the different columns between the load and the store can’t alias which allows the loads and the stores to be reordered in order to enable the fusion above.

Is this something that we could get if we had better loop transformations (i.e., loop fusion)?

As we both know, aliasing usually makes this pretty difficult. For relatively small matrices run-time memchecks can be costly. Also we need to be able to evaluate these fusion opportunities in the inliner so having first-class type representation makes this cheap.

Adam

Richard Smith via cfe-dev <cfe-dev@lists.llvm.org> writes:

However, I do wonder: how is this different from, say, complex
numbers, for which we don't have a native IR representation? (Maybe
the answer is that we should have native IR support for complex
numbers too.)

We've got at least one person here with decades of experience handling
complex operations in compilers who thinks LLVM IR should absolutely
have native support for complex, FWIW. I don't have enough experience
with it myself to make a definitive statement but I can certainly see
advantages to it.

Most of us here also think that LLVM IR should absolutely have native
support for predication/masking. :slight_smile:

As for matrices, I again can see the advantages but have no practical
experience to draw upon. But some people at Apple seem to think it's
advantageous and I'm interested in learning more.

                           -David

Adam Nemet via llvm-dev <llvm-dev@lists.llvm.org> writes:

    Is this something that we could get if we had better loop
    transformations (i.e., loop fusion)?

As we both know, aliasing usually makes this pretty difficult. For
relatively small matrices run-time memchecks can be costly. Also we
need to be able to evaluate these fusion opportunities in the inliner
so having first-class type representation makes this cheap.

Fortran defines away many of these problems. Would better local
restrict support help with C-family languages? There are several old
patches from Hal that need some attention.

                          -David

Agreed these patches would be neat to revive. I think we’d also want someone to pursue wg21.link/n4150.
However, none of that seems like it should gate matrix support in the IR.

JF Bastien via llvm-dev <llvm-dev@lists.llvm.org> writes:

Agreed these patches would be neat to revive.

We've been trying but haven't got responses to Phab comments we've
posted.

I think we’d also want someone to pursue wg21.link/n4150.

Definitely!

However, none of that seems like it should gate matrix support in the
IR.

Agreed. But we should at least be aware of alternatives so we have a
good sense of the pros/cons of the proposal.

                          -David

Recent NVIDIA GPUs do support some matrix operations natively in the hardware

The inputs to the operations in question are actually defined as opaque. You have to load the inputs using a special instruction, the semantics of which are no precisely specified. This instruction loads the matrix’s values distributed across the threads in a warp, and you are not allowed to assume anything about which values are where. In fact they claim that the arrangement of values is architecture-dependent.

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment

I suppose one might be able to spec the corresponding LLVM intrinsics for loading data in preparation for a matmul as taking a matrix pointer rather than a plain float*, but since as soon as you load it you no longer have a matrix in hand (each thread has some undefined subset of the matrix’s elements), I’m not seeing any real use there other than a sort of handwavy “type-safety” argument.

Richard Smith via cfe-dev <cfe-dev@lists.llvm.org> writes:

However, I do wonder: how is this different from, say, complex
numbers, for which we don’t have a native IR representation? (Maybe
the answer is that we should have native IR support for complex
numbers too.)

We’ve got at least one person here with decades of experience handling
complex operations in compilers who thinks LLVM IR should absolutely
have native support for complex, FWIW. I don’t have enough experience
with it myself to make a definitive statement but I can certainly see
advantages to it.

Most of us here also think that LLVM IR should absolutely have native
support for predication/masking. :slight_smile:

I can certainly agree.

On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new first-class type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?

I think there are a few potential problems with adding these into LLVM IR as a first-class type:

- With vectors, we have a simple extension from scalar operations to vector ones. A scalar operation on a vector is exactly the same as the same scalar operation applied pairwise to the elements. This is not the case for matrix types.

Vectors also have things like reductions, masked load/store and scatter/gather. Matrices also have element-wise operations and then some custom ones. I don’t think the situation is all that different.

- A lot of analysis and transform passes need to have special cases for vectors. This is already a bit of a maintenance problem, but one that we put up with because of the big wins from vector types. Do matrix types have the same tradeoff?

I don’t think that’s the right way to look at this proposal. As I said I think it’s better to look at matrices as an extension to vectors adding a different view on the same underlying data. Thus most of the special cases for vectors would also apply to matrices as considered SequentialTypes or a new common base class. So effectively isVector becomes isVectorOrMatrix in many cases. This has definitely been the experience so far in the prototype.

With this extension we can cover a new class of applications that otherwise would require heroic front-end or user-level efforts. And since all these approaches perform premature lowering, enabler passes like the inliner can’t reason about the representation. Leaving them as first-class constructs makes this reasoning trivial.

- What does a matrix of pointers look like? What is a GEP on a matrix of pointers? If I do GEP on a matrix of pointers and a matrix of integers and I replace it with ptrtoint, add, inttoptr, is that equivalent?

Unlike vectors, I don’t think that’s a case that we want to support unless one of the HW implementations requires this. Do you think this would be useful?

- Does the IR representation enforce an in-memory representation? If different targets represent matrixes in different orders (e.g. column-major vs row-major) then how much do optimisers need to be aware of this?

Most of this should be limited to the lowering pass. That is where we would reason about the layout, fusion, register pressure and vectorization.

How much InstCombine and things like that would be affected we can decide based on the trade-off of the specific cases we want to optimize. I don’t think that’s a priority or a requirement for now.

I feel that having this extension experimental would allow us to evaluate pros and cons in this regard.

- Can all of the more complex operations that you might want to implement on matrixes be implemented in terms of a smaller number of primitives, without resorting to long extract and insert element sequences?

I’d like to keep the canonical representation for these operations short so that high-level optimizations like the inliner can easily reason about them.

- Can the same optimisations be used on sparse matrix representations that are then lowered by something custom nearer the back end?

I feel this is related to your scatter/gather question but I am not sure I completely get it. Can you please elaborate?

- As others have said, how does this interact with variable-sized matrixes?

I’d like to experiment with lowering variable-sized matrix operations to this representation. I have some hand-wavy ideas about this. Something where we would have a lambda expressing operations on fixed-sized tiles and then a mapping function that applies/extends that to variable-sized matrices. An expression-template library may be able to gather all the required pieces for this.

Adam

Hi,

We are proposing first-class type support for a new matrix type.

Interesting! Here are some thoughts, I’m sorry but I haven’t read the responses downthread.

This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

LLVM already has a pretty general vector type (arbitrary number of elements). I’m aware of hardware that has rectangular vectors, e.g. nvidia tensor cores, Google has a proprietary in-house design with non-square vector registers, etc.

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

Ok. Makes sense, I agree that supporting the existing pointwise vector operations makes sense.

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t attribute((matrix_type(4, 4)));

mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}

I’d recommend splitting the clang discussion from the LLVM discussion, they are completely different tradeoffs involved. I’ll focus on the LLVM IR side of things.

** Benefits **

Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors.

Right, it is basically the same benefit as having a vector type. You also get the ability to have specific alignments etc.

I think there are several options in the design space here:

  1. Do nothing to the type system, but just use the existing vector types (<16 x float> in this case) with a new set of operations.
  2. Introduce a “matrix” concept and associated operations.
  3. Introduce N-dimensional vector registers and associated operations.

Personally, I’d be in favor of going with #1 followed by #3 followed distantly by #2.

The reason I’m opposed to a matrix type is that this is far too specific of a concept to put into LLVM. We don’t even have signedness of integers in the type system: the instruction set is the major load bearing part of the IR design, and the instruction set is extensible through intrinsics.

Arguing in favor of #1: AFAICT, you only need to add the new intrinsics to do matmul etc. You could just define them to take 1D vectors but apply math to them that interprets them as a 2D space. This is completely an IR level modeling issue, and would be a very non-invasive patch. You’re literally just adding a few intrinsics. All the pointwise operations and insert/extract/shuffles will “just work”. The frontend handles mapping 2d indices to 1D indices.

f you are interested in expanding the type system, I think that #3 is the way to go: extend VectorType to support multiple constant dimensions, like <4 x 32 x float>. I think that all of the existing vector intrinsics and instructions are still defined (pointwise). You’d then add matrix specific intrinsics like your matmul above. If you are going to extend the type system, here are things to think about:

  1. You will need to introduce legalization infra to lower all valid (as in, passing the verifier) cases to arbitrary targets, e.g. turning Nd elementwise add operations into <4 x float> adds or scalar operations. Legalization has a lot of the infra to do this, but it would be a big project.

  2. Extending the type system has non-local effects because you have to audit (e.g.) the mid level optimizer to make sure that everything touching vector type, or “add” is doing the right thing for the new case.

  3. Extending the IR like this is plausible, but would need to be carefully considered by the community.

Having a matrix-multiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FP-contraction.

Just to nit-pick here, but I’d suggest following the existing precedent of the IR in terms of contraction etc. You shouldn’t assume that all matmul-adds are contractable just because your current client want that. Consistency with the rest of the IR is far more important and we would WANT other clients to use this someday.

Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).

I don’t understand this point at all.

The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).

LLVM already has alignment support for memory objects, and the frontend can handle complex lowering like turning a 3x3 matrix into an LLVM IR <16xfloat> (i.e. a 4x4 matrix) if that is more efficient for some reason. If there is a design to put this logic into the code generator and make it frontend independent etc, then putting it in the LLVM IR type system is reasonable, so long as you “do it right” and implement full support for this (including legalization for all targets etc).

The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit run-time alias checks.

LLVM isn’t really set up to do these sorts of optimizations on the IR: it treats SSA values as units, and tiling within an ND vector will be challenging. if you’re interested in tackling this, it would be great to start doing it for 1D vectors, which have exactly the same issue: we generate atrocious code for chains of <256 x float> operations on machines that support <4 x float> because of register pressure etc.

Having a dedicated first-class type also allows for dedicated target-specific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored row-major or column-major order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.

There are other ways to model this. You can just add a new llvm parameter attribute if necessary.

Having all this knowledge at the IR level means that front-ends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.

Agreed, but this only happens through a significant investment in making the mid-level IR general. I’m not opposed to this, but this is a significant engineering task.

At Apple, we also need to support interoperability between row-major and column-major layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.

I don’t see how your proposal helps with this. How do you represent layout in the IR?

** Roll-out and Maintenance **

Since this will be experimental for some time, I am planning to put this behind a flag: -fenable-experimental-matrix-type. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.

I think it is really important to understand the design and get buy-in from many folks before starting to land patches. I’d also love to chat about this at the devmtg if you’re available.

-Chris