RFC: First-class Matrix type

Hi,

For those of you at the Dev Meeting interested in this topic, we’re organizing a round table at 3:30 on Thursday.

See you there.

Adam

Hi,

We are proposing first-class type support for a new matrix type.

Interesting! Here are some thoughts, I’m sorry but I haven’t read the responses downthread.

This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:

%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16

LLVM already has a pretty general vector type (arbitrary number of elements). I’m aware of hardware that has rectangular vectors, e.g. nvidia tensor cores, Google has a proprietary in-house design with non-square vector registers, etc.

Currently we support element-wise binary operations, matrix multiply, matrix-scalar multiply, matrix transpose, extract/insert of an element. Besides the regular full-matrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vector-extract/insert and matrix-vector multiply.

All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).

Ok. Makes sense, I agree that supporting the existing pointwise vector operations makes sense.

These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:

typedef float mf4x4_t attribute((matrix_type(4, 4)));

mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}

I’d recommend splitting the clang discussion from the LLVM discussion, they are completely different tradeoffs involved. I’ll focus on the LLVM IR side of things.

** Benefits **

Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to target-specific vectors.

Right, it is basically the same benefit as having a vector type. You also get the ability to have specific alignments etc.

I think there are several options in the design space here:

  1. Do nothing to the type system, but just use the existing vector types (<16 x float> in this case) with a new set of operations.
  2. Introduce a “matrix” concept and associated operations.
  3. Introduce N-dimensional vector registers and associated operations.

Personally, I’d be in favor of going with #1 followed by #3 followed distantly by #2.

FWIW, I strongly prefer #1 to the other options.

The reason I’m opposed to a matrix type is that this is far too specific of a concept to put into LLVM. We don’t even have signedness of integers in the type system: the instruction set is the major load bearing part of the IR design, and the instruction set is extensible through intrinsics.

Strongly agree.

Arguing in favor of #1: AFAICT, you only need to add the new intrinsics to do matmul etc. You could just define them to take 1D vectors but apply math to them that interprets them as a 2D space. This is completely an IR level modeling issue, and would be a very non-invasive patch. You’re literally just adding a few intrinsics. All the pointwise operations and insert/extract/shuffles will “just work”. The frontend handles mapping 2d indices to 1D indices.

Even better, it makes it easy to support interesting row-major and col-major style operations w/o further diversification of the type system.

Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).

I don’t understand this point at all.

I think what it says is that a matrix type like <4 x 4 x i32> can be designed in a way that it does not imply the data layout (row major, column major, etc), so that passes feel free to transpose the data into another layout if it’s profitable.

It also seems to say that there can be such a global analysis pass to assign one layout per use, then insert necessary transposes. Such pass tries to achieve a global maximum of performance.

However, the argument seems to imply that a vector type like <16 x i32> can’t do so. In the favor of option #1, I argue that the plain <16 x i32> enables the same optimization opportunities, as long as the uses are not on ABI boundaries.

Given that, for example, a unit stride along a 2nd or 3rd axis will
translate into a constant stride when flattened into a 1D vector .. yes,
that is essentially an equally easy to recognise pattern.

The aim will always be to interchange loops so as to get unit stride in an
underlying 1D vector. Which can be done directly.

I imagine that the information about the ABI would have been done
solely by the front-end, which includes compile-time constant padding
on all dimensions, making it feasible to add offsets when iterating
through matrix rows/cols in 1D.

If the padding is different or badly aligned, it would be the same
cost for the front-end to lower the right adds to induction at the
right time as it would be for the middle/back-end, but we already have
(all?) such decisions being made by the front-ends.

Even if we would try to simplify the IR codegen, to expose more
parallelism, the change could be an IR annotation on a new type of
scalar evolution (with arbitrary padding at arbitrary times), and not
necessarily a new type.

Someone mentioned reductions, but we already have reduction intrinsics
that deal with that:

https://llvm.org/docs/LangRef.html#experimental-vector-reduction-intrinsics

Perhaps just adding more of those would satisfy all needs?

With regards to aliasing, couldn't the front-end just force no alias
when the "source code type" is guarantee to not alias, like Fortran?
If we carry that through the optimisation passes, then it should be
easy to not add unnecessary runtime checks.

Finally, I do agree that trickling all of that matrix-specific
knowledge to generic passes like the loop vectoriser, inliner, etc
would be perhaps too big and imperfect, but I haven't got it why a new
type would fix this, other than having to teach all passes about the
new type (which is the same cost as intrinsics).

An alternative would be a new, simpler pass, or even a simplified
VPlan, based on a quick analysis to check such patterns, guarded by
the existence of "matrix attributes" having ever being set by the
front-end, for example (just guessing here).

Hi,

Thanks to everybody who attended the round table and everybody who commented here or discussed this with me at the Dev Meeting. As the next step, I’d like to contrast the main alternatives (flattened vector with shape info on intrinsics vs. N-dimensional vector) in more detail with pros and cons and circulate it here.

Adam

Adam and I discussed this at the devmtg, and indeed his idea is to have a “codegen prepare” sort of pass that does some amount of pre-legalization of matrices (which should also be applicable to large vectors) with the goal of reducing register pressure etc.

Adam, can you please summarize the discussions you had and what you see as the next steps here? Thanks!

-Chris

Hi Chris,

Adding a new dedicated first-class type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).

I don’t understand this point at all.

I think what it says is that a matrix type like <4 x 4 x i32> can be designed in a way that it does not imply the data layout (row major, column major, etc), so that passes feel free to transpose the data into another layout if it’s profitable.

It also seems to say that there can be such a global analysis pass to assign one layout per use, then insert necessary transposes. Such pass tries to achieve a global maximum of performance.

However, the argument seems to imply that a vector type like <16 x i32> can’t do so. In the favor of option #1, I argue that the plain <16 x i32> enables the same optimization opportunities, as long as the uses are not on ABI boundaries.

Adam and I discussed this at the devmtg, and indeed his idea is to have a “codegen prepare” sort of pass that does some amount of pre-legalization of matrices (which should also be applicable to large vectors) with the goal of reducing register pressure etc.

Adam, can you please summarize the discussions you had and what you see as the next steps here? Thanks!

I’d like to write up the main alternatives (flattened vector + shape/layout-aware intrinsics vs. N-dimensional vector) and contrast them with IR at the various stages.

I am busy with some internal stuff at the moment but hoping to get to this next week.

Thanks,
Adam

Awesome, thanks. Please don’t feel any pressure from me, I was just trying to be helpful and just didn’t want your work blocked on me being otherwise busy.

-Chris