[RFC] Matrix support (take 2)

Hi all,

After the previous RFC[1], there were multiple discussions on the ML and in person at the DevMtg. I will summarize the options discussed and propose a path forward.

In general, IR type alignment is all but meaningless. A well-behaved frontend should be setting an explicit alignment on every declaration, definition, allocation, and access it generates (if it's ABI, at least), and it must always lower aggregate types in a way that understands that IR type alignment can arbitrarily differ from the actual alignment of the frontend's original type. So while I'm certainly supportive of allowing vector types to carry additional information besides just an element type and count, I'm a little skeptical specifically about the value of it carrying an alignment.

John.

-------------------------
Proper padding for small matrices
-------------------------

We want the individual rows/columns (for row/column-major) to be naturally aligned by no more than the max vector alignment. E.g. for a 3 x 3 x float with column major order, we should have a single element padding after each column which is a total of 48 bytes. For option A, since it’s a new type we can just define this in the new ABI.

For option B and C, on the other hand, vector alignment and padding is already mandated by the VectorType. This is part of the ABI when people use vector_size or ext_vector_type attributes on the clang side.

Alignment and allocation size (including padding) is the original size of vector rounded up to the next power-of-2. So for example a 3 x 3 x float pad(1) or 12 x float is rounded up to 64 bytes. This is excessive. I also need to support unpadded matrices that are effectively ABI-compatible with arrays.

Front-ends could lower to arrays instead of vectors and then cast them to vectors specifying the proper alignment. This would complicate front-ends and the IR. A more reasonable solution would be allow reducing the alignment and in turn the padding of the vector type itself. We could have an align attribute on the type itself:

For example <12 x float align(16)> for naturally aligned columns or <9 x float align(1)> for the unpadded case.

In summary, option B and C can be made to work with some IR extensions.

In general, IR type alignment is all but meaningless.

Sure, that’s clear. I may not have been precise above that I actually tried to lower the alignment in other ways but failed.

A well-behaved frontend should be setting an explicit alignment on every declaration, definition, allocation, and access it generates (if it's ABI, at least), and it must always lower aggregate types in a way that understands that IR type alignment can arbitrarily differ from the actual alignment of the frontend's original type.

I wasn’t able to lower the alignment (and alloc size) for alloca through the align attribute. I assumed that it was meant to only allow increasing alignment.

I am also unclear how lowering the alignment should work for a vector data member of a structure type. Would I have to make the structure packed and manually fill in the padding? Is that desirable every time a flattened matrix is used inside a structure?

Thanks,
Adam

-------------------------
Proper padding for small matrices
-------------------------

We want the individual rows/columns (for row/column-major) to be naturally aligned by no more than the max vector alignment. E.g. for a 3 x 3 x float with column major order, we should have a single element padding after each column which is a total of 48 bytes. For option A, since it’s a new type we can just define this in the new ABI.

For option B and C, on the other hand, vector alignment and padding is already mandated by the VectorType. This is part of the ABI when people use vector_size or ext_vector_type attributes on the clang side.

Alignment and allocation size (including padding) is the original size of vector rounded up to the next power-of-2. So for example a 3 x 3 x float pad(1) or 12 x float is rounded up to 64 bytes. This is excessive. I also need to support unpadded matrices that are effectively ABI-compatible with arrays.

Front-ends could lower to arrays instead of vectors and then cast them to vectors specifying the proper alignment. This would complicate front-ends and the IR. A more reasonable solution would be allow reducing the alignment and in turn the padding of the vector type itself. We could have an align attribute on the type itself:

For example <12 x float align(16)> for naturally aligned columns or <9 x float align(1)> for the unpadded case.

In summary, option B and C can be made to work with some IR extensions.

In general, IR type alignment is all but meaningless.

Sure, that’s clear. I may not have been precise above that I actually tried to lower the alignment in other ways but failed.

A well-behaved frontend should be setting an explicit alignment on every declaration, definition, allocation, and access it generates (if it's ABI, at least), and it must always lower aggregate types in a way that understands that IR type alignment can arbitrarily differ from the actual alignment of the frontend's original type.

I wasn’t able to lower the alignment (and alloc size) for alloca through the align attribute.

Oh, I see, this is meant to avoid the rounding-up of the alloc size; somehow I missed that part of your post. Yes, it makes sense that you need to be able to specify that.

I assumed that it was meant to only allow increasing alignment.

Are you talking about the Clang attribute? Yes, IIRC, that attribute is defined as not lowering alignment unless it's combined with packed.

I am also unclear how lowering the alignment should work for a vector data member of a structure type. Would I have to make the structure packed and manually fill in the padding? Is that desirable every time a flattened matrix is used inside a structure?

That's what Clang does when the IR type for a field is aligned such that it would end up at the wrong offset, yes.

John.

The clang attribute does not lower alignment on structs (or struct members) without packed, but it does lower alignment on typedefs.

Oh, yes, of course.

John.


Proper padding for small matrices

We want the individual rows/columns (for row/column-major) to be naturally aligned by no more than the max vector alignment. E.g. for a 3 x 3 x float with column major order, we should have a single element padding after each column which is a total of 48 bytes. For option A, since it’s a new type we can just define this in the new ABI.

For option B and C, on the other hand, vector alignment and padding is already mandated by the VectorType. This is part of the ABI when people use vector_size or ext_vector_type attributes on the clang side.

Alignment and allocation size (including padding) is the original size of vector rounded up to the next power-of-2. So for example a 3 x 3 x float pad(1) or 12 x float is rounded up to 64 bytes. This is excessive. I also need to support unpadded matrices that are effectively ABI-compatible with arrays.

Front-ends could lower to arrays instead of vectors and then cast them to vectors specifying the proper alignment. This would complicate front-ends and the IR. A more reasonable solution would be allow reducing the alignment and in turn the padding of the vector type itself. We could have an align attribute on the type itself:

For example <12 x float align(16)> for naturally aligned columns or <9 x float align(1)> for the unpadded case.

In summary, option B and C can be made to work with some IR extensions.

In general, IR type alignment is all but meaningless.

Sure, that’s clear. I may not have been precise above that I actually tried to lower the alignment in other ways but failed.

A well-behaved frontend should be setting an explicit alignment on every declaration, definition, allocation, and access it generates (if it’s ABI, at least), and it must always lower aggregate types in a way that understands that IR type alignment can arbitrarily differ from the actual alignment of the frontend’s original type.

I wasn’t able to lower the alignment (and alloc size) for alloca through the align attribute.

Oh, I see, this is meant to avoid the rounding-up of the alloc size; somehow I missed that part of your post. Yes, it makes sense that you need to be able to specify that.

I assumed that it was meant to only allow increasing alignment.

Are you talking about the Clang attribute? Yes, IIRC, that attribute is defined as not lowering alignment unless it’s combined with packed.

No, I meant IR. It obviously works on loads and stores to indicate an unaligned access but not on alloca. It’s probably bug, I’ll look into it.

I am also unclear how lowering the alignment should work for a vector data member of a structure type. Would I have to make the structure packed and manually fill in the padding? Is that desirable every time a flattened matrix is used inside a structure?

That’s what Clang does when the IR type for a field is aligned such that it would end up at the wrong offset, yes.

OK, we can adopt that model for now.

Adam

Hi all,

After the previous RFC[1], there were multiple discussions on the ML and in person at the DevMtg. I will summarize the options discussed and propose a path forward.

===========================
Options

A. Extend VectorType to be multidimensional

B. Flatten matrices into the current VectorType. Matrix shape and layout information is passed to matrix intrinsics. All matrix operations including element-wise matrix operations are implemented via intrinsics

C. Same as B but padding is explicitly managed by shufflevector instructions, element-wise operations are implemented via built-in operators (e.g. fadd)

===========================
tl;dr

There was some support for option A to introduce first-class matrices (or multidimensional vectors) but also many concerns. I have sketched out many examples in IR and flattening matrices to vectors does not seem to present any clear show-stoppers. Thus, I am scaling back the proposal to option B which is a more incremental step.

Seems reasonable to start small, learn, then build out from there.

Throughout this work, an important goal is to provide a matrix-aware IRBuilder API. E.g.:

Value *CreateMatrixAdd(Value Op0, Value Op1,
unsigned Rows, unsigned Cols,
MatrixLayout ML /
row/column-major, padding */);

This will allow for simpler front-ends and would allow us to swap out the generated IR if the design needs to change.

Why does this have to be on IRBuilder? If you were writing this in Swift, then these would be an extension on IRBuilder that added some methods like this, but could be kept out of core.

C++ isn’t as sophisticated, but you could still define these as a separate header file with functions like:

Value *BuilderCreateMatrixAdd(IRBuilder &B, … other args)

This would avoid adding a bunch of matrix specific stuff to the core IRBuilder class and header.

===========================
Details


Introduction

Representing multi-dimensional vectors in the IR through types makes the IR more expressive (option A). Additionally, if we have a new type we have the freedom to implicitly map it to a layout. E.g. <3 x 3 x float> could imply column-major order and one element of padding between each column. When it’s passed or returned from functions it should be passed in 3 vector registers.

This is a sample IR to add two 3x3 matrices followed by a matrix multiply with a 3x2:

%a = load <3 x 3 x float>, <3 x 3 x float>* %A
%b = load <3 x 3 x float>, <3 x 3 x float>* %B
%c = load <3 x 2 x float>, <3 x 2 x float>* %C

%add = fadd <3 x 3 x float> %a, %b
%mul = call <3 x 2 x float> @llvm.matrix.multiply(<3 x 3 x float> %add,
<3 x 2 x float> %c)

store <3 x 2 x float> %mul, <3 x 2 x float>* %MUL

LLVM requires name mangling the type into the intrinsic, but yeah.

Note that the type always implies a layout here. If we have multiple layouts appear in the same module beyond the default specified by DataLayout, we would have these represented in the type, e.g. <3 x 3 x float column-major pad(1)> specifying column-major layout with a single element of padding after each 3-element column vector.

Right.

Also note that we’re using built-in fadd operator for element-wise operation and an intrinsic for non-element-wise operations like matrix multiply.

Instead of extending the type system, we can map the matrix instances onto existing types. The vector type is a natural fit as it can be considered a SequenceType with Row*Column elements of the element type. For operations, like matrix multiply we just need to pass the shape information for the extra dimension.

One question arises though, how padding should be handled. For one, performing operations like division with the padding can cause spurious faults. But even for non-trapping operations excluding padding should be an option. For example in the case of a <3 x 3 x double>, we may want to lower a single row/column into the combination of a 128B vector register (2 elts) and a scalar rather than two vectors. This should be more beneficial for power. Thus we want to make padding explicit in the IR.

One option is to expose the shape to all operations including element-wise operations. This is option B. With that, the above sequence looks like this:

%a = load <12 x float>, <12 x float>* %A, align 16
%b = load <12 x float>, <12 x float>* %B, align 16
%c = load <8 x float>, <8 x float>* %C, align 16

%add = call <12 x float> @llvm.matrix.fadd(<12 x float> %a, <12 x float> %b,
; 3 x 3 column-major:
i32 3, i32 3, i1 true)

%mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add, <8 x float> %c,
; 3 x 3 3 x 2 column-major:
i32 3, i32 3, i32 3, i32 2, i1 true)
store <8 x float> %mul, <8 x float>* %MUL, align 16

Each computation takes full shape information. The matrix shape is described with the row and columns dimensions and are passed to the intrinsic as constant parameters. We can also pass layout information like whether the matrices are laid out in row-major or column-major order. We’re using column major order in the example and as such the 3 x 3 x float matrix is flattened into a 12 x float vector with one element padding at the end of each column.

I don’t understand this. What is the benefit of providing layout info to element wise operations? This defeats the goal of having simple lowering and representation: you are encoding an ND vector form into the IR in a really ugly way, and this will cause a proliferation of intrinsics that are redundant with the core ops.

Also, are 2d matrices really as general as we want to go here? Generally you go from 1 to 2 to N, and it seems lik you are proposing going from 1 (scalar) to 2 (vectors) to 3 (2d arrays) without giving N. If you want to provide layout info for general ND arrays, a single bit is not going to be enough nor is your row/col size representation.

The amount of padding does not require any new parameters. We can compute it using the shape information and the size of the flattened matrix (e.g. %c which is a <3 x 2 x float> also has one element of padding: Elts / Columns - Rows = 8 / 2 - 3 = 1).

In order to expose the padding elements to element-wise operation (fadd), option B maps those to intrinsics. We can expose the padding bytes in other ways such that we can still use the built-in element-wise operators. One way would be to extend the vector types with specifying the padding, something like <12 x float pad(3, 7, 11)> (John McCall’s idea) or removing the padding with explicit shufflevectors (Chandler’s idea). I explored the lattter under option C. With option C, the same sequence of operations look like this:

%a.padded = load <12 x float>, <12 x float>* %A, align 16
; remove padding
%a = shufflevector <12 x float> %a.padded, <12 x float> undef,
<9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10>

%b.padded = load <12 x float>, <12 x float>* %B, align 16
%b = shufflevector <12 x float> %b.padded, <12 x float> undef,
<9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10>

%c.padded = load <8 x float>, <8 x float>* %C, align 16
%c = shufflevector <8 x float> %c.padded, <8 x float> undef,
<6 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6>

%add = fadd <9 x float> %a, %b

I don’t understand why you’re trying to avoid adding padding. If you are worried about snans, then it seems that you could arrange for the producers of padding to have some guaranteed properties instead of being undef.


Matrix Operation Lowering and Fusion


Common to all of these options is that we are proposing a new IR pass that pre-legalizes matrix operations by lowering them to operations that are natively supported by the HW. This means decomposing the operations into native SIMD operations.

This pass will be used to also de-interleave a chain of matrix operations to manage register pressure.

Cool. While I’m not really thrilled with yet another “codegenprepare” style pass, I agree it is the most pragmatic given the lack of pervasive global isel etc.

Just an observation: this pass can be dropped in today and would be useful for large vectors, independent of your matrix work.

Note that we only have shape and layout information on computations. We don’t have them on other instructions like: load, store, phi, select, bitcast, memcpy intrinsic etc. Since the shape and layout information is critical to avoid unnecessary shuffles when working on rows/columns we need to recover this by propagating this information to all matrix operations.

My sense is that this info is important for your lowering, and your approach of using dataflow analysis to recover this will fail in some cases.

Since layout and padding information is important, it seems most logical to put this into the type. Doing so would make it available in all these places.

That said, I still don’t really understand why you need it.

-Chris

Hi Chris,

Hi all,

After the previous RFC[1], there were multiple discussions on the ML and in person at the DevMtg. I will summarize the options discussed and propose a path forward.

===========================
Options

A. Extend VectorType to be multidimensional

B. Flatten matrices into the current VectorType. Matrix shape and layout information is passed to matrix intrinsics. All matrix operations including element-wise matrix operations are implemented via intrinsics

C. Same as B but padding is explicitly managed by shufflevector instructions, element-wise operations are implemented via built-in operators (e.g. fadd)

===========================
tl;dr

There was some support for option A to introduce first-class matrices (or multidimensional vectors) but also many concerns. I have sketched out many examples in IR and flattening matrices to vectors does not seem to present any clear show-stoppers. Thus, I am scaling back the proposal to option B which is a more incremental step.

Seems reasonable to start small, learn, then build out from there.

Throughout this work, an important goal is to provide a matrix-aware IRBuilder API. E.g.:

Value *CreateMatrixAdd(Value Op0, Value Op1,
unsigned Rows, unsigned Cols,
MatrixLayout ML /
row/column-major, padding */);

This will allow for simpler front-ends and would allow us to swap out the generated IR if the design needs to change.

Why does this have to be on IRBuilder? If you were writing this in Swift, then these would be an extension on IRBuilder that added some methods like this, but could be kept out of core.

C++ isn’t as sophisticated, but you could still define these as a separate header file with functions like:

Value *BuilderCreateMatrixAdd(IRBuilder &B, … other args)

This would avoid adding a bunch of matrix specific stuff to the core IRBuilder class and header.

Sure, that works too.

===========================
Details


Introduction

Representing multi-dimensional vectors in the IR through types makes the IR more expressive (option A). Additionally, if we have a new type we have the freedom to implicitly map it to a layout. E.g. <3 x 3 x float> could imply column-major order and one element of padding between each column. When it’s passed or returned from functions it should be passed in 3 vector registers.

This is a sample IR to add two 3x3 matrices followed by a matrix multiply with a 3x2:

%a = load <3 x 3 x float>, <3 x 3 x float>* %A
%b = load <3 x 3 x float>, <3 x 3 x float>* %B
%c = load <3 x 2 x float>, <3 x 2 x float>* %C

%add = fadd <3 x 3 x float> %a, %b
%mul = call <3 x 2 x float> @llvm.matrix.multiply(<3 x 3 x float> %add,
<3 x 2 x float> %c)

store <3 x 2 x float> %mul, <3 x 2 x float>* %MUL

LLVM requires name mangling the type into the intrinsic, but yeah.

Yes, we have that in our prototype. I just removed it from here for readability.

Note that the type always implies a layout here. If we have multiple layouts appear in the same module beyond the default specified by DataLayout, we would have these represented in the type, e.g. <3 x 3 x float column-major pad(1)> specifying column-major layout with a single element of padding after each 3-element column vector.

Right.

Also note that we’re using built-in fadd operator for element-wise operation and an intrinsic for non-element-wise operations like matrix multiply.

Instead of extending the type system, we can map the matrix instances onto existing types. The vector type is a natural fit as it can be considered a SequenceType with Row*Column elements of the element type. For operations, like matrix multiply we just need to pass the shape information for the extra dimension.

One question arises though, how padding should be handled. For one, performing operations like division with the padding can cause spurious faults. But even for non-trapping operations excluding padding should be an option. For example in the case of a <3 x 3 x double>, we may want to lower a single row/column into the combination of a 128B vector register (2 elts) and a scalar rather than two vectors. This should be more beneficial for power. Thus we want to make padding explicit in the IR.

One option is to expose the shape to all operations including element-wise operations. This is option B. With that, the above sequence looks like this:

%a = load <12 x float>, <12 x float>* %A, align 16
%b = load <12 x float>, <12 x float>* %B, align 16
%c = load <8 x float>, <8 x float>* %C, align 16

%add = call <12 x float> @llvm.matrix.fadd(<12 x float> %a, <12 x float> %b,
; 3 x 3 column-major:
i32 3, i32 3, i1 true)

%mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add, <8 x float> %c,
; 3 x 3 3 x 2 column-major:
i32 3, i32 3, i32 3, i32 2, i1 true)
store <8 x float> %mul, <8 x float>* %MUL, align 16

Each computation takes full shape information. The matrix shape is described with the row and columns dimensions and are passed to the intrinsic as constant parameters. We can also pass layout information like whether the matrices are laid out in row-major or column-major order. We’re using column major order in the example and as such the 3 x 3 x float matrix is flattened into a 12 x float vector with one element padding at the end of each column.

I don’t understand this. What is the benefit of providing layout info to element wise operations? This defeats the goal of having simple lowering and representation: you are encoding an ND vector form into the IR in a really ugly way, and this will cause a proliferation of intrinsics that are redundant with the core ops.

The reason we need that information so that for example we can lower an operation on a 3-element column into a vector of 2 and a scalar op. This should be beneficial for power consumption since for example in the case of a 3x3 with a single element padding rather than operating on 12 elements you’d operate only on 9 (vector ops consume more power than their scalar counterparts).

That said we should be able to remove these intrinsics in the long term. Once we have masking on the core ops in the IR, we should be able to express the same semantics without dedicated intrinsics.

Also, are 2d matrices really as general as we want to go here? Generally you go from 1 to 2 to N, and it seems lik you are proposing going from 1 (scalar) to 2 (vectors) to 3 (2d arrays) without giving N. If you want to provide layout info for general ND arrays, a single bit is not going to be enough nor is your row/col size representation.

Yes, we could start with an ND-ready interface too. I was going to start with just matrix because that is my immediate need and then generalize from there but I guess we can start with the more generalized intrinsic even if we first only focus on the 2D implementation?

The amount of padding does not require any new parameters. We can compute it using the shape information and the size of the flattened matrix (e.g. %c which is a <3 x 2 x float> also has one element of padding: Elts / Columns - Rows = 8 / 2 - 3 = 1).

In order to expose the padding elements to element-wise operation (fadd), option B maps those to intrinsics. We can expose the padding bytes in other ways such that we can still use the built-in element-wise operators. One way would be to extend the vector types with specifying the padding, something like <12 x float pad(3, 7, 11)> (John McCall’s idea) or removing the padding with explicit shufflevectors (Chandler’s idea). I explored the lattter under option C. With option C, the same sequence of operations look like this:

%a.padded = load <12 x float>, <12 x float>* %A, align 16
; remove padding
%a = shufflevector <12 x float> %a.padded, <12 x float> undef,
<9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10>

%b.padded = load <12 x float>, <12 x float>* %B, align 16
%b = shufflevector <12 x float> %b.padded, <12 x float> undef,
<9 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6, i32 8, i32 9, i32 10>

%c.padded = load <8 x float>, <8 x float>* %C, align 16
%c = shufflevector <8 x float> %c.padded, <8 x float> undef,
<6 x i32> <i32 0, i32 1, i32 2, i32 4, i32 5, i32 6>

%add = fadd <9 x float> %a, %b

I don’t understand why you’re trying to avoid adding padding. If you are worried about snans, then it seems that you could arrange for the producers of padding to have some guaranteed properties instead of being undef.

It’s more the extra work/power consumed by the padding elements that concerns me.


Matrix Operation Lowering and Fusion


Common to all of these options is that we are proposing a new IR pass that pre-legalizes matrix operations by lowering them to operations that are natively supported by the HW. This means decomposing the operations into native SIMD operations.

This pass will be used to also de-interleave a chain of matrix operations to manage register pressure.

Cool. While I’m not really thrilled with yet another “codegenprepare” style pass, I agree it is the most pragmatic given the lack of pervasive global isel etc.

Just an observation: this pass can be dropped in today and would be useful for large vectors, independent of your matrix work.

Note that we only have shape and layout information on computations. We don’t have them on other instructions like: load, store, phi, select, bitcast, memcpy intrinsic etc. Since the shape and layout information is critical to avoid unnecessary shuffles when working on rows/columns we need to recover this by propagating this information to all matrix operations.

My sense is that this info is important for your lowering, and your approach of using dataflow analysis to recover this will fail in some cases.

Since layout and padding information is important, it seems most logical to put this into the type. Doing so would make it available in all these places.

That said, I still don’t really understand why you need it.

This seems like the main sticking point so let’s close on this first and see if my answers above are satisfying.

Thanks for taking a look!

Adam

Adam Nemet via llvm-dev <llvm-dev@lists.llvm.org> writes:

That said we should be able to remove these intrinsics in the long
term. Once we have masking on the core ops in the IR, we should be
able to express the same semantics without dedicated intrinsics.

Is there actually a proposal for adding core masking support? It's been
brought up several times in the past but always ends up in the "too hard
to do" bin.

                          -David

There may be some cases where this holds (maybe with 5x5 or something), but most of the time I would expect to get better power from doing a four-element vector op with one wasted lane than doing two arithmetic ops (plus possibly extracts and inserts, depending on physical layout details).

Explicit masking or arranging for zero in padding lanes seems like a better way forward to me.
– Steve

I spent some time chatting with Adam about this and have a better understanding of his concerns here. It seems to me that if having masking intrinsics is the long-term solution we want, we should do that now (for add and sub) rather than building arbitrary matrix layout info into intrinsics, since a mask has all the information that we actually need.

– Steve

I think that sounds like a reasonable compromise. We already have masked load/store intrinsics so adding add and sub just follows that precedent. If the decision is made to move masking to the core operations, the new intrinsics would just move as well.

So an add->multiply for option B + masking intrinsics would look like this:

%a = load <12 x float>, <12 x float>* %A, align 16
%b = load <12 x float>, <12 x float>* %B, align 16
%c = load <8 x float>, <8 x float>* %C, align 16

%add = call <12 x float> @llvm.masked.fadd(<12 x float> %a, <12 x float> %b,
; mask, if false element is taken from passthrough
<12 x i1> <i1 true, i1 true, i1 true, i1 false,
i1 true, i1 true, i1 true, i1 false,
i1 true, i1 true, i1 true, i1 false >
; passthrough:
<12 x float> <float undef, float undef, float undef, float undef,
float undef, float undef, float undef, float undef,
float undef, float undef, float undef, float undef >)

%mul = call <8 x float> @llvm.matrix.multiply(<12 x float> %add, <8 x float> %c,
; 3 x 3 3 x 2 column-major:
i32 3, i32 3, i32 3, i32 2, i1 true)
store <8 x float> %mul, <8 x float>* %MUL, align 16

Adam Nemet via llvm-dev <llvm-dev@lists.llvm.org> writes:

    I spent some time chatting with Adam about this and have a better
    understanding of his concerns here. It seems to me that if having
    masking intrinsics is the long-term solution we want, we should do
    that now (for add and sub) rather than building arbitrary matrix
    layout info into intrinsics, since a mask has all the information
    that we actually need.

I think that sounds like a reasonable compromise. We already have
masked load/store intrinsics so adding add and sub just follows that
precedent. If the decision is made to move masking to the core
operations, the new intrinsics would just move as well.

How will existing passes be taught about the new intrinsics? For
example, what would have to be done to instcombine to teach it about
these intrinsics? Let's suppose every existing operation had an
equivalent masked intrinsic. Would it be easier to teach all of the
passes about them or would it be easier to teach the passes about a mask
operand on the existing Instructions? Would it be easier to teach isel
about all the intrinsics or would it be easier to teach isel about a
mask operand?

I honestly don't know the answers to these questions. But I think they
are important to consider, especially if intrinsics are seen as a bridge
to first-class IR support for masking.

                                 -David

Hi,

We’ve started an RFC that proposes exactly this:

Hi,

Adam Nemet via llvm-dev <llvm-dev@lists.llvm.org> writes:

     I spent some time chatting with Adam about this and have a better
     understanding of his concerns here. It seems to me that if having
     masking intrinsics is the long-term solution we want, we should do
     that now (for add and sub) rather than building arbitrary matrix
     layout info into intrinsics, since a mask has all the information
     that we actually need.

I think that sounds like a reasonable compromise. We already have
masked load/store intrinsics so adding add and sub just follows that
precedent. If the decision is made to move masking to the core
operations, the new intrinsics would just move as well.

How will existing passes be taught about the new intrinsics? For
example, what would have to be done to instcombine to teach it about
these intrinsics? Let's suppose every existing operation had an
equivalent masked intrinsic. Would it be easier to teach all of the
passes about them or would it be easier to teach the passes about a mask
operand on the existing Instructions? Would it be easier to teach isel
about all the intrinsics or would it be easier to teach isel about a
mask operand?

Consider that over night we introduce optional mask parameters to vector instructions. Then, since you can not safely ignore the mask, every transformation and analysis that is somehow concerned with vector instructions is potentially broken and needs to be fixed.

If you go with masking intrinsics, and set the attributes right, it is clear that transformations won't break your code and you will need to teach InstCombine, DAGCombiner, etc that a `masked.fadd` is just an `fadd` with a mask. However, this gives you the opportunity to "re-enable" one optimization add a time each time making sure that the mask is handled correctly. In case of InstCombine, the vector instruction patterns transfer to mask intrinsics: if all mask intrinsics in the pattern have the same mask parameter you can apply the transformation, the resulting mask intrinsics will again take the same mask parameter.

Also, this need not be a hard transition from vector instructions to masking intrinsics.. you can add new types of masking intrinsics in batches along with the required transformations. Masking intrinsics and vector instruction can live side by side (as they do today, anyway).

I honestly don't know the answers to these questions. But I think they
are important to consider, especially if intrinsics are seen as a bridge
to first-class IR support for masking.

I think its sensible to use masking intrinsics (or EVL ⚙ D53613 RFC: Explicit Vector Length Intrinsics and Attributes) on IR level and masked SD nodes in the backend. However, i agree that intrinsics should just be a bridge to native support mid term.

- Simon

Simon Moll <moll@cs.uni-saarland.de> writes:

How will existing passes be taught about the new intrinsics? For
example, what would have to be done to instcombine to teach it about
these intrinsics? Let's suppose every existing operation had an
equivalent masked intrinsic. Would it be easier to teach all of the
passes about them or would it be easier to teach the passes about a mask
operand on the existing Instructions? Would it be easier to teach isel
about all the intrinsics or would it be easier to teach isel about a
mask operand?

Consider that over night we introduce optional mask parameters to
vector instructions. Then, since you can not safely ignore the mask,
every transformation and analysis that is somehow concerned with
vector instructions is potentially broken and needs to be fixed.

True, but is there a way we could do this incrementally? Even if we
start with intrinsics and then migrate to first-class support, at some
point passes are going to be broken with respect to masks on
Instructions.

If you go with masking intrinsics, and set the attributes right, it is
clear that transformations won't break your code and you will need to
teach InstCombine, DAGCombiner, etc that a `masked.fadd` is just an
fadd` with a mask. However, this gives you the opportunity to
"re-enable" one optimization add a time each time making sure that the
mask is handled correctly. In case of InstCombine, the vector
instruction patterns transfer to mask intrinsics: if all mask
intrinsics in the pattern have the same mask parameter you can apply
the transformation, the resulting mask intrinsics will again take the
same mask parameter.

Right.

Also, this need not be a hard transition from vector instructions to
masking intrinsics.. you can add new types of masking intrinsics in
batches along with the required transformations. Masking intrinsics
and vector instruction can live side by side (as they do today,
anyway).

Of course.

I honestly don't know the answers to these questions. But I think they
are important to consider, especially if intrinsics are seen as a bridge
to first-class IR support for masking.

I think its sensible to use masking intrinsics (or EVL
⚙ D53613 RFC: Explicit Vector Length Intrinsics and Attributes) on IR level and masked SD nodes in
the backend. However, i agree that intrinsics should just be a bridge
to native support mid term.

The biggest question I have is how such a transition would happen.
Let's say we have a full set of masking intrinsics. Now we want to take
one IR-level operation, say fadd, and add mask support to it. How do we
do that? Is it any easier because we have all of the intrinsics, or
does all of the work on masking intrinsics get thrown away at some
point?

I'm reminded of this now decade-old thread on gather/scatter and masking
from Don Gohman, which I also mentioned in an SVE thread earlier this
year:

https://lists.llvm.org/pipermail/llvm-dev/2008-August/016284.html

The applymask idea got worked through a bit and IIRC at some later point
someone found issues with it that need to be addressed, but it's an
interesting idea to consider. I wasn't too hot on it at the time but it
may be a way forward.

In that thread, Tim Foley posted a summary of options for mask support,
one of which was adding intrinsics:

https://lists.llvm.org/pipermail/llvm-dev/2008-August/016371.html

                                -David

Simon Moll <moll@cs.uni-saarland.de> writes:

How will existing passes be taught about the new intrinsics? For
example, what would have to be done to instcombine to teach it about
these intrinsics? Let's suppose every existing operation had an
equivalent masked intrinsic. Would it be easier to teach all of the
passes about them or would it be easier to teach the passes about a mask
operand on the existing Instructions? Would it be easier to teach isel
about all the intrinsics or would it be easier to teach isel about a
mask operand?

Consider that over night we introduce optional mask parameters to
vector instructions. Then, since you can not safely ignore the mask,
every transformation and analysis that is somehow concerned with
vector instructions is potentially broken and needs to be fixed.

True, but is there a way we could do this incrementally? Even if we
start with intrinsics and then migrate to first-class support, at some
point passes are going to be broken with respect to masks on
Instructions.

Here is path an idea for an incremental transition:

a) Create a new, distinct type. Let's say its called the "predicated vector type", written "{W x double}".

b) Make the predicate vector type a legal operand type for all binary operators and add an optional predicate parameter to them. Now, here is the catch: the predicate parameter is only legal if the data type of the operation is "predicated vector type". That is "fadd <8 x double>" will for ever be unpredicated. However, "fadd {8 x double} %a, %b" may have an optional predicate argument. Semantically, these two operations would be identical:

fadd <8 x double>, %a, %b

fadd {8 x double}, %a, %b, predicate(<8 x i1><1, 1, 1, 1, 1, 1, 1, 1>)

In terms of the LLVM type hierachy, PredicatedVectorType would be distinct from VectorType and so no transformation can break it. While you are in the transition (from unpredicated to predicated IR), you may see codes like this:

%aP = bitcast <8 x double> %a to {8 x double}
%bP = bitcast <8 x double> %b to {8 x double}
%cP = fdiv %aP, %bP, mask(11101110) ; predicated fdiv
%c = bitcast <8 x double> %c to %cP
%d = fadd <8 x double> %a, %c ; no predicated fadd yet

Eventually, when all optimizations/instructions/analyses have been migrated to run well with the new type, 1. deprecate the old vector type, 2. promote it to PredicatedVectorType when parsing BCand, after a grace period, rename {8 x double} to <8 x double>

If you go with masking intrinsics, and set the attributes right, it is
clear that transformations won't break your code and you will need to
teach InstCombine, DAGCombiner, etc that a `masked.fadd` is just an
fadd` with a mask. However, this gives you the opportunity to
"re-enable" one optimization add a time each time making sure that the
mask is handled correctly. In case of InstCombine, the vector
instruction patterns transfer to mask intrinsics: if all mask
intrinsics in the pattern have the same mask parameter you can apply
the transformation, the resulting mask intrinsics will again take the
same mask parameter.

Right.

Also, this need not be a hard transition from vector instructions to
masking intrinsics.. you can add new types of masking intrinsics in
batches along with the required transformations. Masking intrinsics
and vector instruction can live side by side (as they do today,
anyway).

Of course.

I honestly don't know the answers to these questions. But I think they
are important to consider, especially if intrinsics are seen as a bridge
to first-class IR support for masking.

I think its sensible to use masking intrinsics (or EVL
⚙ D53613 RFC: Explicit Vector Length Intrinsics and Attributes) on IR level and masked SD nodes in
the backend. However, i agree that intrinsics should just be a bridge
to native support mid term.

The biggest question I have is how such a transition would happen.
Let's say we have a full set of masking intrinsics. Now we want to take
one IR-level operation, say fadd, and add mask support to it. How do we
do that? Is it any easier because we have all of the intrinsics, or
does all of the work on masking intrinsics get thrown away at some
point?

The masking intrinsics are just a transitional thing. Eg, we could add them now and let them mature. Once the intrinsics are stable and proven start migrating for core IR support (eg as sketched above).

I'm reminded of this now decade-old thread on gather/scatter and masking
from Don Gohman, which I also mentioned in an SVE thread earlier this
year:

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

The applymask idea got worked through a bit and IIRC at some later point
someone found issues with it that need to be addressed, but it's an
interesting idea to consider. I wasn't too hot on it at the time but it
may be a way forward.

In that thread, Tim Foley posted a summary of options for mask support,
one of which was adding intrinsics:

[LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR

                                 -David

Thank for you for the pointer! Is this documented somewhere? (say in a wiki or some proposal doc). Otherwise, we are bound to go through these discussions again and again until a consensus is reached. Btw, different to then, we are also talking about an active vector length now (hence EVL).

AFAIU apply_mask was proposed to have less (redundant) predicate arguments. Unless the apply_mask breaks a chain in a matcher pattern, the approach should be prone to the issue of transformations breaking code as well.

Has something like the PredicatedVectorType approach above been proposed before?

- Simon

Simon Moll via llvm-dev <llvm-dev@lists.llvm.org> writes:

True, but is there a way we could do this incrementally? Even if we
start with intrinsics and then migrate to first-class support, at some
point passes are going to be broken with respect to masks on
Instructions.

Here is path an idea for an incremental transition:

a) Create a new, distinct type. Let's say its called the "predicated
vector type", written "{W x double}".

b) Make the predicate vector type a legal operand type for all binary
operators and add an optional predicate parameter to them. Now, here
is the catch: the predicate parameter is only legal if the data type
of the operation is "predicated vector type". That is "fadd <8 x
>" will for ever be unpredicated. However, "fadd {8 x double}
%a, %b" may have an optional predicate argument. Semantically, these
two operations would be identical:

fadd <8 x double>, %a, %b

fadd {8 x double}, %a, %b, predicate(<8 x i1><1, 1, 1, 1, 1, 1, 1, 1>)

In terms of the LLVM type hierachy, PredicatedVectorType would be
distinct from VectorType and so no transformation can break it. While
you are in the transition (from unpredicated to predicated IR), you
may see codes like this:

%aP = bitcast <8 x double> %a to {8 x double}
%bP = bitcast <8 x double> %b to {8 x double}
%cP = fdiv %aP, %bP, mask(11101110) ; predicated fdiv
%c = bitcast <8 x double> %c to %cP
%d = fadd <8 x double> %a, %c ; no predicated fadd yet

Eventually, when all optimizations/instructions/analyses have been
migrated to run well with the new type, 1. deprecate the old vector
type, 2. promote it to PredicatedVectorType when parsing BCand, after
a grace period, rename {8 x double} to <8 x double>

That's an interesting idea. It strikes me that if we went this route,
we wouldn't need intermediate intrinsics at all. I'm trying to avoid
throw-away work, which it seems the intrinsics would be if core IR
support is the ultimate goal.

I'm not sure the renaming could ever happen. What are the rules for
supporting older BC/textual IR files?

The masking intrinsics are just a transitional thing. Eg, we could add
them now and let them mature. Once the intrinsics are stable and
proven start migrating for core IR support (eg as sketched above).

What are the tradeoffs between adding these intrinsics and adding
PredicatedVectorType from the get-go? The latter is probably harder to
get past review, but if, say, an RFC were developed to gain consensus,
then maybe it would be easier? I recognize that there are probably time
constraints in getting some kind of better masking support in soon and
intrinsics seems like the easiest way to do that.

Thank for you for the pointer! Is this documented somewhere? (say in a
wiki or some proposal doc). Otherwise, we are bound to go through
these discussions again and again until a consensus is reached. Btw,
different to then, we are also talking about an active vector length
now (hence EVL).

It's not documented anywhere else AFAIK.

AFAIU apply_mask was proposed to have less (redundant) predicate
arguments. Unless the apply_mask breaks a chain in a matcher pattern,
the approach should be prone to the issue of transformations breaking
code as well.

The idea is that applymask of a single value against two different mask
values would result in two new SSA values, so existing pattern matches
would not match if the masks did not match.

Has something like the PredicatedVectorType approach above been
proposed before?

I believe Tim's #5 is the closest:

https://lists.llvm.org/pipermail/llvm-dev/2008-August/016371.html

I don't think it was ever fleshed out.

Perhaps we should start on an RFC to explore the design space and get
community input. In the meantime, if we urgently need intrinsics we
should probably go ahead and start adding them.

                              -David

> Simon Moll <moll@cs.uni-saarland.de> writes:
>
>>> How will existing passes be taught about the new intrinsics? For
>>> example, what would have to be done to instcombine to teach it about
>>> these intrinsics? Let's suppose every existing operation had an
>>> equivalent masked intrinsic. Would it be easier to teach all of the
>>> passes about them or would it be easier to teach the passes about a mask
>>> operand on the existing Instructions? Would it be easier to teach isel
>>> about all the intrinsics or would it be easier to teach isel about a
>>> mask operand?
>> Consider that over night we introduce optional mask parameters to
>> vector instructions. Then, since you can not safely ignore the mask,
>> every transformation and analysis that is somehow concerned with
>> vector instructions is potentially broken and needs to be fixed.
> True, but is there a way we could do this incrementally? Even if we
> start with intrinsics and then migrate to first-class support, at some
> point passes are going to be broken with respect to masks on
> Instructions.

Here is path an idea for an incremental transition:

a) Create a new, distinct type. Let's say its called the "predicated
vector type", written "{W x double}".

b) Make the predicate vector type a legal operand type for all binary
operators and add an optional predicate parameter to them. Now, here is
the catch: the predicate parameter is only legal if the data type of the
operation is "predicated vector type". That is "fadd <8 x double>" will
for ever be unpredicated. However, "fadd {8 x double} %a, %b" may have
an optional predicate argument. Semantically, these two operations would
be identical:

fadd <8 x double>, %a, %b

fadd {8 x double}, %a, %b, predicate(<8 x i1><1, 1, 1, 1, 1, 1, 1, 1>)

In terms of the LLVM type hierachy, PredicatedVectorType would be
distinct from VectorType and so no transformation can break it. While
you are in the transition (from unpredicated to predicated IR), you may
see codes like this:

%aP = bitcast <8 x double> %a to {8 x double}
%bP = bitcast <8 x double> %b to {8 x double}
%cP = fdiv %aP, %bP, mask(11101110) ; predicated fdiv
%c = bitcast <8 x double> %c to %cP
%d = fadd <8 x double> %a, %c ; no predicated fadd yet

Eventually, when all optimizations/instructions/analyses have been
migrated to run well with the new type, 1. deprecate the old vector
type, 2. promote it to PredicatedVectorType when parsing BCand, after a
grace period, rename {8 x double} to <8 x double>

I'm likely missing things,
but i strongly suspect that the amount of effort needed is underestimated.

Vector support works because, with some exceptions,
vector is simply interpreted as several scalars concatenated.

Much like as with native fixed-point type support,
a whole new incompatible type is suggested to be added here.
I *suspect*, *every* single transform in instcombine/instsimplify
will need to be *duplicated*.

That is a lot. Intrinsics sound like less intrusive solution, in both cases.

>> If you go with masking intrinsics, and set the attributes right, it is
>> clear that transformations won't break your code and you will need to
>> teach InstCombine, DAGCombiner, etc that a `masked.fadd` is just an
>> fadd` with a mask. However, this gives you the opportunity to
>> "re-enable" one optimization add a time each time making sure that the
>> mask is handled correctly. In case of InstCombine, the vector
>> instruction patterns transfer to mask intrinsics: if all mask
>> intrinsics in the pattern have the same mask parameter you can apply
>> the transformation, the resulting mask intrinsics will again take the
>> same mask parameter.
> Right.
>
>> Also, this need not be a hard transition from vector instructions to
>> masking intrinsics.. you can add new types of masking intrinsics in
>> batches along with the required transformations. Masking intrinsics
>> and vector instruction can live side by side (as they do today,
>> anyway).
> Of course.
>
>
>>> I honestly don't know the answers to these questions. But I think they
>>> are important to consider, especially if intrinsics are seen as a bridge
>>> to first-class IR support for masking.
>> I think its sensible to use masking intrinsics (or EVL
>> ⚙ D53613 RFC: Explicit Vector Length Intrinsics and Attributes) on IR level and masked SD nodes in
>> the backend. However, i agree that intrinsics should just be a bridge
>> to native support mid term.
> The biggest question I have is how such a transition would happen.
> Let's say we have a full set of masking intrinsics. Now we want to take
> one IR-level operation, say fadd, and add mask support to it. How do we
> do that? Is it any easier because we have all of the intrinsics, or
> does all of the work on masking intrinsics get thrown away at some
> point?
The masking intrinsics are just a transitional thing. Eg, we could add
them now and let them mature. Once the intrinsics are stable and proven
start migrating for core IR support (eg as sketched above).
>
> I'm reminded of this now decade-old thread on gather/scatter and masking
> from Don Gohman, which I also mentioned in an SVE thread earlier this
> year:
>
> [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
>
> The applymask idea got worked through a bit and IIRC at some later point
> someone found issues with it that need to be addressed, but it's an
> interesting idea to consider. I wasn't too hot on it at the time but it
> may be a way forward.
>
> In that thread, Tim Foley posted a summary of options for mask support,
> one of which was adding intrinsics:
>
> [LLVMdev] Ideas for representing vector gather/scatter and masks in LLVM IR
>
> -David

Thank for you for the pointer! Is this documented somewhere? (say in a
wiki or some proposal doc). Otherwise, we are bound to go through these
discussions again and again until a consensus is reached. Btw, different
to then, we are also talking about an active vector length now (hence EVL).

AFAIU apply_mask was proposed to have less (redundant) predicate
arguments. Unless the apply_mask breaks a chain in a matcher pattern,
the approach should be prone to the issue of transformations breaking
code as well.

Has something like the PredicatedVectorType approach above been proposed
before?

- Simon

--

Simon Moll
Researcher / PhD Student

Compiler Design Lab (Prof. Hack)
Saarland University, Computer Science
Building E1.3, Room 4.31

Tel. +49 (0)681 302-57521 : moll@cs.uni-saarland.de
Fax. +49 (0)681 302-3065 : Simon Moll - Compiler Design Lab, Saarland University

Roman.