[RFC] Vector Masking Representation in MLIR

(Authors: Diego Caballero & Nicolas Vasilache)

This proposal is aimed at introducing generic support for vector masking in MLIR. Having a generic way to represent masked operations will pave the way for representing and optimizing more complex vector scenarios. This proposal is also a step towards filling the vector representation gap between MLIR and LLVM. We plan to extend this approach to also support vector operations with a dynamic vector length in the future.

Background and Motivation

Dealing with out-of-bounds memory accesses, side-effecting instructions and control flow that may diverge within the lanes of a vector is one of the biggest challenges of vectorization. Most recent SIMD and vector architectures address this problem by introducing dedicated hardware support to selectively enable/disable the execution of each vector lane using masks. However, compiler support is needed to effectively take advantage of these hardware features.

Modeling masked operations in LLVM has been a hairy topic. The community attempted to model masked computation several times and discussions on the matter went on for years. Luckily, the Vector Predication proposal gained support and landed recently. This approach separates the concern of masking from the rest of the IR by introducing a dedicated set of vector operations (~dialect? :slight_smile:) that can take both a vector mask and a dynamic vector length as arguments. The approach presented in this RFC relies on the Vector Predication proposal as a solid foundation to build a hardware-independent vector/predication model in MLIR.

Nonetheless, modeling masking/predication in MLIR is even more challenging than in LLVM. MLIR has a growing and unbound IR with multiple levels of abstraction. Most of these levels are vectorizable and, therefore, subject to be masked. Creating a dedicated set of operations with support for masking/dynamic vector lengths for any potential maskable operation would lead to a prohibitive level of duplication. On the other hand, introducing native mask/vector length arguments to every maskable operation would be extremely invasive and leak vectorization details to each and every vectorizable operation in MLIR.

The lack of a common vector layer with masking and dynamic vector length support is also a concern. Modeling these vector concepts in a generic way will allow us to share implementations across targets with similar vector/predication features (e.g., SVE, RVV), making some target-specific dialects unnecessary.

We propose a generic and less invasive approach to masking in MLIR that properly scales with the unbound and multi-abstraction level nature of this intermediate representation.

Proposal

This proposal comprises three parts: i) a new vector.mask operation with a region to model masked computation, ii) new interfaces to characterize operations that can mask and be masked and, iii) a set of masked elementary vector operations that will serve as a low-level landing pad for masked operations and facilitate the conversion to LLVM and other backends with masking support.

The vector.mask Operation

We propose the vector.mask operation as a generic vehicle to apply masking to any operation in MLIR, regardless of its level of abstraction. The following snippet shows how an arith.addf can be masked using this approach:

%0 = vector.mask %mask, %passthru : vector<2x8xi1> -> vector<2x8xf32> {
    // '%a' and '%b' are captured from above.
    %1 = arith.addf %a, %b : vector<2x8xf32>
    vector.yield %1 : vector<2x8xf32>
}

The vector.mask operation takes a vector mask and an optional pass-through vector as arguments. A vector.yield-terminated region encloses the operation (nested operation) to be masked with the provided arguments. Other values used within the region are captured from above. The vector mask argument holds a bit for each vector lane and determines which vector lanes should execute the nested operation and which ones should not. The vector.mask operation returns the value produced by the masked execution of the nested operation, if any. The masked-off lanes in the result vector are taken from the corresponding lanes of the pass-through value, if provided, or left unmodified, otherwise.

The vector.mask operation does not prescribe how a specific operation should be masked or how a masked operation should be lowered. Masking requirements will be provided by each maskable operation through the MaskableOp interface (see next section). Lowering of masked operations is implementation defined. Multiple lowering strategies for masked operations are expected.

Alternative Assembly Format

The following snippet shows a more compacted assembly format for the vector.mask operation that we implemented in our prototype. We think that this alternative could be further improved by removing more redundant tokens to make it even shorter. We will use this textual form in this RFC moving forward.

%0 = vector.mask %mask, %passthu { %1 = arith.addf %arg3, %arg4 : vector<2x8xf32> } : vector<2x8xi1>, vector<2x8xf32>

Example

Let’s take a look at how vector.mask can be used to vectorize the following dynamically-shaped linalg.matmul and how the mask can be propagated through progressive lowering and levels of abstraction.

func.func @masked_matmul(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>,                                                                                                                    
                         %arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {                                                                                                                       
  %0 = linalg.matmul  ins(%arg0, %arg1: tensor<?x?xf32>, tensor<?x?xf32>)                                                                                                                   
                      outs(%arg2: tensor<?x?xf32>) -> tensor<?x?xf32>                                                                                                                                                                      
  return %0 : tensor<?x?xf32>                                                                                                                                                               
} 

We assume that linalg.matmul is vectorized to a vector.contract operation with vector sizes {2, 4, 8} for dimensions i, j and k, respectively. We also assume that the dynamic sizes of dimensions i, j and k are bounded by 2, 4 and 8, respectively. This can be achieved by tiling (not shown in this example for the sake of simplicity). The resulting masked vector code is as follows:

func.func @masked_matmul(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %cst = arith.constant 0.000000e+00 : f32
  %0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
  %1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
  %2 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
  %3 = vector.create_mask %0, %2 : vector<2x8xi1>
  %4 = vector.mask %3 { %12 = vector.transfer_read %arg0[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<2x8xf32> } : vector<2x8xi1>, vector<2x8xf32>
  %5 = vector.create_mask %2, %1 : vector<8x4xi1>
  %6 = vector.mask %5 { %12 = vector.transfer_read %arg1[%c0, %c0], %cst {in_bounds = [true, true], permutation_map = affine_map<(d0, d1) -> (d1, d0)>} : tensor<?x?xf32>, vector<8x4xf32> } : vector<8x4xi1>, vector<8x4xf32>
  %7 = vector.create_mask %0, %1 : vector<2x4xi1>
  %8 = vector.mask %7 { %12 = vector.transfer_read %arg2[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<2x4xf32> } : vector<2x4xi1>, vector<2x4xf32>
  %9 = vector.create_mask %0, %1, %2 : vector<2x4x8xi1>
  %10 = vector.mask %9 { %12 = vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %4, %6, %8 : vector<2x8xf32>, vector<8x4xf32> into vector<2x4xf32> } : vector<2x4x8xi1>, vector<2x4xf32>
  %11 = vector.mask %7 { %12 = vector.transfer_write %10, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<2x4xf32>, tensor<?x?xf32> } : vector<2x4xi1>, tensor<?x?xf32>
  return %11 : tensor<?x?xf32>
}

Note that the vector.contract operation is masked with a canonical mask that comprises all the vector dimensions of the iteration space. The masked vector.contract is then lowered to masked lower level operations, where the canonical mask is decomposed into multiple masks, depending on how the canonical vector iteration space is projected/permuted for each lower level operation:

func.func @masked_matmul(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>, %arg2: tensor<?x?xf32>) -> tensor<?x?xf32> {
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %cst = arith.constant 0.000000e+00 : f32
  %0 = tensor.dim %arg0, %c0 : tensor<?x?xf32>
  %1 = tensor.dim %arg1, %c1 : tensor<?x?xf32>
  %2 = tensor.dim %arg0, %c1 : tensor<?x?xf32>
  %3 = vector.create_mask %0, %2 : vector<2x8xi1>
  %4 = vector.mask %3 { %13 = vector.transfer_read %arg0[%c0, %c0], %cst {in_bounds = [true, true, true], permutation_map = affine_map<(d0, d1) -> (d0, 0, d1)>} : tensor<?x?xf32>, vector<2x4x8xf32> } : vector<2x8xi1>, vector<2x4x8xf32>
  %5 = vector.create_mask %2, %1 : vector<8x4xi1>
  %6 = vector.mask %5 { %13 = vector.transfer_read %arg1[%c0, %c0], %cst {in_bounds = [true, true, true], permutation_map = affine_map<(d0, d1) -> (0, d1, d0)>} : tensor<?x?xf32>, vector<2x4x8xf32> } : vector<8x4xi1>, vector<2x4x8xf32>
  %7 = vector.create_mask %0, %1 : vector<2x4xi1>
  %8 = vector.mask %7 { %13 = vector.transfer_read %arg2[%c0, %c0], %cst {in_bounds = [true, true]} : tensor<?x?xf32>, vector<2x4xf32> } : vector<2x4xi1>, vector<2x4xf32>
  %9 = vector.create_mask %0, %1, %2 : vector<2x4x8xi1>
  %10 = vector.mask %9 { %13 = arith.mulf %4, %6 : vector<2x4x8xf32> } : vector<2x4x8xi1>, vector<2x4x8xf32>
  %11 = vector.mask %9 { %13 = vector.multi_reduction <add>, %10, %8 [2] : vector<2x4x8xf32> to vector<2x4xf32> } : vector<2x4x8xi1>, vector<2x4xf32>
  %12 = vector.mask %7 { %13 = vector.transfer_write %11, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<2x4xf32>, tensor<?x?xf32> } : vector<2x4xi1>, tensor<?x?xf32>
  return %12 : tensor<?x?xf32>
}

The resulting code could then be lowered further and eventually be converted to LLVM using the Vector Predication intrinsics.

This example demonstrates how the masking process throughout the compiler pipeline is independent from the actual operations being masked, separating the concerns and preventing an explosion of new vector/masked operations throughout MLIR. Even existing operations with native masking or close semantics, such as vector.transfer_read, vector.transfer_write and vector.execute_on_warp_0 could be simplified in the future by deferring masking semantics to the new vector.mask operation.

The MaskableOp and MaskingOp Interfaces

We propose two new interfaces. The MaskableOp interface defines operations that can be masked. This will help vectorization algorithms to rule out operations where masking wouldn’t make sense, such as scf.if, scf.for, affine.if, affine.for, func.func, etc. The interface will also provide methods to determine if an operation is masked or to know how a canonical mask should be projected/permuted to mask that operation. Further details will be decided at implementation time.

The MaskingOp interface defines operations that can mask other operations (e.g., vector.mask) or operations with native support for masking (e.g., vector.transfer_read, vector.transfer_write). This interface will help with the incremental transition of masking responsibilities from operations with native masking support to the vector.mask operation. It will also help reduce dependencies between dialects with ops implementing the MaskableOp interface and dialects with ops implementing the MaskingOp interface (e.g., the Vector dialect).

Vector Predication Operations

In the long term, we plan to introduce a new set of low-level operations with a native mask (and vector length) argument. These operations will serve as a low level landing pad for higher level operations masked with vector.mask and will align MLIR with the Vector Predication intrinsics in LLVM. We don’t plan to reinvent the wheel here other than adjusting the Vector Predication instructions to the vector ecosystem in MLIR. We don’t plan to work on this in the short term either, so contributions are welcome! :slight_smile:

Op Naming and Dialect Logistics

Currently, the vector.mask operation provides an abstraction for masking. However, this operation should be extended in the future to also model dynamic vector length information. In that context, the mask name might not be the most appropriate and we should consider predicate or a more generic term.

The vector.mask operation has been prototyped in the Vector dialect. However, we also plan to introduce Vector Predication operations, which are quite numerous. An alternative location for all the masking/predication related operations could be a new Predication/Masking dialect. That would keep vectorization and predication/masking concerns separate. The Predication/Masking dialect would be an optional extension of the Vector dialect, which won’t be loaded by targets without predication/masking support.

Suggestions are welcome!

Future Work

  • Extend proposal to support scalable/dynamic vector length vectorization.
  • Implement full-masking lowering.
  • Extend masking to tensor land.

@nicolasvasilache, @zhanghb97, @javiersetoain, @giuseros, @MaheshRavishankar, @ThomasRaoux

5 Likes

Thanks for the proposal! I think this generic mask representation and future dynamic vector length extension will save the vector dialect from fragmentation. I found the dynamic vector length is crucial to the RISC-V Vector (RVV) backend, so I proposed the RVV-specific dialect, and the generic approach and VP intrinsic are suggested according to the feedback. I did some experiments and realized that VP intrinsic could support RVV features like using vsetvl instruction to set the vector length or using mask instructions to deal with the control flow. So I think this proposal is a good start, and I am willing to work together on the vector length side. As for the details, I have several questions.

Are the operations like vector.maskedload and vector.maskedstore belong to the MaskingOp interface? My previous optimization passes depend on them. Should I modify them to work together with the proposed approach? Is it possible to coexist with the original masked ops?

I think vector abstraction hierarchy is generic vector dialect (provides general operations and types) + hardware-specific dialect (exposes the target-specific features, such as arm_sve.smmla). This proposal can avoid the redevelopment of operations like arm_sve.masked.addi and riscvv.masked.addi, and make the target-specific dialects focus on actual special instructions and features. Is my understanding correct?

VP intrinsic is a great abstraction for hardware-independent vector support, and I am wondering if there is some summary about the completeness of VP intrinsic on different backends? According to my experiments, some operations (VPFRemOp, VPReduceFMulOp, VPReduceMulOp, etc.) cannot work well on the RVV backend. And it seems that there is no integration test for the VP intrinsic in the MLIR testing framework.

Will there be vector.mask and tensor.mask ops? Or is there Predication dialect with idk a predication.predicate op that applies to all predictable ops?

In LLVM fixed-length and scalable vectors are different types. Does the predicate/mask op need to be aware of the vector type?

Great! I look forward to this collaboration!

vector.maskedload and vector.maskedstore could implement the MaskingOp interface but that doesn’t mean you have to change your optimization passes. If those optimizations only target vector.maskedload and vector.maskedstore, they should continue to do so. These operations are mostly a copy of their LLVM counterparts and will stay in one way or the other. The situation with vector.transfer_read and vector.transfer_write is different because these ops have grown a lot in terms of semantics and might benefit from offloading part of their semantics to other ops, such as the vector.mask. In any case, that’s something we shouldn’t discuss as part of this RFC. I mentioned that in the RFC to illustrate some of the possibilities that the vector.mask op is opening.

Yes, exactly!

Vector Predication is relatively new so I wouldn’t be surprised if we have some implementation gaps (contributions are welcome :slight_smile:). No idea about the current backend support. Perhaps you can ask the LLVM/RVV backend directly.

We haven’t thought too much about predication/masking at tensor level. That’s one of the bullets in the Future Work section. I think @nicolasvasilache had some ideas but I think masking should settle down at vector level before going any further than that.

The mask argument in a vector.mask op has an i1 vector type. The shape could be fixed or scalable so I think this is similar to any other operation that can work on fixed and scalable vectors. @javiersetoain did some work around scalable masks in MLIR so perhaps he can provide his point of view but I don’t see any major issue with scalable vectors.

Thanks!

Are we allowing nested mask creation in a masked region ?

The vector.mask operation can only have one nested operation to be masked. This simplifies a lot the design and transformations (we tried allowing multiple operations within the same vector.mask region and things get really complicated, especially when n-D vectorizaiton and progressive lowering comes into play). The computation of the mask itself is independent from the vector.mask operation so we can represent, let’s say, the predicated counterpart of nested if-else operations. For example, given something like:

scf.if %cond1 {
  // op1
  scf.if %cond2 {
    // op2
  }
}
else {
  // op3
}

we could generate:

%mask1 = // derived from %cond1
vector.mask %mask1 { op1 }
%mask2 = // derived from %cond2 and %mask1
vector.mask %mask2 { op2 }
%mask3 = // derived from %cond1
vector.mask %mask3 { op3 }

IOW, we can use vector.mask to model the output IR of a traditional predication algorithm.

Overall this looks great and will definitely be a useful tool to deal with potential out of bound accesses (along with padding and specialization). This is something that fits very well what I would need for some of the GPU vectorization.

Few things I’m wondering:

Can vector.mask only contain one operation? It means after vectorization we would end up with a bunch of vector.mask operations and this won’t compose with any of the other transformations.
Is the idea to merge common vector.mask regions? Would we try to remove the vector.mask when not needed and speculatively execute some operation. It sounds like even if an operation doesn’t have side effects it cannot always be speculatively executed as the mask may also apply to the source for reduction kind of ops.
How do you see this evolves so that canonicalization and other transformation could still be applied to the IR?

If the masks meaning is dependent on the operation in the region how do you figure out how to apply masking. Is this MaskableOp would have to define?

Do we need to clarify the semantic of lanes values returned by vector.mask when the mask is off?

Thanks, Thomas! Great questions. They very well reflect our mindset evolution on the proposal along the past months of implementation and testing.

The short answer is yes, mostly for consistency and actually to make the composition with other transformations easier. See rationale below.

As mentioned above, the idea is that vector.mask will have a single operation so no merging would be possible. See rationale below.

We can refine this part. A vector.mask could be removed if that doesn’t alter the execution state/output of the program (better wording needed). The vector.mask op is not enforcing a specific implementation as long as the outcome of the execution doesn’t change. It could also be lowered to scalar if-else statements, for example.

Masking non-side-effecting ops may have a performance/energy impact in some architectures so we should be able to mask them. Linalg specific implementation details are out of the scope of the RFC but the idea is to have a simple vectorization pass where all the ops are masked and then have different lowering strategies for the masked ops. One strategy could preserve the masks for all the operations. Another one could preserve the mask only for side-effecting ops. We plan to provide only the latter in the short term.

Our prototype started with a vector.mask that could mask multiple operations at a time (example here, the op was called vector.predicate at that time). That worked well for 1-D vector cases as vectorization/masking is simpler. However, n-D vectorization makes masking harder. n-D vector ops under the influence of the same mask may require different permutations of the mask (where permutation here means transposition of the dimensions, projection of the dimensions or a combination of the two). For example, for a simple matmul, we would still need five independent vector.mask operations (A vload, B vload, C vload, computation, C vstore) even if we allowed multiple operations per vector.mask .

Implementation wise, allowing multiple operations within a vector.mask makes things complicated as well. Imagine we wanted to lower something like:

vector.mask %m {
  A
  B
  C
}

and the lowering of A and C reuse the same mask %m but the lowering of B needs a new mask %m2. When lowering B, we would have to split the original vector.mask, move the lowering of B to a new vector.mask with the new vector mask %m2 and move C to a new vector.mask with the original mask %m:

vector.mask %m { A’ }
vector.mask %m2 { B’ }
vector.mask %m { C }

As part of the split, we also have to rewire all the use-def chains properly. All these changes would happen when lowering B. This makes the lowering of B not self-contained and convoluted. The pass-through and the future dynamic vector length may also lead to similar splits in different scenarios.

An important premise that derives from the above reasoning is that even if we allowed multiple ops within a vector.mask, it wouldn’t be guaranteed that two ops with the same mask will always be within the same vector.mask op, as A and C shows in the previous example, so we would have to support an arbitrary number of combinations in our patterns/transformations.

Restricting the number of operations within a vector.mask to just one significantly simplifies the problems above and the processing of the mask because the vector.mask and the single masked operation behaves as a unified entity. This is the key motivation for this restriction. The vector.mask operation will never have to be split and masking and lowering of masked operations is self-contained and independent.

In terms of transformations/canonicalization, it’s important to note that mask, pass-through and dynamic vector length values should not be taken for granted. For example, folding two operations with different masks or different pass-through values might be incorrect. However, I expect the MaskingOp and MaskableOp interfaces to simplify the mask processing or skipping, as needed. For example, the MaskingOp interface will have methods like getMaskedOperation to retrieve the single operation being masked. The MaskableOp interface will have methods to retrieve unmasked operands for that operation or check properties of the operands’ masks. This should make the mask processing similar to processing a native mask operand added to the operation, except that now more operations can be masked.

The mask meaning is not dependent on the operation but the mask to be used on an operation will be provided by operation implementing the MaskableOp interface. Given a canonical mask for the vector iteration space, the MaskableOp interface will return the permutation of the canonical mask needed by the operation implementing the interface.

Having an undef value would definitely help define this better but since we don’t have it MLIR and we don’t specifically need to pass an undef value at this level of abstraction, I think the best we can say is that the masked-off lanes are left unmodified if no pass-through value is provided. Any other suggestions?

Thanks for the detailed explanations, Diego. Overall I agree this is the best first step and we can develop on top of that to allow better composition with the rest of the vector operations.

Looking forward to the patches!

You can refer to the as-if rule from C++.
https://en.cppreference.com/w/cpp/language/as_if

Thanks, Thomas!

“do not change the observable behavior of the program.” That’s exactly what we want to say. Thanks, @tschuett!

Thanks for this work, Diego, it’s looking great :slightly_smiling_face:

I did push some patches to make sure mask-related operations worked for scalable vectors. The only issues I found were related to the lowering of some ops that uses constant attributes to initialize and/or manipulate the vectors. Other than that, there shouldn’t be any problems with scalable vectors, masks or otherwise.

@JoeyYe @stevenvar