[RFC] Vector Distribution for CPU (convert vector to physical register size vector)

Introduction & Motivation

This pass is proposed so that we can use tile vector and vectorize for loop according to the hardware SIMD instructions. tile vector means that the vector don’t care about the hardware information like how big a vector can be stored in a cpu register. We directly operate on the tile vector due to some passes do not need to care about hardware information which can reduce the difficulty of pass transformation. Meanwhile, some passes want the tile vector to fit within L1 cache size due to a better performance configuration. After the tile vector pass, we will vectorize the inner most for loop according to the data length supported by SIMD instructions which we call physical register vector pass.

Why we need this

  1. There is no pass to do this now.

  2. This pass allows us to control the actual scf.for loop transformation and vectorization optimization instead of just relying on LLVM’s vectorization mechanism. It can help us better control the optimization of for loop.

Relate Works

tile vector abstraction does not exist in current MLIR Community. And mainly three functions or passes can do for loop vectorization.

Tile Vector

Convert operation which operates on tensor to vector. linalg operation vectorization function can convert operation operate on tensor to vector. But we may need to support more operations like tensor::BitcastOp , tensor::ExpandShapeOp, tensor::ConcatOp etc. And this should become an independent pass to convert the operation on the tensor into the operation on the vector.

Loop Vectorization

Affine

affine dialect implements a super-vectorization class to do vectorization. See SuperVectorize.cpp for more details.

Methods in affine only transform the affine.for and memref operations. But we need transform the scf.for and operations operate on vector now (like arith/math). So we can refer to its transformation ideas, but it cannot be directly used for the pass we designed. At the same time, their approach does focus on the vectorization of scalar data in affine, while we focus on how to maximize the use of SIMD support for operations in scf.for loop body.

SCF

scf mainly use tile to do vectorization like tileUsingSCF. We may need to use this function to transform the scf.for bound and step. But this function only supports the operation which has TilingInterface. Almost all operations in vector dialect do not have this interface.

Another ralated work is vectorToSCF. But this work only lowering transfer_read and transfer_write.

Vector Distribution For GPU

This is the first Vector Distribution RFC. But this work only works for gpu now.

Implementation

Lower to tile vector pass (This abstraction layer can reserve the size of the upper-level tile tensor due to to different hardware performance (such as L1 cache size for cpu))
           |
           |
Lower to physical register pass (Vectorize the inner most `for loop` according to the data length supported by SIMD instructions on CPU.)

Convert tensor to tile vector pass

We convert the tensor type to vector. Mainly use vector.transfer_read and vector.transfer_write etc.

Example:

func.func @add_tensor_test0(%arg0: tensor<4x8x1024xf32>, %arg1: tensor<4x8x1024xf32>) -> tensor<4x8x1024xf32> {
  %0 = tensor.empty() : tensor<4x8x1024xf32>
  %1 = linalg.add ins(%arg0, %arg1 : tensor<4x8x1024xf32>, tensor<4x8x1024xf32>) outs(%0: tensor<4x8x1024xf32>) -> tensor<4x8x1024xf32>
  return %1 : tensor<4x8x1024xf32>
}

After tile vector pass:

func.func @add_tensor_test0(%arg0: tensor<4x8x1024xf32>, %arg1: tensor<4x8x1024xf32>) -> tensor<4x8x1024xf32> {
    %cst = arith.constant 0.000000e+00 : f32
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<4x8x1024xf32>
    %1 = vector.transfer_read %arg0[%c0, %c0, %c0], %cst {in_bounds = [true, true, true]} : tensor<4x8x1024xf32>, vector<4x8x1024xf32>
    %2 = vector.transfer_read %arg1[%c0, %c0, %c0], %cst {in_bounds = [true, true, true]} : tensor<4x8x1024xf32>, vector<4x8x1024xf32>
    %3 = arith.addf %1, %2 : vector<4x8x1024xf32>
    %4 = vector.transfer_write %3, %0[%c0, %c0, %c0] {in_bounds = [true, true, true]} : vector<4x8x1024xf32>, tensor<4x8x1024xf32>
    return %4 : tensor<4x8x1024xf32>
  }

Physical register vector pass

Example:

func.func @add_tensor_test0(%arg0: tensor<4x8x1024xf32>, %arg1: tensor<4x8x1024xf32>) -> tensor<4x8x1024xf32> {
  %cst = arith.constant 0.000000e+00 : f32
  %0 = tensor.empty() : tensor<4x8x1024xf32>
  %c0 = arith.constant 0 : index
  %c1 = arith.constant 1 : index
  %c4 = arith.constant 4 : index
  %c8 = arith.constant 8 : index
  %1 = scf.for %arg2 = %c0 to %c4 step %c1 iter_args(%arg3 = %0) -> (tensor<4x8x1024xf32>) {
    %2 = scf.for %arg4 = %c0 to %c8 step %c1 iter_args(%arg5 = %arg3) -> (tensor<4x8x1024xf32>) {
      %c16 = arith.constant 16 : index
      %c1024 = arith.constant 1024 : index
      %3 = scf.for %arg6 = %c0 to %c1024 step %c16 iter_args(%arg7 = %arg5) -> (tensor<4x8x1024xf32>) {
        %4 = vector.transfer_read %arg0[%arg2, %arg4, %arg6], %cst {in_bounds = [true]} : tensor<4x8x1024xf32>, vector<16xf32>
        %5 = vector.transfer_read %arg1[%arg2, %arg4, %arg6], %cst {in_bounds = [true]} : tensor<4x8x1024xf32>, vector<16xf32>
        %6 = arith.addf %4, %5 : vector<16xf32>
        %7 = vector.transfer_write %6, %arg7[%arg2, %arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<4x8x1024xf32>
        scf.yield %7 : tensor<4x8x1024xf32>
      }
      scf.yield %3 : tensor<4x8x1024xf32>
    }
    scf.yield %2 : tensor<4x8x1024xf32>
  }
  return %1 : tensor<4x8x1024xf32>
}
  1. We need to get how many current values can the hardware’s SIMD instructions operate on. Then we will have the legal length of the hardware’s SIMD instructions to vectorize for loop.
  2. Convert the indexing map for current for loop forms. We also need to mask operation properly according to the indexing map.
  3. In order to avoid data dependency issue, we need to classify all operations. For example, transpose operation may can’t fuse a common outer for loop with other elementwise operation. For example:
// assume we have vector  = [100x100x100xf32]
scf.for 0...100:
  A[100, 100] = arith.add
  B[100, 100] = vector.transpose A permutation[1, 0]
  C[100, 100] = arith.add A, B
// if we generate a common outer for loop, that will be:
scf.for 0...100:
  scf.for 0...100:
    scf.for 0...100, step = 16:
      A_vector[16xf32] = add
      B_vector[16, 1] = transpose A_vector
      // the result will be wrong: C_vector should add B_vector[1, 16], but B_vector is [16,1] here. And [16,1] is also a invalid IR in here.
      C_slice[1, 16] = add A_slice, B_slice
// but it should be:
scf.for 0...100:
  scf.for 0...100:
    scf.for 0...100, step = 16:
      A_vector[16xf32] = add
  scf.for 0...100:
    scf.for 0...100:
      B_vector[16x16] = vector.transpose A_vector[16x16] permutation [1,0] 
  scf.for 0...100:
    scf.for 0...100, step = 16:
      C_slice[16] = add A_vector, B_vector

So, we divide operations into 3 categories to check whether can generate a common outer for loop between different operations:

  1. Complex operations: reorder, transpose, typecast and other complex operations. There may be data dependencies between such operations. It is necessary to determine whether the dimension of the current operation is a data-dependent dimension. If it is not a data-dependent dimension, then the operation can reuse the current for loop.
  2. Elementwise operations: add, sub, mul, div, max, min and etc. A for loop can be reused between such operations.
  3. Other operations.
    1. Reduce
      • It is necessary to determine whether the previous for loop can be reused based on the reduce axis.
    2. Broadcast
      • It is necessary to determine whether the previous for loop can be reused based on the broadcast axis.

At the same time, in order to achieve the best performance, during the phase of generating the loop body of the operation, when tile sizes of two operations are inconsistent, an operation with a smaller tile size may need to do `loop unroll’. For example:

arith.add : vector<32xf32>
 arith.add : vector<32xu8>

 ==> No loop unroll form
 scf.for 0...32 step = 16:
     arith.add : vector<16xf32>
 scf.for 0...32 step = 32:
     arith.add : vector<32xu8>
   
 ==> loop unroll, those two operations can fuse a common for loop.

 scf.for 0...32 step = 32:
   arith.add : vector<16xf32>
   arith.add : vector<16xf32>
   arith.add : vector<32xu8>

What features will be provided

  1. As mentioned above, simple vector-based operation fusion. Operations without data dependencies can use the same for loop.
  2. Perform vectorization to adapt to hardware for operations without data dependencies.

Users can choose to use the fusion we provide to vectorize the current operations, or provide a set of operations without data dependencies and directly call the method provided in 2 to perform vectorization.

Thank you for your attention. Feel free to join the discussion for better design or code implementation.

@ThomasRaoux @nicolasvasilache @mehdi_amini

2 Likes

I think having a pass that converts tile vectors to a loop nest with certain CPU hardware considerations is a great idea. However, it’s unclear to me how far this RFC is proposing to go with structuring the innermost loop. e.g.

What’s the plan for the remainder/residual trips? Does another pass consider the actual inner loop vector interleave/unroll factor? There are a lot of considerations for that inner loop beyond selecting the VF when trying to optimize to a specific CPU. I think it’s better for a generic vectorizer to handle a lot of those decisions before they are baked into the IR.

Perhaps this RFC would focus primarily on structuring the loop that’s the most optimal for the nest as a whole?

Thanks~

Usually vectorization refers to the conversion from scalar to vector. But now we are dealing with the situation of converting from a vector that is not suitable for the current hardware to a vector that is suitable for hardware execution. So our main concern here is whether the vector form can be executed by the hardware. You can think of it as a process of legalizing the vector size. We don’t care about these interleave/unroll factor. The interleave and unroll factors that further optimize the for loop will be handled by other passes.

For this example, we will use a mask to handle lengths that are not divisible by the number of vectorization steps.

As mentioned before, what we mainly want to do is to convert the vector size into a vector size suitable for hardware execution before passing the IR to llvm. This allows us to control the vector size ourselves instead of relying on the llvm backend.

Since the RFC cannot change the content, I would like to state here that it is inappropriate to call this pass vector distribution. The term distrubution is more suitable for GPU.

Therefore, I renamed this pass Tiling Vector for CPU.

In actual usage, we usually first tiling the linalg operation to a medium-sized tensor. e.g.:

func.func @add_tensor_test0(%arg0: tensor<4x8x1024xf32>, %arg1: tensor<4x8x1024xf32>) -> tensor<4x8x1024xf32> {
  %0 = tensor.empty() : tensor<4x8x1024xf32>
  %2 = scf.for 4:
            // omit some read operations
            %1 = linalg.add ins(%tiledArg0, %tiledArg1 : tensor<8x1024xf32>, tensor<8x1024xf32>) outs(%tiled0: tensor<8x1024xf32>) -> tensor<8x1024xf32>
            // omit some write operations
            scf.yield 

  return %2 : tensor<4x8x1024xf32>
}

Then we will lower the tiled linalg operation into vector operation. Then do the tiling vector pass(physical resigster pass).

func.func @add_tensor_test0(%arg0: tensor<4x8x1024xf32>, %arg1: tensor<4x8x1024xf32>) -> tensor<4x8x1024xf32> {
  %0 = tensor.empty() : tensor<4x8x1024xf32>
  %2 = scf.for 4:
         scf.for 8:
           scf.for 1024, step=16
            %4 = vector.transfer_read %arg0[%arg2, %arg4, %arg6], %cst {in_bounds = [true]} : tensor<4x8x1024xf32>, vector<16xf32>
            %5 = vector.transfer_read %arg1[%arg2, %arg4, %arg6], %cst {in_bounds = [true]} : tensor<4x8x1024xf32>, vector<16xf32>
            %6 = arith.addf %4, %5 : vector<16xf32>
            %7 = vector.transfer_write %6, %arg7[%arg2, %arg4, %arg6] {in_bounds = [true]} : vector<16xf32>, tensor<4x8x1024xf32>
            // omit some operations

  return %2 : tensor<4x8x1024xf32>
}

If I understand this proposal correctly, it wants to vectorize high-level operations very early in the pipeline and then perform tiling and fusion on the vector abstraction. I don’t see the point of doing so. This seems to have little to do with actual vector distribution / large-to-small decomposition for target matching.

We already can do tiling and fusion on tensors, then vectorize the resulting code with loops. All the loop classification, data dependency avoidance, and other implementation details is already taken care of at the tensor level. The proposal doesn’t seem to argue why that wouldn’t be possible and what are the drawbacks of that approach. Certainly, we may want to select tile sizes based on some hardware-related criteria, like divisibility of the innermost dimension by the HW vector size, but we can do so on the tensor level perfectly fine. The compiler will have to know this HW-related information regardless of the abstraction the tiling is applied to.

IMO, it’s only a matter of repeatedly tiling until the innermost tile size matches that of the HW vector along one of the dimensions and all other dimensions are 1. Tiling to a multiple of that or a non-1 along other dimensions will trigger “vector unrolling”, which produces opA opA opB opB as opposed to the regular opA opB opA opB induced by a loop.

tile vector abstraction does not exist in current MLIR Community.

I am not familiar with such an abstraction. Could you elaborate or provide references?

1 Like

All the loop classification, data dependency avoidance, and other implementation details is already taken care of at the tensor level.

We need to first tile the operation on the tensor to the size we want, such as the tensor<16x16xf32> required by the transpose operation (so that the transpose optimization algorithm can be used) or the tensor<32x32xf32> required by the matmul calculation (so that brgemm optimization can be used), etc. Then we convert the tiled tensor operation to a vector operation, and then convert the op to the hardware physical register size.

The main reason we do this is to achieve the best performance optimization. The second consideration is the semantic of Linalg op is too high, it’s not suitable to use linalg op on physical register like linalg.transpose is a standalone op, but during lowering to arith/math on vector, we will further break it into several shuffle ops for in-register transpose algorithm. So that without directly tiling the tensor to the size required by the hardware, some algorithms can be used to optimize the operation.

I am not familiar with such an abstraction. Could you elaborate or provide references?

This abstraction means that all the upper-level tiled tensor operations are converted to vector operations. In this way, we can transform these operations into vectors of physical register size. The community does not have a separate pass to handle this matter, but there are some implementation codes, such as linalg vectorization.

e.g.:

func.func @add_tensor_test0(%arg0: tensor<256x32x32xf32>, %arg1: tensor<256x32x32xf32>) -> tensor<256x32x32xf32> {
  %0 = tensor.empty() : tensor<256x32x32xf32>
  %2 = scf.for 256:
            // omit some read operations
            %1 = linalg.add ins(%tiledArg0, %tiledArg1 : tensor<32x32xf32>, tensor<32x32xf32>) outs(%tiled0: tensor<32x32xf32>) -> tensor<32x32xf32>
            // omit some write operations
            scf.yield 

  return %2 : tensor<256x32x32xf32>
}

Lowering tiled linalg operation into arith/math which operate on vector. (In addition to lowering the linalg operation, we also need to lower the commonly used operations on tensor such as concat expand_shape etc. to vector operations.)

func.func @add_tensor_test0(%arg0: tensor<4x8x1024xf32>, %arg1: tensor<256x32x32xf32>) -> tensor<256x32x32xf32> {
  %0 = tensor.empty() : tensor<256x32x32xf32>
  %2 = scf.for 256:
         // omit some operations
         scf.for 32:
           scf.for 32, step=16
            %4 = vector.transfer_read %tiledArg0, %cst {in_bounds = [true]} : tensor<32x32xf32>, vector<16xf32>
            %5 = vector.transfer_read %tiledArg1, %cst {in_bounds = [true]} : tensor<32x32xf32>, vector<16xf32>
            %6 = arith.addf %4, %5 : vector<16xf32>
            %7 = vector.transfer_write %6, %tiled0 {in_bounds = [true]} : vector<16xf32>, tensor<32x32xf32>
            // omit some operations

  return %2 : tensor<256x32x32xf32>
}

It looks like the overwhelming majority of transformations you need is readily available, just requires some careful application.

It is possible to tile repeatedly. It is also possible to rewrite linalg operations to, e.g., vector primitives (we have vector.tranpose!) or some other operation.

The main semantic problem I see is that tensors don’t have any information about how the data is stored in memory most of the time. Neither do large vectors. This information mostly gets introduced by bufferization. I don’t see how what’s proposed here addresses that problem, other than, I suppose, assuming dense contiguous row-major storage for everything.

Linalg vectorization is perfectly operational and exercised in multiple downstream projects. There are upstream tests that show how to use it. MLIR does not provide passes for all its functionality. (Check out this talk https://www.youtube.com/watch?v=lXAp6ZAWyBY if you haven’t already.)

I suppose converting those operations should be the crux of this proposal, not reinventing tiling+fusion on vectors. It’s not very clear to me why you insist on converting them to vectors early, it looks like handling memory is a non-issue in your case for some reason, but okay, let’s here more about that. (Note that expand shape is close to a noop when you introduce buffers because it doesn’t actually reshuffle data in memory, but it may reshuffle data between vector registers). Just as a reminder: absolutely nothing in MLIR requires that all operations in a translation unit use a certain type, vector or not; also, tensor of vectors is a thing.

Hi~ Alex, thank you for your very helpful advice!
Let me add some context to further discuss this pass.

It is possible to tile repeatedly.

Why not tile the tensor in one step?

First, on the CPU, for performance, we often consider the size of L1 cache. We will tile the size of the operation tensor into a tensor of the same size as L1 cache. For example, for matmul, we first use the existing community method tiling and fusion operation into tensor like <32x32xf32>.

It’s not very clear to me why you insist on converting them to vectors early…

Tiling tensor into a L1 cache size tensor also allows some operations to be optimized by hardware instructions. For example, the implementation of matmul on Intel CPU can use AMX instructions, and the implementation of transpose can use shuffle, etc. The implementation of operations like transpose requires the support of specific hardware, so they need to be converted into vectors of physical register size, which is why the pass I want to propose.

Other hardware may also need this pass to do hardware specific optimization. So we need tile vector abstraction lower operations into vector for different hardware. Due to we will lower tiled tensor operation on vector, so this abstraction will name as tile vector as written in RFC above.

tile vector abstraction does not exist in current MLIR Community.

Second, we think that the goal of tiling on tensor is to tile into a medium-sized tensor, such as tensor<32x32xf32>, to provide further optimization opportunities for subsequent passes, rather than tiling into a very small tensor. This can also maintain the original intention of designing linalg, after all, linalg operation has a high semantic level.

  1. Why not use affine super-vector for vectorization

As mentioned above, for better performance, we hope that some operations have their own specific optimization implementations, rather than fully automatic implementations.

The existing pass pipeline in the community may take into account the design on the GPU more, tiling into a tensor of the size of the GPU register. Due to the computing characteristics of the GPU, this tensor will not be as small as the CPU (for example, the f32 data type on avx512f CPU can only hold 16 data at most). If the tensor is too small, that’s will lose many optimization chance for CPU. I think we need such a pass that can provide the best code for different hardware.

It’s rare that one step of tiling is sufficient. There are multiple levels of cache, TLB, registers, etc.

L1 is not related to AMX instructions as far as I know, caches are not software controllable. You may need to ensure that tiling produces memory accesses (transfer_read/write) of the appropriate structure, but that should be already possible.

And that’s already possible. We even have some lowerings for other instruction sets, exerciseable via Transform Dialect - MLIR or patterns.

I still don’t understand what this is and how it is different from (a) just vectors and (b) tensor<A x B x vector< C x D x f32>>.

Nothing prevents one from tiling again after those subsequent passes have been applied. The goal of not only Linalg but structured ops (see paper: https://arxiv.org/pdf/2202.03293) is to preserve parts of the semantics required for transformations instead of rediscovering them through analysis. It has nothing to do with high or low level. Vector dialect follows the same design principles.

I have not suggested that.


I really think there is enough reusable functionality for tiling and vectorization in the codebase to support this use case. Folks in the community must have handled CPU vectorization before… There may be some small bits missing to target AMX specifically, assuming that’s your goal, but you should (1) try hard to compose existing tiling/vectorization/lowering, (2) extend the existing functionality if support is missing, (3) add the lowering specific to AMX, and ask questions along the way. Only if nothing of the above works should we consider adding yet another incarnation of tiling.

Joining a bit late, sorry.

Indeed, we use the Linalg Vectorizer that you suggested upthread - it works really well! And in general, scanning this thread I’m under the impression that it is basically proposing another … Linalg Vectorizer :sweat_smile:

Are you interested specifically in AMX? I might be worth reaching out to @aartbik who authored the AMX dialect. Also, if you’d like some pointers on targeting some novel CPU extension via the Vector dialect, it might be worth watching this (sorry, shameless self-promotion):

Hope this helps,
-Andrzej

assuming that’s your goal, but you should (1) try hard to compose existing tiling/vectorization/lowering, (2) extend the existing functionality if support is missing, (3) add the lowering specific to AMX,

Thanks Alex for the detailed explanation.

L1 is not related to AMX instructions as far as I know, caches are not software controllable.

Yeah, I agree~ The AMX example may bring some misunderstanding. What I mean is that we have an optimization that requires an L1-sized tensor to implement. This implementation uses AMX and other related technologies. So we have to tile a L1-cache-sized tensor here.

It has nothing to do with high or low level. Vector dialect follows the same design principles.

Certainly, we may want to select tile sizes based on some hardware-related criteria, like divisibility of the innermost dimension by the HW vector size, but we can do so on the tensor level perfectly fine.

There are a little differences between operations operate on tensor(like linalg/tensor…) and operations operate on vectors (like arith/math…). Let’s take linalg as an example. Some operations in linalg have difficulty expressing some accurate information. For example, if we want to do data promotion, according to the existing design, we need to convert data types in arith pass, such as bf16→f32, or f32→bf16. If tiling occurs entirely on tensor operation, then we cannot accurately perform hardware-related vectorization on operation. For example, if our tiling is tensor<32xbf16>, and our actual operation needs to use f32 to calculate more accurate result, then we need to do data type promotion in arith dialect which operate on vector, it means that we need to convert:

 `linalg.operation: tensor<32xbf16>`
// run promotion pass
 %2 = arith.extf %1: vector<32xf32>
 %3 = arith.operation %2: vector<32xf32>

and the promotion operation has exceeded the register size on avx512F cpu.

If the current community method directly passes an inappropriate size of vector to llvm, it can only do unroll now.

In order to achieve more appropriate tiling, we can only perform some tiling operations on vectors and tiling them into the size of hardware registers.

On the other hand, llvm has the function of automatic vectorization, and our solution can also perform similar vectorization into the size of hardware registers in mlir.

scanning this thread I’m under the impression that it is basically proposing another … Linalg Vectorizer

No. Here may be some misunderstanding.
It is necessary to use the existing linalg vectorizer in the community, but it needs to expand more operations. For example, tensor.expand_shape may also need to vectorize. If there are many people using it, the vectorizer can process as an rewrite patterns(like this) or pass for people to use.

if you’d like some pointers on targeting some novel CPU extension via the Vector dialect, it might be worth watching this

Thanks~ Very helpful~
It seems we do the similar things~
In the video, we can see belowing code:


We have got the vector.broadcast:vector<32xf32>, but we the size of vector<32xf32> is not meet hardware register size(like 512bits)(assume we need to tile a medium size vector to do some optimization like video). We must do unroll. This the point I want to resolve~~ I will do the vector tiling this time to avoid unrolling:

vector.broadcast : vector<32xf32>
===> tiling:
scf.for 32, step=16:
  vector.broadcast : vector<16xf32>

Yes, we can totally extend the vectoriser :slight_smile:

Could you give an example showing how you’d expect tensor.expand_shape to be vectorised? Once we have an idea, we can discuss how to add that to the Linalg vectoriser. Also, please keep in mind that in many cases Ops can be folded away before “vectorisation” - that’s why we’ve not really needed to support all the Tensor Ops so far.

You can tweak the size of the output vector by changing the tile sizes when tiling.

What’s wrong with letting the backend split the vectors? Apologies if you’ve already explained this.

Btw …

Broken link?

I want to add more overall information to help the reviewer understand the design purpose and the problem to solve:

  • Why do we convert operations to vectors early?

Today we have tile and fusion pass on Linalg dialect, which can fuse ops with tile size. Take fp32 matmul + transpose as an example, after Linalg fusion, several layers of scf.for will be created and inside the innermost loop, there will be linalg.matmul and linalg.transpose operating on a temp tensor with tile size(let’s say its shape is 32x32).

But for CPU, there’s no single register that can hold 32x32xf32 and also there’s no single instruction that can do 32x32 matmul. During lowering that 32x32 matmul into arith and math dialect, a series of FMA instructions will be generated for this 32x32 matmul, and each FMA instruction only operates on 16xf32. We hope transpose can directly happen on each 16xfp32 result for a register-level fusion.

If we want to achieve this with Linalg fusion, we need to further break down the 32x32 matmul with linalg.generic which represents FMA computation, which is quite trivial and less organized. For a high-efficient 32x32 matmul, we also want to do software pipeline and data prefetching for FMA instruction sequence. Those details are beyond the description scope of Linalg dialect.

The alternative solution that this RFC proposed is, to do tile and fusion on linalg with tile size, and this tile size doesn’t need to be exactly the same size as physical register size. After lowering linalg with arith/math, we will convert the tile size vector into physical-register size vector.

It means linalg fusion will focus on register group level, or L1 cache level fusion. Here register group level means, that by utilizing all 512-bit physical registers, we can implement a high efficient microkernel that produces 32x32xf32 output which is saved in several registers. Then we don’t need to let the tile size match the physical register size, and don’t need to break 32x32 matmul in Linalg dialect to let the Linalg fusion result still be maintained in a high-level abstraction, some kind of target-independent(Here target-independent means we don’t need to breakdown matmul with FMA as it’s a target-dependent implementation).

  • What problems can this pass solve?

This RFC is to solve the vector size mismatch issue on vector dialect. To legalize the tile size vector into physical-register size vector, actually, there’s already some transformation existing in upstream MLIR:

  1. Unroll the tile size vector into many physical register size vector, and the arith/math ops are unrolled as well.
  2. Convert tile size vector into scalar and let auto-vectorization convert it back to physical register size vector.

Those two solutions may work well for some devices, like SIMT GPU. But for CPU, hardware prefers to use a loop instead of unrolling to avoid instruction bloat. Also, because we have high-level semantics from tile size vector, this method can provide higher quality vectorized code compared with scalar-to-vector vectorization.

It means, even regardless of thinking which tile size should be used in inalg fusion, extending the way converting large size vector into physical register size vector is also meaningful. This RFC is trying to add a third option for CPU-like devices.

There are at least two solutions to this problem that appear simpler than bringing tiling to vectors:

  1. Making the logic that chooses tile sizes aware of the need to convert data types later on. When compiling for a given target, it is most likely known in advance that said target does not support bf16. At this point, the tile size selection logic can use that information in deciding tile sizes by simply doubling the amount of memory that would be needed to store a tile of data… This is perfectly expected to have optimization heuristics driven by target description and aligns well with ongoing work on target descriptions.
  2. Rewriting Linalg operations instead of waiting until IR reaches Arith. Linalg even supports some form of mixed precision, e.g., when the accumulator is using a larger type than operands in contractions.

The former can be done entirely downstream. The latter is most likely a small and welcome upstream extension. Both have much smaller maintenance cost for upstream.

LLVM vectorization is unrelated to this discussion. It starts from straight-line code or CFG loops to rediscover information readily available in MLIR.

How this would be different from doing tiling+fusion with some sizes on Linalg, doing further transforms, and then tiling each Linalg operation again to hardware register size? I see above that element type change is not supported at this level, and I propose two alternatives that look viable. Are there other transformations that must happen before second tiling for some reason?

The problem is not about there’re other transformations that might happen before the second tiling, if so, we can still achieve that by splitting the linalg tiling into two passes and inserting other transformations in between.

The main concern that we don’t want to do second tiling on linalg dialect is, it’s too trivial and not efficient. Let’s look at your alternative solutions:

This will introduce complexity for Linalg tiling and fusion logic. There are two kind of passes in arith/math dialects to determine the data type used for each ops: One is legalize-to-f32 and emulate-unsupported-floats which will promote unsupported bf16/fp16 ops to fp32, another is canonicalization pass which will eliminate a pair of arith::extf/arith::truncf with fast-math flag. If you want to do the second tiling in linalg, then you will need to implement same transformation in linalg which seems redundant. [MLIR][Arith] add fastMathAttr on arith::extf and arith::truncf by crazydemo · Pull Request #93443 · llvm/llvm-project · GitHub provide more details about this.

Another and more important reason is the design principle, that is, different dialects should provide different levels of abstraction and different transformations should be implemented with suitable dialects.

Linalg dialect provides a high-level abstraction that aligns with the framework ops, like matmul and transpose, while arith and math dialect is a lower level of abstraction that aligns with machine ops like FMA and vector_suffle.

The second tiling is an optimization targeting hardware register size, which is perfectly matched with the abstraction of vector dialect. If we apply this in Linalg dialect, then we need to expose FMA and vector_suffle ops with Linlag. This will introduce abstraction confusion and make vector dialect useless.

I’m confused with terminology here. Please elaborate what you mean by “trivial” and “not efficient”.

Is the transformation itself trivial, as in, easy to perform? That’s exactly the design goal of structured ops.

Is the generated code not efficient on some target platform? Is compile time too high?

Certainly this may introduce some complexity. My point is that implementing yet another tiling but on vectors will introduce more complexity in this flow in particular and the ecosystem in general than lifting what’s essentially peephole rewrites to Linalg. Depending on the implementation of emulation passes, it may be just a matter of letting them see inside Linalg operations that contain arith in their body… The canonicalizations can still happen down the pipeline, after we converted Linalg to loops or vectorized. If by complexity for tiling and fusion you mean the complexity of the driver/cost model that controls the fusion, we currently don’t have any, so it’s the desired outcome to have one.

Following this principle, tiling should be implemented on the corresponding abstraction level, which is Linalg + tensor. Renaming tiling to something else won’t change the nature of this transformation, it’s the one materializing loops that were previously implicit.

I don’t see how you arrive to the conclusion that one needs to expose vector_shuffle in Linalg (FMA is already “exposed” in a sense that math.fma can be used inside a linalg.generic similarly to any other scalar operation) in order to perform tiling or represent the result thereof. It looks perfectly possible to me to tile to the desired vector size, vectorize to produce transfer_read/write and eventually transpose operations, and lower those to shuffles when appropriate. Moreover, all of that exists in the ecosystem. If somehow you cannot arrive at the desired result, please provide a specific example, potentially as a github bug, and we can see whether we are lacking transformation controllability somewhere.

I’m confused with terminology here. Please elaborate what you mean by “trivial” and “not efficient”.

Let’s take matmul+relu as example, during linalg tiling and fusion, if we select 32x32 as tile size, then the IR will be like,

      scf.for 0...4 {
        scf.for 0...4 {
          %tmp0 = tensor.extract_slice %arg0: tensor<32x32xf32>
          %tmp1 = tensor.extract_slice %arg1: tensor<32x32xf32>
          linalg.matmul(%tmp0, %tmp1 : tensor<32x32xf32>, tensor<32x32xf32>) outs(%0: tensor<32x32xf32>) -> tensor<32x32xf32>
          linalg.relu ins(%0 : tensor<128x256xbf16>) outs(%1 : tensor<128x256xbf16>) -> tensor<128x256xbf16> // pseudo op
          tensor.insert_slice %1 : tensor<32x32xf32> into tensor<4x4x32x32xf32>
        }
      }

The result is still in high-level abstraction and easy to analyze the fusion result.
But if we do physical register tiling with 1x16, then tons of broadcast and FMA ops will be created in the IR, then the IR will be like,

      scf.for 0...4 {
        scf.for 0...4 {
          %tmp0 = tensor.extract_slice %arg0: tensor<1xf32>
          %tmp1 = tensor.extract_slice %arg1: tensor<16xf32>
          %tmp2 = tensor.extract_slice %arg0: tensor<1xf32>
          %tmp3 = tensor.extract_slice %arg1: tensor<16xf32>
          %tmp4 = tensor.extract_slice %arg0: tensor<1xf32>
          %tmp5 = tensor.extract_slice %arg1: tensor<16xf32>
          %tmp6 = tensor.extract_slice %arg0: tensor<1xf32>
          %tmp7 = tensor.extract_slice %arg1: tensor<16xf32>
          %tmp8 = tensor.extract_slice %arg0: tensor<1xf32>
          %tmp9 = tensor.extract_slice %arg1: tensor<16xf32>

... hundreds of extract_slice 

          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>
          linalg.broadcast ins(%tmp0 : tensor<1xf32>) outs(%0 : tensor<16xf32>) -> tensor<16xf32> // broadcast scalar to vector
          %extf32_mul2 = linalg.generic {indexing_maps = [...], iterator_types = ["parallel"]} ins(%0, %tmp1: tensor<16xf32>, tensor<16xf32>) outs(%1: tensor<16xf32>) { // FMA
          ^bb0(%in1: f32, %in2: f32 %out: f32):
              %2 = arith.fma %in1,, %in2, %cst_2 : f32
              linalg.yield %2 : f32
          } -> tensor<16xf32>


... hundreds of broadcast and FMA


          linalg.relu ins(%500 : tensor<16xfp32>) outs(%501 : tensor<16xfp32>) -> tensor<16xfp32>
          linalg.relu ins(%500 : tensor<16xfp32>) outs(%501 : tensor<16xfp32>) -> tensor<16xfp32>
          linalg.relu ins(%500 : tensor<16xfp32>) outs(%501 : tensor<16xfp32>) -> tensor<16xfp32>
          linalg.relu ins(%500 : tensor<16xfp32>) outs(%501 : tensor<16xfp32>) -> tensor<16xfp32>
 
... hundreds relu ops

          tensor.insert_slice %1000 : tensor<16xf32> into tensor<4x4x32x32xf32>
          tensor.insert_slice %1001 : tensor<16xf32> into tensor<4x4x32x32xf32>
          tensor.insert_slice %1002 : tensor<16xf32> into tensor<4x4x32x32xf32>

... hundreds of insert_slice
        }
      }

Then the Linalg completely loses the abstraction it should be and the generated IR will be hard to perform analysis and transformation, causing the following compilation flow will be very slow.

tiling should be implemented on the corresponding abstraction level, which is Linalg + tensor.

Linalg + tensor represents high-level abstraction, that’s why there’s no low-level op like prefetch or shuffle defined in Linalg. Linalg is also not good at describing mask ops which are very important for physical register tiling. Forcing tensor to be the same size as physical register before lowering to vector is a very strong restriction. With this restriction, a lot of device details will need to be considered during middle-end optimization, causing unnecessary burden to Linalg transformations.

The last thing I want to mention is, we don’t need to choose between two implementations, actually they can work together to solve different problems. With this RFC, the downstream can still do physical register size tiling with Linalg + tensor, and the following transformations don’t have the restriction above because when lowering to vector dialect, this RFC will help to legalize any unsupported vector to a support one.

Hey! I’m a bit late but I’m thrilled to see more people interested in vectorization! :blush:

The tile vector concept seems to describe what we call “virtual vectors”, which are vectors that do not necessarily adhere to the dimensionality, sizes, constraints and other characteristics of the specific target vector registers. We widely use virtual vectors in the Linalg vectorizer to model multi-dimensional unrolling strategies over the vector iteration space. For example, we can generate vector<8x32xf32>for a target with 512-bit vector registers. This configuration effectively applies a 2x unroll factor to the innermost vector dimension and an 8x unroll factor to the outer one.

It seems like you are missing the actual target legalization part, which would allow you to apply further target-specific optimizations in MLIR once the vector types are in their final form. While we do not have full-blown legalization passes, we have a few transformations that could be useful. For example, we use vector unrolling to turn a multi-dimensional vector into 1-D vectors. Unfortunately, not all operations are currently supported, and this won’t legalize the contiguous vector dimension. You might also want to look at vector linearization. With these transformations in place, it should be relatively simple to implement a pass that splits 1-D vectors into smaller vectors that fit the physical vector registers.

Going back to tiling, I would be concerned if we used the concept of vector to model any kind of memory or cache level tiling. Vectors and memory/caches are independent concepts and should be modeled accordingly, and that’s where the tensor type plays the right role, IMO.

I think there might be a misunderstanding here. If we apply vector unrolling and similar transformations to the 32x32 vector targeting the physical vector register size, we would end up with the code you described above. However, if we apply an extra level of tiling with the physical vector register size, we should end up with two more scf.for ops (one for each dimension) and just a few operations working on the actual physical vector register.

In general, I agree with @ftynse in that Linalg is designed to work on tensors and heavily relies on tiling and tensor slicing to apply different transformations. It would be hard to change these fundamental principles.