Is there any chance to extend `TilingInterface` to operations in arith, vector dialect?

BRUCE11111 · June 6, 2024, 1:38am

Hi~
We need to tile a large vector into a small vector, e.g.

arith.add %1, %2 : vector<32x1024xf32>
// tile it as:
scf.for...
  scf.for...
    // some read operations
    arith.add %1, %2 : vector<16xf32>

The community has tileUsingSCF to tile tensor now. But this method requires the operation to have the tilingInterface interface. Operations that operate on vectors do not have this interface.

Is there any reason not to add this interface to operations like arith.add, vector.transpose so that we can reuse the method to tile vectors?

Is there any chance that we can extend this interface to operations in arith, vector dialect?

Or should just implement tiling vector in new way?

I currently use my own method to tile vectors, and I am not sure which solution is more reasonable.

MaheshRavishankar · June 13, 2024, 12:55am

I think the expectation is that you do all your tiling at Linalg level, and then vectorize. After this point you shouldnt need any tiling.

BRUCE11111 · June 13, 2024, 1:10am

I think the expectation is that you do all your tiling at Linalg level

Thanks~

But the semantic of Linalg op is too high and tiling linalg is designed for medium size tiling, we think it’s not suitable to use linalg op on small size like physical register size, e.g. linalg.transpose is a standalone op, but during lowering to arith/math on vector, we need to further break it into several shuffle ops for in-register transpose algorithm.

BRUCE11111 · June 13, 2024, 1:22am

After this point you shouldnt need any tiling.

Hi~ Mahesh~ Inorder to better explain.
This is what we want to do, and we want to contribute to the community in the end.

MaheshRavishankar · June 13, 2024, 2:51am

Typically the expectation is

If you need to tile and create inter-tile loops you use TilingInterface.
Once you get the problem tiled to the shape you want, then you use VectorUnrolling to break up the virtual multi-dimensional register into a shape that gives you vectors that map to the physical register sizes.

W.R.T vector distribution, I think there is no upstream agreement on the way forward yet. Different projects (like Triton and IREE) have tried their own approaches with different degrees of success.

BRUCE11111 · June 13, 2024, 6:00am

Thanks for the advice~
The vector processing in the community is generally unrolled, and then the IR is passed to llvm.
First, if each vector operation needs to be unrolled, the instruction cache on the CPU will be full, which will cause serious performance problems.
Second, the generated llvm binary code will be very large. For example, we will often see a lot of unroll instructions.

// arith.add unroll
// omit same instructions
 62 71 3c 48 58 85 50           vaddps zmm8,zmm8,ZMMWORD PTR [rbp+0x1c50]
  f2:   1c 00 00 
  f5:   62 71 7c 48 29 47 39    vmovaps ZMMWORD PTR [rdi+0xe40],zmm8
  fc:   62 71 7c 48 28 85 10    vmovaps zmm8,ZMMWORD PTR [rbp+0xc10]
 103:   0c 00 00 
 106:   62 71 3c 48 58 85 10    vaddps zmm8,zmm8,ZMMWORD PTR [rbp+0x1c10]
 10d:   1c 00 00 
 110:   62 71 7c 48 29 47 38    vmovaps ZMMWORD PTR [rdi+0xe00],zmm8
 117:   62 71 7c 48 28 85 d0    vmovaps zmm8,ZMMWORD PTR [rbp+0xbd0]
 11e:   0b 00 00 
 121:   62 71 3c 48 58 85 d0    vaddps zmm8,zmm8,ZMMWORD PTR [rbp+0x1bd0]
 128:   1b 00 00 
 12b:   62 71 7c 48 29 47 37    vmovaps ZMMWORD PTR [rdi+0xdc0],zmm8
 132:   62 71 7c 48 28 85 90    vmovaps zmm8,ZMMWORD PTR [rbp+0xb90]
 139:   0b 00 00 
 13c:   62 71 3c 48 58 85 90    vaddps zmm8,zmm8,ZMMWORD PTR [rbp+0x1b90]
 143:   1b 00 00 
 146:   62 71 7c 48 29 47 36    vmovaps ZMMWORD PTR [rdi+0xd80],zmm8
 14d:   62 71 7c 48 28 85 50    vmovaps zmm8,ZMMWORD PTR [rbp+0xb50]
 154:   0b 00 00 
 157:   62 71 3c 48 58 85 50    vaddps zmm8,zmm8,ZMMWORD PTR [rbp+0x1b50]
 15e:   1b 00 00 
 161:   62 71 7c 48 29 47 35    vmovaps ZMMWORD PTR [rdi+0xd40],zmm8
 168:   62 71 7c 48 28 85 10    vmovaps zmm8,ZMMWORD PTR [rbp+0xb10]
// omit same instructions

This is also the point I want to solve. Tiling the vector into an operation suitable for hardware execution through a for loop. I think the community should consider adding support for tiling vector.

Thanks again~

MaheshRavishankar · June 13, 2024, 6:10am

Linalg and Vector are solving two problems

Linalg tiling gets you the “problem size of the inner most loop”
Vector dialect allows you to essentially implement unroll and jam, and is meant to be straightline code.

So I dont see much point of tiling on vector dialect itself. Essentially think of the computation that your “straight-line innermost loop” represents. You then tile to that in Linalg, vectorize and then unroll.

That is the theory. It definitely comes with its challenges. You need to really control the size of the vector you start with, so the tile sizes you use for tiling become load-bearing, but that is a separate problem of heuristics.

ftynse · June 14, 2024, 1:52pm

My high-level take on this is that, while it is mechanically possible to tile/fuse arith/vector operations, we should not. This would be contrary to the notion of “basic arithmetic” from the dialect charter:

The arith dialect is intended to hold basic integer and floating point mathematical operations.

(I further think that arith shouldn’t operate on tensors, but it’s a separate long-running discussion).

Could you explain how you arrive at the IR where vectors are so big that tiling them is necessary? To me, it sounds like the IR wasn’t sufficiently tiled before performing vectorization.

Also as a random note, we floated the idea of linalg-on-vectors a couple of years ago.

Topic		Replies	Views
[RFC] Generalize tiling (TilingInterface and the tileUsingSCF driver) to operate on ShapedType Tensor Compiler	2	187	March 31, 2025
[TCDG] Notes from meting 2025-03-05 Tensor Compiler	4	334	March 13, 2025
The Dialect suitable to describe the tiling operation MLIR	5	1202	February 5, 2021
[RFC] Vector Distribution for CPU (convert vector to physical register size vector) MLIR	46	1428	July 12, 2024
Tile and fuse support MLIR mlir	5	648	February 11, 2025

Is there any chance to extend `TilingInterface` to operations in arith, vector dialect?

Related topics