Hi~
We need to tile a large vector into a small vector, e.g.
arith.add %1, %2 : vector<32x1024xf32>
// tile it as:
scf.for...
scf.for...
// some read operations
arith.add %1, %2 : vector<16xf32>
The community has tileUsingSCF to tile tensor now. But this method requires the operation to have the tilingInterface interface. Operations that operate on vectors do not have this interface.
Is there any reason not to add this interface to operations like arith.add, vector.transpose so that we can reuse the method to tile vectors?
Is there any chance that we can extend this interface to operations in arith, vector dialect?
Or should just implement tiling vector in new way?
I currently use my own method to tile vectors, and I am not sure which solution is more reasonable.
I think the expectation is that you do all your tiling at Linalg level
Thanks~
But the semantic of Linalg op is too high and tiling linalg is designed for medium size tiling, we think it’s not suitable to use linalg op on small size like physical register size, e.g. linalg.transpose is a standalone op, but during lowering to arith/math on vector, we need to further break it into several shuffle ops for in-register transpose algorithm.
If you need to tile and create inter-tile loops you use TilingInterface.
Once you get the problem tiled to the shape you want, then you use VectorUnrolling to break up the virtual multi-dimensional register into a shape that gives you vectors that map to the physical register sizes.
W.R.T vector distribution, I think there is no upstream agreement on the way forward yet. Different projects (like Triton and IREE) have tried their own approaches with different degrees of success.
Thanks for the advice~
The vector processing in the community is generally unrolled, and then the IR is passed to llvm.
First, if each vector operation needs to be unrolled, the instruction cache on the CPU will be full, which will cause serious performance problems.
Second, the generated llvm binary code will be very large. For example, we will often see a lot of unroll instructions.
This is also the point I want to solve. Tiling the vector into an operation suitable for hardware execution through a for loop. I think the community should consider adding support for tiling vector.
Linalg tiling gets you the “problem size of the inner most loop”
Vector dialect allows you to essentially implement unroll and jam, and is meant to be straightline code.
So I dont see much point of tiling on vector dialect itself. Essentially think of the computation that your “straight-line innermost loop” represents. You then tile to that in Linalg, vectorize and then unroll.
That is the theory. It definitely comes with its challenges. You need to really control the size of the vector you start with, so the tile sizes you use for tiling become load-bearing, but that is a separate problem of heuristics.
My high-level take on this is that, while it is mechanically possible to tile/fuse arith/vector operations, we should not. This would be contrary to the notion of “basic arithmetic” from the dialect charter:
The arith dialect is intended to hold basic integer and floating point mathematical operations.
(I further think that arith shouldn’t operate on tensors, but it’s a separate long-running discussion).
Could you explain how you arrive at the IR where vectors are so big that tiling them is necessary? To me, it sounds like the IR wasn’t sufficiently tiled before performing vectorization.
Also as a random note, we floated the idea of linalg-on-vectors a couple of years ago.