I’m vectorizing a linalg.matmul op using vector dialect. The linalg.matmul op is already tiled. And when doing vectorization, my vectorize pass doing things below:
- Vectorization using
linalg::vectorize
- Unrolling
- Casting away vector leading one dim
- Hoisting
- Lowering to LLVM IR
Everything works fine if mat mul size can be divisible by tile size. But if it cannot be divisible, vectorization will introduce vector.mask op.
%12 = vector.mask %11 { vector.contract {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d2)>, affine_map<(d0, d1, d2) -> (d2, d1)>, affine_map<(d0, d1, d2) -> (d0, d1)>], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %8, %9, %10 : vector<4x4xf32>, vector<4x4xf32> into vector<4x4xf32> } : vector<4x4x4xi1> -> vector<4x4xf32>
vector.mask %11 { vector.contract } seems can not be unrolled using mlir::vector::populateVectorUnrollPatterns(...). Does anyone know how to solve this problem?
Hi there!
We usually apply vector unrolling relatively late in the pipeline when vector.mask has been folded away so unrolling vector.mask is not implemented in mlir::vector::populateVectorUnrollPatterns. However, we also apply unrolling as part of lowering vector.contract to vector.outerproduct or vector.fma so you can try to do that with mlir::vector::populateVectorContractLoweringPatterns. vector.mask is supported there.
You can also try to apply peeling with mlir::linalg::peelLoops before running linalg::vectorize. That should generate a main loop, which would vectorize without masks, and a remainder loop. FTR, I was not very successful applying masking to vector.contract in the past. Whereas it’s supported in MLIR, I struggled to represent masked scalar loads in LLVM, in a way that the LLVM backends would generate efficient asm for it.
Happy to answer any other questions that you may have.
Diego
2 Likes
Thanks!
Using mlir::vector::populateVectorContractLoweringPatterns directly with the outerproduct option gives the same result as unroll method in my pipeline. mlir::linalg::peelLoops might cause some performance issues for matrices that are tiled multiple times(tile size=[[8, 32, 0], [4, 4, 0], [0, 0, 4]])? But I’m not sure if introducing arith.select will cause performance loss, I’ll test it later.
In the beginning, I used a pass pipeline referenced from lei.caht()'s practice in IREE. But it seems that mlir::vector::populateVectorContractLoweringPatterns can now directly replace the unroll step mentioned in the article?
It depends on the target but using a select to blend the new result with the pass-through one is a well-known pattern and should be peepholed into masked instruction if available in the target.
Things are moving quickly! That’s probably a question for @antiagainst 