Hello,
I’m playing with Register Blocking (or Register-level tiling or Loop unroll + Scalar Replacement) in MLIR, and I’ve tried to take advantage of the 16 hardware registers of a Core i7. I found some related passes like: --affine-loop-unroll, --affine-loop-unroll-jam, --affine-scalrep. I used this passes without the expected result. The case study is a dense matrix multiplication C+=A*B with A:2048x2048xf32 and B:2048x2048xf32.
func.func @matmul(%A: memref<2048x2048xf32>, %B: memref<2048x2048xf32>, %C: memref<2048x2048xf32>) {
affine.for %i = 0 to 2048 {
affine.for %j = 0 to 2048 {
affine.for %k = 0 to 2048 {
//load arrays
%a = affine.load %A[%j, %k] : memref<2048x2048xf32>
%b = affine.load %B[%k, %i] : memref<2048x2048xf32>
%ci = affine.load %C[%i, %j] : memref<2048x2048xf32>
// computation
%p = arith.mulf %a, %b : f32
%co = arith.addf %ci, %p : f32
// store output
affine.store %co, %C[%j, %i] : memref<2048x2048xf32>
}
}
}
return
}
I would like to apply register Blocking on the i-loop with Unroll Factor of 3 and on j-loop with Unroll Factor of 4. The code will be similar to this:
for (i=0;i!=N;i+=3){
for (j=0;j!=N;j+=4){
C_00 = C[i][j];
.....
C_23 = C[i + 2][j + 3];
for (k=0;k!=N;k++) {
// 3 registers for A
a = A[i][k]; a1 = A[i + 1][k];
a2 =A[i + 2][k];
b = B[k][j]; // 1 register B
C_00+=a*b;
C_10+=a1*b;
C_20+=a2*b;
.....
b = B[k][j + 3];
C_03+=a*b;
C_13+=a1*b;
C_23+=a2*b;
}
C[i][j] = C_00;
...
C[i + 2][j + 3] = C_23;
}
}
As you can see the Register Management inside innermost kernel is a little bit difficulted to be applied by --affine-scalrep (I can do in on my own). I just only want to unroll the loop with different Unroll factors.
As you can see UF on i-loop do not perfectly divide 2048 (2048 modulo 3 != 0). I need to have padding code (or Prolog/Epilog code).
Can you provide me a way on how to do this ?