Register Blocking in MLIR(loop unroll + Scalar Replacement)

Hello,

I’m playing with Register Blocking (or Register-level tiling or Loop unroll + Scalar Replacement) in MLIR, and I’ve tried to take advantage of the 16 hardware registers of a Core i7. I found some related passes like: --affine-loop-unroll, --affine-loop-unroll-jam, --affine-scalrep. I used this passes without the expected result. The case study is a dense matrix multiplication C+=A*B with A:2048x2048xf32 and B:2048x2048xf32.

func.func @matmul(%A: memref<2048x2048xf32>, %B: memref<2048x2048xf32>, %C: memref<2048x2048xf32>) {
    affine.for %i = 0 to 2048 {
        affine.for %j = 0 to 2048 {
            affine.for %k = 0 to 2048 {
              //load arrays
              %a = affine.load %A[%j, %k] : memref<2048x2048xf32>
              %b = affine.load %B[%k, %i] : memref<2048x2048xf32>
              %ci = affine.load %C[%i, %j] : memref<2048x2048xf32>
              // computation
              %p = arith.mulf %a, %b : f32
              %co = arith.addf %ci, %p : f32
              // store output
              affine.store %co, %C[%j, %i] : memref<2048x2048xf32>
            }
        }
    }
return
}

I would like to apply register Blocking on the i-loop with Unroll Factor of 3 and on j-loop with Unroll Factor of 4. The code will be similar to this:

for (i=0;i!=N;i+=3){
        for (j=0;j!=N;j+=4){
            C_00 =  C[i][j];
            .....
            C_23 =  C[i + 2][j + 3];
            for (k=0;k!=N;k++) {
            	// 3 registers for A 
                a = A[i][k]; a1 = A[i + 1][k]; 
                a2 =A[i + 2][k]; 
                
                b = B[k][j];            // 1 register B
                C_00+=a*b;
                C_10+=a1*b;
                C_20+=a2*b;
                .....
                b = B[k][j + 3];
                C_03+=a*b;
                C_13+=a1*b;
                C_23+=a2*b;

            }
            C[i][j] = C_00;
			...
            C[i + 2][j + 3] = C_23;
        } 

}

As you can see the Register Management inside innermost kernel is a little bit difficulted to be applied by --affine-scalrep (I can do in on my own). I just only want to unroll the loop with different Unroll factors.
As you can see UF on i-loop do not perfectly divide 2048 (2048 modulo 3 != 0). I need to have padding code (or Prolog/Epilog code).

Can you provide me a way on how to do this ?