Tiling - need advice on L3 cache

Hello,

I’m playing with tiling in MLIR, and I’ve tried to take advantage of the L3 cache of a Core i7. I get rather paradoxal results, maybe someone can explain me what happens. The case study is a dense matrix multiplication C+=A*B with A:2048x2048xf64 and B:2048x2048xf64.

I a nutshell, inside MLIR I get virtually no gains (code attached at the end of the message). If I write the code in C and compile it with clang and gcc (-O3) I also get nothing, unless I transpose the B matrix, in which case not only the original code runs faster, but L3 tiling gains a nice 38% speedup.

On the target architecture the L3 cache is shared between cores (and thus between threads and OS) and has 8Mo for 4 cores. I have stopped all memory-intensive tasks before measures.

I tiled over index j (the columns of the output matrix). The resulting loop nest looks like:

   for(j0=0;j0<16;j0++)
     for(i=0;i<D0;i++)
       for(j1=j0*128;j1<(j0+1)*128;j1++) 
 	for(k=0;k<D2;k++)

The objective was to let the 128 columns of B remain in L3 over all the traversal of A.

C compilation is done using “CC -O3 -ffast-math matmul.c”.
MLIR compilation and execution is done using:
mlir-opt --convert-linalg-to-affine-loops --memref-dataflow-opt --affine-loop-unroll-jam --lower-affine --convert-loop-to-std -convert-std-to-llvm --canonicalize gemm.mlir | /Users/dpotop/svn/llvm-mlir/llvm-project/build/bin/mlir-cpu-runner -O3 -e main -entry-point-result=void -shared-libs=…/mylib.dylib

Can someone tell me if these results are normal? My objective is to be able to use tiling (and therefore understand how it works).

Best regards,
Dumitru

PS: here is the MLIR code I’m manipulating:

> // C += A * B, basic implementation
> func @matmul(%A: memref<2048x2048xf64>, %B: memref<2048x2048xf64>, %C: memref<2048x2048xf64>) {
>   affine.for %arg3 = 0 to 2048 {
>     affine.for %arg4 = 0 to 2048 {     
>       affine.for %arg5 = 0 to 2048 {
>         %a = affine.load %A[%arg3, %arg5] : memref<2048x2048xf64>
>         //%b = affine.load %B[%arg5, %arg4] : memref<2048x2048xf64>
>         %b = affine.load %B[%arg4, %arg5] : memref<2048x2048xf64>
>         %ci = affine.load %C[%arg3, %arg4] : memref<2048x2048xf64>
>         %p = mulf %a, %b : f64
>         %co = addf %ci, %p : f64
>         affine.store %co, %C[%arg3, %arg4] : memref<2048x2048xf64>
>       }
>     }
>   }
>   return
> }
> 
> // C += A * B, tiled for L3 on dimension 1 (j)
> #map0 = affine_map<(d0) -> (d0*32)>
> #map1 = affine_map<(d0) -> (d0*32+32)>
> func @matmul_tiled(%A: memref<2048x2048xf64>, %B: memref<2048x2048xf64>, %C: memref<2048x2048xf64>) {
>   affine.for %arg0 = 0 to 64 {
>     affine.for %arg3 = 0 to 2048 {
>       affine.for %arg4 = #map0(%arg0) to #map1(%arg0) {
>         affine.for %arg5 = 0 to 2048 {
>           %a  = affine.load %A[%arg3, %arg5] : memref<2048x2048xf64>
>           // %b  = affine.load %B[%arg5, %arg4] : memref<2048x2048xf64>
>           %b  = affine.load %B[%arg4, %arg5] : memref<2048x2048xf64>
>           %ci = affine.load %C[%arg3, %arg4] : memref<2048x2048xf64>
>           %p  = mulf %a, %b : f64
>           %co = addf %ci, %p : f64
>           affine.store %co, %C[%arg3, %arg4] : memref<2048x2048xf64>
> 	}
>       }
>     }
>   }
>   return
> }
> 
> func @main() {
>   %A = alloc() : memref<2048x2048xf64>
>   %B = alloc() : memref<2048x2048xf64>
>   %C = alloc() : memref<2048x2048xf64>
> 
>   %cf1 = constant 1.00000e+00 : f64
> 
>   linalg.fill(%A, %cf1) : memref<2048x2048xf64>, f64
>   linalg.fill(%B, %cf1) : memref<2048x2048xf64>, f64
>   linalg.fill(%C, %cf1) : memref<2048x2048xf64>, f64
> 
>   %t0 = call @rtclock() : () -> (f64)
>   call @matmul_tiled(%A, %B, %C) : (memref<2048x2048xf64>, memref<2048x2048xf64>, memref<2048x2048xf64>) -> ()
>   %t1 = call @rtclock() : () -> (f64)
> 
> 
>   // call @print_memref_2d_f64(%C): (memref<2048x2048xf64>) -> ()
>   %ci1 = constant 17179869184 : i64 // Number of flops to compute
>   call @print_flops(%t0, %t1, %ci1): (f64,f64,i64) -> ()
>   return
> }
> 
> func @print_memref_2d_f64(memref<2048x2048xf64>)
> func @print_flops(f64,f64,i64)
> func @rtclock() -> (f64)

Keep in mind that performing L3 tiling alone will give you a really limited improvement over untiled code. If you are tiling for one level, it might be just better to tile for L2. Second, optimizing for misses on one of the matrices will give you limited mileage or none. You may want to check what’s happening to the other accesses: your end-result will almost entirely depend on the weakest link.

The passes you are running are missing anything that performs a scalar replacement for the accesses to %C (%C[%arg3, %arg4] is innermost loop invariant in your snippet). I don’t think LLVM will eliminate those load/stores and put it in a register, but you can confirm by looking at opt’s output or the innermost loop in assembly with:

.... | mlir-translate -mlir-to-llvmir  | opt -O3  -S | llc -O3 

There shouldn’t be any store operations if the replacement happened. Such scalar replacement doesn’t exist in MLIR upstream - my branch here has it (-affine-scalrep).

Finally, the -affine-loop-unroll-jam pass is a test pass - it should actually be moved from there. It has the mechanics of unroll-and-jam but no cost model to drive it yet.

Thanks, that starts to clear it. I started with L3 because I wanted to apply a systematic approach and understand what happens.

To perform the scalar replacement, I have used the loop-carried dependencies of loop.for. No change. I have also removed the -affine-loop-unroll-jam (no change, either). Here is the resulting code:

func @matmul_tiled2(%A: memref<2048x2048xf64>, %B: memref<2048x2048xf64>, %C: memref<2048x2048xf64>) {
  affine.for %arg0 = 0 to 32 {
    affine.for %arg3 = 0 to 2048 {
      affine.for %arg4 = #map0(%arg0) to #map1(%arg0) {
        %acc_init = affine.load  %C[%arg3, %arg4] : memref<2048x2048xf64>
        %zero = constant 0 : index
	%one = constant 1: index
        %bound = constant 2048 : index
        %acc = loop.for %arg5 = %zero to %bound step %one iter_args(%acc_current = %acc_init) -> (f64) {
          %a  = load %A[%arg3, %arg5] : memref<2048x2048xf64>
          // %b  = affine.load %B[%arg5, %arg4] : memref<2048x2048xf64> // normal B
          %b  = load %B[%arg4, %arg5] : memref<2048x2048xf64> // transposed B
          %p  = mulf %a, %b : f64
          %acc_out = addf %acc_current, %p : f64
          loop.yield %acc_out : f64
	}
	affine.store %acc, %C[%arg3, %arg4] : memref<2048x2048xf64>
      }
    }
  }
  return
}

How would you tile this matrix multiplication to get at least some gain just from tiling for L3? Given that A and C are traversed linearly, I was expecting at least some gain from what I did…

Best,
Dumitru

You don’t have to use a loop.for and yields here. Just use a memref<1xf32> and use an alloca to allocate it (even alloc will do), and LLVM’s optimizer will get rid of the load/stores and eliminate the allocation as well.

A and C may be accessed contiguously but it’s important to know if they are coming from memory or cache. If both or one of them is always coming from memory, then you may just have marginal or no improvement from keeping B in L3 cache. Now, C is already going to be accessed from registers; so that leaves A for you to focus on for improving reuse.

This is exactly what I tried to do: tiling B over D1 (J), taking 128 columns at a time, and multiplying the whole of A with it. As far as I understand, the consequences should be the following:

  • Each line of A will be loaded only once in L3 for the multiplication with all the columns in a tile (so it’s 128 times less loads for each element of A)
  • Each tile of B will be loaded only once in L3 for the multiplication with all the lines in A (so it’s 2048 less loads).

This is because the L3 cache should be able to store both a tile of B (2Mo) and a line of A (16ko), or at least this was my assumption. I also tried this with 1Mo tiles of B (of 64 columns each, the result was the same).

This is my reasoning, and this is why I don’t understand the results.

Maybe it’s just that the L3 cache, due to interferences from other cores, is more impredictable and cannot be relied upon.

To be sure, you’ll have to use performance counters to check on misses - it is possible that you are reducing the number of L3 misses, but that it’s inconsequential here in the absence of other optimizations. Note that it’s also possible that you have conflict misses between rows of the matrices.

1 Like

Is there some standard way of reading performance counters? For instance, some library? At least on x86-64/Linux/Macosx and ARM/Linux?

Ok, I found my error - I was tiling OK, but without copying I had lots of line conflicts. So L3 tiling and (mandatory) copying gives a 5x speed-up.

1 Like

There’s Intel VTune profiler available for free that I’ve used in the past but not with mlir-cpu-runner. @dcaballe may have more insights here.

1 Like

Thanks for the pointers.

I have a further question (to avoid creating another thread):

I’ve now tiled for L3, L2, and L1. I was surprised to see that I gain less and less doing it:

  • no tiling: 67.68s
  • L3 tiling and copying: 12.72s
  • L3 and L2 tiling and copying: 7.56s
  • L3, L2, and L1 tiling and copying: 6.82s

Is this normal? Does it mean that I’ve hit the pipeline barrier, and that I have to do ther things to unleash more performance? Like tiling for registers and loop unrolling?

Dumitru

Could you please say how you are executing things - is this still via mlir-opt ... | mlir-cpu-runner ...? The trend you see is along expected lines - improving cache locality alone is going to give you diminishing returns at that point.

Here is my compilation/execution line:

mlir-opt --convert-linalg-to-affine-loops --memref-dataflow-opt --lower-affine --convert-loop-to-std -convert-std-to-llvm --canonicalize gemm.mlir | /Users/dpotop/svn/llvm-mlir/llvm-project/build/bin/mlir-cpu-runner -O3 -e main -entry-point-result=void -shared-libs=…/mylib.dylib

We had some code to enable VTune for MLIR but it was more like a temporary hack. It modified LLVM’s ExecutionEngine. It’s here: Support vtune profiling of mlir generated code · NervanaSystems/ngraph@fb82f20 · GitHub @jbobba could provided further info, if needed.

Could you use also perf?

1 Like

Hello all!
This question is mostly addressed to Mr. Bondhugula:

Is such a nice technique as this scalar replacement going to be implemented in MLIR upstream?

There are currently certain compile-time efficiency issues with it (pass running time) to be addressed - it’s not as fast as one would expect it to be on large unrolled bodies in a production compiler. Otherwise, I do intend to submit it upstream. It’ll integrate well with the upstream -affine-scalrep pass.

Are you going to extend this technique, so that it would be able to process memref::LoadOp-s/memref::LoadOp-s together with AffineLoadOp-s/AffineStoreOps?

Not really - this is only meant for affine read/write ops (can be extended to affine.vector_load/store) just like the rest of affine-scalrep upstream. (Its access/subscript equality checking won’t work for unrestricted load/store ops.)

Okay, thank you for the answer

Please, when implementing this pass, give user the opportunity to scalar-replace separate desired loop nests user wants to be processed, similar to the dialog here Can't AffineNormalize the desired loops only.
Thanks.