Hello,
I’m playing with tiling in MLIR, and I’ve tried to take advantage of the L3 cache of a Core i7. I get rather paradoxal results, maybe someone can explain me what happens. The case study is a dense matrix multiplication C+=A*B with A:2048x2048xf64 and B:2048x2048xf64.
I a nutshell, inside MLIR I get virtually no gains (code attached at the end of the message). If I write the code in C and compile it with clang and gcc (-O3) I also get nothing, unless I transpose the B matrix, in which case not only the original code runs faster, but L3 tiling gains a nice 38% speedup.
On the target architecture the L3 cache is shared between cores (and thus between threads and OS) and has 8Mo for 4 cores. I have stopped all memory-intensive tasks before measures.
I tiled over index j (the columns of the output matrix). The resulting loop nest looks like:
for(j0=0;j0<16;j0++) for(i=0;i<D0;i++) for(j1=j0*128;j1<(j0+1)*128;j1++) for(k=0;k<D2;k++)
The objective was to let the 128 columns of B remain in L3 over all the traversal of A.
C compilation is done using “CC -O3 -ffast-math matmul.c”.
MLIR compilation and execution is done using:
mlir-opt --convert-linalg-to-affine-loops --memref-dataflow-opt --affine-loop-unroll-jam --lower-affine --convert-loop-to-std -convert-std-to-llvm --canonicalize gemm.mlir | /Users/dpotop/svn/llvm-mlir/llvm-project/build/bin/mlir-cpu-runner -O3 -e main -entry-point-result=void -shared-libs=…/mylib.dylib
Can someone tell me if these results are normal? My objective is to be able to use tiling (and therefore understand how it works).
Best regards,
Dumitru
PS: here is the MLIR code I’m manipulating:
> // C += A * B, basic implementation > func @matmul(%A: memref<2048x2048xf64>, %B: memref<2048x2048xf64>, %C: memref<2048x2048xf64>) { > affine.for %arg3 = 0 to 2048 { > affine.for %arg4 = 0 to 2048 { > affine.for %arg5 = 0 to 2048 { > %a = affine.load %A[%arg3, %arg5] : memref<2048x2048xf64> > //%b = affine.load %B[%arg5, %arg4] : memref<2048x2048xf64> > %b = affine.load %B[%arg4, %arg5] : memref<2048x2048xf64> > %ci = affine.load %C[%arg3, %arg4] : memref<2048x2048xf64> > %p = mulf %a, %b : f64 > %co = addf %ci, %p : f64 > affine.store %co, %C[%arg3, %arg4] : memref<2048x2048xf64> > } > } > } > return > } > > // C += A * B, tiled for L3 on dimension 1 (j) > #map0 = affine_map<(d0) -> (d0*32)> > #map1 = affine_map<(d0) -> (d0*32+32)> > func @matmul_tiled(%A: memref<2048x2048xf64>, %B: memref<2048x2048xf64>, %C: memref<2048x2048xf64>) { > affine.for %arg0 = 0 to 64 { > affine.for %arg3 = 0 to 2048 { > affine.for %arg4 = #map0(%arg0) to #map1(%arg0) { > affine.for %arg5 = 0 to 2048 { > %a = affine.load %A[%arg3, %arg5] : memref<2048x2048xf64> > // %b = affine.load %B[%arg5, %arg4] : memref<2048x2048xf64> > %b = affine.load %B[%arg4, %arg5] : memref<2048x2048xf64> > %ci = affine.load %C[%arg3, %arg4] : memref<2048x2048xf64> > %p = mulf %a, %b : f64 > %co = addf %ci, %p : f64 > affine.store %co, %C[%arg3, %arg4] : memref<2048x2048xf64> > } > } > } > } > return > } > > func @main() { > %A = alloc() : memref<2048x2048xf64> > %B = alloc() : memref<2048x2048xf64> > %C = alloc() : memref<2048x2048xf64> > > %cf1 = constant 1.00000e+00 : f64 > > linalg.fill(%A, %cf1) : memref<2048x2048xf64>, f64 > linalg.fill(%B, %cf1) : memref<2048x2048xf64>, f64 > linalg.fill(%C, %cf1) : memref<2048x2048xf64>, f64 > > %t0 = call @rtclock() : () -> (f64) > call @matmul_tiled(%A, %B, %C) : (memref<2048x2048xf64>, memref<2048x2048xf64>, memref<2048x2048xf64>) -> () > %t1 = call @rtclock() : () -> (f64) > > > // call @print_memref_2d_f64(%C): (memref<2048x2048xf64>) -> () > %ci1 = constant 17179869184 : i64 // Number of flops to compute > call @print_flops(%t0, %t1, %ci1): (f64,f64,i64) -> () > return > } > > func @print_memref_2d_f64(memref<2048x2048xf64>) > func @print_flops(f64,f64,i64) > func @rtclock() -> (f64)