MLIR/Linalg bad performance

Hi all,
After my previous post (thanks @ezhulenev for the reply) I was able to do some benchmarking of a linalg.matmul operation. I am using 1000x1000 matrices and this is the MLIR code I have:

func @main(%A : memref<1000x1000xf32>, %B : memref<1000x1000xf32>, %C : memref<1000x1000xf32>) {

  linalg.matmul ins(%A, %B: memref<1000x1000xf32>, memref<1000x1000xf32>)
                     outs(%C: memref<1000x1000xf32>) 

  return 
}

I compile with the following command:

mlir-opt test.mlir -convert-linalg-to-loops  -convert-scf-to-std  -convert-std-to-llvm > test.llvm.mlir

Then, I wrote a benchmark program that reads the LLVM dialect file, lower to LLVM, compile and run:

mlir::OwningModuleRef module;
mlir::MLIRContext context;
context.getOrLoadDialect<mlir::LLVM::LLVMDialect>();
loadMLIR(..., module) // similar to the toy example
runJit(*module)

In runJit I basically create the ExecutionEngine, lookup for the entry point and run the function:

auto maybeEngine = mlir::ExecutionEngine::create();
pack_args(...);
auto expectedFPtr = engine->lookup(entryPoint);
void (*fptr)(void **) = *expectedFPtr;
(*fptr)(args.data());

Timing the call to (*fptr), I get a result 100x slower than a naive three loops c++ program!

I tried different options, different passes, etc… but it didn’t seem to help at all. My main question is: am I doing something wrong? Is it doing something more than the simple computation inside the (*fptr) call?

Any insight is more than welcome!

Thank you so much,
Giuseppe

So, after a bit of googling around, I found this amazing tutorial: https://arxiv.org/pdf/2003.00532.pdf and was able to get performance improvements.

Now MLIR is only 20% worse than a naive three loops compiled with gcc -O3. I was wondering if there is still something I am missing, or this is (without further optimization) the best I can achieve.

Thanks,
Giuseppe

Hi Giuseppe, worth having a look here: https://github.com/mmperf/mmperf/blob/main/matmul/matmul-compile/matmul-compile.cpp

Those of us who are using this downstream in IREE are doing various optimizations/search/tuning on top of the base linalg operations, and in addition enable further transformations to tiled layout matmuls where possible. The mmperf work listed above tracks the approaches in that work and is a good starting point. Much of this will land upstream eventually, but we try to keep core MLIR general and not entirely experimental on that front (so it tends to be laggy).

None of this perform optimization, just abstraction lowering.

This uses polyhedral optimization, which is different from and +/- complementary to what Linalg does. There is no fully automated approach in-tree right now. IREE and the code @lorenzo_chelini linked are good places to start.

Why compare JIT to AOT? You can compile .ll files with clang (and comparing to clang -O3 instead of gcc would give you a better baseline).

Thank you all for the help! Actually, moving to AOT made the timings exactly the same (and it’s also very nice, since I can easily read the assembly generated)

At least now I have a starting point :ballot_box_with_check:

I will read/study IREE and the benchmarks @lorenzo_chelini mentioned

Thank you all once more!