In order to get quickly to the bottom of GEMM performance, I would like to inject some assembly directly in MLIR. I would imagine some opaque operation that calls into a function providing the pointers to the memory regions. Is there this option in MLIR?
I added an InlineASMOp a while back.
It is used e.g. here: https://sourcegraph.com/github.com/google/iree/-/blob/iree/compiler/Codegen/LLVMCPU/VectorContractToAArch64InlineAsmOp.cpp?L106
But I think it is definitely a footgun and I do not have expertise using it myself.
Examples would be most welcome though
Hi @nicolasvasilache ,
Very cool, thanks!
I totally agree about it being a footgun, but I am looking for a quick way to prototype for inner kernel for GEMM (then we can abstract high level transformation at a later stage).
Thank you once more,
Quick heads up in case this is relevant, I’ve been looking a little deeper into vector.shape_cast and vector.transpose.
There are some inefficiencies that I am looking into ironing out.
Thanks for the heads up! I would be also curious to understand why without transposition it goes slower
Anyway I am mostly focused on the inner kernel for now (and probably for the next month).
Early experiments today showed that I could get to the 80% of the peak (and about the same performance of ACL-without prefetching) if I can improve the inner-kernel.
I will write a more detailed post about this next week.
Have a nice week-end,