In the past few weeks, @nicolasvasilache and myself have been writing a few MLIR case study docs on CPU codegen for the Vector dialect, following the principles of (1) building technology bottom-up, i.e. first make sure one level works really well before building the next level and (2) keeping low-level code generation as architectural-neutral as possible, for example, by using generic intrinsics (rather than CPU specific intrinsics, or even an intermediate, CPU specific dialect), which enables the LLVM backend to generate good code for e.g. x86-64 and AArch64 flavors alike, with only the need for changing a few simple parameters in the lowering strategies.
So far we have
AVX512 Codegen for the Vector Dialect Ops
Sparse Matrix Time Vector in the Vector Dialect
Transfer Operations in the Vector Dialect
A Simple Retargetable Matmul Strategy
The docs focus on AVX512, although the principles are more widely applicable. Furthermore, the docs are simple case studies, not fully worked out academic papers. Nevertheless, if there is a general interest, we can post the docs here on this forum (after some internal cleanup). Please let us know if that is something we should invest time in.
I haven’t been following the details of the vector dialect, so I’d love to see this. How extensible is it to architectures with variable length vectors, and libcalls to high performance numeric library calls?
Thanks! At the moment, the “vectors” in the vector dialect are statically shaped, but we are thinking on how to extend this to variable-length with an eye on upcoming vector ISAs. Some of the docs indeed compare pure codegen with library calls, either done through alternative codegen paths or just done for comparison purposes.
It would be helpful to implement dialects for AArch64 scalable vector implementations
Our hope is to extend the vector dialect for this, rather than introducing a new dialect (the second principle i.e. keeping even “low-level” code in MLIR as architectural-neutral as possible). Do you foresee any major difficulties with that?
Other than that I am extremely happy to read interest! We will start posting PDFs in this thread as the docs get cleaned up.
That’s a very interesting perspective. We’d like to have more information on that intent of Vector dialect. Would the intended documentation also describe such a philosophical aim for the Vector dialect and some concrete detail too, or perhaps an RFC describing this more ? This would enable us to better respond, but certainly we would also prefer to work within the ethos of the Vector dialect.
Would the intended documentation also describe such a philosophical aim for the Vector dialect and some concrete detail too, or perhaps an RFC describing this more?
No, the docs scheduled for posting here are merely a qualitative and quantitative analysis of all the vector ops in simple-case study form. The variable-length part will probably come in the form of an RFC in the future.
But please keep the vector dialect in mind. A lot of the work we are doing is meant so that MLIR can obtain high performance for all backends without committing “too early” to that backend. For instance, by selecting the right generic intrinsic during lowering to LLVM IR, our hope is that LLVM knows what to do that for every possible backend, and for every possible SIMD flavor. So far we have been very happy how well LLVM fares in that regard.
The recent SIG presentation on the Vector dialect was quite revealing indeed - a lot of work has gone into it since the last time it was presented in around May and it would be ideal if we can leverage that.
I can’t think of an immediate reason why we shouldn’t do what you advise, however we need to look at the Vector dialect more closely - we’ve been busy with other parts of our MLIR stack and need to catch up with your progress!
Here is the first document in the series, providing an explorative qualitative and quantitative analysis of the AVX512 code that is generated for Vector dialect operations.
And here is the second document in the series, focusing on 1-D vector transfers. Note that this much shorter case-study really just supplements the first case-study, conducting a few experiments that did not fit the first document really well.
And the third document in the series. This smaller document really started as a supplement to the first case-study, focusing on the newly introduced gather and scatter operations, but rather than just looking at microbenchmarks, my passion for sparse computations decided to look at something slightly more interesting.
Here a document for benchmarking and assembly code generated for three different flavors of matmul micro kernels targeting AArch64. Which archives 90% theoretical peak performance AArch64_Codegen_For_Vector_Dialect.pdf (240.5 KB)