This is a followup post from:
- transform dialect cheatsheet (link)
- our paper from last year on structured codegen (link)
We are now at a point where we have built enough infrastructure to systematically go after parallel code generation performance on CPU and GPU.
There have been questions recently about how to use some of the structured code generation capabilities.
I have thus started to create a set of examples that demonstrates how to build declarative codegen pipelines for cases of interest. I am aiming at using those to:
- Demonstrate proper usage of structured codegen with a series of examples (ideally culminating in some sort of tutorial).
- Provide precise explainability of “what happens where” in the declarative transformations pipeline: a full codegen strategy can be “held in one’s hand” in a single MLIR file.
- Allow fast iteration, dissemination and integration of new strategies and ideas without having to build and land complex C++ pipelines into a separate project.
- Provide a simple place and tooling to “bring-your-own” codegen strategy for particular cases of interest.
- Provide a batteries included experience with end-to-end retargetable execution to any device thanks to integration with IREE.
- Provide a simple way to better scrutinize the abstractions + their layering. As we iterate and find missing pieces for particular purposes, new transformations will be added/refined/generalized.
The entry point for a CPU matmul can be found here:
Please do not hesitate to ask for new use cases to be added, finer-grained explanations or even send your own examples.
Note: this is an area of fast experimentation and iteration: the various scripts are expected to change regularly and are not yet systematically tested with a CI.
This will stabilize over time.
I’m a bit confused by this post: are you intending to upstream this and looking to discuss how / where to put all this? Or is this simply “advertisement” for work happening in IREE?
I’d not always straightforward to me (and I suspect many other) to understand how to apply the amazing work happening e2e in IREE into upstream development flow (and I wish the opposite actually, so very interested in approaches to build better flows upstream).
The first objective is to socialize some of this and give a place for simple demonstrations of “how do I do X”, esp. as there have been a few questions recently.
Over time, as things mature, I expect the vast majority of codegen to be pure upstream, which also guarantees we don’t have runtime-specific blind spots.
On the CPU side, simple e2e upstream integration should be relatively feasible if there is concrete interest: I did have things hooked up ~1 year ago using python + numpy as the basic programming substrate to automate writing boilerplate IR by hand (a.k.a. the basic framework parts) + the threadpool.
If you (and/or someone else) are volunteering, I am happy to help connect more things here
On the GPU / other targets side, this is a different story: I had not invested in say connecting pytorch or another CUDA supporting framework to hook things up but I know from a prior life that this is doable…
Now doing those integrations properly and building a good tool is a very different discussion.
At which point IREE makes a lot more sense than trying to reinvent the wheel (and making it square in the process…)
I am definitely amongst the “many others” on this one. Recently I’ve been trying to flesh out an e2e
linalg -> gpu pipeline using only MLIR passes and I’ve found it pretty rough going since in fact it seems many things are effectively migrating overtime from MLIR to IREE - e.g. various
linalg passes and approaches being deprecated in favor of
transform, which while wholly upstream, lacks examples of how to use in a pipeline that generates e.g. fully bufferized code. Note, this is not an accusation but more a cry for help .
I would (at least tentatively) volunteer but I am indeed interested in keeping it wholly upstream; IREE is great but it’s a whole 'nother assortment of (albeit useful!) concepts.
I’m not sure I understand to which terminal endpoint of the pipeline PyTorch would be connected to? Are you saying you want to use e.g. PyTorch as a frontend and then lower all the way to NVPTX and then compare to PyTorch’s native CUDA perf?