I am fairly new to the world of compilation and MLIR and I confess a little overwhelmed by the breadth of it.
I am working on a novel accelerator architecture mainly targeting deep learning workloads and I am trying to figure out how much work can be done at the compilation stage.
The accelerator would comprise a few computing components with a different set of operations each with their own properties (different latency costs, some could introduce numerical errors, maybe some have SIMD support while others not, etc).
The set of operations between the computing components are different, although the intersection is non-null.
So optimising a program for this kind of hardware requires to take into account all the costs, memory movements and so on.
Is it something that MLIR can help me solve?
Has there been any similar work done?
What resources should I look into?
MLIR provides infrastructure to help someone build a compiler for many architectures. And for some more mainstream targets (CPUs and GPUs of various types), it has quite a bit more provided directly. I am aware of multiple users of MLIR who have built compilers for various accelerators, and ime, many such accelerators share characteristics like you describe. However, the bespoke and proprietary nature of these accelerators tends to keep them from generalizing or being developed in the open like their CPU and GPU counterparts. As such, there can be little to see publicly. While these designs and the corresponding compilers can be simple, they are often not – and building a good compiler for such an architecture can be a large amount of work no matter which tools your are using, mlir included.
If looking to start somewhere at the high level, I’d recommend looking at torch-mlir or TOSA (and the corresponding support in TFLite to get models in this form). At the lower level, a compiler toolkit for ml devices like IREE provides pluggable backends and higher level infra for building such compilers (but no public examples exist four such parts). The space is pretty large
Since you are in the exploratory phase, it might be interesting to check out CIRCT’s scheduling infrastructure. CIRCT uses MLIR to provide tools for EDA–not for targeting novel hardware, but for designing it in the first place.
One task that comes up in some hardware design flows (often using HLS), is to “schedule” computation into hardware components. This needs to take into account similar concerns as you mentioned, such as components that provide different but potentially overlapping operations, components with different latencies, respecting memory access dependences, etc.
It’s not much more than a proof of concept, but in CIRCT we have one example of scheduling computations from the Affine dialect into a set of provided hardware operator types. That pass and the analyses it uses might be useful: both to show what kind of information the Affine dialect provides, as well as how to use such information to construct a scheduling problem in CIRCT.
We’ve often felt the scheduling infrastructure might be useful not just for those designing novel hardware, but also for those targeting novel hardware. But those involved have been focused on specific HLS flows, so the infrastructure is currently incubating in CIRCT alongside those use-cases. If you’re interested in learning more, feel free to drop a line in the CIRCT channel of LLVM’s discord–everyone involved in that project hangs out there.
I think the part I am missing is the path to low level IR for my hardware.
Most of the compilers I see lower to LLVM/SPIR-V and hence don’t need to bother much about the code generation, which is one of the key part for my case.
I am not sure yet whether LLVM can suits my needs (representing heterogeneous computing device), if so then investing into writing a LLVM backend could make sense.
Otherwise there is a path for writing low level IR for my hardware with MLIR, whether it is with an existing dialect or making a new one, and then creating a code generator on top of that.
Thank you, I will definitely look into it!
It all depends on what the low level codegen for your target looks like. For something bespoke/limited and like what we often see in the accelerator space, I wouldn’t make an llvm backend. The question is indeed where you want to “tap off”, and I’ve seen multiple answers to that question. A good question to ask is whether you want to be tapping off at the high level tensor “Isa” level (i.e. like mpsgraph on apple or tosa) and take that directly to your hardware. Or if you want some level of vectorization and memory planning done for you and to go from lower level vector operations on strided buffers – or something in between.
Either of those extremes can be supported purely within MLIR without going to llvm (and there are some priors). One fringe benefit of staying at that level is that your compiler stays within the range of what can be shipped on device easily, and they tend to be a lot faster than llvm backends.
Again, it comes back to accelerators being like snowflakes: it is hard to make general recommendations without more details.
Going as low as memory planning, finding what components should execute the operation since some components have a non-null (but different) intersection of operations, taking into consideration the fact that if an operation is followed by another one that can be executed on the same component then it may be faster than executing it on the faster component that can’t execute the following operation, etc.
The code format is still an open question, but right now I am only investigating how much of that low-level optimisation can be done within MLIR.
I am still curious though about the existence of tooling for writing a purely MLIR based codegen.
In short, MLIR provides a lot of tooling for this, and my experience is that the more unique and not like a CPU the platform is, the less likely that lowering to LLVM IR will be a good experience. Typically you want to hook into something higher level and lower from there. This is what (Vulkan) SPIR-V does. We have some other example lying around of special purpose things, but like I said: since these are often tied to specialty hardware that is quite unique, there are not a lot of things in the public repos to point you to.
Practically, MLIR is missing algorithms for some of the things that LLVM does: instruction selection, alias analysis, etc. It isn’t for lack of a desire but lack of a strong enough central need.
Just as a thought exercise, I would look at a few levels of abstraction and ask yourself which is the closest to what you would like to “take all the way”:
In IREE, we have a couple of other “bottom layers” that we take things to: VM, VMVX. I don’t think these are particularly relevant to you, but I present them as two alternative/specialty flows that do not go through LLVM-IR but do produce executable artifacts.
Depending on the level of abstraction of your hardware, you may find it beneficial to go all the way to the top of the stack. Or you may find that it is useful to “cut in” lower down. There is not a one size fits all answer that I have yet seen.
Oh right, I was thinking that instruction selection would be the way to solve the problem of selecting the path for the problem of “different components with a non null intersection of operations”, that’s too bad.
Well it is just a lack of need as you say, if ever I really have the need and implement it I will be happy to contribute to the community.
This is all very informative, thank you. I really appreciate that you took the time to answer.