CIRCT/MLIR dialect(s) for on-the-fly compiler generation for programmable heterogeneous (AI) accelerators?

Hi everyone,

I would like to repost a post that I posted on the Calyx forums here as I feel that this forum is possibly a better fit. I would like your help and/or insights on the proposal below. Thanks!

Problem

Creating a compiler for a programmable heterogeneous SoC with (AI) accelerators is difficult, labour-intensive, and almost always happens after the facts, when the hardware is (almost) taped-out.
I find this quite odd, since the way a compiler should target/optimize for hardware is implicitly captured in the hardware description. (e.g. in how many processing elements an accelerator has, or how many cores a gpu has, or how big a cpu’s cache is)

Proposed solution

I would like to make the hardware-compiler interface explicit in the hardware description with a DSL (or perhaps one or multiple MLIR/CIRCT dialects?) so that two things can come out of the compiler:

  1. A more concrete/detailed low-level hardware description (like Calyx, or maybe even (System)Verilog?).
  2. A very detailed target description which serves as an input for a compiler middle-end and back-end.
    We are currently interested in using this on-the-fly generated compiler to ingest neural network descriptions so the created hardware can target TinyML workloads.

I’m not proposing an HLS flow here. In an HLS flow a single algorithm gets lowered to an efficient hardware implementation. I’m proposing a flow which allows to create or mark certain parts of the hardware so that a code-generating compiler is created on the fly.
IIUC you could compare this to the ESI dialect, except that you are creating an interface for compilers and programmable hardware, instead of an interconnect between different hardware blocks.

Rationale

This proposed solution should enable us to:

  • Easily make acceleration hardware different from common CPUs or GPUs which can ingest common ML framework workload descriptions
  • Easily adapt currently existing compilation frameworks (like TVM or MLIR) to existing or newly created hardware.
  • More accurately characterize the hardware at design time for different ML workloads
  • Perform automated design space exploration for programmable hardware.

Discussion

I already got great suggestions on how to get started with this project by @rachitnigam and @adrian in the aforementioned Calyx forum post, but i’m still looking for opinions of other CIRCT developers, since I am convinced that this is something where the heterogeneous and extensible nature of the MLIR/CIRCT infrastructure is really required and could/should really shine.

Currently I think the systolic array generator of Calyx together with TVM’s VTA accelerator project are very interesting directions to dig in deeper. Please let me know if you have any other suggested (existing) projects, CIRCT/MLIR dialects or other thoughts.

Thank you very much!

Best regards!

Just to clarify: what you’re looking for at the end of the day is a compiler to translate some ML code to run on some already (or mostly) defined hardware? Presumably to be used for co-design of the accelerators and ML workloads?

The general idea of creating hardware along with the compiler to map “software” into it sounds interesting. ML in particular has some well-defined boxes which can simplify the problem significantly. I think this is one area where CIRCT/MLIR could really shine!

FYI: We (Microsoft) will shortly be working on high-level, CIRCT-based systolic array functionality to help our physical design efforts on FPGAs. We’re exploring lowering from the affine dialect, automating the scheduling with the CIRCT scheduling framework, and doing post-placement pipelining between the PEs. I’ll be giving a talk at LATTE on this and a few other PD topics.

I agree with this sentiment, and I think this is a really exciting area to explore. I don’t know the TVM stuff as well, but I will give some thoughts on the concrete proposal:

I’m curious what you’d want to capture, along the lines of @adrian’s earlier comment. Are there dialects in CIRCT that represent the things you care about? If not, could this be a place to create a new dialect, and provide transformations to lower it into CIRCT’s dialects?

Since you mentioned a “more concrete/detailed” description, I think of lowering. One good option might be to define your dialect like I mentioned above and lower it to a level of abstraction in CIRCT. This could be Calyx, a hypothetical systolic-array representation, the HW/Comb/Seq dialects, whatever captures the details you care about. You could potentially transform this dialect futher, export it to other tools, emit System Verilog, or anything else CIRCT can help with.

We have had some discussions around target descriptions for hardware, but that was more about things like how many LUTs are on an FPGA, not necessarily what you are looking for here. One thing that comes to mind is ILA, which we had a talk about in the CIRCT ODM. Perhaps this or something like it could be useful. Again, the question is what information do you hope to capture with this representation?

Hi @jdd and @mikeurbach ,

Thanks for your comments and your enthousiasm!

I’d like to point you to two hardware platforms that were recently published by our research group which I think should be targettable by this (set of) dialect(s):

  • DIANA: Dense ML workload processor with a RISC-V control core, Scalable precision Digital Accelerator and low-precision/high efficiency Analog in-memory compute core.
  • DPU: Sparse ML workload processor with 64 compute units supporting scalable posit arithmetic which operate according to a load-store mechanism. A separate compiler was developed for this platform, which is described here.

I think both of these platforms can be regarded from high level as a set of homogeneous or heterogeneous accelerator or CPU cores with a certain memory hierarchy.

I’ve tried to come up with some goals which should be narrowed down (@mikeurbach I’ve tried to inline some existing MLIR dialects here, but I’m not experienced, so my assumptions of what they can do might be wrong):

  • It should be able to describe the capabilities of individual accelerators. E.g. the analog core of Diana does not have the same ISA as the digital core, which is quite important in scheduling. ILAng indeed seems like a very good candidate for this. I think this can also serve as a description for how low-level codegen should be performed on these platforms.
  • It should be able to deal with multi-core (e.g. DPU) heterogeneous platforms (e.g. DIANA).
    I think the MLIR infrastructure readily supports this.
  • It should capture how the memory management/synchronization/communication is done. E.g. DPU has explicit memory barrier insertion. I believe a few dialects already exist in MLIR that effectively capture (part of) this e.g. acc,omp and async, but I might be wrong here.
  • It should be possible to describe the amount of memory and the hierarchy available on the platform to better guide scheduling and memory tiling. I think this is similar to @jdd’s earlier post on FPGA hardware constants.
  • We wish to kickstart compiler-in-the-loop, so that code can already be compiled for very silly accelerator designs at a very early stage, because we think iterative/agile hardware design is really important. E.g. extending the program synthesis idea mentioned in ILAng might also be interesting for creating simple hardware templates.

Kind of yes, but I’d like to stress again that we are not looking for a single algorithm deployment on this hardware, since we deploy to ASICs instead of FPGAs and we can not reconfigure in between applications. So it’s really important for us that we can try out multiple ML algorithms already at design time. I guess a commercial product that comes to mind with a similar flow is Synopsys’ ASIP Designer which allows users to design their processor core in a language named nML. I’ve never used ASIP designer, but a colleague of mine told me that it only supports single core CPUs, and we would like to go beyond that.

Yes I agree that it’s not fully what I’m looking for, but I think it’s quite similar in the sense that the properties mentioned there are also mostly structural, and It’s not super clear if these properties can be or should be lowered/optimized into some other form.

Thanks for sharing, but I can not open this link (“this recording does not exist”). I think ILAng can indeed be a very important component in this proposed flow (also see my above comments), yet I would like to use it for compiler construction as opposed to verfication.

Thanks for attempting to write down those goals, I know that is often way harder than it seems. I find your stated goals quite interesting, and I think your understanding of what exists in the MLIR ecosystem is spot on. I’m especially interested in this notion of “compiler in the loop”.

My question at this point is what comes next? We generally follow the MLIR and LLVM development guidelines, so you are welcome to post an RFC on this forum to start speccing out a new dialect. If it’s still early for that, perhaps we could discuss these efforts informally at an upcoming open design meeting, and that could bring some feedback from the broader community.

My two cents is it might be worth looking into an MLIR dialect for ILAng. I’m not sure if anyone has explored this, but it seems like we agree something like that could be useful for this work, and could be a useful abstraction to other, similar flows.