[RFC] MDL: A Micro-Architecture Description Language for LLVM

TL;DR:

We’ve created a DSL and compiler for modeling micro-architecture that handles a very broad class of architectures - CPU, GPUs, VLIWs, DSPs, ML accelerators, and embedded devices. This effort grew out of a need to quickly develop and experiment with high-quality compilers and tools to facilitate rapid architecture exploration. We named the DSL “MDL” for “Microarchitecture Description Language”.

While being significantly more expressive than TableGen’s Schedules and Itineraries used in LLVM, MDL is also more concise, and simpler to read and write while supporting a much broader class of embedded and accelerator architectures. We currently can automatically _generate _MDL descriptions for all upstream targets which are in many cases 1/10 the size of the equivalent TableGen descriptions. We’ve integrated this with LLVM, and are sending out this RFC because we believe it could be valuable to the larger LLVM community. \

The MDL compiler, associated tools, and documentation are available as open source (at GitHub - MPACT-ORG/llvm-project at work), and we would like to explore adding this to the LLVM project, and encourage contributions from others.

Background

Over the last few years, we have been using LLVM to develop a compiler backend for Google’s TPU machine learning accelerators. TPUs have complex microarchitectures and pose a number of challenges that are not seen in in typical LLVM targets:

  • Clustered VLIW with partitioned register files.
  • Extremely deep pipelines with complex hazard conditions
  • Instructions with functional-unit-specific and/or cluster-specific behaviors
    • Non-trivial and/or instance-specific latencies
    • Complex resource usage
    • Functional-unit-specific register constraints
  • Shared/allocated encoding resources (instructions need 1…M of N resources)
  • Explicitly managed hardware resources (register ports, internal datapaths, busses, etc)

While some of these problems manifest in a few upstream targets, this collection of problems is a superset of the problems directly addressed by LLVM - Schedules and Itineraries are simply not sufficient to model everything. Supporting this class of architecture is therefore code-intensive - it takes around 20,000 lines of C++ code to model the TPU sub-targets. This is brittle, hard to write, debug, test, and evolve over time. In contrast, the MDL description for these sub-targets is ~2,000 lines of text.

Status

  • We’ve created the MDL language and compiler for describing microarchitecture details, a methodology for integrating it with TableGen files for any target, and a set of APIs that can be used in a machine-independent way to inform back-end passes such as bundle-packing, instruction scheduling, and register allocation.
  • To facilitate integration with LLVM, we built a tool which scrapes architectural information from TableGen files, and produces our MDL language for all upstream targets.
  • We’ve modified the CodeGen and MC libraries to (optionally) use our methodology for latency management.

There is a lot more to do. For example, we plan to enhance existing back-end scheduling passes and register allocation passes to cleanly handle a larger class of embedded and accelerator architectures, based on MDL-generated information.

We welcome feedback on the language design and associated tools and use model. You can find the MDL design documentation in our github repo in llvm/docs/Mdl.

-Reid

13 Likes

I wonder if this can improve the precision of MCA, given the fact that it can create a model with finer granularity (@adibiagio @RKSimon).

(Also I just tried to build the repo to see how MDL looks like. In case anyone here also wants to try it, please build LLVM with RTTI)

How are the VLIW instruction scheduling restrictions implemented ex: x and y cannot go into the same bundle/packet ?

@reidtatge This looks interesting, and anything that can reduce the pain of maintaining the more complex aspects of arch modelling would be awesome. Do you have examples of the definitions that are consumed by MdlCompiler? Any reason that you use antlr4 instead of tblgen?

It can provide more detailed information, although for current upstream targets it doesn’t provide any additional information than whats already in tablegen (since the descriptions are scraped from tablegen). But you’re right - we can provide a lot more inforrmation than tablegen can, and this could be used to build tools like MCA for more complex architectures.

(BTW, you shouldn’t have to build with RTTI, so I’m wondering why you had to do that.)

The language is designed to allow you to describe the overall architecture in such a way that we can derive bundle-packing attributes about each instruction. In your example, if X and Y can’t be bundled together, typically its because they both use a particular resource: a functional unit, an issue slot, encoding bits, or some other hardware component. You can also define abstract shared resources that allow you to define arbitrary constraints on classes of instructions. The MDL has first-class language structures for things like functional units, issue slots, and register ports, but you can define your own too.

So, the MDL compiler builds a “database” of instruction behaviors, and we have a solver (in MDLBundle.h) which not only determines whether a set of instructions can be issued in parallel, but which resources each instruction in the bundle would use, what their latencies would be, and any additional register constraints that the bundling implies.

There’s some related reading in llvm/docs/Mdl/BundlePacking.md.
Hope this helps!

1 Like

Hi Simon. Take a look at llvm/docs/Mdl/MachineDescripitionNotes.md. It’s meant to be a “users’ guide” for the MDL language, and has lots of simple examples, as well as a complete grammar for the language, and a complete RISCV (generated) description. As I mentioned in the RFC this was built to handle much more complex architectures (TPUs), but unfortunately I can’t publish a TPU machine description (its proprietary), which would better demonstrate the power of the language. But the docs describe most of this stuff with simple examples. If you build the repo, it will generate descriptions for AArch64, AMDGPU, ARM, Hexagon, Lanai, Mips, PowerPC, RISCV, Sparc, SystemZ, and X86 - all the targets that have Schedules and/or Itineraries. They’re generated into build/lib/Target//.mdl (the same place that tablegen .inc" files are generated to). Caveat: in general, since these are generated files, they’re not particularly pretty to look at, but it will give you some examples to look at.

Why Antlr rather than Tablegen? It seems I ought to make a tablegen joke here, but seriously: I wanted a concise, more expressive, purpose-built language that didn’t require the writer to explicitly connect a lot of information, and I couldn’t find a clean way to do that in tablegen. To be honest, I also wanted a very low-touch methodology - I wasn’t thinking that major hacks to tablegen was something the community wanted to see! Antrl allowed us to experiment with and build a first-class language that was much easier to use and integrate with llvm.

It throws some link time errors complaining about “undefined reference: typeinfo for <some LLVM symbols from LLVMSupport>” when building the mdl tool. I’m not sure if this is caused by the fact that I’m using ANTLR4 4.7.2 (the one bundled with Ubuntu 20.04, I did make some tweaks in your cmake file).

Yeah, the current version of Antlr is 4.11.1, so thats a bit old, and there was a significant change at some point. I’m using 4.10. Its possible that the older versions of Antlr used RTTI, and the newer ones don’t. I’ll look into it. In the meantime, you might want to just download the latest release of Antlr.

Can this also be used for the disassembler functions in LLVM? I.e. Going from bytes to assembler instructions. If so, would it also provide extra information about the disassembled instruction, such as timings etc.

Yes, the exact same API is available for MachineInstr and MCInst. So if you disassemble to MCInst, it just works. :slight_smile:

This looks very interesting! Do you have a timeline when you will be using this in production ?

I tried this for AMDGPU but the generated files are almost empty. The only non-comment lines are:

AMDGPU_instructions.mdl:
family AMDGPU;

AMDGPU.mdl:
import "AMDGPU_instructions.mdl"
protected phases AMDGPU { F1, E[1..1] };

Hmm, thats exactly what it would do with an empty input file. Did you do this by hand (run tablegen, then run tdscan) or build clang? I can check to see if there’s a bug in the clang cmake.

It was fully integrated with a proprietary processor compiler well over a year ago, and extensively tested against the production version of that compiler. That processor exercised all the features of the language and mdl compiler, so we’re quite confident in the robustness and utility of the language and tools for latency management, resource management, and bundle packing (the three things we were focused on).

The integration with upstream targets isn’t complete. While we can scrape and correctly compile descriptions for all the upstream targets, and have fully integrated the latency management, we haven’t finished integration of the resource management and bundle packing code into the CodeGen and MC libraries. That work is ongoing. We are quite confident that the latency management infra is working well in that context, and expect to have the rest integrated in the next 3-6 months.

I hope that kinda answers your question.

3 Likes

Can you please explain, how this hooks into the existing scheduling schemes, i.e. SchedMachineModel or Itineraries? Or does it work at a lower level?

That’s quite interesting. I’ve just started reading the MDL documentation, and maybe my question is actually covered somewhere deeper in the docs, but still.

We have a family of customizable processor cores. Usually, there a “baseline” core, and some derivative cores that usually differ in some minor details - e.g., a different FPU, or different memory subsystem configuration (resulting in different resources and latencies), and so on. Currently we use TableGen to assemble scheduling models for such cores from parameterized blocks. So, a typical core model looks something like (just to give a basic idea):

def Foo1Model : FooSchedModel { ... }

let SchedModel = Foo1Model in {
  def FOO_ALU : ProcResource<1>;
  def FOO_MDU : ProcResource<1>;
  def FOO_BRU : ProcResource<1>;
  def FOO_LSU : ProcResource<1>;
  def FOO_FPU : ProcResource<1>;
  
  defm : FooBRU<FOO_BRU>;
  defm : FooALU<FOO_ALU>;
  defm : FooMDU<FOO_MDU, /* ... MDU parameters ... */ >;
  defm : FooLSU<FOO_LSU, /* ... LSU parameters ... */ >;
  defm : FooFPU<FOO_FPU, /* ... FPU parameters ... */ >;
}

where FooXXX are multiclasses defining common building blocks for Foo processor family.

Is something similar doable in MDL?

Its actually an alternative to SchedMachineModels and Itineraries. The motivation is that SchedMachineModels and Itineraries aren’t expressive enough to handle aspects of some accelerators, and extending them to do so quickly becomes unwieldy, so we created a DSL to succinctly describe those architectures, while also supporting existing upstream targets cleanly.

Great question. Yes, you can do that, it doesn’t even look that different, at least at the top level:

     cpu Foo1 {
           func_unit FOOBRU FooBRU();
           func_unit FOOALU FooALU();
           func_unit FOOMDU FooMDU(/* MDU parameters */);
           ...
     }
     cpu Foo2 {
           func_unit FOOBRU FooBRU();
           ...
     }

Thats where the language similarities mostly end. Functional units can be clustered, tied to issue slots, and specialized for used resources and register classes. The way we tie instructions to functional units and describe their pipeline behavior is quite different from tablegen.

But to answer your question directly, the language was specifically designed to easily handle differences between processor family members.

In TableGen SchedModels, instruction latencies are just numbers. So if, say, a different processor configuration targeting a higher frequency has an extra stage in a multiplier (with other uarch details, such as pipeline bypasses and so on, are mostly unchanged), in TableGen we can just pass latencies as constructor parameters.
From First commit of MDL integration changes · llvm/llvm-project@ca16ce2 · GitHub, it looks like MDL uses a more detailed pipeline model.
How would you handle such case in MDL?