[RFC] MDL: A Micro-Architecture Description Language for LLVM

Yes, there are a few ways to do that. We explicitly didn’t support parameterizing functional units with pipeline phases because we felt it didn’t scale very well: it works well for a few pipeline changes, but if you have more than that you have a long slew of undifferentiated constants passed as parameters.

So first some basic ideas:
We model pipelines a little differently than LLVM. First, you explicitly name the phases of a pipeline, much like an enum definition, then refer to these names when you define a pipeline behavior. Here’s a trivial example:

    phases MY_CPU { E1, E2, E3 };

which defines a simple 3-stage pipeline. Like an enum, you can assign values to these:

      phases MY_OTHER_CPU { READ, EXECUTE, WRITE=6, WRITE_MULTIPLY=42 };

In this case, the phases have the following values: READ=0, EXECUTE=1, WRITE=6, and WRITE_MULTIPLY=42.

You then use these names in latency template definitions to define when things happen:

   use(READ, $src); 
   def(WRITE, $dst);

So, the MDL allows you to define a pipeline with global scope, used by all CPUs. And you can define CPU-specific pipelines, which override the global definitions. So, for example:

     phases GLOBAL { READ, EXECUTE, WRITE, WRITE_SLOW=5 }

     cpu MY_CPU1 {
               phases MY_PHASES { WRITE_SLOW=10 };
     }
     cpu MY_CPU2 {
                phases { WRITE_SLOW=20 }
     }
     ....
     latency Multiply() {
                def(WRITE_SLOW, $dst);
     }

The other method you can use is to explicitly handle it differently in the latency template on a CPU by CPU (or functional unit by functional unit) basis:

     latency Multiply() {
                MY_CPU8 : def(WRITE, $dst);
                MY_CPU9 : def(WRITE_SLOWER, $dst);
     }

BTW, you can also do some very basic arithmetic on the phase names, if you prefer that:

     latency Multiply() {
                MY_CPU12 : def(WRITE, $dst);
                MY_CPU5 : def(WRITE+4, $dst);
     }

Any method produces the exact same information in the compiler output, so there’s no “good” or “bad” way to do it.

Does that make sense?

Yes, it makes sense. As I read the docs more, it looks more like a “structural” ADL. I have a feeling that the “default” level of abstraction may be actually somewhat lower than in “operand scheduling”. I understand where it comes from (clustered VLIW with complex pipelines). But I’m somewhat worried that it would result in tighter coupling between ISA definition and processor scheduling model.

I have to admit I didn’t really build your branch yet, where, supposedly, you generate MDL descriptions equivalent to TableGen models. Maybe looking into those generated descriptions would answer more questions than I have right now.

Let’s consider some simple example - e.g., a simple RISC load-store ISA that has ALU instructions (“load immediate”, “move”, “add”, etc.) with identical scheduling behavior, a mul instruction, and load and store instructions. Now, you have 2 different CPUs with the same ISA - Foo and Bar. They are both single-issue in-order processors, with fully pipelined ALU, MUL and LSU. Instruction latencies are:

                      Foo  Bar
ALU                    1    1
mul                    2    3
load/store             2    4

(suppose, Bar is targeting higher frequency and has memory protection that adds an extra stage to LSU pipeline)
Now, suppose you want to decouple Foo and Bar models from an ISA description.
How would you do that in MDL?

Using either of the two methods I described (earlier), thats pretty easy (I’ll use the first method here). (In this example note that most names can be overloaded):

     cpu Foo {
            phases { ALU=1, MUL=2, LOAD=3 }
            func_unit Unit FooUnit();
     }
     cpu Bar {
            phases ( ALU=1, MUL=3, LOAD=4 }
            func_unit Unit BarUnit();
     }
     func_unit Unit() {
            subunit ALU();
            subunit MUL();
            subunit LOAD();
      }
      subunit ALU() { latency ALU(); }
      subunit MUL() { latency MUL(); }
      subunit LOAD() { latency LOAD(); }

      latency ALU() { def(ALU, $dst); }
      latency MUL() { def(MUL, $dst); }
      latency LOAD() { def(LOAD, $dst); }

not shown: ALU instruction definitions (in tablegen) need to specify that they need an ALU subunit, MUL instructions need to specify they need a MUL subunit, and LOAD instructions specify they need a LOAD subunit.

Regarding coupling:
For a simple RISC cpu, the level of abstraction is equivalent to SchedModels, but we have the capability to model things at a much finer granularity if we need to, which is why its relatively easy to convert SchedModels and Itineraries to MDL descriptions. But the way we connect an instruction description to the CPU definition is rather different, and I think provides more separation than SchedModels. So let me explain that:

Greatly simplified: In SchedModels, each instruction has a set of SchedReadWrite resources for each processor it’s valid on. These are either defined in the instruction, or provided via InstRW records. Each resource is (or can be) tied to a SchedModel, and each SchedWrite resource can be associated with a set of ProcResources. Resources also have latencies associated with them.

So, using InstRW records and associated SchedReadWrite resource sets, each instruction definition can be tied to a set of processors, and on each processor it can be tied to specific functional units (ProcResources), and we can determine latencies.

I think of this as two layers (instruction definitions and SchedModels), glued together with InstRW records. So we could think of that as three levels of hierarchy, with much of the burden carried by InstRWs.

In MDL, the schema looks like this:

A CPU definition describes the functional units it implements:

         cpu CPU1 {   <functional unit instances>   }

A functional unit template definition describes the “subunits” it implements:

        func_unit FU(<parameters>) {   <subunit instances>  }

A subunit template definition describes which latencies it implements:

       subunit SU(parameters)   {   <latency instances>   }

And a latency template definition describes a pipeline behavior:

       latency LAT(parameters) {  <uses and defs of operands and resources>  }

Tablegen instruction definitions specify one or more subunits they can execute on. We added a new tablegen instruction attribute “SubUnits”, which is a list of subunit names that an instruction can use, for example:

       def ADD : <attributes...>, SubUnits<[A, B, C]>;

which asserts that ADD can run on subunits A, B, or C. We scrape instruction information from tablegen, and generate our MDL representation of that:

      instruction ADD(<operand definitions>) { subunit(A, B, C); }

(Note that in the repo, we didn’t modify the tablegen descriptions for upstreamed targets. We simply generated all the subunit information from the schedule information). The SubUnits class is trivial:

     class SubUnitEncoding<string name> { string subunit_name = name;  }
     class SubUnits<list<SubUnitEncoding> subunits> {
                           list<SubUnitEncoding> SubUnits = subunits;   }

So, a few things to note about this approach:

  • Instructions aren’t directly tied to functional units or processors. They are only tied to subunit template definitions, which are specialized and instantiated in one or more functional units, which are specialized and instantiated on one or more processors.
  • A subunit template abstractly represents a particular behavior which is typically shared by a set of instructions. The details of that behavior are provided in the specified latency instance.
  • This model doesn’t really use “instruction latencies” as a thing. We have a more formal definition of the
    pipeline phases, and describe when registers and resources are accessed by phase. From that (and forwarding information) we derive latencies.

Another way to think about this is that we consider the pipeline part of the CPU behavior, not the instructions’ behaviors. The latency templates map instruction operands to the CPUs pipelines. As you point out, we use actual operand names to do that, rather than operand indexes. Generally, I think thats safer than using indexes: we can trivially handle reordered operands, and if you rename an operand without updating the CPU model we error check that.

1 Like

Is there some way to reuse a group of pipeline stage definitions between some CPUs with the same ISA? I’m looking for something like a “matrix” of processor configurations, defined by shared implementations of components such as FPU, memory subsystem, and so on, and, as always, some variations in uarch details here and there.

Yes, absolutely. You can have a globally shared pipeline definition, and each CPU can override parts of it, or add CPU-specific stages. Functional unit definitions, subunit definitions, and latency definitions are also shared across all architectures, and can be optionally specialized for each instance.

I’ve finally got some free cycles to take a deeper look at MDL.

I’m getting compiler errors when building your branch, a bunch of errors like

lib/Target/AArch64/AArch64GenMdlInfo.inc:61121: error: could not convert ... from ‘<brace-enclosed initializer list>’ to ‘llvm::mdl::CpuTableDict’ {aka ‘std::map<std::__cxx11::basic_string<char>, llvm::mdl::CpuInfo>’}

Still, MDL files are generated.
BTW, I suppose there is a separate target that would just generate MDLs - how is it called?

I’d propose to generate separate files for individual processor definitions and related subunits. I understand that maybe those generated MDLs were not intended to be maintained manually, but maybe that would be a better showcase for what you want manually written MDL files to look like, and a better learning tool.

In the generated files for RISC-V I see the following:
RISCV_instructions.mdl:

instruction MUL(GPR  rd(O), GPR  rs1(I), GPR  rs2(I)) {
     subunit(sub32,sub41);
     // "mul $rd, $rs1, $rs2"
}

RISCV.mdl:

subunit sub32() {{ def(E3, $rd); fus(E3, SiFive7PipeB); }}
// ...
subunit sub41() {{ def(E4, $rd); fus(E4, RocketUnitIMul); }}

That coupling between ISA and uarch definition doesn’t look good. Imagine that you have several independent vendors using the same ISA, maintaining a bunch of processor models, and introducing new ones every now and then. Current TableGen model might be a bit wordy here, but it doesn’t require such coupling.

Yeah, I just noticed some compile/cmake problems. I’ll push a fix shortly.

Regarding targets: there are targets to build MDL files for each relevant target: TdScantarget.
So, “make TdScanARM” should build the ARM mdl files.

Regarding associating instructions with a uarch definition: a few thoughts…

You make a good point that different CPU vendors may want a separable way of associating each instruction with their CPU uarch - without modifying the instruction definitions. Its quite easy to replicate the InstRW approach in the MDL language. I had thought about doing that, but I personally don’t love the approach - adding or renaming instructions can create duplications or omissions, but I guess I could check for that. But given that its what people are used to, I can understand a preference for that approach. I’ll take a look at adding that, and using it when I generate descriptions.

Generally, the way I’m generating MDL files isn’t the way I’d generally recommend writing them, so I agree looking at generated MDL files isn’t terribly instructive, although it does provide an overview of the different components of the language. The primary goal of tdscan was to test the integration of the MDL approach for upstream targets. I wouldn’t expect upstream targets would necessarily switch to use it, since the MDL’s real goal is to support targets that can’t be easily described in tablegen.

The proscribed way of hand-writing a description is “tops-down”. Each CPU defines what functional units it instantiates. Each functional unit template defines what subunits it implements. Each subunit defines which latency it uses. Instructions define which subunits they use. This works essentially like template expansion: a functional unit is specialized for each instance in each CPU, and in turn specializes its subunits, which specialize its latencies which are specialized on a per-instruction basis.

In the case where functional units, subunits, and latencies don’t need to be parameterized and specialized (which is true for most upstreamed targets), tdscan omits explicit functional unit template definitions. Instead, each CPU defines which functional units it implements, and each subunit associates itself with a functional unit. Internally, we simply infer each functional unit template’s definition. This results in a much smaller description, albeit less flexible.

Finally, regarding generating CPUs into separate files: TdScan automatically merges identical behaviors, so I’d have to disable that, which is probably easy. As you point out, there would be a lot of
the same duplication that currently exists in tablegen for separate targets, but perhaps thats ok if it helps people understand the language better. FWIW, it wouldn’t impact the generated database, since the MDL compile also merges duplicate information.

Anyway, thanks for sharing your ideas!

Yes, that’s the only usage scenario for us, and for many other CPU vendors, I suppose. We use a shared ISA (RISC-V in our case), and build custom CPU cores on top of it. ISA description can be really big. CPU models, on the other side, are usually quite compact.

If you come up with something better than the current SchedMachineModel we would be glad to adopt it. An ADL designed from scratch definitely has more possibilities for that than a generalized data table description language.

From what you’ve said before, I had an impression that phases can be viewed as some sort of ShedRead/SchedWrite equivalents: ISA definition can use those phases as abstract names of sort, and particular CPU definitions can define particular timings as they see fit. Did I get it right?
If that’s true, then it looks like the only place where ISA definition is coupled with CPU definition now is resource usage. If we could get rid of that coupling somehow…

So phases are literally just symbolic names of pipeline phases, and are independent of instruction definitions, and have no relationship to SchedReadWrite resources.

In tablegen, you associate SchedReadWrite resources with instructions - either in the instruction definition, or with InstRW records on a subtarget-by-subtarget basis. You then associate read and write latencies with each resource, represented simply as an integer constant.

In MDL, we use a completely different approach for tying latencies, functional units, etc to instructions. Each instruction is associated with one or more independent subunit names. A subunit definition is part of the architecture spec, and is an abstract representation of the behavioral class of the instruction - without specifying the behavior - and is typically shared by a set of instructions. The architecture description ties the subunit to functional units, which are tied to cpu definitions. Its also tied to pipeline behaviors (part of the architecture spec), which are described in latency definitions (or inlined into subunit definitions, which is done in generated descriptions).

So I definitely think we have the separation you’re looking for, but maybe not exactly the syntactic sugar you’re looking for. While tablegen instruction definitions currently specify a set of SchedReadWrite resources (which can be supplemented by InstRW records), in our approach the instruction instead specifies a subunit name. The architectural spec implements the behavioral classes, independently of the instruction definitions. Subunits can be specialized for different CPU or functional units, which I think is what you’re looking for. But today, that specialization syntax may not provide the level of separation you’d like to see. And that makes sense, and is easy to fix (since its a new DSL!).

So, let me play around with the syntax a bit, and I’ll write up some examples of how we could do it.

I just pushed an update (to GitHub - MPACT-ORG/llvm-project at work) with some new syntax that adds an alternate way to associate instructions with instruction behaviors, similar to what InstRW does. This update also:

  • cleans up the generated CPU information to streamline the integration with instruction scheduling.
  • implements forwarding in the latency management code
  • does a much better job of handling various scenarios that arise with Itineraries, particularly when combined in a single target with SchedModels.
  • updates the language documentation
  • and a bunch of minor bug fixes

One enhancement is that you can now associate a subunit template with a set of instructions, specified the same way an InstRW works:
subunit my_add : “ADD*” (…) {…}
subunit my_sub : “SUB*” : “SUBX*” (…) {…}
You can also base a subunit on another subunit, inheriting all of its instruction bindings:
subunit my_add_sub : my_add : my_sub (…) {…}
There’s also a command line flag for tdscan (the tablegen scraper) that will generate these for a target (–gen_base_subunits). But its strictly optional.

Dimitri, I think this addresses your concerns about having to modify instruction definitions for different targets. This methodology allows you to build up arbitrary hierarchies of instruction groups, which is useful in a single target or across targets.

I’ve pushed a significant update (to GitHub - MPACT-ORG/llvm-project at work ) which integrates the MDL infrastructure with the LLVM schedulers. I’ve also cleaned up the build process so that the you can build with or without the MDL stuff. When enabled, its used by default, so I’ve added an llvm command line flag to explicitly disable it.

The new CMake flag is LLVM_ENABLE_MDL which indicates you want to build a compiler that uses MDL: -DLLVM_ENABLE_MDL=ON. It is “OFF” by default, which disables directives to scrape tablegen, compile MDL files, and including the generated MDL database in the MC and Target libraries. Since this has been integrated with the CodeGen, MC, and Target libraries, to avoid clutter the integration code is still in those libraries, but its effectively disabled (since it is always predicated on the command-line flag).

The llvm command line flag is --noschedmdl, which disables use of the MDL infrastructure (in builds where it was included).

I’ve modified all the schedulers to optionally use MDL APIs vs Sched or Itin APIs. Swing integration is still in progress - I’d like to expand the class of architectures Swing can handle (with MDL), so that will be in the next push. MachineScheduler occasionally generates slightly different schedules when using MDL vs SchedModel. The MDL database has the exact same information as the SchedModel infra, but its organized differently, so I suspect I’m presenting the data to the heuristics incorrectly, or at least in a different order. I’ll continue to investigate that.

There are also a number of changes to the MDL language to more cleanly support out-of-order processors. These produced cleaner and more accurate CPU information. I’ve updated the documentation quite a bit to include these new capabilities.

Please take a look and let me know what you think! If anyone would like to help, let me know!

3 Likes

I am just wondering–I have an architecture where the microarchitecture can recognize a sequence of instructions and then execute the entire sequence as if it were a single instruction. Right now, I am only doing this for simply loop constructs. Roughly speaking::

 for( i = 0; i < max; i++ )
 {
       looping statements
 }

gets translated into::

 MOV     Ri,#0

loop_top:
looping statements
ADD Ri,Ri,#1
CMP Rt,Ri,Rmax
BLT Rt,loop_top
loop_exit:

Where the ADD-CMP-BR are “performed” in 1 cycle.

So how would one use this top-down strategy to properly tell MDL that these 3 instructions seen sequentially has a total cost of 1 rather than 3 (or 2+branch latency) ? And when the 3 instructions are not seen sequentially, they have their normal cost, each ??

So its not uncommon for superscalar processors to fuse logically adjacent instructions in a reorder buffer, but what you’re proposing is different from superscalar issuing/executing them in parallel, and definitely not VLIW. The MDL currently has direct support for the latter (superscalar and VLIW), but I’d have to think about how to model the former.

You could certainly model it as a superscalar machine, like ARM, X86, etc, with forwarding between the functional units so they could issue in parallel, but restricting it to a few instruction sequences would probably be tedious. In general, its not really necessary to explicitly model it, since the LLVM scheduler’s don’t have a way of dealing with that explicitly. But if the instructions must be adjacent in order to be fused, I think I’d be inclined to just model the sequence as a single instruction (add_cmp_blt). But that doesn’t scale very well if you have a more general pipeline fusion capability.

Sorry, I don’t know if this answers your question. :-/

As we work on using TableGen for generating disassemblers for Capstone in the auto-sync project, we found that some of the necessary information sometimes is missing int TableGen files, moreover, x86 has different model from most other architectures. See more of the changes to the LLVM code here: GitHub - capstone-engine/llvm-capstone: llvm with tablegen backend for capstone disassembler (and unmerged PRs). Moreover, current TableGen code generation for instruction decoding is quite clunky and inflexible, see discussion at ⚙ D138323 [TableGen] RegisterInfo backend - Add abstraction layer between code generation logic and syntax output

I wonder, if MDL could help alleviating these problems in the future. cc @Rot127

A few thoughts:

Currently, the MDL instruction descriptions are solely for the purpose of tying an MDL-based microarchitecture description to instructions as defined in TableGen. That said, of course we could expand the language to capture more information about each instruction, including encoding and decoding information, assembly formatting rules, semantic rules, and general instruction attributes. In fact, a previous, proprietary implementation of MDL did exactly that kind of thing. In the context of LLVM, TableGen already does all those things, so we didn’t consider that a priority, but rather focused on making it easier to model much more complex microarchitectures. But we could scrape all of the instruction information from TableGen descriptions - much like we do for scheduling information today - and represent it in an expanded MDL model.

I haven’t had a chance to go over your project in detail, but maybe you can clarify for me what your goals are. Do you want to read compiled object code and produce MCInst records, or do you just want to produce a string represention of an encoded instruction? Are you interested in modeling VLIW architectures? Things can get very messy in that class of machine. What about stateful machines (like MIPS and ARM/Thumb) where you need some context to decide how the instructions were originally encoded?

Do you want to read compiled object code and produce MCInst records, or do you just want to produce a string represention of an encoded instruction? Are you interested in modeling VLIW architectures? Things can get very messy in that class of machine. What about stateful machines (like MIPS and ARM/Thumb) where you need some context to decide how the instructions were originally encoded?

Yes to all these questions - both MCInst structure and a string representaion. We are definitely interested in both stateful and VLIW architectures too. I agree that this is somewhat ambitious and complex task, but we already progressed quite well for ARM (classic and modern AArch64), PPC, Tricore, some other architectures.

Thanks.

How does this interface with existing TableGen files? In the RISCV backend for example we have many TableGen files which is the result of a lot of engineering effort:

RISCVCallingConv.td    RISCVInstrInfoA.td  RISCVInstrInfo.td             RISCVInstrInfoXCV.td       RISCVInstrInfoZc.td       RISCVInstrInfoZicond.td  RISCVRegisterInfo.td        RISCVScheduleV.td
RISCVFeatures.td       RISCVInstrInfoC.td  RISCVInstrInfoVPseudos.td     RISCVInstrInfoXSf.td       RISCVInstrInfoZfa.td      RISCVInstrInfoZk.td      RISCVSchedRocket.td         RISCVScheduleZb.td
RISCVInstrFormatsC.td  RISCVInstrInfoD.td  RISCVInstrInfoVSDPatterns.td  RISCVInstrInfoXTHead.td    RISCVInstrInfoZfbfmin.td  RISCVInstrInfoZvfbf.td   RISCVSchedSiFive7.td        RISCVSystemOperands.td
RISCVInstrFormats.td   RISCVInstrInfoF.td  RISCVInstrInfoV.td            RISCVInstrInfoXVentana.td  RISCVInstrInfoZfh.td      RISCVInstrInfoZvk.td     RISCVSchedSyntacoreSCR1.td  RISCV.td
RISCVInstrFormatsV.td  RISCVInstrInfoM.td  RISCVInstrInfoVVLPatterns.td  RISCVInstrInfoZb.td        RISCVInstrInfoZicbo.td    RISCVProcessors.td       RISCVSchedule.td

Are you suggesting that we get rid of all the existing TableGen files and redo them using MDL? Are you suggesting that we’d only need to redo the scheduling related parts, and if thats the case, will MDL and TableGen be able to interact with each other? For example, Pseudo instructions that describe the MIR have scheduling information attached to them. Teasing scheduling apart from everything else may be difficult if the two languages cannot work together.

The improvement on description size is impressive, which surley lowers the code size of the compiler itself.

Have you collected any data on how existing TableGen scheduler models compare to MDL scheduler models with respect to the quality of scheduling? By quality of scheduling I mean the performance impact between a “dumb” TableGen model and a “smart” MDL model.

The MDL isn’t really meant to replace any part of tablegen. Using MDL is just an alternative way to write SchedModels and Itineraries, in terms of normal tablegen instructions, operands, etc.

The way we do that is we have a tool that scrapes information about instructions, operands, registers, SchedModels, and Itineraries. From that information, we produce a new description of all the schedule information for all instructions and pseudo-instructions - anything that has scheduling information - and have a simple methodology to tie the information back to MachineInstrs and MCInstrs. So MDL coexists with Schedules and Itineraries - you can choose which approach you want to use.

So for all existing targets with scheduling information, we scrape tablegen and generate an MDL description of the target microarchitecture, and use it instead of the normal SchedModel/Itinerary stuff. We don’t really expect - nor require - people to throw away tablegen descriptions and use the generated MDL-based ones. For existing targets, the MDL doesn’t really add much value (except succinctness in most cases), since the tablegen descriptions already exist and are in production. But it enabled us to test out all this infrastructure for all the current targets without having to write descriptions from scratch.

So the real purpose behind this language is to describe things that TableGen can’t effectively describe. One example of that is the work you’re currently doing with your Acquire/Release stuff - this is in the MDL approach. So we’re really focused on VLIW-class machines with many deep, complex, and often statically scheduled pipelines. Tablegen really can’t scratch the surface of these - Itineraries just aren’t sophisticated enough. Some examples are Google’s TPU, TI’s C6000’s, and QC’s Hexagon family. These are quite easy to describe in MDL, and the language then automatically does a lot of really nice things for you.

As a side effect of that, superscalar processors are pretty straightforward to accurately describe.