[RFC] MDL: A Micro-Architecture Description Language for LLVM

Yes, there are a few ways to do that. We explicitly didn’t support parameterizing functional units with pipeline phases because we felt it didn’t scale very well: it works well for a few pipeline changes, but if you have more than that you have a long slew of undifferentiated constants passed as parameters.

So first some basic ideas:
We model pipelines a little differently than LLVM. First, you explicitly name the phases of a pipeline, much like an enum definition, then refer to these names when you define a pipeline behavior. Here’s a trivial example:

    phases MY_CPU { E1, E2, E3 };

which defines a simple 3-stage pipeline. Like an enum, you can assign values to these:

      phases MY_OTHER_CPU { READ, EXECUTE, WRITE=6, WRITE_MULTIPLY=42 };

In this case, the phases have the following values: READ=0, EXECUTE=1, WRITE=6, and WRITE_MULTIPLY=42.

You then use these names in latency template definitions to define when things happen:

   use(READ, $src); 
   def(WRITE, $dst);

So, the MDL allows you to define a pipeline with global scope, used by all CPUs. And you can define CPU-specific pipelines, which override the global definitions. So, for example:

     phases GLOBAL { READ, EXECUTE, WRITE, WRITE_SLOW=5 }

     cpu MY_CPU1 {
               phases MY_PHASES { WRITE_SLOW=10 };
     }
     cpu MY_CPU2 {
                phases { WRITE_SLOW=20 }
     }
     ....
     latency Multiply() {
                def(WRITE_SLOW, $dst);
     }

The other method you can use is to explicitly handle it differently in the latency template on a CPU by CPU (or functional unit by functional unit) basis:

     latency Multiply() {
                MY_CPU8 : def(WRITE, $dst);
                MY_CPU9 : def(WRITE_SLOWER, $dst);
     }

BTW, you can also do some very basic arithmetic on the phase names, if you prefer that:

     latency Multiply() {
                MY_CPU12 : def(WRITE, $dst);
                MY_CPU5 : def(WRITE+4, $dst);
     }

Any method produces the exact same information in the compiler output, so there’s no “good” or “bad” way to do it.

Does that make sense?

Yes, it makes sense. As I read the docs more, it looks more like a “structural” ADL. I have a feeling that the “default” level of abstraction may be actually somewhat lower than in “operand scheduling”. I understand where it comes from (clustered VLIW with complex pipelines). But I’m somewhat worried that it would result in tighter coupling between ISA definition and processor scheduling model.

I have to admit I didn’t really build your branch yet, where, supposedly, you generate MDL descriptions equivalent to TableGen models. Maybe looking into those generated descriptions would answer more questions than I have right now.

Let’s consider some simple example - e.g., a simple RISC load-store ISA that has ALU instructions (“load immediate”, “move”, “add”, etc.) with identical scheduling behavior, a mul instruction, and load and store instructions. Now, you have 2 different CPUs with the same ISA - Foo and Bar. They are both single-issue in-order processors, with fully pipelined ALU, MUL and LSU. Instruction latencies are:

                      Foo  Bar
ALU                    1    1
mul                    2    3
load/store             2    4

(suppose, Bar is targeting higher frequency and has memory protection that adds an extra stage to LSU pipeline)
Now, suppose you want to decouple Foo and Bar models from an ISA description.
How would you do that in MDL?

Using either of the two methods I described (earlier), thats pretty easy (I’ll use the first method here). (In this example note that most names can be overloaded):

     cpu Foo {
            phases { ALU=1, MUL=2, LOAD=3 }
            func_unit Unit FooUnit();
     }
     cpu Bar {
            phases ( ALU=1, MUL=3, LOAD=4 }
            func_unit Unit BarUnit();
     }
     func_unit Unit() {
            subunit ALU();
            subunit MUL();
            subunit LOAD();
      }
      subunit ALU() { latency ALU(); }
      subunit MUL() { latency MUL(); }
      subunit LOAD() { latency LOAD(); }

      latency ALU() { def(ALU, $dst); }
      latency MUL() { def(MUL, $dst); }
      latency LOAD() { def(LOAD, $dst); }

not shown: ALU instruction definitions (in tablegen) need to specify that they need an ALU subunit, MUL instructions need to specify they need a MUL subunit, and LOAD instructions specify they need a LOAD subunit.

Regarding coupling:
For a simple RISC cpu, the level of abstraction is equivalent to SchedModels, but we have the capability to model things at a much finer granularity if we need to, which is why its relatively easy to convert SchedModels and Itineraries to MDL descriptions. But the way we connect an instruction description to the CPU definition is rather different, and I think provides more separation than SchedModels. So let me explain that:

Greatly simplified: In SchedModels, each instruction has a set of SchedReadWrite resources for each processor it’s valid on. These are either defined in the instruction, or provided via InstRW records. Each resource is (or can be) tied to a SchedModel, and each SchedWrite resource can be associated with a set of ProcResources. Resources also have latencies associated with them.

So, using InstRW records and associated SchedReadWrite resource sets, each instruction definition can be tied to a set of processors, and on each processor it can be tied to specific functional units (ProcResources), and we can determine latencies.

I think of this as two layers (instruction definitions and SchedModels), glued together with InstRW records. So we could think of that as three levels of hierarchy, with much of the burden carried by InstRWs.

In MDL, the schema looks like this:

A CPU definition describes the functional units it implements:

         cpu CPU1 {   <functional unit instances>   }

A functional unit template definition describes the “subunits” it implements:

        func_unit FU(<parameters>) {   <subunit instances>  }

A subunit template definition describes which latencies it implements:

       subunit SU(parameters)   {   <latency instances>   }

And a latency template definition describes a pipeline behavior:

       latency LAT(parameters) {  <uses and defs of operands and resources>  }

Tablegen instruction definitions specify one or more subunits they can execute on. We added a new tablegen instruction attribute “SubUnits”, which is a list of subunit names that an instruction can use, for example:

       def ADD : <attributes...>, SubUnits<[A, B, C]>;

which asserts that ADD can run on subunits A, B, or C. We scrape instruction information from tablegen, and generate our MDL representation of that:

      instruction ADD(<operand definitions>) { subunit(A, B, C); }

(Note that in the repo, we didn’t modify the tablegen descriptions for upstreamed targets. We simply generated all the subunit information from the schedule information). The SubUnits class is trivial:

     class SubUnitEncoding<string name> { string subunit_name = name;  }
     class SubUnits<list<SubUnitEncoding> subunits> {
                           list<SubUnitEncoding> SubUnits = subunits;   }

So, a few things to note about this approach:

  • Instructions aren’t directly tied to functional units or processors. They are only tied to subunit template definitions, which are specialized and instantiated in one or more functional units, which are specialized and instantiated on one or more processors.
  • A subunit template abstractly represents a particular behavior which is typically shared by a set of instructions. The details of that behavior are provided in the specified latency instance.
  • This model doesn’t really use “instruction latencies” as a thing. We have a more formal definition of the
    pipeline phases, and describe when registers and resources are accessed by phase. From that (and forwarding information) we derive latencies.

Another way to think about this is that we consider the pipeline part of the CPU behavior, not the instructions’ behaviors. The latency templates map instruction operands to the CPUs pipelines. As you point out, we use actual operand names to do that, rather than operand indexes. Generally, I think thats safer than using indexes: we can trivially handle reordered operands, and if you rename an operand without updating the CPU model we error check that.

Is there some way to reuse a group of pipeline stage definitions between some CPUs with the same ISA? I’m looking for something like a “matrix” of processor configurations, defined by shared implementations of components such as FPU, memory subsystem, and so on, and, as always, some variations in uarch details here and there.

Yes, absolutely. You can have a globally shared pipeline definition, and each CPU can override parts of it, or add CPU-specific stages. Functional unit definitions, subunit definitions, and latency definitions are also shared across all architectures, and can be optionally specialized for each instance.

I’ve finally got some free cycles to take a deeper look at MDL.

I’m getting compiler errors when building your branch, a bunch of errors like

lib/Target/AArch64/AArch64GenMdlInfo.inc:61121: error: could not convert ... from ‘<brace-enclosed initializer list>’ to ‘llvm::mdl::CpuTableDict’ {aka ‘std::map<std::__cxx11::basic_string<char>, llvm::mdl::CpuInfo>’}

Still, MDL files are generated.
BTW, I suppose there is a separate target that would just generate MDLs - how is it called?

I’d propose to generate separate files for individual processor definitions and related subunits. I understand that maybe those generated MDLs were not intended to be maintained manually, but maybe that would be a better showcase for what you want manually written MDL files to look like, and a better learning tool.

In the generated files for RISC-V I see the following:
RISCV_instructions.mdl:

instruction MUL(GPR  rd(O), GPR  rs1(I), GPR  rs2(I)) {
     subunit(sub32,sub41);
     // "mul $rd, $rs1, $rs2"
}

RISCV.mdl:

subunit sub32() {{ def(E3, $rd); fus(E3, SiFive7PipeB); }}
// ...
subunit sub41() {{ def(E4, $rd); fus(E4, RocketUnitIMul); }}

That coupling between ISA and uarch definition doesn’t look good. Imagine that you have several independent vendors using the same ISA, maintaining a bunch of processor models, and introducing new ones every now and then. Current TableGen model might be a bit wordy here, but it doesn’t require such coupling.

Yeah, I just noticed some compile/cmake problems. I’ll push a fix shortly.

Regarding targets: there are targets to build MDL files for each relevant target: TdScantarget.
So, “make TdScanARM” should build the ARM mdl files.

Regarding associating instructions with a uarch definition: a few thoughts…

You make a good point that different CPU vendors may want a separable way of associating each instruction with their CPU uarch - without modifying the instruction definitions. Its quite easy to replicate the InstRW approach in the MDL language. I had thought about doing that, but I personally don’t love the approach - adding or renaming instructions can create duplications or omissions, but I guess I could check for that. But given that its what people are used to, I can understand a preference for that approach. I’ll take a look at adding that, and using it when I generate descriptions.

Generally, the way I’m generating MDL files isn’t the way I’d generally recommend writing them, so I agree looking at generated MDL files isn’t terribly instructive, although it does provide an overview of the different components of the language. The primary goal of tdscan was to test the integration of the MDL approach for upstream targets. I wouldn’t expect upstream targets would necessarily switch to use it, since the MDL’s real goal is to support targets that can’t be easily described in tablegen.

The proscribed way of hand-writing a description is “tops-down”. Each CPU defines what functional units it instantiates. Each functional unit template defines what subunits it implements. Each subunit defines which latency it uses. Instructions define which subunits they use. This works essentially like template expansion: a functional unit is specialized for each instance in each CPU, and in turn specializes its subunits, which specialize its latencies which are specialized on a per-instruction basis.

In the case where functional units, subunits, and latencies don’t need to be parameterized and specialized (which is true for most upstreamed targets), tdscan omits explicit functional unit template definitions. Instead, each CPU defines which functional units it implements, and each subunit associates itself with a functional unit. Internally, we simply infer each functional unit template’s definition. This results in a much smaller description, albeit less flexible.

Finally, regarding generating CPUs into separate files: TdScan automatically merges identical behaviors, so I’d have to disable that, which is probably easy. As you point out, there would be a lot of
the same duplication that currently exists in tablegen for separate targets, but perhaps thats ok if it helps people understand the language better. FWIW, it wouldn’t impact the generated database, since the MDL compile also merges duplicate information.

Anyway, thanks for sharing your ideas!

Yes, that’s the only usage scenario for us, and for many other CPU vendors, I suppose. We use a shared ISA (RISC-V in our case), and build custom CPU cores on top of it. ISA description can be really big. CPU models, on the other side, are usually quite compact.

If you come up with something better than the current SchedMachineModel we would be glad to adopt it. An ADL designed from scratch definitely has more possibilities for that than a generalized data table description language.

From what you’ve said before, I had an impression that phases can be viewed as some sort of ShedRead/SchedWrite equivalents: ISA definition can use those phases as abstract names of sort, and particular CPU definitions can define particular timings as they see fit. Did I get it right?
If that’s true, then it looks like the only place where ISA definition is coupled with CPU definition now is resource usage. If we could get rid of that coupling somehow…

So phases are literally just symbolic names of pipeline phases, and are independent of instruction definitions, and have no relationship to SchedReadWrite resources.

In tablegen, you associate SchedReadWrite resources with instructions - either in the instruction definition, or with InstRW records on a subtarget-by-subtarget basis. You then associate read and write latencies with each resource, represented simply as an integer constant.

In MDL, we use a completely different approach for tying latencies, functional units, etc to instructions. Each instruction is associated with one or more independent subunit names. A subunit definition is part of the architecture spec, and is an abstract representation of the behavioral class of the instruction - without specifying the behavior - and is typically shared by a set of instructions. The architecture description ties the subunit to functional units, which are tied to cpu definitions. Its also tied to pipeline behaviors (part of the architecture spec), which are described in latency definitions (or inlined into subunit definitions, which is done in generated descriptions).

So I definitely think we have the separation you’re looking for, but maybe not exactly the syntactic sugar you’re looking for. While tablegen instruction definitions currently specify a set of SchedReadWrite resources (which can be supplemented by InstRW records), in our approach the instruction instead specifies a subunit name. The architectural spec implements the behavioral classes, independently of the instruction definitions. Subunits can be specialized for different CPU or functional units, which I think is what you’re looking for. But today, that specialization syntax may not provide the level of separation you’d like to see. And that makes sense, and is easy to fix (since its a new DSL!).

So, let me play around with the syntax a bit, and I’ll write up some examples of how we could do it.