Using either of the two methods I described (earlier), thats pretty easy (I’ll use the first method here). (In this example note that most names can be overloaded):
cpu Foo {
phases { ALU=1, MUL=2, LOAD=3 }
func_unit Unit FooUnit();
}
cpu Bar {
phases ( ALU=1, MUL=3, LOAD=4 }
func_unit Unit BarUnit();
}
func_unit Unit() {
subunit ALU();
subunit MUL();
subunit LOAD();
}
subunit ALU() { latency ALU(); }
subunit MUL() { latency MUL(); }
subunit LOAD() { latency LOAD(); }
latency ALU() { def(ALU, $dst); }
latency MUL() { def(MUL, $dst); }
latency LOAD() { def(LOAD, $dst); }
not shown: ALU instruction definitions (in tablegen) need to specify that they need an ALU subunit, MUL instructions need to specify they need a MUL subunit, and LOAD instructions specify they need a LOAD subunit.
Regarding coupling:
For a simple RISC cpu, the level of abstraction is equivalent to SchedModels, but we have the capability to model things at a much finer granularity if we need to, which is why its relatively easy to convert SchedModels and Itineraries to MDL descriptions. But the way we connect an instruction description to the CPU definition is rather different, and I think provides more separation than SchedModels. So let me explain that:
Greatly simplified: In SchedModels, each instruction has a set of SchedReadWrite resources for each processor it’s valid on. These are either defined in the instruction, or provided via InstRW records. Each resource is (or can be) tied to a SchedModel, and each SchedWrite resource can be associated with a set of ProcResources. Resources also have latencies associated with them.
So, using InstRW records and associated SchedReadWrite resource sets, each instruction definition can be tied to a set of processors, and on each processor it can be tied to specific functional units (ProcResources), and we can determine latencies.
I think of this as two layers (instruction definitions and SchedModels), glued together with InstRW records. So we could think of that as three levels of hierarchy, with much of the burden carried by InstRWs.
In MDL, the schema looks like this:
A CPU definition describes the functional units it implements:
cpu CPU1 { <functional unit instances> }
A functional unit template definition describes the “subunits” it implements:
func_unit FU(<parameters>) { <subunit instances> }
A subunit template definition describes which latencies it implements:
subunit SU(parameters) { <latency instances> }
And a latency template definition describes a pipeline behavior:
latency LAT(parameters) { <uses and defs of operands and resources> }
Tablegen instruction definitions specify one or more subunits they can execute on. We added a new tablegen instruction attribute “SubUnits”, which is a list of subunit names that an instruction can use, for example:
def ADD : <attributes...>, SubUnits<[A, B, C]>;
which asserts that ADD can run on subunits A, B, or C. We scrape instruction information from tablegen, and generate our MDL representation of that:
instruction ADD(<operand definitions>) { subunit(A, B, C); }
(Note that in the repo, we didn’t modify the tablegen descriptions for upstreamed targets. We simply generated all the subunit information from the schedule information). The SubUnits class is trivial:
class SubUnitEncoding<string name> { string subunit_name = name; }
class SubUnits<list<SubUnitEncoding> subunits> {
list<SubUnitEncoding> SubUnits = subunits; }
So, a few things to note about this approach:
- Instructions aren’t directly tied to functional units or processors. They are only tied to subunit template definitions, which are specialized and instantiated in one or more functional units, which are specialized and instantiated on one or more processors.
- A subunit template abstractly represents a particular behavior which is typically shared by a set of instructions. The details of that behavior are provided in the specified latency instance.
- This model doesn’t really use “instruction latencies” as a thing. We have a more formal definition of the
pipeline phases, and describe when registers and resources are accessed by phase. From that (and forwarding information) we derive latencies.
Another way to think about this is that we consider the pipeline part of the CPU behavior, not the instructions’ behaviors. The latency templates map instruction operands to the CPUs pipelines. As you point out, we use actual operand names to do that, rather than operand indexes. Generally, I think thats safer than using indexes: we can trivially handle reordered operands, and if you rename an operand without updating the CPU model we error check that.