[RFC] llvm-mca: a static performance analysis tool

Hi all,

At Sony we developed an LLVM based performance analysis tool named llvm-mca. We
currently use it internally to statically measure the performance of code, and
to help triage potential problems with target scheduling models. We decided to
post this RFC because we are interested in the feedback from the community, and
we also believe that other people might be interested in a tool like this.

llvm-mca uses information which is already available in LLVM (e.g. scheduling
models) to statically measure the performance of machine code in a specific cpu.
Performance is measured in terms of throughput as well as processor resource
consumption. The tool currently works for processors with an out-of-order
backend, for which there is a scheduling model available in LLVM.

The main goal of this tool is not just to predict the performance of the code
when run on the target, but also help with diagnosing potential performance
issues.

Given an assembly code sequence, llvm-mca estimates the IPC (instructions per
cycle), as well as hardware resources pressure. The analysis and reporting style
were inspired by the IACA tool from Intel.

The presence of long data dependency chains, as well as poor usage of hardware
resources may lead to bottlenecks in the back-end. The tool is able to generate
a detailed report which should help with identifying and analyzing sources of
bottlenecks.

Scheduling models are mostly used to compute instruction latencies, to obtain
read-advance information, and understand how processor resources are used by
instructions. By design, the quality of the performance analysis conducted by
the tool is inevitably affected by the quality of the target scheduling models.

However, scheduling models intentionally do not describe all processors details,
since the goal is just to enable the scheduling of machine instructions during
compilation. That means, there are processor details which are not important for
the purpose of scheduling instructions (and therefore not described by the
scheduling model), but are very important for this tool.

A few examples of details that are missing in scheduling models are:

  • Maximum number of instructions retired per cycle.
  • Actual dispatch width (it often differs from the issue width).
  • Number of temporary registers available for renaming.
  • Number of read/write ports in the register file(s).
  • Length of the load/store queue in the LSUnit.

It is also very difficult to find a “good” abstract model to describe the
behavior of out-of-order processors. So, we have to keep in mind that all of
these aspects are going to affect the quality of the static analysis performed
by the tool.

An extensive list of known limitations is reported in one of the last sections
of this document. There is also a section related to design problems which must
be addressed (hopefully with the help of the community). At the moment, the
tool has been mostly tested for x86 targets, but there are still several
limitations, some of which could be overcome by integrating extra information
into the scheduling models.

As was mentioned before, this tool has been (and is still being) used internally
in Sony to debug/triage issues in the btver2 scheduling model. We have also
tested it on other targets to check how generic the tool is. In our experience,
the tool makes it easy to identify simple mistakes like “wrong number of micro
opcodes specified for an instruction”, or “wrong set of hardware resources”.
Some of these mistakes are quite common (sometimes on mature models too), and
often difficult to catch. Reports generated by this tool are simple to analyze,
and contain enough details to help triage most performance problems.

  1. How the tool works

Hi all,

At Sony we developed an LLVM based performance analysis tool named llvm-mca. We
currently use it internally to statically measure the performance of code, and
to help triage potential problems with target scheduling models. We decided to
post this RFC because we are interested in the feedback from the community, and
we also believe that other people might be interested in a tool like this.

This is a dream come true!

llvm-mca uses information which is already available in LLVM (e.g. scheduling
models) to statically measure the performance of machine code in a specific cpu.
Performance is measured in terms of throughput as well as processor resource
consumption. The tool currently works for processors with an out-of-order
backend, for which there is a scheduling model available in LLVM.

The main goal of this tool is not just to predict the performance of the code
when run on the target, but also help with diagnosing potential performance
issues.

You can’t really have one without the other, which is why this is a dream come true.

Given an assembly code sequence, llvm-mca estimates the IPC (instructions per
cycle), as well as hardware resources pressure. The analysis and reporting style
were inspired by the IACA tool from Intel.

The presence of long data dependency chains, as well as poor usage of hardware
resources may lead to bottlenecks in the back-end. The tool is able to generate
a detailed report which should help with identifying and analyzing sources of
bottlenecks.

Scheduling models are mostly used to compute instruction latencies, to obtain
read-advance information, and understand how processor resources are used by
instructions. By design, the quality of the performance analysis conducted by
the tool is inevitably affected by the quality of the target scheduling models.

However, scheduling models intentionally do not describe all processors details,
since the goal is just to enable the scheduling of machine instructions during
compilation. That means, there are processor details which are not important for
the purpose of scheduling instructions (and therefore not described by the
scheduling model), but are very important for this tool.

The LLVM machine model can have as much detail as we want, as long as it’s opt-in for targets.
We’ve always had more detail than the generic scheduler actually needed. e.g. the scheduler doesn’t currently care how big the per-resource buffer sizes are.
Targets have their own scheduling strategies that can pick and choose.

At one point I wrote code that could be called by the scheduler to simulate OOO execution at much greater detail, but the benefit didn’t justify the complexity and compile time. I always felt that a static analysis tool was the right place for this kind of simulation.

A few examples of details that are missing in scheduling models are:

  • Maximum number of instructions retired per cycle.

MicroOpBufferSize is presumed to cover register renaming and retirement, assuming they are well-balanced. For your tool, you certainly want to be more precise.

  • Actual dispatch width (it often differs from the issue width).

This was always a hard one to generalize in a machine independent way, and half the world seems to swap the meaning of those terms. If you’re going to make this distinction please define the terms clearly in the machine model and explain how tools are expected to use the information.

Currently IssueWidth is used to tell the scheduler that ’N’ microops will definitely take ’N’ / ‘IssueWidth’ cycles regardless of which functional units or dispatch pipeline is involved.

[Reading below, I saw that you define instruction “dispatch" as what the LLVM machine model calls “issue”. LLVM doesn’t model dispatch directly because, by its definition, it is redundant with the number of function units, modulo some dynamic behavior that can’t be statically predicted anyway]

We also don’t model decoding, because, presumably you hit the micro-op limit first.

  • Number of temporary registers available for renaming.

See MicroOpBufferSize above.

  • Number of read/write ports in the register file(s).

The assumption is that we hit micro-op issue width first. I suppose it’s good to have though.

  • Length of the load/store queue in the LSUnit.

That was supposed to be covered by per-processor-resource buffer size. Maybe you want to simulate the load/store queue differently from other functional units? e.g. one shared queue across multiple load store units?

It is also very difficult to find a “good” abstract model to describe the
behavior of out-of-order processors. So, we have to keep in mind that all of
these aspects are going to affect the quality of the static analysis performed
by the tool.

Like the scheduler, the tool should be extensible so that different targets can simulate behavior that doesn’t fit some abstract model.

An extensive list of known limitations is reported in one of the last sections
of this document. There is also a section related to design problems which must
be addressed (hopefully with the help of the community). At the moment, the
tool has been mostly tested for x86 targets, but there are still several
limitations, some of which could be overcome by integrating extra information
into the scheduling models.

As was mentioned before, this tool has been (and is still being) used internally
in Sony to debug/triage issues in the btver2 scheduling model. We have also
tested it on other targets to check how generic the tool is. In our experience,
the tool makes it easy to identify simple mistakes like “wrong number of micro
opcodes specified for an instruction”, or “wrong set of hardware resources”.
Some of these mistakes are quite common (sometimes on mature models too), and
often difficult to catch. Reports generated by this tool are simple to analyze,
and contain enough details to help triage most performance problems.

  1. How the tool works
    ——————————

Nice documentation. Please find a good place for it to live and include it in your patch.

Timeline View

A detailed report of each instruction’s state transitions over time can be
enabled using the command line flag ‘-timeline’. This prints an extra section
in the report which contains the so-called “timeline view”. Below is the
timeline view for the dot-product example from the previous section.

///////////////
Timeline view:
012345
Index 0123456789

[0,0] DeeER. . . vmulps %xmm0, %xmm1, %xmm2
[0,1] D==eeeER . . vhaddps %xmm2, %xmm2, %xmm3
[0,2] .D====eeeER . vhaddps %xmm3, %xmm3, %xmm4

[1,0] .DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[1,1] . D=eeeE—R . vhaddps %xmm2, %xmm2, %xmm3
[1,2] . D====eeeER . vhaddps %xmm3, %xmm3, %xmm4

[2,0] . DeeE-----R . vmulps %xmm0, %xmm1, %xmm2
[2,1] . D====eeeER . vhaddps %xmm2, %xmm2, %xmm3
[2,2] . D======eeeER vhaddps %xmm3, %xmm3, %xmm4

Truly awesome.

Extra statistics to further diagnose performance issues.

Flag ‘-verbose’ enables extra statistics and performance counters for the
dispatch logic, the reorder buffer, the retire control unit and the register
file.

Below is an example of verbose output generated by the tool for the dot-product
example discussed in the previous sections.

///////////////////
Iterations: 300
Instructions: 900
Total Cycles: 610
Dispatch Width: 2
IPC: 1.48

Dynamic Dispatch Stall Cycles:
RAT - Register unavailable: 0
RCU - Retire tokens unavailable: 0
SCHEDQ - Scheduler full: 272
LQ - Load queue full: 0
SQ - Store queue full: 0
GROUP - Static restrictions on the dispatch group: 0

Register Alias Table:
Total number of mappings created: 900
Max number of mappings used: 35

Dispatch Logic - number of cycles where we saw N instructions dispatched:
[# dispatched], [# cycles]
0, 24 (3.9%)
1, 272 (44.6%)
2, 314 (51.5%)

Schedulers - number of cycles where we saw N instructions issued:
[# issued], [# cycles]
0, 7 (1.1%)
1, 306 (50.2%)
2, 297 (48.7%)

Retire Control Unit - number of cycles where we saw N instructions retired:
[# retired], [# cycles]
0, 109 (17.9%)
1, 102 (16.7%)
2, 399 (65.4%)

Scheduler’s queue usage:
JALU01, 0/20
JFPU01, 18/18
JLSAGU, 0/12
///////////////////

So great!

LLVM-MCA instruction flow

This section describes the instruction flow through the out-of-order backend, as
well as the functional units involved in the process.

An instruction goes through a default sequence of stages:

  • Dispatch (Instruction is dispatched to the schedulers).
  • Issue (Instruction is issued to the processor pipelines).
  • Write Back (Instruction is executed, and results are written back).
  • Retire (Instruction is retired; writes are architecturally committed).

The tool only models the out-of-order portion of a processor. Therefore, the
instruction fetch and decode stages are not modeled. Performance bottlenecks in
the frontend are not diagnosed by this tool. The tool assumes that instructions
have all been decoded and placed in a queue. Also, the tool doesn’t know
anything about branch prediction.

The following sections are fantastic documentation for your tool and the machine model in general. I hope you check all this in somewhere.

Instruction Dispatch

During the Dispatch stage, instructions are picked in program order from a queue
of already decoded instructions, and dispatched in groups to the hardware
schedulers. The dispatch logic is implemented by class DispatchUnit in file
Dispatch.h.

The size of a dispatch group depends on the availability of hardware resources,
and it cannot exceed the value of field ‘DispatchWidth’ in class DispatchUnit.
Note that field DispatchWidth defaults to the value of field ‘IssueWidth’ from
the scheduling model.

Users can override the DispatchWidth value with flag “-dispatch=” (where ‘N’
is an unsigned quantity).

An instruction can be dispatched if:

  • The size of the dispatch group is smaller than DispatchWidth
  • There are enough entries in the reorder buffer
  • There are enough temporary registers to do register renaming
  • Schedulers are not full.

This is really what’s meant by LLVM’s IssueWidth. I think Intel always preferred to call it “dispatch”. Presumably, instructions are buffered before being dispatched/issued to functional units, so LLVM’s “IssueWidth” is meant to model a hardware restriction that is independent from the number of functional units (processor resources).

It’s especially confusing because some µArchs have dispatch pipelines decoupled from the functional units.

Scheduling models don’t describe register files, and therefore the tool doesn’t
know if there is more than one register file, and how many temporaries are
available for register renaming.

The LLVM machine model can model the size of the register renaming pool if you like.

By default, the tool (optimistically) assumes a single register file with an
unbounded number of temporary registers. Users can limit the number of
temporary registers available for register renaming using flag
-register-file-size=<N>, where N is the number of temporaries. A value of
zero for N means ‘unbounded’. Knowing how many temporaries are available for
register renaming, the tool can predict dispatch stalls caused by the lack of
temporaries.

The number of reorder buffer entries consumed by an instruction depends on the
number of micro-opcodes it specifies in the target scheduling model (see field
‘NumMicroOpcodes’ of tablegen class ProcWriteResources and its derived classes;
TargetSchedule.td).

The reorder buffer is implemented by class RetireControlUnit (see Dispatch.h).
Its goal is to track the progress of instructions that are “in-flight”, and
retire instructions in program order. The number of entries in the reorder
buffer defaults to the value of field ‘MicroOpBufferSize’ from the target
scheduling model.

Instructions that are dispatched to the schedulers consume scheduler buffer
entries. The tool queries the scheduling model to figure out the set of
buffered resources consumed by an instruction. Buffered resources are treated
like “scheduler” resources, and the field ‘BufferSize’ (from the processor
resource tablegen definition) defines the size of the scheduler’s queue.

Zero latency instructions (for example NOP instructions) don’t consume scheduler
resources. However, those instructions still reserve a number of slots in the
reorder buffer.

As currently modeled by dummy µOps:

// A single nop micro-op (uX).
def WriteX : SchedWriteRes<> { let Latency = 0; }

You’ll need MachineInstr’s though.

Instruction Issue

As mentioned in the previous section, each scheduler resource implements a queue
of instructions. An instruction has to wait in the scheduler’s queue until
input register operands become available. Only at that point, does the
instruction becomes eligible for execution and may be issued (potentially
out-of-order) to a pipeline for execution.

Instruction latencies can be computed by the tool with the help of the
scheduling model; latency values are defined by the scheduling model through
ProcWriteResources objects.

Class Scheduler (see file Scheduler.h) knows how to emulate multiple processor
schedulers. A Scheduler is responsible for tracking data dependencies, and
dynamically select which processor resources are consumed/used by instructions.

Internally, the Scheduler class delegates the management of processor resource
units and resource groups to the ResourceManager class. ResourceManager is also
responsible for selecting resource units that are effectively consumed by
instructions. For example, if an instruction consumes 1cy of a resource group,
the ResourceManager object selects one of the available units from the group; by
default, it uses a round-robin selector to guarantee that resource usage is
uniformly distributed between all units of a group.

To be a cross-subtarget tool, this needs to be a customization point. It’s not always so simple.

Load/Store Unit and Memory Consistency Model

The tool attempts to emulate out-of-order execution of memory operations. Class
LSUnit (see file LSUnit.h) emulates a load/store unit implementing queues for
speculative execution of loads and stores.

Each load (or store) consumes an entry in the load (or store) queue. The number
of slots in the load/store queues is unknown by the tool, since there is no
mention of it in the scheduling model. In practice, users can specify flag
-lqueue=N (vic. -squeue=N) to limit the number of entries in the queue to be
equal to exactly N (an unsigned value). If N is zero, then the tool assumes an
unbounded queue (this is the default).

LSUnit implements a relaxed consistency model for memory loads and stores. The
rules are:

  1. A younger load is allowed to pass an older load only if there is no
    intervening store in between the two loads.
  2. An younger store is not allowed to pass an older store.
  3. A younger store is not allowed to pass an older load.
  4. A younger load is allowed to pass an older store provided that the load does
    not alias with the store.

By default, this class conservatively (i.e. pessimistically) assumes that loads
always may-alias store operations. Essentially, this LSUnit doesn’t perform any
sort of alias analysis to rule out cases where loads and stores don’t overlap
with each other. The downside of this approach however is that younger loads are
never allowed to pass older stores. To make it possible for a younger load to
pass an older store, users can use the command line flag -noalias. Under
‘noalias’, a younger load is always allowed to pass an older store.

I’m surprised it isn’t optimistic by default. Being pessimistic hides other bottlenecks and seems less accurate. I guess consistency with other similar tools (IACA) should be the aim.

Note that, in the case of write-combining memory, rule 2. could be relaxed a bit
to allow reordering of non-aliasing store operations. That being said, at the
moment, there is no way to further relax the memory model (flag -noalias is the
only option). Essentially, there is no option to specify a different memory
type (for example: write-back, write-combining, write-through; etc.) and
consequently to weaken or strengthen the memory model.

Other limitations are:

  • LSUnit doesn’t know when store-to-load forwarding may occur.
  • LSUnit doesn’t know anything about the cache hierarchy and memory types.
  • LSUnit doesn’t know how to identify serializing operations and memory fences.

Combining the static model with a sampled dynamic hardware event profile would be amazing.

No assumption is made on the store buffer size. As mentioned before, LSUnit
conservatively assumes a may-alias relation between loads and stores, and it
doesn’t attempt to identify cases where store-to-load forwarding would occur in
practice.

The LSUnits have a buffer size of course. The question is whether you really need to separately model issuing the instructions to a separate memory pipeline via a shared queue.

LSUnit doesn’t attempt to predict whether a load or store hits or misses the L1
cache. It only knows if an instruction “MayLoad” and/or “MayStore”. For loads,
the scheduling model provides an “optimistic” load-to-use latency (which usually
matches the load-to-use latency for when there is a hit in the L1D).

You’re optimistic here, which is good, but pessimistic with aliasing.

Class MCInstrDesc in LLVM doesn’t know about serializing operations, nor
memory-barrier like instructions. LSUnit conservatively assumes that an
instruction which has both ‘MayLoad’ and ‘UnmodeledSideEffects’ behaves like a
“soft” load-barrier. That means, it serializes loads without forcing a flush of
the load queue. Similarly, instructions flagged with both ‘MayStore’ and
‘UnmodeledSideEffects’ are treated like store barriers. A full memory barrier
is a ‘MayLoad’ and ‘MayStore’ instruction with ‘UnmodeledSideEffects’. This is
inaccurate, but it is the best that we can do at the moment with the current
information available in LLVM.

LLVM should have this information. It needs some design work though.

A load/store barrier consumes one entry of the load/store queue. A load/store
barrier enforces ordering of loads/stores. A younger load cannot pass a load
barrier. Also, a younger store cannot pass a store barrier. A younger load has
to wait for the memory/load barrier to execute. A load/store barrier is
“executed” when it becomes the oldest entry in the load/store queue(s). That
also means, by construction, all the older loads/stores have been executed.

In conclusion the full set of rules is:

  1. A store may not pass a previous store.
  2. A load may not pass a previous store unless flag ‘NoAlias’ is set.
  3. A load may pass a previous load.
  4. A store may not pass a previous load (regardless of flag ‘NoAlias’).
  5. A load has to wait until an older load barrier is fully executed.
  6. A store has to wait until an older store barrier is fully executed.

Known limitations

Previous sections described cases where the tool is missing information to give
an accurate report. For example, the first sections of this document explained
how the lack of knowledge about the processor negatively affects the performance
analysis. The lack of knowledge is often a consequence of how scheduling models
are defined; as mentioned before, scheduling models intentionally don’t describe
processors in fine details.

LLVM’s machine model should be optionally extended to model whatever a static analysis tool needs. There’s some dynamic behavior that can’t be modeled statically—predictive structures, the state of the pipeline entering a loop—but machine model precision shouldn’t be the limiting factor for your tool.

The accuracy of the performance analysis is also affected by assumptions made by
the processor model used by the tool.

Most recent Intel and AMD processors implement dedicated LoopBuffer/OpCache in
the hardware frontend to speedup the throughput in the presence of tight loops.
The presence of these buffers complicates the decoding logic, and requires
knowledge on the branch predictor too. Class ‘SchedMachineModel’ in tablegen
provides a field named ‘LoopMicroOpBufferSize’ which is used to describe loop
buffers. However, the purpose of that field is to enable loop unrolling of
tight loops; essentially, it affects the cost model used by pass loop-unroll.

By design, the tool only cares about the out-of-order portion of a processor,
and consequently doesn’t try to predict the frontend throughput. Processors may
implement complex decoding schemes; statically predicting the frontend
throughput is in general beyond the scope of this tool. For the same reasons,
this tool intentionally doesn’t model branch prediction. That being said, this
tool could be definitely extended in future to also account for the hardware
frontend when doing performance analysis. This would inevitably require extra
(extensive) processor knowledge related to all the available decoding paths in
the hardware frontend.

If loops are ever definitely limited by decoder or fetch throughput, it would be good to know that. You don’t need to model the predictive aspect of it, or “all the paths”.

You might as well say you don’t need to model issue/dispatch width, or retirement logic, because it’s decoupled from OOO execution via instruction buffers. The point of this tool is to tell you that there is definitely a bottleneck in a particular area of the pipeline.

Would it be more appropriate to say that a simple abstract model of decoding is too inaccurate for your particular subtarget so there was not enough benefit to implementing it?

When computing the IPC, the tool assumes a zero-latency “perfect” fetch&decode
stage; the full sequence of decoded instructions is immediately visible to the
dispatch logic from the start.

The tool doesn’t know about simultaneous mutithreading. According to the tool,
processor resources are not statically/dynamically partitioned. Processor
resources are fully available to the hardware thread executing the
microbenchmark.

The execution model implemented by this tool assumes that instructions are
firstly dispatched in groups to hardware schedulers, and then issued to
pipelines for execution. The model assumes dynamic scheduling of instructions.
Instructions are placed in a queue and potentially executed out-of-order (based
on the operand availability). The dispatch stage is definitely distinct from the
issue stage.

I wonder why in-order analysis doesn’t mostly fall out as a degenerate case in your tool. There’s some grey area where in-order processors hardware interlocks that cause stalls that aren’t explicit in the scheduling groups. Those processors would still benefit from your tool. Even if that target doesn’t benefit from the OOO simulation, it would be nice to compare the output of your tool with the observed performance to find bugs in the model.

This model doesn’t correctly describe processors where the dispatch/issue is a
single stage. This is what happens for example in VLIW processors, where
instructions are packaged and statically scheduled at compile time; it is up to
the compiler to predict the latency of instructions and package issue groups
accordingly. For such targets, there is no dynamic scheduling done by the
hardware.

Existing classes (DispatchUnit, Scheduler, etc.) could be extended/adapted to
support processors with a single dispatch/issue stage. The execution flow would
require some changes in the way how existing components (i.e. DispatchUnit,
Scheduler, etc.) interact. This can be a future development.

Ah…. Extended with future development sounds better.

The following sections describes other known limitations. The goal is not to
provide an extensive list of limitations; we want to report what we believe are
the most important limitations, and suggest possible methods to overcome them.

Load/Store barrier instructions and serializing operations

Section “Load/Store Unit and Memory Consistency Model” already mentioned how
LLVM doesn’t know about serializing operations and memory barriers. Most of it
boils down to the fact that class MCInstrDesc (intentionally) doesn’t expose
those properties. Instead, both serializing operations and memory barriers
“have side-effects” according to MCInstrDesc. That is because, at least for
scheduling purposes, knowing that an instruction has unmodeled side effects is
often enough to treat the instruction like a compiler scheduling barrier.

A performance analysis tool could use the extra knowledge on barriers and
serializing operations to generate a more accurate performance report. One way
to improve this is by reserving a couple of bits in field ‘Flags’ from class
MCInstrDesc: one bit for barrier operations, and another bit to mark
instructions as serializing operations.

Lack of support for instruction itineraries

The current version of the tool doesn’t know how to process instruction
itineraries. This is probably one of the most important limitations, since it
affects a few out-of-order processors in LLVM.

I don’t think OOO LLVM targets should be using itineraries. If those targets are still actively maintained, they could migrate.

As mentioned in section ‘Instruction Issue’, class Scheduler delegates to an
instance of class ResourceManager the handling of processor resources.
ResourceManager is where most of the scheduling logic is implemented.

Adding support for instruction itineraries requires that we teach
ResourceManager how to handle functional units and instruction stages. This
development can be a future extension, and it would probably require a few
changes to the ResourceManager interface.

Instructions that affect control flow are not correctly modeled

Examples of instructions that affect the control flow are: return, indirect
branches, calls, etc. The tool doesn’t try to predict/evaluate branch targets.
In particular, the tool doesn’t model any sort of branch prediction, nor does it
attempt to track changes to the program counter. The tool always assumes that
the input assembly sequence is the body of a microbenchmark (a simple loop
executed for a number of iterations). The “next” instruction in sequence is
always the next instruction to dispatch.

Call instructions default to an arbitrary high latency of 100cy. A warning is
generated if the tool encounters a call instruction in the sequence. Return
instructions are not evaluated, and therefore control flow is not affected.
However, the tool still queries the processor scheduling model to obtain latency
information for instructions that affect the control flow.

By decompiling to MachineInst’s it would be easy to build a mostly complete CFG and call graph.

Possible extensions to the scheduling model

Section “Instruction Dispatch” explained how the tool doesn’t know about the
register files, and temporaries available in each register file for register
renaming purposes.

The LLVM scheduling model could be extended to better describe register files.
Ideally, scheduling model should be able to define:

  • The size of each register file
  • How many temporary registers are available for register renaming
  • How register classes map to register files

The scheduling model doesn’t specify the retire throughput (i.e. how many
instructions can be retired every cycle). Users can specify flag
-max-retire-per-cycle=<uint> to limit how many instructions the retire control
unit can retire every cycle. Ideally, every processor should be able to specify
the retire throughput (for example, by adding an extra field to the scheduling
model tablegen class).

Known limitations on X86 processors

  1. Partial register updates versus full register updates.

MachineOperand handles this. You just need to create the machine instrs.

  1. Macro Op fusion.

The tool doesn’t know about macro-op fusion. On modern x86 processors, a
‘cmp/test’ followed by a ‘jmp’ is fused into a single macro operation. The
advantage is that the fused pair only consumes a single slot in the dispatch
group.

As a future development, the tool should be extended to address macro-fusion.
Ideally, we could have LLVM generate a table enumerating all the opcode pairs
that can be fused together. That table could be exposed to the tool via the
MCSubtargetInfo interface. This is just an idea; there may be better ways to
implement this.

This is already expressed by subtarget code. That code just needs to be exposed as a common subtarget interface. Typically the interfaces take MachineInst, but in this case the opcode is probably sufficient.

  1. Zero-latency register moves and Zero-idioms.

Most modern AMD/Intel processors know how to optimize out register-register
moves and zero idioms at register renaming stage. The tool doesn’t know
about these patterns, and this may negatively impact the performance analysis.

The machine model has this, but requires proper MachineInstrs.

Known design problems

This section describes two design issues that are currently affecting the tool.
The long term plan is to “fix” these issues.

  1. Variant instructions not correctly modeled.

The tool doesn’t know how to analyze instructions with a “variant” scheduling
class descriptor. A variant scheduling class needs to be resolved dynamically.
The “actual” scheduling class often depends on the subtarget, as well as
properties of the specific MachineInstr object.

Unfortunately, the tool manipulates MCInst, and it doesn’t know anything about
MachineInstr. As a consequence, the tool cannot use the existing machine
subtarget hooks that are normally used to resolve the variant scheduling class.
This is a major design issue which mostly affects ARM/AArch64 targets. It
mostly boils down to the fact that the existing scheduling framework was meant
to work for MachineInstr.

There are good reasons for the scheduler to work with MachineInstrs, and any static analysis tool should work with MachineInstrs for the same reasons…

  1. MCInst and MCInstrDesc.

Performance analysis tools require data dependency information to correctly
predict the runtime performance of the code. This tool must always be able to
obtain the set of implicit/explicit register defs/uses for every instruction of
the input assembly sequence.

In the first section of this document, it was mentioned how the tool takes as
input an assembly sequence. That sequence is parsed into a MCInst sequence with
the help of assembly parsers available from the targets.

A MCInst is a very low-level instruction representation. The tool can inspect
the MCOperand sequence of an MCInst to identify register operands. However,
there is no way to tell register operands that are definitions from register
operands that are uses.

In LLVM, class MCInstrDesc is used to fully describe target instructions and
their operands. The opcode of a machine instruction (a MachineInstr object) can
be used to query the instruction set through method `MCInstrInfo::get’ to obtain
the associated MCInstrDesc object.

However class MCInstrDesc describes properties and operands of MachineInstr
objects. Essentially, MCInstrDesc is not meant to be used to describe MCInst
objects. To be more specific, MCInstrDesc objects are automatically generated
via tablegen from the instruction set description in the target .td files. For
example, field MCInstrDesc::NumDefs' is always equal to the cardinality of the (outs)` set from the tablegen instruction definition.

By construction, register definitions always appear at the beginning of the
MachineOperands list in MachineInstr. Basically, the (outs) are the first
operands of a MachineInstr, and the (ins) will come after in the machine operand
list. Knowing the number of register definitions is enough to identify
all the register operands that are definitions.

In a normal compilation process, MCInst objects are generated from MachineInstr
objects through a lowering step. By default the lowering logic simply iterates
over the machine operands of a MachineInstr, and converts/expands them into
equivalent MCOperand objects.

The default lowering strategy has the advantage of preserving all of the
above mentioned assumptions on the machine operand sequence. That means, register
definitions would still be at the beginning of the MCOperand sequence, and
register uses would come after.

Targets may still define custom lowering routines for specific opcodes. Some of
these routines may lower operands in a way that potentially breaks (some of) the
assumptions on the machine operand sequence which were valid for MachineInstr.
Luckily, this is not the most common form of lowering done by the targets, and
the vast majority of the MachineInstr are lowered based on the default strategy
which preserves the original machine operand sequence. This is especially true
for x86, where the custom lowering logic always preserves the original (i.e.
from the MachineInstr) operand sequence.

This tool currently works under the strong (and potentially incorrect)
assumption that register def/uses in a MCInst can always be identified by
querying the machine instruction descriptor for the opcode. This assumption made
it possible to develop this tool and get good numbers at least for the
processors available in the x86 backend.

That being said, the analysis is still potentially incorrect for other targets.
So we plan (with the help of the community) to find a proper mechanism to map
when possible MCOperand indices back to MachineOperand indices of the equivalent
MachineInstr. This would be equivalent to describing changes made by the
lowering step which affected the operand sequence. For example, we could have an
index for every register MCOperand (or -1, if the operand didn’t exist in the
original MachineInstr). The mapping could look like this <0,1,3,2>. Here,
MCOperand #2 was obtained from the lowering of MachineOperand #3. etc.

This information could be automatically generated via tablegen for all the
instructions whose custom lowering step breaks assumptions made by the tool on
the register operand sequence (In general, these instructions should be the
minority of a target’s instruction set). Unfortunately, we don’t have that
information now. As a consequence, we assume that the number of explicit
register definitions is the same number specified in MCInstrDesc. We also
assume that register definitions always come first in the operand sequence.

In conclusion: these are for now the strong assumptions made by the tool:

  • The number of explicit and implicit register definitions in a MCInst
    matches the number of explicit and implicit definitions specified by the
    MCInstrDesc object.
  • Register uses always come after register definitions.
  • If an opcode specifies an optional definition, then the optional
    definition is always the last register operand in the sequence.

Note that some of the information accessible from the MCInstrDesc is always
valid for MCInst. For example: implicit register defs, implicit register uses
and ‘MayLoad/MayStore/HasUnmodeledSideEffects’ opcode properties still apply to
MCInst. The tool knows about this, and uses that information during its
analysis.

You just made a very strong argument for building the MachineInstrs before running mca. So I wonder why you didn’t do that.

What to do next

The source code has been uploaded for review on phabricator at this link: https://reviews.llvm.org/D43951.

The review covers two patches:
A first (very small) patch that always enables the generation of processor
resource names in the SubtargetEmitter. Currently, the processor resource names
are only generated for debugging purposes, but are needed by the tool to
generate user friendly reports, so we would like to always generate them.
A second patch with the actual static analysis tool (in llvm/tools).

Once these first two patches are committed, the plan is to keep working on the
tool with the help of the community to address all of the limitations described
by the previous sections, and find good solutions/fixes for the design issues
described by section “Known design problems”.

We hope the community will find this tool useful like we have.

Special thanks to Simon Pilgrim, Filipe Cabecinhas and Greg Bedwell who really
helped me a lot by suggesting improvements and testing the tool.

Thanks for your time.
-Andrea

There are a number of people on llvm-dev who can explain better than I how to decompile into MachineInstrs. I’m not totally opposed to checking in something that works with MCInstr, but this does run strongly contrary to the design of LLVM’s subtarget support.

-Andy

Hi Andrea,

Thanks for this great RFC ! I’ve put some high-level comments here, and I’ll give more focused comments in the review on Phabricator.

Hi Andrew,

Thanks for the feedback!

(sending this message again. Apparently it didn’t make it into llvm-dev because the message was too long. Sorry for the spam).

Hi Clement!

+Matthias

I was thinking of APIs like MachineOperand::readsReg().

I guess if you’re only asking whether an instruction zeros the upper part of the register, that information should be available from MCInstr/MCRegisterInfo, but I’m not very familiar with the API.

Matthias?

-Andy

+Ahmed

There are a number of people on llvm-dev who can explain better than I how to decompile into MachineInstrs. I’m not totally opposed to checking in something that works with MCInstr, but this does run strongly contrary to the design of LLVM’s subtarget support.

That would be great! I would be very happy if somebody suggests how to do it (or does it for me :-)).

I know that’s it’s been done before, it’s an extremely useful technique, and work in that direction should be strongly encouraged. I don’t know the current state of in-tree support.

Ahmed?

Do you think the current design (modulo the changes suggested in the review) would be acceptable to start?

I’ll let others review the patch. The fact that you have collaborators in the community is good enough for me.

-Andy

This looks absolutely awesome. Thank you for sharing your work; I can see many ways this will be useful.

Philip

Reading through this last night got be thinking about how to model control flow. Given most of my source code tends to be very branchy, be limited to straight line code is quite restrictive. The main thing is that it requires a lot of hand simplification which can be rather error prone at times.

It occurs to me that we could remove the restriction around branches without necessarily fully modeling fetch, decode, or prediction. If in addition to an assembly file, we also feed the tool a trace of branch executions, the tool could essentially unroll all loops dynamically. The type of thing I’m thinking of would look like something like this: (ENTRY, branch_dest, branch_dest, branch_dest) where each “branch_dest” is the taken successor of the next encountered branch instruction.

As an example, consider a simple loop which executes 3 times:
ENTRY:

fallthrough

bb1:

various instructions

jne bb1
bb2:
ret

The trace for this would look like: (ENTRY, bb1, bb1, bb1, FALLTHROUGH).

The nice thing about such a branch trace is that it separate the source of the trace from the analysis tool. You could provide an actual trace collected through instrumentation, or a predicted trace generated from sample profiling information.

(There are obvious extensions to handle indirect branches, calls, and such here. I’m leaving that as an exercise for the reader at the moment.)

By itself, this seems like a really useful usability improvement.

You could also build on top of this notion to model both branch prediction mispredicts and resource limitations. (Or only one or the other.) From the perspective of the predictor, the trace would be ground truth so you could analyze how well a particularly prediction strategy works on that particular trace. Distinct from the prediction, you could model delay between branch execution and condition satisfaction to identify resource limits in the actual speculative execution. (I’m very deliberately being vague here; it’s been a long time since I looked into this aspect of things and don’t remember the appropriate terminology. Figured it was better to be vague than potentially misleading by getting it wrong.)

Philip

Hi Philip,

+Matthias

I was thinking of APIs like MachineOperand::readsReg().

I guess if you’re only asking whether an instruction zeros the upper part of the register, that information should be available from MCInstr/MCRegisterInfo, but I’m not very familiar with the API.

I don’t think we have this information in an explicit form today:

  • It’s usually not a correctness problem because we cannot really address the upper register parts independently on those targets.
  • We work around some ISEL shortcomings via SUBREG_TO_REG (see TargetOpcode.def) which I consider a nasty hack as stating assumptions about the predecessor node violates the referential transparency that you would expect from SSA.
  • Coalescing/regalloc is using %vregX:sub32<undef> = to represent this.

So today you are probably out of luck when coming from the MC side of things. I think adding a OperandFlag in MCInstrDesc would be a great idea and could be a first step towards retiring SUBREG_TO_REG.

  • Matthias

I haven’t actually worked on analyzing assembler so take this with a grain of salt/as a less informed opinion:

But I think MachineInstrs wouldn’t bring much of a benefit over MC here: MI is designed for lowering of a higher level IR towards a machine representation; that is a different goal and comes with various things (pseudo instructions, virtual register management, stack frame management, etc.) that don’t help for this use case and rather have the potential to be in the way…

I’d rather work towards remodeling the code/datastructures to make reuse in llvm-mca simpler where it makes sense.

  • Matthias

Hello,

I just wanted to say I like this a lot. I can think of at least three projects I worked on in the past that could have used such a tool as a sort of rough indicator even by itself but especially in combination with performance testing on hardware as you mentioned.

Having worked on a project that went round trip from executable to MCInst & analysis to executable again I don’t see any problems with using that as the layer for analyzing in this way.

I agree with your reasons for not modelling certain things like branch prediction fully. Once you get into things like that you have to start thinking about loop alignment, speculative execution (how do we count the penalty for executing code that won’t run on a mispredict, or the benefit of doing it on a correct one?), the list goes on. Same with the SSE + AVX thing, and similar situations like AVX-512 downclocking and such. A simple model for branches and loads / stores also seems to fit in with what AMD has said about optimizing for Ryzen because of the prediction being based on a hardware neural network; just going on their documentation on that one.

The important thing is that it seems like it would make testing for new processor scheduling models nicer; it seems like a good tool to include in a full LLVM build. It could even enhance the test suite with cases basically like what you mentioned with the btver2 work that would allow for a sort of spot check. The biggest problem there is that microcode updates are capable of changing things that might show up there, but that’s getting into more detail than might be necessary.

Thanks for the work on this guys,

-Gordon

It’s not necessary to use virtual registers or stack frame management to use MI. The point of MI is to have a common interface to LLVM backend APIs so you can query the scheduling model, register usage, a variety of TTI/STI information, analyze control flow, and analyze the call graph.

Why would you ever want to duplicate all of MI’s goodness rather than using it directly?

I was under the impression that MC → MI has been useful in the past and that it’s something we would want to support long term.

-Andy

Thanks Andrea for working on this!
I’ve been willing to do this for quite some time now. Looks like procrastination was the right approach here ;).

+Ahmed

There are a number of people on llvm-dev who can explain better than I how to decompile into MachineInstrs. I’m not totally opposed to checking in something that works with MCInstr, but this does run strongly contrary to the design of LLVM’s subtarget support.

That would be great! I would be very happy if somebody suggests how to do it (or does it for me :-)).

I know that’s it’s been done before, it’s an extremely useful technique, and work in that direction should be strongly encouraged. I don’t know the current state of in-tree support.

When Ahmed and I worked on the decompiler, we first targeted MC. Going to MI was more difficult and really wouldn’t have gotten us a lot of benefits. Instead, Ahmed pushed for directly decompiling to IR (look for dagger).

I would actually be in favor for more infrastructure in MC for this kind of things. For instance, dealing with the code layout is not something MI is good at whereas MC is perfect.

When Ahmed and I worked on the decompiler, we first targeted MC. Going to MI was more difficult and really wouldn’t have gotten us a lot of benefits. Instead, Ahmed pushed for directly decompiling to IR (look for dagger).

Thanks for the pointer Quentin.

I would actually be in favor for more infrastructure in MC for this kind of things. For instance, dealing with the code layout is not something MI is good at whereas MC is perfect.

Most of LLVM’s target-specific information is in TargetInstrInfo. TargetSchedModel depends on that. I don’t understand how you would build a performance analysis tool based on LLVM without either decompiling to MI, or rewriting those interfaces to be based on MC. I think you have to pick one direction or the other.

-Andy

You are right those are the two choices. Given that the actual data is in MCSchedModel and TargetSchedModel mostly being interested in the MCInstrDesc you can quert from an MI it seemed to me like the easier/cleaner route to not build up MI datastructures but stay with MC specific things.
You are right though that there is a whole bunch of callbacks in TargetInstrInfo that would need alternatives if we want to go this route. Indeed building up MI makes things fit in better with the APIs today, it just feels less clean to me with all the extra codegen info around those datastructures. Either way I think this choice has to be made by whoever is working on decompiling tools.

  • Matthias

When Ahmed and I worked on the decompiler, we first targeted MC. Going to MI was more difficult and really wouldn’t have gotten us a lot of benefits. Instead, Ahmed pushed for directly decompiling to IR (look for dagger).

Thanks for the pointer Quentin.

I would actually be in favor for more infrastructure in MC for this kind of things. For instance, dealing with the code layout is not something MI is good at whereas MC is perfect.

Most of LLVM’s target-specific information is in TargetInstrInfo. TargetSchedModel depends on that. I don’t understand how you would build a performance analysis tool based on LLVM without either decompiling to MI, or rewriting those interfaces to be based on MC. I think you have to pick one direction or the other.

You are right those are the two choices. Given that the actual data is in MCSchedModel and TargetSchedModel mostly being interested in the MCInstrDesc you can quert from an MI it seemed to me like the easier/cleaner route to not build up MI datastructures but stay with MC specific things.

That’s exactly how I approached things when I was printing some stats in assembly :). Things were readily accessible through MC.

Unfortunately targets have historically been building hooks into TargetInstrInfo.

To be clear then, resolveSchedClass should be moved from TargetSchedModel into MCSchedModel (which is where I originally wanted it). Any TargetInstrInfo APIs called from SchedPredicate should be moved to MCInstrInfo, which should be straightforward but annoying.

I just looked at the x86 target, and I’m surprised that none of the in tree x86 machine models are using SchedPredicate, so this isn’t really a limitation for the current incarnation of the MCA tool.

-Andy

At Sony we developed an LLVM based performance analysis tool named llvm-mca. We
currently use it internally to statically measure the performance of code, and
to help triage potential problems with target scheduling models. We decided to
post this RFC because we are interested in the feedback from the community, and
we also believe that other people might be interested in a tool like this.

This is really really cool. Thank you for building and contributing this!

-Chris