[You can find an easier to read and more complete version of this RFC here.]
Knowing instruction scheduling properties (latency, uops) is the basis for all scheduling work done by LLVM.
Unfortunately, vendors usually release only partial (and sometimes incorrect) information. Updating the information is painful and requires careful guesswork and analysis. As a result, scheduling information is incomplete for most X86 models (this bug tracks some of these issues). The goal of the tool presented here is to automatically (in)validate the TableDef scheduling models. In the long run we envision automatic generation of the models.
At Google, we have developed a tool that, given an instruction mnemonic, uses the data in MCInstrInfo
to generate a code snippet that makes execution as serial (resp. as parallel) as possible so that we can measure the latency (resp. uop decomposition) of the instruction. The code snippet is jitted and executed on the host subtarget. The time taken (resp. resource usage) is measured using hardware performance counters. More details can be found in the ‘implementation’ section of the RFC.
For people familiar with the work of Agner Fog, this is essentially an automation of the process of building the code snippets using instruction descriptions from LLVM.
Results- Solving this bug (sandybridge):
> llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode latency
—
asm_template:
name: latency IMUL16rri8
cpu_name: sandybridge
llvm_triple: x86_64-grtev4-linux-gnu
num_repetitions: 10000
measurements:
- { key: latency, value: 4.0115, debug_string: ‘’ }
error: ‘’
…
|
- |
> llvm-exegesis -opcode-name IMUL16rri8 -benchmark-mode uops
—
asm_template:
name: uops IMUL16rri8
cpu_name: sandybridge
llvm_triple: x86_64-grtev4-linux-gnu
num_repetitions: 10000
measurements:
- { key: ‘2’, value: 0.5232, debug_string: SBPort0 }
- { key: ‘3’, value: 1.0039, debug_string: SBPort1 }
- { key: ‘4’, value: 0.0024, debug_string: SBPort4 }
- { key: ‘5’, value: 0.3693, debug_string: SBPort5 }
error: ‘’
…
|
- |
Running both these commands took ~.2 seconds including printing.
- List of measured latencies for sandybridge, haswell and skylake processors including diffs with LLVM latencies. Excerpt:
sandybridge
haswell
skylake
mnemonic
llvm-exegesis
TD file
llvm-exegesis
TD file
llvm-exegesis
TD file
SHR32r1
1.01
1.00
1.00
1.00
1.01
1.00
IMUL16rri
4.02
3.00
4.01
3.00
4.01
3.00
- Some instructions have different implementations depending on which registers are assigned. This is well known for cases like
xor eax, eax
andxor eax, ebx
, which emits no uops in the first case (this happens during register renaming, see Agner Fog’s “Register Allocation and Renaming”, in microarchitecture.pdf). But we found out that this can go further. For example, SHLD64rri8 takes one cycle and runs on P06 in theshld rax, rax, 0x1
case, but takes 3 cycles and runs on P1 in theshld rbx, rax, 0x1
case. To the best of our knowledge, this has not yet been described.
Future Work- [easy] Fix Intel Scheduling Models.
-
[easy] Extend to memory operands.
-
[easy] Make the tool work reliably for x87 instructions.
-
[medium] A tool that automatically create patches to TD files.
-
[medium] Measure the effect of immediate/register values: Some instructions have performance characteristics that depends on the values it operates on. We should explore the value space (0, 1, ~1, 2^{8,16,32,64}, inf, nan, denorm…).
-
[medium] Measure the effect of changing registers on instruction implementation (see results section above). Model this in LLVM TD schema.
-
[hard] Make the tool work for instruction that have side effects (e.g. PUSH/POP, JMP, …). This might involve extending the TD schema with information on how to setup measurements for specific instructions.
-
[??] Make the tool work for other CPUs. This mainly depends on the presence of performance counters.