[llvm-mca] What's the difference between Rthroughput and "total cycles" in llvm-mca

What is the difference between the two? I thought “Rthroughput” is basically the number of cycles required to execute a single iteration at steady state, but this does not seem to match with the schedule/timeline generated by llvm-mca.
Thanks in advance,
Tom

Hi Tom,

Field ‘Total Cycles’ from the summary view simply reports the elapsed number of cycles for the entire simulation.

Rthroughput (from the “Instruction Info” view) is the reciprocal of the instruction throughput.

Throughput is computed as the maximum number of instructions of a same type that can be executed per clock cycle in the absence of operand dependencies.

Example (x86 - AMD Jaguar):

ADD EAX, ESI

The integer unit in Jaguar has two ALU pipelines. An ADD instruction can issue to any of those pipelines. That means, two independent ADD can be issue during a same cycle. Therefore, throughput is 2 (instructions per cycle), and RThrougput (1/throughput) is 0.5.

I hope it helps,
-Andrea

Hi Andrea,
So does this definition make sense for basic blocks with more than one instructions? E.g. how should one interpret a basic block with RThroughput of 2.3?

In the absence of data dependencies, throughput of a block of code is superiorly limited by the dispatch rate (i.e. our DispatchWidth), and the availability of hardware resources.

DispatchWidth is the maximum number of micro opcodes that can be dispatched to the out-of-order every cycle. That value inevitably affects the block throughput. Example: if a block in input decodes to 4 micro-opcodes in total, and the processor can only dispatch up to 2 opcodes per cycle, then the maximum block throughput cannot exceed 0.5 (i.e. one block every two cycles).

Block throughput is also constrained by the availability of hardware resources.

Example: if we have 4 ADD micro-opcodes, and each opcode consumes 1cy of ALU pipeline, then the block throughput is superiorly limited by N/4, where N is the number of ALU pipelines available on the target, and 4 is the number of ALU cycles consumed. So, if there is only 1 ALU pipeline, then the block throughput is superiorly limited to 1/4 = 0.25 (blocks per cycle)

Back to the computation of the “Block Throughput”.
It is statically computed as the reciprocal of the block throughput. As for the normal instruction throughput, the computation doesn’t take into account operand dependencies. Therefore, we could say that it is computed as the MAX of:

  • #MicroOpcodes of a block / DispatchWidth

  • #Consumed resource cycles / #Resources [ for every resource kind ].

In the absence of loop-carried dependencies between different iterations, the observed ‘uOps Per Cycle’ tends to a theoretical maximum throughput which can be computed by dividing the total number of uOps of a block by the Block RThroughput.

You can find more information about it in the llvm-mca docs under section “How LLVM-MCA works”.

I hope it helps!
-Andrea

In the absence of data dependencies, throughput of a block of code is superiorly limited by the dispatch rate (i.e. our DispatchWidth), and the availability of hardware resources.

DispatchWidth is the maximum number of micro opcodes that can be dispatched to the out-of-order every cycle. That value inevitably affects the block throughput. Example: if a block in input decodes to 4 micro-opcodes in total, and the processor can only dispatch up to 2 opcodes per cycle, then the maximum block throughput cannot exceed 0.5 (i.e. one block every two cycles).

Block throughput is also constrained by the availability of hardware resources.

Example: if we have 4 ADD micro-opcodes, and each opcode consumes 1cy of ALU pipeline, then the block throughput is superiorly limited by N/4, where N is the number of ALU pipelines available on the target, and 4 is the number of ALU cycles consumed. So, if there is only 1 ALU pipeline, then the block throughput is superiorly limited to 1/4 = 0.25 (blocks per cycle)

Back to the computation of the “Block Throughput”.

Sorry, I should have written “Block RThroughput” here.