[llvm-mca] Resource consumption of ProcResGroups

Hi,

I’m trying to work out the behavior of llvm-mca on instructions with ProcResGroups. My current understanding is:

When an instruction requests a port group (e.g., HWPort015) and all of its atomic sub-resources (e.g., HWPort0,HWPort1,HWPort5), HWPort015 is marked as “reserved” and is issued in parallel with HWPort0, HWPort1, and HWPort5, blocking future instructions from reserving HWPort015 for the duration but not explicitly blocking any of HWPort0, HWPort1, or HWPort5 (although those ports are in fact blocked because the instruction also requested usage of those ports).

When an instruction requests a port group (e.g., HWPort015) but only some of its atomic sub-resources (e.g., HWPort0 and HWPort1 but not HWPort5), then HWPort015 is scheduled according to the round robin scheduler, which in this case would decide to dispatch it on Port5.

This (I believe) explains the following reported timings on a basic block which consists of a single instruction with no dependencies and a small NumMicroOps (i.e., only bottlenecked by resource availability), where I have tried out different port maps and ResourceCycles (all of these are for 100 iterations):

• When the resource mapping is: { HWPort0: 2 cycles, HWPort01: 2 cycles }, the instruction has a Total Cycles of 200, because the round-robin scheduler always assigns the HWPort01 resource to execute on HWPort1, so each iteration requires 2 cycles total.
• When the resource mapping is: { HWPort0: 2 cycles, HWPort1: 2 cycles, HWPort01: 2 cycles }, the instruction still has a Total Cycles of 200, because HWPort01 is marked as “reserved” and therefore issued in parallel to HWPort0 and HWPort1, so each iteration still requires 2 cycles total.

The one case that still confuses me is:

• When the resource mapping is: { HWPort0: 2 cycles, HWPort01: 4 cycles }, the instruction has a Total Cycles of 300. This seems to be because issuing to HWPort01 does not always block when I intuitively think it should (e.g., instructions are issued on Cycle 1, Cycle 3 (?), Cycle 7, Cycle 9 (?), and so on, where (?) indicates that I don’t think it should be possible to issue then, but it does anyway).

Is this understanding of llvm-mca’s behavior correct? Are these observations intentional design decisions, limitations of llvm-mca’s model, bugs, or something else?

Thanks!
-Alex

Hi Alex,

Hi,

I’m trying to work out the behavior of llvm-mca on instructions with
ProcResGroups. My current understanding is:

When an instruction requests a port group (e.g., HWPort015) and all of its
atomic sub-resources (e.g., HWPort0,HWPort1,HWPort5), HWPort015 is marked
as “reserved” and is issued in parallel with HWPort0, HWPort1, and HWPort5,
blocking future instructions from reserving HWPort015 for the duration but
not explicitly blocking any of HWPort0, HWPort1, or HWPort5 (although those
ports are in fact blocked because the instruction also requested usage of
those ports).

Your understanding is correct.
The issue of micro opcodes from future instructions is blocked until the
"reserved" group is released.

From the point of view of llvm-mca, a group behaves like a hardware

scheduler. On Haswell, HWPort0 is a resource shared by multiple schedulers
(for example: HWPort0 is contained in both HWPort01 and HWPort0156).
An instruction that writes to port HWPort0 implicitly "uses" (i.e. is
implicitly dispatched to) both HWPort01 and HWPort0156.
For that reason, instructions that only consume HWPort0 are still
artificially delayed by llvm-mca if HWPort0156 is temporarily unavailable.
llvm-mca internally knows which goups are implicitly used by which
instruction (see the return statement at the end of method
`ResourceManager::checkAvailability(const InstrDesc &Desc)`).

When an instruction requests a port group (e.g., HWPort015) but only some
of its atomic sub-resources (e.g., HWPort0 and HWPort1 but not HWPort5),
then HWPort015 is scheduled according to the round robin scheduler, which
in this case would decide to dispatch it on Port5.

Correct. It is mostly a round-robin.

This (I believe) explains the following reported timings on a basic block
which consists of a single instruction with no dependencies and a small
NumMicroOps (i.e., only bottlenecked by resource availability), where I
have tried out different port maps and ResourceCycles (all of these are for
100 iterations):

   - When the resource mapping is: { HWPort0: 2 cycles, HWPort01: 2
   cycles }, the instruction has a Total Cycles of 200, because the
   round-robin scheduler always assigns the HWPort01 resource to execute on
   HWPort1, so each iteration requires 2 cycles total.
   - When the resource mapping is: { HWPort0: 2 cycles, HWPort1: 2
   cycles, HWPort01: 2 cycles }, the instruction still has a Total Cycles of
   200, because HWPort01 is marked as “reserved” and therefore issued in
   parallel to HWPort0 and HWPort1, so each iteration still requires 2 cycles
   total.

The one case that still confuses me is:

   - When the resource mapping is: { HWPort0: 2 cycles, HWPort01: 4
   cycles }, the instruction has a Total Cycles of 300. This seems to be
   because issuing to HWPort01 does not always block when I intuitively think
   it should (e.g., instructions are issued on Cycle 1, Cycle 3 (?), Cycle 7,
   Cycle 9 (?), and so on, where (?) indicates that I don’t think it should be
   possible to issue then, but it does anyway).

Strange. I am unable to replicate your results.

That last case is semantically equivalent to the other case where you have
the following resource usage: { HWPort0: 2 cycles, HWPort01: 2 cycles } .
The only difference is that HWPort01 is consumed for 4cy instead of 2cy. So
HWPort1 is effectively consumed for 4cy.

To verify that point, I tried the following experiment:

In the Haswell model, I have replaced the following line:

-defm : HWWriteResPair<WriteALU,    [HWPort0156], 1>;
+defm : HWWriteResPair<WriteALU,    [HWPort0, HWPort01], 1, [2, 4], 1>;

Then I have executed 100 iterations of the following kernel:

        addq    %rax, %rcx
        addq    %rax, %rdx
        addq    %rax, %rsi
        addq    %rax, %rdi

This was the output from the summary view:

Iterations:        100
Instructions:      400
Total Cycles:      1600
Total uOps:        400

Dispatch Width:    4
uOps Per Cycle:    0.25
IPC:               0.25
Block RThroughput: 8.0

As expected, it is 400 cycles per iteration (not 300). 1 instruction
executed every 4 cycles (due to the pressure on HWPort1).

This is the resource pressure view:

Resources:
[0]   - HWDivider
[1]   - HWFPDivider
[2]   - HWPort0
[3]   - HWPort1
[4]   - HWPort2
[5]   - HWPort3
[6]   - HWPort4
[7]   - HWPort5
[8]   - HWPort6
[9]   - HWPort7

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 -      -     8.00   16.00   -      -      -      -      -      -

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]
 Instructions:
 -      -     2.00   4.00    -      -      -      -      -      -     addq
     %rax, %rcx
 -      -     2.00   4.00    -      -      -      -      -      -     addq
     %rax, %rdx
 -      -     2.00   4.00    -      -      -      -      -      -     addq
     %rax, %rsi
 -      -     2.00   4.00    -      -      -      -      -      -     addq
     %rax, %rdi

Is this understanding of llvm-mca’s behavior correct? Are these

observations intentional design decisions, limitations of llvm-mca’s model,
bugs, or something else?

These are intentional design decisions (I will try to explain why below).

Note that (as far as I remember) these decisions should only affect Haswell
and Broadwell. Those two models were probably a port of existing processor
descriptors in IACA. Presumably the original intention of the author was to
describe the resource consumption of each individual micro-opcode. That is
because (as far as I remember) IACA is able to independently track/simulate
individual uOPs of an instruction. That also explains why if we look at the
Haswell model in llvm, the number of consumed resources almost always
matches the number uOPs declared by a scheduling class.

The llvm scheduling model is quite simple and doesn't allow mca to
accurately simulate the execution of individual uOPs. That limitation is
sort-of acceptable if you consider how the scheduling model framework was
originally designed with a different goal in mind (i.e. machine
scheduling). The lack of expressiveness of the llvm scheduling model
unfortunately limits the accuracy of llvm-mca: we know the number of uOPs
of an instruction. However we don't know which resources are consumed by
which micro-opcodes. So we cannot accurately simulate the independent
execution of individual opcodes of an instruction.

Another "problem" is that it is not possible to describe when uOPs
effectively start consuming resources. At the moment, the expectation is
that resource consumption always starts at relative cycle #0 (relative to
the instruction issue cycle).
Example: an horizontal add on x86 is usually decoded into a pair of
shuffles uOPs and a single (data-dependent) vector ADD uOP.
The ADD uOP doesn't execute immediately because it needs to wait for the
other two shuffle uOPs. It means that the ALU pipe is still available at
relative cycle #0 and it is only consumed starting from relative cycle #1
(ssuming that both shuffles can start execution at relative cycle #0). In
practice, the llvm scheduling model only allows us to declare which
pipeline resources are consumed, and for how long (in number cycles). So we
cannot accurately describe to mca that the delayed consumption of the ALU
pipe.
Now think about what happens if: the first shuffle uOP consumes 1cy of
HWPort0, and the second shuffle uOp consumes 1cy of HWPort1, and the ADD
consumes 1cy of HWPort01. We end up in that "odd" situation you described
where HWPort01 is "reserved" for 1cy.
In reality, that 1cy of HWPort01 should have started 1cy after the other
two opcodes. At that point, both pipelines would have been seen available.

In conclusion, the presence of a "reserved" flag is not ideal, but it is
sort-of a consequence of the above mentioned two limitations (plus the way
how the Haswell and Broadwell models were originally designed).

I hope it helps,
-Andrea

Food for thought…

It would be easy to add a DelayCycles vector to SchedWriteRes to indicate the relative start cycle for each reserved resource. That would effectively model dependent uOps.

NumMicroOps is only meant to model any general limitation of the cpu frontend to issue/rename/retire micro-ops. So, yes, there’s no way to associate resources with specific uOps. You can mark any kind of resource as “dynamically scheduled” (BufferSize = -1). If an instruction uses different kinds of dynamic resources, then those need not be reserved at the same time. If we had the DelayCycles vector, it could be interpreted as “this resource must be reserved N cycles after prior reservations of other resources”.

-Andy

Thanks, that’s very helpful!

Also, sorry for the miscue on that bug with the 2/4 cycles — I realize now that that’s an artifact of a change that I made to not crash when resource groups overlap without all atomic subunits being specified:

`echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`
crashes (because fxrstor requests `HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I added:

--- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
+++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
@@ -292,7 +292,7 @@ void ResourceManager::issueInstruction(
ResourceState &RS = *Resources[Index];

- if (!R.second.isReserved()) {
+ if (!R.second.isReserved() && RS.isReady()) {
ResourceRef Pipe = selectPipe(R.first);
use(Pipe);
BusyResources[Pipe] += CS.size();

which is probably the cause of that weird behavior I reported.

I’m also somewhat curious about what “NumUnits” is modeling: I haven’t totally worked through the code yet, but it seems that when more (not necessarily atomic) sub-resources of a resource group are requested, more “NumUnits” of that group is requested. This doesn’t seem particularly intuitive to me, at least in my mental model of Haswell scheduling (and also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are ever available). Is there some particular behavior that this is modeling?

-Alex

Hi Alex,

Thanks, that’s very helpful!

Also, sorry for the miscue on that bug with the 2/4 cycles — I realize now
that that’s an artifact of a change that I made to not crash when resource
groups overlap without all atomic subunits being specified:

`echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown
-march=x86-64 -mcpu=haswell`
crashes (because fxrstor requests
`HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so
HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I
added:

--- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
+++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
@@ -292,7 +292,7 @@ void ResourceManager::issueInstruction(
ResourceState &RS = *Resources[Index];

- if (!R.second.isReserved()) {
+ if (!R.second.isReserved() && RS.isReady()) {
ResourceRef Pipe = selectPipe(R.first);
use(Pipe);
BusyResources[Pipe] += CS.size();

which is probably the cause of that weird behavior I reported.

I’m also somewhat curious about what “NumUnits” is modeling: I haven’t
totally worked through the code yet, but it seems that when more (not
necessarily atomic) sub-resources of a resource group are requested, more
“NumUnits” of that group is requested. This doesn’t seem particularly
intuitive to me, at least in my mental model of Haswell scheduling (and
also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca
-mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which
requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that
HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are
ever available). Is there some particular behavior that this is modeling?

The issue with fxrstor is unfortunately a bug. I'll see if I can fix it.

Strictly speaking, that "NumUnits" quantity is literally the number of
processor resource units consumed. By construction, it should never exceed
the actual number of resource units declared by a processor resource in the
scheduling model (i.e. this particular field of MCProcResourceDesc -
https://llvm.org/doxygen/structllvm_1_1MCProcResourceDesc.html#a9d4d0cc34fcce4779dc4445d8265fffc
).

The reason why you see the crash/assert is because the number of resource
units for scheduler resources is incorrectly set for that instruction by
the instruction builder in llvm-mca ( lib/MCA/InstrBuilder.cpp - function
`initializeUsedResources()`). I am looking at it.

About what Andy suggested: I really like the idea of having a vector of
delay-cycles for consumed resources. It would fix most of these problems in
Haswell, and it could be used to improve the description of some
instructions on other processor models too. I agree that it should not be
difficult to implement (at least, the tablegen part should be pretty
mechanical); the resource allocation logic in class ResourceManager (in
llvm-mca) would require a bit of refactoring. Other than that, the rest
should be doable.

-Andrea

Hi Alex,

Thanks, that’s very helpful!

Also, sorry for the miscue on that bug with the 2/4 cycles — I realize
now that that’s an artifact of a change that I made to not crash when
resource groups overlap without all atomic subunits being specified:

`echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown
-march=x86-64 -mcpu=haswell`
crashes (because fxrstor requests
`HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so
HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I
added:

--- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
+++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
@@ -292,7 +292,7 @@ void ResourceManager::issueInstruction(
ResourceState &RS = *Resources[Index];

- if (!R.second.isReserved()) {
+ if (!R.second.isReserved() && RS.isReady()) {
ResourceRef Pipe = selectPipe(R.first);
use(Pipe);
BusyResources[Pipe] += CS.size();

which is probably the cause of that weird behavior I reported.

I’m also somewhat curious about what “NumUnits” is modeling: I haven’t
totally worked through the code yet, but it seems that when more (not
necessarily atomic) sub-resources of a resource group are requested, more
“NumUnits” of that group is requested. This doesn’t seem particularly
intuitive to me, at least in my mental model of Haswell scheduling (and
also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca
-mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which
requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that
HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are
ever available). Is there some particular behavior that this is modeling?

The issue with fxrstor is unfortunately a bug. I'll see if I can fix it.

Strictly speaking, that "NumUnits" quantity is literally the number of
processor resource units consumed. By construction, it should never exceed
the actual number of resource units declared by a processor resource in the
scheduling model (i.e. this particular field of MCProcResourceDesc -
https://llvm.org/doxygen/structllvm_1_1MCProcResourceDesc.html#a9d4d0cc34fcce4779dc4445d8265fffc
).

The reason why you see the crash/assert is because the number of resource
units for scheduler resources is incorrectly set for that instruction by
the instruction builder in llvm-mca ( lib/MCA/InstrBuilder.cpp - function
`initializeUsedResources()`). I am looking at it.

Hi Alex,

This issue should be fixed by commit 47b95d7cf462.
I now understand why so many questions about the meaning of reserved
resources :-).
Could you please check if that commit fixes the issue for you too?

As Andy wrote, in the future we should really look into adding an optional
DelayCycles vector for SchedWriteRes.
That would be the ideal improvement; it would also allow us to get rid of
the "reserved" bit.

Thanks,
-Andrea

Hi Andrea,

Yep, that commit fixes that issue for me too.

One last question, just to make 100% sure I understand the specification that llvm-mca is solving, and its design constraints:

{HWPort01: 3} should _always_ mean (at the instruction specification level, even if not in the llvm-mca implementation) that there are 3 cycles dispatched to either of HWPort0 or HWPort1, regardless of whether HWPort0 or HWPort1 are additionally specified, right? And so the decision to use the “reserved” flag is one way of implementing that specification, and not necessarily a reflection of any desired execution behavior beyond that?

Thanks!
-Alex

Hi Andrea,

Yep, that commit fixes that issue for me too.

One last question, just to make 100% sure I understand the specification
that llvm-mca is solving, and its design constraints:

{HWPort01: 3} should _always_ mean (at the instruction specification
level, even if not in the llvm-mca implementation) that there are 3 cycles
dispatched to either of HWPort0 or HWPort1, regardless of whether HWPort0
or HWPort1 are additionally specified, right? And so the decision to use
the “reserved” flag is one way of implementing that specification, and not
necessarily a reflection of any desired execution behavior beyond that?

Thanks!
-Alex

Correct.

It literally only means 3cy on HWPort0 or HWPort1.

The "reserved" flag is an implementation detail. It is one way to solve the
issue caused by the lack of available units in a group.
As you wrote, the specs don't require the presence of a reserved bit. To be
fair, there is not a good way to address that particular "issue". So
different approaches could exist/be used.

When I firstly designed mca, I remember I tried a couple of different
approaches. However I wasn't really happy with any one in particular.
Eventually I opted for the "reserved" flag approach. Generally speaking, I
think that is "fine" for now if we use that flag (especially considering
how, in practice, it is only required to describe a few heavily microcoded
instructions of a couple of x86 processor models). That being said, I hope
that in future we will be able to get rid of that flag by implementing the
DelayCycles vector suggested by Andy. I have raised PR45873 to track the
progress on that particular task (
https://bugs.llvm.org/show_bug.cgi?id=45873).

I hope it helps,
-Andrea

Hi Alex,

Thanks, that’s very helpful!

Also, sorry for the miscue on that bug with the 2/4 cycles — I realize
now that that’s an artifact of a change that I made to not crash when
resource groups overlap without all atomic subunits being specified:

`echo 'fxrstor (%rsp)' | llvm-mca -mtriple=x86_64-unknown-unknown
-march=x86-64 -mcpu=haswell`
crashes (because fxrstor requests
`HWPort0,HWPort6,HWPort23,HWPort05,HWPort06,HWPort15,HWPort0156`, so
HWPort0156 ends up asserting because 0,1,5, and 6 are all taken), so I
added:

--- a/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
+++ b/llvm/lib/MCA/HardwareUnits/ResourceManager.cpp
@@ -292,7 +292,7 @@ void ResourceManager::issueInstruction(
ResourceState &RS = *Resources[Index];

- if (!R.second.isReserved()) {
+ if (!R.second.isReserved() && RS.isReady()) {
ResourceRef Pipe = selectPipe(R.first);
use(Pipe);
BusyResources[Pipe] += CS.size();

which is probably the cause of that weird behavior I reported.

I’m also somewhat curious about what “NumUnits” is modeling: I haven’t
totally worked through the code yet, but it seems that when more (not
necessarily atomic) sub-resources of a resource group are requested, more
“NumUnits” of that group is requested. This doesn’t seem particularly
intuitive to me, at least in my mental model of Haswell scheduling (and
also leads to some infinite loops like `echo ‘fldenv (%rsp)' | llvm-mca
-mtriple=x86_64-unknown-unknown -march=x86-64 -mcpu=haswell`, which
requests HWPort0,HWPort1,HWPort01,HWPort05, and HWPort015, meaning that
HWPort015 never schedules because it requests 4 “NumUnits” but only 3 are
ever available). Is there some particular behavior that this is modeling?

The issue with fxrstor is unfortunately a bug. I'll see if I can fix it.

Strictly speaking, that "NumUnits" quantity is literally the number of
processor resource units consumed. By construction, it should never exceed
the actual number of resource units declared by a processor resource in the
scheduling model (i.e. this particular field of MCProcResourceDesc -
https://llvm.org/doxygen/structllvm_1_1MCProcResourceDesc.html#a9d4d0cc34fcce4779dc4445d8265fffc
).

The reason why you see the crash/assert is because the number of resource
units for scheduler resources is incorrectly set for that instruction by
the instruction builder in llvm-mca ( lib/MCA/InstrBuilder.cpp - function
`initializeUsedResources()`). I am looking at it.

Hi Alex,

This issue should be fixed by commit 47b95d7cf462.
I now understand why so many questions about the meaning of reserved
resources :-).
Could you please check if that commit fixes the issue for you too?

As Andy wrote, in the future we should really look into adding an optional
DelayCycles vector for SchedWriteRes.
That would be the ideal improvement; it would also allow us to get rid of
the "reserved" bit.

Thanks,
-Andrea