How can I model an instruction that gets broken down into two µ-ops?

Some AArch64 targets (and others) have instructions that internally get broken down into two or more µ-ops.

For example, the instruction

str x0, [x1], #8,

which performs a store+index update, can be broken down into

str x0, [x1]
add x1, x1, #8.

The STR+ADD sequence makes the dependency of the post-increment operation explicit through register X1. As a consequence, dependent instructions needing the value of X1 only need to wait for the ADD, and not the whole STR+ADD (which requires both X0 and X1).
In other words, other instructions waiting on X1 do not necessarily have to wait till the whole store finishes (and X0 becomes available and committed, for example); they only have to wait for X1 to become available.

Could anyone give me some pointers on how this can be modelled? All scheduling models I checked seem to model the instruction as a whole, and therefore cases like the one above are modelled incorrectly.

If I try to rephrase the problem, then the issue is that the post-inc STR instruction has 2 outputs, with 2 different latencies, and the increment operation does not to wait for the availability of all inputs. Is that right?

Yes, pretty much. In this case, the increment operation only needs one of the operands (X1).

?

Maybe a more illustrative example would be:

          ldr   x0, [x1], #8

which writes both x0 (when the load occurs) and x1 (when the increment occurs).

Tablegen lets you independently specify when each of these operands are written, regardless of whether the instruction is decomposed into micro-ops or not. The schedulers all use that information, so this shouldn’t be a problem. This also applies to STR instructions - the schedulers would all see the increment of x1 happening in cycle 1, and subsequent dependent instructions would use that to calculate the latency.

(Similarly, the aforementioned MDL language allows you to specify when each operand reference is performed, independently of all the others).

If you want to decompose instructions into instructions that reflect micro-ops in the compiler, you’re going to have to cope with the explicitly dependent instructions you create. There’s more than one way to do that:

           ldr    x0, [x1], #8             // the original instruction

or (as you suggested, and subsequent instructions are dependent on the ldr and the add)

            ldr    x0, [x1]
            add    x1, x1, #8

or (alternatively put the add before the load)

            add    x1, x1, #8
            ldr    x0, [x1, #-8]

or (make the add completely independent of the load - and the pointer now lives in x2)

            add    x2, x1, #8
            ldr    x0, [x1]

But there’s no need to do any of these, unless you want to fold the add instruction into some other operation.

Yep, I agree that using different writers for loads generally speaking works (with some caveats), because loads only depend on the address operand.

However, consider this snippet for stores:

Timeline view:
                    0
Index     0123456789 

[0,0]     DeeeER    .   umulh	x2, x1, x1
[0,1]     D===eER   .   str	x2, [x1], #0
[1,0]     D====eeeER.   umulh	x2, x1, x1
[1,1]     D=======eER   str	x2, [x1], #0

Should the second umulh be waiting for the end of the first str? The umulh only needs to wait until x1 is updated, which could happen much sooner. Compare the above snippet with the following:

Timeline view:
Index     01234567

[0,0]     DeeeER .   umulh	x2, x1, x1
[0,1]     D===eER.   str	x2, [x1]
[0,2]     DeE---R.   add	x1, x1, #0
[1,0]     D=eeeER.   umulh	x2, x1, x1
[1,1]     D====eER   str	x2, [x1]
[1,2]     D=eE---R   add	x1, x1, #0

Here, as you see, the second umulh can start much sooner, because we make the address increment explicit with the add. This is a much more accurate approximation of what the hardware does.

I hope this helps explain why I’d like to model the stores as two independent µ-ops. I haven’t looked into MDL yet, do you know if it would work with tools like MCA? Also, would you mind giving me some pointers on how the decomposition you mentioned can be performed in LLVM?

I haven’t looked into MDL yet, but thanks for the link, it looks quite powerful! Do you know if it works with MCA?

I don’t know.

If I read the thread correctly, there are effects in cores that cannot be modelled with the current scheduling models: the two uops.

Assuming there is a next-gen way to describe scheduling models and a new maschine scheduler, would there be a noticeable performance effect?

It should - I’ve (re)implemented all the APIs that MCA uses, but haven’t done anything more than run the default tests.

DIsclaimer: I have no real practical experience with MCA. But a few thoughts anyway… :slight_smile:

Looking through AArch64 tablegen files, the STR instructions all seem to use the WriteAdr resource to model the address increment. For all targets except Kryo, this happens in cycle 0 or 1. In Kryo its modeled to happen in cycle 6, which seems to reflect whats going on in your example. (use

      grep "def.*WriteAdr.*Latency" *.td

to find them all).

But that doesn’t seem right - typically the increments happen early in the pipeline, so I wonder if this is a “bug” in the td file. Note that there really isn’t any tablegen description (AFAIK) of when store instructions actually finish writing to memory. All the instruction timing information has to do with when registers are written (and read).

So if I’m right, you might look into when the increments actually take place (in the hardware spec), and perhaps experiment with setting the latency of WriteAdr to 1 rather than 6. I’d hazard a guess that you shouldn’t have to restructure your code to avoid this problem. If, in fact, the increment happens in cycle 6, perhaps you don’t want to use that addressing mode in the first place.

Regarding MDL - currently, our MDL descriptions for upstream targets are generated by scraping information from tablegen, so if you used MDL with MCA, you’d likely get the exact same results.

As for hints to how to do these transformations in LLVM, thats not something I would be terribly helpful with. Using ins-combine rules could be problematic, since rewriting one instruction as two often leads to cycles in the rewrite process (where 2 later become 1).

Sorry for the slow response. I found a talk by Apple. They are trying to improve scheduling:

I’m sorry too, it’s been a hectic week.

Thanks for the link, that’s a great talk! They make this definition:

Instruction is Ready @<cycle> → All input data needed by an instruction is ready

which precisely touches on my point: what if an instruction doesn’t need all the input data at the same time?

In my previous example we have:

str x0, [x1], #8  ->  str x0, [x1]     // Waits on x0, x1
                      add x1, x1, #8   // Waits on x1

So, effectively, the str needs to wait for x0 and x1 to become available, but the add only needs to wait on x1. This small nuance can have a big impact on the accuracy of the estimates, especially if x0 is produced by a chain of high-latency instructions.

Either I’m missing something, or I don’t think this is possible to model with the current infrastructure in LLVM.

In the example I showed I was using a CPU that defines the WriteAdr in cycle 0, so that shouldn’t be the problem. I believe the problem lies in the way dependencies are defined blindly for the whole instruction. In most cases, this view is correct- the whole instruction depends on all its inputs. However, sometimes, especially if the instruction is broken down into independent mico-ops internally, this view can lead to inaccuracies where micro-ops are waiting on “faux” dependencies- which can lead to very incorrect estimates. In the example above, in practice, the ADD part of the post-inc STR does not have to wait on x0, but it is led to believe it does due to the way the inputs of the STR are modelled.

I’ve built MDL, but haven’t had a chance to try it out. Once I do I’ll report back!

I copied this from your V2 model:

// Strip pointer authentication code
def : InstRW<[V2Write_2cyc_1M0], (instrs XPACD, XPACI, XPACLRI)>;

Maybe in future scheduling models instructions are described by an aggregate:

instruction<"str"> {
  ports = [xxxx]
  inputs = [x0, x1]
  outputs = [
    memoryStore = 2 uops /*after*/,
    scalar = 1 uops /*after*/
  ]
}

In my opinion the key thing this aggregate is missing is a way to specify the relationship between inputs and outputs, so that you can specify what depends on what explicitly.

This was just one idea. I am afraid that it might hardcode AArch64 terms and is hard to generalize for other architectures.

I’ve been looking at your example of what MCA produces for your original sequence, and to me it looks like MCA is misinterpreting the tablegen information, since MCA isn’t reflecting what the processor actually does with this sequence.

In general there’s kind of an impedance mismatch between what LLVM is trying to model vs what MCA is trying to model. For any kind of dynamic-issue machine, LLVM can’t accurately model dynamic instruction execution timings - it has an inherently static model, so the best it can do is use heuristics to order instructions so that a dynamic machine will see instructions in a reasonable order - minimizing unit traffic and pipeline stalls in the “typical” case. It has two general heuristics: instruction latencies, and aggregate instruction unit requirements. While neither is “precise”, it is generally sufficient to produce efficient code. However, it is not sufficient for modeling precise pipeline behavior in MCA, so MCA has to make some assumptions.

It appears that MCA assumes that “str” doesn’t start execution until all the inputs are available, which isn’t accurate for out-of-order architectures with uOps, where each uOp can start executing as soon as an appropriate unit and each of their inputs are available. In your case, MCA models the "str"s read of x2 and the writeback of x1 as happening in the same pipeline phase - so if one gets stalled, the other is stalled as well. This is what TableGen models, because thats sufficient for doing the “right thing” in instruction scheduling - even if the hardware treats them completely independently.

So I don’t believe that “better resource modeling” in tablegen (per the mentioned presentation) will fix your MCA problem. Tablegen does in fact tie register references to units/resources, but only requires it for writes, so most instruction reads are implicit in tablegen and therefore unassociated with units. And of course in most cases a processor’s uOps aren’t even documented - sometimes even internally to the vendor - so you can’t model dependencies between issued uOps. Tablegen doesn’t even try to do that - it just models which resources are required, and uses that as a pretty effective heuristic for ordering instructions so that they will typically issue and execute efficiently.

The bottom line is that I strongly suspect (without further knowledge of V2) that your second sequence of instructions will not run faster than the original sequence - ignoring any external state. Since its slightly longer, it will possibly run marginally slower (due to instruction cache).

Regarding the presentation: it looks like their “performance improvements” are based on MCA’s output - which I’m pretty sure isn’t accurate - rather than the actual behavior of the hardware. After a quick glance at their examples, I’m guessing the hardware would execute those schedules in virtually the same time. And FWIW, MDL can model all of this kind of stuff - much more than MachineScheduler can actually grok at this point. And it can also handle the significantly more complex information required for VLIW machines.

[One final note: if you’re going to play around with MDL, please wait until I’ve done a push - the current one doesn’t include V2. ]

Hi @reidtatge , thank you very much for your very informative answer(s)!
JFYI: We will be presenting our initial findings in a presentation at the upcoming LLVM dev conference (“LLVM-MCA correlation for AArch64”). We have ‘discovered’ quite a few things already, but appreciate that at the same time this is also just the beginning of things… we definitely need to have a look at MDL (and other things).