Some AArch64 targets (and others) have instructions that internally get broken down into two or more µ-ops.
For example, the instruction
str x0, [x1], #8,
which performs a store+index update, can be broken down into
str x0, [x1]
add x1, x1, #8.
The STR+ADD sequence makes the dependency of the post-increment operation explicit through register X1. As a consequence, dependent instructions needing the value of X1 only need to wait for the ADD, and not the whole STR+ADD (which requires both X0 and X1).
In other words, other instructions waiting on X1 do not necessarily have to wait till the whole store finishes (and X0 becomes available and committed, for example); they only have to wait for X1 to become available.
Could anyone give me some pointers on how this can be modelled? All scheduling models I checked seem to model the instruction as a whole, and therefore cases like the one above are modelled incorrectly.