That’s certainly a good point, and there’s certainly a relationship between uops and throughput on larger cores. Even there, however, more uops often corresponds to higher latency (e.g., a combined load + mul might get split into uops for address generation, cache access, and the arithmetic, but in the end, there’s still a dependency chain that needs to complete before the entire instruction retires). Regardless, I’m perfectly happy saying that what we’re trying to measure here is execution time, and adding latencies and counting uops might be equally bad in this regard across many types of cores.
Here’s how I think about this:
For inlining, to a lesser extent unrolling, and perhaps for other use cases as well, the problem is that we really have (at least) two effects that we’re trying to balance with one metric. One is execution time, and on an in-order core adding the latencies might be a good approximation of this, but for larger cores we likely need to account for #uops and throughput as well. The second is code size, or perhaps the ratio of code size to execution time, and the problem here is that we’re trying to solve a global optimization problem using a local indicator. Globally, we need to worry about i-cache pressure, TLB pressure, and so on, all of which are made worse by increasing code size, but there’s no real way to reason about these one function at a time. Yet, we need to try (using some heuristic). There can certainly be instructions that have large encodings relative to their execution speed (perhaps even after to account for any reduced throughput from decoder capabilities), and we might need to account for this somehow.
There is a small caveat here: for some cores, we really do try to estimate uops, at for unrolling. Large Intel cores, for example, have a uop buffer associated with the dispatch of small loops (the LSD, etc.), and the size limit of these is specified in uops.
Of course, the way that inlining thresholds are tuned, they also ends up accounting for follow-on simplification opportunities enabled by inlining that also cannot be reasoned about locally. This is, in part, what makes changing anything in this space so tricky: anything that you change will cause some regressions that no local, causal reasoning will be able to fix.
So maybe we cannot do better right now than “size and latency”, but I feel like we should certainly think about how we might do better, and in addition, we should think about how we might provide guidance to backend developers as to how they might think about setting these values (other than just playing with things and running a lot of tests).
I think the way forward is, like you say, for the backend to be given the opportunity to report a cost for each transform. Ideally the backend can then choose which cost is important to it, as well as a suitable threshold, and be given a larger region of code if necessary. It would be great to move the ad-hoc cost modelling out of the analysis and transform passes, while they can always fallback to their original costing if a backend doesn’t exist or doesn’t return anything.
Sounds good to me.
Thanks again,
Hal