[RFC] Being explicit in the cost model

Hi,

I have been working in the TargetTransformInfo layer for the last few weeks with several ongoing goals:

Remove ambiguity.
Improve cost modelling performance, accuracy and utility.
Help empower the backends.
Reduce the surface area of the API.
Reduce the dependencies between the layers - of which there are many!

My latest patch is an NFC to help address the issue of ambiguity, I have uploaded the patch here: https://reviews.llvm.org/D79002. It’s a biggy, adding TargetCostKind as an argument to almost all the get*Cost methods. My hope is that, as well as clarity, it will allow some backends to re-use and correlate costs if they wish. It would also allow adding a hook to query the backend which cost kind is the most important to it, which could then be passed around.

But as the patch currently stands, even as an NFC, it could still lead users astray because nothing else is setup to differentiate and return a specific cost. I’d like to hear from the backend people to hear what kind of costs they were modelling for and how we should handle the ‘default’ behaviour. For instance, whether they’re happy to return the same cost for all cost kinds, or whether a check should be performed first on the cost kind first and only return something specific if its the expected cost kind. There’s also the issue of the value of default arguments - I’ve used RecipThroughput for the calls used by the vectorizers and SizeAndLatency for the rest, but I wonder if default values should be used at all.

Regards,
Sam

Hi, Sam,

Thanks for sending this out and I think that it’s great that this is being worked on.

It looks like we currently have this:

enum TargetCostKind {

TCK_RecipThroughput, ///< Reciprocal throughput.

TCK_Latency, ///< The latency of instruction.

TCK_CodeSize, ///< Instruction code size.

TCK_SizeAndLatency ///< The weighted sum of size and latency.

};

Three of the four of these are measurable quantities, at least in an average sense, independent of their use in the optimizer. I think that’s a good thing, in part because we can continue to create good tools to measure them. TCK_SizeAndLatency, however, is not like the others. Is there a way to set this value independent of all of its uses in the optimizer? I suspect that the answer is: no. Moreover, it’s not just the use by a single transformation that matters, but use by all users in the pipeline. The only way to tune these costs is to run large test sets through the entire optimization pipeline.

While I understand that this reflects the current state of things, it seems less than ideal. There’s no way that a single blended ratio of size and latency will be optimal for all potential users (inliner, unroller, PRE, etc.), or at least it seems very unlikely. The transformation also won’t know what the right mixture is across all targets. If we really want to be able to tune these things, maintain stability as we evolve the optimizer, and understand what’s going on, I think we’ll want more flexibility in the system than this.

Also, the “size and latency” mixture is, as I recall, mostly measuring # of uops. Maybe we should understand how much it deviates from that and, if appropriate, call it that?

We might also consider a system allowing for per-pass TTI modifications of the costs returned by the model. a TCK_Blend, taking an explicit pass ID (and some discriminator code). This way a backend can return explicit heuristic costs for particular clients when there’s really no other way to reason about the needed costs.

To your other questions: the vectorizers should certainly use recip throughput – that’s how they’re designed. Targets do specifically return throughput costs for these separate from the costs used by the inliner. “size and latency” is the right default for the others (although, as I mention above, I think this might be better called TCK_MicroOps or similar). It is certainly a good idea to be explicit everywhere about what costs are being requested. Please do continue with that.

Thanks,
Hal

Hi Hal,

Thanks for the support Hal. Yes, I actually only added SizeAndLatency at the start of the week to make it explicit when we’re trying to use that vague cost and to highlight that it’s falsely interchanged with CodeSize too. On the naming, I took it from the comments in TTI, though I’m sure each backend has derived slightly different meanings. I’m not sure whether reporting micro-ops translates to all targets, certainly not the small microcontrollers that I work on where latency is key, and a micro-op cost seems more related to throughput to me. I’d happily receive more feedback though so we can gain a consensus though.

I think the way forward is, like you say, for the backend to be given the opportunity to report a cost for each transform. Ideally the backend can then choose which cost is important to it, as well as a suitable threshold, and be given a larger region of code if necessary. It would be great to move the ad-hoc cost modelling out of the analysis and transform passes, while they can always fallback to their original costing if a backend doesn’t exist or doesn’t return anything.

Thanks,
Sam

That’s certainly a good point, and there’s certainly a relationship between uops and throughput on larger cores. Even there, however, more uops often corresponds to higher latency (e.g., a combined load + mul might get split into uops for address generation, cache access, and the arithmetic, but in the end, there’s still a dependency chain that needs to complete before the entire instruction retires). Regardless, I’m perfectly happy saying that what we’re trying to measure here is execution time, and adding latencies and counting uops might be equally bad in this regard across many types of cores.

Here’s how I think about this:

For inlining, to a lesser extent unrolling, and perhaps for other use cases as well, the problem is that we really have (at least) two effects that we’re trying to balance with one metric. One is execution time, and on an in-order core adding the latencies might be a good approximation of this, but for larger cores we likely need to account for #uops and throughput as well. The second is code size, or perhaps the ratio of code size to execution time, and the problem here is that we’re trying to solve a global optimization problem using a local indicator. Globally, we need to worry about i-cache pressure, TLB pressure, and so on, all of which are made worse by increasing code size, but there’s no real way to reason about these one function at a time. Yet, we need to try (using some heuristic). There can certainly be instructions that have large encodings relative to their execution speed (perhaps even after to account for any reduced throughput from decoder capabilities), and we might need to account for this somehow.

There is a small caveat here: for some cores, we really do try to estimate uops, at for unrolling. Large Intel cores, for example, have a uop buffer associated with the dispatch of small loops (the LSD, etc.), and the size limit of these is specified in uops.

Of course, the way that inlining thresholds are tuned, they also ends up accounting for follow-on simplification opportunities enabled by inlining that also cannot be reasoned about locally. This is, in part, what makes changing anything in this space so tricky: anything that you change will cause some regressions that no local, causal reasoning will be able to fix.

So maybe we cannot do better right now than “size and latency”, but I feel like we should certainly think about how we might do better, and in addition, we should think about how we might provide guidance to backend developers as to how they might think about setting these values (other than just playing with things and running a lot of tests).

I think the way forward is, like you say, for the backend to be given the opportunity to report a cost for each transform. Ideally the backend can then choose which cost is important to it, as well as a suitable threshold, and be given a larger region of code if necessary. It would be great to move the ad-hoc cost modelling out of the analysis and transform passes, while they can always fallback to their original costing if a backend doesn’t exist or doesn’t return anything.

Sounds good to me.

Thanks again,
Hal

Sorry for replying to such an old thread, but I hit the same confusion in my patch: ⚙ D132216 [CostModel][X86] Support cost kind specific look up tables where I’m trying to add better x86 numbers for non-throughput cost kinds.

Should we add a hint to the TCK_SizeAndLatency comment saying that its target specific (e.g. might be uop count on a ooo cpu)? We might need to explain that some pass’ thresholds (LoopMicroOpBufferSize etc.) probably need to be on a comparable scale.

Just looping your patch in here: ⚙ D132288 [TTI] Improve description of TargetCostKind enums to aid targets in choosing cost values