I was looking at the InlineCost calculations, and I noticed that most intrinsic calls have only two possible costs at the moment: free, or equivalent to one instruction. (It actually calls getInstructionCost() to compute the real cost, but then only checks if the returned cost is equal to zero.
See llvm-project/InlineCost.cpp at a4d48e3b0b66cacb8c42be8421608f7efd170c24 · llvm/llvm-project · GitHub and llvm-project/InlineCost.cpp at a4d48e3b0b66cacb8c42be8421608f7efd170c24 · llvm/llvm-project · GitHub .) That doesn’t really reflect reality; many intrinsics end up getting expanded to a longer sequence of instructions. In a case I was looking at, this underestimate led to a substantial size increase at -Os.
Would it make sense to adjust InlineCost to take the cost reported by the target into account? Has anyone else looked at this?
yes, that should be the way to go. @kazutakahirata @mingmingl
It definitely make sense to query target-specific cost (many middle-end passes did this), and https://groups.google.com/g/llvm-dev/c/7He3qWVgM3I (a few years ago) mentioned the idea of “augment Inliner pass to use existing TargetTransformInfo API to figure out cost of particular call on a given target.” → which now already happened.
Here are my two cents on this topic;
(I realize it doesn’t help the immediate question (of intrinsic cost not correctly modeled), just some general thoughts when it comes to target-specific parameterization of inliner pass)
- The current heuristic uses target-agnostic costs from command line options. From current status towards fully using TTI reported cost for inliner, quite some amount of tuning over existing inline thresholds (HotCallsiteThreshold, InlineHintThreshold, etc) are expected so cost and threshold work together (at least to the extent they are effective today).
- The exact numerical cost value might be different between two backends; while the relative costs (across instructions on the same backend) likely agree (e.g., memory operation could be more expensive then arithmetic) for processors in a similar domain. Given it’s the comparison (cost vs threshold) that actually affects the inline result, the threshold should probably be target-specific as well.
Besides, there is a CallPenalty parameter for general function calls. I wonder if using this number helps the intrinsic use case, say without considering target-specific cost differences.
Inliner pass should probably favor size cost more (than latency cost), so memory vs arithmetic is not a good example. I guess i’m just trying to say the relative size cost of different IR instructions could agree (if an arithmetic intrinsic is implemented with a SIMD instruction on one backend, another backend may have a SIMD instruction in parity as opposed to emitting element-by-element scalar operations)
We intentionally don’t apply the call penalty to intrinsics at the moment. And that probably makes sense; most intrinsics don’t end up getting lowered to calls. More generally, a generic intrinsic penalty probably isn’t helpful. It won’t help the inliner understand that, for example, on AArch64 without NEON, ctpop is 12 instructions, while ctlz is 1 instruction.
I’m thinking I can incrementally introduce extra costs to intrinsics the target knows are expensive without substantially reworking the overall cost model; such intrinsics should be relatively rare, so the impact should be minimal in most code.
This is a fair point.
Also, I guess intrinsics are lowered to a moderate sequence of instructions (compared with an arbitrary function written so it’s likely that in-accuracies are not obvious with some compile options (e.g.