Are there any plans/directions for the inliner cost model? Looking at the code it seems to just inline all feasible calls.
I have a program where the current inliner cost model results in significant code size bloat (about 8x). (IR before inlining + call count stats). Happy to help implement when this starts to block me. Also wondering if other folks in the community are running into inliner cost model issues.
River and I had a nice chat offline. Just a quick braindump of some stuff that came up:
cost model for inlining, especially in high level dialects is really hard. It seems to depend transitively on a potentially vast number of later lowering decisions.
code-size based heuristics (like LLVM):
- Consider: high level ops or similar library calls in high level (pytorch/tf/numpy) frameworks such as “add two tensors, which might have any shape and dtype” – which data types? what shapes to support? codegen size might vary wildly based on that, but calling out to a library might be tiny either way!
- Consider: linalg-on-tensors ops. Vastly different code size depending on fusion. E.g. if an “add” is fused, then it is just one “add” machine instruction, but if it isn’t, then it is at least a loop around an “add”, but depending on the runtime and device, it might expand into queuing a GPU grid dispatch, etc. which is way more.
performance-based heuristics are totally dependent on the backend/runtime, and require similar sorts of thinking as code-size
- would inlining enable fusion?
- would it enable intraprocedural shape inference to do a better job?
- does the function have context-dependent shapes and is beneficial to specialize?
inliner can do some amount of abstract interpretation of the callee to find unreachable blocks and adjust cost accordingly (e.g. prove we are only in the “fast” push_back case). LLVM does this, but it is unclear how to extend it to the MLIR level of generality.
At some level we’ll need an interface where we can plug in machine learning to learn good cost models. Maybe that’s the way out.
Some trivial “always beneficial” inlining is possible to do, like identity functions and private functions with only one callsite.
I think that this has got to be an opinterface sort of thing. You might want to look at
llvm::TargetTransformInfo::getInstructionCost as one example existing implementation of this idea.
I’d specifically recommend splitting the “measuring an operation” in code size or performance or whatever, from the policy of “should I inline a function containing a bunch of operations of some size”
From my perspective, and I think Sean touched on this, there can’t feasibly be one MLIR Inliner cost model. I just don’t see how we can define a reasonable model that crosses all possible domains. What I had in mind is that we provide an abstract CostModel class that clients can override to determine when to inline or not. In some abstractions, this can be based on code size using some dialect or target interface. In others, it can be based on some analysis that is performed by the frontend. In other cases it can be an ML model that does some seemingly random computation to compute a value score for the function. The point of all of this being, I agree with you and I don’t think any kind of “cost” notion is going to scale to all MLIR users. We may have several “default” cost models that users can pick from that use some combination of interfaces (and what not), but I don’t think the inliner should embed any of this into its core logic.
Yup, that seems right to me. I don’t think it is precisely an OpInterface though, because in general it is not a property of the Op, but a property of the “target”. (i.e. it is
FSub::getInstructionCost). We don’t really have a “target” abstraction (which for MLIR needs to cover so many layers potentially). So we’re probably looking more like a “plugin” or InlinerCostModelBase to inherit rather than OpInterface, which a user will configure based on their understanding of their pipeline setup.
Also, it probably needs to reason at a granularity larger than a single instruction, to enable parameterization sufficient for ML model heuristics to do a great job.
Sure, but that isn’t why LLVM is putting this in its target libraries. This is done because llvm intrinsics are the “logical extension point” for targets, and the mid-level optimizer wants to be able to do queries without enumerating all the targets.
That is the MLIR related problem I was referring to. The “ops can lower to different ways on different targets” problem is going to be intractable in general I think, particularly for something like a tensor op. An accurate cost model is going to be nuts.
So, what is the correct thing to do to avoid unconditional inlining?
We, the sparse compiler team, are also facing the same issue as the sparse compiler deliberately generates a few complex function calls for operations like sparse tensor insert/sort.
If there is no such a way, can we introduce a filter callback to allow users to filter out call sites that should not be inlined at least? (similar to L102@MLIR: include/mlir/Dialect/Bufferization/IR/BufferizableOpInterface.h Source File)
Could I do something like the following:
when I am initialising my dialect?