For a while we’ve been discussing a cost model in MLIR and how to make informed choices with the upstream passes. There are other threads that speak cost models (ex. Inliner cost model) and target description (ex. RFC: Enhancing Machine Retargetability in MLIR ) but not much upstream has been developed in that direction.
We (PCL - Intel Labs) worked on a compiler prototype that has target decisions embedded in its passes to show near peak performance on CPUs, but that doesn’t scale. We have to manually pass options to the passes to get the best parameters for different targets, not to mention decisions will be very different for GPUs and other targets.
So we started thinking about cost models and target descriptions.
Assumptions
Our main assumptions are what the threads mentioned above already develop:
- We don’t want to duplicate code and LLVM already has a very rich target description infrastructure. It would make no sense at all to replicate that in MLIR.
- LLVM passes already use those structures to make code generations and optimization decisions, we can piggy back for the low level transformations (ex. vectorization).
- LLVM target descriptions are based on common patterns / questions, not a full description of all properties of all micro-architectural and implementation variations in the wild.
- MLIR cost models do not have the same level as LLVM, so only using LLVM’s existing models won’t cut it. We need decisions for tile, fuse, pack, shard, distribute, on a single thread, single node, multiple nodes. There are high-level costs that are just not suitable to store in LLVM’s classes and often not derivable just from target descriptions.
- MLIR costs may involve more than one target, for example CPU+GPU, on the same IR (offloading). While LLVM IR already had such a case (ex. OpenCL), it’s mostly done orthogonaly. In MLIR we need a cost model that understands not only the compute costs of kernels, but the communication costs between them (especially when offloading) to find the right balance between host and target compute.
- MLIR targets hardware that does not have a representation in LLVM, so there is no available information to start with. Does it make sense to add targets to LLVM that are only used in MLIR?
- Downstream projects may have their own cost models and we don’t want to break the world, but to have an upstream cost model infrastructure we need to find the common ground. It would not be reasonable to go to all this trouble just to create a skeleton infrastructure upstream and still require downstream tools to create their own cost models, or worse, make it so upstream users can’t use the cost model.
Cost model hierarchy
There are three parts of a successful high-level cost model:
- Static target information, such as caches, threads, warps, registers, latency, etc. This does not change for a given target and can be used to calculate costs that are specific to known patterns (ex. a particular vectorized code). Most of the cost model comes from here, and this is what this RFC is about.
- Compiler behaviour information, such as whether we unroll a loop before or after trying vectorization, whether we inline aggressively or not, etc. These change with implementation (upstream/downstream, different versions, even different pass order). This is a lot harder to maintain and will need some form of configuration files, dynamic behaviour (changing with pass order), etc. This can be used for machine learning, but it’s a topic for another time.
- Run time behaviour information, for example, if a certain branch starts being more taken than not taken, or if the tensors are getting more and more sparse. This is exclusively for PGO+JIT run time optimization and are not in the cards right now, at least not for our group.
Implementation details
The working idea right now is to have a composable infrastructure:
- MLIR passes ask questions about transformation / analysis validity, profitability, and best parameter guess to a “target descriptor”, which is not necessarily tied to a particular LLVM target (non-LLVM targets, more than one target, etc).
- The target descriptor contains not only static information about the hardware but also “second order” information derived from those costs or added as an “heuristic value”. Replacing the heuristic value with proper infrastructure if point (2) above.
- The costs can come from multiple sources:
- LLVM target description
- Heuristics values in an MLIR-specific TableGen / Config file
- Command line options (including an external config file)
- Inline C++ builder pattern for dynamic target information
- Pass specific decisions based on current state of IR
- MLIR builds the “target descriptor” from command line options (target triple? config files?) or detects host properties.
- All targets must answer the basic questions (those asked upstream), other targets can have additional questions for specific upstream/downstream decisions.
- LLVM code generation passes the targets information with their respective IR snippets to compile to each separate target.
- SPIRV code generation can have an adaptor interface (possibly downstream) for the following toolchain / back end.
Plan
We began working on a target descriptor that pulls LLVM target info plus some additional info and use this to drive transforms. We do not plan to use TableGen for the MLIR side, we very much prefer config files (Json, Yaml, whatever) that can be kept upstream.
First iteration is to have at least one upstream pass making decisions based on this descriptor for at least one LLVM target.
Who else is working on this, upstream or downstream, that has input and work in progress that we can collaborate to speed up the creation of an upstream infrastructure?
@nhasabni @stellaraccident @nicolasvasilache @jpienaar @mehdi_amini @clattner