[RFC] Target Description and Cost Model in MLIR

For a while we’ve been discussing a cost model in MLIR and how to make informed choices with the upstream passes. There are other threads that speak cost models (ex. Inliner cost model) and target description (ex. RFC: Enhancing Machine Retargetability in MLIR ) but not much upstream has been developed in that direction.

We (PCL - Intel Labs) worked on a compiler prototype that has target decisions embedded in its passes to show near peak performance on CPUs, but that doesn’t scale. We have to manually pass options to the passes to get the best parameters for different targets, not to mention decisions will be very different for GPUs and other targets.

So we started thinking about cost models and target descriptions.

Assumptions

Our main assumptions are what the threads mentioned above already develop:

  • We don’t want to duplicate code and LLVM already has a very rich target description infrastructure. It would make no sense at all to replicate that in MLIR.
  • LLVM passes already use those structures to make code generations and optimization decisions, we can piggy back for the low level transformations (ex. vectorization).
  • LLVM target descriptions are based on common patterns / questions, not a full description of all properties of all micro-architectural and implementation variations in the wild.
  • MLIR cost models do not have the same level as LLVM, so only using LLVM’s existing models won’t cut it. We need decisions for tile, fuse, pack, shard, distribute, on a single thread, single node, multiple nodes. There are high-level costs that are just not suitable to store in LLVM’s classes and often not derivable just from target descriptions.
  • MLIR costs may involve more than one target, for example CPU+GPU, on the same IR (offloading). While LLVM IR already had such a case (ex. OpenCL), it’s mostly done orthogonaly. In MLIR we need a cost model that understands not only the compute costs of kernels, but the communication costs between them (especially when offloading) to find the right balance between host and target compute.
  • MLIR targets hardware that does not have a representation in LLVM, so there is no available information to start with. Does it make sense to add targets to LLVM that are only used in MLIR?
  • Downstream projects may have their own cost models and we don’t want to break the world, but to have an upstream cost model infrastructure we need to find the common ground. It would not be reasonable to go to all this trouble just to create a skeleton infrastructure upstream and still require downstream tools to create their own cost models, or worse, make it so upstream users can’t use the cost model.

Cost model hierarchy

There are three parts of a successful high-level cost model:

  1. Static target information, such as caches, threads, warps, registers, latency, etc. This does not change for a given target and can be used to calculate costs that are specific to known patterns (ex. a particular vectorized code). Most of the cost model comes from here, and this is what this RFC is about.
  2. Compiler behaviour information, such as whether we unroll a loop before or after trying vectorization, whether we inline aggressively or not, etc. These change with implementation (upstream/downstream, different versions, even different pass order). This is a lot harder to maintain and will need some form of configuration files, dynamic behaviour (changing with pass order), etc. This can be used for machine learning, but it’s a topic for another time.
  3. Run time behaviour information, for example, if a certain branch starts being more taken than not taken, or if the tensors are getting more and more sparse. This is exclusively for PGO+JIT run time optimization and are not in the cards right now, at least not for our group.

Implementation details

The working idea right now is to have a composable infrastructure:

  • MLIR passes ask questions about transformation / analysis validity, profitability, and best parameter guess to a “target descriptor”, which is not necessarily tied to a particular LLVM target (non-LLVM targets, more than one target, etc).
  • The target descriptor contains not only static information about the hardware but also “second order” information derived from those costs or added as an “heuristic value”. Replacing the heuristic value with proper infrastructure if point (2) above.
  • The costs can come from multiple sources:
    • LLVM target description
    • Heuristics values in an MLIR-specific TableGen / Config file
    • Command line options (including an external config file)
    • Inline C++ builder pattern for dynamic target information
    • Pass specific decisions based on current state of IR
  • MLIR builds the “target descriptor” from command line options (target triple? config files?) or detects host properties.
  • All targets must answer the basic questions (those asked upstream), other targets can have additional questions for specific upstream/downstream decisions.
  • LLVM code generation passes the targets information with their respective IR snippets to compile to each separate target.
  • SPIRV code generation can have an adaptor interface (possibly downstream) for the following toolchain / back end.

Plan

We began working on a target descriptor that pulls LLVM target info plus some additional info and use this to drive transforms. We do not plan to use TableGen for the MLIR side, we very much prefer config files (Json, Yaml, whatever) that can be kept upstream.

First iteration is to have at least one upstream pass making decisions based on this descriptor for at least one LLVM target.

Who else is working on this, upstream or downstream, that has input and work in progress that we can collaborate to speed up the creation of an upstream infrastructure?

@nhasabni @stellaraccident @nicolasvasilache @jpienaar @mehdi_amini @clattner

8 Likes

I am generally a proponent of externalizing heuristics from passes as much as possible, or at least isolating them in a way that makes them easily replaceable. This is a bit orthogonal to having a target descriptor, so just making sure that we don’t end up with too rigid of a descriptor that’s pervasive in passes.

Semi-crazy suggestion: have you considered storing configurations as IR?

It is and yes, the thing that we’re struggling the most now is that balance.

The problem is that a lot of high-level decisions can be made using “formulas over compiler passes, types and target properties”. For example, deciding a tile size depends on L1/L2/L3 caches per thread, the size and shape of the tensors and the algorithm that the compiler will lower to for the tile compute, which in itself, depends on the number and nature of registers (scalar, 1D vector, 2D matrices).

At the high-level, I want to enquire the cost model “what would the compiler do down there, when I select this pattern up here?” and iterate on until I find a reasonable match.

I also want to optimize multiple sub-graphs independently (say, with transforms) and then look at the overall picture (CFG of sub-graphs) and propagate shapes and distribution techniques.

But doing the transforms first would need multi-versioning, akin to VPlan, on a whole-graph level, which is very heavy. If I could probe the cost model, generate transforms for the sub-graphs (not execute them) and then ask a higher-level cost model on the (supposed) final state, I can discard bad candidates before I execute any transform and iterate until a reasonable graph is (presumably) achieved.

Of course, none of that would be in the initial phase, but this is definitely a state we want to reach.

I have not. :thinking:

In theory, it doesn’t matter what textual format it’s represented on, just as long as we have a parser and we can get the info into some C++ classes for the compiler to use.

The main issue I can think is that an old JSON is still JSON, while an old MLIR file may not be parseable by a new MLIR tool.

Do you foresee additional benefits of it being in IR? Like adding “schedules” (like Halide) in MLIR program files with information that the front-end has and the optimizer doesn’t? This to me would be a killer app for this idea.

There is a need for this. Making it pluggable would at least allow different folks to disagree/inject external cost models etc. It would probably need to be on some specific flows and grow out with actual use/benefit. E.g., it would be easy to spend a few PhDs working on models here :slight_smile:

Yes, but only in the sense that you can parse it. Whether it is still a sensible parse or not, or whether the interpretation is consistent is on you :slight_smile: That being said: we do need to have a discussion again about serialization stability MLIR side (for builtin dialect). Doesn’t mean I’m saying this is the right way. Just that we can drive the feature from the needs and if this makes sense (and IREE has been using device attributes in the IR as part of lowering for a bit, and its generally been useful), then we should discuss. I mean, one could also have this as a StringAttr where JSON is nested inside … but a dictionaryattr probably won’t be changing in how its serialized either (it’s a list of keywords, = and value, not much I’d consider changing that to).

Fair enough. Perhaps in this case, IR is even better, because there’s less chance of silently ignoring things.

Indeed. There’s an older discussion about version in MLIR and I was vaguely looking at getting this working for linalg so we can start introducing more ops without breaking existing code. This would be a “running counter” not attached to releases, because most of us uses trunk anyway.

For the builtin parts, it depends on what you want and what would be considered a breaking change or not. New dialects or changes in upstream dialects? Any change in semantics / parsing?

There is a uncommitted phab review:
https://reviews.llvm.org/D58736

Oh, I think I remember that thread. IIRC, we were discussing about high-level costs like VPlan or Polly. But that’s on LLVM, TableGen driven, and I would not put MLIR specific knowledge in there.

I would not oppose to such a class in LLVM, by all means, but if anything, that work would just compose with this one, not replace.

Thanks for the proposal, this is something really important for MLIR, unfortunately hard to build considering the variety of use-cases for MLIR.
The first time we seriously considered this was in the context of TensorFlow and TFLite actually, where the kind of “Target Description” is widely different from the information exposed by the CPU descriptions in LLVM.
I would expect targeting something like a TPU and the kind of information to drive the optimizations done on Linalg for tile&fuse (or Affine) would also be different than LLVM TTI.

This is a good point, it makes sense on a first approximation, I would have two concerns though:

  • TTI is very targeted at CPU-like information, and information that are useful for the LLVM kind of transformation. It’s not clear to me that describing the interconnect of a heterogeneous chip would fit the LLVM target description machinery. You touch on this in one of the other point, and the “target descriptor” you mention later is meant to be not tied to LLVM targets, but I’m not sure how you want to reconcile that?
  • The MLIR lowering should be independent of LLVM: that is I want to be able to optimize and lower down to NVVM for example, without having to configure build the NVVM backend (and pay 60MB of extra binary size!) just to access TTI.

The main difficulty, that blocked progress on this work in the past, is that contrary to TTI we can’t easily standardize the kind of information exposed by the “target descriptor”, so it becomes some sort of unstructured “JSON” API.

The best direction we could come up with was that the “target descriptor” would be an opaque attribute, and passes would define individual AttributeInterface for the cost-model they require. The “target descriptor” implementation (provided by the target) can implement some of the AttributeInterface that makes sense for this target. A pass would be applicable to a given TargetDescriptorAttr only if this descriptor implement the interface this pass needs.

Re opaque target descriptor attributes, I think there’s an argument here for using the ones that’re defined for GPU offloading for GPU targets and extending them as follows, roughly:

  1. If you’ve got a gpu.module or the like that you know is going for a specific target, you’ll specifically have one target descriptor on it instead of the current array - there may end up being a -gpu-make-per-target-module (or -submodule) pass or the like whose job is to duplicate things so you can run target-dependent passes on said modules. (Note that while I just said gpu.module above, you could probably have those attributes on a builtin.module or what have you).
  2. Said target attributes start implementing DLTI (IIRC, that’s something like Data Layout and Type Information Interface) to spell out both backend-specific facts (ex. LLVM addrspace(5) is 32 bits on rocdl) and probably more general info, like memory space remappings. This’d allow -convert-gpu-to-rocdl and its counterparts to not need the target-specific preambles that prevent GPU lowerings from going through *ToLLVM.
  3. Or, in parallel/alternatively, targets would implement a ConfigureLLVMLoweringAttrInterface whose job is to add all the target-specific type conversions, data layouts, what have you.
  4. For that general configurability, we could imagine some other useful interfaces - say MemoryHierarchyInfoAttrInterface.
2 Likes

Correct. Though some information can be collected by LLVM’s TTI, like cache configuration, number of registers, availability of certain extensions, which can affect tile sizes, loop order, 2D parallelization, etc.

Exactly. As pointed out above, there are some things we can get, and those are the ones that I said “I’d hate to duplicate them”. But the intersection between the LLVM TTI stuff and what we’d use in Linalg is small.

Yes, but the information that LLVM has in TTI is (reasonably) independent of the LLVM pass pipeline. I’m not proposing we use LLVM’s passes usage of TTI, just the raw information, and MLIR target descriptors use them in different ways, for MLIR purposes (has-a not is-a).

A future work could use the LLVM usage as a metric IFF you know you’re lowering through the LLVM pipeline, in the same way we (PCL) tile Linalg in a special way because we “know our libxsmm library operates in a particular way”. This is the second part of the cost model I alluded to in my original post and subject for future work.

This could work, but I’d start from the things that every target has and get something working for existing upstream passes (like tile and fuse, bufferize, packing) for at least two targets (CPU, GPU).

This could end up being prohibitively expensive for targets that do not need any of that or don’t change from a default implementation.

I’m assuming there’s a few classes of targets (ex CPUs, GPUs) where most properties have equivalent semantics, even if the numbers are different. While other targets (ex. TPUs, XPUs, IPUs) have very specific properties that no other target has.

If you look at TTI, there’s a large number of interface methods (with default implementation) where only a single target implements anything otherwise. While this “works”, it bloats those classes and makes it hard to compose.

Not to mention absurd methods that mean nothing but are used for real codegen decisions (ex. isLikeA9). I spent months trying to get rid of that one many years ago, but to my surprise, it still lingers in the code today.

We may end up falling into the same trap, but I’d like initially to just do something basic for, say, some CPU and some GPU. Then we iterate on properties, grow the classes, refactor early and often, until we’re happy with the state.

So the model would be a sort of abstract interpretation of the program plus some transformation decisions?

Well, I happen to have a transform dialect that could be adapted for this :wink: Beside that, I’m thinking of the overall complexity of the system if we keep adding configuration languages and DSLs. We already have ODS and DRR that I see as DSLs embedded into Tablegen (ouch), yaml, Python OpDSL for linalg (though it’s optional), PDLL that lowers to PDL, and finally the dialects like IRDL and Transform. This starts being a bit too much, though one can argue that we can just use yaml since it’s already there.

Coincidentally, I was thinking about exactly that now.

For the first few iterations, I think we need to keep it simple and stupid. Not everything can end up as a schedule decision or transform action. And even if we do have that, we still don’t have a good way to hook transforms/rewrites with pass decisions, etc.

Right now, we just have structs that load the LLVM TTI if available and use what it has that we need and populate the rest.

When we start talking about the MLIR side (that has no counterpart in LLVM), we can also start simple, some header with magic constants or config files (but no TableGen please!).

Meanwhile, the meta dialects you mention can start to use them, and if we find duplication, we favour the dialects to store the info, which will help them grow into what we want. It will also means we’ll have info in LLVM, MLIR dialects and possibly headers and config files. Maybe that’s a bit too much for an interim state that will last years?

This is one very important bit of infra that’s missing in MLIR, thanks!

In addition to finding the optimal config (that’s your main motivation, right?), this would help with SVE + SME compilation. Specifically, there are things that are valid/legal for SME but illegal in the SVE context (e.g. vectors with 2 scalable dimensions). These things tend to be annoying to work around. A “target descriptor” would help us decide what paths to avoid depending on config.

Could you point us at some examples? Sorry for being lazy, just no clue where to start :sweat_smile: . I don’t really have an opinion what the right format would be, hence asking for a reference.

Btw, right now we are using IREE for this kind of things. It would be great to build something in MLIR that could be plugged into IREE. Nothing against IREE, that stuff is great. Just want to avoid duplication :slight_smile:

-Andrzej

1 Like

Exactly! Because LLVM supports SVE, some of that info is already in TTI, so we should pull, but there are other things that we’ll need to have in MLIR itself. With more vectorization happening in MLIR, this will be more and more important.

Some examples on our (hard-coded) options:

Indeed. We want something that works upstream and also for IREE, and also for downstream MLIR-based compilers.

2 Likes

I’ve setup my initial PR for the prototype implementation on LLVM GitHub ([mlir] Target Description and Cost Model in MLIR by nhasabni · Pull Request #85141 · llvm/llvm-project · GitHub).

Of course, it is not complete by any means. So happy to discuss about it here as well as extend it.

@rengolin

2 Likes