[RFC] MLIR interpreter framework

I would like to propose an interpreter framework. This interpreter framework allows multiple dialects register their op implementations. The interpreter traverses the SSACFG regions, manages the mapping between SSA names and evaluated values, and calls the op implementations with the evaluated values of the operands.

I uploaded an prototype here: ⚙ D129035 Add MLIR interpreter framework

Motivating use cases

Enable dialect developers to develop a reference implementation of their ops. This reference implementation can be used to:

  1. Define the semantics of the ops
  2. Validate the compilation results
  3. Check whether a transformation introduces differences

Design

The core of the interpreter is the InterpreterOpInterface. If a dialect developer wants to run an operation with the interpreter, they must implement InterpreterOpInterface and provide an interpret method. The interpret method takes (1) an operation that is being interpreted, (2) a reference to the interpreter instance, and (3) an array of evaluated values.

EvalValue consists of (1) a MLIR Type that describes the type of the evaluated value and (2) a raw buffer used by the op implementation. EvalValue has a type information because the static type (the type in the program) may be different from the dynamic type (the type in the execution). For example, in the UnrankedTensorType use case, the static type might be tensor<*xf32> but the SSA name may bind to an evaluated value of type tensor<2x3xf32>. The current proposal leaves the raw buffer to the dialect implementation.

Interpreter class provides the APIs to manipulate the interpreter state. This includes binding evaluated values to SSA names, branching to different block, entering a region, calling a function, and error handling.

Future works

  1. I leave the evaluated value buffer to the dialect developers. Technically, to enable interoperability between dialects, we must define something like calling convention. However, different dialects have different needs. It is unclear whether it is possible to define one for all dialect, thus it is kept out of this RFC.

  2. The implementation of dialect operations can be decoupled from dialect implementations. For example, if an operation intentionally doesn’t define a semantics (to allow different implementations), different groups can build their own implementations and define their own semantics.

Please let me know how do you think about this RFC. Thank you.

6 Likes

This looks quite cool… I’ve been wanting to build something like this for a while! I’m excited to see how this will work.

Thanks for the proposal! This is a component I thought about quite a few times without taking the time to really code it.
It seems to me that we would want from an interpreter a strong typing capability: that is the difficult part of the design is how to map an IR type to a runtime object representing this type, align the implementations on this runtime object definitions, and have the descriptive capability (“poor man RTTI”) to also verify that everything is consistent between the op implementation and what the IR describes. It seems like you punted a bit on some of the “type safety” aspect right now? I can’t totally tell what is dynamically checked right now or what static mechanism ensures the consistency of the system?

When I’ve seen anything like this done for real use (which may be different from being a little toolkit for enabling easier construction of little interpreters), I’ve found that the implementation had to take reasonably strong opinions on memory ownership and type representations, among other things. Then such strong opinions play against a desire for generality. I suspect you are going to find that mlir dialects are quite diverse and that many of them do not define a real “runtime representation”. As examples, even taking the tensor type and dialect, mlir itself is silent on runtime aspects, and the memref dialect, in generality, defines types and values that cannot always be represented by a fixed runtime buffer at any given point (at least without specific lowerings being done).

I think that it is unlikely to work to just have “dialects provide reference implementations for ops”. However, making it easier for a developer who is using a fixed number of dialects in a well defined way to build a little interpreter for them is likely more tractable. This may still be a way from building a “good” interpreter: getting into a good interpreted form, in my experience, requires lowering with an eye to a chosen runtime model.

I’m not opposed to tooling to help people solve problems in this area (or the broader topic of building simulators of various kinds). I mainly just think that this is task focused and not coupled to the dialects (ie. The interpreter depends on the dialects it is scoped to support, not the dialects depending on the interpreter).

1 Like

Agreed.

First, there’s the problem you mention of runtime representation. This is hard on its own, as the complexity of things like QEMU and JVM attest. But when you now allow different dialects to co-exist and implement their layouts independent of others, it gets a lot harder.

Things like floating point semantics, matrix layout, sparse data implementation, structure layout and ABI issues, can become impossible if I can’t identify what the “other side” needs, when the output of my dialect’s operation goes into another dialect’s operand.

Having a generic glue would only work if there was a “generic” definition of everything, including data layout and op semantics, but then you’d have to convert everything to that layout and back, or force dialects to use that layout directly.

If I understand your proposal correctly, having a generic ABI would be the only way to allowing dialects define their own operations. But then, that generic layout would be suitable to a specific class of hardware (say 64-bit CPUs) and that means emulation on other classes (ex. GPUs).

It’s either that, or having specific implementations of specific ABIs for specific hardware, and we’d have to create a new compiler from scratch.

In a nutshell, while basic stuff like control flow and integer arithmetic should work fine, work in MLIR is generally tailored to the more complex side of things…

I don’t understand that actually: types are defined in the IR, including their semantics, etc.
They also belong to a dialect, which can provide the runtime representation for any given type. Of course there are multiple possible implementation choice for a given type (like tensor), but that does not mean that we can’t implement one. Some of these can even be “virtualized”: that is what’s important for a consumer is the interface to access the value more than all of the actual concrete implementation detail.

There are various trade-offs, but I think it is impossible to discuss all this without stating the goals of an IR interpreter: trade-offs are always about optimizing for a some use cases / requirements.

Right, my point was related to differences in hardware support. We can implement anything in any hardware (enough emulators out there), but after a certain point in the pipeline, the IR gets closer and closer to hardware (ex. ABI arguments and return values or assumptions about data layout) that can get hard to do on another hardware.

While dialects themselves don’t usually care about such things, and MLIR isn’t usually that low (unlike LLVM IR), the connections between dialects and the assumptions of dialects about those that can create more problems than it’s worth.

I also see a conflict between the design goal to have very generic regions (SSACFG, Graph), not having a strong semantics, which can be used by many different dialects in very different ways, and the goals of providing a reference implementation, basically enforcing (or strongly encouraging) specific semantics.

For some cases, dialects get lowered to hardware directly, avoiding the LLVM pipeline, because the underlying semantics is very different, but the same dialects can still be lowered to LLVM, for example, cfg or linalg and how would we define what is the default semantics?

I don’t get the conflicts: any dialect defines already semantics, so I am entirely on the opposite to you here and it seems to me that a reference implementation should always be possible (of course there are limitations in particular in terms of performance when one emulate something that has poor mapping to a cpu host).

I don’t completely agree with this. While dialects do their best to define semantics, it’s not a precision definition (like ISA or ABI documents) and that leaves opportunities for other dialects or tools to use in different ways.

To reiterate, I’m talking about hardware execution semantics. As @stellaraccident describes, it’s hard to write an interpreter that does not take into account the runtime model.

Like her, I think it’s a really great idea to build an interpreter framework, to allow people to implement their own rules on a set of dialects, for a set of problems in a set of hardware restrictions, but I’d be wary of any implementation that aims to be a reference without taking those factors in, as they’ll very likely be restrictive to the more interesting cases.

Can you elaborate with examples, I still don’t get your point, in particular at this level of abstraction.

If I understand correctly, the intention is to enable quick and easy “reference implementation” for ops, runnable on CPUs, and for that I think it’s not unimaginable that a dialect (default interpreter for a dialect?) defines a C++ type behind the EvalValue. And specific ABIs for specific hardware is just out of scope of this RFC.

1 Like

I’m skeptical, but I think my concrete critique is that these extensions need to be isolated from the dialect implementations: I don’t want dependencies on a dialect to carry the overhead of reference implementations. If this is to be developed and used for common dialects, then it needs to be possible to depend on the interpreter completely separately from the dialects. I can envision how this can be done with some combination of interfaces and extensions, but that might also not be the best way to do it (if biasing towards easy vs industrial strength, I would invert it and use some sugar to associate little lambdas compactly in one implementation file per dialect).

I’d also like to see a more comprehensive prototype with some of the hard parts illustrated before talking about in tree development and claiming top level names like “interpreter”. Many domains will require simulators of a completely different design, and this proposal exists as one of a set of such options.

I’m not sure “performance” is the right lens with which to be evaluating reference implementations if talking generically about dialects that define execution semantics: some of the best simulators (is. Reference implementations) are very non performant because they are simulating a very detailed set of circumstances with high fidelity. I would go so far as to say that once you stray outside of dialects that are operating in an imperative regime (and even some in it), they can’t be meaningfully simulated with an op by op interpreter as proposed… Or such a simulation is one valid way among many.

Again, I’m not opposed to providing tooling to do some of this kind of thing if it is useful – I get that a lot of people probably just want to know if their math op is producing the right results, etc. I’m just pretty leary of wiring a limited purpose piece of infra like that too much into the core, or setting it up to be the only thing like it: make it a non exclusive bolt on that doesn’t require surgery on the dialects to support it and my concerns are satisfied.

2 Likes

This is exactly what I’d like to see: a “reference semantics” defined operationally, without undue attention paid to performance. The ‘performance-oriented’ path should be a lowering into some other
dialect.

We already have part of this in core already: it’s the constant folder. Do you see this being different (beyond a desire for completeness in some sense, while the constant folder can always safely do nothing)?

1 Like

If the motivating use case is to give dialect developers a framework for writing reference implementations, I’m not sure this “interpreter framework”, in the sense that dialects define for each operation a direct implementation, is the correct approach. Others have pointed out myriad issues with this.

Injecting op reference implementations away from the implementation of the dialect itself is possible with interfaces. I can imagine a mlir/lib/RefImpl/Default that adds default implementations to simple arithmetic and matrix operations, for example, with runtime type checks on EvalValue in each interpret hook.

But to me the more robust method is to define a “high-level interpreter” dialect/target that is easy to extend and has a generous surface for other dialects to target, which provides semantic guarantees for its operations and types. In practice, this dialect will end up having to duplicate a lot of the “bread-and-butter” MLIR dialects and operations and consequently have a lot of trivial converters from them, but having a clear separation between what operations cannot be interpreted, what can, and what their semantics (ABI, layout, etc. and all that good stuff) are neatly sidesteps the problem of, e.g., what the hardware execution semantics of an operation are as @rengolin mentioned.

I tried to expand it here: Discord
And here:
Discord

It’s a good question: Just taking things like the arith dialect for now (because I can’t quite see the semantic match for many of the others), the constant folders are all operating on attributes – which are likely to be a poor choice for anything “runtimey”. But in this case, they do define their reference domain and operations properly (i.e. IntegerType maps to APInt, FloatType to APFloat etc). If they just had that and not the tie-in to the uniqued attribute/context, I suspect they would be a good general purpose tool for not just doing constant folding but also simulation (i.e. both could be built on the more common thing).

Looking at the other dialects, I don’t see a similarly strong reference execution model for their types. The tensor folders, in my experience, are pretty ad-hoc/not great/based on DenseElementsAttr, etc.

I think it would be an interesting exercise to look at this from the perspective of “if I were creating an opset and evaluation facilities for a high fidelity simulator, how would I structure it”.

I also have a hard time getting excited about this proposal. Back in the day we had an LLVM IR interpreter, but it was ridiculously slow: direct interpreting something in SSA form requires the use of hashtables for the register file. Also, unlike constant folders, you have to keep the “register file” around (or keep track of last use etc somehow).

Furthermore, arith isn’t a simple integer and floating point thing: it supports vectors and tensors, which have their own fairly fundamental representational questions. I don’t think that a “reference” implementation will be useful for anything more than toy examples, and which case it could be in the tutorial or somewhere out of tree.

The LLVM interpreter was eventually removed because it was a bunch of weird code that no one used and we got tired of maintaining and updating it as LLVM changed.

-Chris

If anything, I would expect an interpreter approach to be more effective for coarse-granularity (e.g. tensor) operations, for exactly the reason you point out: The value lookup overhead would be relatively speaking less significant.

1 Like

Well indeed yes, but tensors and aggregate operations have far more representational questions than an i32. The two different cases are problematic for very different reasons.