[RFC] A New "One-Shot" Dialect Conversion Driver

TLDR: Design a new (additional) dialect conversion driver without the quirks of the current dialect conversion driver.

Dialect Conversion Recap

The dialect conversion is a pattern-based IR rewrite driver that:

  • traverses the IR from top to bottom,
  • looks for “illegal” ops (as per ConversionTarget),
  • rewrites illegal ops by applying a ConversionRewritePattern,
  • keeps track of all IR modifications, so that already applied patterns can be rolled back and different patterns can be tried (backtracking).

To allow for rollback of IR modifications, the dialect conversion materializes some IR modifications in a delayed fashion at the end of the dialect conversion process. E.g., ops are not erased immediately and uses of op results are not replaced immediately.

What’s wrong with the current Dialect Conversion?

It is slow. @mehdi_amini micro-benchmarked parts of MLIR (see “Efficient Idioms in MLIR” keynote at EuroLLVM 2024; do we have the slides somewhere?) and the dialect conversion was by far the slowest part. The main reason for that is the extra housekeeping that enables rollbacks. You may have noticed when running the test suite that one the dialect conversion-heavy NVVM tests (I think it’s this one) is quite slow.

It is too complicated. Even for long-standing MLIR contributors. And to an extent that contributors have been reluctant to touch the code base and rather implemented a standalone infrastructure (1:N dialect conversion).

It reuses common MLIR API but imposes partly undocumented restrictions that make it difficult to use correctly. Some examples:

  • ConversionRewritePattern derives from RewritePattern, but it is generally unsafe to use a conversion pattern in a greedy rewrite or rewrite pattern in a dialect conversion.
  • Walking IR and/or SSA use-def chains is generally unsafe in a ConversionRewritePattern because some IR modifications are applied to the IR only at the end of the dialect conversion. (In particular: Users can still reach/see IR that was already scheduled for erasure.) It is unclear/undocumented what exactly is allowed in a ConversionRewritePattern.
  • Because of the delayed materialization of certain IR modifications, ConversionRewritePatterns cannot assume that the IR is in a valid state at the beginning of a pattern application; making it difficult to write ConversionRewritePatterns. It is unclear/undocumented what patterns can assume. (Side note: Regular rewrite patterns can assume that IR is valid before each pattern application, and we have assertions for that.)
  • The dialect conversion driver calls op folders, which generally assume that the IR is in a valid state. Using op folders in a dialect conversion is generally unsafe.

It is difficult to use and to debug. Some examples:

  • IR dumps during a dialect conversion (before/during/after a pattern application) are broken: ops that were already scheduled for erasure are still shown, value replacements have not manifested yet (they are tracked in an IRMapping instead).
  • A discourse search for “failed to legalize operation” yields 46 results since Nov. 2020.
  • A bug in a ConversionRewritePattern may become apparent only much later, at the end of the dialect conversion.

Proposal: One-Shot Dialect Conversion (name TBD)

Provide a new, simpler dialect conversion driver. (In addition to the existing dialect conversion driver. No changes needed for existing users of the current dialect conversion driver, if they do not want to use the new driver.)

  • Rollbacks (the main source of complexity in the current dialect conversion) are no longer supported. Once a pattern has been applied, it can no longer be undone.
  • All state is materialized in IR. In particular, unrealized_conversion_cast ops are eagerly materialized every time a value is replaced with a value of different type. No more state in IRMapping. This means that IR is valid after each pattern application and can be dumped at any time.
  • In essence, the new dialect conversion is a simple walk over the IR. A conversion pattern is applied for each illegal op, in the same way as regular RewritePatterns are applied. The only difference is that replaceOp/replaceAllUsesWith are relaxed so that replacement values can have a different type. The type converter is not needed for this process; it is only needed to convert block signatures.
  • Provide only a new driver, but reuse the existing infrastructure. In particular, ConversionRewritePattern and TypeConverter can be reused as is. Existing ConversionRewritePatterns can be used with the new dialect conversion driver. (But patterns designed for the new driver cannot be used with the old driver, as they may assume that IR is valid when they start executing.)

Implementation Outline

A new OneShotConversionPatternRewriter that derives from ConversionPatternRewriter will be provided. Along with a thin driver that walks the IR and applies patterns.

   OpBuilder
       ^
       |
  RewriterBase
       ^
       |
 PatternRewriter
       ^
       |
ConversionPatternRewriter
       ^
       |
OneShotConversionPatternRewriter    // this one is new

The new rewriter behaves mostly like a PatternRewriter, but allows ops/values to be replaced with ops/values of different type, upon which an unrealized_conversion_cast is automatically inserted. The new rewriter must be a subclass of ConverisonPatternRewriter so that ConversionRewritePattern can be reused. All functions that are overwritten in the ConversionPatternRewriter (except for replaceOp/replaceAllUsesWith) are overwritten in OneShotConversionPatternRewriter and dispatch to the original PatternRewriter/RewriterBase implementation.

The conversion pattern “adapter” object is populated by taking the operands from the unrealized_conversion_cast op (instead of querying them from IRMapping).

One change is needed for existing conversion patterns: they must not modify the IR if they return failure. (Same as ordinary rewrite patterns.) Most patterns should already be written in such a way. for all other patterns, that should be a simple fix.

Open Questions

The main idea of the proposal is to design a dialect conversion driver without rollback/backtracking. I’d like to hear from dialect conversion users whether they actually depend on the rollback mechanism or not. I am not aware of any passes that depend on the backtracking.

What if multiple patterns can apply to an illegal op? Which one should be chosen? A ConversionRewritePattern must erase, replace or in-place modify the illegal root op, so there is generally no chance for a second pattern to be applied. (In practice, there may only be one applicable pattern in most cases.)

The current dialect conversion has a cost model that guides the selection of patterns. Depending on how we handle the case where multiple patterns can apply, this cost model may still or may no longer be needed.

For compatibility reasons, I proposed reusing existing infrastructure and having OneShotConversionPatternRewriter derive from ConversionPatternRewriter. We could alternatively, completely decouple the new dialect conversion and make more radical changes. E.g., ConversionRewritePattern should ideally not derive from RewritePattern because it is not safe to use with other RewritePattern-based infrastructure such as the greedy pattern rewrite driver. However, this also means that we cannot easily reuse existing functionality such as the PatternApplicator.

Next Steps

Based on the feedback for this proposal, I plan to prepare a prototype / proof of concept. But I’d like to first fix important design decisions (e.g., whether to reuse existing infrastructure, etc.), as those are difficult to change later.

@ftynse @ingomueller-net @jpienaar @mehdi_amini @Mogball @nicolasvasilache

12 Likes

How would this differ from using GreedyPatternDriver + legality function?

1 Like

I’m in favour of simplifying the conversion, but I have a few extra points.

There is a struggle between supporting too many variations in the same dialect to avoid partial conversion. For example, linalg supports both tensor and memref, so that you can run bufferization without having to convert dialects. We tried to create a dialect (tpp) that worked only in memrefs, lowered directly from linalg but that conflicted with the one-shot bufferization. The “solution” was to support tensor on our dialect, but then we were forced to duplicate bufferization. Not good.

Vectorization provides a similar impedance mismatch. If my list of legal output dialects doesn’t include vector, I have to lower to memref (and my lower dialect has to support it). If it does, then am I forced to vectorize as I convert? And if there is no rollback, I can’t try variants. Do I just insert a lot of extract/insert vector instructions and try to clean them up in the end? Do I have to list those ops in my legal output?

An interesting approach could be to apply one-shot within boundaries. Same for bufferization, vectorization, etc. You mention function arguments, perhaps this could be a function pass, not a module pass, and then we only need to add various “unrealised casts” at prologue/epilogue, which would be much easier to “clean” later than if it’s added everywhere in the code. You could even multi-version functions with different one-shot parameters, and then have a later pass to pick one and remove the others.

Thanks Matthias!
The current dialect conversion is definitely an expensive infrastructure (the bookkeeping cost of the powerful features it comes with), and a one-shot simpler approach could be interesting to try.

The delayed materialization is needed because the patterns can inspect the current operands pre-rewriting that way.

It’s not clear what in this proposal would avoid any of these?
I picked a few through random sampling, and they seem like user just reporting a failure for a pass to convert an input IR, when the input IR contains operation not supposed to be handled. I would think a one-shot approach to fail exactly the same way actually, and these users to ask the same question.

This does not seem like the only difference to me: in particular the contract is quite changing in that you lose the rollback ability, and so a pattern can’t generate illegal operations or types anymore (I am not seeing this as a blocker, just saying!).

One nit: patterns through the current operands can at the moment inspect the producer of an operand pre-conversion, which your change would break (the operand type is still unchanged, but the producer is now an unrealized_cast).

I am missing what makes the type converter not needed anymore where it was used before?

I’d think that the “one shot” wouldn’t be iterative, and doesn’t need to keep a worklist updated like GreedyPatternDriver does.

+1 to this proposal! I have long stopped using DialectConversion. It’s often easier and simpler to just walk over the operations and dyn_cast. DialectConversion to me is a bit of a relic; building compilers that might fail to compile valid input programs seems like an anti-pattern these days.

However, as you and others have noted, type conversion continues to be a challenge.

This one to me is tricky. Eagerly inserting these ops can result in lots of temporary IR bloat, which increases allocator pressure.

Also, I don’t think patterns should be the “functor” used in this framework. The framework needs a set of OperationName -> function_ref<void(OpT, AdaptorT, IRRewriter&)>. This API guarantees that there are not multiple patterns that can match and remove the ability of the functors to “fail”.

It would still be nice to have some kind of registration mechanism so that lowerings can be composed across different dialects.

Would this be manageable if the driver is pooling them and eagerly reusing them when their users are converter instead of delete/create?

Why is this a good property to have? How do you account for example for the ability to convert an op only if it in a particular form? (like some subset of attributes)

Yeah, this seems like something the driver should do. I’ve been meaning to write a “dumb dialect conversion” framework to replace the grotty for loops and passes that need to navigate transient illegal IR. Great to see this being pushed forward!

IMO valid IR should never fail to lower, assuming the lowering is written.

It should never fail to lower in aggregate but a single pattern may not be comprehensive.

So purely forward walk, one iteration, no pattern expanding into ops that need further lowering/always legal expansions?

I think it’s use is where one has many ops with many different attributes/variants and composing different paths. It’s for when your input language is a rather large target and it’s doubtful that you’d ever really be able to support all. There that’s fine, but it’s being used where the start and end are rather fixed/small and the type converter is the convenience part. E.g., it’s being used where the cost is not being offset by any benefit.

+1 on avoiding bloat which has to be cleaned up.

This might be a key idea. I’ve also almost written a “dumb dialect conversion driver” a couple of times, but what to do about type conversion always holds me back. If the current dialect conversion had a way to do this one thing when it was created, it might have ended up better. Instead, it went through several iterations of trying to balance type conversion and the resulting transiently illegal IR – and I feel like each iteration just added a lot of complexity.

The other thing I’ve often wondered is whether there should be a few such conversion drivers. There are a lot of use cases for how to lower something and I feel like by trying to solve them all, it kind of didn’t do any of them very well.

Finally, while it would be a big change, the matchAndRewrite which does both predicate and rewrite in one has been a continual problem. The most common bug and CR feedback I see comes from people mutating as part of the match and then returning failure… Usually indirectly in some helper where it is non obvious.

Edit: please disregard. I found this on a second read of the proposal: “The conversion pattern “adapter” object is populated by taking the operands from the unrealized_conversion_cast op (instead of querying them from IRMapping ).”

Original question below.

What happens to the OpAdaptors in this world? Without arguing good or bad, I’ve seen many, many patterns that use the op to be able to reference the original types of the operands while the adaptor references current op values and types. This working is a side effect of the current backtracking iirc.

This view of the original type signature, based on memory/reviews, always seemed to me the biggest way that the backtracking implementation leaked into patterns themselves. In such patterns, being able to access the original type signature, does have a use sometimes, but maybe it can be fulfilled in a different way?

Maybe I’m misunderstanding something.

I believe this would still work here: the existing operands are still in the old type system thanks to the unrealized_cast and the adaptor can be constructed using the operands of the unrealized_cast.

This is an interesting question.

I think we still need a “list of all ops to be converted” because walking the IR while modifying it sounds dangerous. The current dialect conversion maintains such a list.

It also has a kind of “worklist”: newly created operations and in-place modified operations must be (re)processed, but this is implemented via a recursive call to legalize.

Imo, the main difference from the greedy pattern driver is:

  • Patterns are applied only to legal ops. Legal ops are never put on the worklist.
  • Conversion patterns must erase / replace / in-place modify the root op. And the driver asserts that.
  • The property that causes an op to be enqueued during the rewrite process is different.
    • Greedy pattern driver: users of modified ops, ops defining operands of modified ops, in-place modified ops, newly created ops, etc., are enqueued. Basically an op is enqueued whenever something is changing “near” the op.
    • Dialect conversion: Newly created ops and in-place modified ops are enqueued.

If we add a customization point for the “enqueueing logic” of the greedy pattern driver, we may be able to use it as a dialect conversion driver.

You could still generate new illegal ops. These would be put on the worklist. But if they are not lowered into legal ops later, the conversion would be “stuck” and fail.

What’s the meaning of “pooling” in this context?

This part is also still a bit fuzzy in my head, but based on my understanding there are two use cases for the type converter:

  • It controls type conversion when calling helper functions such as ConversionPatternRewriter::convertRegionTypes.
  • It can be used to parameterize a pattern. E.g., the SCFStructuralTypeConversions define a generic scf.for rewrite pattern that works with any kind of type conversion. This is the same as adding a field to a regular rewrite pattern to customize its behavior. We just have a standardized API for type conversions.

This means that a dialect conversion can actually work without a type converter. E.g., if you have a conversion from my_dialect_Amy_dialect_B, you could compute the new types directly in your patterns. (In many cases that won’t even be necessary because op builder usually infer the result type.) Imo, the main benefit of the type converter is the fact that you don’t have to reimplement patterns for upstream loops, control flow ops, functions, etc.

So I would think of the type converter as an optional/non-load-bearing part of the infrastructure.

This hints at the second open question that I mentioned. It is unclear what happens when there are multiple patterns that could be applied. If the first one fails, we try the next one until a pattern succeeds. So far so good. But what if two patterns could succeed? Which one should we apply? Should there be a priority/benefit? Is it undefined (kind of like in the greedy pattern rewriter)?

Now that I think of it, it is also undefined in the current dialect conversion: there may be multiple lowering paths that lead from the illegal input IR to (different) legal output IR, and it is undefined which path is chosen (unless you work with benefit or the cost model). Maybe this is not really a problem…

+1 for simplifying dialect conversion! Especially materializing IR changes immediately should help with debugging / understanding what is going on.

We have such a use case downstream. In essence, we have two patterns that can apply to a specific operation:

  1. A main pattern that can lift an operation to a higher level representation that has a higher benefit and that should apply if possible (the condition if it can apply is somewhat lengthy).
  2. A fallback pattern that has a lower benefit and that handles all operations that we cannot lift. It wraps the operations that cannot be converted in a region operation with extra semantics.

We implemented two patterns instead of one since they do fundamentally different things and since the fallback pattern is generic and applies to multiple different operation types. Theoretically, we could strengthen the matching logic in the fallback pattern to succeed only if the main pattern does not match. However, this logic is quite complex and expressing a pattern order seems more favorable in terms of runtime and code complexity.

I would thus vote for having some form of benefit mechanism for such fallback patterns. It is a bit like an if-else for pattern application where we don’t need to write the full boolean condition for the else block. Instead of a benefit, I could also imagine that the patterns for a specific operation type are registered in the order they should be tried. This solution would be a bit more explicit and there is always an order (there cannot be two patterns with the same benefit and undefined execution order).

Oh: to me the concept of “one shot” meant that there is no worklist and no recursive application of patterns! Probably worth finding a better name if all you want to avoid is the transactional aspect of the dialect conversion.
But having worklists and other complex machinery to track IR modification makes the proposal a much less important departure from the current driver (you still need a lot of indirection through the rewriter to keep state managed somehow).

I means behind able to recycle the unrealized_cast operations in a free list in the driver instead of erasing/recreating new ones constantly.

I don’t quite follow in what you’re describing if anything changes in your mind for the type converter compared to the existing dialect conversion?

It isn’t really “undefined” in the greedy pattern driver: there is a priority/benefit associated with patterns, and otherwise the order of insertion of the pattern is a tie breaker.

I was referring to “type conversion” broadly vs the type converter specifically. Indeed, that seems like where the type converter is used.

What I was getting at: the original dialect conversion got the feature late in its development to materialize illegal type conversions in the IR, and it is still conservative about it. I think I recall the reason being a desire to avoid the overhead of creating so many cast ops. Commentary at the time was that the right way is to materialize in IR, but the design damage was already done with respect to the code.

The critical things in my mind to get right for the next one are: either prove this overhead is not important (or the cost of accounting without it exceeds the overhead) or implement some kind of op pooling scheme to eliminate it, and eliminate or dramatically scope down the backtracking.

Both of those together, imo, would be more than enough to justify a V2, and it seems likely would produce the desired simplification. As Mehdi says, though, without a further diversion on the requirements, it will still have a fair amount of complexity to it. But that may be unavoidable.

Does dropping the rollback require a new driver? Is it possible to just have an option that avoids the roll back to existing dialect conversion driver?

I suspect it would be better for code quality to rewrite from scratch. I think doing that is separable from how we roll this out.

First of all, thank you very much for proposing to tackle this problem! The complexity of the dialect conversion framework has bothered me for a long time and, despite being one of the people who introduced it in the first place, I have never found the energy to actively push for a redesign.

Let me first provide a brief historical excursion to explain some of the early design decisions. The initial client of the dialect conversion infrastructure was the conversion from the standard to the LLVM dialect. Unlike previous pattern-based transformations, it needed to convert types in addition to operations. Moreover, patterns needed to see the original operand types of operations, potentially beyond the root operation being converted. As a result, the initial design included maintaining the original IR module and having a new, converted IR module being constructed next to it. Hence the IRMapping (that back in the day was a simple pair of maps for values and blocks). The alternative considered was to make the type conversion invertible, but it was deemed infeasible as type conversions are not injective, e.g. both i64 and index where mapping to !llvm.int<64> back in the day. The implementation was inspired by, but completely disconnected from RewritePattern and the greedy rewriter because it didn’t need the fixpoint iteration.

Later changes were based on the premises of (1) reducing code duplication between dialect conversion and canonicalization and (2) maintaining two modules being excessively expensive. The dialect conversion was relayered on top of RewritePattern and drivers/builders were similarly unified. This required the delayed erasure mechanism to still be able to see the original operand types of the ops beyond the current one. The rollback mechanism was designed as a response to criticism that the new mechanism could produce IR that breaks core IR invariants, in particular around types on the two ends of a use-def link, should one of the patterns fail. (Note that unrealized_conversion_cast is a much later addition and was not considered at that point, rather the consensus was that it would be an unprincipled op that has no semantics.) If the original two-module implementaiton allowed one to just discard the new IR module on any error and keep the original module, the new implementation could not do that anymore.

Even later changes proposed on the so-called A->B->C conversion that was driven by the vision that either the lowering or the entire compiler could be structured as a bag of patterns and controlled by a target description that specifies which operations are valid on the target, and what is the cost/benefit of applying a specific pattern. The infrastructure would, in theory, consider the current and the desired state of the IR, analyze the possible paths through the lowering graph and choose the most beneficial one. This has led to significant improvements in the rollback mechanism, more advanced IR mapping than can see through multiple type changes, and the “conversion target” infrastructure. This has also introduced the way for patterns to specify kinds of operations they might produce, needed to support the lowering graph construction, but this feature hasn’t seen wide adoption in conversion patterns. In absence of such specification, the rollback mechanism can be used to backtrack from the partially rewritten state if the desired legal state cannot be reached.


My current assessment (it may and does differ from my previous arguments about the direction of the infrastructure) of this state is as follows:

  • The introduction of the original rollback mechanism is based on a wrong assumption and misplaced desire for backward compatibility: the infrastructure should not necessarily guarantee the validity of the IR when it the transformation fails. It is okay to indicate that the IR was rendered invalid and it’s up to the caller to deal with that, many passes simply signal a failure.
  • The repurposing of the rollback mechanism to do backtracking doesn’t pay for itself.
  • The assumption that having two parallel modules is too expensive is simply wrong. I don’t remember seeing any data to back that originally, and now we ended up building a parallel, less efficient, uninspectable and non-roundtrippable IR representation for the purpose of rollbacks: when using updateOpInPlace, we maintain the list of operands, attributes and result types, which is most of the operation state, but as a simple list of SmallVectors stored in another vector.
  • The layering of ConversionRewritePattern on top of RewritePattern may be a mistake. It creates a false sense of compatibility when multiple edge cases exist as described above. Until recently, some modifications were not possible with a simple PatternRewriter and required direct IR mutation, which is disallowed in conversion patterns. This has also introduced some of the most vicious bugs in MLIR that I have seen, specifically when the address of a deleted operation gets recycled by the system allocator to store another operation rendering it invisible to asan. I personally have spent O(weeks) debugging these, and the project-level tally is clearly in on O(person-months).

Based on these, I would strongly consider recreating something that looks closer to the original state described in my first historic paragraph: no rollback, no additional state, intentionally not compatible with the greedy rewriter. This may require the conversion to happen in several stages, most likely having one “type system change” at a time, but that would have a significantly lower overall complexity than the current state IMO.

Two high-level questions:

  • Given that we originally needed to look at the original types of the values, how certain are we that the locally-greedy approach will work now? From vague memories, this was originally connected to extracting static sizes from memref types when lowering memref operations, potentially when the memref itself comes from a block argument.
  • Can this solve the ordering issue we currently have between -convert-func-to-llvm and -convert-cf-to-llvm where one order is valid and another is not?
7 Likes

As far as I remember, this was motivated by partial conversion: if you just want to lower SCF to CF for example, the cost of cloning the entire module does not seem trivial at all to me.

2 Likes