First of all, thank you very much for proposing to tackle this problem! The complexity of the dialect conversion framework has bothered me for a long time and, despite being one of the people who introduced it in the first place, I have never found the energy to actively push for a redesign.
Let me first provide a brief historical excursion to explain some of the early design decisions. The initial client of the dialect conversion infrastructure was the conversion from the standard to the LLVM dialect. Unlike previous pattern-based transformations, it needed to convert types in addition to operations. Moreover, patterns needed to see the original operand types of operations, potentially beyond the root operation being converted. As a result, the initial design included maintaining the original IR module and having a new, converted IR module being constructed next to it. Hence the IRMapping (that back in the day was a simple pair of maps for values and blocks). The alternative considered was to make the type conversion invertible, but it was deemed infeasible as type conversions are not injective, e.g. both i64 and index where mapping to !llvm.int<64> back in the day. The implementation was inspired by, but completely disconnected from RewritePattern and the greedy rewriter because it didn’t need the fixpoint iteration.
Later changes were based on the premises of (1) reducing code duplication between dialect conversion and canonicalization and (2) maintaining two modules being excessively expensive. The dialect conversion was relayered on top of RewritePattern and drivers/builders were similarly unified. This required the delayed erasure mechanism to still be able to see the original operand types of the ops beyond the current one. The rollback mechanism was designed as a response to criticism that the new mechanism could produce IR that breaks core IR invariants, in particular around types on the two ends of a use-def link, should one of the patterns fail. (Note that unrealized_conversion_cast is a much later addition and was not considered at that point, rather the consensus was that it would be an unprincipled op that has no semantics.) If the original two-module implementaiton allowed one to just discard the new IR module on any error and keep the original module, the new implementation could not do that anymore.
Even later changes proposed on the so-called A->B->C conversion that was driven by the vision that either the lowering or the entire compiler could be structured as a bag of patterns and controlled by a target description that specifies which operations are valid on the target, and what is the cost/benefit of applying a specific pattern. The infrastructure would, in theory, consider the current and the desired state of the IR, analyze the possible paths through the lowering graph and choose the most beneficial one. This has led to significant improvements in the rollback mechanism, more advanced IR mapping than can see through multiple type changes, and the “conversion target” infrastructure. This has also introduced the way for patterns to specify kinds of operations they might produce, needed to support the lowering graph construction, but this feature hasn’t seen wide adoption in conversion patterns. In absence of such specification, the rollback mechanism can be used to backtrack from the partially rewritten state if the desired legal state cannot be reached.
My current assessment (it may and does differ from my previous arguments about the direction of the infrastructure) of this state is as follows:
- The introduction of the original rollback mechanism is based on a wrong assumption and misplaced desire for backward compatibility: the infrastructure should not necessarily guarantee the validity of the IR when it the transformation fails. It is okay to indicate that the IR was rendered invalid and it’s up to the caller to deal with that, many passes simply signal a failure.
- The repurposing of the rollback mechanism to do backtracking doesn’t pay for itself.
- The assumption that having two parallel modules is too expensive is simply wrong. I don’t remember seeing any data to back that originally, and now we ended up building a parallel, less efficient, uninspectable and non-roundtrippable IR representation for the purpose of rollbacks: when using
updateOpInPlace, we maintain the list of operands, attributes and result types, which is most of the operation state, but as a simple list ofSmallVectors stored in another vector. - The layering of
ConversionRewritePatternon top ofRewritePatternmay be a mistake. It creates a false sense of compatibility when multiple edge cases exist as described above. Until recently, some modifications were not possible with a simplePatternRewriterand required direct IR mutation, which is disallowed in conversion patterns. This has also introduced some of the most vicious bugs in MLIR that I have seen, specifically when the address of a deleted operation gets recycled by the system allocator to store another operation rendering it invisible to asan. I personally have spent O(weeks) debugging these, and the project-level tally is clearly in on O(person-months).
Based on these, I would strongly consider recreating something that looks closer to the original state described in my first historic paragraph: no rollback, no additional state, intentionally not compatible with the greedy rewriter. This may require the conversion to happen in several stages, most likely having one “type system change” at a time, but that would have a significantly lower overall complexity than the current state IMO.
Two high-level questions:
- Given that we originally needed to look at the original types of the values, how certain are we that the locally-greedy approach will work now? From vague memories, this was originally connected to extracting static sizes from memref types when lowering memref operations, potentially when the memref itself comes from a block argument.
- Can this solve the ordering issue we currently have between
-convert-func-to-llvmand-convert-cf-to-llvmwhere one order is valid and another is not?