[RFC] A New "One-Shot" Dialect Conversion Driver

Context-dependent type conversion has been mentioned multiple times as a useful but missing feature. I don’t plan to address this as part of the One-Shot effort, but it could be a follow-up.

The way we would have to do this now is by having a pass run the analysis, populate attributes on the ops, and then manually type convert in the pattern.

I don’t see a way around this. Presumably you have a conversion pattern for each op and that pattern computes + rewrites the op with new types for all block arguments and op results. It sounds like the type conversion rule is op-specific, so I would expect this type conversion rule to be part of the pattern. (Alternatively: What kind of TypeConverter API do you think would be useful?)

The downside is that many of the helpers (FunctionOpInterfaceSignatureConversion and similar for interface-backed ops) break because the types of region/block arguments are not able to take into account the op.

What kind of functionality do you need, let’s say for a func.func or scf.for conversion?

Yes, sorry, I was just tacking on since the existing comments on this topic are in this thread.

Something like the following would suffice.

class ContextAwareTypeConverter : public TypeConverter {
  // Returns a TypeRange with the converted types of the values in the input range,
  // taking into account the context of the values (e.g., defining ops or uses).
  TypeRange convertValueRangeTypes(ValueRange values);

  // Returns a TypeRange with the converted types of the results of an op,
  // taking into account the context of the op when selecting the new type.
  TypeRange convertOpResultTypes(Operation *op, TypeRange resultTypes);
}

For example, the func.func op may have argument attributes that inform the lowering of those types, but are not part of the type themselves. In my case, having access to the attributes on the op would be enough, but the above API (convertValueRangeTypes) would be enough to get from the SSA values to their defining ops and check for any attributes.

Why do you have both convertValueRangeTypes and convertOpResultTypes? It looks like convertValueRangeTypes(op->getResults()) could be used instead of convertOpResultTypes.

Ah, right, it seems that we could get away with just one function there.

For reference, downstream currently implemented a ContextAwareTypeConverter with the following API

struct ContextAwareTypeConverter : public TypeConverter {
 public:
  // Convert types of the values in the input range, taking into account the
  // context of the values (e.g., defining ops or uses).
  // NOTE that this also converts the types of the values themselves,
  // beyond just calculate the new type
  virtual void convertValueRangeTypes(
      ValueRange values, SmallVectorImpl<Type> &newTypes) const = 0;

  // Convert types of the results of an op, taking into account the context of
  // the op when selecting the new type.
  // NOTE that this just calculate the new type
  virtual void convertOpResultTypes(
      Operation *op, SmallVectorImpl<Type> &newResultTypes) const = 0;

  // Convert types of the arguments and results of a function, taking into
  // account the context of the function when selecting the new types.
  // Note that this method is not used for converting the function type itself.
  // NOTE that this also converts the types of the arguments/results themselves,
  // beyond just calculate the new type
  virtual void convertFuncArgumentAndResultTypes(
      FunctionOpInterface funcOp, SmallVectorImpl<Type> &newArgTypes,
      SmallVectorImpl<Type> &newResultTypes) const = 0;
};

The reason I still keep convertOpResultTypes is that it does not change the result type, where convertValueRangeTypes does the setType for me (check here) because OpAdaptor is not aware of this TypeConverter. I know it is unsafe, but until I get all my way to OpAdaptor I have to use such workaround.

Further, the detailed context here is that for each Value there is a special Attribute associated with it, attached by its DefiningOp or BlockArgument, so we have a

struct TypeWithAttrTypeConverter : public ContextAwareTypeConverter {
  // inherited TypeConverter should implement this to do actual type conversion
  virtual Type convertTypeWithAttr(Type type, Attribute attr) const = 0;
}

Now that the 1:N dialect conversion refactoring is done, I am continuing the work on this RFC.

Outline of Next Steps

  1. Remove rollback functionality from the dialect conversion driver. This is a breaking API change.
    1.1. Update all conversion patterns in MLIR / Flang / … that currently trigger rollbacks. These are mostly patterns that start modifying the IR and then return failure(). Most patterns have already been updated over the last weeks, but a few remain, mostly in the SPIRV dialect.
    1.2. Add a new allowPatternRollback flag to ConversionConfig that existing users of the dialect conversion framework can use to find patterns that trigger a rollback. These patterns must be updated.
    1.3. After some time, when all downstream users had enough time to migrate their patterns, delete all rollback-related code.
  2. Internal refactoring: materialize all IR changes immediately. This step is mostly NFC from a user’s perspective. I say “mostly” because some things change: e.g., an operation can no longer be removed when it still has uses. (This is allowed today, as long as all users are guaranteed to be removed by the end of the conversion.)

I expect Step 1 to improve the robustness of dialect conversion framework. The rollback mechanism is error prone and many bugs/crashes/… in the past were rollback related. I am quite certain that there are more bugs in the rollback mechanism that are either unknown or that we are working around today.

I expect Step 2 to simplify the code base and improve the compile-time performance. Internal data structures such as ConversionValueMapping and IRRewrite (including subclasses) can be deleted. No additional bookkeeping is needed to keep track of changes that have only partly materialized. Step 2 will also enable use cases that are currently not safely supported such as mixing conversion patterns and rewrite patterns.

Status of Step 1.1:

Failed Tests (14):
  MLIR :: Conversion/ArithToSPIRV/arith-to-spirv.mlir
  MLIR :: Conversion/ArithToSPIRV/fast-math.mlir
  MLIR :: Conversion/ConvertToSPIRV/vector.mlir
  MLIR :: Conversion/MathToSPIRV/math-to-gl-spirv.mlir
  MLIR :: Conversion/SCFToGPU/parallel_loop.mlir
  MLIR :: Conversion/TosaToLinalg/tosa-to-linalg-invalid.mlir             (fixed by #136308)
  MLIR :: Conversion/VectorToSPIRV/vector-to-spirv.mlir
  MLIR :: Dialect/SPIRV/IR/target-env.mlir
  MLIR :: Integration/Dialect/Linalg/CPU/unpack-dynamic-inner-tile.mlir   (unrelated ASAN error)
  MLIR :: Integration/Dialect/SparseTensor/CPU/reshape_dot.mlir           (unrelated ASAN error)
  MLIR :: Transforms/test-legalize-type-conversion.mlir                   (testing rollback)
  MLIR :: Transforms/test-legalizer-full.mlir                             (testing rollback)
  MLIR :: Transforms/test-legalizer.mlir                                  (testing rollback)
  MLIR :: Transforms/test-merge-blocks.mlir                               (testing rollback)

Most patterns have already been updated. (The changes were quite simple.) I have not looked at Flang and other projects yet. The patterns above are the remaining ones that are failing when rollback is disallowed. There are likely additional patterns with bad test coverage that trigger a rollback in certain cases and are not listed above.

While I am working on updating the last SPIRV and GPU-related patterns, I’d like to ask people to run their test suite with this PR to get a feeling of how many patterns must be updated.

Please let me know if you have any concerns.

3 Likes

Nice! If by any chance you can measure the time it takes to exercise conversions before and after this refactoring, it would give you an objective improvement metric. This is arguably one of the slowest parts we have in the infra.

1 Like

We have found a somewhat common case where we currently rely on rollbacks: We sometimes have several possible patterns, one that creates ops that cannot be lowered further in the conversion step, and another where lowering works. With rollbacks, we would hit a failure with the “wrong” pattern at some point and it is rolled back. Without rollbacks, now we need to figure out which pattern is the one that should actually be used. Given we have shared pattern “libraries”, this is somewhat complicated. I guess we will have to provide different populate methods for different users, so that each user only gets those patterns they actually want to use? Any other suggestions?

That’s one way to do it: make pattern application fail if no lowering progress can be made down the road. This gets more and more difficult as the number of pattern applications on the path from the “branching point” to the “dead end” increases. For the few cases that I found in MLIR itself, it was typically the immediate next pattern that failed, so it was easy to refactor.

This would be the ideal solution. It’s also going to improve compilation time and reduce complexity from your passes.

Conversely it can be seen as increasing the complexity and coupling of the system by moving the complexity from the framework into the passes which now have a more ad-hoc setup.

@matthias-springer, TensorFlow → TFLite legalization relies fairly heavily on this rollback behavior, unfortunately. How long do you expect to support the allowPatternRollback flag?

allowPatternRollback is just a debugging flag that helps with the transition. I don’t have a clear timeline. Once all patterns in MLIR were updated and downstream users had enough time to update their code base (based on the feedback in this thread), I plan to remove the rollback.

Is the TensorFlow → TFLite legalization open source? Can you send me a link to the code?

What @arfaian mentioned is actually the case I was also talking about. The part I looked at that caused problems is here:

We call the TF::PopulateLoweringTFPatterns method which contains some patterns which are lowered to other TF ops that are not lowered further with any of the patterns registered in this dialect conversion. For example FakeQuantWithMinMaxVarsOp creates a ClipByValueOp. There is another pattern to lower FakeQuantWithMinMaxVarsOp that will be used if the TF pattern is rolled back, so currently it works, but without rollbacks it doesn’t work anymore. There are more such issues along those lines.

In addition to what Adrian mentioned, most of the legalization patterns for TF → TFL do not have any constraints on the patterns themselves rather relying on verification logic defined on the target TFL op.

For example, consider this tf.Multfl.mul legalization pattern:

Which will generate a tfl.mul op in all cases but can get rolled back if this verification trait fails:

@matthias-springer If possible, can you give us another warning 1 week in advance before you plan to remove this option? So that we can prioritize fixing this before it becomes really urgent?

IMO we should not remove the option if the feature is useful and there are users for it, unless there is an easy path to move to the new flow.
At this point this could be just a new driver instead.

@akuegel I will leave any change of that “size” out for review for at least 2 weeks and post it here before merging, so there should be enough time to fix things. (And if you need more time, comment on the PR.)

@mehdi_amini I’m a bit stuck here. I don’t want to remove anything that is used and not easy to migrate away from. I still have to study the above-mentioned patterns in TF in-depth… Adding a new driver (as discussed in the RFC in the beginning) would be an option, but after talking with a few folks in-person (this was many months ago) they had me convinced that building a parallel infrastructure in MLIR is not ideal.

Even if not ideal, I’d say it is justified in this case. We are trading off adding slightly more complexity and compiler build time (if both used) for increased compilation speed and robustness. If the outcome is also a clear documentation saying: (a) only use the rollback rewriter when you actually need rollback, there’s 10x overhead; (b) only use dialect conversion when you are converting types, there’s 4x overhead; (c) use greedy rewriter when you don’t need either but want fixed-point, there’s 2x overhead; (d) use a simple walk otherwise, there’s no overhead. Numbers are a guesstimate, but you get the point.

1 Like

I agree with Alex and Mehdi here (I say this without knowing the complexity as well as you do Matthias/e.g., don’t know if we could constexpr/template it away to “just” compile time cost even).

And definitely +1 to having some doc/comment as to cost: I’ve seen folks use dialect converter liberally for no good reason other than it has “dialect” in the name and they are doing a rewrite that involves dialects.

My main disappointment if we can’t get simplification of the existing conversion infra with this is that it would have been great to lower the barrier of change for RewriterBase. I remember it having a couple missing useful features (notably about rewriting block arguments iirc) that would have been nice to have. Because the conversion infra implementation of the rewriter is so complex and upstream, it makes it really hard to do those small improvements that would benefit all rewriters, and I guess a lot of that complexity comes from rollbacks.