[RFC] CanonicalizerPass convergence / error handling

matthias-springer · December 22, 2022, 7:35am

The CanonicalizerPass uses the greedy pattern rewriter to apply canonicalization patterns until a fixpoint is reached. However, the pass succeeds even if no fixpoint is reached, as the return value of applyPatternsAndFoldGreedily is ignored.

void CanonicalizerPass::runOnOperation() {
  /* ... */
  (void)applyPatternsAndFoldGreedily(getOperation(), patterns, config);
}

This means that the IR is not guaranteed to be in canonical form after running the canonicalizer pass. This can be seen as either a bug or a case of incomplete documentation.

Based on a discussion in ⚙ D140482 [mlir] Signal CanonicalizerPass failure due to non-convergence, canonicalization could be considered “best-effort”, in which case the current implementation would be correct.

The problem with the current situation is:

According to the MLIR docs, users expect that the IR is in canonical form after running -canonicalize, but this is not necessarily the case.
Users often run canonicalization as a pre-processing pass to enable other transforms/canonicalizations. But this is incorrect, as the canonicalizer may not bring the IR in the canonical form.
-canonicalize may apply some canonicalization patterns while “ignoring” others. Which patterns are applied and which are not (if any) can change if new patterns added or additional dialects are loaded. This is hard to debug.
Faulty canonicalization patterns (e.g., patterns that do not converge) can remain unnoticed. They can start breaking the CanonicalizerPass at any point in a non-predictable way. We currently have faulty patterns in the GPU and OpenACC dialects (that nobody noticed), as indicated by the failed tests of ⚙ D140482 [mlir] Signal CanonicalizerPass failure due to non-convergence. Probably even more in downstream projects. (This is what the docs say about canonicalization patterns: Repeated applications of patterns should converge. Unstable or cyclic rewrites will cause infinite loops in the canonicalizer.)

Here are two proposals to fix this. (My preferred solution is the first one.)

Proposal 1: Signal pass failure on non-convergence

void CanonicalizerPass::runOnOperation() {
  /* ... */
  if (failed(applyPatternsAndFoldGreedily(getOperation(), patterns, config)))
    signalPassFailure();
}

and do not limit the number of iterations by default (pass option of CanonicalizerPass):

Option<"maxIterations", "max-iterations", "int64_t",
       /*default=*/"-1",
       "Max. iterations between applying patterns / simplifying regions">,

The downside of this approach is that the CanonicalizerPass may run longer. It is, however, guaranteed to terminate. Otherwise, we must have a bug in a pattern. We can add additional debugging flags to the CanonicalizerPass that help tracking down such bugs (e.g., ⚙ D140525 [mlir] Limit number of pattern rewrites in CanonicalizerPass).

Proposal 2: Keep ignoring convergence (NFC) and update documentation + fuzzer flags

Update the MLIR docs (Operation Canonicalization - MLIR) as follows:

MLIR has a single canonicalization pass, which iteratively applies
canonicalization transformations in a greedy way. Canonicalization
is best-effort and not guaranteed to bring the entire IR in a canonical
form. In the worst case, the canonicalizer could be a no-op; every pass
pipeline should work correctly under such an assumption.

Note: If canonicalization is a necessary pre-processing step, it should
be done in a dedicated pass or integrated into the pass that requires
pre-processing. (Ideally with a greedy pattern rewriter that applies only
only the required patterns and applies them to a fixpoint.)

This change would likely mean that many current pass pipelines are incorrect. To help fixing them, we could add a debug flag that replaces the CanonicalizerPass with a no-op. This can be used as a fuzzer.

@mehdi_amini @ftynse @River707 @nicolasvasilache @pifon2a @jpienaar

River707 · December 22, 2022, 7:49am

I haven’t had time to digest much (effectively holiday in US), but my quick take is that I am somewhat strongly opposed to proposal 1. Canonicalization should not be necessary for correctness, that seems like a very troubling sign to me. It’s also somewhat concerning to allow unbounded recursion in the hopes that it does eventually converge, that almost always indicates a bug in either processing/pattern or in an edge case that we do not deem reasonable to continually iterate for. The benefits of a canonical form are that optimizations have fewer forms that they need to check, but anything bound on correctness should handle all of the possible forms of IR that can reach it.

– River

ftynse · December 22, 2022, 10:31am

Can we just add a pass option to optionally check for convergence so we can exercise it in tests? The normal use of the pass will not have that flag enabled.

tschuett · December 22, 2022, 12:31pm

InstCombine in LLVM has similar issues. They eventually came up with an upper limit with command line flags:
https://reviews.llvm.org/D71673

mehdi_amini · December 22, 2022, 1:49pm

I don’t consider this “incorrect”, but these passes must not miscompile in absence of canonical IR. That is lowering and other flow can lead to suboptimal code generation but we should aways be correct.

New dialects loaded should not affect anything if there is no operation for this dialect involved, so I’m not sure what kind of situation you have in mind here?

How do you propose to make this robust? That is: how can we guarantee that we won’t run into infinite loop with some new IR input that we haven’t tested on?
It is a non-starter to me to ship a pipeline that can infinite loop when fed some inputs that would trigger corner cases in the convergence algorithm.

IMO, unless we can prove that we will converge, I don’t see how to get there right now. (but if we had a principle and sound way to ensure convergence, then yes proposal 1 would be ideal)

Proposal 2 matches my mental model about canonicalization, about this Note I would clarify what “necessary” means: there is “necessary for an optimization to kick in” and “necessary for correct code to be generated” and these are fundamentally different.

matthias-springer · December 22, 2022, 2:42pm

I was thinking of situations like “pass X assumes canonical form”, otherwise its patterns do not match and the IR cannot be lowered any further. So not really “incorrect”.

Right, just loading is fine. It must be “loading and start using”. This can lead to situations where adding ops from dialect B to a test that used to use only ops from dialect A can stop canonicalization of ops from dialect A (because maxIterations was reached). Confusing to the user because modifying an unrelated part of the IR affects a different part of the IR.

This does not seem so bad to me. The documentation states that repeated pattern application must converge. So an infinite loop would be a compiler bug. We also don’t ban while loops in C++, even though they may not terminate due to a bug in the code.

What makes it tricky though is cross-dialect canonicalization. This can lead to situations where the canonicalization patterns of dialect A by itself and dialect B by itself are fine. But both put together (where A and B may even come from different code bases) are not fine because there are 2 patterns that undo each other. So that would be problem indeed… And neither the author of dialect A nor B are to blame.

Maybe it’s also worth noting that many of the patterns that I added to the vector/linalg dialects are exposed through populate... functions, whereas I would’ve made them canonicalization patterns a year ago. It is not always clear what the “canonical” form is and with different use cases in mind people have different opinions.

This and setting max-iterations = -1 for canonicalize.mlir test cases (maybe also other test cases) would be a nice improvement imo.

mehdi_amini · December 22, 2022, 3:01pm

Yes, but this would be a compiler bug that is easy to ship in production and hard to defend against. So I’m opposed to encourage a system which is fragile this way.

I don’t understand the analogy: we let user shot themselves in the foot, but that seems to me to be unrelated to the robustness of the tooling / system we provide them. We’re talking about a user changing something in its program and the compiler suddenly entering an infinite loop here.
Instcombine or SelectionDAD have similar conceptual problem, however they operate on a fix operation set and a fixed predefined number of patterns. So while it is “fragile” in some way, it is also much more doable to reason about this “closed world” than anything related to MLIR.

That would mean that both A and B are operating in a cross-dialect way and both of them have a different idea of what the canonical form looks like? This can happen but it is likely very rare and also would be caught quickly in testing (in general the trickier version includes a larger number of patterns forming such a loop, so rather more than 2 dialects together).

Criteria should be: easier to analyse, less number of operations, no loss of information.
And at some point: pick an arbitrary direction between equivalent form! A single arbitrary form is better than two equivalent one because all transformations can focus on matching one.

I don’t think so: we want our test to indicate us about issues with lack of quick convergence. Exhausting a few iterations in unit-tests seems wrong to me.

matthias-springer · December 22, 2022, 3:58pm

I agree with the first part, but transformations would still have to match the two equivalent forms (or run the canonicalization patterns themselves). Because there is no guarantee that canonicalization would actually happen when running -canonicalize.

Then let’s add two // RUN: lines. One with max-iterations=-1+failure on non-convergence and one without. Then we can find broken canonicalization patterns (the ones in the GPU dialect and the OpenAcc dialect that I mentioned above) and test for quick convergence at the same time.

jpienaar · December 22, 2022, 4:46pm

Optimization passes leave the IR in better (by some metric) state for next. Canonicalization is an optimization pass. I expect these to fail only if an invariant is violated. Given failure means to pull down the pipeline & create reproducer. We also attempt to write our passes to be robust against unknown ops, so say you have op foo that is in a canonicalization loop with op bar, what aspect is your consuming pass relying on that would necessitate failing pipeline if canonicalization fails given you don’t care about either foo or bar? The postcondition relied on is vague given canonicalization is wide and I’d argue there is not too good a signal in it failing. E.g., inliner one could run a verifier post that all calls have been inlined (if that’s important for a given target) and we have an explicit property to check ^[1]. With canonicalization there is no such contract and so failure to converge doesn’t tell me anything useful beyond maybe there is a set of patterns that either don’t converge or triggers a long rewrite sequence. But no clear invariant or post condition for the general case that is actionable.

No loss of information/good in all cases is important metric for canonicalization vs regular optimization. We have a few that are only better in some cases.

I think this is too broad a brush. And comes back to Mehdi’s point about correctness vs performance. We match on the canonical form for optimized lowering, but the other case’s handling is probably needed (might not be exact match, might be fallback that handles general case) as that is valid input IR (to River’s point). Things that can never happen should be verified. But also points to the final part of your comment in proposal 2: if you rely on specific condition for correctness either you should have dedicated pass or verifier or you add that canonicalization pattern into your pattern rewrite set.

We could also just toggle max-iterations to max-iterations+1 and diff in the test (doesn’t work if we have a split test, but split tests should be used in error checking cases).

I fear broken canonicalization patterns are indeed often due to interactions when combined (well they shouldn’t be for them to be actually canonicalizations, but state today). Fuzzing end-to-end and checking for any divergent behavior would be interesting.

and it is correct to only inline where safe, so not inlining isn’t a failure, but for the sake of discussion it is here for this target and platform user may have wanted a pass option to fail the pass in such cases. ↩︎

mehdi_amini · December 22, 2022, 5:10pm

We have to be more precise here: “transformations” is ambiguous: there are transformations that are required for correctness (lowering mostly) and others that aren’t. Only the first category need to match all possible IR, the second category can just miss on some optimizations (and when we see missed optimizations, we may find that we need to improve canonicalization).

clattner · December 22, 2022, 6:40pm

I agree with the upthread sentiment - canonicalizer is not a guarantee, and IR/transformations shouldn’t rely on it for correctness.

If you’re effectively doing a lowering from one level of abstraction to another, then you can define your own pass and use the pattern rewriter within it.

stellaraccident · December 22, 2022, 7:00pm

To Alex’s point, a test only flag seems like a good idea, but given that it is prone to abuse, maybe name it explicitly as “test-”. I’d rather not have unbounded unit tests. They will be a nightmare to debug.

To multiple points made here, I suspect that the reason this is being asked for is because we have let a number of things through as canonicalizations when they are actually a lowering/specific kind of optimization. It is important to get to the bottom of those and factor them properly. It is already too easy/tempting to make everything a canonicalization, and I’m -1 on reducing more barriers to that (vs fixing root causes and structuring lowering pipelines more appropriately). I could be +1 on adding options that make testing more precise if they were scoped specifically to that.

matthias-springer · December 24, 2022, 10:15am

Yes. I (and probably some other contributors) didn’t know about the intended use of the canonicalizer until we had this discussion here.

We could hide it behind a fuzzer option (that could also be used to randomly deactivate some patterns). That should make clear that it’s for testing only.

ftynse · December 28, 2022, 9:05am

Another issue is the continuing conflation between the greedy pattern rewrite driver and the canonicalization pass, combined with dialect conversion being a misnomer. From names alone, it looks like the greedy driver is the only way to implement a pattern-based pass. We also don’t have another way test individual patterns or the driver…

clattner · December 28, 2022, 2:53pm

Another common thing I’ve seen is when folks reach for the pattern rewriter when it is overkill for a problem. A simple lowering pass implemented with a for loop can be much less code and be more efficient that using the rewriter.

The rewriter pays for itself when composed patterns are needed or when the source of those patterns isn’t in one place (eg different dialects/ops contribute patterns through an interface like canonicalize).

-Chris

matthias-springer · December 28, 2022, 4:43pm

Expanded documentation based on this discussion: ⚙ D140729 [mlir][transforms][NFC] Expand CanonicalizerPass documentation

_sean_silva · January 3, 2023, 6:08pm

I would even be +1 on deleting the canonicalizer pass an replacing it better API’s for building passes that run certain controlled sets of patterns.

Sadly my personal experience is that 99% of compilers in the ML space require “optimizing hard enough is necessary for correctness” philosophy, and anything we can do to corral that kind of thing and make it harder/more explicit would be welcome.

Even in Torch-MLIR which I designed specifically to avoid “optimizing hard enough is necessary for correctness” we ended up giving into it eventually – I concluded that the only way to actually avoid this is more radical frontend changes (which thankfully PyTorch 2 is doing, but they are a massive engineering effort). (for those in the know, this is what I am referring to)

matthias-springer · January 3, 2023, 6:33pm

My experience is similar: Canonicalizing is necessary to make progress during lowering, otherwise I have to handle many edge cases. So canonicalizing is useful, the canonicalizer pass not so much.

clattner · January 3, 2023, 11:50pm

While I agree that the canonicalizer is abused, that doesn’t make it a bad thing! Good tools in the wrong hands will produce wrong results.

Besides that, removing it would really be tantamount to removing the get canonicalizer pattern hooks, which would be a far more significant change.

I don’t see any motivation to make these changes, pytorch being a mess doesn’t mean that all mlir compilers are for pytorch.

-Chris

stellaraccident · January 4, 2023, 12:03am

That is already possible with the GreedyPatternRewriteDriver. If parts of some systems have bags of patterns which don’t meet the definition of “canonicalization”, then they should not be populating canonicalizer patterns – instead having their own mechanism/passes for “trying harder” to lower.

There are plenty of things that are doing it right. It just may apply less to the heroics needed to fixup things like torch IR.

Topic		Replies	Views
Pass patterns in MLIR MLIR	3	714	June 10, 2021
About applyPatternsAndFoldGreedily MLIR	8	1730	January 20, 2022
OpRewritePattern works weird MLIR	7	492	October 26, 2022
Math Dialect match and rewrite patterns MLIR	3	388	December 7, 2022
[discussion] Fuzzing Pattern Application MLIR	7	1290	January 27, 2023

[RFC] CanonicalizerPass convergence / error handling

Proposal 1: Signal pass failure on non-convergence

Proposal 2: Keep ignoring convergence (NFC) and update documentation + fuzzer flags

Related topics