EuroLLVM 2026 Round Table Summary: MLIR Canonicalization

The round table follows the recent topics:

And is related to this lightning talk at EuroLLVM:


We began by making a widely accepted statement that “canonicalization is not required for correctness” of either passes of lowering, however, we also made the strong point that it is only useful if it simplifies matching against commonly expected IR forms for optimization transformations.

So, while a non-canonical form should always be lowered to base dialects (like llvm) in the end, they won’t always be recognized and optimized before that. Since we’re building optimizing compilers, the whole point of such process is to ease matching and increase optimization. Since we’re building multiple upstream and downstream compilers, an upstream canonical form needs to cater to all of them in the best possible way and not be “best effort”.


We started with the discussion about @ftynse example of the canonicalize pass increasing compile time from minutes to hours. We couldn’t come up with a reason that’d have such an impact in compile time, since the matchers fail early. Since Alex wasn’t at EuroLLVM, we decided to skip this discussion and investigate later.

We speculated there are certain adversarial patterns where two or more rewrites keep adding work items to the queue in a multi-fixed-point scenario, but more data is needed to know for sure. But the key aspect of not wanting the canonicalization to run everything every time is still there.

The reasons I collected were:

  • Not wanting to run irrelevant passes at every level when a subset would work equally, and have less chance of the scenario above. Less scope, less interference.
  • There is no guarantee the canonicalization will finish, let alone finish at a particular point, let alone at a canonical representation. In a nutshell, the canonicalize pass does not canonicalize the IR.
  • Since the rewrites are defined at an operation level, there’s no expectation that they’ll consider (or know about) other operations’ own patterns. Unintentional interference is likely and unpredictable.
  • Since operation rewrites are allowed to create any arbitrary graph, there’s no control on the final state of the IR, and the generated new operations cannot control in which state they were created by other rewrites.
  • There were other variations on those themes…

There was a consensus that canonicalization should have a strong mandate to not only terminate, but also to generate an expected, canonical form. While generally accepted that canonical forms depend on the use (transforms / destinations), there was agreement that such a canonical form upstream would provide value to users, even if other normal forms were also accepted / allowed.

We also discussed “canonicalization by construction”, where transforms should (best effort) generate and/or maintain canonical (or even normal) forms as much as possible.

A discussion in linalg was used as a proxy:

  • Tiling a named op (linalg.add) should produce a loop of named ops ({ scf.for { linalg.add } }). Currently, most (all?) tiling produces linalg.generic.
  • Named linalg ops (one of its normal forms) are easier for loop fusion, while generic ops (another normal form) are easier for linalg fusion. We can, and want to, have both at different points.
  • Canonical operation syntax is not the same as canonical dialect usage:
    • A linalg with affine maps inside the op or outside the op are the same op represented differently. We should allow only one of them as the canonical form.
    • A sequence of linalg ops (e.g. transpose + matmul) is the same as a generic with transpose affine map and both forms should be allowed.
  • An operation that has more than one normal form does not have a canonical form, and perhaps should not have a canonicalization pattern (but perhaps multiple normalization patterns).

A proposal to describe canonical forms as requirements was followed from @ftynse’s PR. His work allows the transform dialect to encode verifiers for a canonical form on the matcher (returns null if not canonical). A question was raised: can we do that in a more generic way?

This is, in spirit, similar to my own RFC above, but it goes further. Can we describe what I want from previous passes? For example, which forms need to exists, which dialects must not exist? One concrete example was: “Give me linalg in category form where the scf loops are parallel and not forall”.

Of course, a combinatorial explosion of rewrite patterns that convert forms would do the trick, with each pass/rewrite that needs a particular form mandating those specific A-to-B patterns to run before, but that doesn’t scale. However, it was accepted that the idea is good, but implementation is invasive and fuzzy.


Action items:

  • Understand the high cost of the current canonicalize pass
  • Continue working on transforms and canonical matchers
  • Inspect interference between canonicalization rewrites and propose solutions
  • Work towards a model where canonicalization not only terminates but also in a guaranteed canonical form

Further discussions:

  • How to split canonicalization patterns to make it easier to compose
  • Create a prototype where such composition is beneficial (linalg based would be easier)
  • How to enquire those rewrites for forms, guarantees and invariants and reach them from outside an “all-or-nothing” greedy canonicalize pass.
  • Reduce the number of opinionated best effort canonical rewrites and re-brand them as normal form conversions (like we have in linalg), at least until we can agree on forms that are canonical

Add a dialect-specific canonicalization pass (#1172) · EnzymeAD/Enzyme-JAX@ff6c6d8 · GitHub this is the downstream patch I made to address that problem. The setup registers/loads nearly all upstream dialects + stablehlo + custom dialects, but they don’t all co-exist at the same point in the pipeline. So running a generic canonicalizer, which collects patterns for all known dialects, was spending cycles iterating over all canonicalization patterns from all dialects, most of which are unnecessary. I don’t have the input IR anymore, maybe @wsmoses can regenerate a similar one, but it was a ~100mb IR file (basically a whole application with aggressive inlining) with LLVM dialect in the first input that we were progressively raising. So the first stage would do -dialect-canonicalize=llvm,nvvm, then -dialect-canonicalize=arith,cf, then -dialect-canonicalize=arith,scf,affine, etc.

I haven’t checked whether the recent work on the pattern infrastructure could have improved the situation a bit. I also considered walking through IR to find operations/dialects that are actually present, but that was adding some overhead as well. I discarded the idea of applying patterns from only the currently loaded dialects since we never unload a dialect, so later invocations of canonicalization would try to run patterns for llvm/arith on IR where they are no longer present.

There are some complex patterns involved in that project so other points are equally valid, but my example was specifically not patterns undoing each other’s work. It was checking thousands of patterns from dialects not present in the input IR but known to the context on each of several million operations.


We have some mileage in two downstreams with the approach where each transformation declares pre- and post-conditions explicitly. It may be possible to tell a transformation that you want certain post-conditions, but feels like this shouldn’t be up to individual transformations to implement (except for maybe a small subset where you can hardcode 2-3 alternatives like named vs generic linalg) but rather a generic mechanism. This intertwines with the “canonical by construction” discussion.

Here are a few more concrete action items that we could work on:

  • Let’s clarify: Are canonicalization and normalization orthogonal? Let’s document the difference somewhere with a short paragraph, maybe here. In my mind, canonicalization is a (small-ish) set of patterns that we can easily agree on and, most importantly, is not specific to a certain follow-up transformation. Normalization is for cases where there are two equally useful forms (e.g., rank-reducing tensor.extract_slice vs. tensor.collapse_shape) and/or rewrites that are specific to a certain follow-up transformation (e.g., tiling).
  • A few folks told me after the round table that we did not really answer the question “What is a canonicalization?” I put some examples here a while ago. Is this list comprehensive? Did I miss some important patterns?
  • There was a consensus that canonicalization should have a strong mandate to not only terminate: We may want to reconsider Proposal 1 of this RFC.
  • We could also try to document the canonical form on a per-dialect basis.

Yes. This is where I was going with my earlier RFC. I think each dialect should be very clear what are its forms and which is canonical.

Small nit: None of the normal forms may be canonical. Hypothetical example for the tensor dialect.

Canonical form: We prefer static dimensions over dynamic dimensions. When possible, fold tensor.cast into other tensor ops if it introduces more static dimensions.

Normal form 1: Prefer tensor.extract_slice for rank reductions.

Normal form 2: Prefer tensor.collapse_shape for rank reductions.

Indeed, and whether there is a canonical form or not.

Linalg named / category / generic normal forms are also another example.

Thanks for writing a report, I couldn’t attend this year unfortunately.

There seems to be a distinction implied here between “the best possible way” and “best effort”, but I’m not grasping the subtlety, do you mind elaborating what you mean?

You mean “converge” I guess? We have a limit on the number of iterations to ensure it’ll finish, albeit without convergence.
Upstream we don’t have a test case going beyond 3 iterations if I remember correctly (I could redo the experiment on the test suite), even though this is limited by the fact that tests in the test suites are small by nature.
It is also known that you can craft tests that specifically exploit the limit of the “requeue” mechanism to trigger more iterations that the minimum. To ensure convergence you can play with top-down vs bottom-up traversals and increasing the iteration limit (at the price of compile time).

Definitely! This just came up a couple of days ago in a pattern creating a broadcast, but the test was matching a shape_cast for vectors. Turned out that if the result type was a vector the generated broadcast would always be turned into a shape_cast, we were able to save one pattern application by emitting directly the shape_cast instead of a broadcast.

There is something fishy here, because the process for the canonicalizer is to collect all the patterns ahead of time for all the loaded (not registered) dialect once on initialization, but only the one specific to an op are considered (with an asterisk, see below).
The idea for pre-loading being to share the patterns across all instances of the pass (when running in parallel) and across re-runs of the pass-manager.
Now when patterns are loaded, they are stored in a map: DenseMap<OperationName, SmallVector<const RewritePattern *, 2>> patterns;
When processing the IR we looking this map using the operation name and try all the patterns registered for this operation.
So having patterns for all the ops in the other dialects not present in the IR should be zero cost for canonicalizing here.

One wrinkle: there is however in FrozenRewritePatternSet the notion of AnyOpPatterns. While not the common case, these patterns run on every operation. So it is possible for a dialect to inject many patterns (or a few costly ones) there and have significant impact.

If these AnyOpPatterns are a problem, we can:

  • address these patterns directly: they would be not great regardless.
  • change the way we collect patterns on the canonicalizer, I can easily add an option that would make canonicalizer only load patterns right before running, through looking up the dialects presents in the IR to process, instead of the dialects loaded in the context. This would also make canonicalizer independent of the state of the context before running the pass manager (helping reproducibility).

Anyway: a trace / repro would be useful!

I cannot quite follow what you’re describing here, can you explain a bit more?

I think the -opt tool in that project loads all dialects upfront. This is arguably a bad idea, but somewhat common in downstreams.

The “no unloading” problem even without forced preload is as follows:

  1. all dialects are registered, none are loaded
  2. the input IR only contains the LLVM dialect, the parser loads the LLVM dialect
  3. the canonicalizer will only apply LLVM dialect canonicalization patterns
  4. further passes load Arith and CF dialects by the dependent dialect mechanism
  5. passes introduced Arith+CF and removed LLVM operations
  6. the canonicalizer will now apply patterns from Arith, CF and LLVM dialect since the LLVM dialect is still loaded, even if no longer present
  7. further passes go to Affine+SCF and remove Arith+CF
  8. the canonicalizer will now apply patterns from Affine, SCF, Arith, CF, and LLVM since all are still loaded, even though the three latter dialects are not present in the IR anymore
  9. induction…

My bad! :sweat_smile:

We can start with the restrictive notion that upstream canonical form is the intersection of all existing normal forms in user projects. If the intersection is the null set, then there’s no canonical form. However, the number of downstream users is much larger than purely upstream ones, so that’s not a good upstream design.

So we look at each form and have a discussion about the pros/cons of potential normal forms, and agree to pick one as the canonical form upstream. Meaning downstream projects that don’t use that form will either change their matchers or use a (possibly upstream) form conversion pass before their match.

My aversion to the “best effort” term is that it has been used by multiple parties when trying to add or reject a particular form as being canonical. This is what me, @ftynse and @matthias-springer have often highlighted as being natural to only have normal forms, not necessarily canonical forms.

@matthias-springer described a situation where the pass would not finish: rewrite1 adds a new item2 to the queue that is matched by rewrite2, which adds a new item1, which is matched by the rewrite1. In this case, the queue is never empty and the limit is never reached.

We guessed this was the cause of @ftynse’s 3hs compile time issue. I guess this can be done in special cases, but we should avoid it on cleanup passes like canonicalize.

Interesting! This is similar to @matthias-springer example above. I remember having to fix a few bugs around similar canonical assembly syntax in the Arm backend back then, so this really resonates.

The tiling, inlining, and other “surrounding code transformations” should also preserve the normal forms, but now the rewrite needs to “know what it was before”. Usually, you just do rewriter.replaceOpWithNewOp<...>, which is easier. Adding a switch statement for all forms won’t scale, we need to think of a better way here.

This is strictly better than the current state, but also means we’ll be passing linalg canonicalization patterns in llvm dialect.

There was a comment at the round table that we should also notify when a dialect has “ceased to exist in the IR”. The comment was more about full conversion (guarantee it has converted fully), then we could use that to unregister the dialect and the canonicalization patterns.

My earlier RFC was a staging ground to do that for late registering and possibly also unregistering those rewrites.

Thanks for clarifying. I think it is misguided aim at canonicalization being the intersection of the normal forms actually. I like the concept of “normal forms” exactly because it makes it more clear that they are disjoint from the canonical form (not a subset, nor a superset).
Since one of the basic principle of canonicalization is to not lose information and preserve it to be easily recoverable, downstream “normal forms” should always be able to built on top of a canonical form and move to whatever shape they prefer.
Canonicalization should help regardless: even if I need a “normal form” of mine downstream, I would want to match a little as possible variation of the same pattern, so canonicalizing first is valuable: even if I have to undo a few things, these things could also be present in the IR for other reason and without canonicalization I would need to match more variation of it.

Following up on my post above above the distention between op-specific patterns vs op-agnostics one (since only the latter should have runtime cost here).

I collected the set of patterns when loading all the upstream dialects:

=== Canonicalizer Patterns for dialects: xevm xegpu x86 wasmssa vector ub transform tosa test_irdl_to_cpp test_dyn test tensor spirv sparse_tensor smt shard shape scf rocdl quant ptr pdl_interp pdl omp nvvm nvgpu mpi ml_program memref math llvm linalg irdl index gpu func emitc dlti complex cf builtin bufferization async arm_sve arm_sme arm_neon arith amdgpu affine acc  ===
[Op-Specific Native Patterns] Count: 719
[Match-Any-Op Native Patterns] Count: 0

Seems like we don’t have any “Match-Any-Op” canonicalization patterns upstream? (I may have missed something, you can double check my work here)

Anyway that would put quite a dent in the story about llvm/arith/… dialects persistence having any cost on subsequent canonicalization right now. Feel free to correct me if I forgot anything about the GreedyPatternDriver…

Here’s a PR that adds documentation to describe the canonical form of arith. Is this useful? [mlir][arith] Add documentation for the canonical form by matthias-springer · Pull Request #192845 · llvm/llvm-project · GitHub. This is mainly AI generated, then edited by hand. I asked the coding agent for a high-level overview. I don’t think documenting every single canonicalization is useful.

Here are a few canonicalizations that don’t really fit into the “categories” that are mentioned in the PR:

  • select(%x, c1, c0) → extui(%x) — opinionated “prefer extui over select” rewrite.
  • arith.cmpfarith.cmpi when the float rhs maps losslessly back to an integer (CmpFIntToFPConst) — changes op kind and is a heavy rewrite.
  • mulsi_extended(x, 1)[x, extsi(cmpi slt, x, 0)] — an op decomposition rather than a simplification.
  • trunci(shrsi x, c)trunci(shrui x, c) — preferring one shift kind only because of a downstream truncation; the kind of “specific to a follow-up transformation” rewrite the RFC flags as a normalization smell.

But maybe these are actually the interesting ones. Folding constants is kind of expected, but the fact that %res = arith.select %arg0, %c0_i64, %c1_i64 : i64 turns into %0 = arith.xori %arg0, %true : i1; %res = arith.extui %0 : i1 to i64 came as a surprise to me.