Min/Max/Abs/Relu recognition (starter project?)

As I find myself hacking out a relu recognizer in the sparsifier to get some PyTorch models working for sparse tensor inputs, I started to wonder why we don’t have a proper min/max/abs and thus relu recognition as an MLIR pass? (or at least I could not find one; we do have the opposite arith-expand that maps these higher level ops back to lower level ops to enable lowering to llvm).

Pretty much every compiler I ever worked on benefited greatly from a unified analyzer that extracts min/max/abs (and others) from conditional constructs. The rewriting is pretty straightforward, with the caveat that for floating-point, care has to be taken to preserve the semantics around special cases, such as +0, -0, NaN, Inf etc.

I don’t have the bandwidth currently to start this side-project on doing this in a general useful way (hence the quick ReLu(x) solution to find Max(x,0) alluded to above), but this seems like a really fun starter project for somebody that wants to contribute to MLIR.

Thoughts?

1 Like

I just got an excellent off-line question on why we need to reconstruct ReLu if we start with PyTorch, which is a very good observation. We indeed may have a bit of a lower-too-early-raise-again issue, but the current setup takes the linalg from torch-mlir as input, where many operations are lowered to select. So, perhaps that is really where the problem occurs (and we plan to migrate to StableHLO as incoming IR eventually), but for the short term the issue remains that MLIR may receive input from upstream that needs some cleaning up to work better.

Also, shameless self-plug, I could not help but reminisce about some fun work done a lifetime ago by now at Intel on recognizing saturation and clipping arithmetic for SSE vectorization:

Automatic Detection of Saturation and Clipping Idioms

That’s why we added named ops to Linalg, so that torch-mlir doesn’t have to do that.

Linalg now has linalg.max which is basically ReLU.

Also, we will be writing the semantics of Linalg ops one by one, so that it’s clear and easy to raise from generic code just by replacing arith.foo with linalg.foo for elementwise Foo operations.

This should make it even easier to do what you want, I think.

But from PyTorch (and TensorFlow and ONNX and TOSA), yeah, they should be lowering to named ops directly from now on.

We also created a “structured matcher” to map Linalg generics to named ops. Maybe this can help bootstrap your project?

Yeah, I completely agree that is the right approach along the lines of the MLIR philosophy we all have learned to love. I was merely trying to balance the pragmatic with the perfect here. For example, I would never propose writing a vectorizer that raises SIMD semantics from sequential loops in MLIR back to vectors (although we have one :wink: ) but having a few structural matches to raise some structure that was lost prematurely seemed a pragmatic thing to deal with the reality of today.

But if this raises too many objections, I am of course perfectly fine with just fixing this upstream in torch-mlir’s lowering (and other frameworks of your choice).

Thanks for your input and pointing me to the matcher.

I’m not sure if this is entirely on topic, but we have had similar problems recognizing, say, a gelu when our entry point is tosa, which doesn’t have that named op. We were considering writing our own matcher for it.

1 Like

If TOSA lowers to a sequence of generics, then our matcher can help the first stage of detecting which kind of generics they are. You can either convert them to named ops and then match the def-use chain or do that directly in the compiler.

I’d also like to extract some higher level matcher that can see across multiple ops and match a graph with a simple declarative form. The way, we could bundle the underlying ops into GeLU, etc. Again, this can be either as a grouped op or solely inside the compiler.

The complexity comes when we start getting casts. Broadcast change the shape of the inputs, and need generic descriptions of “constants” (for ex. max(x, zero) or splat bias).

It’ll also need to understand multi-dimensional arguments. A 2D matmul can be tiled into an ND block-matmul with fused element-wise, forming patterns that you can potentially only uncover after tiling and fusing. Or type packing (VNNI, BFMMLA) where some arguments have an extra dimension created by a type transpose.

Generic descriptions for all these shapes isn’t easy, and creating a long list of match-patterns (like we did) does not scale. We need something that composes well (can reuse patterns) but also something that does not have square or cubic complexity.

We discussed a tree-matching system where the descriptive language builds a map of all matchers, which gets flattened into a decision tree (possibly a dag), and this tree is the pattern-matcher. The initial stage is finding the right “entry point”, and from there, re-use matchers until it either finds a pattern or reaches the end of the branch, signaling failure.

But we never got to implement this internally, as such complexity is better designed and implemented upstream.

Yes very relevant. In an ideal world, MLIR would only see high level ops coming out of upstream frameworks. In practice, however, many of the current paths into MLIR lower some of the very relevant ops too early, and having some reasonable matcher to correct such cases would go a long way (since writing a single matcher is easier than fixing all those upstream paths).

But I recognize this is a pragmatic, not perfect solution.

Another path is to make projects like torch-mlir a canonicalizer of ingress. It can already understand HLO, Torch and ONNX inputs, if it can lower them all to the same Linalg representation (high level named ops), then the relevance of such matcher would be only for special cases.

This is important because the matchers take a considerable effort to build correctly and to run at compile time.

This was one of the motivations when we started with PDL. The ability to combine pattern matches into a more optimized/merged DAG of matchers (looking at old slides Takes inspiration from LLVM and Hoffman-O’Donnell FSM matcher).

I believe it can produce HLO, but not ingest. I’d only recommend using it for ingress for folks that only (mostly?) want PyTorch. But these patterns are not restricted to ML though. While I’m definitely a fan of not having to recover info the user gave you to begin with so ideally only is lowered away on a flow when no longer needed.

If we look at it from a historical point of view, this would be an accurate assessment.

My point here is that we should look forward: can we have an ingress project that consumes many graph dialects and output a canonicalized linalg output?

The torch-mlir project had ONNX support added not long ago, and it does output linalg and “knows about” HLO, so why not make that into a more generic ingress project instead of just PyTorch?

PDL is great for user code, but having it implement core functionality would perhaps be overkill: parse, build, optimize at run time, every time.

If we can reuse PDL’s infrastructure and drive it from the compiler side (C++, table-gen), we can perhaps build the optimized tree matchers at compile time and only go through it at run time.

And of course, pair it with actual PDL code for the user side of things.