[RFC] Op explosion in Linalg

rengolin · October 31, 2024, 2:08pm

We have too many similar operations in Linalg (*_conv_*, *_matmul_*) that cannot be decomposed because they were defined in OpDSL.

We discussed about this in the past and the end result is to move it out of OpDSL (Python, YAML), into ODS (table-gen, C++) so that we can common up implementation details (such as broadcast, transpose, quantize, loop order).

There is ongoing work to move all matmul operations to a common format, with the first patch ([mlir][linalg] Introduce transpose semantic to 'linalg.matmul' ops. by shahidact · Pull Request #104783 · llvm/llvm-project · GitHub) landed and follow ups in development.

Convolutions are currently at about 30 ops and counting, with recent PRs adding more and more ([linalg] Add quantized version of `conv_3d_ncdhw_fcdhw` by ubfx · Pull Request #113953 · llvm/llvm-project · GitHub, [mlir][linalg] Add Grouped Convolution Ops: conv_2d_nhwgc_gfhwc and conv_2d_nhwgc_gfhwc_q by stefankoncarevic · Pull Request #108192 · llvm/llvm-project · GitHub).

Can we have a discussion on how to stop the op explosion and focus on a few operations that we agree is reasonable for such a core dialect?

Here are the discussions I had personally with different people and where my plan stands for the near future:

No implicit cast, broadcast, transpose, reduction. These can all be represented by affine maps (or attributes), including compute type and quantization.
There should be only one matmul op, one batch_matmul and one batch_reduce_matmul. No *_matmul_transpose_* operations should be necessary. We’re not going to delete any op now, but they should be left over in OpDSL and be removed with it. @mshahid is working on this.
matmul has only 2D operands, batch_matmul has 3D operands with the outer-most batch, batch_reduce_matmul has 3D inputs, 2D accumulator, with the outer batch. We could reuse matmul for the latter two above if we can easily validate shapes and affine maps, but that’s not our current goal. Anything else should be implemented as a linalg.generic or the upcoming linalg.contract.
Element wise ops with operation attribute. We already had this, but with implicit casts. I added a lot of named ops, which was the wrong thing to do but necessary due to OpDSL. Now @javedabsar will add those to ODS with the right semantics.
Convolutions have the same issues, but with the addition of loop order (NCWH variations), which can be attributes and C++ implementation detail in ODS. We can use dimensionality to lower ND correctly in C++, which was impossible in OpDSL. Depthwise can also be an attribute.

I want to repeat @stellaraccident words in the latest PR:
“I am sympathetic to the need to have a practical solution for them in short order. I’m also cringing at the addition of technical debt this adds because the project overall doesn’t have a plan for how to manage the expansion.”

Been there, done that and now I’m trying to undo my mistakes from last year. I want to avoid other people making the same mistakes. This is why we started with matmul, but we can’t work efficiently while the pile is still growing and people are working against these goals.

So, let’s just agree on a goal and everyone follow that goal.

stellaraccident · October 31, 2024, 2:24pm

I’d love to hear from the authors of those PRs what motivated them and what is depending on getting them in now. One of them looks like part of a tosa implementation plan but I can’t tell whether these are one offs or part of a stream of things we should expect. Usually the tosa side has a pretty good handle on a scope for things like this, and if they have a full need expressed somewhere, that is probably a good place to start to make sure that the linalg conv op(s) are comprehensive.

javedabsar · October 31, 2024, 4:47pm

Thanks @rengolin for writing this up. I think simplifying and making orthogonal (as you also mention too many similar operations) the named ops and their implementation is a good idea and I am happy to help in this.

Can we have a discussion on how to stop the op explosion - not sure if I understand whether you mean new ops in flight or ones recently added (e.g. softmax was most recent?). On that point, I am thinking that whether having level ops in linalg e.g. linalg.attention/softmax/… (if frameworks can lower to them) is good or bad ? Keen to know opinions. Personally I think it increases scope for kernel replacement and avoids pain of re-discovery.

the upcoming linalg.contract. - please could you share more info on that. I think its a good generic one (not meaning … linalg.generic) . Would we need matmul then?

Anyways, thanks again @rengolin and @stellaraccident for discussions on this.

krzysz00 · October 31, 2024, 4:50pm

So, as one of the people behind the grouped conv2d PR (I didn’t do the implementation work, which is why my name isn’t on it), we needed it in rocMLIR for CPU-side testing of grouped convolutions (which were hackily plumbed through Tosa).

Those arose from clients that were consuming ONNX, which allows a native group = G attribute on convolutions (in that system, the group is implicitly folded into the channel dimension, but that caused impedance mismatches with our code). Some folks are of the opinion that grouped convolution should be represented by G successive convolutions, but that’s both ugly and doesn’t let you encode the placement of the group dimension with any precision.

The other half of that PR would’ve been an improvement to Tosa to allow for grouped convolutions there (which we pached in downstream, leading to needing NGCHW / NHWGC convolutions for TosaToLinalg).

What all this means is that, if we move to a more general linalg.conv, it should support groups, and ideally as their own dimension alongside batch / channels / … .

ubfx · October 31, 2024, 5:18pm

I want to preface this by saying that having a temporary named op for a certain variation of quantized 3d convolution is certainly no hill I want to die on (I’ve never even used a 3d convolution personally). I’m just arguing this because I genuinely don’t understand the downsides of “Mahesh’s plan no 2” ([linalg] Add quantized version of `conv_3d_ncdhw_fcdhw` by ubfx · Pull Request #113953 · llvm/llvm-project · GitHub).

I agree with everything said in the github PRs and here about the positives of having a few unified linalg.conv ops. The reason I made the PR(s) in the first place is that I seem to lack understanding of the negatives of continuing the “old” approach in the meantime.

To me, it looked like having 20 named variation-ops is no worse than having 10 because any good solution to the problem would be categorical and replace all variations in the end. It seemed like the dialect was already cluttered and adding more of the same would not change that. Again, it seemed like this was a quality, not quantity kind of problem.

It’s not my intention to work against any goals within the project. I just didn’t understand how the growing pile of the same kind of Ops stops or hinders the parallel development of a solution to the problem. I would have argued that the addition of the (temporary) named Ops indirectly contributes to that goal because these Ops represent the different downstream uses and span the space that the permanent solution will have to fill later.

Seems like this is another good example of this. Once their PR is merged, this requirement is baked into the dialect until we get the unified Op. Then it becomes trivial to test coverage of the new op, e.g. by trying to rewrite all tests of the existing named op variations to the new unified Op.

Some more thoughts on advantages of temporary named op variations:

They require little to no review (or so I thought), because they follow an existing pattern in OpDSL and no existing functionality is changed.
They are actually decent stand-ins for what I imagine the permanent solution to be in the end. I.e. they allow expressing convolution with arbitrary dimension ordering as a single named Op, which is extremely useful in graph partitioning. Any current alternative (other named op + transpositions or linalg.generic) behaves fundamentally different during pattern matching.
Once the permanent solution is there, they can be replaced by the new Op(s) almost in a “search and replace” manner. Again, the current alternatives can require more code changes.

To summarize my thoughts, I was motivated by cleaning up the code in torch-mlir to get similar logic for FP and quantized convolutions (e.g. channel first variant existed for standard version but not _q) and by trying to improve pattern matching on quantized convolutions in another project. Nothing really depends on us getting it in now, but I do still believe that having all these variations allows us to continue developing almost as if we already had the permanent solution, and then switch over using search and replace once we actually get it.

stellaraccident · October 31, 2024, 5:24pm

Thank you for that. What pushed this to the forums was that there were two patches at the same time both adding to the convolution op count, and comments/reviews were being done differently across them. It is sometimes hard to tell who people are, and my first thought was that these were related efforts, but it just seems to be happenstance.

I’m fine proceeding carefully at need (Mahesh’s plan 2). If nothing else, this discussion has pushed me from being merely concerned about the explosion to believing that we need to be more proactive about addressing it.

rengolin · October 31, 2024, 5:30pm

Apologies, it was not my intention to imply that.

I was talking about the “disembodied problem”, not the people who created it, which for a big chunk of it, was myself. I know why we’re all adding to it, but there are enough voices calling stop right now, and this is what this RFC is about.

Not to lay blame, but to find a common solution and march towards it.

This is a curse more than a solution, because your partitioning will be specific to this one op, and the other op, and the other. Every pattern becomes a large switch/case statement.

If we’re lucky. Having done this before, it can take long to weed out the odd pattern that you missed because it was in the else branch, or worse, someone else adds another case and your else doesn’t apply anymore and you have silent code generation failures.

That’s a worthwhile goal, but here we’re cleaning up torch, hlo, onnx by just moving the “dirt” from those projects into linalg by adding all of their special variations.

javedabsar · October 31, 2024, 5:50pm

Could someone please provide a link to ‘plan 2’ if its written out somewhere? I am also keen to learn what was Mahesh’s plan 1?

stellaraccident · October 31, 2024, 5:56pm

Comment buried on one of the PR threads:

I think we need to make a decision here.

We freeze any additions to Linalg named ops.
We allow uses of named ops of this form to come in and then build a “generalized convolution op” in Linalg that accounts for all convolution names ops.

I am not really comfortable with 1 cause we dont have a plan to fix it for downstream users. Without a solid plan and some one signed up to work on it, I would rather go with (2). I know I blocked this PR to kick start a discussion, but that isnt really fair to the contributor here (it is on us as defacto maintainers to provide a better solution).
If we are on-board with (2), we can let this one go in and add it to the list of “ops we need to deprecate”.

rengolin · October 31, 2024, 6:09pm

Agreed. My reaction also. Removed my block on one of the PRs.

I don’t mind merging these two if we agree we have to stop and get it right, now. Even if it takes multiple iterations, several combining stages, etc.

javedabsar · October 31, 2024, 6:17pm

Shall we discuss this further here then, as I am interested in it.

rengolin · October 31, 2024, 6:21pm

Yes! Sorry, didn’t mean to “close the thread”, just unblock the parallel thread so we can continue here.

I want to hear from others before I give more opinions…

banach-space · October 31, 2024, 6:21pm

+1 To the plan outlined by Mahesh and Stella.

My only concern is that we don’t have volunteers for the “then” part:

But at least we know what’s the next big project in Linalg.

-Andrzej

rengolin · October 31, 2024, 6:23pm

@mshahid is doing the matmul part, @javedabsar the elemnt-wise part. It’d be fair if some of the people who want to add more convolutions would do the convolution part.

banach-space · October 31, 2024, 6:24pm

I guess we’d want 3 Ops:

linalg.conv for “regular convs” (1D, 2D and 3D),
linalg.depthwise_conv for all depthwise convs,
linalg.pooling_conv for the pooling convs.

Probably with sub-categories for 1D, 2D and 3D?

-Andrzej

ubfx · October 31, 2024, 6:52pm

I see your point, in our case this was the lesser problem because the frameworks only emit a single Op variant (according to their preferred dimension ordering), so matching onto that is less annoying than matching on 5 Ops (result transposition + conv + 3 input operand transpositions) plus potential DQ/Q ops in between. Thus my problem became making that one preferred variant available in the first place.

But yeah, we agree that the unified op solution is superior in all cases. And seeing as this thread has already started to actually catalyze this solution, I have no reason to argue for the named op variants anymore.

Sounds good

Would you merge the quantized versions with the zero point addition into the normal versions? Matmul has an additional quantized_matmul version.
What are the advantages and disadvantages of having depthwise separately? Are there any advantages to splitting out conv_1d, conv_2d, ..?

rengolin · October 31, 2024, 8:01pm

Honest question: why can’t these be just linalg.conv with an attribute for depthwise and pooling?

On the matcher side, the difference is between:

if conv - elif dept_conv - else, the latter assumes pool_conv and fails when we add another one. This also repeats code, promotes usage of lambdas, which are hard to debug, etc.
if conv and handle the iteration order via depth/width in the algorithm and add pooling at the right place if needed.

javedabsar · October 31, 2024, 11:46pm

The distance from that to just having linalg.generic is pretty short, isn’t it? May be a fair question to ask to guide us, is ‘what are benefits of having named-ops’. I can think of two (a) lowering from frameworks where instead of generic, the generated IR has named op; (2) specific recipes for lowering specific named ops .

ubfx · November 1, 2024, 7:53am

Sounds nice, but it seems like a different paradigm from what has been applied to the matmul?

What is the reasoning for seperating these out? To me, this seems like the equivalent of separating out depthwise or the different dimensionalities.

rengolin · November 1, 2024, 9:00am

Pragmatism changes.

The current approach is to have the three versions, but my proposal above is to look into maybe fusing them and using the operand dimensions to distinguish them.

In the end, what really matters is how we use it. We do a lot of pattern-matching, so not having too many operations help. But only if we can reuse the code from one pattern to another, otherwise, we’ll end up with a single match, but a dozen internal cases, which don’t help.

This is not just for dialect conversion, but also tiling, fusing, sharding and matching against micro-kernels or special lowering. If the three matmuls and the three convolutions are distinct enough that code can’t be reused across a variety of cases, then we keep them.

[Edited to add] And if having three convolutions is a step into realizing reuse or reducing code churn from the move, then let’s do it! I don’t use convolutions directly, so that’s not going to affect me directly, but having churn in Linalg does affect me indirectly. I’m happy with big local churn or small slow churn, as long as we agree we’re going in the right direction, and not just going round in circles.

This thread was meant exactly to figure that out.

Correct. My follow-up comment is…

Which is what I describe above in this reply. It’s not our (Intel Labs) current goal, but I’m making the case that it should be our (upstream MLIR) medium term goal.

Yes and no. You can encode a lot of it as just a generic, but not all patterns can become a generic (non-perfectly nested loops). For the ones you can’t, a named op is the only recourse, but that’s the trivial case, so let’s ignore it.

For the ones you can represent as linalg.generic there are choices, and it goes back to pattern-matching and semantics.

Matching against generic is really hard. You have to “understand” affine maps, iterator types, the semantics of the inner basic block, etc. Even for element-wise operations, you can have one argument as a constant all the way to a higher dimensional transposed tensor, as long as the affine map matches, it’s element-wise.

However, tiling a generic is reasonably simple, you just look at the affine map. Fusion is also reasonable, just make sure the maps are the same (or compatible, and nest). So, we want the transformational power of affine maps, with the semantic simplicity of named ops. Our current matmul design brings this closer to the goal.

This is one of the goals of Linalg, for sure. But it’s not the only one and definitely not the “most important” one. PyTorch, XLA and ONNX all have their dialects for that reason.

As an ingres dialect, we want Linalg to represent the common patterns of all those frameworks, not every pattern of every framework. XLA got pretty convoluted once it started supporting every variation of every ML hero op and their fusion buddies, and this is what’s happening to Linalg. Moving the complexity lower down may help XLA and PyTorch, but ultimately doesn’t fix the compilation pipeline. We’re still solving the same old problems in the same old way and having the same conclusions in the end.

If here you mean pattern-matching and lowering to hardware-aware sequences (special lowering, micro-kernel, hw dialects), then yes, this is an important goal, too. But there’s a third one that I think may be why you’re missing the point:

(3) hardware-agnostic transformations for distribution, tiling, fusion, memory-compute density parity, etc.

Being able to split and merge operations, pattern-match against sequence of operations, represent special semantics (like casts, transposes, quantization) without breaking the optimization goals and compiler complexity is perhaps the most important Linalg goal.

In my view, (1) and (2) are strong requirements, while (3) is the main goal.

Topic		Replies	Views
RFC: Linalg OpDSL MLIR	4	1136	March 9, 2021
[RFC] Add per channel quantized convolution ops in Linalg MLIR	4	78	December 17, 2024
An update on linalg on tensors MLIR	3	2023	September 30, 2020
How can i learn LinalgNamedStructuredOps.yaml? MLIR	2	256	March 14, 2023
How to add custom linalg named ops using OpDSL? MLIR mlir	5	65	November 18, 2024

[RFC] Op explosion in Linalg

Related topics