[RFC] Op explosion in Linalg

We have too many similar operations in Linalg (*_conv_*, *_matmul_*) that cannot be decomposed because they were defined in OpDSL.

We discussed about this in the past and the end result is to move it out of OpDSL (Python, YAML), into ODS (table-gen, C++) so that we can common up implementation details (such as broadcast, transpose, quantize, loop order).

There is ongoing work to move all matmul operations to a common format, with the first patch ([mlir][linalg] Introduce transpose semantic to 'linalg.matmul' ops. by shahidact Ā· Pull Request #104783 Ā· llvm/llvm-project Ā· GitHub) landed and follow ups in development.

Convolutions are currently at about 30 ops and counting, with recent PRs adding more and more ([linalg] Add quantized version of `conv_3d_ncdhw_fcdhw` by ubfx Ā· Pull Request #113953 Ā· llvm/llvm-project Ā· GitHub, [mlir][linalg] Add Grouped Convolution Ops: conv_2d_nhwgc_gfhwc and conv_2d_nhwgc_gfhwc_q by stefankoncarevic Ā· Pull Request #108192 Ā· llvm/llvm-project Ā· GitHub).

Can we have a discussion on how to stop the op explosion and focus on a few operations that we agree is reasonable for such a core dialect?

Here are the discussions I had personally with different people and where my plan stands for the near future:

  • No implicit cast, broadcast, transpose, reduction. These can all be represented by affine maps (or attributes), including compute type and quantization.
  • There should be only one matmul op, one batch_matmul and one batch_reduce_matmul. No *_matmul_transpose_* operations should be necessary. Weā€™re not going to delete any op now, but they should be left over in OpDSL and be removed with it. @mshahid is working on this.
  • matmul has only 2D operands, batch_matmul has 3D operands with the outer-most batch, batch_reduce_matmul has 3D inputs, 2D accumulator, with the outer batch. We could reuse matmul for the latter two above if we can easily validate shapes and affine maps, but thatā€™s not our current goal. Anything else should be implemented as a linalg.generic or the upcoming linalg.contract.
  • Element wise ops with operation attribute. We already had this, but with implicit casts. I added a lot of named ops, which was the wrong thing to do but necessary due to OpDSL. Now @javedabsar will add those to ODS with the right semantics.
  • Convolutions have the same issues, but with the addition of loop order (NCWH variations), which can be attributes and C++ implementation detail in ODS. We can use dimensionality to lower ND correctly in C++, which was impossible in OpDSL. Depthwise can also be an attribute.

I want to repeat @stellaraccident words in the latest PR:
ā€œI am sympathetic to the need to have a practical solution for them in short order. Iā€™m also cringing at the addition of technical debt this adds because the project overall doesnā€™t have a plan for how to manage the expansion.ā€

Been there, done that and now Iā€™m trying to undo my mistakes from last year. I want to avoid other people making the same mistakes. This is why we started with matmul, but we canā€™t work efficiently while the pile is still growing and people are working against these goals.

So, letā€™s just agree on a goal and everyone follow that goal.

5 Likes

Iā€™d love to hear from the authors of those PRs what motivated them and what is depending on getting them in now. One of them looks like part of a tosa implementation plan but I canā€™t tell whether these are one offs or part of a stream of things we should expect. Usually the tosa side has a pretty good handle on a scope for things like this, and if they have a full need expressed somewhere, that is probably a good place to start to make sure that the linalg conv op(s) are comprehensive.

Thanks @rengolin for writing this up. I think simplifying and making orthogonal (as you also mention too many similar operations) the named ops and their implementation is a good idea and I am happy to help in this.

Can we have a discussion on how to stop the op explosion - not sure if I understand whether you mean new ops in flight or ones recently added (e.g. softmax was most recent?). On that point, I am thinking that whether having level ops in linalg e.g. linalg.attention/softmax/ā€¦ (if frameworks can lower to them) is good or bad ? Keen to know opinions. Personally I think it increases scope for kernel replacement and avoids pain of re-discovery.

the upcoming linalg.contract. - please could you share more info on that. I think its a good generic one (not meaning ā€¦ linalg.generic) . Would we need matmul then?

Anyways, thanks again @rengolin and @stellaraccident for discussions on this.

So, as one of the people behind the grouped conv2d PR (I didnā€™t do the implementation work, which is why my name isnā€™t on it), we needed it in rocMLIR for CPU-side testing of grouped convolutions (which were hackily plumbed through Tosa).

Those arose from clients that were consuming ONNX, which allows a native group = G attribute on convolutions (in that system, the group is implicitly folded into the channel dimension, but that caused impedance mismatches with our code). Some folks are of the opinion that grouped convolution should be represented by G successive convolutions, but thatā€™s both ugly and doesnā€™t let you encode the placement of the group dimension with any precision.

The other half of that PR wouldā€™ve been an improvement to Tosa to allow for grouped convolutions there (which we pached in downstream, leading to needing NGCHW / NHWGC convolutions for TosaToLinalg).

What all this means is that, if we move to a more general linalg.conv, it should support groups, and ideally as their own dimension alongside batch / channels / ā€¦ .

I want to preface this by saying that having a temporary named op for a certain variation of quantized 3d convolution is certainly no hill I want to die on (Iā€™ve never even used a 3d convolution personally). Iā€™m just arguing this because I genuinely donā€™t understand the downsides of ā€œMaheshā€™s plan no 2ā€ ([linalg] Add quantized version of `conv_3d_ncdhw_fcdhw` by ubfx Ā· Pull Request #113953 Ā· llvm/llvm-project Ā· GitHub).

I agree with everything said in the github PRs and here about the positives of having a few unified linalg.conv ops. The reason I made the PR(s) in the first place is that I seem to lack understanding of the negatives of continuing the ā€œoldā€ approach in the meantime.

To me, it looked like having 20 named variation-ops is no worse than having 10 because any good solution to the problem would be categorical and replace all variations in the end. It seemed like the dialect was already cluttered and adding more of the same would not change that. Again, it seemed like this was a quality, not quantity kind of problem.

Itā€™s not my intention to work against any goals within the project. I just didnā€™t understand how the growing pile of the same kind of Ops stops or hinders the parallel development of a solution to the problem. I would have argued that the addition of the (temporary) named Ops indirectly contributes to that goal because these Ops represent the different downstream uses and span the space that the permanent solution will have to fill later.

Seems like this is another good example of this. Once their PR is merged, this requirement is baked into the dialect until we get the unified Op. Then it becomes trivial to test coverage of the new op, e.g. by trying to rewrite all tests of the existing named op variations to the new unified Op.

Some more thoughts on advantages of temporary named op variations:

  • They require little to no review (or so I thought), because they follow an existing pattern in OpDSL and no existing functionality is changed.
  • They are actually decent stand-ins for what I imagine the permanent solution to be in the end. I.e. they allow expressing convolution with arbitrary dimension ordering as a single named Op, which is extremely useful in graph partitioning. Any current alternative (other named op + transpositions or linalg.generic) behaves fundamentally different during pattern matching.
  • Once the permanent solution is there, they can be replaced by the new Op(s) almost in a ā€œsearch and replaceā€ manner. Again, the current alternatives can require more code changes.

To summarize my thoughts, I was motivated by cleaning up the code in torch-mlir to get similar logic for FP and quantized convolutions (e.g. channel first variant existed for standard version but not _q) and by trying to improve pattern matching on quantized convolutions in another project. Nothing really depends on us getting it in now, but I do still believe that having all these variations allows us to continue developing almost as if we already had the permanent solution, and then switch over using search and replace once we actually get it.

Thank you for that. What pushed this to the forums was that there were two patches at the same time both adding to the convolution op count, and comments/reviews were being done differently across them. It is sometimes hard to tell who people are, and my first thought was that these were related efforts, but it just seems to be happenstance.

Iā€™m fine proceeding carefully at need (Maheshā€™s plan 2). If nothing else, this discussion has pushed me from being merely concerned about the explosion to believing that we need to be more proactive about addressing it.

Apologies, it was not my intention to imply that.

I was talking about the ā€œdisembodied problemā€, not the people who created it, which for a big chunk of it, was myself. I know why weā€™re all adding to it, but there are enough voices calling stop right now, and this is what this RFC is about.

Not to lay blame, but to find a common solution and march towards it.

This is a curse more than a solution, because your partitioning will be specific to this one op, and the other op, and the other. Every pattern becomes a large switch/case statement.

If weā€™re lucky. Having done this before, it can take long to weed out the odd pattern that you missed because it was in the else branch, or worse, someone else adds another case and your else doesnā€™t apply anymore and you have silent code generation failures.

Thatā€™s a worthwhile goal, but here weā€™re cleaning up torch, hlo, onnx by just moving the ā€œdirtā€ from those projects into linalg by adding all of their special variations.

1 Like

Could someone please provide a link to ā€˜plan 2ā€™ if its written out somewhere? I am also keen to learn what was Maheshā€™s plan 1?

Comment buried on one of the PR threads:


I think we need to make a decision here.

  1. We freeze any additions to Linalg named ops.
  2. We allow uses of named ops of this form to come in and then build a ā€œgeneralized convolution opā€ in Linalg that accounts for all convolution names ops.

I am not really comfortable with 1 cause we dont have a plan to fix it for downstream users. Without a solid plan and some one signed up to work on it, I would rather go with (2). I know I blocked this PR to kick start a discussion, but that isnt really fair to the contributor here (it is on us as defacto maintainers to provide a better solution).
If we are on-board with (2), we can let this one go in and add it to the list of ā€œops we need to deprecateā€.

2 Likes

Agreed. My reaction also. Removed my block on one of the PRs.

I donā€™t mind merging these two if we agree we have to stop and get it right, now. Even if it takes multiple iterations, several combining stages, etc.

Shall we discuss this further here then, as I am interested in it.

Yes! Sorry, didnā€™t mean to ā€œclose the threadā€, just unblock the parallel thread so we can continue here.

I want to hear from others before I give more opinionsā€¦ :slight_smile:

1 Like

+1 To the plan outlined by Mahesh and Stella.

My only concern is that we donā€™t have volunteers for the ā€œthenā€ part:

:slight_smile: But at least we know whatā€™s the next big project in Linalg.

-Andrzej

@mshahid is doing the matmul part, @javedabsar the elemnt-wise part. Itā€™d be fair if some of the people who want to add more convolutions would do the convolution part.

1 Like

I guess weā€™d want 3 Ops:

  • linalg.conv for ā€œregular convsā€ (1D, 2D and 3D),
  • linalg.depthwise_conv for all depthwise convs,
  • linalg.pooling_conv for the pooling convs.

Probably with sub-categories for 1D, 2D and 3D?

-Andrzej

I see your point, in our case this was the lesser problem because the frameworks only emit a single Op variant (according to their preferred dimension ordering), so matching onto that is less annoying than matching on 5 Ops (result transposition + conv + 3 input operand transpositions) plus potential DQ/Q ops in between. Thus my problem became making that one preferred variant available in the first place.

But yeah, we agree that the unified op solution is superior in all cases. And seeing as this thread has already started to actually catalyze this solution, I have no reason to argue for the named op variants anymore.

Sounds good :slightly_smiling_face:

Would you merge the quantized versions with the zero point addition into the normal versions? Matmul has an additional quantized_matmul version.
What are the advantages and disadvantages of having depthwise separately? Are there any advantages to splitting out conv_1d, conv_2d, ..?

Honest question: why canā€™t these be just linalg.conv with an attribute for depthwise and pooling?

On the matcher side, the difference is between:

  • if conv - elif dept_conv - else, the latter assumes pool_conv and fails when we add another one. This also repeats code, promotes usage of lambdas, which are hard to debug, etc.
  • if conv and handle the iteration order via depth/width in the algorithm and add pooling at the right place if needed.

The distance from that to just having linalg.generic is pretty short, isnā€™t it? May be a fair question to ask to guide us, is ā€˜what are benefits of having named-opsā€™. I can think of two (a) lowering from frameworks where instead of generic, the generated IR has named op; (2) specific recipes for lowering specific named ops .

Sounds nice, but it seems like a different paradigm from what has been applied to the matmul?

What is the reasoning for seperating these out? To me, this seems like the equivalent of separating out depthwise or the different dimensionalities.

Pragmatism changes. :slight_smile:

The current approach is to have the three versions, but my proposal above is to look into maybe fusing them and using the operand dimensions to distinguish them.

In the end, what really matters is how we use it. We do a lot of pattern-matching, so not having too many operations help. But only if we can reuse the code from one pattern to another, otherwise, weā€™ll end up with a single match, but a dozen internal cases, which donā€™t help.

This is not just for dialect conversion, but also tiling, fusing, sharding and matching against micro-kernels or special lowering. If the three matmuls and the three convolutions are distinct enough that code canā€™t be reused across a variety of cases, then we keep them.

[Edited to add] And if having three convolutions is a step into realizing reuse or reducing code churn from the move, then letā€™s do it! I donā€™t use convolutions directly, so thatā€™s not going to affect me directly, but having churn in Linalg does affect me indirectly. Iā€™m happy with big local churn or small slow churn, as long as we agree weā€™re going in the right direction, and not just going round in circles.

This thread was meant exactly to figure that out.

Correct. My follow-up comment isā€¦

Which is what I describe above in this reply. Itā€™s not our (Intel Labs) current goal, but Iā€™m making the case that it should be our (upstream MLIR) medium term goal.

Yes and no. You can encode a lot of it as just a generic, but not all patterns can become a generic (non-perfectly nested loops). For the ones you canā€™t, a named op is the only recourse, but thatā€™s the trivial case, so letā€™s ignore it.

For the ones you can represent as linalg.generic there are choices, and it goes back to pattern-matching and semantics.

Matching against generic is really hard. You have to ā€œunderstandā€ affine maps, iterator types, the semantics of the inner basic block, etc. Even for element-wise operations, you can have one argument as a constant all the way to a higher dimensional transposed tensor, as long as the affine map matches, itā€™s element-wise.

However, tiling a generic is reasonably simple, you just look at the affine map. Fusion is also reasonable, just make sure the maps are the same (or compatible, and nest). So, we want the transformational power of affine maps, with the semantic simplicity of named ops. Our current matmul design brings this closer to the goal.

This is one of the goals of Linalg, for sure. But itā€™s not the only one and definitely not the ā€œmost importantā€ one. PyTorch, XLA and ONNX all have their dialects for that reason.

As an ingres dialect, we want Linalg to represent the common patterns of all those frameworks, not every pattern of every framework. XLA got pretty convoluted once it started supporting every variation of every ML hero op and their fusion buddies, and this is whatā€™s happening to Linalg. Moving the complexity lower down may help XLA and PyTorch, but ultimately doesnā€™t fix the compilation pipeline. Weā€™re still solving the same old problems in the same old way and having the same conclusions in the end.

If here you mean pattern-matching and lowering to hardware-aware sequences (special lowering, micro-kernel, hw dialects), then yes, this is an important goal, too. But thereā€™s a third one that I think may be why youā€™re missing the point:

(3) hardware-agnostic transformations for distribution, tiling, fusion, memory-compute density parity, etc.

Being able to split and merge operations, pattern-match against sequence of operations, represent special semantics (like casts, transposes, quantization) without breaking the optimization goals and compiler complexity is perhaps the most important Linalg goal.

In my view, (1) and (2) are strong requirements, while (3) is the main goal.