[RFC]: Extend Linalg elemwise named ops semantics

[RFC]: Extend Linalg elemwise named ops semantics

Written in collaboration with @rengolin, @MaheshRavishankar, @banach-space

Context and Background

In Oct 2024, Renato put out [RFC]: Op Explosion in Linalg highlighting the problem – “We have too many similar operations in Linalg (conv, matmul) that cannot be decomposed because they were defined in OpDSL. We discussed about this in the past and the end result is to move it out of OpDSL (Python, YAML), into ODS (table-gen, C++) so that we can common up implementation details”.

The proposal identified – matmul, contraction, convolution, and element-wise, amongst others – as areas needing attention. Subsequent to Renato’s RFC, Rolf put out [RFC]: introduce linalg.contract. More threads on same topic can found here - [RFC] : Linalg Operation Tree, and move matmul to ODS.

Problem Description
Linalg operations such as linalg.add or linalg.exp currently do not support broadcast, transpose (projected permutation in general) or type conversion. User-defined affine maps cannot be supplied (unlike linalg.generic and now linalg.matmul), and the current semantics of element-wise requires same type (i.e. shape and element type) for all operands and results. While the following is currently a valid IR –

%add = linalg.add
          ins(%x, %y : tensor<4x8x16xf32>, tensor<4x8x16xf32>)
          outs(%z : tensor<4x8x16xf32>) -> tensor<4x8x16xf32>

The following which tries to express transpose on the first operand before the operation is currently not a valid IR –

%add_transpose_a = linalg.add
         ins(%x, %y : tensor<8x4x16xf16>, tensor<4x8x16xf32>)
         outs(%z: tensor<4x8x16xf32>) -> tensor<4x8x16xf32>

Any required transpose, broadcast, type conversion etc. therefore must be done prior to calling this op or afterwards.

What is the benefit of extending semantics of elemenwise ops and ops such as linalg.add? Instead of having in the graph IR several ops performing – input reshape and type conversions; the actual computation (e.g. linalg.add); conversion of result to the desired output type – in a series of linalg.generics and other ops, and then relying on the folding optimizations to generate a compact low entropy overall IR, this can be expressed compactly if we define a new linalg.elemwise with the semantics that overcomes the above limitation. Also, the linalg.elemwise will be defined in ODS (in the spirit of ‘move out of OpDSL’).

Proposal: Introduce a new linalg.elemwise
This RFC proposes a new linalg.elemwise defined under ODS with semantics that overcomes above limitations. We seek feedback, although to add value to this proposal, a prototype implementation of what is proposed here follows this RFC.

Syntax

elemwise-op ::= `linalg.elemwise`
                 `func_type` `=` func-op
                 (`comp_type` `=` comp-type)?
                 (`indexing_maps` `=` `[` affine-map+ `]`)?
                 `ins( `$A`,` $B `:` tensor-or-memref-type `,`
                                     tensor-or-memref-type `)`
                 `outs( `$C `:` tensor-or-memref-type `)`
                 (`->` tensor-type)?
                
func-op ::= `#linalg.elemwise_fn` `<` func-op-kind `>`
func-op-kind ::=  `exp` | .. | `erf` | `add` | .. | `powf` | `select`
comp-type ::= `#linalg.comp_type` `<` comp-kind `>`
comp-kind ::= `i1` | `i8` | .. | `F64`

The attribute func_type describes the operation type (e.g. add, exp). Attribute comp_type type defines the type on which the elementwise operation is to be performed. If all types match there is no need to specify this attribute. Affine-map for operands and result must be only projected permutations with no zero constants.

Verifier
The verifier checks -

  1. The func_type is a valid unary, binary, or ternary operation.
  2. The semantics of comp_type (extension/truncation/type-conversion) probably needs some discussions here, so keeping it open for now.
  3. The indexing_maps attribute of the input operands must consist of affine_maps which are projected permutations. The indexing_map of the output must be identity.
  4. The number of input-operands must match the arity (unary, binary, or ternary) inferred from the func_type .
  5. Some more checks are done via already available verifyStructuredOpInterface

Inference and Deduction

  1. When a user-defined indexing_map is not provided, identity map is inferred for all operands. The default indexing maps are N identity-maps. ‘N’ depends on the arity of the elementwise op. The number of dims is inferred from rank of the output type. In the case of default indexing map, the input and output shapes must all match.
  2. For element-wise iterator-type is always inferred as all ‘parallel’. Iterator-type is needed for constructing the structured op that lowers to linalg.generic. The number of dims of the iterator-type is inferred as from the indexing_map dims.
  3. The arity is inferred from func_type.

Einsum Notation
Elementwise binary op –

 C[I^C] = A[I^A] ⊕ B[I^B]

where I^A$, I^B, and I^C are multi-indices, i.e. sequences/ordered sets of dimension identifiers (meant to range over index ranges) corresponding to the co-domains of the respective affine_maps. ⊕ is the selected kind of element-wise operation.

Similaryly, elementwise unary op –

C[I^C] = λ A[I^A]

where λ is selected unary operator.

Worked out Example
Below we show a working example IR:

#identity = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#transpose = affine_map<(d0, d1, d2) -> (d0, d2, d1)>
#broadcast = affine_map<(d0, d1, d2) -> (d0, d2)>

func.func @test(%arg0: tensor<4x16x8xf32>,
                %arg1: tensor<4x16xf32>,
                %arg2: tensor<4x8x16xf32>) -> tensor<4x8x16xf32> {
  %e = tensor.empty() : tensor<4x16x8xf32>
  
  // unary op (exp). no user-defined map.
  %exp = linalg.elemwise
           func_type=#linalg.elemwise_fn<exp>
           ins(%arg0 : tensor<4x16x8xf32>) outs(%e: tensor<4x16x8xf32>) -> tensor<4x16x8xf32>

  // binary op, user defined map
  %add = linalg.elemwise
            func_type=#linalg.elemwise_fn<add>
            indexing_maps = [#transpose, #broadcast, #identity]
            ins(%exp, %arg1 : tensor<4x16x8xf32>, tensor<4x16xf32>) outs(%arg2: tensor<4x8x16xf32>) -> tensor<4x8x16xf32>

  return %add :  tensor<4x8x16xf32>
}

Lowering of Example to linalg.generic

Above lowers to the following linalg.generic (mlir-opt --linalg-generalize-named-ops test.mlir):

#map = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d2, d1)>
#map2 = affine_map<(d0, d1, d2) -> (d0, d2)>
module {
  func.func @test(%arg0: tensor<4x16x8xf32>, %arg1: tensor<4x16xf32>, %arg2: tensor<4x8x16xf32>) -> tensor<4x8x16xf32> {
    %0 = tensor.empty() : tensor<4x16x8xf32>
    %1 = linalg.generic {indexing_maps = [#map, #map], iterator_types = ["parallel", "parallel", "parallel"]} ins(%arg0 : tensor<4x16x8xf32>) outs(%0 : tensor<4x16x8xf32>) {
    ^bb0(%in: f32, %out: f32):
      %3 = math.exp %in : f32
      linalg.yield %3 : f32
    } -> tensor<4x16x8xf32>
    %2 = linalg.generic {indexing_maps = [#map1, #map2, #map], iterator_types = ["parallel", "parallel", "parallel"]} ins(%1, %arg1 : tensor<4x16x8xf32>, tensor<4x16xf32>) outs(%arg2 : tensor<4x8x16xf32>) {
    ^bb0(%in: f32, %in_0: f32, %out: f32):
      %3 = arith.addf %in, %in_0 : f32
      linalg.yield %3 : f32
    } -> tensor<4x8x16xf32>
    return %2 : tensor<4x8x16xf32>
  }
}

Actions

  1. First PR : I have an implementation and some tests of things working. E.g. the above test example and lowering is from my working diff. I will put it up in couple of days. It does not have lots of tests or covers all bases as some implementations are likely to change based on inputs here.
  2. Gather Feedback : get feedback on this RFC and revise implementation.
  3. Discuss with community time-line and order of the next steps - e.g. adding transformations that leverages this new linalg.elemwise.

This RFC needs to be handled similar to [MLIR][RFC]: introduce linalg.contract so that we have a consistent story for named ops. What is missing here for me is that we should not have a single comp_type but rather should have a list. So I would suggest making this

elemwise-op ::= `linalg.elemwise`
                 `func_type` `=` func-op
                 (`cast_info` `=` `[` (cast-type)+ `]`)?
                 (`indexing_maps` `=` `[` affine-map+ `]`)?
                 `ins( `$A`,` $B `:` tensor-or-memref-type `,`
                                     tensor-or-memref-type `)`
                 `outs( `$C `:` tensor-or-memref-type `)`
                 (`->` tensor-type)?
                
func-op ::= `#linalg.elemwise_fn` `<` func-op-kind `>`
func-op-kind ::=  `exp` | .. | `erf` | `add` | .. | `powf` | `select`
cast-type ::= `#linalg.cast_type` `<` comp-kind `>`
comp-kind ::= `i1` | `i8` | .. | `F64`

with the same constraint as what I described here [MLIR][RFC]: introduce linalg.contract - #30 by MaheshRavishankar . (Please look at what is written there w.r.t why this needs to be a list IMO).

Note that for elementwise-operations there is an alternative where

  1. We add a “compute-type”
  2. The cast_info could be a list of size number of ins and number of outs which tells how to
    a. Convert from input element type to compute type for each ins operand
    b. Convert from compute type to output element type for each outs operand.

This would be perfectly valid, but the only argument to not do this is to be consistent across named ops. If there is a strong consensus to go with the alternative then that would be OK with me as well (even though the consistency would be lost which would probably make it confusing for “clients” of Linalg named ops.)

Thanks @MaheshRavishankar for your inputs. w.r.t on comp_type
lets see what other folks say, and what is concluded on the on the thread ([MLIR][RFC]: introduce linalg.contract).

PR for the above RFC

From what I understand the main utility that the new linalg.elementwise provides over the similar linalg.map ops is that the latter doesn’t allow any broadcast/transpose semantics like the former does? However, linalg.map has a nice feature where the elementwise operation can be a composition of simple arith/math ops, whereas linalg.elementwise doesn’t seem to have this feature. Is there any plan to extend to support more complex operations with a region?

Followup question. Why have both linalg.map and linalg.elementwise? Why not just extend linalg.map instead of adding another op? Seems like adding another op just contributes to the issue you mentioned in the beginning of your RFC

In Oct 2024, Renato put out [RFC]: Op Explosion in Linalg highlighting the problem – “We have too many similar operations in Linalg (conv , matmul ) that cannot be decomposed because they were defined in OpDSL. We discussed about this in the past and the end result is to move it out of OpDSL (Python, YAML), into ODS (table-gen, C++) so that we can common up implementation details”.

but if there is good reason to have both, I would like to hear people’s thoughts on that.

Thanks!

Correct.

That’s a map. :slight_smile:

The main reason is that map allows arbitrary number of ops in the inner region, which make it hard to reason about the semantics when lowering to hardware (some ops have direct counterparts, others do not). Mix and match and you have a lowering nightmare. We already have that semantics with generic, so map itself is just an alias to a generic that happens to have an element-wise map for multiple ops.

You may now have the following question: “Why not add custom affine maps to map and have an attribute to force it to be single op?”. Well, that’s precisely what elementwise is.

We are trying to fix the op explosion problem, but also trying not to break existing patterns. It’s a delicate balance between keeping existing ops, creating new ones, changing their semantics and deprecating unused ones.

In my original post I propose a structure of aliasing, where both map and elementwise alias generic. When certain properties match (affine maps, number of operands, number and types of operations) they “coalesce” to their aliases (map, elementwise, contract). This can be seen as a canonical representation, but in essence, they’re the same as before.

In this case, a fusion of two elementwise operations would yield a map, a packing of a matmul would yield a contract, and so on.

The main benefit of this is that, in the pattern matchers, matching a map guarantees a multi-op element-wise semantics, so you don’t have to check attributes, etc. It’s more about the transforms that use them than their actual textual representations.

That’s a map . :slight_smile:

Well not quite, because map doesn’t have the extra feature of supporting broadcasts, transposes, etc.

We are trying to fix the op explosion problem

I’m very much in support of this. I really hate this problem as well and think that most people put zero thought into this when they add new ops for their specific purposes.
But this is also why I asked my questions in the first place. You guys definitely put thought into this, but I wanted to understand why both map and elementwise have to exist and why their co-existence doesn’t contribute to this exact problem.

We already have that semantics with generic , so map itself is just an alias to a generic that happens to have an element-wise map for multiple ops.

Well actually, it’s not quite treated as an alias to generic because there is no conversion from map to generic in linalg-generalize-named-ops. Moroever, you can apply this argument to elementwise, so I’m not seeing how making this identification justifies the separation of map and elementwise.

Thanks for your feedback. I’ll have to absorb some of the points you are making.

The amount of "not quite"s in your response clearly illustrates the issues we’re having. :smiley:

The generalization / de-generalization passes are “work-in-progress”, so don’t have all conversion patterns. They should have. But before we start implementing them all, we need to get the minimum set of operations and deprecate the rest.

The fact that you can apply the alias argument to them all is precisely why it’s ok to have them.

If we reduce the commonality argument to the maxium, we’re back with just generic. There would be no reason to have anything else. For a while, this was the truth behind linalg.

But pattern matchers need to do a lot of work to separate the multiple “types” of generic ops, and “linalg is a transform dialect”, ie. the core purpose of this dialect is to be “easy to transform”. If it’s hard to pattern match, then it’s also hard to transform.

So, think about the named ops as memoization. You start with a generic, and some transform realizes it has element-wise maps, so it converts to an alias. Which one? Well, depends on how many operations are inside the region.

But it gets weirder. You could have a region with three ops (cast, cast, payload) and still alias to an elementwise with different input and output types. But match two payload ops (say, exp and add) and now you don’t know what it is, so it becomes a map. But if you match mul + add, and there is a reduction, well, that’s a matmul, or at least a contract.

If you run de-generalization passes in between your transforms, you can simplify the graph, and make it easier for further transforms to match the alias directly, and avoid all the checks on affine maps, iterator types, input types, etc. Memoization.

Once this is all working, you can start talking about meta aliases, like einsum, which is the union of contract and elementwise (but not map). And if you change the iteration order of a matmul, your tile can’t be another matmul, but that’s ok, because it becomes a contract. As long as you know how to lower those efficiently, you’re fine.

Finally, things like batch_matmul and batch_reduce_matmul become unnecessary, but can also continue as aliases, because some transforms can still match them without checking the reduction dimensions from a contract.

This is why I proposed an “operation tree” of core operations (like generic) and aliases. But the einsum would DAG that tree, so it’s not as simple as I made it look like last year.

Makes sense?

Lol, got it. :slight_smile:

yah makes sense

I totally agree with this. I’m definitely a big fan of named ops. But yes they should also be kept at a minimum. And I do appreciate that it’s a bit tricky to get a good balance for this.

Yes I think I follow most of your reasoning here and I appreciate all the details. I would like to keep track of this effort and perhaps even help if I can. Is there another larger scoped thread for this going on?

Thanks again for the response and for putting in effort to improve linalg. I’ve been thinking it was sorely needed for a while, but didn’t have the bandwidth, connections, or intelligence to start anything myself. I did start trying to come up with a better way to represent convolutions a while back, because I was sick of all the named variants that currently exist. But no one cared at the time so I stopped caring myself (and haven’t needed to deal with convolutions in a while). If you’re curious [mlir][linalg] Implement `LinalgGroupedConvolutionOpInterface` to unify grouped convs by srcarroll · Pull Request #94796 · llvm/llvm-project · GitHub

Right now I think we’re done creating more ops, and will start to move old ones to new ones. The larger scope project where we’ll work on is the Lighthouse project. The plan is to use that infra to validate the usages, canonical forms, patter match issues and common pipelines, to have strong arguments for/against all the variants to reach a minimum set. You’re very welcome to join us there!

Ah, yes, convolutions don’t fit the operation tree as is, because they’re not all derived from generic. That thread was being driven by @MaheshRavishankar and @banach-space but I’m not sure what’s the current status.

Hopefully we can drive both through the lighthouse and reach a common design.

If an op doesn’t have a generic counterpart, does it even belong in linalg? Moreover, don’t plenty of convolutions fit to justify cleaning up that mess?

That’s a good question. linalg has evolved from just linear algebra to more of a payload dialect that represents structured (perfectly nested or not) operations on large data (tensor/memref), so it’s not totally out of line.

But the convolution list explosion made us re-think that strategy. If we could common up into a tree (as you were trying to do partially), even if the parent wasn’t generic, it would be ok to be in the same dialect, because we can still do a lot of linear algebra (albeit with some creativity) on them.

Got it. Thanks again

We have not discussed it much since https://discourse.llvm.org/t/rfc-op-explosion-in-linalg/. Mahesh drafted some ideas, but that discussion has stalled.

Personally, I’d like to avoid multiple big refactors in-flight and focused on finalising this, see e.g. https://discourse.llvm.org/t/psa-deprecating-linalg-named-elementwise-ops/.

-Andrzej

I had forgotten about that! :sweat_smile:

I agree. This is closer to the topic of this thread, also.