[RFC] Declarative "named ops" in the Linalg dialect

Hi everyone,

I am proposing adding support to declaratively create semantically charged “named ops” in Linalg.

The end goal is that we will be able to automatically generate a large fraction of ops in a dialects from a declarative specification that resembles (e.g. for linalg.matmul):

  "C(i,j) +=! A(i,k) * B(k,j) where i in 0:M, k in 0:K, j in 0:N"

Concretely, the plan is to start with a specific Tablegen backend to automatically populate the attributes of linalg.generic from a textual specification. As things progress and evolve, refactorings are expected to take place and be reusable across dialects (e.g. more idiomatic support for affine maps in Tablegen).

Such a mechanism is expected to help build a system to understand tradeoffs related to :

  1. succinct declaration of many ops from different application domains, ML in particular
  2. source of mathematical truth that carries the semantics of the op at a high-level
  3. op “class” vs op “instance” specification and rank-polymorphism
  4. roundtripping between the “named” form and the “generic” form of an op and implications on matchers and library calls / CISC ISAs
  5. transformations that apply to the “named” form (e.g. TASO/TensorRT/Algebraic rewrites) or the “generic” form (fusion, tiling etc) of the op.
  6. shape inference
  7. specification language design

Progress on these fronts will be on a per-need basis and with input from the community, but at least we can start from something concrete.

While the backend would like to be unopinionated on the choice of the syntax, it will need to start somewhere, so I am proposing to use a syntax inspired by TensorComprehensions to get started.

Concretely, the first revision will assume some parser exists and add the ability to write:

def PointwiseAddOp : LinalgNamedStructured_Op<"pointwise_add", [
    NInputs<2>,
    NOutputs<1>,
    NamedStructuredOpTraits]> {
  let arguments = (ins Variadic<LinalgOperand>:$views);
  let results = (outs Variadic<AnyRankedTensor>:$output_tensors);
  let spec = "A(...) = B(...) + C(...)";
  let assemblyFormat = "`(` operands `)` attr-dict `:` "
    "functional-type(operands, results)";
}

def PointwiseMulOp : LinalgNamedStructured_Op<"pointwise_mul", [
    NInputs<2>,
    NOutputs<1>,
    NamedStructuredOpTraits]> {
  let arguments = (ins Variadic<LinalgOperand>:$views);
  let results = (outs Variadic<AnyRankedTensor>:$output_tensors);
  let spec = "A(...) = B(...) * C(...)";
  let assemblyFormat = "`(` operands `)` attr-dict `:` "
    "functional-type(operands, results)";
}

The spec strings will just be checked for equality for now, a parser will be developed later.
This will result in the following legal IR on arbitrarily mixed tensors and buffers:


func @pointwise1d(%a: memref<?xf32>, %b: memref<?xf32>, %c: memref<?xf32>,
                  %ta: tensor<?xf32>, %tb: tensor<?xf32>, %tc: tensor<?xf32>) {
  linalg.pointwise_add(%a, %b, %c): (memref<?xf32>, memref<?xf32>, memref<?xf32>) -> ()
  %dadd = linalg.pointwise_add(%a, %b): (memref<?xf32>, memref<?xf32>) -> tensor<?xf32>
  %eadd = linalg.pointwise_add(%ta, %b): (tensor<?xf32>, memref<?xf32>) -> tensor<?xf32>
  linalg.pointwise_add(%ta, %tb, %c): (tensor<?xf32>, tensor<?xf32>, memref<?xf32>) -> ()

  linalg.pointwise_mul(%a, %b, %c): (memref<?xf32>, memref<?xf32>, memref<?xf32>) -> ()
  %dmul = linalg.pointwise_mul(%a, %b): (memref<?xf32>, memref<?xf32>) -> tensor<?xf32>
  %emul = linalg.pointwise_mul(%ta, %b): (tensor<?xf32>, memref<?xf32>) -> tensor<?xf32>
  linalg.pointwise_mul(%ta, %tb, %c): (tensor<?xf32>, tensor<?xf32>, memref<?xf32>) -> ()

  return
}

Followup revisions will consist in (1) enabling the roundtripping between named and generic form as well as (2) lowering to loops.

This is also related to the discussion on TCP ops and tensors.

I will send an RFC revision soon so people can comment on the approach.

Please let me know what you think, thanks!

That’s pretty nice!

When I look at a declaration like this:

def PointwiseAddOp : LinalgNamedStructured_Op<"pointwise_add", [
    NInputs<2>,
    NOutputs<1>,
    NamedStructuredOpTraits]> {
  let arguments = (ins Variadic<LinalgOperand>:$views);
  let results = (outs Variadic<AnyRankedTensor>:$output_tensors);
  let spec = "A(...) = B(...) + C(...)";
  let assemblyFormat = "`(` operands `)` attr-dict `:` "
    "functional-type(operands, results)";
}

Is seems like from the spec field you could infer a lot of the other informations (like number of inputs/outputs for example).
Having you consider having something that take the spec and the name of the op and generate the ODS definition instead of writing a TableGen backend?

Sounds like a great idea, is the model of mlir/tools/mlir-tblgen/LLVMIRIntrinsicGen.cpp a good one ?
It generates Tablegen from Tablegen.
Note that this is still crammed in the mlir-tblgen binary, is there also a suggestion such things should be standalone binaries?

This revision sketches what it takes without a parser and without the Tablegen → Tablegen meta step.

I believe both could be considered as future extensions or refactorings once the ball is rolling.

This approach really opened my eyes when I saw it for the first time. I really support this direction. Would like to see a bit more discussion about the specific incremental implementation path and desired end state.

Hey Sean, I think I intuitively agree with you but I was wondering if you could elaborate a bit more on what attracts you specifically (whole approach, source of truth, syntax, etc)?

I really like the symmetry and consistency from top to bottom that this approach provides. I also like grounding these op monikers in a plausible mathematical definition: I think that is going to help bridge the compiler and non compiler sides of the project in a way that the syntax of the underlying affine maps do not quite make clear.

So from me:
+1 on nicely named ops for canonical instances of the primitives
+1 on a mathematically inspired spec language, and I’m fine starting with tc-esque

On approach, it might help unblock things to follow the suggestion of generating the ODS. That would give you the most freedom to fiddle and get this right, potentially even having the generator itself in a more experimental area. Since the source of truth of this is basically going to be a file somewhere that a small number of people add specs to, it wouldn’t need to be build-system integrated in one go: if we had a binary off to the side that someone had to run to generate the ODS, that seems like a fine way to me to start. You’d want it more integrated eventually, but I’d bias towards iterating to get the definitions right for a bit before solving the integration question.

This sounds great and I think we should explore this further here. Some questions/comments.

  1. I assume the ‘spec’ is looked at to expand out the op from its named form to its fully operational form (where a region says what happens at the ‘point’ for the pointwise op for example, and there may be other attributes capturing things in other cases).

  2. This is more on the detail, but how does ‘spec’ refer to iterators. Where are these iterators defined? I think this is related to the support for regions in ODS since these iterators are region args (always?) in the generic form.

  3. I assume we would also have enough information to go from the tensor type operand/result form to the memref type operand/result form. Such a conversion could be easily auto-generated? I think this has been one of the key missing parts in MLIR proper postponed for just too long and from very early days since there was no standard ML dialect.

  4. I didn’t understand why you state ‘roundtripping between the ‘named’ form and the ‘generic’ form’. Roundripping the named form or the generic form will obviously be there, and the conversion from the named to generic as well, but why is it a priority to go from the generic form to the named one - at least to start with? That may be a lot of matching.

  5. The lowering to loops when desired is typically desired from the memref form of those ops. I assume the plan is to also auto-generate these conversions? (like conversion to affine for the ops that could be and since affine to loop exists) In some cases, there may be multiple lowering paths depending on which dialect one would like to go through (for eg. hardware that doesn’t want to go through even loop or the affine dialect right after, but a slightly higher level dialect on the way, but perhaps partly or later through 1-d loops).

This is a great point, I was also fiddling to try and better surface @andydr’s Teckyl work.

To be clear, do we think it is an acceptable point to just copy/paste auto-generated ODS into MLIR?

Ideally, we’d have a way to integrate this binary but I don’t think it is acceptable to call git submodule as part of the ODS generation (plus it would break on Win etc).

Creating a separate binary that is not upstreamed to MLIR core for now, and pasting the generated ODS sounds like a low-friction alternative to me.

I can’t answer that exclusively, but I would be clear about where you want the end state to be (integrated with a parser upstreamed) and that this is a pargmatic step to get there. I would evaluate this based on: if the generator was never upstreamed, could a human still make sense of the generated code and maintain it. Having seen the attempts at this, I believe the answer is yes.

Thanks @nicolasvasilache for the RFC. The idea of capturing the semantics of the operation from the “spec” sounds like a great way to build an generic DSL infrastructure. Some questions (related/duplicate) of questions above.

  1. What is the grammar of the spec. I am thinking the spec could also be used to extract the indexing maps for the inputs/outputs, similar to probably what was done in TC. This would effectively bring TC/Halide into MLIR AFAICS. Could you also use the spec to capture iterator types (again something that probably happens in TC and Halide).

  2. One of the things I like about Linalg is the small surface area of ops that capture most of the semantics. These are “named” ops that are a layering over generic/indexed-generic ops (AFAIK, apologies if this mental model is wrong). What are the guidelines for adding new named ops. There is an overhead w.r.t to having named ops (compilation of different tablegen methods that result in increased symbol counts in the final library/binary, compilation overhead, etc.). For pw_add and pw_mul, it seems using a named op is overkill (they are adequately captured by generic ops without too much loss of abstraction). So what is the guidance w.r.t when a named op would be useful. I see how having a named op for matmul/conv/pooling etc is useful, but have some concerns about increased surface area of ops.

To this point, there may be two dialects: one for the generic ops and one for the canonical named ops. Doesn’t answer your question directly but may keep the standalone utility of the former “honest” to have separation.

As discussed in Do we need all ops to take tensors? there’s actually value in having semantically charged “add” and “mul” ops.

I like that it covers a large number of ops with a very low-entropy way. Since there isn’t that much essential complexity (which is a tremendous statement to make considering how much functionality this covers :slight_smile: ) it’s easy to be very agile with it and evolve it.

Folks seem generally supportive, so let me ask some specific technical questions:

  • how will this handle unranked?
  • how will this handle broadcasting?

Could you be more specific about broadcasting. I have seen this come up many times w.r.t discussions in Linalg. Linalg generic op (and indexed generic) already support broadcasting. So what are you looking for here?

for example, suppose you have:

%2 = pw_add(%0, %1) : (tensor<?x?xf32>, tensor<?xf32>) -> tensor<?x?xf32>

what are the semantics? Do we disallow this and require separate broadcasting ops? Do we implicitly do numpy-style “right-aligned” broadcast? Do we have an explicit attribute mapping the dimensions between the operands to allow for non-“right-aligned” broadcasting?

I understand that in the generic form this is not a problem at all and it all works out in a simple and principled way. The question is what the named form concretely looks like.

+1.
Writing a parser and an upstream replacement for the tool is the kind of thing you can then add to the list of Open Projects and maybe get GSOC students to work on it.

Also if we have:

%2 = pw_add(%0, %1) : (tensor<?xf32>, tensor<?xf32>) -> tensor<?xf32>

Do you allow the dynamic value of the input shapes to differ as long as they are broadcast-compatible? (assuming 1-dimensions can be broadcasted)