Notes from the MLIR Upstream Round Table @ EuroLLVM 2024

The two main topics that were discussed at the round table were cost models / target descriptors and named operations in Linalg. We ran out of time (1h) so had to stop.

The examples below are not meant to be accurate, and it’s just an idea of what the design is. The actual implementation may end up very different and will be discussed in the respective PRs.

Cost model / Target Descriptor

Reasonable consensus that we need a generic way to represent multiple targets in MLIR and that cost models need to make the interface between those target descriptions and the information that the passes need. Our PR goes into that direction, so it’s not a big departure.

The main delta was discussed after the round table on how to represent that in MLIR. The PR itself uses the MLIRContext, which is a global state, but after speaking with @mehdi_amini and @jpienaar, there is a much better way to do that, via some TargetQueryAttribute in the module. This would not be a dictionary (map of key/value pairs) on the IR itself, but an identifier into how to acquire the information.

Composition can be ade via lists of attributes, and can be used to override target descriptions (ex. TTI, JSON file, cmd-line options), or to have multiple (ex. CPU, GPU, XPU), etc. All that can be encoded into different query attributes and query attribute lists.

Example:

module {
  // This creates an LLVM TTI using the X86 target and the rest as arguments
  target = #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx']
}

module {
  // This parser a file and creates a run-time target (API TBD)
  target = #target.json['my-device.json']
  ...
}

module {
  // This allows two targets in the same IR, indexed by position
  target = #target.list[
               #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
               #target.json['my-device.json']
  ]

  // This is a CPU function
  func.func cpu_func(...) #target.id['0'] {
    ...
  }

  // This is a XPU function
  func.func cpu_func(...) #target.id['1'] {
    ...
  }
}

module {
  // This allows two targets in the same IR, indexed by string
  target = #target.dict[
               #target.id['CPU'],
               #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
               #target.id['XPU],
               #target.json['my-device.json']
  ]

  // This is a CPU function
  func.func cpu_func(...) #target.id['CPU'] {
    ...
  }

  // This is a XPU function
  func.func cpu_func(...) #target.id['XPU'] {
    ...
  }
}
module {
  // This overrides the TTI info with JSON info
  target = #target.override[
               // This is the baseline
               #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
               // This overrides TTI on intersection, adds the rest
               #target.json['spr-special.json']
  ]
  ...
}

None of that is implemented and the PR won’t have all that, just JSON for now, but this is the idea.

Named Operations in Linalg

The main discussions revolved around semantics. The consensus seems to be that we want a strong documented semantics and encode the lowering/generalization to match that semantics and not the other way around.

Currently, the named ops have forced generalizations and that’s what we use for semantics. Some of them will change, all of them will be documented in the website.

This is important to have strong expectations from the front-ends, so that they lower their implicit behaviour into explicit Linalg, given that not all of them have the same expectations, we need to common it up on the same language.

The main semantic agreements are:

  • Named ops will not have implicit casts (type, shape)
  • Element-wise ops will require same types for input/output
  • Matmul/conv will have existing appropriate type restrictions
  • Broadcast will have to use linalg.broadcast on the appropriate operand
  • Type cast / quantization should use appropriate quantization strategies

None of this changes the linalg.generic operation, which continues to represent all of those casts as affine maps and/or arithmetic casts inside the region block.

Another important discussion was surrounding a grouping op. Currently we have scf.execute_region, which can already group based on the idea of a “target” or “thread block”, and it could be used with the above discussion of target descriptors (for some operations in a region, not all).

But that doesn’t translate to tiling and fusion opportunities. When using named ops, being able to fuse ops at multiple nest levels, then tile them (as a group), then fuse again is very powerful. We’ll need guarantees of what can be put into those regions (thus, can’t use scf.execute_region), for example, only ops that implement the TilingInterface, or something.

A good example in the discussion was linalg.softmax. How we lower it will define how we tile & fuse. For example, we can teach reduction/broadcast to be fusable, or we can split the lowering into three groups: pre-reduction, red/bcast, after brodcast, so that we tile and fuse the first group with the producer, the last group with the consumer and do special lowering for the middle one.

Other topics

There were other topics that were not discussed in the round tablr but are also important, we should work on them soon. Most of those I have discussed in separate throughout the conference and can update later.

  • ML-guided optimization in sync with cost models, target descriptor, etc. (this was a big topic at CGO than EuroLLVM).
  • Packing for CPU extensions (we want to upstream our code in tpp-mlir)
  • Linalg to GPU lowering upstream (we want to upstream our code in tpp-mlir)
  • Pipeline composition, deps, canon, ordering, multiple downstreams (we want to expose compilers like IREE to various upstream/downstream passes)
  • Temporary buffers, memory address space, shared memory, arena allocators, stack scope and other memref/vector allocation techniques to expose software pipelining across multiple threads, where not all of them do the same thing (@matthias-springer ?)
  • Vector layout for GPUs and CPU extensions (this was addressed by @Groverkss at EuroLLVM)
  • Transform schedules, multi-versioning (being addressed by @aniragil [PR] @martin.luecke Rolf Morel)

@ftynse @nicolasvasilache @stellaraccident

[Edited to make clear we don’t want to use the current quant dialect, but some explicit quantization semantics, be it its own dialect or inside linalg]
[Edited to add multiple types of target composition]

6 Likes

Thanks for the notes. Helpful for those of us who didn’t make it.

Could you give some context to what you mean by “implicit casts” here? I ask because I would already have considered that the named ops that exist abide by this, and I’m wondering whether you are reinforcing a position, trying to define a new one, or see some bugs in specific ops that folks would like to see straightened out?

As an example, it is common in frontends for the accumulator type to be implicit, but in lowering to linalg, we have to decide on a concrete accumulator type, and that is encoded into the op (so that, for examples, a torch.convolution on f16 is expressed in linalg for most backends as a convolution from f16 → f32, followed by a cast to f16). It’s a pita (because this is often left quite under-defined by the frontends), but it is correct and I don’t see a better way than to make sure it is nailed down like this going into linalg.

I’d be curious to hear more about this. From my perspective, the quant dialect is basically abandonware at this point. It hasn’t had any substantial patches landed against it in years, and it is woefully incomplete for representing modern schemes. As the defacto code owner of it, I think that changing that would require a careful evaluation prior to saying it should be the basis for anything new.

I’ll take this to mean that “we feel there is some missing type casting infra and it seems like that maybe should exist in the charter of something called the quant dialect”. Tell me if I got the thought process right or if there was a more concrete analysis :slight_smile:

Great questions!

We were picking one and sticking with it. No implicit broadcast, no implicit type cast, no implicit reduction.

The original named ops in Linalg said they implemented the “Numpy Broadcast Semantics” but that’s not directly translatable to MLIR because of the shape lowering and affine maps.

A linalg.generic can use a 1D vector for broadcast into any dimension (row, col, etc) onto an ND tensor, and IIRC some front-ends do. But once you move to named ops (without explicit affine map), it becomes ambiguous.

  // Broadcast %1 10x before adding
  %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding
  %0 = linalg.add %1, %2 : tensor<10x1xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding?
  %0 = linalg.add %1, %2 : tensor<10xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 

  // What now?
  %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<4x4xf32> -> tensor<4x4xf32> 

If every broadcast is explicit, via linalg.broadcast, then all named ops need compatible shapes (same ones in element-wise, special ones for matmul, conv).

Now you go from matching all possible representations of affine maps, checking iterator types and region bodies for semantics, to just matching a def-use chain of two ops.

Simpler, explicit, well defined.

Exactly! In a generic, you can always lower the accumulator type (casts inside the region body), but in a named op, you can’t. Moreover, “accumulator type” can mean different things, for example actually using a down cast later or maybe the hardware op already does that (so Ir without it cannot be lowered).

Linalg type cast semantics is broken in that respect, as it assumes you want to always cast the input type to the output type before the operation, which is rarely what you want. It also doesn’t allow you to represent a “pure” FP16 op that just happens to accumulate on FP32 internally.

So, we want to remove auto-casting from Linalg named ops altogether and make that an explicit feature.

Correct. Apologies for the misdirection.

We don’t use quant, never used it, never cared about it. But the more we discuss about type casting, the more I realise this is a quantization problem (mostly because of the accumulation issue that isn’t always clear with just C++ style casts).

We can get away with this just by having explicit type casts plus some annotation on what is the accumulator type, and make it the default to be the same as the input/output types.

But I’d be amis if I didn’t mention quantization, as there may be patterns I’m missing in the jungle of custom hardware where that strategy would fail.

This is quite similar to how DLTI/data layout was implemented. Admittedly, the current implementation is over-engineered, I’d be in favor of cleaning that up and using for more target information stuff.

An additional note on this is that we have been historically opposed to semantics-by-lowering because it overspecifies semantics. For example, the fact that Linalg named ops contain/lower to Arith operations without fastmath flags may mean that their semantics prevents fastmath.

Many years ago, we had a series of discussions on nested linalg.generic to allow outer tiling/fusion. Maybe it’s time to revive those discussions.

1 Like

Ok, I think this is what is tripping me up: linalg.broadcast is one way to get a compatible shape, but there are many others when considering all of the ways that indexing maps can be composed. But the principle makes sense.

Yeah, we’ve discussed this internally: certain hardware intrinsics in the current regime can only be targeted by fusion. But that isn’t going away any time soon, regardless of whether the encoding of some of the ops are tweaked to make f16 a bit more natural (which is the case where this most often comes up in the wild)… I don’t see a general solution in the cards but struggle without talking very detailed specifics. We just decided to embrace the fusion.

If that means more explicit type parameters at the cast points, +1. I believe I had argued that many years ago for mixed precision ops and was then convinced that the current way was a reasonable use of magic.

Welcome - word of advice: I’ve never found a problem to become less obtuse by promoting it to a “quantization problem” :slight_smile: quantization is a needlessly high level concept (well, family of concepts) that never in nature survives to a low level realization in the hardware. Usually taking a bottom up view of a spread of ways the concepts are realized gives the symmetries you are after. Linalg is fundamentally about encoding those lower level things in a way that is unambiguous.

I think where that breaks is when you craft something that yields a gap between the natural level of expression that linalg provides and what a frontend can produce with known information. Folks have often filled that gap with additional high level representations to provide a half step down that makes things match up. But for codegen, I’m yet to see any of those things be relevant, simplify the problem, or be judged well by history.

But I buy that you are latching onto some missing expressiveness in the way that mixed precision operations are composed.

Probably not helpful ones. Although if you look at all of the bespoke software riding on top and try to target that, you’ll see all kinds of things that run the gamut. The ones I know about call for dedicated ops that do what the hardware needs vs magic or complicated representations that try to generalize.

Whew! For a second there, the entire world shifted on me in a funny way :slight_smile:

Ah, yes! In theory, anything that gets into the right shape is “valid” (even other non-linalg ops that auto broadcast), but in practice you can only def-use-match against a fixed set of ops (closed world). That’s the only reason I named linalg.broadcast in there, not because it’s the only one that can do it.

We could always add interfaces (open world), but that’s a separate discussion.

One interpretation is that accumulation type is micro-architecture specific and therefore it’s an “implementation detail”.

For example, can could have a vanilla linalg.add FP16, FP16 -> FP16 and still accumulate in FP32 if the op stills output FP16. But to output a temporary FP32 from that op into another I’d have to use an explicit type cast from that result into FP32 into the next op. This is very ugly, but it’s how we did SelectionDAG hacks.

Another implementation is that we want to carry even more information. As @ftynse says, there’s also fast-math flags that we may want to propagate, and accumulation type can be another property of the op. But in the latter case, would the add op output the larger type? Wouldn’t that violate the same-type property?

That’s why I mention “quantization”. Now I realized I’m misusing the term. My point was just “how to encode mixed-type awareness into operations without changing their original semantics?”.

:heart:

It’s funny what 5 misplaced letters do to your brain when they come from a tangent.

I’ll stop using that term and just cope when people say “oh, you mean quantization?”. Far less harmful. :slight_smile:

Depending on audience, I do sometimes close down my internal monologue and say “quantization”. It’s just super imprecise and means something different to everyone. And I have battle scars… Which don’t need to be anyone else’s problem. I can deal :slight_smile:

… Except that it is actually defined concretely in torch. Our priority when lowering torch has been a faithful encoding of the op semantics, such as they are. A backend is of course free to do any number of… Hazardous… Things to the numerics, but I think it is important that the default representation doesn’t abstract frontend constructs that are concrete.

Agreed!

The main reason why we wanted strong encoded semantics (and not weak “lowering” semantics) is to being able to faithfully lower from different front-end semantics without having to second guess.

1 Like

Thanks for the summary @rengolin and for organising the roundtable!

+1 Here’s one specific case to support this:

-Andrzej

Thanks for summarizing (and organizing) that round table, @rengolin !

Some more details on the named ops topic: One question was could we retain the semantics/idiom of named linalg ops that lower into a graph of linalg.generic ops, e.g. linalg.softmax.

One possible solution might be to encapsulate the lowered generic ops within the body of some isolated-from-above op, making it hard for e.g. cse to break the idiom these ops form while facilitating moving/lowering/optimizing them as a whole (kind of a local, ad-hoc alternative for function outlining). Here’s how such an op might look if added to the scf dialect.

Other uses for ad-hoc grouping of ops might be limiting transformation scopes (as proposed by the PR Renato mentioned) and as the result of partitioning.

I am a bit late to this thread. Thanks @rengolin for the detailed notes.

In general I am -1 on almost all of these. Will describe more about the “broadcasting behavior” and “type cast” later on, but the main premise for all of this is that Linalg operations are essentially any computation that can be represented using “perfectly nested loops”. Named ops are useful, but Linalg is a transformation focused dialect, and any restriction that tries to limit the “perfectly nested loops” aspect of this seem artificial to me. For most part a named op is just a linalg.generic with a fixed indexing map, iterator types and region definitions.

Broadcasting behavior

In general broadcasting gets a bit of a bad-rap (with good reason), but IMO not all broadcasting behavior is bad. In some certain restricted category having some broadcasting behavior can lead to a better representation of the program. Ill use the examples above to walk through which I think are well defined under these thumb rules for elementwise parallel operations.

  1. All operations need to be in destination passing style. If any named op is not (I actually dont know how that is legal for the Linalg Interaface, but that is a requirement IMO
  2. The indexing maps for the outputs of such ops needs to be identity.
  3. All operands indexing maps need to be a minor identity (using this term lightly, but basically it has no transposition, but can have missing dimensions)
  4. The broadcasting behavior is essentially, all vectors not at the same rank as the output gets broadcasted to the output shapes.
 %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding

This IMO is perfectly well defined (apart from this should be in Destination passing style and it isnt) since it is essentially this

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<10x4xf32>)
    outs(%3 : tensor<10x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

There is no ambiguity here. I think one issue is that these ops are trying to infer broadcast behavior based on the shapes. Instead the broadcasting behavior should be based off of indexing maps.

%0 = linalg.add %1, %2 : tensor<10x1xf32>, tensor<10x4xf32> -> tensor<10x4xf32>

This is numpy-style size-1 based broadcasting. This should just be illegal.

// Broadcast %1 4x before adding?
  %0 = linalg.add %1, %2 : tensor<10xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 

No issue here either. It is basically

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<10x4xf32>)
    outs(%3 : tensor<10x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

Now coming to this

%0 = linalg.add %1, %2 : tensor<4xf32>, tensor<4x4xf32> -> tensor<4x4xf32> 

If you are trying to “infer” based on shapes, this gets stuck… instead the named ops should infer broadcasting behavior based on indexing maps. Then there will be no ambiguity… For this specific case actually both

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<4x4xf32>)
    outs(%3 : tensor<4x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<4x4xf32>

and

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<4x4xf32>)
    outs(%3 : tensor<4x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

are equivalent. One is just the loop interchanged version of the other. So you can just pick one.

I think one addition might be to make the indexing maps explicit on even named ops to represent broadcasting behavior precisely. Then there is no ambiguity at all.
Doing this for the matmul and convolution ops could also be useful and help collapse the 10s of variants of matmul + convolution into a handful of few ops with different indexing maps.
This also means we might be at end of the road for the python opdsl → yaml → tablgen def of named ops. Instead we add a set of named ops explicitly in tablegen that give us more freedom to make sure the ops carry enough semantic information to keep things as local as possible as well as unambiguous when “generalizing this operation” to linalg.generic.

Type mismatches

AFAIU the only type issue that happens with “named elementwise ops” is understanding how to extend the input type to output type. Because Linalg ops fundamentally work on unsigned types, the semantics of the element type conversion is carried in the body of the linalg op. For named ops we just need to have a way in which we represent how to convert from the operand type to the result type and all the computation happen in result type. (You could also relax this and have an “operation type”, where all the operands are converted to this type and then the operation is done in that type followed by a conversion to the result type"). All of this could just be represented as an enum per operand/result. In general it is useful to allow mixed types in a single operation. The whole premise of Linalg is that you want to keep all information needed to “transform” an operation local. This is more a guiding principle and not about red lines, but we should try to make it easier to keep things as local as possible.

To make the above more concrete, you could add this to an elementwise operation

%0 = linalg.add {
    extension_types = [SIGN_EXTEND_I32, UNSIGNED_EXTEND_I32],
    ...}
    ins(%1, %2 : tensor<..xi4>, tensor<..xi4>)

which essentially means

%0 = linalg.generic {...}
    ins(%1, %2 : tensor<..xi4>, tensor<..xi4>)
  ^bb0(%b0: i4, %b1 : i4,...):
    %3 = arith.extsi : i4 to i32
    %4 = arith.extui : i4 to i32
}

This essentially boils down to how you can generate the body of the linalg.generic given the named op.

I think it would be useful to even adapt linalg.matmul and linalg.conv named ops to have explicitly stated semantics for the conversion from input type to accumulator type and from accumulator type to result type. For now we are essentially using two separate ops, and relying heavily on fusion to fold things back for us, which is do-able, but it would be easier if we didnt have to. The flip side of this is the quantizations schemes are getting inherently more complicated. We might not be future proof with named ops to handle these evolutions and have to build up interfaces that can query information from a generic op directly (like which dimensions are contracted, or are batch dimensions) and not rely on named ops to handle things for us.

That’s what we mean by implicit behaviour and this is what we all agreed should not happen. If my comments led you astray, it’s my reporting, not the consensus.

We don’t want to deal with affine maps once we go into non-perfectly nested territory, so this needs explicit ops, which is a perfectly valid usage of Linalg and one we have been agreeing with for the past year.

It does not interfere with the generics part or the perfectly-generalizable ops either. We’re just expanding the ability to transform linalg code for non-perfectly nested ops.

Creating a new dialect for that is a non-starter. We discussed this last year and the consensus then and direction since have been to improve linalg, and that’s what we have been doing. It would be great if IREE could participate more actively in this discussion.

Nope. The main problem is accumulator type and controlling the casts (pre/post) to avoid breaking precision expectations. Your proposal does not address that. I’m open to suggestions, but we need to start thinking beyond generics, affine maps and perfectly-nested loops. There’s a lot of fusion opportunities that do not work at all in those cases.

I dont think I follow this. We were talking about representing elementwise add and such operations, not imperfectly nested loops. I dont think having Linalg ops represent non-perfectly nested loop computation is a good usage of Linalg. Those are represented by loops + linalg ops.

That seems like something that shouldnt be in Linalg. This seems like a new dialect, or really a new interface. Something that builds upon the DecomposeAggregateOpInterface (just as all Linalg ops are effectively implementations of the Linalg Interface).

In what sense? I suggested adding an array attribute of enums that explicitly state what the truncation/extension expectations are to take ambiguity out of it. Not wedded to that approach, but more a suggestion to get the ball rolling.

Overall what was suggested at the top is no different from things like TOSA or Torch kind of dialects. This would just become one more of those if we take that direction. For Linalg ops we should aim for maximal compact representation within what we can be handled by transformations.

We have discussed this for the past year, since my first RFC. There are many other threads in the forum that touch that and the consensus is strong on the proposals at the round table. I wish you have participated from the beginning and voiced your concerns (or at least resolved your questions), so we don’t have to do this all over again.

That is exactly the proposal. Loops + named ops.

All your other comments seem to come from the misunderstanding that we’re trying to extend generics or named ops to non-perfectly nested loops. We’re not.

As I said above, we’ve gone through all these options throughout the last year. Attributes don’t scale. Named ops variations don’t scale. Look at the number of convolutions, matmuls etc. that are in Linalg right now.

the only two things that scale:

  1. Linalg generic with arbitrary code in the perfectly-nested inner loop body.
  2. Loops + strict semantics named ops for the cases a generic can’t represent.

For 2 above, having attributes or variations won’t scale. Every one will want to introduce their own variation. And in the end, they can all be represented as a graph of simpler named ops that can easily be matched with def-use matchers.

We already have tiling and fusing for linalg generics and that works really well. That’s why we generalize named ops in our compiler. But we’d like to add tiling from named ops (or even DAGs of named ops) into tiled named ops, so that we can more easily match them to micro-kernels or special hardware ops.

Today, we take a named op, generalize on tiling, then re-specialize the tile generics to match. It works really well, so the consensus (at the round table and before in the multiple threads in this forum) was that we should just complete the named ops catalogue, write exact semantics, validate their lowering and tile them into named ops still.

We should only generalize in three cases:

  1. The case gets too complex and generics becomes a better description for the problem.
  2. I don’t know how to lower a named op (no HW/MK available) so I lower to loops.
  3. I know how to vectorize generics better than named ops.

Note that none of that stops anyone from using generics from the beginning and never touching named ops. It’s already possible to represent everything in a set of loops and generics. It’s just SO much simpler to match a DAG of named ops than a DAG of generics.

You really should read the original proposal’s thread. This all has been discussed there. None of those dialects have LLVM governance, Linalg does. We have created a new dialect (TPP) but that ended up just being a clone of Linalg (down to the interfaces we implement).

Torch/HLO/ONNX come from high level graphs which have named ops, then they convert to generics and then we convert to named ops. We just want to be able to convert to named Linalg ops, work with them and then lower them to hardware without having to pass through generics (if we don’t really need to).