Notes from the MLIR Upstream Round Table @ EuroLLVM 2024

rengolin · April 15, 2024, 9:34pm

The two main topics that were discussed at the round table were cost models / target descriptors and named operations in Linalg. We ran out of time (1h) so had to stop.

The examples below are not meant to be accurate, and it’s just an idea of what the design is. The actual implementation may end up very different and will be discussed in the respective PRs.

Cost model / Target Descriptor

Reasonable consensus that we need a generic way to represent multiple targets in MLIR and that cost models need to make the interface between those target descriptions and the information that the passes need. Our PR goes into that direction, so it’s not a big departure.

The main delta was discussed after the round table on how to represent that in MLIR. The PR itself uses the MLIRContext, which is a global state, but after speaking with @mehdi_amini and @jpienaar, there is a much better way to do that, via some TargetQueryAttribute in the module. This would not be a dictionary (map of key/value pairs) on the IR itself, but an identifier into how to acquire the information.

Composition can be ade via lists of attributes, and can be used to override target descriptions (ex. TTI, JSON file, cmd-line options), or to have multiple (ex. CPU, GPU, XPU), etc. All that can be encoded into different query attributes and query attribute lists.

Example:

module {
  // This creates an LLVM TTI using the X86 target and the rest as arguments
  target = #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx']
}

module {
  // This parser a file and creates a run-time target (API TBD)
  target = #target.json['my-device.json']
  ...
}

module {
  // This allows two targets in the same IR, indexed by position
  target = #target.list[
               #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
               #target.json['my-device.json']
  ]

  // This is a CPU function
  func.func cpu_func(...) #target.id['0'] {
    ...
  }

  // This is a XPU function
  func.func cpu_func(...) #target.id['1'] {
    ...
  }
}

module {
  // This allows two targets in the same IR, indexed by string
  target = #target.dict[
               #target.id['CPU'],
               #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
               #target.id['XPU],
               #target.json['my-device.json']
  ]

  // This is a CPU function
  func.func cpu_func(...) #target.id['CPU'] {
    ...
  }

  // This is a XPU function
  func.func cpu_func(...) #target.id['XPU'] {
    ...
  }
}
module {
  // This overrides the TTI info with JSON info
  target = #target.override[
               // This is the baseline
               #target.tti['x86_64-linux-gnueabi','sapphirerapids','amx'],
               // This overrides TTI on intersection, adds the rest
               #target.json['spr-special.json']
  ]
  ...
}

None of that is implemented and the PR won’t have all that, just JSON for now, but this is the idea.

Named Operations in Linalg

The main discussions revolved around semantics. The consensus seems to be that we want a strong documented semantics and encode the lowering/generalization to match that semantics and not the other way around.

Currently, the named ops have forced generalizations and that’s what we use for semantics. Some of them will change, all of them will be documented in the website.

This is important to have strong expectations from the front-ends, so that they lower their implicit behaviour into explicit Linalg, given that not all of them have the same expectations, we need to common it up on the same language.

The main semantic agreements are:

Named ops will not have implicit casts (type, shape)
Element-wise ops will require same types for input/output
Matmul/conv will have existing appropriate type restrictions
Broadcast will have to use linalg.broadcast on the appropriate operand
Type cast / quantization should use appropriate quantization strategies

None of this changes the linalg.generic operation, which continues to represent all of those casts as affine maps and/or arithmetic casts inside the region block.

Another important discussion was surrounding a grouping op. Currently we have scf.execute_region, which can already group based on the idea of a “target” or “thread block”, and it could be used with the above discussion of target descriptors (for some operations in a region, not all).

But that doesn’t translate to tiling and fusion opportunities. When using named ops, being able to fuse ops at multiple nest levels, then tile them (as a group), then fuse again is very powerful. We’ll need guarantees of what can be put into those regions (thus, can’t use scf.execute_region), for example, only ops that implement the TilingInterface, or something.

A good example in the discussion was linalg.softmax. How we lower it will define how we tile & fuse. For example, we can teach reduction/broadcast to be fusable, or we can split the lowering into three groups: pre-reduction, red/bcast, after brodcast, so that we tile and fuse the first group with the producer, the last group with the consumer and do special lowering for the middle one.

Other topics

There were other topics that were not discussed in the round tablr but are also important, we should work on them soon. Most of those I have discussed in separate throughout the conference and can update later.

ML-guided optimization in sync with cost models, target descriptor, etc. (this was a big topic at CGO than EuroLLVM).
Packing for CPU extensions (we want to upstream our code in tpp-mlir)
Linalg to GPU lowering upstream (we want to upstream our code in tpp-mlir)
Pipeline composition, deps, canon, ordering, multiple downstreams (we want to expose compilers like IREE to various upstream/downstream passes)
Temporary buffers, memory address space, shared memory, arena allocators, stack scope and other memref/vector allocation techniques to expose software pipelining across multiple threads, where not all of them do the same thing (@matthias-springer ?)
Vector layout for GPUs and CPU extensions (this was addressed by @Groverkss at EuroLLVM)
Transform schedules, multi-versioning (being addressed by @aniragil [PR] @martin.luecke Rolf Morel)

@ftynse @nicolasvasilache @stellaraccident

[Edited to make clear we don’t want to use the current quant dialect, but some explicit quantization semantics, be it its own dialect or inside linalg]
[Edited to add multiple types of target composition]

stellaraccident · April 15, 2024, 11:19pm

Thanks for the notes. Helpful for those of us who didn’t make it.

Could you give some context to what you mean by “implicit casts” here? I ask because I would already have considered that the named ops that exist abide by this, and I’m wondering whether you are reinforcing a position, trying to define a new one, or see some bugs in specific ops that folks would like to see straightened out?

As an example, it is common in frontends for the accumulator type to be implicit, but in lowering to linalg, we have to decide on a concrete accumulator type, and that is encoded into the op (so that, for examples, a torch.convolution on f16 is expressed in linalg for most backends as a convolution from f16 → f32, followed by a cast to f16). It’s a pita (because this is often left quite under-defined by the frontends), but it is correct and I don’t see a better way than to make sure it is nailed down like this going into linalg.

I’d be curious to hear more about this. From my perspective, the quant dialect is basically abandonware at this point. It hasn’t had any substantial patches landed against it in years, and it is woefully incomplete for representing modern schemes. As the defacto code owner of it, I think that changing that would require a careful evaluation prior to saying it should be the basis for anything new.

I’ll take this to mean that “we feel there is some missing type casting infra and it seems like that maybe should exist in the charter of something called the quant dialect”. Tell me if I got the thought process right or if there was a more concrete analysis

rengolin · April 16, 2024, 10:50am

Great questions!

We were picking one and sticking with it. No implicit broadcast, no implicit type cast, no implicit reduction.

The original named ops in Linalg said they implemented the “Numpy Broadcast Semantics” but that’s not directly translatable to MLIR because of the shape lowering and affine maps.

A linalg.generic can use a 1D vector for broadcast into any dimension (row, col, etc) onto an ND tensor, and IIRC some front-ends do. But once you move to named ops (without explicit affine map), it becomes ambiguous.

  // Broadcast %1 10x before adding
  %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding
  %0 = linalg.add %1, %2 : tensor<10x1xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding?
  %0 = linalg.add %1, %2 : tensor<10xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 

  // What now?
  %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

If every broadcast is explicit, via linalg.broadcast, then all named ops need compatible shapes (same ones in element-wise, special ones for matmul, conv).

Now you go from matching all possible representations of affine maps, checking iterator types and region bodies for semantics, to just matching a def-use chain of two ops.

Simpler, explicit, well defined.

Exactly! In a generic, you can always lower the accumulator type (casts inside the region body), but in a named op, you can’t. Moreover, “accumulator type” can mean different things, for example actually using a down cast later or maybe the hardware op already does that (so Ir without it cannot be lowered).

Linalg type cast semantics is broken in that respect, as it assumes you want to always cast the input type to the output type before the operation, which is rarely what you want. It also doesn’t allow you to represent a “pure” FP16 op that just happens to accumulate on FP32 internally.

So, we want to remove auto-casting from Linalg named ops altogether and make that an explicit feature.

Correct. Apologies for the misdirection.

We don’t use quant, never used it, never cared about it. But the more we discuss about type casting, the more I realise this is a quantization problem (mostly because of the accumulation issue that isn’t always clear with just C++ style casts).

We can get away with this just by having explicit type casts plus some annotation on what is the accumulator type, and make it the default to be the same as the input/output types.

But I’d be amis if I didn’t mention quantization, as there may be patterns I’m missing in the jungle of custom hardware where that strategy would fail.

ftynse · April 16, 2024, 11:58am

This is quite similar to how DLTI/data layout was implemented. Admittedly, the current implementation is over-engineered, I’d be in favor of cleaning that up and using for more target information stuff.

An additional note on this is that we have been historically opposed to semantics-by-lowering because it overspecifies semantics. For example, the fact that Linalg named ops contain/lower to Arith operations without fastmath flags may mean that their semantics prevents fastmath.

Many years ago, we had a series of discussions on nested linalg.generic to allow outer tiling/fusion. Maybe it’s time to revive those discussions.

stellaraccident · April 16, 2024, 1:59pm

Ok, I think this is what is tripping me up: linalg.broadcast is one way to get a compatible shape, but there are many others when considering all of the ways that indexing maps can be composed. But the principle makes sense.

Yeah, we’ve discussed this internally: certain hardware intrinsics in the current regime can only be targeted by fusion. But that isn’t going away any time soon, regardless of whether the encoding of some of the ops are tweaked to make f16 a bit more natural (which is the case where this most often comes up in the wild)… I don’t see a general solution in the cards but struggle without talking very detailed specifics. We just decided to embrace the fusion.

If that means more explicit type parameters at the cast points, +1. I believe I had argued that many years ago for mixed precision ops and was then convinced that the current way was a reasonable use of magic.

Welcome - word of advice: I’ve never found a problem to become less obtuse by promoting it to a “quantization problem” quantization is a needlessly high level concept (well, family of concepts) that never in nature survives to a low level realization in the hardware. Usually taking a bottom up view of a spread of ways the concepts are realized gives the symmetries you are after. Linalg is fundamentally about encoding those lower level things in a way that is unambiguous.

I think where that breaks is when you craft something that yields a gap between the natural level of expression that linalg provides and what a frontend can produce with known information. Folks have often filled that gap with additional high level representations to provide a half step down that makes things match up. But for codegen, I’m yet to see any of those things be relevant, simplify the problem, or be judged well by history.

But I buy that you are latching onto some missing expressiveness in the way that mixed precision operations are composed.

Probably not helpful ones. Although if you look at all of the bespoke software riding on top and try to target that, you’ll see all kinds of things that run the gamut. The ones I know about call for dedicated ops that do what the hardware needs vs magic or complicated representations that try to generalize.

Whew! For a second there, the entire world shifted on me in a funny way

rengolin · April 16, 2024, 4:34pm

Ah, yes! In theory, anything that gets into the right shape is “valid” (even other non-linalg ops that auto broadcast), but in practice you can only def-use-match against a fixed set of ops (closed world). That’s the only reason I named linalg.broadcast in there, not because it’s the only one that can do it.

We could always add interfaces (open world), but that’s a separate discussion.

One interpretation is that accumulation type is micro-architecture specific and therefore it’s an “implementation detail”.

For example, can could have a vanilla linalg.add FP16, FP16 -> FP16 and still accumulate in FP32 if the op stills output FP16. But to output a temporary FP32 from that op into another I’d have to use an explicit type cast from that result into FP32 into the next op. This is very ugly, but it’s how we did SelectionDAG hacks.

Another implementation is that we want to carry even more information. As @ftynse says, there’s also fast-math flags that we may want to propagate, and accumulation type can be another property of the op. But in the latter case, would the add op output the larger type? Wouldn’t that violate the same-type property?

That’s why I mention “quantization”. Now I realized I’m misusing the term. My point was just “how to encode mixed-type awareness into operations without changing their original semantics?”.

It’s funny what 5 misplaced letters do to your brain when they come from a tangent.

I’ll stop using that term and just cope when people say “oh, you mean quantization?”. Far less harmful.

stellaraccident · April 16, 2024, 7:03pm

Depending on audience, I do sometimes close down my internal monologue and say “quantization”. It’s just super imprecise and means something different to everyone. And I have battle scars… Which don’t need to be anyone else’s problem. I can deal

… Except that it is actually defined concretely in torch. Our priority when lowering torch has been a faithful encoding of the op semantics, such as they are. A backend is of course free to do any number of… Hazardous… Things to the numerics, but I think it is important that the default representation doesn’t abstract frontend constructs that are concrete.

rengolin · April 16, 2024, 7:26pm

Agreed!

The main reason why we wanted strong encoded semantics (and not weak “lowering” semantics) is to being able to faithfully lower from different front-end semantics without having to second guess.

banach-space · April 18, 2024, 8:01pm

Thanks for the summary @rengolin and for organising the roundtable!

+1 Here’s one specific case to support this:

Numerical casting in linalg.conv_2d

-Andrzej

aniragil · April 18, 2024, 8:45pm

Thanks for summarizing (and organizing) that round table, @rengolin !

Some more details on the named ops topic: One question was could we retain the semantics/idiom of named linalg ops that lower into a graph of linalg.generic ops, e.g. linalg.softmax.

One possible solution might be to encapsulate the lowered generic ops within the body of some isolated-from-above op, making it hard for e.g. cse to break the idiom these ops form while facilitating moving/lowering/optimizing them as a whole (kind of a local, ad-hoc alternative for function outlining). Here’s how such an op might look if added to the scf dialect.

Other uses for ad-hoc grouping of ops might be limiting transformation scopes (as proposed by the PR Renato mentioned) and as the result of partitioning.

MaheshRavishankar · May 2, 2024, 6:57pm

I am a bit late to this thread. Thanks @rengolin for the detailed notes.

In general I am -1 on almost all of these. Will describe more about the “broadcasting behavior” and “type cast” later on, but the main premise for all of this is that Linalg operations are essentially any computation that can be represented using “perfectly nested loops”. Named ops are useful, but Linalg is a transformation focused dialect, and any restriction that tries to limit the “perfectly nested loops” aspect of this seem artificial to me. For most part a named op is just a linalg.generic with a fixed indexing map, iterator types and region definitions.

Broadcasting behavior

rengolin:

  // Broadcast %1 10x before adding
  %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding
  %0 = linalg.add %1, %2 : tensor<10x1xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding?
  %0 = linalg.add %1, %2 : tensor<10xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 

  // What now?
  %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

In general broadcasting gets a bit of a bad-rap (with good reason), but IMO not all broadcasting behavior is bad. In some certain restricted category having some broadcasting behavior can lead to a better representation of the program. Ill use the examples above to walk through which I think are well defined under these thumb rules for elementwise parallel operations.

All operations need to be in destination passing style. If any named op is not (I actually dont know how that is legal for the Linalg Interaface, but that is a requirement IMO
The indexing maps for the outputs of such ops needs to be identity.
All operands indexing maps need to be a minor identity (using this term lightly, but basically it has no transposition, but can have missing dimensions)
The broadcasting behavior is essentially, all vectors not at the same rank as the output gets broadcasted to the output shapes.

 %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding

This IMO is perfectly well defined (apart from this should be in Destination passing style and it isnt) since it is essentially this

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<10x4xf32>)
    outs(%3 : tensor<10x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

There is no ambiguity here. I think one issue is that these ops are trying to infer broadcast behavior based on the shapes. Instead the broadcasting behavior should be based off of indexing maps.

%0 = linalg.add %1, %2 : tensor<10x1xf32>, tensor<10x4xf32> -> tensor<10x4xf32>

This is numpy-style size-1 based broadcasting. This should just be illegal.

// Broadcast %1 4x before adding?
  %0 = linalg.add %1, %2 : tensor<10xf32>, tensor<10x4xf32> -> tensor<10x4xf32>

No issue here either. It is basically

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<10x4xf32>)
    outs(%3 : tensor<10x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

Now coming to this

%0 = linalg.add %1, %2 : tensor<4xf32>, tensor<4x4xf32> -> tensor<4x4xf32>

If you are trying to “infer” based on shapes, this gets stuck… instead the named ops should infer broadcasting behavior based on indexing maps. Then there will be no ambiguity… For this specific case actually both

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<4x4xf32>)
    outs(%3 : tensor<4x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<4x4xf32>

and

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<4x4xf32>)
    outs(%3 : tensor<4x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

are equivalent. One is just the loop interchanged version of the other. So you can just pick one.

I think one addition might be to make the indexing maps explicit on even named ops to represent broadcasting behavior precisely. Then there is no ambiguity at all.
Doing this for the matmul and convolution ops could also be useful and help collapse the 10s of variants of matmul + convolution into a handful of few ops with different indexing maps.
This also means we might be at end of the road for the python opdsl → yaml → tablgen def of named ops. Instead we add a set of named ops explicitly in tablegen that give us more freedom to make sure the ops carry enough semantic information to keep things as local as possible as well as unambiguous when “generalizing this operation” to linalg.generic.

Type mismatches

AFAIU the only type issue that happens with “named elementwise ops” is understanding how to extend the input type to output type. Because Linalg ops fundamentally work on unsigned types, the semantics of the element type conversion is carried in the body of the linalg op. For named ops we just need to have a way in which we represent how to convert from the operand type to the result type and all the computation happen in result type. (You could also relax this and have an “operation type”, where all the operands are converted to this type and then the operation is done in that type followed by a conversion to the result type"). All of this could just be represented as an enum per operand/result. In general it is useful to allow mixed types in a single operation. The whole premise of Linalg is that you want to keep all information needed to “transform” an operation local. This is more a guiding principle and not about red lines, but we should try to make it easier to keep things as local as possible.

To make the above more concrete, you could add this to an elementwise operation

%0 = linalg.add {
    extension_types = [SIGN_EXTEND_I32, UNSIGNED_EXTEND_I32],
    ...}
    ins(%1, %2 : tensor<..xi4>, tensor<..xi4>)

which essentially means

%0 = linalg.generic {...}
    ins(%1, %2 : tensor<..xi4>, tensor<..xi4>)
  ^bb0(%b0: i4, %b1 : i4,...):
    %3 = arith.extsi : i4 to i32
    %4 = arith.extui : i4 to i32
}

This essentially boils down to how you can generate the body of the linalg.generic given the named op.

I think it would be useful to even adapt linalg.matmul and linalg.conv named ops to have explicitly stated semantics for the conversion from input type to accumulator type and from accumulator type to result type. For now we are essentially using two separate ops, and relying heavily on fusion to fold things back for us, which is do-able, but it would be easier if we didnt have to. The flip side of this is the quantizations schemes are getting inherently more complicated. We might not be future proof with named ops to handle these evolutions and have to build up interfaces that can query information from a generic op directly (like which dimensions are contracted, or are batch dimensions) and not rely on named ops to handle things for us.

rengolin · May 3, 2024, 2:15pm

That’s what we mean by implicit behaviour and this is what we all agreed should not happen. If my comments led you astray, it’s my reporting, not the consensus.

We don’t want to deal with affine maps once we go into non-perfectly nested territory, so this needs explicit ops, which is a perfectly valid usage of Linalg and one we have been agreeing with for the past year.

It does not interfere with the generics part or the perfectly-generalizable ops either. We’re just expanding the ability to transform linalg code for non-perfectly nested ops.

Creating a new dialect for that is a non-starter. We discussed this last year and the consensus then and direction since have been to improve linalg, and that’s what we have been doing. It would be great if IREE could participate more actively in this discussion.

Nope. The main problem is accumulator type and controlling the casts (pre/post) to avoid breaking precision expectations. Your proposal does not address that. I’m open to suggestions, but we need to start thinking beyond generics, affine maps and perfectly-nested loops. There’s a lot of fusion opportunities that do not work at all in those cases.

MaheshRavishankar · May 4, 2024, 4:54am

I dont think I follow this. We were talking about representing elementwise add and such operations, not imperfectly nested loops. I dont think having Linalg ops represent non-perfectly nested loop computation is a good usage of Linalg. Those are represented by loops + linalg ops.

That seems like something that shouldnt be in Linalg. This seems like a new dialect, or really a new interface. Something that builds upon the DecomposeAggregateOpInterface (just as all Linalg ops are effectively implementations of the Linalg Interface).

In what sense? I suggested adding an array attribute of enums that explicitly state what the truncation/extension expectations are to take ambiguity out of it. Not wedded to that approach, but more a suggestion to get the ball rolling.

Overall what was suggested at the top is no different from things like TOSA or Torch kind of dialects. This would just become one more of those if we take that direction. For Linalg ops we should aim for maximal compact representation within what we can be handled by transformations.

rengolin · May 4, 2024, 10:54am

We have discussed this for the past year, since my first RFC. There are many other threads in the forum that touch that and the consensus is strong on the proposals at the round table. I wish you have participated from the beginning and voiced your concerns (or at least resolved your questions), so we don’t have to do this all over again.

That is exactly the proposal. Loops + named ops.

All your other comments seem to come from the misunderstanding that we’re trying to extend generics or named ops to non-perfectly nested loops. We’re not.

As I said above, we’ve gone through all these options throughout the last year. Attributes don’t scale. Named ops variations don’t scale. Look at the number of convolutions, matmuls etc. that are in Linalg right now.

the only two things that scale:

Linalg generic with arbitrary code in the perfectly-nested inner loop body.
Loops + strict semantics named ops for the cases a generic can’t represent.

For 2 above, having attributes or variations won’t scale. Every one will want to introduce their own variation. And in the end, they can all be represented as a graph of simpler named ops that can easily be matched with def-use matchers.

We already have tiling and fusing for linalg generics and that works really well. That’s why we generalize named ops in our compiler. But we’d like to add tiling from named ops (or even DAGs of named ops) into tiled named ops, so that we can more easily match them to micro-kernels or special hardware ops.

Today, we take a named op, generalize on tiling, then re-specialize the tile generics to match. It works really well, so the consensus (at the round table and before in the multiple threads in this forum) was that we should just complete the named ops catalogue, write exact semantics, validate their lowering and tile them into named ops still.

We should only generalize in three cases:

The case gets too complex and generics becomes a better description for the problem.
I don’t know how to lower a named op (no HW/MK available) so I lower to loops.
I know how to vectorize generics better than named ops.

Note that none of that stops anyone from using generics from the beginning and never touching named ops. It’s already possible to represent everything in a set of loops and generics. It’s just SO much simpler to match a DAG of named ops than a DAG of generics.

You really should read the original proposal’s thread. This all has been discussed there. None of those dialects have LLVM governance, Linalg does. We have created a new dialect (TPP) but that ended up just being a clone of Linalg (down to the interfaces we implement).

Torch/HLO/ONNX come from high level graphs which have named ops, then they convert to generics and then we convert to named ops. We just want to be able to convert to named Linalg ops, work with them and then lower them to hardware without having to pass through generics (if we don’t really need to).

MaheshRavishankar · May 6, 2024, 1:48am

I am sorry if this is the case. Discussions on discourse go in very different directions. I am not sure which previous RFC/thread/consensus you are referring to. Its very possible I missed it, but I pay more attention to code/PR than discussions, so apologies. Happy to come up to speed then.

Thats only part of my concern. My major concern is about things like
(a) named op should have no broadcast support (as I outlined above supporting well defined broadcasting behavior is useful)
(b) named ops cannot support mixed types
IMO these two restrictions are basically legacy IMO. For state of the things coming in the pipe such restrictions make it harder to build a reasonable compilation story for a quantized LLM model for example (ML isnt the only domain of interest, but is an important domain that we need to make sure is covered/made easy to support by any developments). We have been thinking of ways in which we can make better used of named ops to do optimizations/canonicalizations. If named ops have the above restriction, I think that makes it basically unusable going forward.

Might be misplacing the blame here. The number of variants of convolutions are indeed problematic but that is because named ops are effectively
(a) fixed/implicit indexing maps
(b) fixed/implicit iterator types
(c) fixed/implicit regions
All I am suggesting it named ops should allow more flexibility that the definition of the named ops (with some helper attributes) will allow collapsing multiple of currently specified named op into one (which would also mean move away from the current opdsl based way of specifying/generating named ops)

You mentioned above that you arent extending named ops to cover imperfectly nested loops, but tiled named ops are imperfectly nested loops. Could you clarify?

There are multiple threads from different stake holders (including one from Sanjor/Raghav and other Cruise folks) trying to build this layer. I agree Linalg covers 90% of what you would need here, but it isnt all of it. There are ops that are more like “Composite linalg ops” that will be needed (like softmax which is the only example in tree) that live in Linalg for the sole reason of it being the dialect that can represent operations under both tensor and buffer semantics. It really shouldnt be in Linalg.

Yes, that would be great. I’d like that too… but IMO restrictions on broadcasting and mixed types that are in this thread basically make it dead on arrival.
Side Note: I think the ship has sailed on having these dialects “not” lowering to linalg.generic. Doing that is basically a rewrite of all the lowerings. We might be better off (at least in the short term), writing recognizers that can make generics to named ops.

rengolin · May 6, 2024, 7:52pm

You misunderstand. No one is advocating for named ops to not have explicit mixed types or broadcast support, only that the current support is implicit and we need a new approach. From what I understood from your points, you agree with this assessment.

In the threads I linked above we go through the attributes option. I tried to explain in this thread, but perhaps wasn’t clear. It does not allow collapsing multiple ops into a single op+attribute, just a single op with multiple attributes, which ends up being essentially the same matching complexity.

An example is broadcast, which can be in any argument on any operand’s dimension into any other. Just adding a bcast attribute and letting the compiler guessing as to what isn’t explicit, so not helpful.

It’s much simpler to have another op (we do: linalg.broadcast) with explicit semantics as the producer of the specific operand. You then end up with a linalg named op where a specific operand has a specific broadcast encoded in a single place (linalg.broadcast), and the pattern linalg.op(operand1, broadcast(opts..., operand2)) becomes your whole semantics.

// Code
%0 = op1
%1 = bcast(op2, ...)
%2 = linalg.named_op ins(%0, %1) ...

// DAG
// op1  bcast(op2, ...)
//   \ /
// named_op

Exactly the same semantics as linalg.op(op1, op2) 2d_col_bcast_on_op2 but without having to create the combination of attributes.

This is slightly orthogonal to the multiple ops in linalg, so that was an unintentional misdirection from my part, apologies.

Imperfectly nested loops are just that: loops. Today, they can have linalg generics in the inner loops (yes, plural), but then matching all those generics gets hairy. We just replace them with named op and carry the non-perfected-ness in loops.

There is literally no change to Linalg here. We’re just using def-use chains ot named ops as a replacement for affine maps and implicit cast semantics because it’s a lot easier to match.

Right, that was another discussion, and I agreed with your position. However, multiple groups (some inside Intel, some outside) expressed concern on a key feature: matching well known patterns (like softmax, gelu) that are not trivial.

Our group doesn’t plan to add ops like softmax at the moment. Actually, we’re working on breaking it down to make it easier to optimize, so I think we’re working on the same directions.

Oh, we also did that! We have that functionality in TPP and @javedabsar seems to be adding some upstream (you’re reviewing it).

We wanted to upstream our matchers (which do that in a more elegant way), but we opted to get named ops from the front-ends directly instead.

Note: Structured matchers did not make it simpler to match against generics and implicit semantics, this is why we’re going the def-use named ops instead.

@nicolasvasilache proposed last year to build a generic de-generalizer based on reversing the generalization process. Not sure if anyone has looked into that since.

javedabsar · May 6, 2024, 11:06pm

Wading into the dicussion … my intention for the diff add more specializing patterns is to augment the up-conversion from generic to named where it can. It has debugging and library call uses. More precise definitions of named-ops would be good. But then any change/extension must reflect firstly in named → generic. Only then up-conversion should follow. Otherwise a valid generic op could get up-converted to ‘valid’ named-op, as per definition but if implementation does not follow then it cannot get lowered (to generic).

rengolin · May 7, 2024, 12:32am

Correct. A linalg.generic with a broadcasting affine map and n ops would get raised as linalg.broadcast + linalg.op[n], which it should be then lowered to the same generic back again. But it won’t. It’ll be lowered to n+1 generics.

This is the main reason why we did not follow this path. Finding a canonical representation of bcast + add is not hard, but the general case might be intractable, especially as we start having non-perfectly-nested loops with named ops inside.

This is achieved by round-trip tests. I imagine they must be identical in your cases (1:1 mapping). They may not be the case for some more complex patterns.

I postulate that it could become identical after a number of carefully selected series of conversion + canonicalization (which could run fusion of the broadcast into the element wise, for example).

The hard part is to find such a sequence of transforms that always yields a fixed point after N iterations for the general case.

MaheshRavishankar · May 14, 2024, 3:11am

(sorry took me a while to circle back to this)

Ok cool! I can propose some enhancement that can support these nicely (when I get a moment to breathe )

I dont think so… just like broadcast in linalg.generic is “explicit” and has no guesses, we can port that to any elementwise ops. We just need to make the indexing maps explicit on ops.

Well you can support that too… all you need is a pass that splits a named op with explicit broadcast to a sequence of nameds op with broadcasts… My claim here is only that its easier to go from (op with broadcast support) to (broadcast + op with no broadcasts). The reverse is strictly harder. So why restrict the named ops. Allow broadcast and then just add a pass to materialize broadcasts explicitly. Then do whatever matching is needed.

IMO that does not belong to Linalg. (I am again confused since you are both suggesting adding new ops and no change to Linalg). I agree we need an operation for that. I have been discussing with some folks offline of a “composite” ops dialect which basically is an op with ins, outs, iterator_types and indexing_maps but the region is a sequence of linalg.generic-like operations. Just like linalg.generic is a generic implementation of LinalgInterface, there can be a class of ops (composite.generic and named composite ops) that can be used to specify ops like softmax, flashattention, winograd transforms and many more. This is essentially a generalization of the DecomposeOpInterface.

Topic		Replies	Views
[RFC] Target Description and Cost Model in MLIR MLIR	14	1056	March 13, 2024
EuroLLVM MLIR round table(s) EuroLLVM	2	362	April 2, 2024
RFC: Enhancing Machine Retargetability in MLIR MLIR	8	991	September 10, 2020
[RFC] Vector Masking Representation in MLIR MLIR	12	1335	September 26, 2022
Compilation errors at TableGen descriptions MLIR	2	305	May 5, 2022

Notes from the MLIR Upstream Round Table @ EuroLLVM 2024

Broadcasting behavior

Type mismatches

Related topics