[RFC] Primitive Ops: add MapOp, ReductionOp, TransposeOp, BroadcastOp to Linalg

[RFC] Primitive Ops: add MapOp, ReductionOp, TransposeOp, BroadcastOp to Linalg

Motivation

linalg.generic is a powerful Swiss knife of an operation. It can model elementwise operations, reductions, static broadcasts and reshapes, transposes, convolutions, matmuls and more. There are several problems with it and I think they can be addressed relatively easily.

Layering

There are examples of compilation pipelines in TF and IREE that go from a front-end dialect, e.g. mHLO, directly to linalg.generic and then perform tiling, fusion and other related transforms like tree reductions. In order to do so, linalg.generic ops are pattern-matched to understand what they actually model. Is it a reduction? Is an elementwise op? Is it an elementwise op that also broadcasts one of the inputs? Is it a reduction that had multiple ops fused into it?

We are losing the structure and then try to recover it with pattern-matching. It is unnecessary and indicates a missing layer of ops.

Readability

The IR that we generate ideally should be readable and concise not only for “initiated”. Here is an example of a simple add op on tensors that was lowered to linalg.generic.

%add = linalg.generic {
  indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
                   affine_map<(d0, d1) -> (d0, d1)>,
                   affine_map<(d0, d1) -> (d0, d1)>],
  iterator_types = ["parallel", "parallel"]}
  ins(%lhs, %rhs : tensor<10xf32>, tensor<10xf32>)
  outs(%out : tensor<10xf32>) {
  ^bb0(%l: f32, %r: f32, %o: f32):
    %0 = arith.addf %l, %r: f32
    linalg.yield %0 : f32
  } -> tensor<10xf32>

and this is an example of the same op conveted to the missing linalg.map.

%add = linalg.map ins(%lhs:tensor<10xf32>, %rhs:tensor<10xf32>)
                  outs(%out:tensor<10xf32>) optional-attrs
                  (%l: f32, %r: f32) {
                    %0 = arith.addf %l, %r: f32
                    linalg.yield %0: f32
                  }

I argue that having simpler operations that model elementwise, reduction, transpose, static broadcast would improve readability and reduce the need for pattern matching in transformation passes, for example, when clustering operations to perform fusion.

Therefore, it might be a good time to reconsider whether the front-end dialects should be converted to linalg.generic or to a primitive set of ops, possibly within Linalg dialect that can be trivially lowered to linalg.generic.

Linalg Named Ops

All of the above is suspiciously similar to Linalg Named Ops, a way to define structured ops that can be converted to linalg.generic using YAML. Unfortunately, it requires defining a lot of operations for every possible rank and combination of what gets reduced, transposed, etc. For example, if we want to define an op for reduction, then not one, but many ops have to be created: reduction_1d, reduction_2d, column_reduction_2d, row_reduction_2d, …

Design

Add MapOp, IotaOp, ReductionOp, TransposeOp and BroadcastOp to LinalgStructuredOps.td. All of the ops will implement LinalgOpInterface and DestinationStyleOpInterface, which will allow for tiling/fusion/lowering to loops and bufferization.

These ops can be seen as complementary to Linalg Named Ops, but defined in TableGen and C++ instead of being generated out of the YAML config.

Map

This is an n-ary operation where the shapes of the inputs and the output are the same. It does not have any implicit broadcasting behaviour.

%add = linalg.map ins(%lhs:tensor<10xf32>, %rhs:tensor<10xf32>)
                  outs(%out:tensor<10xf32>) optional-attrs
                  (%l: f32, %r: f32) {
                    %0 = arith.addf %l, %r: f32
                    linalg.yield %0: f32
                  }

linalg.map has a region with block arguments corresponding to every input. Note, that it does not have a block argument for the output unlike in linalg.generic.

Bike-shed: Should it be called linalg.map, linalg.cwise, linalg.pointwise or linalg.elementwise?

Iota

Iota is an operation that does not have any inputs, but it uses the values of the induction variables to populate the output.

%iota = linalg.iota outs(%output:tensor<8x16xf32>) dim = [0] optional-attrs

Bike-shed: Do we really need this operation? Can’t we just use linalg.map with linalg.index inside to model it?

%iota = linalg.elementwise outs(%out:tensor<8x16xf32>) optional-attrs () {
           %0 = linalg.index : 0
           linalg.yield %0: f32
         }

Transpose

Transpose operation specifies the permutation of the input dimensions.

%transpose = linalg.transpose
               ins(%input:tensor<16x64xf32>)
               outs(%output:tensor<64x16xf32>)
               permutation = [1, 0] optional-attrs

Reduction

Reduction operation specifies what dimensions will be reduced and the body region that contains the combiner.

%sum = linalg.reduction
         ins(%input:tensor<16x64xf32>)
         outs(%output:tensor<16xf32>) dimensions = [1] optional-attrs
         (%in: f32, %out: f32) {
           %0 = arith.addf %in, %out: f32
           linalg.yield %0: f32
         }

Broadcast

This is a static broadcast operation, i.e. there is no ambiguity in compile-time what size-1 dimensions should be expanded.

%bcast = linalg.broadcast
         ins(%input:tensor<16xf32>)
         outs(%output:tensor<16x64xf32>) dimensions = [0] optional-attrs

Question: Do we need size-1 expansion in this version of broadcasting?

Dynamic broadcast operation cannot be modeled with linalg.generic and it is too specific to TensorFlow, that’s why it should not be a part of Linalg.

Reshape

Reshapes are already modeled with tensor.expand_shape, tensor.collapse_shape or a combination of a tensor.collapse_shape that reshapes to 1D and then tensor.expand_shape that expands to the target shape.

There are dynamic reshapes that cannot be modeled with linalg.generic, e.g. reshaping tensor<?x?xf32> to tensor<?x?x?xf32>. It cannot be implemented LinalgOpInterface but can be converted to tensor.reshape.

It is an interesting question whether we need a dst-style linalg.reshape op at all or we can live with tensor.[expand_shape, collapse_shape, reshape]. Usually, linalg.reshape does not allocate during bufferization and just becomes a memref descriptor operation, so I am not sure how important it is to have it in dst-style.

%reshape = linalg.reshape
         ins(%input:tensor<?x?xf32>)
         outs(%output:tensor<?x?x?xf32>) optional-attrs

Also it is quite different from the ops above w.r.t. to tiling and fusion, since reshapes are only tileable to size-1 tiles, i.e. to point level.

Matmul, Fill and Convolutions

These ops are already defined in Linalg and HLO gets converted to them instead of linalg.generic. They are already readable and useful.

There are many operations for convolutions, i.e. Conv1DNwcWcfOp, Conv2DNhwcHwcfOp, Conv3DNdhwcDhwcfOp, DepthwiseConv1DNwcWcmOp and more. It might be possible to find a smaller set of ops to cover them all.

1 Like

@stellaraccident, @nicolasvasilache, @MaheshRavishankar, @herhut, @frgossen, @akuegel, @mehdi_amini, @matthias-springer, @okkwon

Thank you for the proposal. I want to chime in and give my opinion on the topic as I have the same problem. I agree with the motivation but wonder if there are other alternatives to what is proposed. For example, why not provide a reliable mechanism to go from linalg.generic to linalg named operations? (see for example: Linalg to LLVM Lowering - #7 by nicolasvasilache). Also, I am not completely sure what limitations would prevent us from using OpDSL to generate the suggested ops. Can we not generalize at the OpDSL level?

My two main fears are:

  1. If I need to define a new op in Linalg, which ways should I prefer? OpDSL or TableGen and C++? and why?
  2. In my case, it is not just being able to detect if an operation is an elementwise or reduction but also what kind of operation I am dealing with (e.g., add or sub). Thus, at the end of the day, I still need to do pattern matching on the body of the operation (e.g., if you want to map to library calls).

Again, thank you for the proposal. I think this is a problem that should be solved, and I agree 100% with the motivation.

Ps: the links redirect to the Google login page.

I guess it depends on granularity of the op. LinalgNamedOps already has linalg.elemwise_unary and linalg.elemwise_binary that introduction of a separate fun argument for every possible body of the operation. It makes mapping to function calls easier, although I would argue, that this particular pattern-matching will lead to the same amount of code necessary to map every fun attribute or every body of linalg.map to a function call.

Like @chelini, first let me say thank you for the proposal, and I also 100% agree with the motivation and need for better primitives. linalg.generic is indeed powerful but being used for too much (and it is very complicated to manipulate when prematurely lowered to that form). I think we would all like fewer affine map laden attributes to read through for these common things.

I’m thinking on the proposal of the specific operations you mentioned and will followup later. Comments below on design and approach.

I think it is perfectly fine to just call them Linalg Named Ops, regardless of the specific mechanism for instantiating them. To answer the specific point raised, OpDSL (which I wrote and have no real attachment to making it a “one true way” for everything) is not really set up to deal with polymorphism at a contained region level (map and reduction in your example) and I suspect that you are going to want to exert a level of control on the pretty-printing of that which runs counter to using a highly systemic representation tool meant for stamping out ops where the name+attributes define the entire semantics. For those reason, I expect that map/reduction should just be implemented in Tablegen/C++. If OpDSL ever grows to handle that level of polymorphism, then that can be revisited (that was how it came to be what it is: by form fitting it to handle all of the C++ defined ops that existed at the time).

I suspect that the rest may not fare well with OpDSL today due to the rank polymorphism they imply. Even if representable, you are likely going to need/want more control than than it provides.

This is a design point that we probably want to keep: always open to refinements but the goal with these ops is to bias more towards the name being the discriminator in terms of the resulting computation structure and specializations that exist in the field (as opposed to higher level op sets, which tend to be very attribute laden and open ended). In this situation, named ops are cheap and I would prefer keeping them defined roughly at the granularity they are (and can come up with a better defense of that viewpoint if there are strong feelings on this/it wasn’t just a passing comment).

We can, but adding new levels of polymorphism to OpDSL comes with its own hazards in terms of maintenance and opacity. The criteria for me boil down to: a) does the op leverage more polymorphism than OpDSL has (and we would rather wait and see some examples before generalizing), and b) does the op require more fine grained C++ control than the one-size-fits all conception of OpDSL provides (i.e. verification, custom print formats, C++ helpers, etc). As mentioned above, mostly OpDSL was evolved to meet the needs of a fairly uniform subset of ops which make up the bulk of specializations in the field.

Same for me. Having the name does make such pattern matching code a good deal easier though (fewer AffineMap laden heuristics about projected permutations and such for these common cases). I believe that it was always part of the plan to be able to transform both to generic form (generalize) and pattern match back to the named op level as a very specifically supported raising with constrained scope. The way OpDSL is defined should make it systematic to generate such transformations for the forms it supports, but we probably also want it for the bespoke ops (i.e. there are many ways that these ops can be “entangled” in a lowering flow). In these cases, the DecomposeLinalgOps pass does peel the generic and would make those transforms work.

I do think this proposal pushes us further down the path of needing that named op specialization pass for full generality.

Some additional comments on the specifics of the ops proposed. I think the utility of new ops comes from two directions:

  • Do they provide readability, convenience or less information loss to lowerings in to linalg. For this, we look at a few of the usual suspects (and you are specifically looking at advanced fusion heuristics at the high level).
  • Do they enable better mapping to specializations in the field (i.e. library calls, hardware blocks, etc).

When lowering to libraries or microkernels, I don’t think this is going to be a generally useful op, so this is in the frontend convenience category, I think.

The reason is that common targets for these things actually do match a fused linalg.generic defined over projected permutations quite well: in such targets, you typically have strides available for free and the elementwise-fused form is what you want (i.e. broadcasts and transposes get “folded in”). I don’t love pattern matching linalg.generic for such cases, but it is actually the right representation for the use cases I see. Separate to this proposal, I wonder if we had a better name/spelling for such n-ary, parallel, projected permutation mappings, if that wouldn’t be the right op to define. I’m not precisely sure how to define that formally with better rigor than generic is defined now, though – more that I’ve seen a lot of code that does the same pre-amble dance to identify whether that is the form.

Naming: Elsewhere, we already have the ElementwiseMappable trait which captures the semantic. Maybe linalg.elementwise_map for consistency? There is also the ElementwiseToLinalg pass which is presently converting anything with this trait to a linalg.generic. Minimally that should be updated to build this new op.

Even if they get fused later on, I would like a better representation for transpose specifically, even if just for the pure readability at the frontend->linalg level. Same comment about linalg.map above: it is rare in real programs that this should survive as a standalone op because generic represents its trivial fusions (and those are what you want to lower), but I think the ergonomic benefit makes it worth it.

I’m not actually sure what you are proposing here? It looks to be a fully static form of HLO’s broadcast_in_dim? As written, I believe it can support dynamic broadcasting but does not allow ambiguity on unknown dimensions triggering expansion.

I’m going to leave this to the better qualified.

I think this is only partially true: Perhaps at the mid-level we are over-pattern matching, but there are a lot of (even library call) cases where generic is the most direct match. Yes, you have to pattern match cases out of it, but the associations don’t exist otherwise. But especially at the higher levels, less information loss is better, so I can get behind the “missing layer of ops” analysis. Ideally, I would always be able to do access-pattern fusion and get into generic form for some code generation or lowering flows, but your analysis is that we don’t need to start there. (I think I’m just restating what you said in my own words, so please tell me if we are saying something different)

It might be a matter of more clearly defining what the demarcation of Linalg is. At the outset (also biased by my experience with Linalg), these ops seem to be describing something that sits “above” Linalg.
IMO, a Linalg op is nothing but a perfectly nested loop nest, where the operands, indexing maps and iterator types give you information about the access pattern within the loop body, while treating the body of the loop as a black box. Anything that is trying to look into the region of the operation to re-discover some semantics of the computation is inherently raising. So that has always been a bit of a red flag for me.

I cant talk about TF pipelines, but in IREE I have tried to actively avoid this. I’d argue that if it is indeed needed to look at the region of the linalg.generic to know what it is doing, the transformation is being implemented at the wrong layer. Either MHLO, or some layer between Linalg and MHLO, is where such transformations should live. MLIR allows you to avoid lifting by writing analysis/transformations in dialects that capture the information needed without having to do complex lifting.

I am not sure readability is as much as a concern for Linalg dialect (at least the way I think of it). Its more related to a front end for codegeneration. So its seems reasonable that someone working at this level needs to invest more time to understand this operation.

Coming to the specific ops here

  • The proposed map operation seems nothing more than a linalg.generic with all iterator types being parallel, and all indexing maps being identities. This could just be a utility function to use as needed
  • iota and transpose operations as well seems like a very specific use case. Whats the difference between these ops and MHLO version of these ops? This seems like a “lateral translation” rather than a lowering then.
  • static broadcast seems even more problematic. All Linalg ops work with dynamic dimensions. (Side note: I continue to believe that “broadcast_in_dim” and “dynamic_broadcast_in_dim” are legacy constructs that came about before sufficiently powerful compiler technology existed to better model the required computation. Handling broadcasting behavior in front ends and codegen has more discussion on this, which itself is a bit dated. In IREE for example we explicitly avoid lowering to these constructs to begin with).
  • Collapse/Expand shape are not Linalg operations at all. Linalg operations can be used to implement these ops, but that is IMO strictly a pessimization. These operations server a slightly different purpose, and are mostly meant to capture change in tensor shapes (they need to have copy semantics at SSA level, but mostly expect to lower to memref versions of expand/collapse shape that have no copy semantics after bufferization).

To summarize, I feel like Linalg is not the right level for these operations. We can have a separate discussion about where these ops should go, (in MHLO, in some “front-end” dialect in core, etc.). I think a lot of work is in progress to separate out things that dont belong in Linalg really, out of this dialect, while this seems to be going against that trend.

1 Like

Exactly :slight_smile: tHLO is the right level :wink:

Leaving aside where they live, I think we’ve learned some things from “the linalg journey” that is influencing the desire for these higher level ops, but I can also buy that they are neither “Linalg” nor “MHLO/TOSA/etc” but somewhere in between. Perhaps we were just a couple of years too early for the “TCP” (tensor compute primitives) idea.

The “elements of Linalg” that I think these ops are drawing from are:

  1. Destination passing style as a gateway to progressive bufferization, memory planning and grouping heuristics: We are already factoring this out as a dedicated concept, and even though the starting point is to just disaggregate in-situ within linalg, I expect that this will end up being a first order thing we support at this mid-level as we exit from strict value-based semantics.
  2. Default lowerings to buffers/loops: This has been a big boon for multiple projects to have the dumbest possible reference lowerings. Torch-MLIR relies on this exclusively for its reference backend, which is used to gate all changes.
  3. Ranked with explicit extents: We just can’t relitigate this. For the subset of the universe that is ranked, having extents modeled explicitly in the IR is a requirement. This composes well with destination passing style.

I think what hasn’t worked so well in Linalg for a generic frontend, mid-level dialect is:

  • There is a desire for more “algebra centrism” in the opset. I struggle with the right categorization for this, but I’ve heard it described as “I want something I can write grappler-style algebra passes on”, “I want ops that are representative of what the user programmed, not what we are lowering to”, and (tongue in cheek) "why do I need to know about affine and loop structures to spell “transpose”.
  • Higher level fusions: these tend to rely on grouping and are typically heuristic based on a vocabulary of domain ops. I am separating this from lower-level fusions as befits codegen, which linalg represents quite well.

While plenty of people are lowering frontends to Linalg, actually covering pretty much the whole space of ML frameworks at this point, what I’m hearing is that it is codegen biased and not something people want to be doing high level transformations on directly.

If we had it to do over again, with the benefit of hindsight with respect to DPS/bufferization/dynamic-shapes, I would have advocated that most of these higher level “algebra” ops should be in tensor or possibly some specific dialect but follow the #1-3 conventions above. We’re pattern matching to linalg because it does satisfy those requirements/conventions while nothing else does. Neither MHLO/Tosa/ATen models things at that level – and for real compilers, those conventions are really useful. We could just (either in or out of linalg) define these ops to those conventions and make it all compose together. I think the key are those three conventions defining a specific level in the IR as you exit frontend/value/dataflow and enter codegen.

This doesn’t sound to me like “MHLO” (as in the dialect) anymore than it sounds like “TOSA” or “ATen”.

1 Like

Lol, I completely missed the “t” there and was like “but these aren’t HLO at all!”. Time to increase my browser zoom again, I guess… or we work on the name :slight_smile:

In all seriousness, if you have found your way to a useful intermediate level that follows the principles I laid out above and is better for lowering from frontends, then I think we want that – and if you are offering, it would be great if it wasn’t project specific.

@pifon2a @stellaraccident What is tHLO? A quick search did not turn up anything.

We (at Cruise) need something like this as well, but we were thinking of proposing a dialect that layers above Linalg instead of extending it. It would also contain operations like “sort” that perhaps don’t mesh well with Linalg, but do show up in our use cases.

CC @raghavanr

1 Like

Good question: I’ve not actually seen it yet, just had it described (and didn’t find it the couple of times I went looking). One of the reasons I am supportive of this RFC (in the abstract) is that we all have a common need for the things that I’ve heard described as “tHLO” – and hearing from you adds more weight to that assessment. I’d rather see that in the commons.

I’d be supportive of a new dialect like you suggest, especially if we could identify some of the design principles (i.e. I took a stab at that with my 3 points above). Did you all have a sketch of what you are thinking?

1 Like

I think that would be very interesting to see and discuss. How far along is this work? Something that could be discussed at ODM?

Did you all have a sketch of what you are thinking?
How far along is this work?

We don’t have a Cruise-internal implementation at the moment. @raghavanr will start a thread in the next week ish to get the ball rolling and yes, a in-person discussion in a subsequent ODM sounds like a great idea!

I took a stab at that with my 3 points above

Regarding #3, have you seen issues with operations like {torch|tf}.squeeze? These turn shape dynamism into rank dynamism.

@jpienaar @sanjoyd @stellaraccident I can prepare smth for the ODM.

At the moment, we are working on set-based code generation. The presentation for it will at ODM in the end of August/beginning of September, but the code and tests can be already found in TensorFlow/MLIR_HLO.

There is gml_st_extension_ops.td file, which at the moment contains DynamicBroadcastInDimOp, ConcatOp, ScatterOp, GatherOp i.e. all ops that cannot be modeled with Linalg, but still can be tiled/fused, etc. Briefly, the idea is to move these ops to a tHLO (tileable HLO) dialect which will contains destination-style ops with Tiling/Fusion interfaces attached to them. Also, I thought, why not extend this set of ops with thlo.map, thlo.reduce, …, i.e. everything I am talking about in this RFC.

Then I talked to @nicolasvasilache who was interested in adding it to Linalg dialect, especially, since ScatterOp is planned to be moved from linalg_ext dialect in IREE to MLIR Core. Originally, I wanted a separate dialect. And maybe it’s the right thing to do.

@sanjoyd Do you use HLO at Cruise? If yes, then would it be convenient to have this new dialect in OpenXLA repo?

I think that’s two orthogonal questions, e.g., these ops may make sense independent of HLO or not.

We don’t use HLO or XLA at this time.

We also prefer open governance so having something under LLVM/MLIR is ideal.

2 Likes

tHLO will have some hlo-specific things, like different attributes that ScatterOp or GatherOp. The rest of the ops make sense not only for hlo, e.g. map, reduce and such.

We (IREE) also strongly prefer this. I think doing in-situ early work in the various downstreams is a good way to get going and get the problem characterized, but I would like to see us rendezvous on an upstream dialect.

I think we should do more work on these, and in principle I don’t see them in violation with where I was going on #3. I don’t have an immediate answer to how to compose them, but they do represent a relatively limited form of rank dynamism that, while bespoke, is quite a bit different from a fully unranked mechanism. It would be nice if however we specified them, it was structured somehow still vs fully unstructured. But also, if they were the primary exceptions to the rule while waiting for more work to better characterize them, I don’t think that would be the end of the world.