[RFC] Primitive Ops: add MapOp, ReductionOp, TransposeOp, BroadcastOp to Linalg

Some additional comments on the specifics of the ops proposed. I think the utility of new ops comes from two directions:

  • Do they provide readability, convenience or less information loss to lowerings in to linalg. For this, we look at a few of the usual suspects (and you are specifically looking at advanced fusion heuristics at the high level).
  • Do they enable better mapping to specializations in the field (i.e. library calls, hardware blocks, etc).

When lowering to libraries or microkernels, I don’t think this is going to be a generally useful op, so this is in the frontend convenience category, I think.

The reason is that common targets for these things actually do match a fused linalg.generic defined over projected permutations quite well: in such targets, you typically have strides available for free and the elementwise-fused form is what you want (i.e. broadcasts and transposes get “folded in”). I don’t love pattern matching linalg.generic for such cases, but it is actually the right representation for the use cases I see. Separate to this proposal, I wonder if we had a better name/spelling for such n-ary, parallel, projected permutation mappings, if that wouldn’t be the right op to define. I’m not precisely sure how to define that formally with better rigor than generic is defined now, though – more that I’ve seen a lot of code that does the same pre-amble dance to identify whether that is the form.

Naming: Elsewhere, we already have the ElementwiseMappable trait which captures the semantic. Maybe linalg.elementwise_map for consistency? There is also the ElementwiseToLinalg pass which is presently converting anything with this trait to a linalg.generic. Minimally that should be updated to build this new op.

Even if they get fused later on, I would like a better representation for transpose specifically, even if just for the pure readability at the frontend->linalg level. Same comment about linalg.map above: it is rare in real programs that this should survive as a standalone op because generic represents its trivial fusions (and those are what you want to lower), but I think the ergonomic benefit makes it worth it.

I’m not actually sure what you are proposing here? It looks to be a fully static form of HLO’s broadcast_in_dim? As written, I believe it can support dynamic broadcasting but does not allow ambiguity on unknown dimensions triggering expansion.

I’m going to leave this to the better qualified.

I think this is only partially true: Perhaps at the mid-level we are over-pattern matching, but there are a lot of (even library call) cases where generic is the most direct match. Yes, you have to pattern match cases out of it, but the associations don’t exist otherwise. But especially at the higher levels, less information loss is better, so I can get behind the “missing layer of ops” analysis. Ideally, I would always be able to do access-pattern fusion and get into generic form for some code generation or lowering flows, but your analysis is that we don’t need to start there. (I think I’m just restating what you said in my own words, so please tell me if we are saying something different)

It might be a matter of more clearly defining what the demarcation of Linalg is. At the outset (also biased by my experience with Linalg), these ops seem to be describing something that sits “above” Linalg.
IMO, a Linalg op is nothing but a perfectly nested loop nest, where the operands, indexing maps and iterator types give you information about the access pattern within the loop body, while treating the body of the loop as a black box. Anything that is trying to look into the region of the operation to re-discover some semantics of the computation is inherently raising. So that has always been a bit of a red flag for me.

I cant talk about TF pipelines, but in IREE I have tried to actively avoid this. I’d argue that if it is indeed needed to look at the region of the linalg.generic to know what it is doing, the transformation is being implemented at the wrong layer. Either MHLO, or some layer between Linalg and MHLO, is where such transformations should live. MLIR allows you to avoid lifting by writing analysis/transformations in dialects that capture the information needed without having to do complex lifting.

I am not sure readability is as much as a concern for Linalg dialect (at least the way I think of it). Its more related to a front end for codegeneration. So its seems reasonable that someone working at this level needs to invest more time to understand this operation.

Coming to the specific ops here

  • The proposed map operation seems nothing more than a linalg.generic with all iterator types being parallel, and all indexing maps being identities. This could just be a utility function to use as needed
  • iota and transpose operations as well seems like a very specific use case. Whats the difference between these ops and MHLO version of these ops? This seems like a “lateral translation” rather than a lowering then.
  • static broadcast seems even more problematic. All Linalg ops work with dynamic dimensions. (Side note: I continue to believe that “broadcast_in_dim” and “dynamic_broadcast_in_dim” are legacy constructs that came about before sufficiently powerful compiler technology existed to better model the required computation. Handling broadcasting behavior in front ends and codegen has more discussion on this, which itself is a bit dated. In IREE for example we explicitly avoid lowering to these constructs to begin with).
  • Collapse/Expand shape are not Linalg operations at all. Linalg operations can be used to implement these ops, but that is IMO strictly a pessimization. These operations server a slightly different purpose, and are mostly meant to capture change in tensor shapes (they need to have copy semantics at SSA level, but mostly expect to lower to memref versions of expand/collapse shape that have no copy semantics after bufferization).

To summarize, I feel like Linalg is not the right level for these operations. We can have a separate discussion about where these ops should go, (in MHLO, in some “front-end” dialect in core, etc.). I think a lot of work is in progress to separate out things that dont belong in Linalg really, out of this dialect, while this seems to be going against that trend.

1 Like

Exactly :slight_smile: tHLO is the right level :wink:

Leaving aside where they live, I think we’ve learned some things from “the linalg journey” that is influencing the desire for these higher level ops, but I can also buy that they are neither “Linalg” nor “MHLO/TOSA/etc” but somewhere in between. Perhaps we were just a couple of years too early for the “TCP” (tensor compute primitives) idea.

The “elements of Linalg” that I think these ops are drawing from are:

  1. Destination passing style as a gateway to progressive bufferization, memory planning and grouping heuristics: We are already factoring this out as a dedicated concept, and even though the starting point is to just disaggregate in-situ within linalg, I expect that this will end up being a first order thing we support at this mid-level as we exit from strict value-based semantics.
  2. Default lowerings to buffers/loops: This has been a big boon for multiple projects to have the dumbest possible reference lowerings. Torch-MLIR relies on this exclusively for its reference backend, which is used to gate all changes.
  3. Ranked with explicit extents: We just can’t relitigate this. For the subset of the universe that is ranked, having extents modeled explicitly in the IR is a requirement. This composes well with destination passing style.

I think what hasn’t worked so well in Linalg for a generic frontend, mid-level dialect is:

  • There is a desire for more “algebra centrism” in the opset. I struggle with the right categorization for this, but I’ve heard it described as “I want something I can write grappler-style algebra passes on”, “I want ops that are representative of what the user programmed, not what we are lowering to”, and (tongue in cheek) "why do I need to know about affine and loop structures to spell “transpose”.
  • Higher level fusions: these tend to rely on grouping and are typically heuristic based on a vocabulary of domain ops. I am separating this from lower-level fusions as befits codegen, which linalg represents quite well.

While plenty of people are lowering frontends to Linalg, actually covering pretty much the whole space of ML frameworks at this point, what I’m hearing is that it is codegen biased and not something people want to be doing high level transformations on directly.

If we had it to do over again, with the benefit of hindsight with respect to DPS/bufferization/dynamic-shapes, I would have advocated that most of these higher level “algebra” ops should be in tensor or possibly some specific dialect but follow the #1-3 conventions above. We’re pattern matching to linalg because it does satisfy those requirements/conventions while nothing else does. Neither MHLO/Tosa/ATen models things at that level – and for real compilers, those conventions are really useful. We could just (either in or out of linalg) define these ops to those conventions and make it all compose together. I think the key are those three conventions defining a specific level in the IR as you exit frontend/value/dataflow and enter codegen.

This doesn’t sound to me like “MHLO” (as in the dialect) anymore than it sounds like “TOSA” or “ATen”.

1 Like

Lol, I completely missed the “t” there and was like “but these aren’t HLO at all!”. Time to increase my browser zoom again, I guess… or we work on the name :slight_smile:

In all seriousness, if you have found your way to a useful intermediate level that follows the principles I laid out above and is better for lowering from frontends, then I think we want that – and if you are offering, it would be great if it wasn’t project specific.

@pifon2a @stellaraccident What is tHLO? A quick search did not turn up anything.

We (at Cruise) need something like this as well, but we were thinking of proposing a dialect that layers above Linalg instead of extending it. It would also contain operations like “sort” that perhaps don’t mesh well with Linalg, but do show up in our use cases.

CC @raghavanr

1 Like

Good question: I’ve not actually seen it yet, just had it described (and didn’t find it the couple of times I went looking). One of the reasons I am supportive of this RFC (in the abstract) is that we all have a common need for the things that I’ve heard described as “tHLO” – and hearing from you adds more weight to that assessment. I’d rather see that in the commons.

I’d be supportive of a new dialect like you suggest, especially if we could identify some of the design principles (i.e. I took a stab at that with my 3 points above). Did you all have a sketch of what you are thinking?

1 Like

I think that would be very interesting to see and discuss. How far along is this work? Something that could be discussed at ODM?

Did you all have a sketch of what you are thinking?
How far along is this work?

We don’t have a Cruise-internal implementation at the moment. @raghavanr will start a thread in the next week ish to get the ball rolling and yes, a in-person discussion in a subsequent ODM sounds like a great idea!

I took a stab at that with my 3 points above

Regarding #3, have you seen issues with operations like {torch|tf}.squeeze? These turn shape dynamism into rank dynamism.

@jpienaar @sanjoyd @stellaraccident I can prepare smth for the ODM.

At the moment, we are working on set-based code generation. The presentation for it will at ODM in the end of August/beginning of September, but the code and tests can be already found in TensorFlow/MLIR_HLO.

There is gml_st_extension_ops.td file, which at the moment contains DynamicBroadcastInDimOp, ConcatOp, ScatterOp, GatherOp i.e. all ops that cannot be modeled with Linalg, but still can be tiled/fused, etc. Briefly, the idea is to move these ops to a tHLO (tileable HLO) dialect which will contains destination-style ops with Tiling/Fusion interfaces attached to them. Also, I thought, why not extend this set of ops with thlo.map, thlo.reduce, …, i.e. everything I am talking about in this RFC.

Then I talked to @nicolasvasilache who was interested in adding it to Linalg dialect, especially, since ScatterOp is planned to be moved from linalg_ext dialect in IREE to MLIR Core. Originally, I wanted a separate dialect. And maybe it’s the right thing to do.

@sanjoyd Do you use HLO at Cruise? If yes, then would it be convenient to have this new dialect in OpenXLA repo?

I think that’s two orthogonal questions, e.g., these ops may make sense independent of HLO or not.

We don’t use HLO or XLA at this time.

We also prefer open governance so having something under LLVM/MLIR is ideal.

2 Likes

tHLO will have some hlo-specific things, like different attributes that ScatterOp or GatherOp. The rest of the ops make sense not only for hlo, e.g. map, reduce and such.

We (IREE) also strongly prefer this. I think doing in-situ early work in the various downstreams is a good way to get going and get the problem characterized, but I would like to see us rendezvous on an upstream dialect.

I think we should do more work on these, and in principle I don’t see them in violation with where I was going on #3. I don’t have an immediate answer to how to compose them, but they do represent a relatively limited form of rank dynamism that, while bespoke, is quite a bit different from a fully unranked mechanism. It would be nice if however we specified them, it was structured somehow still vs fully unstructured. But also, if they were the primary exceptions to the rule while waiting for more work to better characterize them, I don’t think that would be the end of the world.

IREE has some of these that we added to our compilation stack that are ops that are not necessarily within the StructuredOps perview but can be transformed in similar ways. It has Scatter, Sort and others.

As recently as today, I submitted a change to IREE to move all these operations to implement the TilingInterface. The current implementation of these only parallelize the “parallel” dimension, i.e batch dimensions of sort, etc. So these can be tiled and fused with other operations that implement the TilingInterface using the patterns here . So a lot of these ops can simply be copied over to core once we decide on a place to land these operations.

Edit : All operations in the above linked file work with tiling, bufferization and lowering to loops. They are already in destination passing style. They are tested e2e within IREE. There might be different transformations that can be implemented on these ops (like an algorithm to use merge sort for efficient sort implementation which might fit into a more general tile +distribute along reduction dimensions) but they fit within the broader structured op based lowering

1 Like

From an implementation perspective: All the ops in the original post implement LinalgOpInterface, so all existing code that operates on LinalgOps works out of the box. (I rarely see code that operates on GenericOp.) If I understand correctly, this change is essentially a bunch of new ops in a .td file, no/little new C++ code/refactorings/… So we would get better abstraction with little implementation effort. From that perspective +1

Agreeing on specific operation semantics might be hard (because different projects use different infeed dialects that they might want to maintain when transforming to destination passing style or under tiling). Sharing interfaces upstream might be the better choice in this instance.

For the operations mentioned here, which just add a layer above what is expressible with linalg today, adding them upstream makes sense.

Which they are fine to do. I don’t think this would be a requirement that all would use it and everything would automatically be decomposed to these. Same as lowering to LinAlg or Affine or SCF today vs retaining the op. Especially if usable in different input compilation flows (e.g., TOSA or MHLO), this could make for more codegen focussed set to complement those (not that I’m making that an extra criteria on the regular dialect criteria already).

Makes sense. I suspect constraining torch.squeeze to be statically resolvable would work for us (e.g. perhaps we only allow the squeeze variant with an explicit dim that points to a static dim).