[RFC] Primitive Ops: add MapOp, ReductionOp, TransposeOp, BroadcastOp to Linalg

@jpienaar @sanjoyd @stellaraccident I can prepare smth for the ODM.

At the moment, we are working on set-based code generation. The presentation for it will at ODM in the end of August/beginning of September, but the code and tests can be already found in TensorFlow/MLIR_HLO.

There is gml_st_extension_ops.td file, which at the moment contains DynamicBroadcastInDimOp, ConcatOp, ScatterOp, GatherOp i.e. all ops that cannot be modeled with Linalg, but still can be tiled/fused, etc. Briefly, the idea is to move these ops to a tHLO (tileable HLO) dialect which will contains destination-style ops with Tiling/Fusion interfaces attached to them. Also, I thought, why not extend this set of ops with thlo.map, thlo.reduce, …, i.e. everything I am talking about in this RFC.

Then I talked to @nicolasvasilache who was interested in adding it to Linalg dialect, especially, since ScatterOp is planned to be moved from linalg_ext dialect in IREE to MLIR Core. Originally, I wanted a separate dialect. And maybe it’s the right thing to do.

@sanjoyd Do you use HLO at Cruise? If yes, then would it be convenient to have this new dialect in OpenXLA repo?

I think that’s two orthogonal questions, e.g., these ops may make sense independent of HLO or not.

We don’t use HLO or XLA at this time.

We also prefer open governance so having something under LLVM/MLIR is ideal.

2 Likes

tHLO will have some hlo-specific things, like different attributes that ScatterOp or GatherOp. The rest of the ops make sense not only for hlo, e.g. map, reduce and such.

We (IREE) also strongly prefer this. I think doing in-situ early work in the various downstreams is a good way to get going and get the problem characterized, but I would like to see us rendezvous on an upstream dialect.

I think we should do more work on these, and in principle I don’t see them in violation with where I was going on #3. I don’t have an immediate answer to how to compose them, but they do represent a relatively limited form of rank dynamism that, while bespoke, is quite a bit different from a fully unranked mechanism. It would be nice if however we specified them, it was structured somehow still vs fully unstructured. But also, if they were the primary exceptions to the rule while waiting for more work to better characterize them, I don’t think that would be the end of the world.

IREE has some of these that we added to our compilation stack that are ops that are not necessarily within the StructuredOps perview but can be transformed in similar ways. It has Scatter, Sort and others.

As recently as today, I submitted a change to IREE to move all these operations to implement the TilingInterface. The current implementation of these only parallelize the “parallel” dimension, i.e batch dimensions of sort, etc. So these can be tiled and fused with other operations that implement the TilingInterface using the patterns here . So a lot of these ops can simply be copied over to core once we decide on a place to land these operations.

Edit : All operations in the above linked file work with tiling, bufferization and lowering to loops. They are already in destination passing style. They are tested e2e within IREE. There might be different transformations that can be implemented on these ops (like an algorithm to use merge sort for efficient sort implementation which might fit into a more general tile +distribute along reduction dimensions) but they fit within the broader structured op based lowering

1 Like

From an implementation perspective: All the ops in the original post implement LinalgOpInterface, so all existing code that operates on LinalgOps works out of the box. (I rarely see code that operates on GenericOp.) If I understand correctly, this change is essentially a bunch of new ops in a .td file, no/little new C++ code/refactorings/… So we would get better abstraction with little implementation effort. From that perspective +1

Agreeing on specific operation semantics might be hard (because different projects use different infeed dialects that they might want to maintain when transforming to destination passing style or under tiling). Sharing interfaces upstream might be the better choice in this instance.

For the operations mentioned here, which just add a layer above what is expressible with linalg today, adding them upstream makes sense.

Which they are fine to do. I don’t think this would be a requirement that all would use it and everything would automatically be decomposed to these. Same as lowering to LinAlg or Affine or SCF today vs retaining the op. Especially if usable in different input compilation flows (e.g., TOSA or MHLO), this could make for more codegen focussed set to complement those (not that I’m making that an extra criteria on the regular dialect criteria already).

Makes sense. I suspect constraining torch.squeeze to be statically resolvable would work for us (e.g. perhaps we only allow the squeeze variant with an explicit dim that points to a static dim).

I know there is a fine line between having the user and compiler-author hat on, but with the former, I can concretely say that every time I’ve short-cutted and used torch.squeeze (or its equiv) yolo-style without the dims I’m squeezing – I have eventually regretted my life choices. Code gets copied, the context gets lost and then you find that you are crashing or getting wrong results on wrong shape expectations. I think there is a path to supporting it in more generality, but also, I would never approve patches to live ML models using the more generic form because it is so bug prone.

Adding my two cents.

I’d like to give big thumbs up for the direction described here.

A big question here for me is whether to use a high level dialect or extending Linalg.

Extending Linalg raises a question like what kind of specific instructions we want to support as a named op. The information provided by the proposed ops can be easily extracted from a high level ops. Does your compiler stack provide a dialect between the input and Linalg?

In IREE, there is no common high level dialect between the input and Linalg, so the proposed ops will be very useful if we want to do a specific pattern matching instead of recovering info from the generic ops. But Mahesh said above this is not a recommended approach in Linalg; solving the problems in a unified way is a goal of Linalg, and it puts some restrictions in how we implement in Linalg.

Having the broadcast and transpose op is great also because the ops are target independent; some target HWs directly support those operations and having them separately makes propagation of the info in the graph and codegen easy.

Drawing a fine line in Linalg might be a difficult task because each customer may have different criteria about how specific is enough for their need. One extreme case might be supporting a super set of mHLO and TOSA or picking one from a bigger set by Linalg. (I am talking about an extreme case here because the case would be simple to understand and easy to use.) Even without going extreme, having the proposed ops puts two semantic layers of Linalg. I am not clear whether supporting two abstractions in a signle dialect is a desired outcome because it seems to violate KISS.

1 Like

MapOp
Iota
Transpose
Reduction
Broadcast

+1, easier to read and all occur often enought to make the shorthand worth the effort

Reshape

It is an interesting question whether we need a dst-style linalg.reshape op at all or we can live with tensor.[expand_shape, collapse_shape, reshape]. Usually, linalg.reshape does not allocate during bufferization and just becomes a memref descriptor operation, so I am not sure how important it is to have it in dst-style.
Also it is quite different from the ops above w.r.t. to tiling and fusion, since reshapes are only tileable to size-1 tiles, i.e. to point level.

-1. It seems like there are insufficient tools/abstractions in linalg right now to do this right now. All the other proposed ops are basically shorthand for something that can be represented as a linalg.generic. This definitely isn’t true for the reshape case. The most basic restriction is currently you can’t use dimension size symbols in the indexing maps, which rules out a bunch of dynamic shape cases

It is possible to tile a reshape with arbitrary tile size, depending on what exactly you mean by “tiling”. If we take it to mean that you can replace op -> tensor.extract_slice with some IR that depends only on the slice parameters and the inputs to op, then you can implement TilingInterface for arbitrary tiles for these ops as well. See for example my recent RFC on extracting a slice from tensor.collapse_shape. This sort of transform requires another construct to deal with the discontinuous iteration space (in the RFC I propose a loop).

However trying to embed that functionality in linalg via a linalg.reshape breaks some nice properties. The linalg.generic operation can be tiled such that the tile itself is another linalg.generic. You can combine/fuse linalg.generic into new linalg.generic. All that breaks if you throw in these reshape operations, because you are lacking something else which needs to go in first (gather? a new type of view?).

It seems like a much longer-term objective and outside the scope of the rest of the proposal.

1 Like

I’m not aware of a real program where the user’s programming intent is for a squeeze op to produce more than a single rank dynamically. Our experience in Torch-MLIR is that usually after enough simplifications we discover the user’s intent and the program is in fact fully ranked.

The same applies for eliminating the dynamism of “dynamic size-1 broadcasting”. On the Torch-MLIR side we have not encountered any program in practice where the user in fact intended to have it be decided dynamically whether a broadcast in fact needs to happen. (we emit a runtime guard to check that it doesn’t and then proceed to assume that it doesn’t – we have not had a report of such a guard being hit by any program).

1 Like

This would be very valuable data. Do you have any stats to underscore this? E.g., perhaps in benchmark set X and Y, N of these ops occur and none occurred. There is of course a question between what random researcher wrote (and what they intended) vs a benchmark or deployed model, where the former I can’t of an easy way to measure. This also fits in with what GitHub - google-research/dex-lang: Research language for array processing in the Haskell/ML family authors have said. I recall having to deal with squeeze in TFlite converter but can’t recall exact usage/if a higher level pattern would have removed the need.

It is going to be very hard to get stats on this (but would be super helpful corroborating evidence of it exists). In my experience, the reason you don’t see this in real programs is because if it actually produces rank variance, it is a bug in the computation (absolutes are hard but I can’t think of a legitimate reason for true rank variance here). Since it is a shortcut in order to be more concise, this results in two situations: either the program is self consistent and not rank variant in reality or the programmer hasn’t happened to have run it with an input pattern that makes it blow up yet.

In the first case, there may or may not be enough information in the surrounding context to prove that it is not rank variant, and that is the tricky part. These cases tend to happen at the boundaries or in toy programs. In the former (ie. Squeeze to return a simplified shape), it is fairly easy to make a compiler that deals with this as a special op, and if you do that, you can also hoist it and see if you can get it to a point that you can infer.

As I said UT though, this is a slippery slope. If designing a numeric processing language, if I had such a convenience op I would require that it could always be statically rank resolvable (or error) as, I believe, that is the only sane semantic. But here, some of these cases are trying to be compatible with less strict “languages”, so someone does need to decide what to do with it. Maybe it is necessary, but building out an entire arm of compiler tech to reproduce a program bug seems a shame to me.

Another option might be to look at how it is derived from those less strict systems. Since it is a trivial metadata op in any eager executor, maybe something can be done at the jit level to isolate it? I believe there are already such heuristics and it turns a hard compiler problem into a jit speedbump. For full AOT cases, I am almost certain you just want to flag this as an error and reject the program: the resulting damage it will do, beyond a few trivial special cases, to the compilation quality or ergonomics of the result if it can’t be inferred will never be what the user wanted, imo.

Indeed, the only such anecdotes I know of are folks closer to ML researchers who’ve asked others. Even the simpler question of actual usage in benchmarks are interesting, although there may be no such usages as they’ve been optimized away manually (which is a sign too :slight_smile: ).

Yes that’s what I meant with higher level patterns. So one may never actually need the squeeze as the ops behavior individually is.

And what torch-mlir does is one extreme (failing when needed) and another extreme may be falling back to op-by-op execution (sort of like JS execution fallback paths). Of course the latter has a real cost in terms of deployed model and performance. And again if you can afford the worst performance of that extreme you may not be in deployment scenario vs in some part of iterating on model.

As mentioned, the cost of this op is minimal to any op by op executor (is. Twiddle some strides). The real cost of a failure to infer deep within a program is that the optimization potential of what follows goes down a lot if your compilation mode can’t account for that (and in the limit requires the compiler to degrade to op by op execution).

I think these are cases where you don’t want to squash both representations into one. It’s perfectly reasonable to have a higher level op set without such constraints and then make decisions to lower in the vast majority of cases that it is conforming to lower level constraints (or split the program and use a different strategy). I’m just not sure that higher level thing is worth abstracting. Let torch represent that and make the decisions: it is in the position to do so and nothing lower level is (or tf, as the other loose higher level form for these parts).

Thanks @pifon, I’ll mostly repost prior internal comments here for visibility.

Generally, +1 on the proposal: Map / Reduce / Static Broadcast / Transpose / Window (and combinations thereof) “isa” LinalgOp but they should not always be read and manipulated as a linalg.generic.

Syntactic sugar is important to simplify both:

  1. understanding of the IR and
  2. specific targeting of the IR by transformations and patterns.

Instead of trying to infer back the semantics from the generic at every single op.

What we are missing is the ability to write generic logic to define high-level op semantics from attributes + interfaces. I agree the OpDSL granularity is too fine to be a one-size-fits-all atm and that the cosmetics are super important for usability and algebraic rewrites.

Trying to fit more general rank-polymorphism in OpDSL than what exists today is likely not worth the effort IMO. I don’t think we have a better tool than Tablegen for this.

One big aspect is they need to reuse/generalize the existing interfaces we have been developing (e.g. TilingInterface, LinalgOp interface, in-flight DestinationPassingStyleInterface etc). One big red flag would be to have custom C++ logic that depends on op-specific attributes or op-specific knowledge to implement tiling, fusion or lowering to loops: we happily dropped all special cases a long while back.

Beyond that, the spelling of ops is a matter of taste and people have many strong diverging opinions.

The semantics of the ops you propose generally LGTM.

On to other specific points:

Like @cbate and @MaheshRavishankar, I’d keep that out for now as the design space for this op is different.

A DPS version of these ops has no clear advantage to me as they are closer to a tensor.extract_slice.
If you want to allow reshapes that copy, we could also add a first-class copy attribute to tensor.expand_shape/collapse_shape to allow the reshape of non-contiguous stuff.
This would lower to the proper alloc + copies where needed.
This is in principle close to the tensor.pad op and its nofold attribute.

We should try to have that: the named form should be syntactic sugar around linalg.generic.
Unfortunately it is unclear atm how to determine that a linalg.generic is equivalent to some linalg.named_op + some_yet_to_be_determined_attribute_instance.

Let me elaborate a bit:
A simple procedure follows what we have in IREE: StructuredTransformOpsExt.cpp - iree-org/iree - Sourcegraph

The objective is to avoid the “inverse” problem of matching and instead just compare a linalg.generic against what a named op would “generate”.

The difficulty is we don’t know beforehand all the possible attributes of a named op or the ones that will appear in a program. To make it clearer, consider a 4-D permutation; there are 4! possible permutation vectors. Given a generic that may be a permutation, we don’t know which one is the right one and we will need some inference. It gets worse with convs for which we need to find strides and dilations etc and that can’t just be enumerated in a generic fashion.

We could still make progress with an InferrableNamedOpInterface whose default can capture the majority of the easy cases and require the other ops to implement the interface explicitly given only a generic.

To avoid quadratic complexity, there should also be some sort of registry of hashes.

This is the traditional top-down / bottom-up design iteration cycle, linalg.generic is designed in a bottom-up fashion with a focus on codegen and vectorization: it is likely we have reached the point where the next round of top-down investigations are strongly needed for more user-friendly named ops (i.e. the “readable IR” + “ease of matching” points I made a bit earlier).

The top-down design should be informed by the first-principle findings of the codegen work (the principles you list + a few others). The nice thing is that these are captured in the interfaces so as long as the implementation does not try to get too clever with custom C++ from ops + attributes, all should work out of the box, as was shown by OpDSL.

+1: convolutions is one such abstraction where good intentions of generalizations lead straight into the lava pit (I think gather is another :slight_smile: ).

+1

I don’t subscribe to this viewpoint: yes there is some “inference” in finding a matching “name+attributes” from a generic but everything else is literally the same (same operands of the same types, same op granularity and especially same transformations and implementations).

Without drilling deeper into the other discussions that forked off from this thread, I agree with the direction here as it will greatly improve the ergonomics. How this is sliced in the future when a new potential TCP dialect shapes up can be separated and will likely involve a larger set of folks.