[RFC] Primitive Ops: add BroadcastOp to Linalg

[RFC] Primitive Ops: add BroadcastOp to Linalg

Context

[RFC] Primitive Ops: add MapOp, ReductionOp, TransposeOp, BroadcastOp to Linalg proposed adding new ops to Linalg dialect. There were still open questions about what BroadcastOp should look like. This document goes into concrete details how BroadcastOp can be implemented.

Design

linalg.broadcast is a static broadcast operation. There is no ambiguity at compile-time about shape information.

Representation

The new op will have the following representation:

%bcast = linalg.broadcast
    ins(%input:tensor<16xf32>)
    inits(%init:tensor<16x64xf32>)
    dimensions = [0]
    optional-attrs

Each index in dimensions attribute maps input dimension into the corresponding target dimension. The length of the dimensions list should match the input rank.

Semantics

Dimensions attribute should be sorted

Which means that linalg.broadcast does not support implicit transpose of the input. Broadcast ops from other dialects that support unsorted dimensions can be canonicalized by inserting an additional linalg.transpose and then lowered to linalg.broadcast.

No size-1 expansion

In linalg.broadcast all mapped dimensions should match.

The following IR is illegal:

%illegal_bcast = linalg.illegal_broadcast
    ins(%input:tensor<16x1xf32>)
    inits(%init:tensor<16x32x64xf32>)
    dimensions = [0, 1]

Broadcast ops from other dialects can be canonicalized by removing size-1 dimensions with tensor.collapse_shape. The previous example with have the equivalent form with

It can be canonicalized into an equivalent form without size-1 expansions:

%shaped_input = tensor.collapse_shape %input [[0, 1]] tensor<16x1xf32> into tensor<16xf32>
%illegal_bcast = linalg.broadcast
    ins(%shaped_input:tensor<16xf32>)
    inits(%init:tensor<16x32x64xf32>)
    dimensions = [0]

It is permitted for mapped dimensions to be size 1 in both input and init. In that case tensor.collapse_shape is not used to avoid unnecessary reshapes.

@stellaraccident, @nicolasvasilache, @MaheshRavishankar, @herhut, @frgossen, @akuegel, @mehdi_amini, @matthias-springer, @okkwon

There seems to be a lot of redundancy here between these class of operations and the operations being proposed with the TCP dialect . If TCP has to make its way upstream eventually, then are these ops even necessary?

2 Likes

cc @sanjoyd and @raghavanr

This design matches my priors of what a broadcast should look like for codegen and mirrors my comments in the TCP ODM about orthogonality and avoiding redundancy.

In particular, a broadcast as represented here translates trivially to a generic op and therefore is a good abstraction for tiling, is easily fusible into a larger generic op, lowers to loops and bufferizes. All this out of the box without any effort.

This can be viewed as useful syntactic sugar that helps target and implement transformations (i.e. it is a generic without iterator_types and maps).

+1 from me.

I agree that this is a better design for broadcast op itself, but this is redundant w.r.t to TCP. If there is a broad agreement that TCP is working towards being in core. That end state means this is redundant with TCP. That does not seem desirable to me. Also this syntactic sugar is not giving much. A linalg.generic would serve (and does serve) equally well for all the backend use cases.

If it matters, i’m -1 on adding such ops.

Edit: FTR, more than anything, I am registering my protest on such ops since IMO they don’t serve a purpose. I will not block any patches related to this, but raising my concern based on having interacted with people outside of MLIR who have a very tough time navigating the different dialects in MLIR. Of course it is not a blocker, since there are ways to avoid using these ops if it doesn’t fit a particular use case.

Yes, this is mostly a syntactic sugar, but from my subjective view it’s easier to parse than indexing maps in linalg.generic. Probably because I have limited experience with reading linalg IR, but that’s also true for people who try to navigate MLIR dialects.

As for redundancy, the original design of TCP says that

TCP will be conceptually layered above Linalg.

so I’m not sure that it’s fair to compare linalg.broadcast and (upcoming) tcp.broadcast. The only question I can see, is if tcp.broadcast should be lowered to linalg.generic or linalg.broadcast, but that we can leave for pass authors.

This is also the essence of my comment, despite having experience reading linalg IR: usability improvements are really important (for instance, I haven’t seen anyone prefer to use a generic version of linalg.fill).

It is crucial however to write transformations to work on the generic form, and to “just work” with both the named and generic variant.

I agree with @MaheshRavishankar here. An op like this should go into TCP and not into Linalg. It can be lowered into whatever abstraction in Linalg from TCP. When there wasn’t a proposal for TCP on the table and nothing upstream in that space, it would have been fine to add such named ops to linalg. But that isn’t the case now.

Thank you for the feedback @MaheshRavishankar and @bondhugula. I can see that there is some redundancy with the linalg.generic operation but would object the claim that this operation should be part of tcp rather than linalg. Let me try to provide some context to clarify why we believe linalg.broadcast is useful (and who this we is).

We have been building out an mlir-based code generation pipeline for Google’s ML compiler product XLA. This pipeline is built around the same principles as linalg, which we commonly refer to as structured code generation. We have been working on this in our separate GitHub repository to solidify the design before suggesting upstream changes. First results of this work upstream have been our changes to split and clean up interfaces (like LinalgStructuredInterface and DestinationStyleOpInterface).

Our pipeline starts with StableHLO as our unified input dialect. We then lower this to mhlo, which is roughly equivalent to tcp in level of abstraction. We perform all graph-level optimizations at the mhlo dialect level. In particular, we perform most hardware/target agnostic transformations at this level. To my understanding, this is the level that tcp aims to model and support.

Using mhlo directly in structured code generation has its issues. For example, we rely on operations to be in destination passing style to support bufferization and we expect operations to support tensors and memrefs as operands. Neither of these properties are available at the mhlo level and I suspect they won’t be at the tcp level. I have explicitly asked about destination passing style and tcp before.

Hence, we have created a dialect called thlo, which is short for tilable-hlo. Operations in that dialect support destination passing style, implement the linalg structured interface and a variant of the tiling interface. Every operation in mhlo that has no upstream equivalent in linalg is modeled in thlo. This included the broadcast operation that we suggested to upstream in this RFC.

As an aside, we use a different tiling interface as we have taken a slightly different approach to model fusion and loop structures than upstream. We introduced gml_st.parallel as an evolution on linalg.tiled_loop and scf.parallel and you can find the details in our GitHub repo. We plan to submit an RFC shortly to suggest an evolution to the upstream tiling interface that enables supporting our (and other) approaches additionally to what is done in linalg upstream. We are also open to upstreaming the loop structures themselves but feel we need a little bit more time to solidify their design further.

Looking at our code generation pipeline, linalg.broadcast is not the same as broadcasting in mhlo and I suspect it will be sufficiently different from what tcp will have to offer. So from our perspective, it makes sense to have this operation. We aim to keep thlo as minimal as possible and upstream all operations that have fairly generic semantics. However, if the community would prefer to wait for tcp to mature, we can also move broadcast back downstream.

As a final note, we initially used linalg.generic extensively instead of more customized operations but found the latter much easier to operate with when debugging issues with larger graphs (i.e. reading lots of IR). So we decided to prioritize readability over generality. This has not had a big negative impact on the code structure, as we are using the LinalgStructuredInterface so that we can still uniformly reason about these operations.

We shared our design a few weeks back (link), can you PTAL and let us know if it works? I believe mhlo.broadcast should be equivalent tensor.expand_shape + tcp.broadcast.

Thanks for sharing your design!

Yes, that true, but it’s not related to this RFC. mhlo and tcp are similar in levels of abstraction, so it’s likely they can be trivially converted.

This RFC is about linalg. As original tcp design clearly states, that tcp doesn’t have a goal to replace linalg, because tcp is on a different level. If I read comments correctly, tcp will not have destination passing style and buffer semantics. tcp will lower to linalg for such things.

This RFC is about a better abstraction in linalg to lower from tcp or mhlo.

Right now the only option is to lower to linalg.generic. We find IRs with many generics hard to read and debug, so we propose something that is nicer to use and easier to read. Since linalg.generic and linalg.broadcast share the same interface, they can be trivially exchanged and most of the existing code for linalg.generic will just work on linalg.broadcast.

I’m worried that this clear separation will be harder to maintain if linalg & tcp overlap in their levels of abstraction.

If readability is the major concern, have you looked into leveraging pretty-printing & parsing while keeping the core data structures & IR representations the same?

This is precisely what is done: add a custom pretty-printer, parser and nicer verification messages and attach that to a new op name. Analyses and transformations are automatically available by virtue of the op implementing the DestinationStyleOpInterface and LinalgStructuredInterface interfaces.

It feels hard to get an even more modular or composable design, but if you have ideas for additional improvements (e.g. to specify even less parser/printer/verifier code), please share :slight_smile:

Thanks Stephan! Reading through this though, I am trying to find the link between the project evolution here and the need to add linalg.broadcast operation.

I think this is the relevant parts… I will buy that this is syntactic sugar over linalg.generic. I think what Sanjoy suggested was basically using a recognizer in parsing/printing of linalg.generic that “recognizes” broadcast in prints it that way. I am not saying I approve of this approach, but solves the syntactic sugar part of the equation.

Another thing that is of concern to me is that these ops are not defined using the Python OpDSL (I recently found this out when I was trying to triage a bug). The map, reduce, transpose (and now broadcast) are all added outside of the python → yaml → tablegen definition of named ops in Linalg. A significant effort was put in to move all named ops in Linalg to using the python-based OpDSL, to the extent that only linalg.generic is defined in tablegen. Now there are more ops added. Its a break in “development practices” for Linalg that warrants some attention. When do we use the python OpDSL, and when do we define individual ops this way? I am not proposing we change any plans, but more urging establishing some guidelines w.r.t when we use the python opdsl and when we defined an op directly in Linalg.

This was more in response to @bondhugula and, looking at recent responses, @sanjoyd. The point I was trying to make was that linalg.broadcast serves a different purpose at a different layer in our stack than mhlo.broadcast. This is not because they have different semantics (that is not the point) but because we use them for different things. This would not change even if we were to switch to tcp. I hoped that given more background on what our pipeline looks like would make this clearer.

This is something for @nicolasvasilache to comment. I am not very opinionated as to how operations are defined as long as it works and is reasonably well documented. thlo, where this operation originated, uses tablegen based definitions and less ways to do things is better. I see the benefit of OpDSL for the different flavors of convolutions though.

We use the python DSL wherever possible.

However there is no magic bullet to specify the mapping between attributes+op semantics to an op name: this does require a bit of custom C++.

If/when we can extend OpDSL to have more powerful attribute specification behavior, we should switch (and it will be low cost).

I think that the goal here of having linalg-level IR be easier to read is definitely worthwhile.

It might be worth exploring some sort of hardcoded logic in linalg.generic’s pretty printer to print out broadcasts “nicely”, but as Nicolas points out, this is all interface-driven anyways so using a new op is really just a shorthand for a pretty printer anyway. So I’m generally supportive of the new op for this.

TCP and MHLO are red herrings in this discussion – as Stephan points out, they don’t exist at the point in the pipeline where we want this improvement to readability.

Just so I understand correctly, would this op need to literally be converted to linalg.generic in order for our existing linalg-on-tensors passes to merge it with other parallel-only generics (what we could consider “elementwise-ish fusion”)? Or will it “just fuse” naturally with elementwise linalg.generic’s?

Yes, if you go down the element-wise fusion path (which our pipeline does not), you will need to convert these operations to a linalg.generic. Note that when coming from mhlo, where these operations are currently used, you still have the option to use the old path and directly go to linalg.generic to ensure that adding these operations is not a breaking change.