Handling broadcasting behavior in front ends and codegen

Several discussions here have touched upon the topic of dynamic broadcasts. Wanted to get some feedback on whether the prevelant broadcasting semantics is adding unnecessary complications to the front-end and codegen. Specifically, I am talking about the broadcast semantics where if one of the dimensions is of size 1, then that dimension is broadcasted. For example

# Snippet 1:
>>> c = np.add(a, b);
>>> a.shape
(7, 1, 6)
>>> b.shape
(7,5,6)
>>> c.shape
(7, 5, 6)

This is effectively c[i][j][k] = a[i][0][k] + b[i][j][k].
One way of expressing this in XLA (which for the most part makes broadcasts explicit)

// Snippet 2:
%t = xla_hlo.broadcast_in_dim %a 
     {broadcast_dimensions = dense<[0, 1, 2]> : tensor<3xi64>} :
     tensor<?x1x?xf32> -> tensor<?x?x?xf32>
%c = xla_hlo.add %t, %b : (
     tensor<?x?x?xf32>, tensor<?x?x?xf32>) -> tensor<?x?x?xf32>

In the above snippet I left all dimension, except the broadcasting dimension dynamic. If that dimension were dynamic as well, i.e. tensor<?x?x?xf32> instead of tensor<?x1x?xf32>, then this falls into “dynamic broadcast” category. This adds a lot more complexity to the code-generation since you effectively need to check for whether the size is 1 dynamically which changes the semantics of the operation.

An alternative would be to move away from the special status of size 1 dims, and have a more pure form of broadcasting which doesn’t rely on it. So the above computation could be re-written to

// Snippet 3:
%t = xla_hlo.broadcast %a ... :  tensor<?x?xf32> -> tensor<?x?x?xf32>
%c = xla_hlo.add %t, %b : (
     tensor<?x?x?xf32>, tensor<?x?x?xf32>) -> tensor<?x?x?xf32>

I am using xla_hlo.broadcast here, but its specification actually doesnt allow this kind of broadcast. But it should be possible to generalize it to express such semantics. What you would need is a map from the original dimensions to the dimensions in the result that it is broadcasted to. So the above broadcast would be captured using these maps { 0 -> {0, 1}, 1 -> 2} which says both dims 0, 1 of the result get their value from dim 0 of the operand, and that dim 2 of the result gets the value from dim 1 of the operand.

This specification is immune to the dynamic broadcasting behavior that Snippet 2 suffered from. Even when expressed using Linalg Snippet 3 can be expressed as

#accesses =  
  [affine_map<(d0, d1, d2) -> (d0, d2)>,
   affine_map<(d0, d1, d2) -> (d0, d1, d2)>,
   affine_map<(d0, d1, d2) -> (d0, d1, d2)>]
#trait = { ..., 
  indexing_maps = #accesses, ...
}
%c = linalg.generic #trait %a, %b {
     ^bb0(%arg2 : f32, %arg3 : f32, %arg4 : f32):
       %2 = addf %arg2, %arg3:
       linalg.yield %2
     } :(tensor<?x?xf32>, tensor<?x?x?xf32>) ->  tensor<?x?x?xf32>

Apart from the advantage of not having the dynamic broadcasting problem, the above Linalg specification is IMO more canonical. It plays well with Linalg’s approach of determining loop bounds based on shapes of the operands (not going into more detail here). Indeed, I spent a few days prototyping some canonicalizations and folding at Linalg level to undo the presence of such size-1 dims at Linalg level since it leads to some problems with loop range inference. (Looking back at the original computation and seeing why this happened in the first place led me to the aspects highlighted in this post).

My take away from this is that having the special semantics for size-1 dims is problematic. It seems to be a complications introduced by abstractions than anything fundamental to the computational needs of the end-user. From my experience end-users typically end up doing a lot of reshapes to get the shape in the right form to do the broadcasting. (This also leads to other problems of having to deal with reshapes in codegen, which deserves a whole other post). If this is not the case, I’d happy to know more uses cases which do not allow the usage of broadcast semantics similar to Snippet 3.

I have two questions here

  1. Can we rely on the dialects/conversion at the very top of the stack (like TCP/TCF) to take inputs that might have such size-1-dim-based broadcast semantics and “canonicalize” them away so that you use the broadcast operations similar to Snippet 3. Though i suspect in dynamic broadcasting case there is very little that can be done.
  2. I think it actually would benefit end-users as well to not have to use size-1-dim-based-broadcasting. Would like to know more cases where it cannot be expressed using a purer broadcasting form.

Yep, at this point it seems like it is up to frontends to eliminate size-1 broadcasting before it gets to codegen. I think it is pretty safe to say at this point that “tcp.add” will not have size-1 broadcasting. In fact, @stellaraccident is in the process of migrating the broadcasting “add” in XLA into a “client hlo” dialect with the “xla_hlo.add” op not supporting the size-1 broadcasting behavior. That’s actually the main thing other than the TCF/TCP/npcomp prototype that we are working on right now (both are solving the same problems from different, complementary angles: “how do we upgrade the existing XLA infra to support this for real workloads” and “clean-slate for-discussion implementation with upstream infra using lessons learned from the former and not burdened by legacy”).

@stellaraccident and I chatted with @nicolasvasilache and it seems like maybe there is something we can do for codegen to handle the dynamic broadcasting case, but nothing concrete yet (more like “maybe if we absolutely had to do something with it maybe we could” than “oh, there’s a nice clean generalization of what we have”).

I believe we were talking about the stride-0 case :slight_smile:.

Stride-0 has internal aliasing but with enough canonicalizations + carrying it in the type system (strided memref such as (d0, d1, d2)[s0] -> (d0 * s0 + d2)), we could start seeing vector broadcasts appearing. Such broadcasts would materialize at a time in memory where it is cheap (e.g. after enough levels of tiling).

Re. implicit broadcast, I’ve been of the position that these rules are far from ideal (see footnote p10 here which links to this discussion).

My impression is that in many cases, the implicit broadcast rules creep in due to:

  1. wanting to guarantee that large data does not get materialized in memory. Combined with:
  2. not having a way to express custom computations.

As you point out, this is just c[i][j][k] = a[i][0][k] + b[i][j][k], which is how Linalg would prefer to model it. Modeling it this “fused” form has an additional advantage: the relationship between the broadcast and the value j is well-defined: j == dim b, 1, a[i][0][k] is accessed for all j (broadcast) and dim c, 1 == dim b, 1.

This is not so well defined in the dynamic case for the snippet:

%t = xla_hlo.broadcast %a ... :  tensor<?x?xf32> -> tensor<?x?x?xf32>
%c = xla_hlo.add %t, %b : (
     tensor<?x?x?xf32>, tensor<?x?x?xf32>) -> tensor<?x?x?xf32>

I.e. what is in the ..., what are the rules under which we can fuse the add through the broadcast and what does the IR look like after such fusion ?

Maybe we can get enough canonicalizations at the HLO → Linalg boundary to get some of these issues disappear in the more expressive Linalg generic form.

Still, if it is possible to avoid creating problems that we shouldn’t want to be solving in the first place, I’m all for it.

Correct, largely because that is the way that numpy implements it and we were seeing if the concept maps in any way (See rule 4 here). In the worst case, you’ve got to introduce a runtime bit for every could-by-1-broadcast dimension and handle it somehow. The big lever here is multi-versioning closer to the frontend level, which has the benefit of being fairly aligned with our capabilities today. However, if we find that we are not being effective at eliminating certain classes of these bits in practice, then the observation was that we may be benefit in sinking them down to the LinAlg level in some fashion, assuming that some strategy could be employed for handling them at that level without upsetting the whole flow.

I’m still very much of the opinion that I want to see more examples of how these dynamic could-be-1 bits emerge in practice, focusing on correctness through frontend multi-versioning before committing to a research project on the codegen side (unless if there is something trivial that unlocks it).

I’ve long suspected this. Aside from a few cases where I’ve seen intentional uses of this, as a user, I often would have preferred if there was a way that I could have been explicit about not having the size-1 auto-broadcast behavior (i.e. I’d rather not have iteration 100 act differently just because of a quirk of how the shapes happen to mod-out to a 1-dim). I’ve wasted too much of my life tracking down such silly bugs…

As I mentioned up-thread, we can multi-version this at the top level to start. In addition, there are some (I think) fairly straight-forward analyses that can be done to prove that this cannot impact a lot of common cases. For the rest, I think we just need to assess in real programs some more, looking to either do more complicated analysis/codegen or (my preference) invent a new kind of frontend dynamic dimension that does not broadcast. I believe that Theano had some nice solutions for this (but I haven’t seen/used them). At this stage, I’m more concerned that we have a way to correctly ingest programs with this behavior than I am that such a thing is in any way optimal.

I am not sure why this is not well defined. IMO it has better semantics As I mentioned, the missing attribute (what would be filled in instead of ... above) would contain a mapping from the dimension of the result to the dimension of the source. I havent worked out the exact attribute that would capture this, but I can express the two operations above using linalg.generic as follows

#map0 = affine_map<(d0, d1, d2) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
%t = linalg.generic %a {
       indexing_maps = [#map0, #map1],
       iterator_types = ["parallel", "parallel", "parallel"],
       }
       ^bb0(%arg0 : f32, %arg1 : f32):
         linalg.yield %arg0
      } : tensor<?x?xf32>, tensor<?x?x?xf32>
%c = linalg.generic %t, %b {
       indexing_maps = [#map1, #map1, #map1]
       iterator_types = ["parallel", "parallel", "parallel"],
       ...
       }
       ^bb0(%arg2 : f32, %arg3 : f32, %arg4 : f32):
         %0 = addf %arg2, %arg3 : f32
         linalg.yield %0 : f32
     } : tensor<?x?x?xf32>, tensor<?x?x?xf32>, tensor<?x?x?xf32>

These can be fused to get the specification I had in the first post. We need the same information captured by the indexing_maps at the front-end/xla_hlo level. This does away with the fake size 1 dims and is actually fairly clean representation.
In terms of rules under which we can fuse these, the current fusion on tensor version of linalg.generic just looks for the producer to have all “parallel” iterator types. Then its always valid to fuse. You will replicate computation here, but one aspect of fusion is that you favor replicating computation to avoid trips to memory.

I don’t think this is equivalent, if I understand your encoding right. Stating that 0 -> {0, 1} does not specify how elements get distributed. If you had a dimension of extent 32 that gets broadcast to dimensions of extent 8 and 4, then you would redistribute the elements. If the original extent does not match the product of the broadcasted-to dimensions, you would need to repeat elements. using degenerate dimensions with extend 1 in the source operand is a way to specify along which dimension the repeating is to happen.
I think what you are more after here is to specify the dimensions in the result where you repeat the input. Would this just be a vector? So essentially, you want some static attributes that tells you where the dynamic ‘1’ values in the shape will be.

I would argue that you are no longer having the dynamic broadcasting problem because you now require a static specification of broadcasting behavior. For input languages that require a specification like this, this could work well and expanding the original xla.broadcast operation to also support permutations would be a good way to model this.

I cannot judge whether implicit broadcast along size-1 dimensions is useful at the input language level. It would be interesting to understand this better. Irrespective of whether it is useful, we have models written using current semantics and changing semantics would need user intervention. So if we want to support existing models (and for TensorFlow I believe we do), we will have to find a way to support this in the presence of dynamic shapes. For statically shaped workloads, this is not an issue anyway.

But this would require to statically know where the 0 stride is. If this was dynamic, we would no longer have the internal aliasing guarantees but much of the mechanics would be the same. I proposed earlier that we could model at the typesystem level whether a memref has 0 strides or not. That would not take away any optimization potential for the currently supported cases but would open up some optimizations (like fusion?) also for the broadcasting case.

Mahesh’s proposal gives you full static knowledge of where the expansion happens. So other than duplicating computations, I do not see any issues.

More generally, if we had dynamic stride-0 memrefs, why would (input-)fusion not work? You might also duplicate computations. What else am I missing? I do see that other things get harder (like vectorization).

I’m missing this as well. To support the dynamic case completely, you need an additional input (not attribute) that encodes this distinction in some way (ie. List of could-be-one dims).

In both:

and

there is not enough information: tensor<?x?x?xf32> is produced and there is not enough info to derive the dim t, 1.

This prompted my comments about “modeling in fused form” and canonicalizations at the “HLO → Linalg” boundary: writing the equivalent of c[i][j][k] = a[i][0][k] + b[i][j][k] in Linalg implies that dim c, 1 == dim b, 1

Other fun cases:

%t = xla_hlo.broadcast %a ... :  tensor<?x?xf32> -> tensor<?x?x?xf32>
%c = xla_hlo.add %t, %b : (
     tensor<?x?x?xf32>, tensor<?x?x?xf32>) -> tensor<?x?x?xf32>
%e = xla_hlo.add %t, %d : (
     tensor<?x?x?xf32>, tensor<?x?x?xf32>) -> tensor<?x?x?xf32>

would seem to imply that dim b, 1 == dim d, 1 but I don’t think is well specified.

1 Like

You are right. I assumed that the ... are hiding the target shape, as well. But that would probably have been a xla_hlo.dynamic_broadcast then.

So what you are saying is that, even assuming we had such a dynamic_broadcast, this still cannot be compiled to a linalg.generic as the output shape cannot be derived from the shape of the input alone.

So instead, one would either need to have a special operation (maybe linalg.broadcast) that encodes this transformation of the indexing maps, or a variant of linalg.generic that allows to specify a target shape.

Doing this in the HLO->LinAlg lowering would require all dialects that lower into LinAlg to implement this kind of adhoc fusion.

I would somewhat inverse the discussion here.

Case 1: one wants to materialize the broadcast in memory.
An indexing map (possibly with permutations if we wanted) and a target shape need to be specified.
This can be achieved either by passing the tensor that the broadcast want to model or by passing some “fake memref” that only serves the purpose of giving structure to the op.

Case 2: one does not want to materialize the broadcast in memory.
I would just argue that things should be written in fused form at the tensor level as early as possible.
Canonicalizations of broadcast + X into fused form don’t sound too scary.

Case 3: one does not want to materialize the broadcast in memory and the semantics are not powerful enough to write the fused form (e.g. fusing with some combination of reductions).
There are multiple options here, including specifying fusion transformations that must happen in the future, but I won’t speculate at this point.

My understanding is that XLA semantics force case 2 to become case 3 because of HLO expressiveness. But that the vast majority of concrete use cases are of type 2.
@herhut does this seem a reasonable characterization?

Technically, XLA can fuse the broadcast into operations. In the client version of the XLA dialect all operations can implicitly broadcast. So you can fuse a broadcast_in_dim into an operation. This is not fully generic, as only one operand can be fused. Cases like

%1 = broadcast_in_dim ...
%2 = broadcast_in_dim ...
%3 = add %1, %2

fall into category 3. But for these we also do not have a fused form in linalg, as the output shape might be under-specified.

I would expect (and I have no data for this) that it is common for dialects at the level of abstraction of HLO to not be able to represent fused operations. After all, that is why we lower into linalg, to have a more transformation friendly format that makes the computation more visible at the cost of being more verbose (leave linalg’s named operations aside here).

For LinAlg on buffers, we already have the “fake” memref as the output needs to be pre-allocated. Just in the case of LinAlg on tensors, we do not have enough information about the result shape. Maybe it would make sense in such cases to add a value or expression that computes the result shape, if it cannot be derived.

Apologies for radio silence (was catching on some backlog)

That is a separate point AFAICS. The actual shape of the tensor needs to be provided separately (@herhut mentioned this too). What I am focusing on here is not having indexing such as a[i][0][k] in the first place. Getting more specific and starting from the broadcast op

%t = xla_hlo.broadcast %a ... : tensor<?x?xf32> -> tensor<?x?x?xf32>

There are six different “computations” that this might be used to specify.

t[i][j][k] = a[i][j];
t[i][j][k] = a[i][k];
t[i][j][k] = a[j][k];
t[i][j][k] = a[k][i];
t[i][j][k] = a[k][j];
t[i][j][k] = a[j][i];

The choice of which one to use can either come from a legalization at the top of the stack or from the end-user. The conjecture I have is that the in the common case, the choice of which one to use is known statically. (I cant myself think of a case where you would dynamically want to pick between these variants. )

There is a separate issue of how to “infer” the shape of the result tensor. Even in the broadcast-in-dim case where c[i][j][k] = a[i][0][k] + b[i][j][k] you can infer that dim c, 1 == dim b, 1 but you cant infer the shape value without additional info either through shape inference dialect, or some other mechansim. My point is trying to instead get to c[i][j][k] = a[i][k] + b[i][j][k] will remove the complexity of “dynamic broadcasting” based on whether a dimension is of size 1 at runtime or not.

Yes that’s the better/canonical form indeed, and also how you’d write it in TC, not the one with 0.

Understood, my point is that putting a and b in the same linalg.generic (i.e. in the same shape inference scope) enables that and I don’t think there is an XLA equivalent for tying a and b in a custom op.

Yes, the place I’ve seen these pop up is when a user wants to use an existing a 1-D library for a 2-D purpose (the stride 0 that Stella mentioned). The commonly used trick is the expert user subverting the system to get broadcast semantics without having to write a new kernel and without having to materialize the broadcast. In my experience it is a pretty deliberate and static way of getting it to “just work”, with the fun caveats that Stella mentioned that k mod n == 1 may buy you an unexpected broadcast materialization somewhere else and get an OOM …

As an experiment, I would go as far as removing implicit broadcast from the frontend (when custom ops work well enough) and then reevaluate when/if we see legitimate dynamic use cases.

Let’s be clear on terminology. In my understanding, there are two implicit broadcasting modes: a) an implicit broadcast to a higher rank by inserting should-broadcast-1-dims, and b) assuming that any 1-dim is broadcastable. I call the first implicit rank broadcasting and the second implicit 1-dim degenerate broadcasting.

(a) is generally useful ime. (b) is problematic because the extent 1 is ambiguous between being a literal unit dimension or being a special symbol that activates broadcast behavior. Depending on context the term “implicit broadcasting” can refer to either/both.

If instead of a 1-extent signaling dimensional broadcast behavior, we instead had a special symbol bcast that could be used for any dimension, the (a) case could be defined unambiguously in terms of that (and, I would argue, be useful).

In the world where we don’t control the frontend, we just have to do our best. However, in the numpy case, I could envision a special strict_ndarray which carried an extra bitmask for broadcastable dims and enforced it when dispatching to a ufunc. I don’t think it would actually change any kernel definition, just add some more pre-conditions. This would allow you to write eager-dispatch code that functioned the same as if you compiled it (assuming that when compiling a graph of such operations, we added the rule that dynamic dimensions cannot be declared to have the broadcastable bit set – they would need to go through a special broadcast_dims op that flipped the bits on results).

The semantic is so prevalent that I don’t see us being able to produce a compiler that ignores it (it has to do the right thing), but developing some well defined semantics for running without this degenerate 1-broadcast behavior would give us a path to let people both write stricter programs and let the compiler focus on being optimal for cases that comply.

It all needs more definition and thought, though.

1 Like

I fully agree with this. I was wondering along the same lines coming from the compiler. Would it make sense to have a special type for strided memrefs that might be broadcasting internally (and hence alias) so that we can statically differentiate the two cases. Then, if we have the knowledge, we could produce code that does not have to care about degenerate broadcasts and reap the benefit of tighter semantics. At the same time, we would be general enough for current users.

I like the terminology.

This would work, and you could canonicalize the bcast away.

Thats exactly what I would like to happen. We have to support this usage, but if the compilation stack is inefficient for this case (no fusion, etc.) then it would maybe push people to avoid using this. That is just my hope, and not sure that even happens in practice.

Thanks for all this! I next will try to send out the canonicalizations I have in Linalg to remove all the implicit 1-dim degenerate broadcasting (its been languising for a week so needs some cleanup)

1 Like

To follow up on this, the stack of patches ending with https://reviews.llvm.org/D79766 canonicalizes away the use of unit-extent dims for broadcasting.

So the following computation

func @broadcast_test(%arg0 : tensor<5xf32>, %arg1 : tensor<5xf32>) 
  -> tensor<5x5xf32> {
  %0 = xla_hlo.reshape %arg0 : (tensor<5xf32>) -> tensor<1x5xf32>
  %1 = xla_hlo.broadcast_in_dim %0 
         {broadcast_dimensions = dense<[0, 1]> : tensor<2xi64>} : tensor<5x5xf32>
  %2 = xla_hlo.reshape %arg0 : (tensor<5xf32>) -> tensor<5x1xf32>
  %3 = xla_hlo.broadcast_in_dim %2 
         {broadcast_dimensions = dense<[0, 1]> : tensor<2xi64>} : tensor<5x5xf32>
  %4 = xla_hlo.add %1, %3 : tensor<5x5xf32>
  return %4
}

which gets lowered to linalg (and after some fusion) becomes:

#accesses = [
  affine_map<(d0, d1) -> (0, d1)>,
  affine_map<(d0, d1) -> (d0, 0)>,
  affine_map<(d0, d1) -> (d0, d1)>
]

#trait = {
  args_in = 2,
  args_out = 1,
  indexing_maps = #accesses,
  iterator_types = ["parallel", "parallel"],
  library_call = "some_external_fn"
}

func @broadcast_test(%arg0 : tensor<5xf32>, %arg1 : tensor<5xf32>) -> tensor<5x5xf32>
{
  %0 = linalg.tensor_reshape %arg0 [affine_map<(d0, d1) -> (d0, d1)>] :
       tensor<5xf32> into tensor<1x5xf32>
  %1 = linalg.tensor_reshape %arg1 [affine_map<(d0, d1) -> (d0, d1)>] :
       tensor<5xf32> into tensor<5x1xf32>
  %2 = linalg.generic #trait %0, %1 {
       ^bb0(%arg2: f32, %arg3: f32):
         %3 = addf %arg2, %arg3 : f32
         linalg.yield %3 : f32
       } : tensor<1x5xf32>, tensor<5x1xf32> -> tensor<5x5xf32>
  return %2 : tensor<5x5xf32>
}

gets canonicalized to

#map0 = affine_map<(d0, d1) -> (d1)>
#map1 = affine_map<(d0, d1) -> (d0)>
#map2 = affine_map<(d0, d1) -> (d0, d1)>

func @broadcast_test(%arg0: tensor<5xf32>, %arg1: tensor<5xf32>) -> tensor<5x5xf32> {
  %0 = linalg.generic {args_in = 2 : i64, args_out = 1 : i64,
                       indexing_maps = [#map0, #map1, #map2],
                       iterator_types = ["parallel", "parallel"]}
        %arg0, %arg1 {
        ^bb0(%arg2: f32, %arg3: f32):  // no predecessors
          %1 = addf %arg2, %arg3 : f32
          linalg.yield %1 : f32
         }: tensor<5xf32>, tensor<5xf32> -> tensor<5x5xf32>
  return %0 : tensor<5x5xf32>
}
1 Like