Notes from the MLIR Upstream Round Table @ EuroLLVM 2024

I am a bit late to this thread. Thanks @rengolin for the detailed notes.

In general I am -1 on almost all of these. Will describe more about the “broadcasting behavior” and “type cast” later on, but the main premise for all of this is that Linalg operations are essentially any computation that can be represented using “perfectly nested loops”. Named ops are useful, but Linalg is a transformation focused dialect, and any restriction that tries to limit the “perfectly nested loops” aspect of this seem artificial to me. For most part a named op is just a linalg.generic with a fixed indexing map, iterator types and region definitions.

Broadcasting behavior

In general broadcasting gets a bit of a bad-rap (with good reason), but IMO not all broadcasting behavior is bad. In some certain restricted category having some broadcasting behavior can lead to a better representation of the program. Ill use the examples above to walk through which I think are well defined under these thumb rules for elementwise parallel operations.

  1. All operations need to be in destination passing style. If any named op is not (I actually dont know how that is legal for the Linalg Interaface, but that is a requirement IMO
  2. The indexing maps for the outputs of such ops needs to be identity.
  3. All operands indexing maps need to be a minor identity (using this term lightly, but basically it has no transposition, but can have missing dimensions)
  4. The broadcasting behavior is essentially, all vectors not at the same rank as the output gets broadcasted to the output shapes.
 %0 = linalg.add %1, %2 : tensor<4xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 
  // Broadcast %1 4x before adding

This IMO is perfectly well defined (apart from this should be in Destination passing style and it isnt) since it is essentially this

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<10x4xf32>)
    outs(%3 : tensor<10x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

There is no ambiguity here. I think one issue is that these ops are trying to infer broadcast behavior based on the shapes. Instead the broadcasting behavior should be based off of indexing maps.

%0 = linalg.add %1, %2 : tensor<10x1xf32>, tensor<10x4xf32> -> tensor<10x4xf32>

This is numpy-style size-1 based broadcasting. This should just be illegal.

// Broadcast %1 4x before adding?
  %0 = linalg.add %1, %2 : tensor<10xf32>, tensor<10x4xf32> -> tensor<10x4xf32> 

No issue here either. It is basically

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<10x4xf32>)
    outs(%3 : tensor<10x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

Now coming to this

%0 = linalg.add %1, %2 : tensor<4xf32>, tensor<4x4xf32> -> tensor<4x4xf32> 

If you are trying to “infer” based on shapes, this gets stuck… instead the named ops should infer broadcasting behavior based on indexing maps. Then there will be no ambiguity… For this specific case actually both

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<4x4xf32>)
    outs(%3 : tensor<4x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<4x4xf32>

and

 %0 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>,
                     affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]}
    ins( %1, %2 : tensor<4xf32>, tensor<4x4xf32>)
    outs(%3 : tensor<4x4xf32>) {
  ^bb0(%b0 : f32, %b1 : f32):
    %1 = arith.addf %b0, %b1 : f32
    linalg.yield %1 : f32
} -> tensor<10x4xf32>

are equivalent. One is just the loop interchanged version of the other. So you can just pick one.

I think one addition might be to make the indexing maps explicit on even named ops to represent broadcasting behavior precisely. Then there is no ambiguity at all.
Doing this for the matmul and convolution ops could also be useful and help collapse the 10s of variants of matmul + convolution into a handful of few ops with different indexing maps.
This also means we might be at end of the road for the python opdsl → yaml → tablgen def of named ops. Instead we add a set of named ops explicitly in tablegen that give us more freedom to make sure the ops carry enough semantic information to keep things as local as possible as well as unambiguous when “generalizing this operation” to linalg.generic.

Type mismatches

AFAIU the only type issue that happens with “named elementwise ops” is understanding how to extend the input type to output type. Because Linalg ops fundamentally work on unsigned types, the semantics of the element type conversion is carried in the body of the linalg op. For named ops we just need to have a way in which we represent how to convert from the operand type to the result type and all the computation happen in result type. (You could also relax this and have an “operation type”, where all the operands are converted to this type and then the operation is done in that type followed by a conversion to the result type"). All of this could just be represented as an enum per operand/result. In general it is useful to allow mixed types in a single operation. The whole premise of Linalg is that you want to keep all information needed to “transform” an operation local. This is more a guiding principle and not about red lines, but we should try to make it easier to keep things as local as possible.

To make the above more concrete, you could add this to an elementwise operation

%0 = linalg.add {
    extension_types = [SIGN_EXTEND_I32, UNSIGNED_EXTEND_I32],
    ...}
    ins(%1, %2 : tensor<..xi4>, tensor<..xi4>)

which essentially means

%0 = linalg.generic {...}
    ins(%1, %2 : tensor<..xi4>, tensor<..xi4>)
  ^bb0(%b0: i4, %b1 : i4,...):
    %3 = arith.extsi : i4 to i32
    %4 = arith.extui : i4 to i32
}

This essentially boils down to how you can generate the body of the linalg.generic given the named op.

I think it would be useful to even adapt linalg.matmul and linalg.conv named ops to have explicitly stated semantics for the conversion from input type to accumulator type and from accumulator type to result type. For now we are essentially using two separate ops, and relying heavily on fusion to fold things back for us, which is do-able, but it would be easier if we didnt have to. The flip side of this is the quantizations schemes are getting inherently more complicated. We might not be future proof with named ops to handle these evolutions and have to build up interfaces that can query information from a generic op directly (like which dimensions are contracted, or are batch dimensions) and not rely on named ops to handle things for us.