[RFC] Changes to linalg::TiledLoopOp to unblock reductions

It isn’t clear to me why a « minimal change » would be the best metric to optimize for.
In particular « reducing churn » is also another one.

The point of an RFC should be first and foremost to agree on a design.
This isn’t about perfection, but at minimal to flush out any obvious issues with the design.

Alternatives (like the one I showed abode) don’t seem much harder to achieve so I am not sure why we should knowingly go with flawed design when we can avoid it?

It is much harder to achieve. It needs introduction of tensor.tile_range, a new type !tensor.range, then tensor.extract_slice should start either accepting !tensor.range or another new op is needed to extract offsets-sizes-strides from the !tensor.range and feed it into tensor.extract_slice. Some of these ops also require an agreement on design.

This op is supposed to be an experimental one. It might change again in a foreseeable future. I don’t think it makes sense to block it. It is in Linalg, it does not break anyone, because it is not used by anyone currently.

I’m not sure what is the argument here: it seems like it can be read as that we should just blankly accepts anything that is labelled “experimental” because it “does not break anyone”?
I object to this approach for in-tree development, I’d like to keep the bar higher than that on design.

In particular, if you send an RFC you need to accept that you may have to revisit your initial design to be able to move forward upstream, including if it requires significant changes.

I am not sure what is “blankly” accepted here. I don’t see a lot of objections from anybody actually involved in Linalg development.

I am not sure thats the bar here. I think there are some valid points raised here. I asked the same question above. It seems like

linalg.tiled_yield %transpose_sub in %out_sub

there is an implicit assumption that %out_sub is a result of tensor.extract_slice and that the use here is just to get the slice shape and not its values. So given that, I think Mehdi’s proposal address that. It is true that following through on this would involve a lot more work to make sure everything fits properly (new types, ops, etc. etc.) There is another subtle issue, the tensor.extract_slice looks like a “read” in the IR, but is actually a slice to write through. So in some sense this construct is mimicing an in-place update of a tensor value till bufferization can actually make this an in-place update. I totally understand the thought behind this (IREE does the same thing but slightly differently using different types).

I might not personally block this work because there are some very hairy abstractions gaps that need to be addressed that I dont have a definitive answer for, but these are valid concerns IMO.

Just a fly on the wall perspective from someone who has been dipping in and out of this thread for the last week: this topic doesn’t seem irreconcilably distant in design space but does not seem converged. Sometimes f2f discussions can help when things are in this state. It looks like we have a demo scheduled for the ODM on Thursday. Do we think that will take the whole time, and if not, would this be something worth discussing? (or could be another time to discuss)

(I don’t have a stake in this topic – more just recognizing some comms patterns that might be good to flush out)

1 Like

If we are doing something that involves a lot more work, then I propose a slightly different design that allows to avoid the weird terminator syntax with the variadic pair of args.

%sum = linalg.tiled_loop (%i) = (%c0) to (%100) step (%c10) {
    input(%in: tensor<100xf32>) {
      %0 = linalg.tile_range %in [%i][%c10][%c1] : !linalg.subset
      linalg.yield %0 : !linalg.subset
    output(%out: tensor<f32>) {
      %0 = linalg.full_range %out : !linalg.subset
      linalg.yield %0 : !linalg.subset
    computation(%in_: tensor<10xf32>, %out_: tensor<f32>) {
      %sum = <some_computation>(%in_, %out_) : tensor<f32>
      linalg.yield %sum : tensor<f32>

Every input and output tensor/memref will have a corresponding region. These regions define subset transformations to get arguments for the “computation” region. At first, only tiled and non-tiled args will be produced with linalg.tile_range and linalg.full_range operations. Later, we might add support for non-rectangular subsets.

We can also bufferize this op.

linalg.tiled_loop (%i) = (%c0) to (%100) step (%c10) {
    input(%in: memref<100xf32>) {
      %0 = linalg.tile_range %in [%i][%c10][%c1] : !linalg.subset
      linalg.yield %0 : !linalg.subset
    output(%out: memref<f32>) {
      %0 = linalg.full_range %out : !linalg.subset
      linalg.yield %0 : !linalg.subset
    computation(%in_: memref<10xf32, #map>, %out_: memref<f32>) {
      <some_computation>(%in_, %out_)

linalg.yield can still be used as a terminator in all of these regions. In “computation” region it yields slices or entire tensors for the corresponding tensor output argument.

One type !linalg.subset and two subset ops linalg.tile_range, linalg.full_range will have to be added. Also, in order to develop linalg.tiled_loop 2.0 incrementally, I would suggest to create a separate operation, make sure that every pass works correctly and then replace the original linalg.tiled_loop with it.

Oh, I was out of this thread for a couple of days and did not notice this. I am fine discussing it in one of the MLIR meetings if the new design is not accepted.

This looks nice to me!

Some questions:

  • I think there is an invariant that the %in_ input to the computation region has to have the shape of the tile computed in the input region, right? But we can’t have any verification it seems: is it just defined as UB if there is a mismatch?
  • Same question for the output region and the value yielded in the computation region?
  • I’m confused by how it works different tiles have overlapping output linalg.subset? It seems like it could describe a reduction but there would be a need to express the reduction itself. Unless the semantics is that the tiles are sequential and that the tile extracted at iteration N is from the %out_ tensor produced after iterations N-1?
    If so, would that mean that with a parallel iterator_types we’d have to forbid overlapping yielded tiles?

Yep, both inputs and outputs “computation” region args should correspond to the shapes computed in “input/output” regions. I thought that we could actually add verification for these types similarly to how tensor.extract_slice or memref.subview verify the output type.

Yes, a parallel “iterator_type” would require non-overlapping tiles. And yes, you are right about reductions. I have a prototype for a simple reduction kernel that can actually be distributed. With the “old” (unsubmitted) syntax, a tiled reduction looked like

func @reduce_sum(%arg: tensor<800x600xf32>) -> tensor<800xf32> {                                      
  %cst = constant 0.000000e+00 : f32                                                                    
  %c2 = constant 2 : index                                                                              
  %c4 = constant 4 : index                                                                              
  %c800 = constant 800 : index                                                                          
  %c600 = constant 600 : index                                                                                                                                                                                    
  %c0 = constant 0 : index                                                                              
  %init = linalg.init_tensor [800] : tensor<800xf32>                                                       
  %result = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%c800, %c600) step (%c2, %c4)
       ins (%in_ = %arg: tensor<800x600xf32>)
       outs (%out_ = %init: tensor<800xf32>)
       iterators["parallel", "reduction"] {                                                                                                
    %sub_in = tensor.extract_slice %in_[%i, %j] [2, 4] [1, 1]
      : tensor<800x600xf32> to tensor<2x4xf32>
    %sub_out = tensor.extract_slice %out_[%i] [2] [1]
      : tensor<800xf32> to tensor<2xf32>                   
    // after bufferization, %fill will result in an `alloca` of a buffer for the temp result         
    %fill = linalg.fill(%cst, %sub_out) : f32, tensor<2xf32> -> tensor<2xf32>          
    // actual linalg.generic with reduction with a temp "out" arg                       
    %local_sum = linalg.generic {
        indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, 
                                     affine_map<(d0, d1) -> (d0)>],
        iterator_types = ["parallel", "reduction"]}
        ins(%sub_in : tensor<2x4xf32>)
        outs(%fill : tensor<2xf32>) {
    ^bb0(%arg6: f32, %arg7: f32):                                                
      %8 = addf %arg6, %arg7 : f32                                                                      
      linalg.yield %8 : f32                                                                             
    } -> tensor<2xf32>          
    // add the local sum to the output. If the loop is distributed, then it should be done atomically.                                                                        
    %sum = linalg.generic {
        indexing_maps = [affine_map<(d0) -> (d0)>,
                                     affine_map<(d0) -> (d0)>],
       iterator_types = ["parallel"]}
       ins(%local_sum : tensor<2xf32>) 
       outs(%sub_out : tensor<2xf32>) {
    ^bb0(%arg6: f32, %arg7: f32):                                                 
      %8 = addf %arg6, %arg7 : f32                                                                                                                                                                                
      linalg.yield %8 : f32                                                                                                                                                                                       
    } -> tensor<2xf32>                                                                                  
    linalg.tiled_yield %sum in %sub_out : tensor<2xf32>                                                         
  return %result : tensor<800xf32>                                                                           

Aren’t these verification very local? (Just based on the input types you can infer the correct output type)
Here in your example above the computation has %in_: tensor<10xf32> but the input region can be arbitrary complex, and even in this case recovering the slice would require to analyse linalg.tile_range %in [%i][%c10][%c1] and figure out that the inputs are constants, etc.
But I’m fine with not having a verifier (we can’t just verify everything, in particular in the static case): we just need to document clearly what’s UB.

Something unclear to me reading this is how do you figure from the tiled_loop op that a reduction is present and how do you identify that the second linalg.generic is actually the “reduction operator”?
In the scf dialect this is made unambiguous by the use of a specific op (scf.reduce).

Most likely, it is going to be enough to have a composition of functions like linalg.yield f_1(f_2( ...(f_N(%input_or_output_tensor_or_memref, %other_args_N), ...), %more_args_2), %other_args_2), %other_args_1). I was planning to restrict each input/output region to having a single block with 2 ops: a shape op and a terminator in the first version. That way we can verify the types. At least in the beginning, while implementing the op and attempting to use it, it is going to be way too easy to make a mistake without a verifier.

This will require a separate RFC.

So, if there are no objections, I can start implementing this version.

No objection for me, I’m interested to hear what @MaheshRavishankar and @nicolasvasilache think of this!

Nicolas might have a better take on this, but a summary of comments

This might be trying to do too much in one op. I understand the need for this, but doing transformations on this op is going to be very involved. I need to replace the input region and then change the computation. More things to keep track off, and that are harder to verify, etc. There is no objective criteria to measure what point is too far, but IMO this might not pay for itself.

Have had a few discussions with Nicolas on this. One factor to consider here is that if linalg.tiled_loop starts moving closer to what an scf.parallel can do, then whats the point of linalg.tiled_loop in the first place. I think this operation was added initially with the intent to “generalize” linalg.generic to work on tiles, but that seems to be challenging. So it maybe time to re-evaluate if this abstraction is carrying its weight.
Specifically w.r.t reduction, whether a linalg.tiled_loop has an extra region for reduction or not might be actually influenced by how a particular compilation workflow is structured. For example, have been looking at “parallel” reduction support in IREE, with linalg.tiled_loop + region as a candidate, my conclusion here is that this is not needed. If you want to do tree based reductions, and scf.for with a linalg.tiled_loop as it exists today can be used to implement it.

Having said that, I dont have a strong opinion on this apart from saying for IREEs use-case the generalization for reduction does not seem to be necessary. I can stamp it to favor experimentation, but probably requires more input from Nicolas who probably has a better idea of the long-term evolution of this.

This is a long thread, but as a fly on the wall comment, another approach here that is isomorphis to what is being proposed is

%sum = linalg.tiled_loop (%i, %j) = (%c0, %c0) to (%size_0, %size_1)
    step (%c10, %c10)
    ins (%in_ =  %in: tensor<100x100xf32>)
    outs (%out_ =  %out: tensor<100x100xf32>)
    iterator_types ("parallel", "parallel")  {
  %in_sub = tensor.extract_slice %in_[%i, %j][%c10, %c20][%c1, %c1]
  %transpose_sub = linalg.generic {
      indexing_maps =  [#id, #tr],
      iterator_types =  ["parallel", "parallel"]}
      ins(%in_sub: tensor<10x10xf32>)
      outs(%out_sub: tensor<10x10xf32>)  {
    ^bb0(%in_elem: f32,  %out_elem: f32):
      linalg.yield  %in_elem : f32
  } -> tensor<10x10xf32>
  linalg.tiled_loop_terminator {
    tiled_yield %transpose_sub at [%j, %i][%c20, %c10][%c1, %c1]

That is, the terminator has a region, with one op per outs, and that op holds the offsets/strides to insert into for this iteration. I think this dodges the weirdness of linalg.tiled_yield %transpose_sub in %out_sub : tensor<10x10xf32> needing %out_sub to be defined by a tensor.extract_slice. It’s the same information, but represented without needing to traverse a use-def chain to reach the dummy “read”.

Nice! This is definitely the best solution I have seen so far.

It does remove insert_slice, but I would prefer to get rid of the indices completely, because that way you can see that a tile of the output that you read is actually the same as the tile that you overwrite. Also it restricts subset types to rectangular tiles only. In future, we want to support more types.
What @mehdi_amini suggested above with

and what I suggested with

can model this.

It is in Linalg, it does not break anyone, because it is not used by anyone currently.

I notice your above comment on LinAlg and wonder if this statement is still accurate at this point. Is LinAlg considered “experimental” and not ready for the community to adopt? Is there some guideline to distinguish experimental and non-experimental work in MLIR? Thanks.

I think the comment was specifically about the linalg.tiled_loop op and not linalg in general. At the moment linalg.tiled_loop isn’t used in the lowering pipelines used in IREE/XLA/TensorFlow/… ; but linalg itself is heavily used and beyond what I’d consider “experimental”.

1 Like

Yes, I can confirm what Mehdi wrote. I was talking about linalg.tiled_loop only.

1 Like