[RFC] Parallel loops on tensors in MLIR

Parallel loops on tensors in MLIR

scf::ForeachThreadOp (TableGen) and gml_st::ParallelOp (TableGen) are two loop-like operations that came to replace linalg.tiled_loop (RFC) after its deprecation. They went through many changes and in their current form the difference is small enough to merge them. Especially, since there is interest in making it happen (Discourse).

Overview

scf.foreach_thread is a multi-dimensional loop that expresses the computation in terms of thread IDs. It has a special scf.foreach_thread.perform_concurrently terminator, that contains one or more tensor.parallel_insert_slice operations. It allows to specify what slice of the shared output is updated by the thread.

scf.foreach_thread

#map = affine_map<(d0) -> (d0 * 8)>
#map1 = affine_map<(d0) -> (d0 * 4)>

%matmul = scf.foreach_thread (%threadIdI, %threadIdJ) in (%c16, %c16) 
    shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
  %offset_I = affine.apply #map(%threadIdI)
  %offset_J = affine.apply #map1(%threadIdJ)

  %lhs_slice = tensor.extract_slice %lhs[%offset_I, 0] [8, 16] [1, 1]
    : tensor<128x16xf32> to tensor<8x16xf32>
  %rhs_slice = tensor.extract_slice %rhs[0, %offset_J] [16, 4] [1, 1]
    : tensor<16x64xf32> to tensor<16x4xf32>
  %out_slice = tensor.extract_slice %out_[%offset_I, %offset_J] [8, 4] [1, 1]
    : tensor<128x64xf32> to tensor<8x4xf32>

  %matmul_slice = linalg.matmul
    ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
    outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>

  scf.foreach_thread.perform_concurrently {
    tensor.parallel_insert_slice %matmul_slice
      into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
      : tensor<8x4xf32> into tensor<128x64xf32>
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

The loop also can annotate dimensions with the mapping attribute.

gml_st.parallel

gml_st.parallel is also multi-dimensional loop that has lower, upper bounds and steps like scf.parallel. It has a gml_st.set_yield terminator that specifies what slice/element of the shared output can be updated using !gml_st.tile-typed SSA value.

gml_st.set_yield has a variadic number of src - dst - set triples and can also have an accumulator that specifies a RMW-like update of the output.

%matmul = gml_st.parallel (%i, %j) = (%c0, %c0) to (%c128, %c64) step (%c8, %c4)
    outs (%out_ = %fill: tensor<128x64xf32>) distribution ("block") {
  %lhs_slice = tensor.extract_slice %lhs[%i, 0] [8, 16] [1, 1] :
          tensor<128x16xf32> to tensor<8x16xf32>
  %rhs_slice = tensor.extract_slice %rhs[0, %j] [16, 4] [1, 1] :
          tensor<16x64xf32> to tensor<16x4xf32>
  %out_slice = tensor.extract_slice %out_[%i, %j] [8, 4] [1, 1] :
          tensor<128x64xf32> to tensor<8x4xf32>

  %matmul_slice = linalg.matmul
   ins(%lhs_slice, %rhs_slice: tensor<8x16xf32>, tenosr<16x4xf32>)
   outs(%out_slice: tensor<8x4xf32>) -> tensor<8x4xf32>

  %tile = gml_st.tile [%i, %j] [8, 4] [1, 1] : !gml_st.tile<8x4>
  gml_st.set_yield %matmul_slice into %out_[%tile] :
          tensor<8x4xf32> into tensor<128x64xf32>[!gml_st.tile<8x4>]
} : tensor<128x64xf32>

The loop also has a distribution attribute that is used to map dimensions of the loop to thread, blocks, etc.

Difference

The gml_st.set_yield terminator when it does not update the same slice concurrently or insert element types, can be always rewritten as scf.foreach_thread.perform_concurrently. Accumulator regions can be added to tensor.parallel_insert_slice (proposal). Single-element updates can be added as well, if they are really needed.

Proposal

Introduce loop bounds for scf.foreach_thread. It will still have two modes of tiling, i.e. when tiles sizes are computed given the thread count and when tiles sizes are given.

At this point I would also omit _thread from the op name to make it less confusing.

Updated IR: tiling w.r.t. thread count.

#map = affine_map<(d0) -> (d0 * 8)>
#map1 = affine_map<(d0) -> (d0 * 4)>

%matmul = scf.foreach (%threadIdI, %threadIdJ)
    = (%c0, %c0) to (%c16, %16) step (%c1, %c1)
    shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
  %offset_I = affine.apply #map(%threadIdI)
  %offset_J = affine.apply #map1(%threadIdJ)

  %lhs_slice = tensor.extract_slice %lhs[%offset_I, 0] [8, 16] [1, 1]
    : tensor<128x16xf32> to tensor<8x16xf32>
  %rhs_slice = tensor.extract_slice %rhs[0, %offset_J] [16, 4] [1, 1]
    : tensor<16x64xf32> to tensor<16x4xf32>
  %out_slice = tensor.extract_slice %out_[%offset_I, %offset_J] [8, 4] [1, 1]
    : tensor<128x64xf32> to tensor<8x4xf32>

  %matmul_slice = linalg.matmul
    ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
    outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>

  scf.foreach_yield {
    tensor.update_slice %matmul_slice
      into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
      : tensor<8x4xf32> into tensor<128x64xf32>
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Updated IR: tiling to w.r.t. tiling size.

%matmul = scf.foreach (%i, %j) = (%c0, %c0) to (%c128, %c64) step (%c8, %c4) 
    shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
  %lhs_slice = tensor.extract_slice %lhs[%i, 0] [8, 16] [1, 1]
    : tensor<128x16xf32> to tensor<8x16xf32>
  %rhs_slice = tensor.extract_slice %rhs[0, %j] [16, 4] [1, 1]
    : tensor<16x64xf32> to tensor<16x4xf32>
  %out_slice = tensor.extract_slice %out_[%i, %j] [8, 4] [1, 1]
    : tensor<128x64xf32> to tensor<8x4xf32>

  %matmul_slice = linalg.matmul
    ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
    outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>

  scf.foreach_yield {
    tensor.update_slice %matmul_slice
      into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
      : tensor<8x4xf32> into tensor<128x64xf32>
  }
} {mapping = [#gpu.block<y>, #gpu.block<x>]}

Note about scf.parallel

scf.parallel works only on tensors right now. I think that as a next step after gml_st.parallel and scf.foreach_thread unification, we should reconsider whether scf.parallel is pulling its weight or whether we can cover its use case by scf.foreach as well. This is out of scope for the current RFC. But I would be interested to learn how scf.parallel is used when it has scf.reduce ops inside. @bondhugula, @ftynse, @gysit

@nicolasvasilache @stellaraccident @herhut

4 Likes

At the time, there was a popular vote that was favoring scf.forall.
I do not have strong opinions about the name.

1 Like

The main difference between

and:

is that the first current form is normalized and the 0 / 1 are implicitly omitted.

Future evolutions want to support dynamic slices and other subset types that may not be easy to mix with the non-normalized form, but it is not a big stretch to require normalization to be applied when we need it.

Looking around the Tablegen code, how would you feel about evolving scf.for into an scf.multi_for in some followup?

I suspect this will also improve the uniformization of transforms that @chelini mentionedin another post.

I always thought that scf.for should also be multi-dimensional.

scf.forall is even better. I like short names.

2 Likes

I don’t see the point of changing the name now, it is what the documentation says it is and having a multi_for would imply there should be a single_for that we don’t have or necessarily want.

I would not change the name of scf.for. Ideally, we should have scf.parallel and scf.for and that’s it. Both multi-dimensional, both work on tensors and on memrefs.

Also changing scf.for is a bit out of scope of this RFC :slight_smile:

3 Likes

Thanks for writing this up. I like the general direction. I have two general questions:

  1. Do you expect scf.foreach to capture non-parallel loops? It would be nice to have a single loop-like construct at the tensor level and then lower it (maybe after bufferization) to either scf.for or scf.parallel.

  2. In your example, the mapping attributes seem related to GPU only (mapping = [#gpu.block<y>, #gpu.block<x>]), but I assume we would evolve the operation to target CPU as well, correct?

  1. No, I don’t expect scf.forall to capture non-parallel loops. linalg.tiled_loop used to do that, but it was not very helpful to have parallel/sequential dimension types for the loops. It was quite tricky to read. There are some other differences, for example, scf.for cannot have distribution attributes, scf.forall/parallel can.

  2. Yes, correct, there will be CPU-related mapping attributes.

1 Like

⚙ D144072 [mlir] Add loop bounds to scf.foreach_thread and rename it to scf.forall. adds loop bounds.

1 Like

https://reviews.llvm.org/rGcdf8f064694c37d9f89cfe24203efdc4804a00cc
https://reviews.llvm.org/rGeb2f946e780cc0e82260c775930fa0509885b1ea

I guess, it’s done. Unification of scf.parallel and scf.forall needs a separate RFC.