# Parallel loops on tensors in MLIR

`scf::ForeachThreadOp`

(TableGen) and `gml_st::ParallelOp`

(TableGen) are two loop-like operations that came to replace `linalg.tiled_loop`

(RFC) after its deprecation. They went through many changes and in their current form the difference is small enough to merge them. Especially, since there is interest in making it happen (Discourse).

## Overview

`scf.foreach_thread`

is a multi-dimensional loop that expresses the computation in terms of thread IDs. It has a special `scf.foreach_thread.perform_concurrently`

terminator, that contains one or more `tensor.parallel_insert_slice`

operations. It allows to specify what slice of the shared output is updated by the thread.

### scf.foreach_thread

```
#map = affine_map<(d0) -> (d0 * 8)>
#map1 = affine_map<(d0) -> (d0 * 4)>
%matmul = scf.foreach_thread (%threadIdI, %threadIdJ) in (%c16, %c16)
shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
%offset_I = affine.apply #map(%threadIdI)
%offset_J = affine.apply #map1(%threadIdJ)
%lhs_slice = tensor.extract_slice %lhs[%offset_I, 0] [8, 16] [1, 1]
: tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %offset_J] [16, 4] [1, 1]
: tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>
scf.foreach_thread.perform_concurrently {
tensor.parallel_insert_slice %matmul_slice
into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<8x4xf32> into tensor<128x64xf32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
```

The loop also can annotate dimensions with the mapping attribute.

### gml_st.parallel

`gml_st.parallel`

is also multi-dimensional loop that has lower, upper bounds and steps like `scf.parallel`

. It has a `gml_st.set_yield`

terminator that specifies what slice/element of the shared output can be updated using `!gml_st.tile`

-typed SSA value.

`gml_st.set_yield`

has a variadic number of `src`

- `dst`

- `set`

triples and can also have an accumulator that specifies a RMW-like update of the output.

```
%matmul = gml_st.parallel (%i, %j) = (%c0, %c0) to (%c128, %c64) step (%c8, %c4)
outs (%out_ = %fill: tensor<128x64xf32>) distribution ("block") {
%lhs_slice = tensor.extract_slice %lhs[%i, 0] [8, 16] [1, 1] :
tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %j] [16, 4] [1, 1] :
tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%i, %j] [8, 4] [1, 1] :
tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice: tensor<8x16xf32>, tenosr<16x4xf32>)
outs(%out_slice: tensor<8x4xf32>) -> tensor<8x4xf32>
%tile = gml_st.tile [%i, %j] [8, 4] [1, 1] : !gml_st.tile<8x4>
gml_st.set_yield %matmul_slice into %out_[%tile] :
tensor<8x4xf32> into tensor<128x64xf32>[!gml_st.tile<8x4>]
} : tensor<128x64xf32>
```

The loop also has a `distribution`

attribute that is used to map dimensions of the loop to thread, blocks, etc.

### Difference

The `gml_st.set_yield`

terminator when it does not update the same slice concurrently or insert element types, can be always rewritten as `scf.foreach_thread.perform_concurrently`

. Accumulator regions can be added to `tensor.parallel_insert_slice`

(proposal). Single-element updates can be added as well, if they are really needed.

## Proposal

Introduce loop bounds for `scf.foreach_thread`

. It will still have two modes of tiling, i.e. when tiles sizes are computed given the thread count and when tiles sizes are given.

At this point I would also omit `_thread`

from the op name to make it less confusing.

### Updated IR: tiling w.r.t. thread count.

```
#map = affine_map<(d0) -> (d0 * 8)>
#map1 = affine_map<(d0) -> (d0 * 4)>
%matmul = scf.foreach (%threadIdI, %threadIdJ)
= (%c0, %c0) to (%c16, %16) step (%c1, %c1)
shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
%offset_I = affine.apply #map(%threadIdI)
%offset_J = affine.apply #map1(%threadIdJ)
%lhs_slice = tensor.extract_slice %lhs[%offset_I, 0] [8, 16] [1, 1]
: tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %offset_J] [16, 4] [1, 1]
: tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>
scf.foreach_yield {
tensor.update_slice %matmul_slice
into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<8x4xf32> into tensor<128x64xf32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
```

### Updated IR: tiling to w.r.t. tiling size.

```
%matmul = scf.foreach (%i, %j) = (%c0, %c0) to (%c128, %c64) step (%c8, %c4)
shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
%lhs_slice = tensor.extract_slice %lhs[%i, 0] [8, 16] [1, 1]
: tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %j] [16, 4] [1, 1]
: tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%i, %j] [8, 4] [1, 1]
: tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>
scf.foreach_yield {
tensor.update_slice %matmul_slice
into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<8x4xf32> into tensor<128x64xf32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
```

##
Note about `scf.parallel`

`scf.parallel`

works only on tensors right now. I think that as a next step after `gml_st.parallel`

and `scf.foreach_thread`

unification, we should reconsider whether `scf.parallel`

is pulling its weight or whether we can cover its use case by `scf.foreach`

as well. This is out of scope for the current RFC. But I would be interested to learn how `scf.parallel`

is used when it has `scf.reduce`

ops inside. @bondhugula, @ftynse, @gysit