Parallel loops on tensors in MLIR
scf::ForeachThreadOp
(TableGen) and gml_st::ParallelOp
(TableGen) are two loop-like operations that came to replace linalg.tiled_loop
(RFC) after its deprecation. They went through many changes and in their current form the difference is small enough to merge them. Especially, since there is interest in making it happen (Discourse).
Overview
scf.foreach_thread
is a multi-dimensional loop that expresses the computation in terms of thread IDs. It has a special scf.foreach_thread.perform_concurrently
terminator, that contains one or more tensor.parallel_insert_slice
operations. It allows to specify what slice of the shared output is updated by the thread.
scf.foreach_thread
#map = affine_map<(d0) -> (d0 * 8)>
#map1 = affine_map<(d0) -> (d0 * 4)>
%matmul = scf.foreach_thread (%threadIdI, %threadIdJ) in (%c16, %c16)
shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
%offset_I = affine.apply #map(%threadIdI)
%offset_J = affine.apply #map1(%threadIdJ)
%lhs_slice = tensor.extract_slice %lhs[%offset_I, 0] [8, 16] [1, 1]
: tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %offset_J] [16, 4] [1, 1]
: tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>
scf.foreach_thread.perform_concurrently {
tensor.parallel_insert_slice %matmul_slice
into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<8x4xf32> into tensor<128x64xf32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
The loop also can annotate dimensions with the mapping attribute.
gml_st.parallel
gml_st.parallel
is also multi-dimensional loop that has lower, upper bounds and steps like scf.parallel
. It has a gml_st.set_yield
terminator that specifies what slice/element of the shared output can be updated using !gml_st.tile
-typed SSA value.
gml_st.set_yield
has a variadic number of src
- dst
- set
triples and can also have an accumulator that specifies a RMW-like update of the output.
%matmul = gml_st.parallel (%i, %j) = (%c0, %c0) to (%c128, %c64) step (%c8, %c4)
outs (%out_ = %fill: tensor<128x64xf32>) distribution ("block") {
%lhs_slice = tensor.extract_slice %lhs[%i, 0] [8, 16] [1, 1] :
tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %j] [16, 4] [1, 1] :
tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%i, %j] [8, 4] [1, 1] :
tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice: tensor<8x16xf32>, tenosr<16x4xf32>)
outs(%out_slice: tensor<8x4xf32>) -> tensor<8x4xf32>
%tile = gml_st.tile [%i, %j] [8, 4] [1, 1] : !gml_st.tile<8x4>
gml_st.set_yield %matmul_slice into %out_[%tile] :
tensor<8x4xf32> into tensor<128x64xf32>[!gml_st.tile<8x4>]
} : tensor<128x64xf32>
The loop also has a distribution
attribute that is used to map dimensions of the loop to thread, blocks, etc.
Difference
The gml_st.set_yield
terminator when it does not update the same slice concurrently or insert element types, can be always rewritten as scf.foreach_thread.perform_concurrently
. Accumulator regions can be added to tensor.parallel_insert_slice
(proposal). Single-element updates can be added as well, if they are really needed.
Proposal
Introduce loop bounds for scf.foreach_thread
. It will still have two modes of tiling, i.e. when tiles sizes are computed given the thread count and when tiles sizes are given.
At this point I would also omit _thread
from the op name to make it less confusing.
Updated IR: tiling w.r.t. thread count.
#map = affine_map<(d0) -> (d0 * 8)>
#map1 = affine_map<(d0) -> (d0 * 4)>
%matmul = scf.foreach (%threadIdI, %threadIdJ)
= (%c0, %c0) to (%c16, %16) step (%c1, %c1)
shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
%offset_I = affine.apply #map(%threadIdI)
%offset_J = affine.apply #map1(%threadIdJ)
%lhs_slice = tensor.extract_slice %lhs[%offset_I, 0] [8, 16] [1, 1]
: tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %offset_J] [16, 4] [1, 1]
: tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>
scf.foreach_yield {
tensor.update_slice %matmul_slice
into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<8x4xf32> into tensor<128x64xf32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
Updated IR: tiling to w.r.t. tiling size.
%matmul = scf.foreach (%i, %j) = (%c0, %c0) to (%c128, %c64) step (%c8, %c4)
shared_outs(%out_ = %fill) -> (tensor<128x64xf32>) {
%lhs_slice = tensor.extract_slice %lhs[%i, 0] [8, 16] [1, 1]
: tensor<128x16xf32> to tensor<8x16xf32>
%rhs_slice = tensor.extract_slice %rhs[0, %j] [16, 4] [1, 1]
: tensor<16x64xf32> to tensor<16x4xf32>
%out_slice = tensor.extract_slice %out_[%i, %j] [8, 4] [1, 1]
: tensor<128x64xf32> to tensor<8x4xf32>
%matmul_slice = linalg.matmul
ins(%lhs_slice, %rhs_slice : tensor<8x16xf32>, tensor<16x4xf32>)
outs(%out_slice : tensor<8x4xf32>) -> tensor<8x4xf32>
scf.foreach_yield {
tensor.update_slice %matmul_slice
into %out_[%offset_I, %offset_J] [8, 4] [1, 1]
: tensor<8x4xf32> into tensor<128x64xf32>
}
} {mapping = [#gpu.block<y>, #gpu.block<x>]}
Note about scf.parallel
scf.parallel
works only on tensors right now. I think that as a next step after gml_st.parallel
and scf.foreach_thread
unification, we should reconsider whether scf.parallel
is pulling its weight or whether we can cover its use case by scf.foreach
as well. This is out of scope for the current RFC. But I would be interested to learn how scf.parallel
is used when it has scf.reduce
ops inside. @bondhugula, @ftynse, @gysit