Some recent patches have landed to implement pipelining of an scf.for
operation. The current setup is that the transformation uses a callback that needs the caller to supply two things
- A list of list of instructions, where each list is a the sequence of instructions that belong to a stage.
- A valid global ordering of instructions (across stages) that the generated code should use in the pipelined code.
While this is functionally enough to implement the transformation, there also seems to be a case for better modelling of pipelined computation. @ThomasRaoux and I have been iterating over a design and I am just posting what we have so far. There are still some unanswered questions, so also looking for feedback from the broader community. So this is not an RFC or a final design, but more a WIP.
Pipelined loop
Pipelining is meant to increase the dependence distance between definition of a value and its use to allow for interspersing instructions to hide the latency of instructions (like loads). For a 1D loop, the dependence between the definition and use can be modeled by a dependence vector {i, j}
, where
-
i
represents the number of iterations of the loop between the def and the use -
j
represents the number of lexicographic instructions between the def and the use.
These form an ordered pair. For the untransformed loop, the i
value is 0
. Pipelining increases this value to > 0
. Also at MLIR level, the lexicographic ordering of instructions is meaningless. The final generated code can only be guaranteed to respect ordering between SSA use-def chains. So the design below tries to explicitly represent the intended dependence distance across iterations, i
, after pipelining. The j
value is represented through SSA use-def chains.
The design we have so far is to use two ops
-
scf.pipeline_for
which has- three operands for bounds of the loop : lower bound, upper bound and step
- A variadic list of
Value
s ,iter_args
, that is for loop carried dependence values - A region with a single block that is not isolated from above and only contains
scf.pipeline_stage
operations and anscf.yield
terminator that yields the value to be forwarded to the next iteration - Also does not represent the induction variable for the loop explicitly.
-
scf.pipeline_stage
which has- A variadic list of
Value
s that are either created by otherscf.pipeline_stage
or is an element ofiter_args
of the surroundingscf.pipeline_for
operation - An List of integer attributes, size equal to the number of operands. Each integer represents the intended dependence distance between the producer stage and the consumer stage (along the loop iteration dimension)
- A region with a single block that is not isolated from above and has an scf.yield terminator that yields the value for the result of the operation.
- The first argument is the induction variable of the loop.
- A variadic list of
As an example, this is one of the tests from the tests
scf.for %i0 = %c0 to %c4 step %c1 {
%A_elem = memref.load %A[%i0] { __test_pipelining_stage__ = 0, __test_pipelining_op_order__ = 2 } : memref<?xf32>
%A1_elem = addf %A_elem, %cf { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 0 } : f32
memref.store %A1_elem, %result[%i0] { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 1 } : memref<?xf32>
} { __test_pipelining_loop__ }
This could be represented as
scf.pipeline_for %lb to %ub step %step {
%0 = scf.pipeline_stage {
^bb0(%iv : index):
%1 = memref.load %A[%iv] : memref<?xf32>
scf.yield %1 : f32
} -> f32
scf.pipeline_stage ins(%0 : f32) dependence_distance = [1] {
^bb0(%iv : index, %arg0 : f32):
%2 = addf %arg0, %cf : f32
memref.store %2, %result[%iv] : memref<?xf32>
}
}
When lowered to scf.for
the distance between stage producing %0
and the stage consuming this is intended to be 1
(it is 0
to begin with). One could also represent a value greater than 1
if a longer latency is required. The dependence_distance
attribute captures the intended dependence distance between def and the use of the value along the loop iteration dimension. i.e. i
value above.
The following example shows the representation in cases with loop carried dependence.
%r = scf.for %i0 = %c0 to %c4 step %c1 iter_args(%arg0 = %cf) -> (f32) {
%A_elem = memref.load %A[%i0] { __test_pipelining_stage__ = 0, __test_pipelining_op_order__ = 2 } : memref<?xf32>
%A1_elem = addf %A_elem, %arg0 { __test_pipelining_stage__ = 1, __test_pipelining_op_order__ = 1 } : f32
%A2_elem = mulf %cf, %A1_elem { __test_pipelining_stage__ = 2, __test_pipelining_op_order__ = 0 } : f32
scf.yield %A2_elem : f32
} { __test_pipelining_loop__ }
return %r : f32
This could be represented as
%res = scf.pipeline_for %lb to %ub step %step iter_args(%arg0 = %init) {
%0 = scf.pipeline_stage {
^bb0(%iv : index, %arg1 : f32):
%1 = memref.load %A[%iv]: memref<?xf32>
scf.yeild %1 : f32
} -> f32
%1 = scf.pipeline_stage ins(%0, %arg0 : f32, f32) dependence_distance = [1, 0] {
^bb0(%iv : index, %arg1 : f32, %arg2 : f32):
%2 = addf %arg1, %arg2 : f32
scf.yield %2 : f32
} -> f32
%2 = scf.pipeline_stage ins(%1 : f32) dependence_distance = [1] {
^bb0(%iv : index, %arg1 : f32):
%3 = mulf %cf, %arg1 : f32
scf.yield %3 : f32
} -> f32
scf.yield %2 : f32
}
Notice that the stage producing %1
uses a value from the iter_args
of the surrounding scf.pipeline_for
. Since the value is coming from the previous iteration of the loop the dependence distance is 1
to begin with, but the dependence_distance
specifies that the intended distance is 0
. This implies that the dependence between the def and the use is to be reduced.
Note that this representation seems to easily allow a change to the intended dependence distance between stages until it is lowered to scf.for
.
Lowering to scf.for
One thing that we havent worked out fully, is the exact algorithm to lower the scf.pipeline_for
to scf.for
. The op itself though has all the information needed to be lowerable to an scf.for
, i.e.
- The DAG of dependence between stages can help emit the prologue, body of the loop and the epilogue
- For cases where dependence distance has to be reduced, the stages can be “rotated” such that none of the
dependence_distance
to values fromiter_args
is> 0
.
Next steps
To fully evaluate this design, we are lacking a good idea of what use cases this might have. If it is intended to be just an intermediate step before lowering to scf.for
, then the utility of this abstraction is limited (what is currently implemented does work). At best it would just help in better debugging. Specifically it would be good to get an idea of whether there are any use cases that motivate transformations after lowering to the scf.pipeline_for
but before lowering to scf.for
. If those transformations are made easier by this representation, then that would motivate the way forward.
@nicolasvasilache @herhut @mehdi_amini any thoughts?