Hello everyone,
I’d like to suggest an additional operation in the affine dialect to represent parallel for loops. Please see the inline markdown text below. I look forward to everyone’s comments. We are currently experimenting with implementation of this operation, and based on the feedback here, we hope to finalize the design and implementation for a future code review.
-Jeremy
[RFC] Add affine.parallel
Table of Contents
- Background
- Goals
- Proposal
- Summary
- IR Representation
- Examples
- Future Work
- FAQ
Background
This proposal is intended to provide an effective representation of loops over affine induction variables whose iterations may safely be run in parallel. It draws on techniques and experience from the Stripe dialect discussed in the open design meeting on 2019-11-7, as well as a proposal for a parallel loop operation / dialect made by Stephan Herhut. Additionally it was informed by the recent commit which adds parallel for to the loop dialect.
Goals
-
Add representation of parallel for loops with affine memory access patterns
-
Allow lowering from implicitly parallel representations to maintain semantic parallelism
-
Allow dependence analysis and other techniques to infer parallelism and capture it in the IR
-
Enable rearrangement of iterations (such as in tiling) without needing to make changes to the loop body or break perfect nesting via
affine.apply
-
Allow representation of parallelizable reductions within parallel for loops
-
As a specific use case, we want to be able to replace our Stripe dialect with the Affine dialect. Adding parallel for loops meeting the above goals would add most of the functionality we need for this.
Proposal
We propose adding a parallel for operation to the Affine dialect with a pretty-printed form like
affine.parallel %i0 = 0 to 10, %i1 = 0 to 16 {
%0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
}
This parallel for operation iterates over induction variables (in this case %i0
and %i1
), each of which has a lower and upper bound, along with an optional step size.
The lower and upper bounds can additionally take affine dimensions and symbols passed to the parallel for as SSA operands, providing further generality.
IR Representation
We propose the corresponding IR specification (with details elided) to be
def ParallelOp {
let arguments = (ins
AffineMapAttr:$lowerBounds,
AffineMapAttr:$upperBounds,
IndexArrayAttr:$steps,
Variadic<IndexType>:$mapOperands);
let regions = (region SizedRegion<1>:$region);
}
where $lowerBounds
and $upperBounds
are affine maps which each produce N results, where N is the number of induction variables in the parallel for. The bounds represent the inclusive lower bound and exclusive upper bound. $steps
contains an array attribute with N positive step sizes, which represent the step size for each induction variable.
The values in $mapOperands
will be use for the map arguments of lowerBounds and upperBounds in that order. In the actual implementation, helper methods similar to those in affine.for
will be provided to manage mapOperands
more easily.
Examples
Here are some examples of simple operations expressed using parallel for.
Fixed Size Transpose
Use a parallel for to transpose a 16 by 10 tensor I
to a 10 by 16 tensor O
(this is the same example shown above):
func @fixed_transpose(%I: memref<16x10xf32>, %O: memref<10x16xf32>) {
affine.parallel %i0 = 0 to 16, %i1 = 0 to 10 {
%0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
}
return
}
A direct representation (without pretty printing) of the IR for this would be
func @fixed_transpose(%I: memref<16x10xf32>, %O: memref<10x16xf32>) {
affine.parallel() ({
^bb0(%i0: index, %i1: index): // no predecessors
%0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
affine.terminator() : () -> ()
}) {
lowerBounds = () -> (0, 0),
upperBounds = () -> (16, 10),
steps = [1, 1]
} : () -> ()
return
}
Dynamic Size Transpose
Use a parallel for to transpose an N
by M
tensor I
to an M
by N
tensor O
:
func @sym_transpose(%I: memref<?x?xf32>, %O: memref<?x?xf32>, %N: index, %M: index) {
affine.parallel %i0 = 0 to sym(%N), %i1 = 0 to sym(%M) {
%0 = affine.load %I[%i0, %i1] : memref<?x?xf32>
affine.store %0 %O[%i1, %i0] : memref<?x?xf32>
}
return
}
A direct representation (without pretty printing) of the IR for this would be
func @sym_transpose(%I: memref<?x?xf32>, %O: memref<?x?xf32>, %N: index, %M: index) {
affine.parallel(%N, %M) ({
^bb0(%i0: index, %i1: index): // no predecessors
%0 = affine.load %I[%i0, %i1] : memref<?x?xf32>
affine.store %0 %O[%i1, %i0] : memref<?x?xf32>
affine.terminator() : () -> ()
}) {
lowerBounds = () -> (0, 0),
upperBounds = ()[s0, s1] -> (s0, s1),
steps = [1, 1]
} : (index, index) -> ()
return
}
Tiling of a Transpose
The transpose from the first example tiled into 4 by 5 tiles:
func @fixed_transpose(%I: memref<16x10xf32>, %0: memref<10x16xf32>) {
affine.parallel %j0 = 0 to 4, %j1 = 0 to 2 {
affine.parallel %i0 = 4*%j0 to 4*%j0 + 4, %i1 = 5*%j1 to 5*%j1 + 5 {
%0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
}
}
return
}
A direct representation (without pretty printing) of the IR for this would be
func @fixed_transpose(%I: memref<16x10xf32>, %O: memref<10x16xf32>) {
affine.parallel() ({
^bb0(%j0: index, %j1: index): // no predecessors
affine.parallel(%j0, %j1, %j0, %j1) ({
^bb0(%i0: index, %i1: index): // no predecessors
%0 = affine.load %I[%i0, %i1] : memref<16x10xf32>
affine.store %0 %O[%i1, %i0] : memref<10x16xf32>
affine.terminator() : () -> ()
}) {
lowerBounds = (d0, d1) -> (4*d0, 5*d1),
upperBounds = (d0, d1) -> (4*d0 + 4, 5*d1 + 5),
steps = [1, 1]
} : (index, index, index, index) -> ()
affine.terminator() : () -> ()
}) {
lowerBounds = () -> (0, 0),
upperBounds = () -> (4, 2),
steps = [1, 1]
} : () -> ()
return
}
Future Work
This can be combined with an affine atomic read-modify-write operation to represent parallel reductions. We intend to also implement such an operation, see also this diff adding LLVM’s atomicrmw
op to the LLVM dialect.
It would also be possible to model reductions usings a design similar to the newly added loop.reduce operation.
Additionally, it may be useful to support symbolic step sizes for both affine.parallel and affine.for, since such symbolic steps should be compatible with affine analysis.
FAQ
Here are some questions that came up as we were making design decisions for parallel for, and the corresponding rationales for our decisions.
Why use multiple induction variables in a single for loop?
There is no semantic difference from changing the nesting order of multiple parallel for loops. By allowing multiple induction variables, we eliminate redundant forms of expressing the same operation. This provides an easier to canonicalize and more concise representation.
Moreover, we believe a single parallel for operation with multiple induction variables provides a more intuitive representation of the semantics. We are concerned that nested parallel for loops might be read as “first iterate over this index in parallel, then iterate over that index in parallel”. The intended semantics of “iterate over these indices in parallel” should be clearer with a single loop op and multiple induction variables.
This also means that nested parallel for loops can represent different semantic structures that may be more meaningful. For example, a tiling pass might transform a single parallel for into two nested parallel fors, representing loops whose memory will live at different levels of the cache hierarchy.
Why put this in the Affine dialect instead of using something more general (e.g. loop.parallel
)?
Our use cases must interact with other components of the Affine dialect, such as affine.if
, affine.load
, and affine.store
. We also want optimizations that make heavy use of the easier aliasing analysis available with affine accesses. Thus, we benefit both from integrating with other ops within the same dialect and from being able to write optimizations knowing we are dealing specifically with an affine parallel for.
Why have a separate parallel
operation instead of adding an attribute to the current for
indicating whether it’s parallel?
We think that it is better to have serial and parallel for as separate operations because there are enough differences between the ideal structure of serial and parallel for operations. This is true both of current differences in the operations and also because of differences in representing possible extensions of the current operations.
In terms of current structure, we think a single parallel for loops should allow multiple induction variables for semantic clarity. With serial for, this is less clear, as serial for requires an ordering of the induction variables that is typically represented by nesting multiple loop operations. It is certainly possible to represent multiple induction variables in a single for loop operation, and perhaps even desirable. However, we prefer to not make that decision for serial for while designing parallel for.
In terms of possible extensions, there has been disucssions on the mailing list regarding adding support for loop-carried values to serial for. However, loop-carried values only make sense in the context of serial for.
Why not make parallelizability a flag on individual induction variables (instead of on the entire multivariable loop)?
Primarily because that would mean we had a single unified serial and parallel for operation. We didn’t want to do this for reasons described in the previous question. Also, parallel induction variables can be freely interchanged in order while serial induction variables can’t. Including both serial and parallel induction variables in a single operation on equal footing strikes us as too likely to cause confusion. It is especially prone to confusion amongst optimization pass writers who are primarily thinking of one case (serial or parallel).