Currently, the affine dialect uses a very simple terminator (affine.terminate) for it’s flow control structures, affine.if, affine.for and affine.parallel. However, the scf dialect use scf.yield as its terminator, and provides a method to pass SSA values to the enclosing instructions to useful effect. For example, both branches of the scf.if may yield a value, which becomes a result of the scf.if operation. scf.for uses yield to pass loop carried values, etc.
It seems reasonable to add similar functionality to the affine dialect (including the appropriate lowerings to scf) via replacing affine.terminate with affine.yield. Yielding no values would be equivilent to the existing implementation. Blocks which yield no value would would be printed/parsed with an implicit yield (also similar to scf). And in general, this change should be 100% backward compatible.
I’ve put up an initial diff for review making some of the needed changes (affine.terminator -> affine.yield, IR changes to affine.if, IR changes to affine.parallel) which is available here: https://reviews.llvm.org/D82600. I’ll make additional changes based on this discussion as well as providing more complete documentation as suggested by @bondhugula once we have a basic consensus.
As for the detailed semantics of each instruction, I think there is some room for discussion. I’ll give my current opinions to get things started.
affine.if
Similar semantics to scf.if. Specifically, affine.if can return any number of values, and for each returned value, both branches of the if must ‘affine.yield’ the same number of values, and with matching types. The values returned by the if would be the values returned by whichever branch was executed. The ‘else’ branch of affine.if can now only be omitted for the case of no returns.
Example: Initalizing a buffer with edge padding
#interior = affine_set<(i, j) : (i - 1 >= 0, j - 1 >= 0, 10 - i >= 0, 10 - j >= 0)> (%i, %j)
func @pad_edges(%I : memref<10x10xf32>) -> (memref<12x12xf32) {
%O = alloc memref<12x12xf32>
affine.parallel (%i, %j) = (0, 0) to (12, 12) {
%1 = affine.if #interior (%i, %j) {
%2 = load %I[%i - 1, %j - 1] : memref<10x10xf32>
affine.yield %2
} else {
%2 = constant 0.0 : f32
affine.yield %2 : f32
}
affine.store %O[%i, %] : memref<12x12xf32>
}
return %O
}
Currently my diff includes the IR changes for thee semantics, but not the lowering changes.
affine.for
Nearly identical semantics to scf.for. affine.for would gain a new ‘iter_args’ section which specifies a set of input values. These would initialize a set of loop carried values. Then the inner block arguments would be augmented to include not only the induction variables, but also the additional ‘current state’ of the loop carried values. An affine.yield at the end of the block would specify the new values for the next loop iteration, and finally, the ultimate yielded values would be returned by the affine.for instructions.
Example: Serial accumulation:
func @serial_sum(%In : memref<100xf32) -> f32 {
%cst0 = constant 0.0 : f32
%tot = affine.for %i = 0 to 100 iter_args (%subtot = %cst0) -> (f32) {
%0 = affine.load %In[%i]
%1 = addf %subtot, %0
affine.yield %1 : %f32
}
return %tot
}
Currently, I didn’t intend to make this change during my initial commit as it is perhaps the most complex and it is also not relevant to my particular use case.
affine.parallel
Here my current diff diverges slightly from scf.parallel. Similar to scf.parallel, the interior of affine.parallel is allowed to produce values, which are then reduced, and returned by the scf.parallel instruction. However, rather than the method used in scf.parallel, I suggest simply ‘yielding’ the values from the interior, and for each value specifying a reduction operation (which implies a ‘identity’ value, defining the results for 0 trip loops).
Example 1: Parallel accumulation:
fuction @sum(%A : memref<100x100xf32>) -> f32 {
%out = affine.parallel (%i, %j) = (0, 0) to (100, 100) reduce ("addf") -> (f32) {
%0 = affine.load %A[%i, %j]
affine.yield %0 : f32
}
return %out
}
Example 2: 2d 3x3 centered valid convolution:
fuction @conv_2d(%D : memref<100x100xf32>, %K : memref<3x3xf32>) -> (memref<98x98xf32>) {
%O = alloc memref<98x98xf32>
affine.parallel (%x, %y) = (0, 0) to (98, 98) {
%0 = affine.parallel (%kx, %ky) = (0, 0) to (2, 2) {
%1 = affine.load %D[%x + %kx, %y + %ky] : memref<100x100xf32>
%2 = affine.load %K[%kx, %ky] : memref<3x3xf32>
%3 = mulf %1, %2 : f32
affine.yield %3 : f32
}
affine.store %0, O[%x, %y] : memref<98x98xf32>
}
return %O
}
Here, the reduction op specified comes from a fixed set (currently I use the set from std.atomicrmw, although @dcaballe correctly points our this should probably be renamed at a minimum). The advantages of a fixed set are:
- Easier pattern matching for lowerings to hardware that has fixed function parallel reduction units.
- Less complex IR.
- Initalizations (and 0-trip results) can be definied as idenity values for the operations in questions (i.e. 0 for sum, 1 for mul, -inf for max).
The advantages for following the method from scf are:
- More consistent with scf of course
- More general
Perhaps it is worth the additional complexity? I’m somewhat skeptical due to the fact that our prior work (which basically implements all of the Keras backend API) never ran into any cases needed more than a fixed set of reduction operations. However, I recognize that the MLIR community is broader, and so there may be lots of uses cases out there I’m not considering. Certainly some concrete examples of places where more general reduction is desirable would help frame the discussion is anyone knows some good examples.