Returning here with some more questions (and also going to ask about the scalar replacement pass):
Consider the following function, which is adding an input array to itself 3 times into an output array, using some temporaries along the way.
func.func @testing(%input : memref<100xf32>, %output : memref<100xf32>) {
%int1 = memref.alloc() : memref<100xf32>
%int2 = memref.alloc() : memref<100xf32>
affine.for %i = 0 to 100 {
%0 = affine.load %input[%i] : memref<100xf32>
%1 = affine.load %input[%i] : memref<100xf32>
%2 = arith.addf %0, %1: f32
affine.store %2, %int1[%i] : memref<100xf32>
}
affine.for %i = 0 to 100 {
%0 = affine.load %input[%i] : memref<100xf32>
%1 = affine.load %int1[%i] : memref<100xf32>
%2 = arith.addf %0, %1: f32
affine.store %2, %int2[%i] : memref<100xf32>
}
affine.for %i = 0 to 100 {
%0 = affine.load %input[%i] : memref<100xf32>
%1 = affine.load %int2[%i] : memref<100xf32>
%2 = arith.addf %0, %1: f32
affine.store %2, %output[%i] : memref<100xf32>
}
return
}
First, why is the loop fusion pass also rewriting the sizes of allocated temporary memrefs:
func.func @testing(%arg0: memref<100xf32>, %arg1: memref<100xf32>) {
%alloc = memref.alloc() : memref<1xf32>
%alloc_0 = memref.alloc() : memref<1xf32>
affine.for %arg2 = 0 to 100 {
%0 = affine.load %arg0[%arg2] : memref<100xf32>
%1 = affine.load %arg0[%arg2] : memref<100xf32>
%2 = arith.addf %0, %1 : f32
affine.store %2, %alloc_0[0] : memref<1xf32>
%3 = affine.load %arg0[%arg2] : memref<100xf32>
%4 = affine.load %alloc_0[0] : memref<1xf32>
%5 = arith.addf %3, %4 : f32
affine.store %5, %alloc[0] : memref<1xf32>
%6 = affine.load %arg0[%arg2] : memref<100xf32>
%7 = affine.load %alloc[0] : memref<1xf32>
%8 = arith.addf %6, %7 : f32
affine.store %8, %arg1[%arg2] : memref<100xf32>
}
return
}
While this is not the final form I’d like to use, it seems strange that the loop fusion pass would do something like that. In particular, the loop is no longer parallelizable! This sort of thing is eliminated by hitting the result with the scalar replacement pass:
func.func @testing(%arg0: memref<100xf32>, %arg1: memref<100xf32>) {
affine.for %arg2 = 0 to 100 {
%0 = affine.load %arg0[%arg2] : memref<100xf32>
%1 = arith.addf %0, %0 : f32
%2 = affine.load %arg0[%arg2] : memref<100xf32>
%3 = arith.addf %2, %1 : f32
%4 = affine.load %arg0[%arg2] : memref<100xf32>
%5 = arith.addf %4, %3 : f32
affine.store %5, %arg1[%arg2] : memref<100xf32>
}
return
}
Onto the second question – the scalar replacement pass didn’t get rid of all of the intermediate loads! Why not, and is that expected? Using the scalar replacement pass twice gets us the desired computation:
func.func @testing(%arg0: memref<100xf32>, %arg1: memref<100xf32>) {
affine.for %arg2 = 0 to 100 {
%0 = affine.load %arg0[%arg2] : memref<100xf32>
%1 = arith.addf %0, %0 : f32
%2 = arith.addf %0, %1 : f32
%3 = arith.addf %0, %2 : f32
affine.store %3, %arg1[%arg2] : memref<100xf32>
}
return
}
Given some arbitrary function, is the proper use of the affine scalar replacement pass to apply it until a fixed-point is reached?
Thanks.