[RFC][HLFIR] Optimized Buffierization for elemental array updates
TL;DR
In this RFC I propose a change to MLIR’s CSE pass and a new optimization pass for HLFIR. A prototype implementation demonstrates substantial improvement to spec2017 exchange2 and cam4 performance.
Introduction
HLFIR currently produces very bad code for exchange2 (spec2017). @jeanPerier narrowed down the problem to a kernel which looks like
module m
integer, parameter :: r = 9
integer :: block(r, r, r)
contains
subroutine repro(imin, imax, row)
integer :: imin, imax, row, i1
do i1 = imin,imax
block(row, 2:9, i1) = block(row, 2:9, i1) - 10
end do
end subroutine
end module
The (canonicalised) HLFIR lowering for the do loop is
%15:2 = fir.do_loop %arg3 = %11 to %13 step %c1 iter_args(%arg4 = %14) -> (index, i32) {
fir.store %arg4 to %6#1 : !fir.ref<i32> // %6 is i1
%16 = fir.load %9#0 : !fir.ref<i32> // %9 is row
%17 = fir.convert %16 : (i32) -> i64
%18 = fir.load %6#0 : !fir.ref<i32>
%19 = fir.convert %18 : (i32) -> i64
%20 = fir.shape %c8 : (index) -> !fir.shape<1>
%21 = hlfir.designate %2#0 (%17, %c2:%c9:%c1, %19) shape %20 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
%22 = hlfir.elemental %20 unordered : (!fir.shape<1>) -> !hlfir.expr<8xi32> {
^bb0(%arg5: index):
%33 = hlfir.designate %21 (%arg5) : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
%34 = fir.load %33 : !fir.ref<i32>
%35 = arith.subi %34, %c10_i32 : i32
hlfir.yield_element %35 : i32
}
%23 = fir.load %9#0 : !fir.ref<i32> // the next few values duplicate values %16 to %21
%24 = fir.convert %23 : (i32) -> i64
%25 = fir.load %6#0 : !fir.ref<i32>
%26 = fir.convert %25 : (i32) -> i64
%27 = fir.shape %c8 : (index) -> !fir.shape<1>
%28 = hlfir.designate %2#0 (%24, %c2:%c9:%c1, %26) shape %27 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
// it isn't obvious from trivial analysis that the array being assigned to is the same array being read from
hlfir.assign %22 to %28 : !hlfir.expr<8xi32>, !fir.box<!fir.array<8xi32>>
hlfir.destroy %22 : !hlfir.expr<8xi32>
%29 = arith.addi %arg3, %c1 : index
%30 = fir.convert %c1 : (index) -> i32
%31 = fir.load %6#1 : !fir.ref<i32>
%32 = arith.addi %31, %30 : i32
fir.result %29, %32 : index, i32
}
If the above HLFIR is bufferized using the existing HLFIR bufferization pass, a new array temporary is created for %22 and this is assigned to %28 using the assign runtime function.
Lowering directly to FIR (not using HLFIR) does much better. That produces a do loop which modifies the array in place, much as the source code is written.
I propose a new HLFIR optimized bufferization pass which detects special cases like this where a more profitable bufferization can be used than the catch-all implementation in the HLFIR bufferization pass. I would like this to be a separate pass to the existing HLFIR bufferization pass to make it easier to read and debug.
To achieve this there are two problems to solve:
- it isn’t obvious that the array being assigned to is the same one being read from because the hlfir.designate (and supporting loads and conversions) are duplicated before and after the hlfir.elemental operation.
- the pass needs to determine when it is safe to perform the transformation (e.g. spotting conflicting array element accesses)
MLIR CSE pass and recursive memory effects
CC: @River707 @clementval
To achieve goal (1) above, the obvious solution is to run MLIR’s common subexpression eleimination pass. This pass is already part of the FIR optimization pipeleine.
When the CSE pass finds two operations which appear to be duplicates, it looks at all operations in between the duplicates for any operations that have the write memory effect - if no writes are found the duplicated operation is replaced with the original value. If any operation is found which does not have a memory effects interface, the pass assumes a write effect.
There are no operations with write memory effects in between %16 and %23, but the elemental operation has recursive memory effects. Currently the CSE pass treats recursive memory effects as though the operaiton does not specify any memory effects.
I propose to add support for operations with recursive memory effects by taking their effects as the union of all the effects of operations in their regions. There is precendent for this in the implementation of wouldOpBeTriviallyDeadImpl()
in mlir/lib/Interfaces/SideEffectInterfaces.cpp
.
With this change, values %23-%28 can be removed by CSE, giving the following:
%15:2 = fir.do_loop %arg3 = %11 to %13 step %c1 iter_args(%arg4 = %14) -> (index, i32) {
fir.store %arg4 to %6#1 : !fir.ref<i32>
%16 = fir.load %9#0 : !fir.ref<i32>
%17 = fir.convert %16 : (i32) -> i64
%18 = fir.load %6#0 : !fir.ref<i32>
%19 = fir.convert %18 : (i32) -> i64
%20 = fir.shape %c8 : (index) -> !fir.shape<1>
%21 = hlfir.designate %2#0 (%17, %c2:%c9:%c1, %19) shape %20 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
%22 = hlfir.elemental %20 unordered : (!fir.shape<1>) -> !hlfir.expr<8xi32> {
^bb0(%arg5: index):
%27 = hlfir.designate %21 (%arg5) : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
%28 = fir.load %27 : !fir.ref<i32>
%29 = arith.subi %28, %c10_i32 : i32
hlfir.yield_element %29 : i32
}
hlfir.assign %22 to %21 : !hlfir.expr<8xi32>, !fir.box<!fir.array<8xi32>>
hlfir.destroy %22 : !hlfir.expr<8xi32>
%23 = arith.addi %arg3, %c1 : index
%24 = fir.convert %c1 : (index) -> i32
%25 = fir.load %6#1 : !fir.ref<i32>
%26 = arith.addi %25, %24 : i32
fir.result %23, %26 : index, i32
}
HLFIR pass
Given the above IR (after CSE), it is obvious that the elemental body reads from the same array that the elemental expression is assigned to (as both have the same SSA value). It is also obvious that the elemental has the same shape as the array.
The pattern should match if the following is true:
- There are exactly two uses of the elemental expression: one in hlfir.assign and one in hlfir.destroy
- The above assignment is to a
fir::SequenceType
(afterfir::unwrapPassByRefType
), I will call this “the array”. A hlfir.expr could be supported in the future. - Elements of the array are of a trivial type (to keep the pass simple, this could be extended later)
- The array value dominates the elemental
- The array and the elemental have the same shape
- There are no writes to the array in the elemental body (the source code write is performed by hlfir.assign)
- All reads from the array in the elemental body are indexed by the elemental block argument (this ensures that different iterations cannot effect eachother)
If all of that is the case, the hlfir.elemental can be rewritten as a do-loop modifying the array in place:
%15:2 = fir.do_loop %arg3 = %11 to %13 step %c1 iter_args(%arg4 = %14) -> (index, i32) {
fir.store %arg4 to %6#1 : !fir.ref<i32>
%16 = fir.load %9#0 : !fir.ref<i32>
%17 = fir.convert %16 : (i32) -> i64
%18 = fir.load %6#0 : !fir.ref<i32>
%19 = fir.convert %18 : (i32) -> i64
%20 = fir.shape %c8 : (index) -> !fir.shape<1>
%21 = hlfir.designate %2#0 (%17, %c2:%c9:%c1, %19) shape %20 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
fir.do_loop %arg5 = %c1 to %c8 step %c1 unordered {
%26 = hlfir.designate %21 (%arg5) : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
%27 = fir.load %26 : !fir.ref<i32>
%28 = arith.subi %27, %c10_i32 : i32
%29 = hlfir.designate %21 (%arg5) : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
hlfir.assign %28 to %29 temporary_lhs : i32, !fir.ref<i32>
}
%22 = arith.addi %arg3, %c1 : index
%23 = fir.convert %c1 : (index) -> i32
%24 = fir.load %6#1 : !fir.ref<i32>
%25 = arith.addi %24, %23 : i32
fir.result %22, %25 : index, i32
}
Results
- 11x shorter execution time for SPEC17 exchange2 than the current HLFIR code (still around 12% slower than the FIR lowering)
- 20% shorter execution time for SEPC17 cam4 than the current HLFIR code
- All gfortran tests which currently pass for HLFIR still pass
- spec2017 bwaves, fotonik and roms all pass (but with no significant performance change vs existing HLFIR)