[RFC][HLFIR] Optimized Bufferization for elemental array updates

[RFC][HLFIR] Optimized Buffierization for elemental array updates

TL;DR

In this RFC I propose a change to MLIR’s CSE pass and a new optimization pass for HLFIR. A prototype implementation demonstrates substantial improvement to spec2017 exchange2 and cam4 performance.

Introduction

HLFIR currently produces very bad code for exchange2 (spec2017). @jeanPerier narrowed down the problem to a kernel which looks like

 module m
  integer, parameter :: r = 9
  integer :: block(r, r, r)
contains
subroutine repro(imin, imax, row)
  integer :: imin, imax, row, i1
   do i1 = imin,imax
       block(row, 2:9, i1) =  block(row, 2:9, i1)  - 10
   end do
end subroutine
end module

The (canonicalised) HLFIR lowering for the do loop is

    %15:2 = fir.do_loop %arg3 = %11 to %13 step %c1 iter_args(%arg4 = %14) -> (index, i32) {
      fir.store %arg4 to %6#1 : !fir.ref<i32> // %6 is i1
      %16 = fir.load %9#0 : !fir.ref<i32>     // %9 is row
      %17 = fir.convert %16 : (i32) -> i64
      %18 = fir.load %6#0 : !fir.ref<i32>
      %19 = fir.convert %18 : (i32) -> i64
      %20 = fir.shape %c8 : (index) -> !fir.shape<1>
      %21 = hlfir.designate %2#0 (%17, %c2:%c9:%c1, %19)  shape %20 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
      %22 = hlfir.elemental %20 unordered : (!fir.shape<1>) -> !hlfir.expr<8xi32> {
      ^bb0(%arg5: index):
        %33 = hlfir.designate %21 (%arg5)  : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
        %34 = fir.load %33 : !fir.ref<i32>
        %35 = arith.subi %34, %c10_i32 : i32
        hlfir.yield_element %35 : i32
      }
      %23 = fir.load %9#0 : !fir.ref<i32>  // the next few values duplicate values %16 to %21
      %24 = fir.convert %23 : (i32) -> i64
      %25 = fir.load %6#0 : !fir.ref<i32>
      %26 = fir.convert %25 : (i32) -> i64
      %27 = fir.shape %c8 : (index) -> !fir.shape<1>
      %28 = hlfir.designate %2#0 (%24, %c2:%c9:%c1, %26)  shape %27 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
      // it isn't obvious from trivial analysis that the array being assigned to is the same array being read from
      hlfir.assign %22 to %28 : !hlfir.expr<8xi32>, !fir.box<!fir.array<8xi32>>
      hlfir.destroy %22 : !hlfir.expr<8xi32>
      %29 = arith.addi %arg3, %c1 : index
      %30 = fir.convert %c1 : (index) -> i32
      %31 = fir.load %6#1 : !fir.ref<i32>
      %32 = arith.addi %31, %30 : i32
      fir.result %29, %32 : index, i32
    }

If the above HLFIR is bufferized using the existing HLFIR bufferization pass, a new array temporary is created for %22 and this is assigned to %28 using the assign runtime function.

Lowering directly to FIR (not using HLFIR) does much better. That produces a do loop which modifies the array in place, much as the source code is written.

I propose a new HLFIR optimized bufferization pass which detects special cases like this where a more profitable bufferization can be used than the catch-all implementation in the HLFIR bufferization pass. I would like this to be a separate pass to the existing HLFIR bufferization pass to make it easier to read and debug.

To achieve this there are two problems to solve:

  1. it isn’t obvious that the array being assigned to is the same one being read from because the hlfir.designate (and supporting loads and conversions) are duplicated before and after the hlfir.elemental operation.
  2. the pass needs to determine when it is safe to perform the transformation (e.g. spotting conflicting array element accesses)

MLIR CSE pass and recursive memory effects

CC: @River707 @clementval

To achieve goal (1) above, the obvious solution is to run MLIR’s common subexpression eleimination pass. This pass is already part of the FIR optimization pipeleine.

When the CSE pass finds two operations which appear to be duplicates, it looks at all operations in between the duplicates for any operations that have the write memory effect - if no writes are found the duplicated operation is replaced with the original value. If any operation is found which does not have a memory effects interface, the pass assumes a write effect.

There are no operations with write memory effects in between %16 and %23, but the elemental operation has recursive memory effects. Currently the CSE pass treats recursive memory effects as though the operaiton does not specify any memory effects.

I propose to add support for operations with recursive memory effects by taking their effects as the union of all the effects of operations in their regions. There is precendent for this in the implementation of wouldOpBeTriviallyDeadImpl() in mlir/lib/Interfaces/SideEffectInterfaces.cpp.

With this change, values %23-%28 can be removed by CSE, giving the following:

%15:2 = fir.do_loop %arg3 = %11 to %13 step %c1 iter_args(%arg4 = %14) -> (index, i32) {
      fir.store %arg4 to %6#1 : !fir.ref<i32>
      %16 = fir.load %9#0 : !fir.ref<i32>
      %17 = fir.convert %16 : (i32) -> i64
      %18 = fir.load %6#0 : !fir.ref<i32>
      %19 = fir.convert %18 : (i32) -> i64
      %20 = fir.shape %c8 : (index) -> !fir.shape<1>
      %21 = hlfir.designate %2#0 (%17, %c2:%c9:%c1, %19)  shape %20 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
      %22 = hlfir.elemental %20 unordered : (!fir.shape<1>) -> !hlfir.expr<8xi32> {
      ^bb0(%arg5: index):
        %27 = hlfir.designate %21 (%arg5)  : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
        %28 = fir.load %27 : !fir.ref<i32>
        %29 = arith.subi %28, %c10_i32 : i32
        hlfir.yield_element %29 : i32
      }
      hlfir.assign %22 to %21 : !hlfir.expr<8xi32>, !fir.box<!fir.array<8xi32>>
      hlfir.destroy %22 : !hlfir.expr<8xi32>
      %23 = arith.addi %arg3, %c1 : index
      %24 = fir.convert %c1 : (index) -> i32
      %25 = fir.load %6#1 : !fir.ref<i32>
      %26 = arith.addi %25, %24 : i32
      fir.result %23, %26 : index, i32
    }

HLFIR pass

CC: @szakharin @jeanperier

Given the above IR (after CSE), it is obvious that the elemental body reads from the same array that the elemental expression is assigned to (as both have the same SSA value). It is also obvious that the elemental has the same shape as the array.

The pattern should match if the following is true:

  1. There are exactly two uses of the elemental expression: one in hlfir.assign and one in hlfir.destroy
  2. The above assignment is to a fir::SequenceType (after fir::unwrapPassByRefType), I will call this “the array”. A hlfir.expr could be supported in the future.
  3. Elements of the array are of a trivial type (to keep the pass simple, this could be extended later)
  4. The array value dominates the elemental
  5. The array and the elemental have the same shape
  6. There are no writes to the array in the elemental body (the source code write is performed by hlfir.assign)
  7. All reads from the array in the elemental body are indexed by the elemental block argument (this ensures that different iterations cannot effect eachother)

If all of that is the case, the hlfir.elemental can be rewritten as a do-loop modifying the array in place:

%15:2 = fir.do_loop %arg3 = %11 to %13 step %c1 iter_args(%arg4 = %14) -> (index, i32) {
      fir.store %arg4 to %6#1 : !fir.ref<i32>
      %16 = fir.load %9#0 : !fir.ref<i32>
      %17 = fir.convert %16 : (i32) -> i64
      %18 = fir.load %6#0 : !fir.ref<i32>
      %19 = fir.convert %18 : (i32) -> i64
      %20 = fir.shape %c8 : (index) -> !fir.shape<1>
      %21 = hlfir.designate %2#0 (%17, %c2:%c9:%c1, %19)  shape %20 : (!fir.ref<!fir.array<9x9x9xi32>>, i64, index, index, index, i64, !fir.shape<1>) -> !fir.box<!fir.array<8xi32>>
      fir.do_loop %arg5 = %c1 to %c8 step %c1 unordered {
        %26 = hlfir.designate %21 (%arg5)  : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
        %27 = fir.load %26 : !fir.ref<i32>
        %28 = arith.subi %27, %c10_i32 : i32
        %29 = hlfir.designate %21 (%arg5)  : (!fir.box<!fir.array<8xi32>>, index) -> !fir.ref<i32>
        hlfir.assign %28 to %29 temporary_lhs : i32, !fir.ref<i32>
      }
      %22 = arith.addi %arg3, %c1 : index
      %23 = fir.convert %c1 : (index) -> i32
      %24 = fir.load %6#1 : !fir.ref<i32>
      %25 = arith.addi %24, %23 : i32
      fir.result %22, %25 : index, i32
    }

Results

  • 11x shorter execution time for SPEC17 exchange2 than the current HLFIR code (still around 12% slower than the FIR lowering)
  • 20% shorter execution time for SEPC17 cam4 than the current HLFIR code
  • All gfortran tests which currently pass for HLFIR still pass
  • spec2017 bwaves, fotonik and roms all pass (but with no significant performance change vs existing HLFIR)
1 Like

How about an array update idiom detection pass before bufferization? For me

reads like a read-modify write. You will have a lot of these update assignments:

a[I] = b[I-1] + b[I+2]

Could you detect these update idioms and turn into them a more suitable hlfir expression?

Thank you for the detailed explanation of the problem, Tom! It looks like a good start to me.

Could you please elaborate on the placement of the new pass in the pipeline? I suppose we will have to put it after the elementals inlining or it does not matter?

Should not the HLFIR pass also prove that the elemental block arguments are used for array indexing in the “natural” order (e.g. we cannot optimize such a pattern for the transpose case)?

I guess we should also mention somewhere that by the lowering process (and probably some Fortran standard clauses) it is not possible to have modifications of the array between the elemental and the assign operations. Otherwise, we would have to prove that.

Thanks for taking a look.

@tschuett, so far as I can tell, the example you give is already lowered to a do loop without additional temporary buffers. But yes I think the pass I propose would be able to bufferize a worse lowering of this using hlfir.elemental, because there are no reads from a.

@szakharin yes I was planning to run this pass before bufferization. The current pipeline looks like (createHLFIRToFIRPassPipeline, CLOptions.inc)

  • If optimizing for speed:
    • canonicalize
    • simplify HLFIR Intrinsics
  • Inline elementals
  • lower hlfir ordered assignments
  • lower hlfir intrinsics
  • bufferize hlfir
  • convert to fir

I think that lowering hlfir ordered assignments and lowering hlfir intrinsics won’t interact with this pass so it doesn’t matter much if this pass happens before or after those two, but I would like to place it after intrinsic inlining so that this pass doesn’t have to reason about hlfir.apply.

Yes you’re right about the order of the block arguments. I did implement this but didn’t think to write it up, thanks!

MLIR change is here ⚙ D156805 [mlir][Transforms] teach CSE about recursive memory effects

Thank you for the replies, Tom!

Yes, placing it after elementals inlining sounds right.

I think you should also specify the analysis done on the body of the elemental operation regarding the other memory accesses that might be there. For example, if there is a read from memory potentially overlapping with “the array”, we cannot do the transformation. In your description of the HLFIR pass you focus on the accesses of “the array”, but I do not see any explanation what happens regarding the other accesses.

Currently, I only check uses of the array (and the CSE pass has already checked that there aren’t any operations with MemWrite effects on any value). I don’t check reads to values other than the array.

So I guess the solution would be to dissallow reads which alias, or might alias array?

So I guess the solution would be to dissallow reads which alias, or might alias array?

Yes, I believe this should limit the transformation.

⚙ D157107 [flang][hlfir] add an optimized bufferization pass implements the HLFIR side of this RFC. With the suggested change to detect read aliasing, we loose the 20% improvement to cam4. I will see if there’s an easy fix and follow up with another patch if so.