[RFC] Changes to fircg.xarray_coor CodeGen to allow better hoisting

[RFC] Changes to fircg.xarray_coor CodeGen

How things are right now

For a 3D (x,y,z) unsliced array in a 3D loop nest indexed by (i,j,k) we generate something like

offset = (i - lb(x))*stride(x) + (j - lb(y))*stride(y) + (k - lb(z))*stride(z)

In this example, everything other than the loop indices (i,j,k) are loop invariant. Assume that the inner-most loop variable is i, and stride(x) = 1.

The problem

After some LLVM optimization passes, the y and z terms of offset are hoisted out of the inner loops. Leaving something roughly like

offset = (i - lb(x)) + offset_outer

When the LICM pass sees an operation like %0 = sub %i %lb_x, this cannot be hoisted out of the inner-most loop because %i is not loop invariant.

This results in llvm-flang having more integer operations for array index calculation in hot loop bodies than classic flang. This has been observed to lead to register spills+fills in important loops in spec2017 on aarch64.

Proposed solution

Algebraically (ignoring integer overflow), the above expresssion is is equivalent to somthing like

offset = (offset_outer - lb(x)) + i

This allows the subtraction to be hoisted outside of the innermost loop.

This could be achieved by changing the order in which the oprations are generated in flang codegen for xarray_coor.

Other information

Classic-flang adds the nsw flag to offset calculation arithmetic operations. This on its own isn’t enough to allow LLVM to re-order these operations automatically to allow the subtraction to be hoisted, but it might be useful so I’d be interested if anyone has thoughts on this.

Can you share a minimal reproducer, source + godbold link, so people can check what we do now.
Maybe also include the IR you generate after the change.

Thanks for taking a look. The llvm-flang on godbolt is a bit out of date. I’ll post some files here

module m
contains
subroutine repro(a, b, xlb, ylb, zlb, nx, ny, nz, xlbb, ylbb, zlbb)
  integer, intent(in) :: xlb, ylb, zlb, nx, ny, nz, xlbb, ylbb, zlbb

  real, intent(inout), dimension(xlb:nx,ylb:ny,zlb:nz) :: a
  real, intent(in), dimension(xlbb:nx,ylbb:ny,zlbb:nz) :: b

  integer i, j, k
  do k=zlb,nz
    do j=ylb,ny
      do i=xlb,nx
        a(i,j,k) = a(i,j,k) + b(i,j,k)
      end do
    end do
  end do
end subroutine
end module

Building with -Ofast -mcpu=neoverse-v1, the vector body contains subtractions for the lower bounds

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %active.lane.mask = phi <vscale x 4 x i1> [ %active.lane.mask.entry, %vector.ph ], [ %active.lane.mask.next, %vector.body ]
  %.cast18 = trunc i64 %index to i32
  %offset.idx = add i32 %12, %.cast18
  %81 = sext i32 %offset.idx to i64
  %82 = sub nsw i64 %81, %13
  %83 = getelementptr float, ptr %72, i64 %82
  %wide.masked.load = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr %83, i32 4, <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x float> poison), !tbaa !25
  %84 = sub nsw i64 %81, %26
  %85 = getelementptr float, ptr %75, i64 %84
  %wide.masked.load19 = tail call <vscale x 4 x float> @llvm.masked.load.nxv4f32.p0(ptr %85, i32 4, <vscale x 4 x i1> %active.lane.mask, <vscale x 4 x float> poison), !tbaa !27
  %86 = fadd fast <vscale x 4 x float> %wide.masked.load19, %wide.masked.load
  tail call void @llvm.masked.store.nxv4f32.p0(<vscale x 4 x float> %86, ptr %83, i32 4, <vscale x 4 x i1> %active.lane.mask), !tbaa !25
  %index.next = add i64 %index, %80
  %active.lane.mask.next = tail call <vscale x 4 x i1> @llvm.get.active.lane.mask.nxv4i1.i64(i64 %index, i64 %78)
  %87 = extractelement <vscale x 4 x i1> %active.lane.mask.next, i64 0
  br i1 %87, label %vector.body, label %._crit_edge.us.us.us, !llvm.loop !29

Whole files:

Unrelated, but we really want to fix that.

Is this what you’d expect (wrt. the optimized output):

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 %index, %13
  %80 = getelementptr float, ptr %66, i64 %index
  %wide.load = load <vscale x 4 x float>, ptr %80, align 4, !tbaa !23
  %81 = getelementptr float, ptr %80, i64 %73
  %wide.load17 = load <vscale x 4 x float>, ptr %81, align 4, !tbaa !23
  %82 = sub i64 %offset.idx, %26
  %83 = getelementptr float, ptr %69, i64 %82
  %wide.load18 = load <vscale x 4 x float>, ptr %83, align 4, !tbaa !25
  %84 = getelementptr float, ptr %83, i64 %75
  %wide.load19 = load <vscale x 4 x float>, ptr %84, align 4, !tbaa !25
  %85 = fadd contract <vscale x 4 x float> %wide.load, %wide.load18
  %86 = fadd contract <vscale x 4 x float> %wide.load17, %wide.load19
  store <vscale x 4 x float> %85, ptr %80, align 4, !tbaa !23
  %87 = getelementptr float, ptr %80, i64 %77
  store <vscale x 4 x float> %86, ptr %87, align 4, !tbaa !23
  %index.next = add nuw i64 %index, %79
  %88 = icmp eq i64 %index.next, %n.vec
  br i1 %88, label %middle.block, label %vector.body, !llvm.loop !27

Did you change something here? I can’t reproduce the same results using opt locally.

It looks like there is still a subtraction for a lower bound, but one lower bound is gone, and the cost is further reduced by the unrolling (although unrolling wouldn’t help my original use case with the register exhaustion which I unfortunately cannot share).

The sub seems loop carried, is it not?

With my input your opt should produce the same result.
What I did is to avoid mixing i64 and i32, so I manually promoted all i32 to i64.
I’m not saying that this will happen automatically, but I am sure if we can generate less intermixed sizes we will get better results.
Is there an option to adjust codegen to avoid mixing i32 and i64 all over the place?

I believe the concern is that the following re-association (that does not happen) exposes the invariant computation:

  %invariant = sub i64 %13, %26
  %82 = add i64 %index, %invariant

Fair point, but that is something we want LLVM to do (for all languages). Reassociate then licm should realize this. It does work fine if you rerun O3 on the result. Also works fine on the small reproducer:

We need to see why it doesn’t catch it in the above and fix it (plus the i32/i64 mixing).

FYI, when the loop variables are declared as:

 integer(8) i, j, k

There is no truncations/extensions interfering, but the subtracts stay in IR as late as loop strength reduction:
before:

; Loop:
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 %index, %103
  %121 = sub i64 %offset.idx, %13
  %122 = getelementptr float, ptr %110, i64 %121
  %wide.load = load <4 x float>, ptr %122, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  %123 = getelementptr float, ptr %122, i64 4
  %wide.load30 = load <4 x float>, ptr %123, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  %124 = sub i64 %offset.idx, %26
  %125 = getelementptr float, ptr %113, i64 %124
  %wide.load31 = load <4 x float>, ptr %125, align 4, !tbaa !3, !alias.scope !10
  %126 = getelementptr float, ptr %125, i64 4
  %wide.load32 = load <4 x float>, ptr %126, align 4, !tbaa !3, !alias.scope !10
  %127 = fadd fast <4 x float> %wide.load31, %wide.load
  %128 = fadd fast <4 x float> %wide.load32, %wide.load30
  store <4 x float> %127, ptr %122, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  store <4 x float> %128, ptr %123, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  %index.next = add nuw i64 %index, 8
  %129 = icmp eq i64 %index.next, %n.vec
  br i1 %129, label %middle.block, label %vector.body, !llvm.loop !12

after:

; Loop:
vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %145 = shl i64 %index, 2
  %scevgep51 = getelementptr i8, ptr %scevgep49, i64 %145
  %scevgep52 = getelementptr i8, ptr %scevgep51, i64 -16
  %wide.load = load <4 x float>, ptr %scevgep52, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  %146 = shl i64 %index, 2
  %scevgep50 = getelementptr i8, ptr %scevgep49, i64 %146
  %wide.load30 = load <4 x float>, ptr %scevgep50, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  %147 = shl i64 %index, 2
  %scevgep41 = getelementptr i8, ptr %scevgep40, i64 %147
  %wide.load31 = load <4 x float>, ptr %scevgep41, align 4, !tbaa !3, !alias.scope !10
  %scevgep42 = getelementptr i8, ptr %scevgep41, i64 16
  %wide.load32 = load <4 x float>, ptr %scevgep42, align 4, !tbaa !3, !alias.scope !10
  %148 = fadd fast <4 x float> %wide.load31, %wide.load
  %149 = fadd fast <4 x float> %wide.load32, %wide.load30
  store <4 x float> %148, ptr %scevgep52, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  store <4 x float> %149, ptr %scevgep50, align 4, !tbaa !3, !alias.scope !7, !noalias !10
  %index.next = add nuw i64 %index, 8
  %150 = icmp eq i64 %n.vec, %index.next
  br i1 %150, label %middle.block, label %vector.body, !llvm.loop !12

But it is optimized away eventually, so it is not necessarily a good reproducer.

So the question is how to enable reassociation for the following IR (this is the cleanest what we can get after LICM):

81:                                               ; preds = %.lr.ph, %81
  %82 = phi i64 [ %72, %.lr.ph ], [ %93, %81 ]
  %83 = phi i32 [ %67, %.lr.ph ], [ %92, %81 ]
  %84 = sext i32 %83 to i64
  %85 = sub nsw i64 %84, %13
  %86 = getelementptr float, ptr %77, i64 %85
  %87 = load float, ptr %86, align 4, !tbaa !3
  %88 = sub nsw i64 %84, %33
  %89 = getelementptr float, ptr %80, i64 %88
  %90 = load float, ptr %89, align 4, !tbaa !3
  %91 = fadd fast float %90, %87
  store float %91, ptr %86, align 4, !tbaa !3
  %92 = add i32 %83, 1
  %93 = add nsw i64 %82, -1
  %94 = icmp sgt i64 %93, 0
  br i1 %94, label %81, label %._crit_edge
1 Like

I took the output after the vectorizer (which introduced the sub → add chain) and modified it (see line 171) to use i64 instead of a mix, then instcombine + reassociate do the right thing, licm will take care of the rest:

Thanks for your help @jdoerfert, @szakharin.

I’ve verified that building 549.fotonik3d_r with -fdefault-integer-8 does allow LLVM passes to hoist correctly (except for vectorized gather loads, which I guess are an uncommon case that tend to be slow anyway). The benchmark sees a speedup of around 7% on my setup.

We would like to achieve this performance out of the box. So either llvm’s instcombine and reassociate passes need to be taught to understand the sign extension (if that isn’t part of the issue in itself (*)), or flang needs to promote these integers used for array index calculations to i64 (which would also avoid the sign extension).

(*) For broader context, on my test system I believe the speed up is coming from avoiding spills/fills by using fewer registers in the body of the loop - hence why the sign extension could be important.

I don’t know what combination of steps we should take, but I’m glad we figured out what is going on and that we have options. I believe it’s generally not helpful that Flang generates i32 arithmetic on a 64bit machine, so if it’s just a choice, rather than mandated, we should avoid it. That said, if we could reason about these things better, we’d get a win across the board, it’s just harder.

We generate 32-bit arithmetic because the default integer type is 32-bit. I don’t know why that is. We have -fdefault-integer-8 to make 64-bit integers the default size.

Flang front-end just obeys the user code in this case: the loops’ induction variables (do-variables) are declared as 32-bit integers, so the reads/writes of i, j, k are 32-bit:

  %10:2 = hlfir.declare %9 {uniq_name = "_QMmFreproEi"} : (!fir.ref<i32>) -> (!fir.ref<i32>, !fir.ref<i32>)
...
      %100:2 = fir.do_loop %arg15 = %96 to %98 step %c1_12 iter_args(%arg16 = %99) -> (index, i32) {
        fir.store %arg16 to %10#1 : !fir.ref<i32>
        %105 = fir.load %10#0 : !fir.ref<i32>
        %106 = fir.convert %105 : (i32) -> i64
...
        %111 = hlfir.designate %46#0 (%106, %108, %110)  : (!fir.box<!fir.array<?x?x?xf32>>, i64, i64, i64) -> !fir.ref<f32>

It is no different from what clang does (e.g. Compiler Explorer), except that Flang front-end also introduces the loop “counter” variable of an extended index type (i64) that controls the loop to iterate inclusively from the loop’s lower bound to the loop’s upper bound.

Flang does use index arithmetic for addressing, thus, the induction variables’ values are extended to index before the memory accesses (also same as clang).

So for the mixed data types case adding nsw for the increment of the i variable helps to get rid of one subtract: Compiler Explorer (see nsw flag on %141).

This is what we have before vectorizer:

68:                                               ; preds = %68, %.lr.ph.us.us.us
  %indvars.iv = phi i64 [ %indvars.iv.next, %68 ], [ %13, %.lr.ph.us.us.us ]
  %69 = phi i64 [ %77, %68 ], [ %46, %.lr.ph.us.us.us ]
  %70 = sub nsw i64 %indvars.iv, %13
  %71 = getelementptr float, ptr %64, i64 %70
  %72 = load float, ptr %71, align 4, !tbaa !25
  %73 = sub nsw i64 %indvars.iv, %26
  %74 = getelementptr float, ptr %67, i64 %73
  %75 = load float, ptr %74, align 4, !tbaa !27
  %76 = fadd contract float %72, %75
  store float %76, ptr %71, align 4, !tbaa !25
  %indvars.iv.next = add nsw i64 %indvars.iv, 1
  %77 = add nsw i64 %69, -1
  %78 = icmp sgt i64 %69, 1
  br i1 %78, label %68, label %._crit_edge.us.us.us

So when the vectorizer normalizes the loop index to start from 0 we get:

vector.body:                                      ; preds = %vector.body, %vector.ph
  %index = phi i64 [ 0, %vector.ph ], [ %index.next, %vector.body ]
  %offset.idx = add i64 %13, %index
  %75 = add i64 %offset.idx, 0
...
  %81 = sub nsw i64 %75, %13
  %83 = getelementptr float, ptr %67, i64 %81
...
  %89 = sub nsw i64 %75, %26
  %91 = getelementptr float, ptr %70, i64 %89

So the inst-combine easily optimizes %13 subtraction. Nothing happens for %26. Feeding the output back to opt gets rid of the second subtract: Compiler Explorer

So I think we can start with adding the nsw flag and then look for the pass (or collection of passes) that optimizes the second subtract on the second opt pass - it may be useful to place this pass properly in the pipeline, so that the subtracts are optimized even when the loop’s lower bound does not match the dimension’s lower bound (i.e. for %26 case).

Ack, I have a prototype for nsw already so I’ll get a patch up asap. Thanks for your help.