Potentially missed GVN/PRE optimization for prefix sum pattern?

Consider this piece of prefix sum function written in C:

void prefix_sum(int *a, int n) {
  for (int i = 1; i < n; i++)
    a[i] += a[i - 1];
}

The function is transformed into the following LLVM IR before GVN pass:

define dso_local void @prefix_sum(ptr nocapture noundef %a, i32 noundef %n) local_unnamed_addr {
entry:
  %cmp7 = icmp sgt i32 %n, 1
  br i1 %cmp7, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  %wide.trip.count = zext i32 %n to i64
  br label %for.body

for.cond.cleanup.loopexit:                        ; preds = %for.body
  br label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond.cleanup.loopexit, %entry
  ret void

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 1, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %0 = getelementptr i32, ptr %a, i64 %indvars.iv
  %arrayidx = getelementptr i8, ptr %0, i64 -4
  %1 = load i32, ptr %arrayidx, align 4
  %2 = load i32, ptr %0, align 4
  %add = add nsw i32 %2, %1
  store i32 %add, ptr %0, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond, label %for.body, label %for.cond.cleanup.loopexit
}

I found that the GVN pass does not perform PRE for load in %1, since the PHI transform for %0 fails. The load does get optimized into phi in LoopLoadElimination pass, but this got me curious as one of the tests (see test5 in llvm/test/Transforms/GVN/PRE/rle-phi-translate.ll) for GVN/PRE uses similar pattern, and ensures that GVN eliminates load for (i - 1)-th index of the array. Is this an intended behavior for the pass, or a missed optimization opportunity?

In my test, llvm can handle this case from c src code: Compiler Explorer
But for your input ir, PRE fails.

Can you double check? In the opt pipeline, the load is not eliminated in GVN pass even in armv8-a clang.

Before GVN, loop body has two load, a[i] and a[i-1].
After GVN, a[i-1] is only load when entering loop; still need load/store for a[i]

According to the opt pipeline viewer, GVN does not transform the IR in your armv8-a clang example. In my inspection, the load for a[i - 1] is removed in LoopLoadElimination pass instead.

Oh, sorry abort that.
You are right.
Not GVN, but LoopLoadEliminationPass elimi load of a[i-1]

1 Like

Oh, I found where the problem.

I change

; Case 1
%10 = getelementptr i32, ptr %0, i64 %9
%11 = getelementptr i32, ptr %10, i64 -1

to

; Case 2
  %10 = getelementptr i32, ptr %0, i64 %9
  %tmp = add i64 %9, -1
  %11 = getelementptr i32, ptr %0, i64 %tmp

Then the gvn works…
And I thought case 1 and case 2 have the same semantic.