Split GEP in Loop

Hi all,

For input IR:

; Function Attrs: argmemonly nofree norecurse nosync nounwind
define dso_local void @vec_mov(ptr noalias nocapture noundef readonly %s, ptr noalias nocapture noundef writeonly %d, i64 noundef %N) local_unnamed_addr #0 {
entry:
  br label %for.cond

for.cond:                                         ; preds = %for.body, %entry
  %i.0 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %cmp = icmp ult i64 %i.0, %N
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  ret void

for.body:                                         ; preds = %for.cond
  %arrayidx = getelementptr inbounds i32, ptr %s, i64 %i.0
  %0 = load i32, ptr %arrayidx, align 4, !tbaa !6
  %arrayidx1 = getelementptr inbounds i32, ptr %d, i64 %i.0
  store i32 %0, ptr %arrayidx1, align 4, !tbaa !6
  %inc = add nuw i64 %i.0, 1
  br label %for.cond, !llvm.loop !10
}

I’m wondering whether there is a pass transform the input IR to this:

; Function Attrs: argmemonly nofree norecurse nosync nounwind
define dso_local void @vec_mov(ptr noalias nocapture noundef readonly %s, ptr noalias nocapture noundef writeonly %d, i64 noundef %N) local_unnamed_addr #0 {
entry:
  br label %for.cond

for.cond:                                         ; preds = %for.body, %entry
  %i.0 = phi i64 [ 0, %entry ], [ %inc, %for.body ]
  %s.start = getelementptr inbounds i32, ptr %s.offset, i64 0
  %d.start = getelementptr inbounds i32, ptr %d.offset, i64 0
  %s.offset = phi ptr [ %s.start, %entry ], [ %s.next, %entry ]
  %d.offset = phi ptr [ %d.start, %entry ], [ %d.next, %entry ]
  %cmp = icmp ult i64 %i.0, %N
  br i1 %cmp, label %for.body, label %for.cond.cleanup

for.cond.cleanup:                                 ; preds = %for.cond
  ret void

for.body:                                         ; preds = %for.cond
  %0 = load i32, ptr %s.offset, align 4, !tbaa !6
  store i32 %0, ptr %d.offset, align 4, !tbaa !6
  %s.next = getelementptr inbounds i32, ptr %s.offset, i64 1
  %d.next = getelementptr inbounds i32, ptr %d.offset, i64 1
  %inc = add nuw i64 %i.0, 1
  br label %for.cond, !llvm.loop !10
}

I’m totally newbee in this area, but I feel it is something relative to SCEV. I skimmed passes in llvm/lib/Transforms/Scalar folder but nothing similar is found to do such job.

So is there something can do this job already? If not, what is knowledge I need to dive in to make a new pass to do it.

Thank you in advance.

I know what you mean, but your code has a few problems.

First, it GEPs twice every element after the first. First as index 1, then in the for.cond as index 0. LLVM may optimise that, but you don’t need to create that in the first place.

Second, you don’t want pointer Phi nodes rather than just integer offsets Phi nodes. It shouldn’t matter for the GEP, and they’re the same offset for both pointers, so we can still use %i.0 as the index, even if the GEP is after the load/store.

Finally, I’m unsure what you’re trying to gain here. GEP is just an address calculation, it’s not accessing memory at all (that’s for the load/store to do), so I don’t quite get what the optimisation here is.

See: The Often Misunderstood GEP Instruction — LLVM 16.0.0git documentation

Thank u very much for your reply.

I think the example I gave is not the proper one to demonstrate the performance gain, so it is very natural you have questions about the rationality. So bare me with another example to do further explanation:

Say a loop in c code:

void reversed_mov(int N, char *restrict a, char *restrict b) {
  for (int i = 0; i < N; ++i)
     b[i] = a[N - 1 - i];
}

In current way LLVM using, it will generate IR like:

; Function Attrs: argmemonly nofree norecurse nosync nounwind
define dso_local void @reversed_mov(i32 noundef signext %N, ptr noalias nocapture noundef readonly %a, ptr noalias nocapture noundef writeonly %b) local_unnamed_addr #0 {
entry:
  %cmp8 = icmp sgt i32 %N, 0
  br i1 %cmp8, label %for.body.preheader, label %for.cond.cleanup

for.body.preheader:                               ; preds = %entry
  %wide.trip.count = zext i32 %N to i64
  br label %for.body

for.cond.cleanup:                                 ; preds = %for.body, %entry
  ret void

for.body:                                         ; preds = %for.body.preheader, %for.body
  %indvars.iv = phi i64 [ 0, %for.body.preheader ], [ %indvars.iv.next, %for.body ]
  %0 = trunc i64 %indvars.iv to i32
  %1 = xor i32 %0, -1
  %sub1 = add i32 %1, %N
  %idxprom = sext i32 %sub1 to i64
  %arrayidx = getelementptr inbounds i8, ptr %a, i64 %idxprom
  %2 = load i8, ptr %arrayidx, align 1, !tbaa !6
  %arrayidx3 = getelementptr inbounds i8, ptr %b, i64 %indvars.iv
  store i8 %2, ptr %arrayidx3, align 1, !tbaa !6
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond.not = icmp eq i64 %indvars.iv.next, %wide.trip.count
  br i1 %exitcond.not, label %for.cond.cleanup, label %for.body, !llvm.loop !9
}

which will lowering for.body BB to code like:

%a.gep = %a + %N - 1 - %iv
load %a.gep
...do the remaining stuff

What I’d like to achieve is to hoist the start part to the preheader of the loop, like:

for.body.preheader:
...
%a.start = %a + %N - 1
...

Then in the for.body, we just need to do:

%a.offset = phi ptr [ %a.start, %for.body.preheader ], [ %a.next, %for.body ]
load %a.offset
%a.next = %a.offset - 1
...do the remaining stuff

In other words, this pass (in my imagination) eliminate redundant addressing work in loop body, and can make loop focus on real calculation stuff.

An old optimisation like that was to move the load of the next element to the end of the loop, just before the branch back, so that load/store unit and branch predictor can work in parallel.

Nowadays, the branch predictors are good enough and (almost?) always predict a back loop branch as taken, because you want to optimise for the loop iteration, not the loop exit. So that optimisation is nowadays mostly irrelevant (because the branch is almost free and the load will probably be cached when within cache lines, etc).

So, if that’s what you’re trying to do, and you’re working with modern medium/large hardware (pretty much all x86/arm other than embedded), then it’s probably not going to have any measurable effect.

What is the benefit, for your hardware, for the loop body to “focus on real calculation stuff”? Is it just the boundaries of the branches or is there something else at play? Is this conventional hardware or some embedded / accelerator?

I do not have an answer but I think I can help to understand the question.
The thought here a bit more simplified is the following:

Assume we have

float a[3] = {1.1f, 2.2f, 3.3f};
int N = 1;

float x = *(a + N);
float y = *(a + N + 1);

To calculate x and y there are 3 integer adds needed.

If we formulate it as

float a[3] = {1.1f, 2.2f, 3.3f};
int N = 1;

float* r = a + N;

float x = *r;
float y = *(r + 1);

only two adds are needed.
So if I understand it correct the question is if there is an optimization pass that does such a transformation.

I think what you want is loop strength reduction, but I’m not sure if our LSR pass will do what you want.

2 Likes

This is just Common Subexpression Elimination? I believe SeparateConstOffsetFromGEP is what’s meant to do this.

2 Likes