Enabling IRCE pass or Adding something similar in the pipeline of new pass manager

Hi All,

I am trying to vectorize some loops. Let 's see a simple loop.

while() {
...
if ()
...
}

As you can see, there is if statement inside loop. As you know, LoopVectorizer tries to predicate the block of if statement with its condition and it asks target machines the cost of the instructions. If there are load or store instructions in the block to be predicated and target machine does not support masked load and store, LoopVectorizer assigns big number to the load and store instructions and it means the loop is not vectorized with loop vectorizer. In this case, it is important to remove the if statement inside loop before LoopVectorizer.

We need to consider two cases to remove the if statement inside loop as below.

1. if statement's condition with loop invariant variables
2. if statement's condition with induction variables

For the first one, UnSwitch/SimpleUnSwitch pass handles the case. The passes hoist the loop invariant condition outside loop and unswitch loop.
For the second one, there could be several transformations but I think the IRCE pass is general solution. The pass splits the iteration space following the condition with induction variable.

At this moment, the only SimpleUnSwitch pass is in pipeline of new pass manager but IRCE is not.

Let's see an IR function to see the impact of IRCE pass.

target triple = "aarch64"

define void @foo(i64 %a, i64* noalias %src, i64* noalias %dst, i64 %n) {
entry:
  br label %bb

bb:
  %if.cond = icmp slt i64 1, %n
  br i1 %if.cond, label %if.then, label %exit

if.then:
  %if.cond.2 = icmp slt i64 10, %n
  br i1 %if.cond.2, label %loop.ph, label %if.then.else

if.then.else:
  br label %loop.ph

loop.ph:
  br label %loop

loop:
  %iv = phi i64 [ %inc, %for.inc ], [ 1, %loop.ph ]
  %cmp = icmp slt i64 %iv, %a
  br i1 %cmp, label %if.then.2, label %for.inc

if.then.2:
  %src.arrayidx = getelementptr inbounds i64, i64* %src, i64 %iv
  %val = load i64, i64* %src.arrayidx
  %dst.arrayidx = getelementptr inbounds i64, i64* %dst, i64 %iv
  store i64 %val, i64* %dst.arrayidx
  br label %for.inc

for.inc:
  %inc = add nuw nsw i64 %iv, 1
  %cond = icmp eq i64 %inc, %n
  br i1 %cond, label %exit, label %loop

exit:
  ret void
}

If we try to vectorize above code, we can see below debug output.

opt -loop-vectorize -S ./test.ll -debug-only=loop-vectorize

LV: Scalar loop costs: 5.
...
LV: Found an estimated cost of 3000000 for VF 2 For instruction: %val = load i64, i64* %src.arrayidx, align 4
LV: Vector loop of width 2 costs: 1500004.
...
LV: Vectorization is possible but not beneficial

If we run IRCE pass ahead of loop vectorizer with SCEV changes ⚙ D101409 [SCEV] Check IDom with single predecessor on getPredecessorWithUniqueSuccessorForBB and ⚙ D100566 [SCEV] Add a ad-hoc pattern on isImpliedCondBalancedTypes, we can see the loop is vectorized with below debug output.

opt -irce -irce-skip-profitability-checks -simplifycfg -instcombine -loop-vectorize -S ./test.ll -debug-only=loop-vectorize

LV: Scalar loop costs: 6.
LV: Vector loop of width 2 costs: 2.

In order to vectorize more loops which has conditional branch inside, we need to enable IRCE pass or add something similar ahead of loop vectorizer in the pipeline of new pass manager.

I had a short discussion with Philip and Nikita on ⚙ D101409 [SCEV] Check IDom with single predecessor on getPredecessorWithUniqueSuccessorForBB and it looks we need more effort for IRCE pass or something similar features. If you have experience or idea about this, please share it.

Thanks
JinGu Kang

Let me start by copying my comment for the review.

@mkazantsev and I have spent a lot of time working on vectorization of similar loops. If you want to talk through this, I’d be happy to jump on a call. You may also find my talk at last years llvm conference on multiple exit loops useful background.

A couple of general comments.

I really don’t think that extending IRCE is the right path forward here. IRCE has some serious design defects, and I’m honestly quite nervous about it’s correctness. I think that iteration set splitting (the basic transform IRCE uses) is absolutely something we should implement for the main pipeline, but I’d approach it as building new infrastructure to replace IRCE, not as getting IRCE on by default. In particular, I suspect the value comes primarily from a cost model driven approach to splitting, not IRCE’s unconditional one.

Second, I advise being very cautious about going directly for the general case here. The general case for this is really really hard. If it wasn’t, we’d already have robust solutions. If you can describe your motivating examples in a bit more depth (maybe offline), we can see if we can find a specific sub-case which is both tractable and profitable.

Thank you for continuing the conversation on llvm-dev, a specific suggestion inline below.

I also forgot to mention the ongoing (currently stalled) work on multiple exit vectorization. That might also be of interest.

Philip, I appreciate your kind comments.

Hi Philip,

I have extended your suggestion slightly more as below.

newbound1 = min(n, c)

newbound2 = max(n, c)

while (iv < n) { while(iv < newbound1) {

A A

if (iv < c) B

B C

C }

} iv = newbound1

while (iv < newbound2) {

A

C

}

I have implemented a simple pass to split bound of loop, which has conditional branch with IV, as above example. https://reviews.llvm.org/D102234 It is initial version. If possible, please review it.

Thanks

JinGu Kang

as I know, current IRCE implementation relies on some preconditions. it’s intended to language runtime with garbage collection, not for loop vectorization.
the same is true for loop predication, which is also helpful for eliminating condition check within a loop.

Jie He
B.R

This is incorrect.

IRCE’s current sole known user happens to be a compiler for a GCed language, but there is no (intentional) dependence on that fact. It should work on arbitrary IR.

Loop predication (the form in IndVars) triggers for arbitrary IR. The separate pass depends on semantics of guards which is related to deopt semantics, but not GC.

Philip

JFYI, the addition of ‘A’ and ‘C’ in your example complicates cost modeling a bunch.

I’ll review the patch, just sharing a meta comment here.

Philip

yes, but current lowering deopt implementation would generate a statepoint IR which currently only supports X86-64, as mentioned in GC documentation in LLVM.

iRCE doesn’t reply on GCed language, I remember wrong. but it’s not smart right now, can’t handle bounds check well like java RCE did.

I believe this is supported on at least AArch64 if memory serves. Er, I think you’re either misunderstanding or need to clarify your point. IRCE does exactly the standard pre/main/post loop technique which was used in C2 back in the day. LoopPred does the widening transformation. Do you have a particular case in mind you’re thinking of?

hi Philip

yes, I submitted 2 issues about iRCE, 49012 and 49014.I don’t know if I misuse the pass, I have no comprehensive understanding about this pass and its background. just take some time to dive the code to find the reason.