Question about Unrolling Loop with Multiple Exits

Hi All,

While I am investigating benchmarks, I have found loops which llvm fails to unroll because the loops have multiple exits.

For example,

char *foo(void);

int boo(char *s);

int test(char *s, char *end, char *check1, char *check2) {

while (s <= end) {

if (*s++ != ‘\n’)

continue;

if (check1 || check2) {

s = foo();

if (!s)

goto ret1;

}

if (boo(s))

goto ret0;

}

goto ret1;

ret0:

return 0;

ret1:

return 1;

}

Above code causes below messages from LoopUnroll pass.

Bailout for multi-exit handling when latch exit has >1 predecessor.

Multiple exit/exiting blocks in loop and multi-exit unrolling not enabled!

I can see the option unroll-runtime-multi-exit and comments about it. I wonder there are already reviews for the work on phabriactor or some people are working on it.

If someone knows information about it, please share it.

Thanks

JinGu Kang

A) Are you measuring on tip of tree? There were changes for multiple exit unrolling which landed very recently.

B) One of your exits does not dominate your latch. Those are generally hard.Â

C) This example does not seem to require gotos. I strongly suggest reducing your test cases if you want more informed commentary.Â

Philip

Hi Philip,

Thanks for your kind reply.

A) Are you measuring on tip of tree?� There were changes for multiple exit unrolling which landed very recently.

Yep, I am investigating benchmarks with llvm tip and I can see the llvm fails to unroll some loops with multiple exits.

B) One of your exits does not dominate your latch.� Those are generally hard

C) This example does not seem to require gotos.� I strongly suggest reducing your test cases if you want more informed commentary.�

I am looking at perlbench recently and it has goto statements inside loop. The example is a reduced case. When I look at the gcc’s output of the example, it looks like gcc unrolls only the below if statement block…

if (*s++ != ‘\n’)

continue;

Thanks

JinGu Kang

Right, but the gotos aren’t relevant for your reduced test. You can reduce further. Your phrasing here does not parse for me. Can you restate this with different wording and maybe a fully worked example?

Sorry for poor example…

The AArch64 assembly output of the example from gcc is as below. The loop is unrolled 7 times. I have written some comments to explain how the assembly code is mapped to C source code. As you can see on .L3 label, the ‘if (*s++ != ‘\n’)’ block is unrolled 7 times.

stp x29, x30, [sp, -48]!

mov x29, sp

stp x19, x20, [sp, 16]

mov x19, x0 → x19 is char *s

mov x20, x1 → x20 is char *end

str x21, [sp, 32]

orr x21, x3, x2 → x21 is check1 | check2

.L70:

sub x0, x20, x19 → x0 = end - s;

add x1, x0, 1

ands x2, x1, 7 → unroll count is 7

beq .L3 → .L3 is inside while loop. if (*s++ != ‘\n’)

cmp x19, x20 → while(s <= end)

bhi .L2 → .L2 is label ret1.

ldrb w3, [x19], 1 → start of remainder

cmp w3, 10

beq .L71

cmp x2, 1

beq .L3

cmp x2, 2

beq .L49

cmp x2, 3

beq .L50

cmp x2, 4

beq .L51

cmp x2, 5

beq .L52

cmp x2, 6

beq .L53

ldrb w4, [x19], 1

cmp w4, 10

bne .L53

.L71:

cbz x21, .L4 → if(check1 || check2)

bl foo()

mov x19, x0

cbz x0, .L2 → if (!s) goto ret1;

.L4: → if(boo(s)) goto ret0;

mov x0, x19

bl boo(char*)

cbz w0, .L70

mov w0, 0

ldp x19, x20, [sp, 16]

ldr x21, [sp, 32]

ldp x29, x30, [sp], 48

ret

.L53: → if (*s++ != ‘\n’) for remainder

ldrb w5, [x19], 1

cmp w5, 10

beq .L71

.L52: → if (*s++ != ‘\n’) for remainder

ldrb w6, [x19], 1

cmp w6, 10

beq .L71

.L51: → if (*s++ != ‘\n’) for remainder

ldrb w7, [x19], 1

cmp w7, 10

beq .L71

.L50: → if (*s++ != ‘\n’) for remainder

ldrb w8, [x19], 1

cmp w8, 10

beq .L71

.L49: → if (*s++ != ‘\n’) for remainder

ldrb w9, [x19], 1

cmp w9, 10

beq .L71

.L3: → if (*s++ != ‘\n’), 7 times unrolled

cmp x19, x20

bhi .L2

ldrb w10, [x19]

add x19, x19, 1

mov x11, x19

cmp w10, 10

beq .L71

ldrb w12, [x19], 1

cmp w12, 10

beq .L71

ldrb w13, [x11, 1]

add x19, x11, 2

cmp w13, 10

beq .L71

ldrb w14, [x11, 2]

add x19, x11, 3

cmp w14, 10

beq .L71

ldrb w15, [x11, 3]

add x19, x11, 4

cmp w15, 10

beq .L71

ldrb w16, [x11, 4]

add x19, x11, 5

cmp w16, 10

beq .L71

ldrb w17, [x11, 5]

add x19, x11, 6

cmp w17, 10

beq .L71

ldrb w18, [x11, 6]

add x19, x11, 7

cmp w18, 10

beq .L71

b .L3

.L2: → label ret1

mov w0, 1

ldp x19, x20, [sp, 16]

ldr x21, [sp, 32]

ldp x29, x30, [sp], 48

ret

I am sorry for showing you assembly output directly… It looks like the rtl level’s unroll pass of gcc unrolls above loop and I need to check it next week. Once I have clear idea about above unrolling from gcc, let me reduce the example more and let you know.

Thanks

JinGu Kang

Jingu,

I’m not fluent in AArch64 assembly, sorry. If you want more meaningful commentary, you’ll need to reduce your c++ example further, and give a better explanation of what you expect the unroller to do for you.Â

Philip

Hi Philip,

I have reduced the test case roughly as below.

declare i32 @foo(i8 *)

define void @test(i8* %s, i64 %a) {

entry:

%s.addr.a = getelementptr i8, i8* %s, i64 %a

br label %while.body.us

while.body.us:

%s.addr = phi i8* [ %incdec.ptr, %while.cond.backedge ], [ %s, %entry ]

%incdec.ptr = getelementptr inbounds i8, i8* %s.addr, i64 1

%incdec.val = load i8, i8* %s.addr, align 1

%cmp1 = icmp eq i8 %incdec.val, 10

br i1 %cmp1, label %if.end8.us, label %while.cond.backedge

if.end8.us:

%call9 = tail call i32 @foo(i8* nonnull %incdec.ptr)

%cmp2 = icmp ult i32 %call9, 0

br i1 %cmp2, label %while.cond.backedge, label %return.loopexit

while.cond.backedge:

%cmp3 = icmp eq i8* %incdec.ptr, %s.addr.a

br i1 %cmp3, label %return.loopexit, label %while.body.us

return.loopexit:

ret void

}

Roughly, I unrolled the loop manually 2 times as below and ignored the remaining loop simply.

define void @test(i8* %s, i64 %a) {

entry:

%s.addr.a = getelementptr i8, i8* %s, i64 %a

br label %while.body.us

while.body.us:

%s.addr = phi i8* [ %incdec.ptr, %while.cond.backedge ], [ %s, %entry ]

%incdec.ptr = getelementptr inbounds i8, i8* %s.addr, i64 1

%incdec.val = load i8, i8* %s.addr, align 1

%cmp1 = icmp eq i8 %incdec.val, 10

br i1 %cmp1, label %if.end8.us, label %while.body.us.1

while.body.us.1:

%incdec.ptr.1 = getelementptr inbounds i8, i8* %incdec.ptr, i64 1

%incdec.val.1 = load i8, i8* %incdec.ptr, align 1

%cmp1.1 = icmp eq i8 %incdec.val.1, 10

br i1 %cmp1.1, label %if.end8.us, label %while.cond.backedge

if.end8.us:

%incdec.ptr.phi = phi i8* [ %incdec.ptr, %while.body.us ], [ %incdec.ptr.1, %while.body.us.1 ]

%call9 = tail call i32 @foo(i8* nonnull %incdec.ptr.phi)

%cmp2 = icmp ult i32 %call9, 0

br i1 %cmp2, label %while.cond.backedge, label %return.loopexit

while.cond.backedge:

%cmp3 = icmp eq i8* %incdec.ptr.1, %s.addr.a

br i1 %cmp3, label %return.loopexit, label %while.body.us

return.loopexit:

ret void

}

If possible, can we make loop unroll pass handle this kind of cases please?

If I missed something, please let me know.

Thanks

JinGu Kang

./opt -loop-unroll -enable-new-pm=0 < llvm-dev.ll -S -unroll-runtime -debug
Args: ./opt -loop-unroll -enable-new-pm=0 -S -unroll-runtime -debug
Loop Unroll: F[test] Loop %while.body.us
 Loop Size = 10
 runtime unrolling with count: 8
 Exiting block %if.end8.us: TripCount=0, TripMultiple=1, BreakoutTrip=1
 Exiting block %while.cond.backedge: TripCount=0, TripMultiple=1, BreakoutTrip=1
Trying runtime unrolling on Loop:
Loop at depth 1 containing: %while.body.us,%if.end8.us,%while.cond.backedge
Using prolog remainder.
Bailout for multi-exit handling when latch exit has >1 predecessor.
Multiple exit/exiting blocks in loop and multi-exit unrolling not enabled!
Won’t unroll; remainder loop could not be generated when assuming runtime trip count
; ModuleID = ‘’
source_filename = “”

declare i32 @foo(i8*)

define void @test(i8* %s, i64 %a) {
entry:
 %s.addr.a = getelementptr i8, i8* %s, i64 %a
 br label %while.body.us

while.body.us:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %while.cond.backedge, %entry
 %s.addr = phi i8* [ %incdec.ptr, %while.cond.backedge ], [ %s, %entry ]
 %incdec.ptr = getelementptr inbounds i8, i8* %s.addr, i64 1
 %incdec.val = load i8, i8* %s.addr, align 1
 %cmp1 = icmp eq i8 %incdec.val, 10
 br i1 %cmp1, label %if.end8.us, label %while.cond.backedge

if.end8.us:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %while.body.us
 %call9 = tail call i32 @foo(i8* nonnull %incdec.ptr)
 %cmp2 = icmp ult i32 %call9, 0
 br i1 %cmp2, label %while.cond.backedge, label %return.loopexit

while.cond.backedge:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %if.end8.us, %while.body.us
 %cmp3 = icmp eq i8* %incdec.ptr, %s.addr.a
 br i1 %cmp3, label %return.loopexit, label %while.body.us

return.loopexit:Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â ; preds = %while.cond.backedge, %if.end8.us
 ret void
}

The debug output appears to give a pretty good clue as to where you could start if desired.

Philip

Thanks for kind suggestion Philip.

Let me have a look at the points where the debug message mentions.

Regards

JinGu Kang

It looks like the runtime unroll does not support the trip count with pointer type…

Before handling the multiple exits, I need to resolve it first for my example…

If possible, can someone let me know that people are working on it or there are reviews for it on phabricator please?

Thanks

JinGu Kang

JFYI, works in this direction, and the reduced test case does unroll with that patch and the following command line.

$ opt -loop-unroll -enable-new-pm=0 < llvm-dev.ll -S -unroll-runtime -debug -unroll-runtime-multi-exit -unroll-runtime-epilog

Hi Philip,

I can see upstream llvm unrolls the loop in the test case with the command line options.

I really appreciate your patch!

Thanks,

JinGu Kang