A code layout related side-effect introduced by rL318299

Hi,

Recently 10% performance regression on an important benchmark showed up after we integrated https://reviews.llvm.org/rL318299. The analysis showed that rL318299 triggered loop rotation on an multi exits loop, and the loop rotation introduced code layout issue. The performance regression is a side-effect of rL318299. I got two testcases a.ll and b.ll attached to illustrate the problem. a.ll was generated by rL318298 and b.ll was generated by rL318299.

-------------------------- a.ll ----------------------------
declare void @_Z1fv() local_unnamed_addr #2
@i = global i8 0, align 1

define i8* @Z1gPcS_S(i8* nocapture readonly %d, i8* %h, i8* readnone returned %p3) local_unnamed_addr #3 {
entry:
br label %while.cond

while.cond: ; preds = %while.body, %entry
%h.addr.0 = phi i8* [ %h, %entry ], [ %add.ptr4, %while.body ]
%d.addr.0 = phi i8* [ %d, %entry ], [ %add.ptr3, %while.body ]
%cmp = icmp ugt i8* %h.addr.0, @i
br i1 %cmp, label %while.end, label %while.body

while.body: ; preds = %while.cond
%0 = bitcast i8* %d.addr.0 to i64*
%1 = load i64, i64* %0, align 1
%2 = bitcast i8* %h.addr.0 to i64*
store i64 %1, i64* %2, align 1
%add.ptr = getelementptr inbounds i8, i8* %d.addr.0, i64 8
%3 = bitcast i8* %add.ptr to i64*
%4 = load i64, i64* %3, align 1
store i64 %4, i64* %2, align 1
%add.ptr3 = getelementptr inbounds i8, i8* %d.addr.0, i64 6
%add.ptr4 = getelementptr inbounds i8, i8* %h.addr.0, i64 6
%cmp5 = icmp ult i8* %add.ptr4, %p3
br i1 %cmp5, label %while.cond, label %return

while.end: ; preds = %while.cond
tail call void @_Z1fv()
unreachable

return: ; preds = %while.body
ret i8* %p3
}

-------------------------- b.ll ----------------------------
declare void @_Z1fv() local_unnamed_addr #2
@i = global i8 0, align 1

define i8* @Z1gPcS_S(i8* nocapture readonly %d, i8* %h, i8* readnone returned %p3) local_unnamed_addr #3 {
entry:
br label %while.cond

while.cond: ; preds = %cleanup.cont, %entry
%h.addr.0 = phi i8* [ %h, %entry ], [ %add.ptr4, %cleanup.cont ]
%d.addr.0 = phi i8* [ %d, %entry ], [ %add.ptr3, %cleanup.cont ]
%cmp = icmp ugt i8* %h.addr.0, @i
br i1 %cmp, label %while.end, label %while.body

while.body: ; preds = %while.cond
%0 = bitcast i8* %d.addr.0 to i64*
%1 = load i64, i64* %0, align 1
%2 = bitcast i8* %h.addr.0 to i64*
store i64 %1, i64* %2, align 1
%add.ptr = getelementptr inbounds i8, i8* %d.addr.0, i64 8
%3 = bitcast i8* %add.ptr to i64*
%4 = load i64, i64* %3, align 1
store i64 %4, i64* %2, align 1
%add.ptr4 = getelementptr inbounds i8, i8* %h.addr.0, i64 6
%cmp5 = icmp ult i8* %add.ptr4, %p3
br i1 %cmp5, label %cleanup.cont, label %return

cleanup.cont: ; preds = %while.body
%add.ptr3 = getelementptr inbounds i8, i8* %d.addr.0, i64 6
br label %while.cond

while.end: ; preds = %while.cond
tail call void @_Z1fv()
unreachable

return: ; preds = %while.body
ret i8* %p3
}

The only difference between a.ll and b.ll is the basicblock cleanup.cont.

-------------------------- a.ll after loop rotate, same as a.ll before loop rotate ----------------------------
~/workarea/llvm-r318298/dbuild/bin/opt -loop-rotate -S < a.ll

; ModuleID = ‘’
source_filename = “”

@i = global i8 0, align 1

declare void @_Z1fv() local_unnamed_addr

define i8* @Z1gPcS_S(i8* nocapture readonly %d, i8* %h, i8* readnone returned %p3) local_unnamed_addr {
entry:
br label %while.cond

while.cond: ; preds = %while.body, %entry
%h.addr.0 = phi i8* [ %h, %entry ], [ %add.ptr4, %while.body ]
%d.addr.0 = phi i8* [ %d, %entry ], [ %add.ptr3, %while.body ]
%cmp = icmp ugt i8* %h.addr.0, @i
br i1 %cmp, label %while.end, label %while.body

while.body: ; preds = %while.cond
%0 = bitcast i8* %d.addr.0 to i64*
%1 = load i64, i64* %0, align 1
%2 = bitcast i8* %h.addr.0 to i64*
store i64 %1, i64* %2, align 1
%add.ptr = getelementptr inbounds i8, i8* %d.addr.0, i64 8
%3 = bitcast i8* %add.ptr to i64*
%4 = load i64, i64* %3, align 1
store i64 %4, i64* %2, align 1
%add.ptr3 = getelementptr inbounds i8, i8* %d.addr.0, i64 6
%add.ptr4 = getelementptr inbounds i8, i8* %h.addr.0, i64 6
%cmp5 = icmp ult i8* %add.ptr4, %p3
br i1 %cmp5, label %while.cond, label %return

while.end: ; preds = %while.cond
tail call void @_Z1fv()
unreachable

return: ; preds = %while.body
ret i8* %p3
}

-------------------------- b.ll after loop rotate ----------------------------
~/workarea/llvm-r318298/dbuild/bin/opt -loop-rotate -S < b.ll

@i = global i8 0, align 1
declare void @_Z1fv() local_unnamed_addr

define i8* @Z1gPcS_S(i8* nocapture readonly %d, i8* %h, i8* readnone returned %p3) local_unnamed_addr {
entry:
%cmp1 = icmp ugt i8* %h, @i
br i1 %cmp1, label %while.end, label %while.body.lr.ph

while.body.lr.ph: ; preds = %entry
br label %while.body

while.cond: ; preds = %while.body
%h.addr.0 = phi i8* [ %add.ptr4, %while.body ]
%d.addr.0 = phi i8* [ %add.ptr3, %while.body ]
%cmp = icmp ugt i8* %h.addr.0, @i
br i1 %cmp, label %while.cond.while.end_crit_edge, label %while.body

while.body: ; preds = %while.body.lr.ph, %while.cond
%d.addr.03 = phi i8* [ %d, %while.body.lr.ph ], [ %d.addr.0, %while.cond ]
%h.addr.02 = phi i8* [ %h, %while.body.lr.ph ], [ %h.addr.0, %while.cond ]
%0 = bitcast i8* %d.addr.03 to i64*
%1 = load i64, i64* %0, align 1
%2 = bitcast i8* %h.addr.02 to i64*
store i64 %1, i64* %2, align 1
%add.ptr = getelementptr inbounds i8, i8* %d.addr.03, i64 8
%3 = bitcast i8* %add.ptr to i64*
%4 = load i64, i64* %3, align 1
store i64 %4, i64* %2, align 1
%add.ptr4 = getelementptr inbounds i8, i8* %h.addr.02, i64 6
%cmp5 = icmp ult i8* %add.ptr4, %p3
%add.ptr3 = getelementptr inbounds i8, i8* %d.addr.03, i64 6
br i1 %cmp5, label %while.cond, label %return

while.cond.while.end_crit_edge: ; preds = %while.cond
br label %while.end

while.end: ; preds = %while.cond.while.end_crit_edge, %entry
tail call void @_Z1fv()
unreachable

return: ; preds = %while.body
ret i8* %p3
}

a.ll and b.ll have different results after loop rotation because of http://llvm.org/viewvc/llvm-project?view=revision&revision=181230

-------------------------- a.s generated from a.ll ----------------------------
~/workarea/llvm-r318298/dbuild/bin/opt -loop-rotate -S < a.ll |~/workarea/llvm-r318298/dbuild/bin/llc

.cfi_startproc

BB#0: # %entry

pushq %rax
.cfi_def_cfa_offset 16
movl $i, %eax
.p2align 4, 0x90
.LBB0_1: # %while.cond

=>This Inner Loop Header: Depth=1

cmpq %rax, %rsi
ja .LBB0_4

BB#2: # %while.body

in Loop: Header=BB0_1 Depth=1

movq (%rdi), %rcx
movq %rcx, (%rsi)
movq 8(%rdi), %rcx
movq %rcx, (%rsi)
addq $6, %rdi
addq $6, %rsi
cmpq %rdx, %rsi
jb .LBB0_1

BB#3: # %return

movq %rdx, %rax
popq %rcx
retq
.LBB0_4: # %while.end
callq _Z1fv
.Lfunc_end0:
.size Z1gPcS_S, .Lfunc_end0-Z1gPcS_S
.cfi_endproc

call _Z1fv is unreachable. Suppose loop LBB0_1 has few iterations, a.s will contain mostly fall through branches.

-------------------------- b.s generated from b.ll ----------------------------
~/workarea/llvm-r318298/dbuild/bin/opt -loop-rotate -S < b.ll |~/workarea/llvm-r318298/dbuild/bin/llc

.cfi_startproc

BB#0: # %entry

pushq %rax
.cfi_def_cfa_offset 16
movl $i, %eax
cmpq %rax, %rsi
ja .LBB0_5

BB#1:

movl $i, %eax
.p2align 4, 0x90
.LBB0_3: # %while.body

=>This Inner Loop Header: Depth=1

movq (%rdi), %rcx
movq %rcx, (%rsi)
movq 8(%rdi), %rcx
movq %rcx, (%rsi)
addq $6, %rsi
cmpq %rdx, %rsi
jae .LBB0_4

BB#2: # %while.cond

in Loop: Header=BB0_3 Depth=1

addq $6, %rdi
cmpq %rax, %rsi
jbe .LBB0_3
.LBB0_5: # %while.end
callq _Z1fv
.LBB0_4: # %return
movq %rdx, %rax
popq %rcx
retq
.Lfunc_end0:
.size Z1gPcS_S, .Lfunc_end0-Z1gPcS_S
.cfi_endproc

call _Z1fv is unreachable. Here we have “jae .LBB0_4” which will not fall through when exiting the loop. The non-fall-through branch increases branch misses significantly and regresses the benchmark by 10%.

Now a possible way to fix it is to duplicate basicblock .LBB0_3, just like what tail duplication does, but basicblock .LBB0_3 contains 7 instructions and it will introduce some cost of code size increase.

Any suggestion to fix the issue are welcomed!

a.ll and b.ll are generated from the same function by different versions of
compiler. a.s and b.s generated from a.ll and b.ll respectively showed
where the performance difference comes from.

What I can’t figure out is why we don’t get the same basic block layout for b.ll…

Specifically, I understand that we essentially have an iteration peeled out of the loop by loop rotation. But the result still should get the fancy layout that causes the hot conditional branch (away from the unreachable) to fall through into the loop header…

Looks like this is the loop-oriented layout really hurting us. We don’t even consider that while.body → return is less cold than while.cond → while.end because we don’t really do much at all in layout out loop exits. =[

For the hot conditional branch, I guess you mean “jbe .LBB0_3” in BB#2. If “jbe .LBB0_3” fall through into loop header .LBB0_3, then it is impossible for “ja .LBB0_5” in BB#0 to fall through into .LBB0_3. We cannot achieve both unless .LBB0_3 is duplicated (That is the possible solution listed at the end of the original post, but for that solution we will have the code size increase cost).

The introduction of cleanup.cond block in b.ll without loop-rotation already makes the layout worse than a.ll.

Without introducing cleanup.cond block, the layout out is

entry->while.cond → while.body->ret

All the arrows are hot fall through edges which is good.

With cleanup.cond introduced in b.ll, the layout without tailDup nor loop rotation looks like:

entry->while.cond ->while.body->cleanup.cond, ret

Note that now there is no fall through edge to ‘ret’ block and while.body now needs to explicitly branch to ‘ret’. If loop rotation happens, we will have

entry, cleanup.cond → while.cond → while.body->ret

Not that there will be a hot branch from entry to while.cond.

LLVM actually does both tail dup and loop rotation. And the layout looks like:

entry → while.cond.dup, cleanup.cond->while.cond → while.body->ret

this is better than the previous one, but there is still a hot branch from while.cond.dup to while.body introduced.

David

The introduction of cleanup.cond block in b.ll without loop-rotation already makes the layout worse than a.ll.

Without introducing cleanup.cond block, the layout out is

entry->while.cond → while.body->ret

All the arrows are hot fall through edges which is good.

With cleanup.cond introduced in b.ll, the layout without tailDup nor loop rotation looks like:

entry->while.cond ->while.body->cleanup.cond, ret

Note that now there is no fall through edge to ‘ret’ block and while.body now needs to explicitly branch to ‘ret’. If loop rotation happens, we will have

entry, cleanup.cond → while.cond → while.body->ret

Not that there will be a hot branch from entry to while.cond.

LLVM actually does both tail dup and loop rotation. And the layout looks like:

entry → while.cond.dup, cleanup.cond->while.cond → while.body->ret

this is better than the previous one, but there is still a hot branch from while.cond.dup to while.body introduced.

(just relaying a comment I made in person)

This does seem like a less good layout, but as a consequence of it we do avoid one addition and comparison along the hot path. It’s not obvious to me which is better: the better layout (with added instruction) or the minimal set of instructions with less good layout.

The introduction of cleanup.cond block in b.ll without loop-rotation
already makes the layout worse than a.ll.

Without introducing cleanup.cond block, the layout out is

entry->while.cond -> while.body->ret

All the arrows are hot fall through edges which is good.

With cleanup.cond introduced in b.ll, the layout without tailDup nor loop
rotation looks like:

entry->while.cond ->while.body->cleanup.cond, ret

Note that now there is no fall through edge to 'ret' block and
while.body now needs to explicitly branch to 'ret'. If loop rotation
happens, we will have

entry, cleanup.cond -> while.cond -> while.body->ret

Not that there will be a hot branch from entry to while.cond.

LLVM actually does both tail dup and loop rotation. And the layout looks
like:

entry --> while.cond.dup, cleanup.cond->while.cond -> while.body->ret

this is better than the previous one, but there is still a hot branch
from while.cond.dup to while.body introduced.

(just relaying a comment I made in person)

This does seem like a less good layout, but as a consequence of it we do
avoid one addition and comparison along the hot path. It's not obvious to
me which is better: the better layout (with added instruction) or the
minimal set of instructions with less good layout.

If the loop has high number of iterations, i.e., the while.body->ret path
is less likely taken, then sinking the increment into a split block brings
very little performance gain. For short trip-counted loops, the sinking
may seem beneficial, but it is at the cost of a new taken branch for every
execution of the loop exit path -- so the overall performance gain for
splitting a new block is also questionable -- and in this case, actually
hurting performance.

Do you see a good way to model this in SimplifyCFG? (Which is where I think this is occuring…)

The introduction of cleanup.cond block in b.ll without loop-rotation
already makes the layout worse than a.ll.

Without introducing cleanup.cond block, the layout out is

entry->while.cond -> while.body->ret

All the arrows are hot fall through edges which is good.

With cleanup.cond introduced in b.ll, the layout without tailDup nor
loop rotation looks like:

entry->while.cond ->while.body->cleanup.cond, ret

Note that now there is no fall through edge to 'ret' block and
while.body now needs to explicitly branch to 'ret'. If loop rotation
happens, we will have

entry, cleanup.cond -> while.cond -> while.body->ret

Not that there will be a hot branch from entry to while.cond.

LLVM actually does both tail dup and loop rotation. And the layout
looks like:

entry --> while.cond.dup, cleanup.cond->while.cond -> while.body->ret

this is better than the previous one, but there is still a hot branch
from while.cond.dup to while.body introduced.

(just relaying a comment I made in person)

This does seem like a less good layout, but as a consequence of it we do
avoid one addition and comparison along the hot path. It's not obvious to
me which is better: the better layout (with added instruction) or the
minimal set of instructions with less good layout.

If the loop has high number of iterations, i.e., the while.body->ret path
is less likely taken, then sinking the increment into a split block brings
very little performance gain. For short trip-counted loops, the sinking
may seem beneficial, but it is at the cost of a new taken branch for every
execution of the loop exit path -- so the overall performance gain for
splitting a new block is also questionable -- and in this case, actually
hurting performance.

Do you see a good way to model this in SimplifyCFG? (Which is where I
think this is occuring...)

Perhaps avoid splitting loop backedge (which is a critical edge)? Splitting
it does allow sinking defs that are not live out of the loop to the split
block, but introduces problems like this.

David