Jump to Basic Block with only One Return Instruction

Hi all,

We have noticed a missed optimization opportunity in coremark. Here’s how it looks in isolation (on RISCV):

core_list_find:                         # @core_list_find  
# %bb.0:                                # %entry
	lh	a2, 2(a1)
	bltz	a2, .LBB0_5
# %bb.1:                                # %while.cond.preheader
	beqz	a0, .LBB0_9
# %bb.2:                                # %land.rhs.preheader
	zext.h	a1, a2
.LBB0_3:                                # %land.rhs
                                        # =>This Inner Loop Header: Depth=1
	ld	a2, 8(a0)
	lhu	a2, 2(a2)
	beq	a2, a1, .LBB0_9
# %bb.4:                                # %while.body
                                        #   in Loop: Header=BB0_3 Depth=1
	ld	a0, 0(a0)
	bnez	a0, .LBB0_3
(!)	j	.LBB0_9  # we could replace this with a ret instruction!
.LBB0_5:                                # %while.cond9.preheader
	beqz	a0, .LBB0_9
# %bb.6:                                # %land.rhs11.lr.ph
	lhu	a1, 0(a1)
.LBB0_7:                                # %land.rhs11
                                        # =>This Inner Loop Header: Depth=1
	ld	a2, 8(a0)
	lbu	a2, 0(a2)
	beq	a2, a1, .LBB0_9
# %bb.8:                                # %while.body19
                                        #   in Loop: Header=BB0_7 Depth=1
	ld	a0, 0(a0)
	bnez	a0, .LBB0_7
.LBB0_9:                                # %return
	ret

On the line marked with (!) The j .LBB0_9 instruction could be replaced with ret. We would like to ask the community what the best approach would be here.

To me it seems like this could be done in a late tail duplication pass. But TailDuplicator can’t duplicate into bb.4, because it has more than one successor (see first lines of TailDuplicator::canCompletelyDuplicateBB).

We can do the following transformation:

bb:
...
CondBr TBB
UncondBr FBB

becomes:

bb:
...
CondBr TBB
   |
   | # a fall-through to a new basic block bbF
   V
bbF:
UncondBr FBB

This creates new opportunities for tail-duplication (and maybe other optimizations?). Tail-duplicating FBB into bbF is perfectly fine now.

I have implemented this approach and compiled everything in llvm-test-suite/SingleSource for X86. There were 135 instances of the above transformation.

Thank you and looking forward to hearing your thoughts / suggestions,

Mikhail