We have noticed a missed optimization opportunity in coremark. Here’s how it looks in isolation (on RISCV):
core_list_find: # @core_list_find # %bb.0: # %entry lh a2, 2(a1) bltz a2, .LBB0_5 # %bb.1: # %while.cond.preheader beqz a0, .LBB0_9 # %bb.2: # %land.rhs.preheader zext.h a1, a2 .LBB0_3: # %land.rhs # =>This Inner Loop Header: Depth=1 ld a2, 8(a0) lhu a2, 2(a2) beq a2, a1, .LBB0_9 # %bb.4: # %while.body # in Loop: Header=BB0_3 Depth=1 ld a0, 0(a0) bnez a0, .LBB0_3 (!) j .LBB0_9 # we could replace this with a ret instruction! .LBB0_5: # %while.cond9.preheader beqz a0, .LBB0_9 # %bb.6: # %land.rhs11.lr.ph lhu a1, 0(a1) .LBB0_7: # %land.rhs11 # =>This Inner Loop Header: Depth=1 ld a2, 8(a0) lbu a2, 0(a2) beq a2, a1, .LBB0_9 # %bb.8: # %while.body19 # in Loop: Header=BB0_7 Depth=1 ld a0, 0(a0) bnez a0, .LBB0_7 .LBB0_9: # %return ret
On the line marked with (!) The
j .LBB0_9 instruction could be replaced with
ret. We would like to ask the community what the best approach would be here.
To me it seems like this could be done in a late tail duplication pass. But
TailDuplicator can’t duplicate into
bb.4, because it has more than one successor (see first lines of
We can do the following transformation:
bb: ... CondBr TBB UncondBr FBB
bb: ... CondBr TBB | | # a fall-through to a new basic block bbF V bbF: UncondBr FBB
This creates new opportunities for tail-duplication (and maybe other optimizations?). Tail-duplicating
bbF is perfectly fine now.
I have implemented this approach and compiled everything in llvm-test-suite/SingleSource for X86. There were 135 instances of the above transformation.
Thank you and looking forward to hearing your thoughts / suggestions,