If a target has an efficient way to test a bit at a given position (bt
on X86, tstbit
on Hexagon, bext
on RISCV), we can use that to combine or
s of several icmp eq
into one instruction. Note that gcc already does this optimization.
Consider the following code:
void foo(int x) {
a = (((x == 2) || (x == 5) || (x == 7)));
return;
}
Currently llvm generates inefficient assembly:
.text
.attribute 4, 16
.attribute 5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0_zba1p0_zbb1p0_zbc1p0"
.file "test.c"
.globl foo
.p2align 1
.type foo,@function
foo:
li a2, 2
li a1, 1
beq a0, a2, .LBB0_2
li a2, 5
bne a0, a2, .LBB0_3
.LBB0_2:
lui a0, %hi(a)
sw a1, %lo(a)(a0)
ret
.LBB0_3:
addi a0, a0, -7
seqz a1, a0
lui a0, %hi(a)
sw a1, %lo(a)(a0)
ret
.Lfunc_end0:
.size foo, .Lfunc_end0-foo
.type a,@object
.section .sbss,"aw",@nobits
.globl a
.p2align 2
a:
.word 0
.size a, 4
We should generate
addi a1, x0, 164 // 164 = 2^2 + 2^5 + 2^7
bext a1, a1, a0
lui a0, %hi(a)
sw a1, %lo(a)(a0)
ret
Not being able to recognize this idiom may also prevent other optimizations. Here is the example that I came across: The IR for the above code going into (2nd pass of) SimplifyCFG
looks as follows:
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%cmp = icmp eq i32 %x, 2
%cmp1 = icmp eq i32 %x, 5
%or.cond = or i1 %cmp, %cmp1
br i1 %or.cond, label %lor.end, label %lor.rhs
lor.rhs: ; preds = %entry
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
br label %lor.end
lor.end: ; preds = %lor.rhs, %entry
%lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ]
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}
The SimplifyBranchOnICmpChain
turns this into:
; Function Attrs: nounwind
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
switch i32 %x, label %lor.rhs [
i32 5, label %lor.end
i32 2, label %lor.end
]
lor.rhs: ; preds = %entry
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
br label %lor.end
lor.end: ; preds = %entry, %entry, %lor.rhs
%lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ], [ 1, %entry ]
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}
The switch
instruction is not handled by neither SpeculativelyExecuteBB
nor by FoldTwoEntryPHINode
which come later. If we disable SimplifyBranchOnICmpChain
on this example we get much better IR at the end of opt
:
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%cmp = icmp eq i32 %x, 2
%cmp1 = icmp eq i32 %x, 5
%or.cond = or i1 %cmp, %cmp1
%cmp2 = icmp eq i32 %x, 16
%narrow = or i1 %cmp2, %or.cond
%lor.ext = zext i1 %narrow to i32
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}
The chain of or
s
%cmp = icmp eq i32 %x, 2
%cmp1 = icmp eq i32 %x, 5
%or.cond = or i1 %cmp, %cmp1
%cmp2 = icmp eq i32 %x, 16
%narrow = or i1 %cmp2, %or.cond
can be combined into the bit-testing instruction.
Should we look for chains of or
s and combine them into a simpler instruction sequence? I modified the input to SimplifyCFG
to what it would be if we combined the or
s:
; Function Attrs: nounwind
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%lshr_ = lshr i32 36, %x
%and_ = and i32 %lshr_, 1
%cond = icmp eq i32 %and_, 1
br i1 %cond, label %lor.end, label %lor.rhs
lor.rhs: ; preds = %entry
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
br label %lor.end
lor.end: ; preds = %lor.rhs, %entry
%lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ]
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}
This does give the following IR at the end:
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%lshr_ = lshr i32 36, %x
%and_ = and i32 %lshr_, 1
%cond = icmp eq i32 %and_, 1
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
%lor.ext = select i1 %cond, i32 1, i32 %0
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}
I guess (and this is true for gcc) another place where the idiom can be useful is in lowering of switches when several cases go to the same destination.
Several questions come up:
(1) Should we do this optimization?
(2) If “yes”, where / how? Should we combine this idiom to icmp eq (and(lshr(...), 1), 1)
and later recognize this in DAG combine? Maybe we should have an intrinsic?
(3) Should we treat a switch with only two destinations as a branch on the condition expressed by the idiom.
Looking forward to your thoughts / comments / suggestions!
Thank you!