# Missed Optimization: Combine Ors to Testing Of a Single Bit

If a target has an efficient way to test a bit at a given position (`bt` on X86, `tstbit` on Hexagon, `bext` on RISCV), we can use that to combine `or`s of several `icmp eq` into one instruction. Note that gcc already does this optimization.

Consider the following code:

``````void foo(int x) {
a = (((x == 2) || (x == 5) || (x == 7)));
return;
}
``````

Currently llvm generates inefficient assembly:

``````	.text
.attribute	4, 16
.attribute	5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0_zba1p0_zbb1p0_zbc1p0"
.file	"test.c"
.globl	foo
.p2align	1
.type	foo,@function
foo:
li	a2, 2
li	a1, 1
beq	a0, a2, .LBB0_2
li	a2, 5
bne	a0, a2, .LBB0_3
.LBB0_2:
lui	a0, %hi(a)
sw	a1, %lo(a)(a0)
ret
.LBB0_3:
seqz	a1, a0
lui	a0, %hi(a)
sw	a1, %lo(a)(a0)
ret
.Lfunc_end0:
.size	foo, .Lfunc_end0-foo

.type	a,@object
.section	.sbss,"aw",@nobits
.globl	a
.p2align	2
a:
.word	0
.size	a, 4
``````

We should generate

``````	addi a1, x0, 164 // 164 = 2^2 + 2^5 + 2^7
bext	a1, a1, a0
lui	a0, %hi(a)
sw	a1, %lo(a)(a0)
ret
``````

Not being able to recognize this idiom may also prevent other optimizations. Here is the example that I came across: The IR for the above code going into (2nd pass of) `SimplifyCFG` looks as follows:

``````define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%cmp = icmp eq i32 %x, 2
%cmp1 = icmp eq i32 %x, 5
%or.cond = or i1 %cmp, %cmp1
br i1 %or.cond, label %lor.end, label %lor.rhs

lor.rhs:                                          ; preds = %entry
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
br label %lor.end

lor.end:                                          ; preds = %lor.rhs, %entry
%lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ]
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}

``````

The `SimplifyBranchOnICmpChain` turns this into:

``````; Function Attrs: nounwind
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
switch i32 %x, label %lor.rhs [
i32 5, label %lor.end
i32 2, label %lor.end
]

lor.rhs:                                          ; preds = %entry
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
br label %lor.end

lor.end:                                          ; preds = %entry, %entry, %lor.rhs
%lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ], [ 1, %entry ]
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}

``````

The `switch` instruction is not handled by neither `SpeculativelyExecuteBB` nor by `FoldTwoEntryPHINode` which come later. If we disable `SimplifyBranchOnICmpChain` on this example we get much better IR at the end of `opt`:

``````define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%cmp = icmp eq i32 %x, 2
%cmp1 = icmp eq i32 %x, 5
%or.cond = or i1 %cmp, %cmp1
%cmp2 = icmp eq i32 %x, 16
%narrow = or i1 %cmp2, %or.cond
%lor.ext = zext i1 %narrow to i32
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}

``````

The chain of `or`s

``````  %cmp = icmp eq i32 %x, 2
%cmp1 = icmp eq i32 %x, 5
%or.cond = or i1 %cmp, %cmp1
%cmp2 = icmp eq i32 %x, 16
%narrow = or i1 %cmp2, %or.cond

``````

can be combined into the bit-testing instruction.

Should we look for chains of `or`s and combine them into a simpler instruction sequence? I modified the input to `SimplifyCFG` to what it would be if we combined the `or`s:

``````; Function Attrs: nounwind
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%lshr_ = lshr i32 36, %x
%and_ = and i32 %lshr_, 1
%cond = icmp eq i32 %and_, 1
br i1 %cond, label %lor.end, label %lor.rhs

lor.rhs:                                          ; preds = %entry
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
br label %lor.end

lor.end:                                          ; preds = %lor.rhs, %entry
%lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ]
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}

``````

This does give the following IR at the end:

``````define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
%lshr_ = lshr i32 36, %x
%and_ = and i32 %lshr_, 1
%cond = icmp eq i32 %and_, 1
%cmp2 = icmp eq i32 %x, 16
%0 = zext i1 %cmp2 to i32
%lor.ext = select i1 %cond, i32 1, i32 %0
store i32 %lor.ext, ptr @a, align 4, !tbaa !4
ret void
}

``````

I guess (and this is true for gcc) another place where the idiom can be useful is in lowering of switches when several cases go to the same destination.

Several questions come up:
(1) Should we do this optimization?
(2) If “yes”, where / how? Should we combine this idiom to `icmp eq (and(lshr(...), 1), 1)` and later recognize this in DAG combine? Maybe we should have an intrinsic?
(3) Should we treat a switch with only two destinations as a branch on the condition expressed by the idiom.

Thank you!

I guess (and this is true for gcc) another place where the idiom can be useful is in lowering of switches when several cases go to the same destination.

LLVM already does this (when it thinks the circumstances are right), for example: Compiler Explorer

For your original example, if we return a and add another conditional, LLVM does at least partially use a bit test, but it does seem to not get it quite right: Compiler Explorer

EDIT: For the optimization to be correct we MAY need to check if the value in question is not bigger (as unsigned) than (bitwidth of register - 1).

Thanks for taking a look! I am wondering if LLVM should do this more often than it currently does. My example shows that some cases are missed. I’ll try to run an experiment on spec to see how often something like this occurs.