Missed Optimization: Combine Ors to Testing Of a Single Bit

If a target has an efficient way to test a bit at a given position (bt on X86, tstbit on Hexagon, bext on RISCV), we can use that to combine ors of several icmp eq into one instruction. Note that gcc already does this optimization.

Consider the following code:

void foo(int x) {
  a = (((x == 2) || (x == 5) || (x == 7)));
  return;
}

Currently llvm generates inefficient assembly:

	.text
	.attribute	4, 16
	.attribute	5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0_zba1p0_zbb1p0_zbc1p0"
	.file	"test.c"
	.globl	foo
	.p2align	1
	.type	foo,@function
foo:
	li	a2, 2
	li	a1, 1
	beq	a0, a2, .LBB0_2
	li	a2, 5
	bne	a0, a2, .LBB0_3
.LBB0_2:
	lui	a0, %hi(a)
	sw	a1, %lo(a)(a0)
	ret
.LBB0_3:
	addi	a0, a0, -7
	seqz	a1, a0
	lui	a0, %hi(a)
	sw	a1, %lo(a)(a0)
	ret
.Lfunc_end0:
	.size	foo, .Lfunc_end0-foo

	.type	a,@object
	.section	.sbss,"aw",@nobits
	.globl	a
	.p2align	2
a:
	.word	0
	.size	a, 4

We should generate

	addi a1, x0, 164 // 164 = 2^2 + 2^5 + 2^7
	bext	a1, a1, a0
	lui	a0, %hi(a)
	sw	a1, %lo(a)(a0)
	ret

Not being able to recognize this idiom may also prevent other optimizations. Here is the example that I came across: The IR for the above code going into (2nd pass of) SimplifyCFG looks as follows:

define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
  %cmp = icmp eq i32 %x, 2
  %cmp1 = icmp eq i32 %x, 5
  %or.cond = or i1 %cmp, %cmp1
  br i1 %or.cond, label %lor.end, label %lor.rhs

lor.rhs:                                          ; preds = %entry
  %cmp2 = icmp eq i32 %x, 16
  %0 = zext i1 %cmp2 to i32
  br label %lor.end

lor.end:                                          ; preds = %lor.rhs, %entry
  %lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ]
  store i32 %lor.ext, ptr @a, align 4, !tbaa !4
  ret void
}

The SimplifyBranchOnICmpChain turns this into:

; Function Attrs: nounwind
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
  switch i32 %x, label %lor.rhs [
    i32 5, label %lor.end
    i32 2, label %lor.end
  ]

lor.rhs:                                          ; preds = %entry
  %cmp2 = icmp eq i32 %x, 16
  %0 = zext i1 %cmp2 to i32
  br label %lor.end

lor.end:                                          ; preds = %entry, %entry, %lor.rhs
  %lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ], [ 1, %entry ]
  store i32 %lor.ext, ptr @a, align 4, !tbaa !4
  ret void
}

The switch instruction is not handled by neither SpeculativelyExecuteBB nor by FoldTwoEntryPHINode which come later. If we disable SimplifyBranchOnICmpChain on this example we get much better IR at the end of opt:

define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
  %cmp = icmp eq i32 %x, 2
  %cmp1 = icmp eq i32 %x, 5
  %or.cond = or i1 %cmp, %cmp1
  %cmp2 = icmp eq i32 %x, 16
  %narrow = or i1 %cmp2, %or.cond
  %lor.ext = zext i1 %narrow to i32
  store i32 %lor.ext, ptr @a, align 4, !tbaa !4
  ret void
}

The chain of ors

  %cmp = icmp eq i32 %x, 2
  %cmp1 = icmp eq i32 %x, 5
  %or.cond = or i1 %cmp, %cmp1
  %cmp2 = icmp eq i32 %x, 16
  %narrow = or i1 %cmp2, %or.cond

can be combined into the bit-testing instruction.

Should we look for chains of ors and combine them into a simpler instruction sequence? I modified the input to SimplifyCFG to what it would be if we combined the ors:

; Function Attrs: nounwind
define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
  %lshr_ = lshr i32 36, %x
  %and_ = and i32 %lshr_, 1 
  %cond = icmp eq i32 %and_, 1
  br i1 %cond, label %lor.end, label %lor.rhs

lor.rhs:                                          ; preds = %entry
  %cmp2 = icmp eq i32 %x, 16
  %0 = zext i1 %cmp2 to i32
  br label %lor.end

lor.end:                                          ; preds = %lor.rhs, %entry
  %lor.ext = phi i32 [ 1, %entry ], [ %0, %lor.rhs ]
  store i32 %lor.ext, ptr @a, align 4, !tbaa !4
  ret void
}

This does give the following IR at the end:

define dso_local void @foo(i32 noundef signext %x, i32 noundef signext %y) local_unnamed_addr #0 {
entry:
  %lshr_ = lshr i32 36, %x
  %and_ = and i32 %lshr_, 1
  %cond = icmp eq i32 %and_, 1
  %cmp2 = icmp eq i32 %x, 16
  %0 = zext i1 %cmp2 to i32
  %lor.ext = select i1 %cond, i32 1, i32 %0
  store i32 %lor.ext, ptr @a, align 4, !tbaa !4
  ret void
}

I guess (and this is true for gcc) another place where the idiom can be useful is in lowering of switches when several cases go to the same destination.

Several questions come up:
(1) Should we do this optimization?
(2) If “yes”, where / how? Should we combine this idiom to icmp eq (and(lshr(...), 1), 1) and later recognize this in DAG combine? Maybe we should have an intrinsic?
(3) Should we treat a switch with only two destinations as a branch on the condition expressed by the idiom.

Looking forward to your thoughts / comments / suggestions!
Thank you!

I guess (and this is true for gcc) another place where the idiom can be useful is in lowering of switches when several cases go to the same destination.

LLVM already does this (when it thinks the circumstances are right), for example: Compiler Explorer

For your original example, if we return a and add another conditional, LLVM does at least partially use a bit test, but it does seem to not get it quite right: Compiler Explorer

EDIT: For the optimization to be correct we MAY need to check if the value in question is not bigger (as unsigned) than (bitwidth of register - 1).

Thanks for taking a look! I am wondering if LLVM should do this more often than it currently does. My example shows that some cases are missed. I’ll try to run an experiment on spec to see how often something like this occurs.

1 Like