X86TargetLowering::LowerToBT

I’m tracking down an X86 code generation malfeasance regarding BT (bit test) and I have some questions.

This IR matches and then X86TargetLowering::LowerToBT is called:

%and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test with a register index

This IR does not match and so X86TargetLowering::LowerToBT is not called*:*

%and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit test with an immediate index

%conv = and i64 %and, 1

Let’s back that up a bit. Clang emits this IR. These expressions start out life in C as and with a left shifted masking bit, and are then converted into IR as right shifted values anded with a masking bit.

This IR then remains untouched until Expand ISel Pseudo-instructions in llc (-O3). At that point, LowerToBT is called on the REGISTER version and substitutes in a BT reg,reg instruction:

btq %rsi, %rdi ## <MCInst #312 BT64rr

The IMMEDIATE version doesn’t match the pattern and so LowerToBT is not called.

Question: This is during pseudo instruction expansion. How could *LowerToBT’*s caller have enough context to match the immediate IR version? In fact, lli isn’t calling LowerToBT so it isn’t matching. But isn’t this really a peephole optimization issue?

LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which has exactly one subclass in NVPTXTargetMachine.cpp.

But isn’t it better to deal with X86 LowerToBT in a PeepholeOptimizer subclass where you have a small window of instructions rather than during pseudo instruction expansion where you have really one instruction? PeepholeOptimizer doesn’t seem to be getting much attention and certainly no attention at the subclass level.

Bluntly, expansion is about expansion. Peephole optimization is the opposite.

Question: Regardless, why is LowerToBT not being called for the IMMEDIATE version? I suppose you could look at the preceding instruction in the DAG. That seems a bit hacky*.*

Another approach using LowerToBT would be to match lshr reg/imm first and then if the following instruction was an and reg,1 replace both with a BT*.* It doesn’t look like LowerToBT as is can do that right now since it is matching the and instruction.

SDValue X86TargetLowering::LowerToBT(SDValue And, ISD::CondCode CC, SDLoc dl, SelectionDAG &DAG) const { … }

But I think this is better done in a subclass of CodeGen/PeepholeOptimizer.cpp.

thanks.

Hi,

Can you provide a reproducible example? I feel especially your first IR sample is incomplete.
If you can also make more explicit how is the generated code wrong?

You can give a C file if you are sure that it is reproducible with the current clang.

Thanks,

Mehdi

Sure. Attached is the file but here are the functions. The first uses a fixed bit offset. The second has a indexed bit offset. Compiling with llc -O3, LLVM version 3.7.0svn, it compiles the IR from IsBitSetB() using btq %rsi, %rdi. Good. But then it compiles IsBitSetA() with shrq/andq, which is is pretty much what Clang had generated as IR.

shrq $25, %rdi
andq $1, %rdi

LLVM should be able to replace these two with a single X86_64 instruction: btq reg,25
The generated code is correct in both cases. It just isn’t optimized in the immediate operatnd case.

unsigned long long IsBitSetA(unsigned long long val)
{
return (val & (1ULL<<25)) != 0ULL;
}

unsigned long long IsBitSetB(unsigned long long val, int index)
{
return (val & (1ULL<<index)) != 0ULL;
}

tst.c (207 Bytes)

Do we want to use btq?

On many x64_64 processors, shrq/andq is “hard-coded”, but btq will execute in microcode, and will likely be worse performing.

Which BTQ? There are three flavors.

BTQ reg/reg
BTQ reg/mem
BTQ reg/imm

I can imagine that the reg/reg and especially the reg/mem versions would be slow. However the shrq/and versions with the same operands would be slow as well. There’s even a compiler comment about the reg/mem version saying “this is for disassembly only”.

But I doubt BTQ reg/imm would be microcoded.

I'm tracking down an X86 code generation malfeasance regarding BT (bit test)
and I have some questions.

This IR matches and then X86TargetLowering::LowerToBT is called:

%and = and i64 %shl, %val ; (val & (1 << index)) != 0 ; bit test
with a register index

This IR does not match and so X86TargetLowering::LowerToBT is not called:

%and = lshr i64 %val, 25 ; (val & (1 << 25)) != 0 ; bit
test with an immediate index

%conv = and i64 %and, 1

Let's back that up a bit. Clang emits this IR. These expressions start out
life in C as and with a left shifted masking bit, and are then converted
into IR as right shifted values anded with a masking bit.

This IR then remains untouched until Expand ISel Pseudo-instructions in llc
(-O3). At that point, LowerToBT is called on the REGISTER version and
substitutes in a BT reg,reg instruction:

btq %rsi, %rdi ## <MCInst #312 BT64rr

The IMMEDIATE version doesn't match the pattern and so LowerToBT is not
called.

Question: This is during pseudo instruction expansion. How could LowerToBT's
caller have enough context to match the immediate IR version? In fact, lli
isn't calling LowerToBT so it isn't matching. But isn't this really a
peephole optimization issue?

LLVM has a generic peephole optimizer, CodeGen/PeepholeOptimizer.cpp which
has exactly one subclass in NVPTXTargetMachine.cpp.

But isn't it better to deal with X86 LowerToBT in a PeepholeOptimizer
subclass where you have a small window of instructions rather than during
pseudo instruction expansion where you have really one instruction?
PeepholeOptimizer doesn't seem to be getting much attention and certainly no
attention at the subclass level.

Bluntly, expansion is about expansion. Peephole optimization is the
opposite.

Question: Regardless, why is LowerToBT not being called for the IMMEDIATE
version? I suppose you could look at the preceding instruction in the DAG.
That seems a bit hacky.

Another approach using LowerToBT would be to match lshr reg/imm first and
then if the following instruction was an and reg,1 replace both with a BT.
It doesn't look like LowerToBT as is can do that right now since it is
matching the and instruction.

I think it's actually matching the comparison: LowerToBT is called by
LowerSetCC, which has a comment saying:

  // Optimize to BT if possible.
  // Lower (X & (1 << N)) == 0 to BT(X, N).
  // Lower ((X >>u N) & 1) != 0 to BT(X, N).
  // Lower ((X >>s N) & 1) != 0 to BT(X, N).

This doesn't match the immediate/LSHR version, because the ANDed
result is returned directly, and there's no comparison with 0.

If it is indeed profitable to generate the BT (a quick glance at
Agner's tables for Merom/Haswell shows it probably is), I would start
by looking at in PerformAndCombine, to replace the two nodes with an
X86ISD::BT.

-Ahmed

Looking at the Intel Optimization Reference Manual, page 14-14, for Atom

BT m16, imm8, BT mem, imm8 latency 2,1 throughput 1
BT m16, r16, BT mem, reg latency 10, 9, throughput 8
BT reg, imm8, BT reg, reg latency 1, throughput 1

On C-26 they lower that throughput to 0.5 clock cycle for BT reg, imm8.

The posted functions were simplified for tracking down the code generation problem. In general, the comparison between using BTQ reg,imm vs SHRQ/ANDQ for bit testing is even worse because you have to MOVE the tested reg to a temporary before the SHRQ/ANDQ. And all of these instructions require a REX prefix (well, not the AND). The result is some code bloat (3 instructions vs 1) and a little register pressure.

I’m not an X86 expert, but I’d still like to understand why you are comparing 1 instructions to 3, the result does not seem exactly the same since (if I understand correctly) BT only sets the carry flags while the other combination provide the result in a register.

The full sequence is:

  btq %rsi, %rdi
  sbbq %rax, %rax
  andq $1, %rax
  popq %rbp

vs:

  shrq $25, %rdi
  andq $1, %rdi
  movq %rdi, %rax
  popq %rbp

(I’m not saying that btq is not preferable, just that I don’t see a difference in the number of instructions needed to get the result).

I agree with Ahmed that you probably should look into PerformAndCombine()( X86ISelLowering.cpp)

Best,

Mehdi

According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has one bit set, it might not be the best idea to always change TEST reg, 0x1000 to BT reg, 12…

Fiona

Even more importantly, TEST benefits from macro-fusion; BT does not, BT is vulnerable to partial-flags update stalls, TEST is not.

Use TEST. Don’t use BT.

– Steve

Thanks Fiona and Stehpen, let me go over this. I’ve been reading the Intel optimization manual and not Agner.

The bug actually has 2 parts:

(1) in LowerToBT() there is some logic that determines whether to use a BT or a TEST but it gets the dividing line between the two wrong:

11994,11995c11994,11997
< // Use BT if the immediate can’t be encoded in a TEST instruction.
< if (!isUInt<32>(AndRHSVal) && isPowerOf2_64(AndRHSVal)) {

// 16 bit mode: 32 bit mode: 64 bit mode:
// TEST reg,imm 24b TEST reg,imm 40b TEST reg,imm 80b
// BT reg,imm 32b BT reg,imm 32b BT reg,imm 40b
if (AndRHSVal >= 256 && isPowerOf2_64(AndRHSVal)) {

(2) For an expression like (var & (1 << 37)) Clang emits LSHR/AND.
This actually confuses the hell out of LLVM generating fairly similar assembly even with O3. It would be better for Clang to emit AND reg, #(1<<37) and let LLVM recognize + optimize that.

You can argue whether BT should be emitted and I’ll work through the Intel/Agner docs. But LLVM is generating bad code from bad IR. I’ll have a patch for Clang later today and you can look these over.

BTW, macrofusion of TEST only applies for conditional jumps and not to CMOV.

According to Agner’s docs, many CPUs have slower BT than TEST; Haswell has only 0.5 inverse throughput as opposed to 0.25, Atom has 1 instead of 0.5, and Silvermont can’t even dual-issue BT (it locks both ALUs). So while BT does seem have a shorter instruction encoding than TEST for TEST reg, imm32 where imm32 has one bit set, it might not be the best idea to always change TEST reg, 0x1000 to BT reg, 12…

Sounds like we should use BT with -Os, but TEST otherwise. This is probably a common enough instruction that it might make a good impact on code size.

Pete

I think the partial update issue isn’t really valid concern, Agner Fogg, p 142. I don’t think LLVM is going to emit this fragment.

That’s not how partial-flags update stalls work. There is no independent tracking of individual bits in EFLAGS. This means that BT + CMOVNZ has a false dependency on whatever instruction wrote to EFLAGS before BT and requires an extra µop vis-a-vis TEST + CMOVNZ or SHR + AND.

Please do not use BT. It is a performance hazard. If you don’t believe me for some reason, here’s the relevant quote from Agner:

"BT, BTC, BTR, and BTS change the carry flag but leave the other flags unchanged. This causes a false dependence on the previous value of the flags and costs an extra μop. Use TEST, AND, OR and XOR instead of these instructions.”

– Steve

The problem is that REX TEST reg,#(1<<37) is 10 bytes vs 5 bytes for REX BT reg,37.
That’s a large space penalty to pay for a possible partial update stall.

So the idea of generating BT for -Os and TEST for -Ofast makes sense to me.
Regardless, LowerToBT is generating TEST for small values.

I’m still tracking down the Clang issue and I’d like folks to look at that.

Is that even a valid instruction? I thought TEST only took 32-bit immediates.

Fiona

My bad on that. So that's what the comment meant.
That means BT is pretty much req'd in the 64b case.

Yeah, the alternative is to do movabs and then test, which is doable but I’m not sure if it’s worth it (surely BT + risk of flags merging penalty has to be better than two ops, one of which is ~9-10 bytes).

Fiona

In the 64b case we can TEST with mask from memory. This is *still* preferable to BT in most instruction streams.

One can reasonably argue that we should generate BT under –Os (I disagree, but only because I think that optimizing for code size to the point of completely disregarding performance is pointlessly stupid; if that’s what Os is meant to be, then by all means use BT). I don’t think one can make a case for BT outside of –Os.

– Steve

You are currently paying a 16b penalty for TEST vs BT in the 32b case.
That may be worth testing the -Os flag.