rotl: undocumented LLVM instruction?

We’ve recently moved our project from LLVM 3.6 to LLVM 3.9. I noticed one of our code generation tests is breaking in 3.9.

The test is:

; RUN: llc < %s -march=xstg | FileCheck %s

define i64 @bclr64(i64 %a, i64 %b) nounwind readnone {
entry:
; CHECK: bclr r1, r0, r1, 64
%sub = sub i64 %b, 1
%shl = shl i64 1, %sub
%xor = xor i64 %shl, -1
%and = and i64 %a, %xor
ret i64 %and
}

I ran llc with -debug to get a better idea of what’s going on and found:

Initial selection DAG: BB#0 ‘bclr64:entry’
SelectionDAG has 14 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
t6: i64 = sub t4, Constant:i64<1>
t7: i64 = shl Constant:i64<1>, t6
t9: i64 = xor t7, Constant:i64<-1>
t10: i64 = and t2, t9
t12: ch,glue = CopyToReg t0, Register:i64 %R1, t10
t13: ch = XSTGISD::Ret t12, Register:i64 %R1, t12:1

Combining: t13: ch = XSTGISD::Ret t12, Register:i64 %R1, t12:1

Combining: t12: ch,glue = CopyToReg t0, Register:i64 %R1, t10

Combining: t11: i64 = Register %R1

Combining: t10: i64 = and t2, t9

Combining: t9: i64 = xor t7, Constant:i64<-1>
… into: t15: i64 = rotl Constant:i64<-2>, t6

Combining: t10: i64 = and t2, t15

Combining: t15: i64 = rotl Constant:i64<-2>, t6

Combining: t14: i64 = Constant<-2>

Combining: t6: i64 = sub t4, Constant:i64<1>
… into: t17: i64 = add t4, Constant:i64<-1>

Combining: t15: i64 = rotl Constant:i64<-2>, t17

These rotl instructions weren’t showing up when I ran llc 3.6 and that’s completely changing the generated code at the end which means the test fails (and it’s less optimal than it was in 3.6).

I’ve been looking in the LLVM language docs (3.9 version) and I don’t see any documentation on ‘rotl’. What does it do? Why isn’t it in the docs?

Phil

rotl is not an IR instruction, it's a node in the
instruction-selection dag (ISD::ROTL). It performs bitwise rotation to
the left.

The change to your code is probably due to this new transformation:
http://llvm.org/viewvc/llvm-project?view=revision&revision=232572

I believe some of the ISDs were introduced to allow for DAG optimizations under the assumption that some of the major architectures directly support these types of instructions.

-Ryan

Is there any way to get it to delay this optimization where it goes from this:

Initial selection DAG: BB#0 ‘bclr64:entry’
SelectionDAG has 14 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
t6: i64 = sub t4, Constant:i64<1>
t7: i64 = shl Constant:i64<1>, t6
t9: i64 = xor t7, Constant:i64<-1>
t10: i64 = and t2, t9
t12: ch,glue = CopyToReg t0, Register:i64 %R1, t10
t13: ch = XSTGISD::Ret t12, Register:i64 %R1, t12:1

Combining: t13: ch = XSTGISD::Ret t12, Register:i64 %R1, t12:1

Combining: t12: ch,glue = CopyToReg t0, Register:i64 %R1, t10

Combining: t11: i64 = Register %R1

Combining: t10: i64 = and t2, t9

Combining: t9: i64 = xor t7, Constant:i64<-1>
… into: t15: i64 = rotl Constant:i64<-2>, t6

…to this:

Optimized lowered selection DAG: BB#0 ‘bclr64:entry’
SelectionDAG has 13 nodes:
t0: ch = EntryToken
t2: i64,ch = CopyFromReg t0, Register:i64 %vreg0
t4: i64,ch = CopyFromReg t0, Register:i64 %vreg1
t17: i64 = add t4, Constant:i64<-1>
t15: i64 = rotl Constant:i64<-2>, t17
t10: i64 = and t2, t15
t12: ch,glue = CopyToReg t0, Register:i64 %R1, t10
t13: ch = XSTGISD::Ret t12, Register:i64 %R1, t12:1

That combining of the xor & and there ends up giving us suboptimal results as compared with 3.6.

For example, in 3.6 the generated code is simply:

bclr64: # @bclr64

BB#0: # %entry

addI r1, r1, -1, 64
bclr r1, r0, r1, 64
jabs r511

Whereas with 3.9 the generated code is:

bclr64: # @bclr64

BB#0: # %entry

addI r1, r1, -1, 64
movimm r2, -2, 64
rol r1, r2, r1, 64
bitop1 r1, r0, r1, AND, 64
jabs r511

… it seems to be negatively impacting some of our larger benchmarks as well that used to contains several bclr (bit clear) commands but now contain much less.

Phil

… or perhaps to rephrase:

In 3.9 it seems to be doing a smaller combine much sooner, whereas in 3.6 it deferred that till later in the instruction selection pattern matching - the latter was giving us better results because it seems to match a larger pattern than the former did in the earlier stage.

Phil

Setting the ISD::ROTL to Expand doesn’t work? (via SetOperation)

You could also do a Custom hook if that’s what you’re looking for.

I could try setting ISD::ROTL to Expand… however, we do have a rol op and we’d like the ISD::ROTL to map to it. If I set it to Expand it’s not going to do that, right?

I think in this case we’re just getting the ISD::ROTL a bit too soon in the process and that’s causing us to miss other optimization opportunities later on.

Phil

Change the DAGCombine.

One option may be to prevent the formation of ROTL, if possible, and then generating rol by hand.
Marking it as "expand" would likely stop the DAG combiner from creating it. Then you could "preprocess" the selection DAG before the instruction selection and do the pattern matching yourself.

-Krzysztof

What is the advantage of preventing combining the rotl instead of teach the selection to match this extended pattern that includes the rotl and generates the bclr code?

My reply wasn't entirely accurate---in the OP's case the desired output does not have rol at all. In that situation there isn't much that the selection can do, if the ROTL was generated. Adding selection patterns for bclr that handle all the DAGs that can come out of the combiner is just masking the problem. A change in the combiner may lead to generating yet another DAG that will not be covered, and it will not be detected unless someone checks that specific output.

In general, waiting until selection may not work, since the combiner may fold pieces of the larger pattern (that we would like to replace as a whole) into surrounding operations, effectively inhibiting that replacement.

-Krzysztof

This still isn’t clear to me that the rotl isn’t desired here. The fact that this specific example is pessimized by the rotl does not mean that other examples wouldn’t swing the other way.

I haven’t worked on the DAG for some time now, but as I remember it, the important thing to figure was if a combine is simplifying a pattern without losing information (i.e. that it is still analyzable).

I know that our target had a lot of target specific combine that are only running during the second combine phase and were producing target specific nodes in the DAG (like bclr).
I’m not sure there is a better way for a target that handling all the patterns exposed by (supported) generic nodes. (I expect the generic DAG combine to canonicalize the representation in the space of these generic nodes.)