InstCombine adds bit masks, confuses self, others

Look at this silly function:

$ cat small.c
unsigned f(unsigned a, unsigned *p) {
unsigned x = a/4;
p[0] = x;
p[1] = x+x;
return p[1] - 2*p[0];
}

GCC turns this into straightforward code and figures out the 0 return value:

  shrl $2, %edi
  movl %edi, (%rsi)
  addl %edi, %edi
  movl %edi, 4(%rsi)
  movl $0, %eax
  ret

LLVM optimizes the code:

$ clang -O -S -o- small.c -emit-llvm

define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %0 = lshr i32 %a, 1
  %add = and i32 %0, 2147483646
  %arrayidx1 = getelementptr inbounds i32* %p, i64 1
  store i32 %add, i32* %arrayidx1, align 4, !tbaa !0
  %1 = lshr i32 %a, 1
  %mul = and i32 %1, 2147483646
  %sub = sub i32 %add, %mul
  ret i32 %sub
}

InstCombine has obfuscated 'x+x' so badly that it can't even recognize it itself. DAGCombine will eventually untangle the mess and figure out that the return value is 0, but that is too late. The loop optimization passes and scalar evolution don't get to see the simple 'x' and '2*x' expressions.

The problem is these transformations in InstCombineShifts.cpp:

      // If we have ((X >>? C) << C), turn this into X & (-1 << C).
      // (X >>? C1) << C2 --> X << (C2-C1) & (-1 << C2)
      // (X >>? C1) << C2 --> X >>? (C1-C2) & (-1 << C2)

The shl instruction is just as much arithmetic as it is a bitwise logical instruction. It is used as the canonical form for x+x, 4*x etc. The transforms above turn an arithmetic expression into a purely logical expression, and it is very hard to recover the original arithmetic expression.

Disabling the transformations produces the straightforward:

define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %add = shl nuw nsw i32 %div, 1
  %arrayidx1 = getelementptr inbounds i32* %p, i64 1
  store i32 %add, i32* %arrayidx1, align 4, !tbaa !0
  ret i32 0
}

The problem is not so much figuring out the 0 return value. The problem is that the 'canonical form' produced by InstCombine is hiding trivial arithmetic expressions like 'x+x' from scalar evolution.

I am not sure how best to fix this. If possible, InstCombine's canonicalization shouldn't hide arithmetic progressions behind bit masks. At least, it seems these transformations should be disabled unless (X >> C).hasOneUse(). They aren't exactly optimizations.

This:

  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %add = shl nuw nsw i32 %div, 1

is better than this:

  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4, !tbaa !0
  %0 = lshr i32 %a, 1
  %add = and i32 %0, 2147483646

/jakob

The entire concept of cleverly converting arithmetic to bit masks seems like the perfect domain for DAGCombine instead of InstCombine:

  1. We know the architecture, so we can make intelligent decisions about what masks are cheap or expensive.
  2. We know the addressing modes so we can fold arithmetic into them
  3. There are no more high-level optimizations we’re trying to enable: gvn, scev, loop opts, other deep math optimizations have all already had their shot at the code.

Does sinking these into the DAGCombine layer help? How much does it break?

In the past, LLVM has been incredibly sensitive to canonicalization changes, so I’m worried, but I’ve also seen a ton of code similar to what you posted, and it simply makes no sense to me.

I wrote a pile of heroics in the addressing mode matching for x86 to specifically work around these obnoxious representations.

I don’t know what would break, but DAGCombine already has these tricks:

$ cat small2.c
unsigned f(unsigned x) {
return x >> 2 << 3 >> 2 << 5;
}

With the shift transforms disabled, we get:

define i32 @f(i32 %x) nounwind uwtable readnone ssp {
entry:
%shr = lshr i32 %x, 2
%shl = shl i32 %shr, 3
%shr1 = lshr exact i32 %shl, 2
%shl2 = shl i32 %shr1, 5
ret i32 %shl2
}

But DAGCombine goes:

shll $4, %edi
andl $-64, %edi
movl %edi, %eax
ret

And you are right, we only get the bit masks when it is worthwhile.

/jakob

I am not sure how best to fix this. If possible, InstCombine's canonicalization shouldn't hide arithmetic progressions behind bit masks. At least, it seems these transformations should be disabled unless (X >> C).hasOneUse(). They aren't exactly optimizations.

This:

%div = lshr i32 %a, 2
store i32 %div, i32* %p, align 4, !tbaa !0
%add = shl nuw nsw i32 %div, 1

is better than this:

%div = lshr i32 %a, 2
store i32 %div, i32* %p, align 4, !tbaa !0
%0 = lshr i32 %a, 1
%add = and i32 %0, 2147483646

I think we could try your hasOneUse idea. If we are going to keep
%div, we may as well keep using it and save one instruction in the
canonical form.

/jakob

Cheers,
Rafael

I really dislike hasOneUse-based “solutions” at the IR / InstCombine layer. They result in strange artifacts during optimization: places where adding similar code turns off optimizations because we fold the similar bits together and reuse parts of the computation.

I would much rather see us devise a reasonable set of canonicalization rules at the IR level that don’t block optimization. Tricks like saving instructions, using bit masks, etc., should all be reserved for the codegen layer / DAGCombine. There, hasOneUse makes perfect sense because you’re trying to peephole through architectural gymnastics, not trying to get systematic scalar optimizations to fire.

Let’s see if we have any evidence that reserving this transform for the DAGCombiner hurts things, how it hurts things, and what we can do about it first?

I really dislike hasOneUse-based "solutions" at the IR / InstCombine layer.
They result in strange artifacts during optimization: places where adding
similar code turns off optimizations because we fold the similar bits
together and reuse parts of the computation.

I would much rather see us devise a reasonable set of canonicalization rules
at the IR level that don't block optimization. Tricks like saving
instructions, using bit masks, etc., should all be reserved for the codegen
layer / DAGCombine. There, hasOneUse makes perfect sense because you're
trying to peephole through architectural gymnastics, not trying to get
systematic scalar optimizations to fire.

Let's see if we have any evidence that reserving this transform for the
DAGCombiner hurts things, how it hurts things, and what we can do about it
first?

<day dreaming>
In general I think I agree. Putting the subexpression in an inline
function for example would cause the hasOneUse hack to fail as the
multiple uses would only show up after inlining. To be general we
would have to be able to combine the other way any transformation we
refuse to do because it has more than one use.

I also agree that the pipeline should first make the IL canonical, and
then decide what is the best way to compute the canonical form. The
one thing I find unfortunate is that in LLVM the second step is almost
all in DAG and MI levels. There is no point were we have IL passes
that known more about the target.
</day dreaming>

Looking a bit more at this particular case, I think the problem is not
so much the representation we have chosen (with the masks) or that the
DAG combiner is better at handling it. It is just that the DAG
combiner is a second pass. Running the example in opt again produces

define i32 @f(i32 %a, i32* nocapture %p) nounwind uwtable ssp {
entry:
  %div = lshr i32 %a, 2
  store i32 %div, i32* %p, align 4
  %0 = lshr i32 %a, 1
  %add = and i32 %0, 2147483646
  %arrayidx1 = getelementptr inbounds i32* %p, i64 1
  store i32 %add, i32* %arrayidx1, align 4
  ret i32 0
}

Which I realize is orthogonal to using the mask being a good idea or
not as it could be even simpler by some metric by reusing the %div,
but does show that the IL optimizers are stopping while there is more
they can do.

OK. In the end I think I agree we should try it out and decide on the
canonical representation first.

Cheers,
Rafael

In my example, it won’t even work. The second use of ‘x+x’ only emerges after GVN, after the harm has been done.

However, I am more concerned about hiding things from scalar evolution than from InstCombine itself. Here is an example I was expecting to fail:

$ cat small2.c
struct desc {
unsigned size : 2;
unsigned skip : 6;
unsigned value;
};

void f(struct desc d, unsigned *p) {
for (unsigned i = 0; i != 64; ++i) {
*p = 0; p += d.skip;
*p = 1; p += d.skip;
}
}

It does the right thing, though:

define void @f(i64 %d.coerce, i32* nocapture %p) nounwind uwtable ssp {
entry:
%0 = lshr i64 %d.coerce, 2
%bf.clear = and i64 %0, 63
%add.ptr.sum = shl nuw nsw i64 %bf.clear, 1
br label %for.body

for.body: ; preds = %entry, %for.body
%p.addr.06 = phi i32* [ %p, %entry ], [ %add.ptr3, %for.body ]
%i.05 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
store i32 0, i32* %p.addr.06, align 4, !tbaa !0
%add.ptr = getelementptr inbounds i32* %p.addr.06, i64 %bf.clear
store i32 1, i32* %add.ptr, align 4, !tbaa !0
%add.ptr3 = getelementptr inbounds i32* %p.addr.06, i64 %add.ptr.sum
%inc = add i32 %i.05, 1
%cmp = icmp eq i32 %inc, 64
br i1 %cmp, label %for.end, label %for.body

for.end: ; preds = %for.body
ret void
}

Notice how the two gep offsets %bf.clear and %add.ptr.sum have a simple arithmetic relationship that scalar evolution can figure out.

With a very small change, that breaks:

$ cat small3.c
void f(unsigned long d, unsigned *p) {
unsigned long skip = d/4;
for (unsigned i = 0; i != 64; ++i) {
*p = 0; p += skip;
*p = 1; p += skip;
}
}

define void @f(i64 %d, i32* nocapture %p) nounwind uwtable ssp {
entry:
%div = lshr i64 %d, 2
%0 = lshr i64 %d, 1
%add.ptr.sum = and i64 %0, 9223372036854775806
br label %for.body

for.body: ; preds = %entry, %for.body
%i.03 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%p.addr.02 = phi i32* [ %p, %entry ], [ %add.ptr1, %for.body ]
store i32 0, i32* %p.addr.02, align 4, !tbaa !0
%add.ptr = getelementptr inbounds i32* %p.addr.02, i64 %div
store i32 1, i32* %add.ptr, align 4, !tbaa !0
%add.ptr1 = getelementptr inbounds i32* %p.addr.02, i64 %add.ptr.sum
%inc = add i32 %i.03, 1
%cmp = icmp eq i32 %inc, 64
br i1 %cmp, label %for.end, label %for.body

for.end: ; preds = %for.body
ret void
}

Now it suddenly becomes very difficult for scalar evolution to figure out the relationship between %div and %add.ptr.sum.

I think the canonical form should look more like the bit field example - don’t hide the shl behind a bit mask.

I’ll try deferring the transforms that hide an shl instruction to DAGCombine. I’ll run some benchmarks.

/jakob

Note that adding the bit masks does make sense as a canonical form, it is not just a premature optimization.

Expressions like x >> 2 << 3 >> 5 << 6 get canonicalized to a single shift + a mask that way.

I think it would make sense to leave the last shl alone, so the above expression gets canonicalized to:

  ((x >> 4) & mask) << 6

DAGCombine can reduce it completely if it wants.

This also matches the code generated for the bit field access in my small2.c example:

entry:
  %and = lshr i64 %d, 2
  %div = and i64 %and, 63
  %add.ptr.sum = shl nuw nsw i64 %div, 1

/jakob

I tried disabling just the InstCombine transforms that hide shl instructions behind bitmasks. Even though DAGCombine has the same transforms, it causes some pretty bad regressions:

External/SPEC/CINT95/147_vortex/147_vortex 0.294 0.322 +9.6% +40mB
MultiSource/Benchmarks/Olden/tsp/tsp 0.680 0.748 +9.9% +41mB
SingleSource/Benchmarks/Shootout-C++/except 0.116 0.128 +10.8% +45mB
SingleSource/Benchmarks/Shootout/strcat 0.102 0.113 +11.1% +46mB
SingleSource/Benchmarks/Shootout-C++/hash 0.455 0.507 +11.4% +47mB
External/Povray/povray 2.015 2.246 +11.5% +47mB
External/SPEC/CINT2000/255_vortex/255_vortex 1.814 2.044 +12.7% +52mB
SingleSource/Benchmarks/Shootout-C++/heapsort 1.871 2.132 +13.9% +57mB
SingleSource/Benchmarks/Shootout-C++/ary3 1.087 1.264 +16.3% +65mB
MultiSource/Benchmarks/SciMark2-C/scimark2 27.491 23.596 -14.2% -66mB
MultiSource/Benchmarks/Olden/bisort/bisort 0.360 0.428 +19.0% +75mB
MultiSource/Benchmarks/Olden/bh/bh 1.074 1.287 +19.9% +79mB

(Running on Sandy Bridge, x86-64)

I’ll try to figure out why.

/jakob

I wonder about your idea of still combining most of the shifts, but leaving the last bits uncombined… Certainly the combining effects seem particularly important to capture because we’ll only look through a finite number of shifts to determine demanded bits, etc.

I would expect that, even with the fix, demanded bits might need adjusting to cope with the pattern.

But this is all theorizing. =] The actually regressions will likely tell more.

I tried disabling just the InstCombine transforms that hide shl instructions behind bitmasks. Even though DAGCombine has the same transforms, it causes some pretty bad regressions:

I wonder about your idea of still combining most of the shifts, but leaving the last bits uncombined… Certainly the combining effects seem particularly important to capture because we’ll only look through a finite number of shifts to determine demanded bits, etc.

Yes, the idea is to only preserve shl instructions because of their arithmetic/logical duality. I’ll keep folding ashr/lshr.

These transformations all still run:

X << C1 >> C2 → X << (C1-C2) & Mask

X << C1 >> C2 → X >> (C2-C1) & Mask

X << C1 << C2 → X << (C1+C2)

X >> C1 >> C2 → X >> (C1+C2)

I also allow folding of exact right shifts:

X >>exact C1 <<nuw C2 → X <<nuw (C2-C1)

X >>exact C1 << C2 → X >>exact (C1-C2)

Since a bit mask isn’t needed, scalar evolution should be able to reason about this.

Long chains of shifts should still be collapsed this way.

But this is all theorizing. =] The actually regressions will likely tell more.

Yeah, all of the regressions I posted actually had identical assembly. I’ll try measuring again while holding my breath.

/jakob