[IR canonicalization] 6 ways to choose {-1,0,1}

I’m looking at the output of memcmp() expansion (D34904), and I noticed that there are many ways to produce the common positive/zero/negative comparison result in IR.

For the following 6 functionally equivalent C source functions, we produce 6 different versions of IR which leads to 6 different asm outputs for x86. Which of these should we choose as canonical IR form?

  1. Two selects

int zero_negone_one(int x, int y) {
if (x == y) return 0;
if (x < y) return -1;
return 1;
}

define i32 @zero_negone_one(i32, i32) {
%3 = icmp eq i32 %0, %1
%4 = icmp slt i32 %0, %1
%5 = select i1 %4, i32 -1, i32 1
%6 = select i1 %3, i32 0, i32 %5
ret i32 %6
}

  1. Two selects, but different

int zero_one_negone(int x, int y) {
if (x == y) return 0;
if (x > y) return 1;
return -1;
}

define i32 @zero_one_negone(i32, i32) {
%3 = icmp eq i32 %0, %1
%4 = icmp sgt i32 %0, %1
%5 = select i1 %4, i32 1, i32 -1
%6 = select i1 %3, i32 0, i32 %5
ret i32 %6
}

  1. Select and zext

int negone_one_zero(int x, int y) {
if (x < y) return -1;
if (x > y) return 1;
return 0;
}

define i32 @negone_one_zero(i32, i32) {
%3 = icmp slt i32 %0, %1
%4 = icmp sgt i32 %0, %1
%5 = zext i1 %4 to i32
%6 = select i1 %3, i32 -1, i32 %5
ret i32 %6
}

  1. Select and sext
    int negone_zero_one(int x, int y) {
    int sel = x < y ? -1 : 0;
    if (x > y) return 1;
    return sel;
    }

define i32 @negone_zero_one(i32, i32) {
%3 = icmp sgt i32 %0, %1
%4 = icmp slt i32 %0, %1
%5 = sext i1 %4 to i32
%6 = select i1 %3, i32 1, i32 %5
ret i32 %6
}

  1. Subs and shifts

int neg101_sub_shifty(int x, int y) {
int r = (x - y) >> 31;
r += (unsigned)(y - x) >> 31;
return r;
}

define i32 @neg101_sub_shifty(i32, i32) {
%3 = sub nsw i32 %0, %1
%4 = ashr i32 %3, 31
%5 = sub nsw i32 %1, %0
%6 = lshr i32 %5, 31
%7 = add nsw i32 %4, %6
ret i32 %7
}

  1. Zexts and sub

int neg101_cmp_sub(int x, int y) {
return (x>y) - (x<y);
}

define i32 @neg101_cmp_sub(i32, i32) {
%3 = icmp sgt i32 %0, %1
%4 = zext i1 %3 to i32
%5 = icmp slt i32 %0, %1
%6 = zext i1 %5 to i32
%7 = sub nsw i32 %4, %6
ret i32 %7
}

https://godbolt.org/g/UnM9H7

Show these are logically equivalent:
http://rise4fun.com/Alive/b4D

Recent patch related to this pattern:
https://reviews.llvm.org/D34278

Hi Sanjay,
I’m wondering if there was any progress on this since there don’t appear to have been any responses. I don’t have a particularly strong opinion on which form we should canonicalize to, but I do think we should settle on one.

I’ve recently written a patch (not committed yet) for codegen that basically tries detecting all of these on the SDAG. I’d be quite happy if I can simplify that patch before committing.

Nemanja

Hi Nemanja -

No, I didn’t make any IR transforms based on this, but I agree we should canonicalize the IR. Like you, I was trying to improve the codegen (for x86 in particular), but that patch was growing every time I found a new way to express this operation!

I’m going to rule out option #4 because #3 is better (always prefer a zext over a sext for bit-tracking purposes). I’d also kill option #5 because it doesn’t offer any particular advantage, but needs more instructions than a version with select. #6 might be the best option for codegen, but it has more instructions and a subtract seems harder to reason about in IR than select-of-constants.

If you agree with the above reasoning, then that brings us to a choice about select vs. zext that goes back to the questions in:
https://reviews.llvm.org/D24480

I’m now in favor of creating more selects in IR (ie, abandon that patch and reverse the other pattern). We used to treat selects in IR as more complex operations than basic logic ops, but the backend has improved enough that we probably don’t need to do that anymore.

A few more DAGCombiner transforms in foldSelectOfConstants() are needed to ensure that if we pick selects in IR that we will transform that to zext/sext/logic when the select has -1/0/1 constants. This would probably happen under an expanded version of the TLI.convertSelectOfConstantsToMath() target hook.

If anyone agrees with me this far…the choice is down to #1 or #2. This could be decided by putting the “larger” (signed or unsigned?) constant on the left, or we just let that go and expect the backend to deal with it.

Just to offer some random pieces of data:
From a GVN/CSE/etc perspective, the non-zext’d versions are the most likely to be optimized, and of the non-zext’d versions, the icmp/select is the most likely to be optimized.

Especially when combined as part of larger IR. That is, an “and of zext’d icmp and regular icmp” is less likely to be optimized.
I mean this from the perspective of “be able to deduce later things about it”.

But past that, i’d say pretty much any of them are fine.

OK, if it’s all the same to everyone else then, I’m in favour of 1/2 from a code-gen perspective given how we have the PPC back end set up.

I'm looking at the output of memcmp() expansion (D34904), and I noticed
that there are many ways to produce the common positive/zero/negative
comparison result in IR.

For the following 6 functionally equivalent C source functions, we produce
6 different versions of IR which leads to 6 different asm outputs for x86.
Which of these should we choose as canonical IR form?

1. Two selects
int zero_negone_one(int x, int y) {
  if (x == y) return 0;
  if (x < y) return -1;
  return 1;
}

define i32 @zero_negone_one(i32, i32) {
  %3 = icmp eq i32 %0, %1
  %4 = icmp slt i32 %0, %1
  %5 = select i1 %4, i32 -1, i32 1
  %6 = select i1 %3, i32 0, i32 %5
  ret i32 %6
}

2. Two selects, but different
int zero_one_negone(int x, int y) {
  if (x == y) return 0;
  if (x > y) return 1;
  return -1;
}

define i32 @zero_one_negone(i32, i32) {
  %3 = icmp eq i32 %0, %1
  %4 = icmp sgt i32 %0, %1
  %5 = select i1 %4, i32 1, i32 -1
  %6 = select i1 %3, i32 0, i32 %5
  ret i32 %6
}

3. Select and zext
int negone_one_zero(int x, int y) {
  if (x < y) return -1;
  if (x > y) return 1;
  return 0;
}

define i32 @negone_one_zero(i32, i32) {
  %3 = icmp slt i32 %0, %1
  %4 = icmp sgt i32 %0, %1
  %5 = zext i1 %4 to i32
  %6 = select i1 %3, i32 -1, i32 %5
  ret i32 %6
}

4. Select and sext
int negone_zero_one(int x, int y) {
  int sel = x < y ? -1 : 0;
  if (x > y) return 1;
  return sel;
}

define i32 @negone_zero_one(i32, i32) {
  %3 = icmp sgt i32 %0, %1
  %4 = icmp slt i32 %0, %1
  %5 = sext i1 %4 to i32
  %6 = select i1 %3, i32 1, i32 %5
  ret i32 %6
}

5. Subs and shifts
int neg101_sub_shifty(int x, int y) {
  int r = (x - y) >> 31;

What if x is INT_MIN and y is greater than zero? Won't r be poison?

  r += (unsigned)(y - x) >> 31;

Ditto with respect to y being INT_MIN and x greater than zero.

Oops - yes. Ignore this option. When I initially modelled this case in
Alive, I only checked that the 'sub nsw' form could be transformed into a
form with cmp/select rather than the other way around:
http://rise4fun.com/Alive/SKBN

FWIW: it would be nice to start canonicalizing more on select. Some of the inscombine rewrites we have at the moment canonicalize to arithmetic instead, and it is very hard to justify the correctness of these (to say the least).
So canonicalizing more stuff on select would give us critical mass to remove all the remaining canonicalizations to arithmetic.

Thanks,
Nuno

IMO, canonicalization is driven by a few priorities:

  • How easy is it for the rest of the optimizer to reason about the IR

  • How easy is it to lower the IR to “The Right Thing ™”

I think it is hard to know which is best without resorting to a case-by-case analysis, mostly because of stuff like ComputeKnownBits.

Outside a few special cases, ComputeKnownBits isn’t very smart when it comes to select: it just takes the intersection of the two possibilities. This can lose quite a bit of information.

I’d feel a lot better about always canonicalizing to select if the backends are strengthened to lower those selects to something good (without relying on CMOV, etc.) and if ComputeKnownBits dealt with select better.

I’m trying to get the backend cleaned up…but it is a mess.

On x86 at least, we could take a select in IR, combine it into logic ops, turn that back into select, and then lower it into math ops by recognizing some extra-special-case.

The untangling involves hacks like:
https://reviews.llvm.org/rL307471

I don't see how ComputeKnownBits can be improved here. Because the return
value can be 0 or -1 then any bit can be 0 or 1, which is without
information, but optimal.
This is a fundamental limitation of the ComputeKnownBits representation.

Note we could drastically improve it by having two set of ComputeKnownBits:
- one for when the MSB is 0
- one for when the MSB is 1

But that is an other discussion I guess.

IMO, canonicalization is driven by a few priorities:
- How easy is it for the rest of the optimizer to reason about the IR
- How easy is it to lower the IR to "The Right Thing (TM)"

I think it is hard to know which is best without resorting to a
case-by-case analysis, mostly because of stuff like ComputeKnownBits.

Outside a few special cases, ComputeKnownBits isn't very smart when it
comes to select: it just takes the intersection of the two possibilities.
This can lose quite a bit of information.

This is essentially unfixable without a real bit range analysis.

I'd feel a lot better about always canonicalizing to select if the
backends are strengthened to lower those selects to something good (without
relying on CMOV, etc.) and if ComputeKnownBits dealt with select better.

The latter is going to take real design and work :slight_smile: