failing to optimize boolean ops on cmps

We have several optimizations in InstCombine for bitwise logic ops (and/or/xor) that fail to handle compare patterns with the equivalent bitwise logic. Example:

define i8 @or_and_not(i8 %a, i8 %b) {
%nota = xor i8 %a, -1
%and = and i8 %nota, %b
%res = or i8 %and, %a
ret i8 %res
}

define i1 @or_and_cmp_not(i32 %a, i32 %b, i1 %c) {
%cmp = icmp sgt i32 %a, %b
%cmp_inv = icmp sle i32 %a, %b ; this is ‘not’ of %cmp
%and = and i1 %c, %cmp_inv
%res = or i1 %cmp, %and
ret i1 %res
}

$ ./opt -instcombine hidden_not.ll -S

define i8 @or_and_not(i8 %a, i8 %b) {
%res = or i8 %b, %a
ret i8 %res
}

define i1 @or_and_cmp_not(i32 %a, i32 %b, i1 %c) {
%cmp = icmp sgt i32 %a, %b
%cmp_inv = icmp sle i32 %a, %b
%and = and i1 %cmp_inv, %c
%res = or i1 %cmp, %and
ret i1 %res
}

I would rather see this added to instsimplify than instcombine.
If you do that, GVN/et all will get this already.

This probably does require a special matcher though:

We have:
if (m_c_And(m_Cmp(Pred, m_Value(), m_Value()),
              m_Cmp(Pred, m_Value(), m_Value()))

and you really want:
if (m_c_And(m_Cmp(Pred, m_Value(), m_Value()),
              m_Cmp(m_Opposite(Pred), m_Value(), m_Value()))

This can’t be an instsimplify though? The values we want in these cases do not exist already:
%res = or i8 %b, %a
%res = or i1 %cmp, %c

Ah yes, it can only return one instruction and you need two.

NewGVN could still handle it if instsimplify was changed to return the right binary operators because it has infrastructure that can handle insertion.

(it would need a little work).

(At this point, InstCombine seems to have become worse to me than GCC’s combining infrastructure, so i’d really like to see if we can stop putting things there. Otherwise, we just go farther down the path of 'let’s just make everything one big graph rewrite)

Perhaps something of a digression: My general impression is that if InstCombine were actually doing a principled graph rewrite, then we wouldn’t have these problems. The problem we have, however, is that we have phase-ordering problems (even within the fixed-point iteration scheme): Patterns being matched depend on a canonical form, instructions nearly matching those patterns might be created, and then suboptimally consumed (i.e. further transformed), before being subject to the canonicalizing transformation(s) that would make the better pattern actually match. Except for TableGen’ing the whole thing, and making an actual graph automaton to match the patterns, it’s not clear to me how to efficiently fix this. In the mean time, what we’ve been doing in practice, is that we have InstCombine patterns depend on canonical forms, until we hit some phase-ordering problem, and then we extend the pattern in question to match some non-canonical forms. We might do this here as well (if that’s the problem here), but at some point I suspect we’ll end up redoing everything. -Hal

Ah yes, it can only return one instruction and you need two.

NewGVN could still handle it if instsimplify was changed to return the
right binary operators because it has infrastructure that can handle
insertion.
(it would need a little work).

(At this point, InstCombine seems to have become worse to me than GCC's
combining infrastructure, so i'd really like to see if we can stop putting
things there. Otherwise, we just go farther down the path of 'let's just
make everything one big graph rewrite)

Perhaps something of a digression:

My general impression is that if InstCombine were actually doing a
principled graph rewrite, then we wouldn't have these problems.

Define "these problems"?

:slight_smile:
I mean, you are right it would be better in general.
But the notion of adding more and more graph rewrites to a single part of
the compiler, to get second/third/etc order effects, is endless. Pretty
much everything can be expressed as a set of graph rewrites. There is no
principled stopping point. IE You have to decide what you want each pass
to do, and if you have one pass whose job is "perform all possible graph
rewrites", that pass is in fact, your compiler :slight_smile:

That's not such a bad thing, mind you, there are entire libraries/languages
centered around this (see, e.g., stratego,
https://en.wikipedia.org/wiki/Stratego/XT and
Stratego Tutorial/Reference — Spoofax 2.5.16 documentation),
but it should be a deliberate thing.

e problem we have, however, is that we have phase-ordering problems (even
within the fixed-point iteration scheme): Patterns being matched depend on
a canonical form, instructions nearly matching those patterns might be
created, and then suboptimally consumed (i.e. further transformed), before
being subject to the canonicalizing transformation(s) that would make the
better pattern actually match.

Except for TableGen'ing the whole thing, and making an actual graph
automaton to match the patterns, it's not clear to me how to efficiently
fix this

That is what people used to do, make parsing grammars, etc.
Stratego is also quite efficient at term rewriting, from what i remember.

. In the mean time, what we've been doing in practice, is that we have

InstCombine patterns depend on canonical forms, until we hit some
phase-ordering problem, and then we extend the pattern in question to match
some non-canonical forms. We might do this here as well (if that's the
problem here), but at some point I suspect we'll end up redoing everything.

The question i would have is "how do you know when you are done?". As I
said, pretty much everything can be expressed as a graph transform, and the
more it gets added here, the more people want to run instcombine 50 times
because cse/gvn/you name it "doesn't do what they want".

AFAIK, that’s true. I think you’re done when you get everything that you have that fits into a model of local transformations and that should be run to a fixed point. Certainly an interesting question, but I suspect that the practical answer is “what’s in InstCombine.” To do more in this sense probably requires some more-general scheme (e.g. equality saturation). -Hal

I think you're done when you get everything that you have that fits into a
model of local transformations and that should be run to a fixed point.

Sure, but to point out what you certainly already know, that's almost every
optimization you can think of in a compiler.
For example, I can do PRE this way with no issue (IE generate optimal
results).
Same with reassociation, most CSE, etc.

A very large set of global stuff can be done by "local transform repeated
until fixpoint is reached". It's just not optimal time-wise.

Certainly an interesting question, but I suspect that the practical answer

is "what's in InstCombine."

At least to me, that's clearly the wrong answer, because instcombine
already does a lot of non-local stuff that could be done better elsewhere.

Let me bring the discussion back to this particular example. This *is* a
local optimization; it's the most basic 2-variable boolean logic
optimization that's not an instsimplify (where we return a constant or an
existing value).

I see 3 possibilities (let me know if there are others):
1. Hard: Declare instcombine bankrupt and closed for business because it
has already gone over the cliff. We should not add the new matcher type. We
should go further and remove the existing combine [ ((~A & B) | A) -> (A |
B) ] too because that's the responsibility of some other pass. This might
be the best theoretical solution, but practically impossible until someone
reinvents instcombine as multiple different passes that have the same
optimization power that we have today?

2. Medium: Take the suggestion from PR27431 (
https://bugs.llvm.org/show_bug.cgi?id=27431 ) to have CSE or GVN replace an
inverted compare predicate with an xor of an existing compare. I think this
would mean abandoning https://reviews.llvm.org/D35182 and maybe removing
that transform altogether to avoid conflicting with the expected canonical
form that includes an xor. How this plays out with the rest of
instcombine's existing folds and other IR transforms is not clear to me yet.

3. Easy: Add the new matcher and use it. It's then one 2-line patch after
another to ensure that instcombine won't miss these pattern variants no
matter what form they come in with. We acknowledge the added mass to
instcombine, but live with it for the perf gains...and wait for the now
better-performing machines to rewrite the whole thing for us. :slight_smile:

I think you're done when you get everything that you have that fits into
a model of local transformations and that should be run to a fixed point.

Sure, but to point out what you certainly already know, that's almost
every optimization you can think of in a compiler.
For example, I can do PRE this way with no issue (IE generate optimal
results).
Same with reassociation, most CSE, etc.

A very large set of global stuff can be done by "local transform repeated
until fixpoint is reached". It's just not optimal time-wise.

Certainly an interesting question, but I suspect that the practical

answer is "what's in InstCombine."

At least to me, that's clearly the wrong answer, because instcombine
already does a lot of non-local stuff that could be done better elsewhere.

Let me bring the discussion back to this particular example.

Sure, i'm happy to admit we've digressed pretty far :slight_smile:

This *is* a local optimization; it's the most basic 2-variable boolean
logic optimization that's not an instsimplify (where we return a constant
or an existing value).

Right, but it also creates new instructions which will later be subject to
more combining :slight_smile:

I see 3 possibilities (let me know if there are others):
1. Hard: Declare instcombine bankrupt and closed for business because it
has already gone over the cliff. We should not add the new matcher type. We
should go further and remove the existing combine [ ((~A & B) | A) -> (A |
B) ] too because that's the responsibility of some other pass. This might
be the best theoretical solution, but practically impossible until someone
reinvents instcombine as multiple different passes that have the same
optimization power that we have today?

Do you believe the latter will *ever* happen until we become more

stringent about instcombine?
There has to be some middle ground.

2. Medium: Take the suggestion from PR27431 ( https://bugs.llvm.org/show_

bug.cgi?id=27431 ) to have CSE or GVN replace an inverted compare
predicate with an xor of an existing compare. I think this would mean
abandoning https://reviews.llvm.org/D35182 and maybe removing that
transform altogether to avoid conflicting with the expected canonical form
that includes an xor. How this plays out with the rest of instcombine's
existing folds and other IR transforms is not clear to me yet.

3. Easy: Add the new matcher and use it. It's then one 2-line patch after
another to ensure that instcombine won't miss these pattern variants no
matter what form they come in with. We acknowledge the added mass to
instcombine, but live with it for the perf gains...and wait for the now
better-performing machines to rewrite the whole thing for us. :slight_smile:

I'm not opposed to this, but everything i've ever seen in compilers tells
me we will beat this horse well past the point it's dead.

Not sure about this last part. It is really going to require work by us to rewrite things. :slight_smile: In the mean time, I think we should go ahead with this. -Hal

Not sure about this last part. It is really going to require work by us to
rewrite things. :slight_smile: In the mean time, I think we should go ahead with this.

FWIW: My problem is, when put in this framework, we will repeatedly make
this same decision, this same way, again and again, and never actually get
even started on fixing it :slight_smile:

IE "it's just another small patch!"

But i’m also fine if we go ahead because i’m not offering to do the work :slight_smile:

You’re correct. However, as I’m sure you’re aware, developer time is not directly transferable in a way that makes any other decision optimal. It’s not like blocking all improvements to InstCombine will cause a movement to appear to rewrite InstCombine. It will only cause a shrink in the pool of developers with recent experience working on InstCombine (and a lot of out-of-tree patches). Frankly, we don’t even have a concrete plan for how we’d do this, or even a target design, and that’s the first item on the critical path to a rewrite. We should start an RFC and iterate on this until we have a concrete plan and migration plan. -Hal

Going back to the original problem. Why shouldn’t SimplifyUsingDistributiveLaws handle both these cases?

For the bitwise test case if you distribute you get
(~A | A) & (B | A)

The left side of that simplifies to all 1s which is the identify value for and. So even though the right side doesn’t simplify it’s a win.

Not sure about this last part. It is really going to require work by us
to rewrite things. :slight_smile: In the mean time, I think we should go ahead with
this.

FWIW: My problem is, when put in this framework, we will repeatedly make
this same decision, this same way, again and again, and never actually get
even started on fixing it :slight_smile:

IE "it's just another small patch!"

You're correct. However, as I'm sure you're aware, developer time is not
directly transferable in a way that makes any other decision optimal. It's
not like blocking all improvements to InstCombine will cause a movement to
appear to rewrite InstCombine.

Of course not, but this sets it up as an all or nothing thing, and it's
not.
Though you probably will need a stop loss point in the future.

It will only cause a shrink in the pool of developers with recent
experience working on InstCombine (and a lot of out-of-tree patches).

This is not the only effect it will have, and i don't believe you think
that either (since this argument is globally applicable to the degree that
it would mean we should just accept *every patch* rather than push people
to better design stuff).
It would also, for example, push people towards putting transformations
elsewhere.

Frankly, we don't even have a concrete plan for how we'd do this, or even
a target design, and that's the first item on the critical path to a
rewrite.

Sure, and frankly, you will never get one until either someone decides to
be super-altruistic, or you start pushing on people a little more than "not
at all".

We should start an RFC and iterate on this until we have a concrete plan

and migration plan.

I don't believe that will actually achieve anything at this point (really,
no offense meant)
So I'm just going to leave this stuff alone since it's pretty clear people
just view me as being either obstructionist or unrealistic.

I don't see anything happening until we are willing to push more, and i
don't see that happening until instcombine is even worse than it is now.

Correct, we should insist on the best technical decisions given the current infrastructure (or current infrastructure with incremental improvements). No offense taken, but at this point, even if we were going to push people to work on a new system, we don’t even have a concrete thing to suggest they do. I’d be really interested in exploring, for example, whether some Stratego/XT-like system would be better, and how we might construct it, the degree to which such a system can remain sound and efficient in the presence “arbitrary” C++ predicates, etc. There’s also significant interest in encoding these transformations in a form that can be mechanically transformed such that they can be verified by some SMT-solver-backed checker (e.g. Alive). Thanks again, Hal

That’s a good point. The factorization / reassociation in instcombine is just too limited I think. This raises another pair of questions though:

  1. Why doesn’t -reassociate catch this either?
  2. And if we teach reassociate to match this, can we remove some redundant functionality from instcombine?

Reassociate very explicitly doesn’t handle i1 at all. But even if it did, the processing for OR doesn’t look at its arguments other than finding duplicates and direct inverses. But it won’t look through the and to find an inverse.

Just want to chime in here in support of what Hal said. I strongly believe we should not block incremental progress unless we have a plan in place for a better design. In particular, part of being able to come up with a better design is having multiple people with recent knowledge of the system which needs replaced and stopping evolution of the old system by definition inhibits having such a group of people. Philip