[llvm] r303387 - [InstCombine] add more tests for xor-of-icmps; NFC

[adding llvm-dev for wider audience]

[adding llvm-dev for wider audience]

> Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated
> value
> propagation)?
>
> If so, we have this comment regarding compares:
> // As a policy choice, we choose not to waste compile time on
> anything
> where
> // the comparison is testing local values.
>
> Or this for div/rem:
> // As a policy choice, we choose not
> // to waste compile time on anything where the operands are local
> defs.
>
> "Local" means in the same basic block from what I can tell by the code
> here.
>
> I think this means that this pass explicitly defers handling simple
> cases
> like:
> ⚙ D33342 [InstCombine] try to canonicalize xor-of-icmps to and-of-icmps
> ...to another pass, and currently that pass is InstCombine (because the
> patterns really can't be any simpler than the tests in that patch,
> right?).
>
> I think that improving compile-time should not preclude improving
> InstCombine. We should do both.
>

Just thoughts, feel free to ignore them.
I didn't measure the impact in LLVM, but I'm sure you can do VRP
relatively fast (GCC does that both interprocedurally and
intraprocedurally and their pass is much faster in some cases than
LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this
policy?

Yeah, that's kind of silly, we can do much better.

I think replacing a pass in LLVM is not trivial (I'm learning it the
hard way with NewGVN). OTOH, I'm still not entirely convinced
`InstCombine` should handle these cases given it's already a
compile-time sink?

Correct me where I'm going wrong.

1. We got a bug report with a perf regression that was root caused to this
IR:

  define i1 @xor_of_icmps(i64 %a) {
    %b = icmp sgt i64 %a, 0
    %c = icmp eq i64 %a, 1
    %r = xor i1 %b, %c
    ret i1 %r
  }

Because this was a regression, we know that the optimal/canonical IR for
this program is:

  define i1 @xor_of_icmps_canonical(i64 %a) {
    %r = icmp sgt i64 %a, 1
     ret i1 %r
  }

2. I saw this as a bug in InstCombine because I think InstCombine's job is
to canonicalize simple IR sequences.

3. I assumed that matching an (xor icmp, icmp) pattern can be done
efficiently in InstCombine. In fact, I knew that we already did that match,
so there is zero incremental cost to find this pattern in InstCombine.

4. I knew that we already handle related and-of-cmps and or-of-cmps patterns
in InstSimplify/InstCombine.

5. Based on that, I have proposed a patch that mostly uses existing logic in
those passes to solve this bug in the most general way I am aware of. It
makes 2 calls to InstSimplify to find a potential fold before
morphing/creating new instructions that may get folded by other existing
logic.

Questions:
1. Was/is InstCombine intended to handle this transform?

2. If there is a measurable compile-time cost for this patch, then there
must be a structural problem with InstSimplify, InstCombine, and/or the pass
pipeline?

3. Is there another pass that is more suitable to solve this bug than
InstCombine? CVP was mentioned as a candidate for enhancement. Any others?

4. If there is a more suitable pass, why is it more suitable?

I can provide an example + analysis: ⚙ D33172 [InstCombine] Simpify inverted predicates in 'or'

5. If we add to that pass or create a new pass that specifically targets
logic-of-cmps, can we pull that code out of InstSimplify and InstCombine?

6. If compile-time has become unbearable, shouldn't we be moving
optimization logic and/or passes from -O1 to -O2 to -O3, etc? My
understanding as a compiler consumer has always been that I'll get better
performing code the higher up the -O# ladder I climb, but it will cost
increasingly more time to compile. Is that not the model for LLVM?

The real problem of instcombine IMHO is that it suffer from an
additive effect. i.e. matchers over matchers are added over time which
per-se don't slow down the compiler in any measurable manner. But all
of them sum up result in a significant compiler slowdown, and we
notice only when it's too late (i.e. when the code has been released).
We've all under the assumption that the time in InstCombine was spent
inside auxiliary analysis computing the bitwise domain
(ComputeKnownBits et similia). Turns out we're not able to find a case
where caching speeds up things.

People seem intent on adding lots and lots of stuff to instcombine because
it's easy.
This makes it harder to ever replace, while simultaneously making it
slower.
It's not hard to see where that path ends up.
InstCombine does a lot of random crap that isn't even combining or graph
rewrite, too, because well second and third order effects are hard.

Can you provide some examples? I honestly don't know where I should draw the
line. If I've crossed the line, I'd like to fix it.

I don't have the code handy, but something that came up recently (and
I was a little surprised) is that instcombine checks for irreducible
PHI cycles, see the functions/comments in InstCombinePHI.

> [adding llvm-dev for wider audience]
>
>>
>>
>>
>>>
>>> > Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated
>>> > value
>>> > propagation)?
>>> >
>>> > If so, we have this comment regarding compares:
>>> > // As a policy choice, we choose not to waste compile time on
>>> > anything
>>> > where
>>> > // the comparison is testing local values.
>>> >
>>> > Or this for div/rem:
>>> > // As a policy choice, we choose not
>>> > // to waste compile time on anything where the operands are local
>>> > defs.
>>> >
>>> > "Local" means in the same basic block from what I can tell by the
code
>>> > here.
>>> >
>>> > I think this means that this pass explicitly defers handling simple
>>> > cases
>>> > like:
>>> > ⚙ D33342 [InstCombine] try to canonicalize xor-of-icmps to and-of-icmps
>>> > ...to another pass, and currently that pass is InstCombine (because
the
>>> > patterns really can't be any simpler than the tests in that patch,
>>> > right?).
>>> >
>>> > I think that improving compile-time should not preclude improving
>>> > InstCombine. We should do both.
>>> >
>>>
>>> Just thoughts, feel free to ignore them.
>>> I didn't measure the impact in LLVM, but I'm sure you can do VRP
>>> relatively fast (GCC does that both interprocedurally and
>>> intraprocedurally and their pass is much faster in some cases than
>>> LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this
>>> policy?
>>
>>
>> Yeah, that's kind of silly, we can do much better.
>>
>>>
>>> I think replacing a pass in LLVM is not trivial (I'm learning it the
>>> hard way with NewGVN). OTOH, I'm still not entirely convinced
>>> `InstCombine` should handle these cases given it's already a
>>> compile-time sink?
>>
>>
>
> Correct me where I'm going wrong.
>
> 1. We got a bug report with a perf regression that was root caused to
this
> IR:
>
> define i1 @xor_of_icmps(i64 %a) {
> %b = icmp sgt i64 %a, 0
> %c = icmp eq i64 %a, 1
> %r = xor i1 %b, %c
> ret i1 %r
> }
>
> Because this was a regression, we know that the optimal/canonical IR for
> this program is:
>
> define i1 @xor_of_icmps_canonical(i64 %a) {
> %r = icmp sgt i64 %a, 1
> ret i1 %r
> }
>
> 2. I saw this as a bug in InstCombine because I think InstCombine's job
is
> to canonicalize simple IR sequences.
>
> 3. I assumed that matching an (xor icmp, icmp) pattern can be done
> efficiently in InstCombine. In fact, I knew that we already did that
match,
> so there is zero incremental cost to find this pattern in InstCombine.
>
> 4. I knew that we already handle related and-of-cmps and or-of-cmps
patterns
> in InstSimplify/InstCombine.
>
> 5. Based on that, I have proposed a patch that mostly uses existing
logic in
> those passes to solve this bug in the most general way I am aware of. It
> makes 2 calls to InstSimplify to find a potential fold before
> morphing/creating new instructions that may get folded by other existing
> logic.
>
>
> Questions:
> 1. Was/is InstCombine intended to handle this transform?
>
> 2. If there is a measurable compile-time cost for this patch, then there
> must be a structural problem with InstSimplify, InstCombine, and/or the
pass
> pipeline?
>
> 3. Is there another pass that is more suitable to solve this bug than
> InstCombine? CVP was mentioned as a candidate for enhancement. Any
others?
>
> 4. If there is a more suitable pass, why is it more suitable?
>

I can provide an example + analysis: ⚙ D3317 Declare _DYNAMIC and dl_phdr_info in asan_linux.cc on FreeBSD
2#754563

Yes, I agreed that case should be solved more generally. There are
follow-up questions in that review that I'd like to reach consensus on
though. For example:

define i32 @cmp_br_1(i32 %a, i32 %b) {
entry:
  %cmp = icmp sge i32 %a, %b
  br i1 %cmp, label %t, label %f
t:
  ret i32 42
f:
  ret i32 31415
}

define i32 @cmp_br_2(i32 %a, i32 %b) {
entry:
  %cmp = icmp slt i32 %a, %b ; invert predicate
  br i1 %cmp, label %f, label %t ; swap successors

t:
  ret i32 42

f:
  ret i32 31415
}

Which of these is canonical? Is InstCombine responsible for that
canonicalization?

> 5. If we add to that pass or create a new pass that specifically targets
> logic-of-cmps, can we pull that code out of InstSimplify and InstCombine?
>
> 6. If compile-time has become unbearable, shouldn't we be moving
> optimization logic and/or passes from -O1 to -O2 to -O3, etc? My
> understanding as a compiler consumer has always been that I'll get better
> performing code the higher up the -O# ladder I climb, but it will cost
> increasingly more time to compile. Is that not the model for LLVM?
>

The real problem of instcombine IMHO is that it suffer from an
additive effect. i.e. matchers over matchers are added over time which
per-se don't slow down the compiler in any measurable manner. But all
of them sum up result in a significant compiler slowdown, and we
notice only when it's too late (i.e. when the code has been released).
We've all under the assumption that the time in InstCombine was spent
inside auxiliary analysis computing the bitwise domain
(ComputeKnownBits et similia). Turns out we're not able to find a case
where caching speeds up things.

This conclusion doesn't line up with the experimental results I showed here:
http://lists.llvm.org/pipermail/llvm-dev/2017-March/111416.html

Is there a bug report with a test case for the stated case of "matchers
over matchers"? I'd like to try some experiments with that test. From the
earlier thread, it sounded like improving that case was something that was
being investigated or would be investigated soon.

The general idea, as I understand it, is that we need to better
define/distinguish canonicalization from optimization and then separate
InstCombine into 'InstCanonicalize' and 'InstOptimize' and/or not run
InstCombine at full strength so many times.

(I’ll reply in more depth a bit later this week, i’m slammed ATM)

(I'll reply in more depth a bit later this week, i'm slammed ATM)

[adding llvm-dev for wider audience]

> Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated
value
> propagation)?
>
> If so, we have this comment regarding compares:
> // As a policy choice, we choose not to waste compile time on
anything
> where
> // the comparison is testing local values.
>
> Or this for div/rem:
> // As a policy choice, we choose not
> // to waste compile time on anything where the operands are local
defs.
>
> "Local" means in the same basic block from what I can tell by the
code here.
>
> I think this means that this pass explicitly defers handling simple
cases
> like:
> ⚙ D33342 [InstCombine] try to canonicalize xor-of-icmps to and-of-icmps
> ...to another pass, and currently that pass is InstCombine (because
the
> patterns really can't be any simpler than the tests in that patch,
right?).
>
> I think that improving compile-time should not preclude improving
> InstCombine. We should do both.
>

Just thoughts, feel free to ignore them.
I didn't measure the impact in LLVM, but I'm sure you can do VRP
relatively fast (GCC does that both interprocedurally and
intraprocedurally and their pass is much faster in some cases than
LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this
policy?

Yeah, that's kind of silly, we can do much better.

I think replacing a pass in LLVM is not trivial (I'm learning it the
hard way with NewGVN). OTOH, I'm still not entirely convinced
`InstCombine` should handle these cases given it's already a
compile-time sink?

Correct me where I'm going wrong.

1. We got a bug report with a perf regression that was root caused to
this IR:

  define i1 @xor_of_icmps(i64 %a) {
    %b = icmp sgt i64 %a, 0
    %c = icmp eq i64 %a, 1
    %r = xor i1 %b, %c
    ret i1 %r
  }

Yup

Because this was a regression, we know that the optimal/canonical IR for
this program is:

  define i1 @xor_of_icmps_canonical(i64 %a) {
    %r = icmp sgt i64 %a, 1
     ret i1 %r
  }

Yes.

2. I saw this as a bug in InstCombine because I think InstCombine's job
is to canonicalize simple IR sequences.

Where is the dividing line between what InstCombine does and what
something else does?

In practice? CFG transformations and inter-procedural reasoning are
forbidden. The driving philosophy of InstCombine, as I know it, is "do
reasonable stuff which exposes further canonicalization opportunities."
This mandate is widely defined but is not unreasonable in and of itself. It
is unreasonable to expect InstCombine to perform _all_ canonicalization
opportunities. InstCombine is a soupy combination of many passes because it
is a engineering trade-off between perfect separation of concerns
(performing no overlapping work with any other pass which has a specific
responsibility) and having one giant pass which can do everything and
anything.

3. I assumed that matching an (xor icmp, icmp) pattern can be done
efficiently in InstCombine. In fact, I knew that we already did that match,
so there is zero incremental cost to find this pattern in InstCombine.

4. I knew that we already handle related and-of-cmps and or-of-cmps
patterns in InstSimplify/InstCombine.

5. Based on that, I have proposed a patch that mostly uses existing logic
in those passes to solve this bug in the most general way I am aware of. It
makes 2 calls to InstSimplify to find a potential fold before
morphing/creating new instructions that may get folded by other existing
logic.

Questions:
1. Was/is InstCombine intended to handle this transform?

InstCombine has no real philosophy right now. It can handle *anything*.
The more it handles, the more people want to shove stuff in it. You can
see this already. It does random store sinking, phi conversion, etc.

Because if some other pass handles it, then they don't get as many
simplifications, because nobody has built out those places.
Without some real philosophy line drawing, instcombine expands to cover
everything, gets slow and can never be replaced without introducing
performance regressions.

It becomes the only thing that does certain things, and then people want
to run it more as a result. But you can't run it more because it's slow
(which also incentivizes people to make it do more, so that it catches
whatever case when it does run)

2. If there is a measurable compile-time cost for this patch, then there
must be a structural problem with InstSimplify, InstCombine, and/or the
pass pipeline?

Places like InstCombine get slower by very small percentages at a time.
Really.

People seem intent on adding lots and lots of stuff to instcombine

because it's easy.
This makes it harder to ever replace, while simultaneously making it
slower.
It's not hard to see where that path ends up.
InstCombine does a lot of random crap that isn't even combining or graph
rewrite, too, because well second and third order effects are hard.

Can you provide some examples?]

It does:

Demanded bits analysis
Phi translation and attempts to fold
Instruction sinking
Reassociation
Factorization
Dead code elimination

It also does DSE, some memory forwarding optimizations and a restrict
subset of SROA. It can also do some really weird stuff like rewriting
and/or coalescing allocas. It can even do speculation :slight_smile:

I honestly don't know where I should draw the line. If I've crossed the
line, I'd like to fix it.

I'm sorry if you feel like this is people saying it's you. It's really
not. It's really "hey, uh, are we ever going to stop trying to shove stuff
in this thing?"
I think if you look, every so often people take notice of these kinds of
things and it's usually started in a random review thread. :slight_smile:
As for where to draw the line, i think that's a good discussion for us all
to have.
I think we *should* say what the line is, draw it, and stick to it, and
probably split it up into multiple things (canonicalization, etc).

Truthfully, right now, Instcombine crossed whatever line you want to draw
a long time ago.
It's 30k lines of code.

For reference:
The entirety of transforms/Scalar, combined, is only 60k lines of code.

Of course, line counting is not a great measure of a lot, i'm just
pointing out it contains roughly half the code of the rest of our scalar
optimization infrastructure, and 30% of the total code of it overall. It's
twice as big as our entire vectorization infrastructure. It's big. It does
a lot.
I'd argue that even if most of these transforms belong there, there are
probably cleaner, more performance, more maintainable, and easier to reason
about ways at this point to produce the same result.

If there is a way to do what we do today which is more Pareto optimal, I
doubt you will hear any complaints. You will hear loud complains if we made
InstCombine as enjoyable to debug as SelectionDAGISel :slight_smile:

(I'll reply in more depth a bit later this week, i'm slammed ATM)

[adding llvm-dev for wider audience]

> Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated
> value
> propagation)?
>
> If so, we have this comment regarding compares:
> // As a policy choice, we choose not to waste compile time on
> anything
> where
> // the comparison is testing local values.
>
> Or this for div/rem:
> // As a policy choice, we choose not
> // to waste compile time on anything where the operands are local
> defs.
>
> "Local" means in the same basic block from what I can tell by the
> code here.
>
> I think this means that this pass explicitly defers handling simple
> cases
> like:
> ⚙ D33342 [InstCombine] try to canonicalize xor-of-icmps to and-of-icmps
> ...to another pass, and currently that pass is InstCombine (because
> the
> patterns really can't be any simpler than the tests in that patch,
> right?).
>
> I think that improving compile-time should not preclude improving
> InstCombine. We should do both.
>

Just thoughts, feel free to ignore them.
I didn't measure the impact in LLVM, but I'm sure you can do VRP
relatively fast (GCC does that both interprocedurally and
intraprocedurally and their pass is much faster in some cases than
LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this
policy?

Yeah, that's kind of silly, we can do much better.

I think replacing a pass in LLVM is not trivial (I'm learning it the
hard way with NewGVN). OTOH, I'm still not entirely convinced
`InstCombine` should handle these cases given it's already a
compile-time sink?

Correct me where I'm going wrong.

1. We got a bug report with a perf regression that was root caused to
this IR:

  define i1 @xor_of_icmps(i64 %a) {
    %b = icmp sgt i64 %a, 0
    %c = icmp eq i64 %a, 1
    %r = xor i1 %b, %c
    ret i1 %r
  }

Yup

Because this was a regression, we know that the optimal/canonical IR for
this program is:

  define i1 @xor_of_icmps_canonical(i64 %a) {
    %r = icmp sgt i64 %a, 1
     ret i1 %r
  }

Yes.

2. I saw this as a bug in InstCombine because I think InstCombine's job
is to canonicalize simple IR sequences.

Where is the dividing line between what InstCombine does and what
something else does?

In practice? CFG transformations and inter-procedural reasoning are
forbidden. The driving philosophy of InstCombine, as I know it, is "do
reasonable stuff which exposes further canonicalization opportunities." This
mandate is widely defined but is not unreasonable in and of itself. It is
unreasonable to expect InstCombine to perform _all_ canonicalization
opportunities. InstCombine is a soupy combination of many passes because it
is a engineering trade-off between perfect separation of concerns
(performing no overlapping work with any other pass which has a specific
responsibility) and having one giant pass which can do everything and
anything.

I'd also would like to add that I think that when you think about a
combine that spans more than two or three instructions (i.e. weird
tree shapes), it should be reconsidered whether it's the correct pass
perform the transformation. Instcombine is *really good* at things
that are very localized.

3. I assumed that matching an (xor icmp, icmp) pattern can be done
efficiently in InstCombine. In fact, I knew that we already did that match,
so there is zero incremental cost to find this pattern in InstCombine.

4. I knew that we already handle related and-of-cmps and or-of-cmps
patterns in InstSimplify/InstCombine.

5. Based on that, I have proposed a patch that mostly uses existing logic
in those passes to solve this bug in the most general way I am aware of. It
makes 2 calls to InstSimplify to find a potential fold before
morphing/creating new instructions that may get folded by other existing
logic.

Questions:
1. Was/is InstCombine intended to handle this transform?

InstCombine has no real philosophy right now. It can handle *anything*.
The more it handles, the more people want to shove stuff in it. You can
see this already. It does random store sinking, phi conversion, etc.

Because if some other pass handles it, then they don't get as many
simplifications, because nobody has built out those places.
Without some real philosophy line drawing, instcombine expands to cover
everything, gets slow and can never be replaced without introducing
performance regressions.

It becomes the only thing that does certain things, and then people want
to run it more as a result. But you can't run it more because it's slow
(which also incentivizes people to make it do more, so that it catches
whatever case when it does run)

2. If there is a measurable compile-time cost for this patch, then there
must be a structural problem with InstSimplify, InstCombine, and/or the pass
pipeline?

Places like InstCombine get slower by very small percentages at a time.
Really.

People seem intent on adding lots and lots of stuff to instcombine
because it's easy.
This makes it harder to ever replace, while simultaneously making it
slower.
It's not hard to see where that path ends up.
InstCombine does a lot of random crap that isn't even combining or graph
rewrite, too, because well second and third order effects are hard.

Can you provide some examples?]

It does:

Demanded bits analysis
Phi translation and attempts to fold
Instruction sinking
Reassociation
Factorization
Dead code elimination

It also does DSE, some memory forwarding optimizations and a restrict subset
of SROA. It can also do some really weird stuff like rewriting and/or
coalescing allocas. It can even do speculation :slight_smile:

Fun fact (or maybe not so fun), there are some memory forwarding
xforms that InstCombine is able to catch and GVN is not (the same was
true for NewGVN, but that was fixed). I don't have the actual example
but if you do some archeology on bugzilla Michael Spencer reported a
NewGVN bug containing it.

(I'll reply in more depth a bit later this week, i'm slammed ATM)

[adding llvm-dev for wider audience]

> Is "VRP" (value range propagation) in LLVM-speak "CVP" (correlated
value
> propagation)?
>
> If so, we have this comment regarding compares:
> // As a policy choice, we choose not to waste compile time on
anything
> where
> // the comparison is testing local values.
>
> Or this for div/rem:
> // As a policy choice, we choose not
> // to waste compile time on anything where the operands are local
defs.
>
> "Local" means in the same basic block from what I can tell by the
code here.
>
> I think this means that this pass explicitly defers handling simple
cases
> like:
> ⚙ D33342 [InstCombine] try to canonicalize xor-of-icmps to and-of-icmps
> ...to another pass, and currently that pass is InstCombine (because
the
> patterns really can't be any simpler than the tests in that patch,
right?).
>
> I think that improving compile-time should not preclude improving
> InstCombine. We should do both.
>

Just thoughts, feel free to ignore them.
I didn't measure the impact in LLVM, but I'm sure you can do VRP
relatively fast (GCC does that both interprocedurally and
intraprocedurally and their pass is much faster in some cases than
LLVM's), i.e. O(N) in practice, so, maybe we could re-evaluate this
policy?

Yeah, that's kind of silly, we can do much better.

I think replacing a pass in LLVM is not trivial (I'm learning it the
hard way with NewGVN). OTOH, I'm still not entirely convinced
`InstCombine` should handle these cases given it's already a
compile-time sink?

Correct me where I'm going wrong.

1. We got a bug report with a perf regression that was root caused to
this IR:

  define i1 @xor_of_icmps(i64 %a) {
    %b = icmp sgt i64 %a, 0
    %c = icmp eq i64 %a, 1
    %r = xor i1 %b, %c
    ret i1 %r
  }

Yup

Because this was a regression, we know that the optimal/canonical IR for
this program is:

  define i1 @xor_of_icmps_canonical(i64 %a) {
    %r = icmp sgt i64 %a, 1
     ret i1 %r
  }

Yes.

2. I saw this as a bug in InstCombine because I think InstCombine's job
is to canonicalize simple IR sequences.

Where is the dividing line between what InstCombine does and what
something else does?

In practice? CFG transformations and inter-procedural reasoning are
forbidden.

That's just about everything though :slight_smile:

The driving philosophy of InstCombine, as I know it, is "do reasonable
stuff which exposes further canonicalization opportunities."

This mandate is widely defined but is not unreasonable in and of itself. It

is unreasonable to expect InstCombine to perform _all_ canonicalization
opportunities. InstCombine is a soupy combination of many passes because it
is a engineering trade-off between perfect separation of concerns
(performing no overlapping work with any other pass which has a specific
responsibility) and having one giant pass which can do everything and
anything.

FWIW: I think it's long past turned into the latter.
If you chart the code growth, it's also actually growing at a much faster
rather than all of our other optimizations combined :slight_smile:

3. I assumed that matching an (xor icmp, icmp) pattern can be done
efficiently in InstCombine. In fact, I knew that we already did that match,
so there is zero incremental cost to find this pattern in InstCombine.

4. I knew that we already handle related and-of-cmps and or-of-cmps
patterns in InstSimplify/InstCombine.

5. Based on that, I have proposed a patch that mostly uses existing
logic in those passes to solve this bug in the most general way I am aware
of. It makes 2 calls to InstSimplify to find a potential fold before
morphing/creating new instructions that may get folded by other existing
logic.

Questions:
1. Was/is InstCombine intended to handle this transform?

InstCombine has no real philosophy right now. It can handle *anything*.
The more it handles, the more people want to shove stuff in it. You can
see this already. It does random store sinking, phi conversion, etc.

Because if some other pass handles it, then they don't get as many
simplifications, because nobody has built out those places.
Without some real philosophy line drawing, instcombine expands to cover
everything, gets slow and can never be replaced without introducing
performance regressions.

It becomes the only thing that does certain things, and then people want
to run it more as a result. But you can't run it more because it's slow
(which also incentivizes people to make it do more, so that it catches
whatever case when it does run)

2. If there is a measurable compile-time cost for this patch, then there
must be a structural problem with InstSimplify, InstCombine, and/or the
pass pipeline?

Places like InstCombine get slower by very small percentages at a time.
Really.

People seem intent on adding lots and lots of stuff to instcombine

because it's easy.
This makes it harder to ever replace, while simultaneously making it
slower.
It's not hard to see where that path ends up.
InstCombine does a lot of random crap that isn't even combining or
graph rewrite, too, because well second and third order effects are hard.

Can you provide some examples?]

It does:

Demanded bits analysis
Phi translation and attempts to fold
Instruction sinking
Reassociation
Factorization
Dead code elimination

It also does DSE, some memory forwarding optimizations and a restrict
subset of SROA. It can also do some really weird stuff like rewriting
and/or coalescing allocas. It can even do speculation :slight_smile:

I honestly don't know where I should draw the line. If I've crossed the
line, I'd like to fix it.

I'm sorry if you feel like this is people saying it's you. It's really
not. It's really "hey, uh, are we ever going to stop trying to shove stuff
in this thing?"
I think if you look, every so often people take notice of these kinds of
things and it's usually started in a random review thread. :slight_smile:
As for where to draw the line, i think that's a good discussion for us
all to have.
I think we *should* say what the line is, draw it, and stick to it, and
probably split it up into multiple things (canonicalization, etc).

Truthfully, right now, Instcombine crossed whatever line you want to draw
a long time ago.
It's 30k lines of code.

For reference:
The entirety of transforms/Scalar, combined, is only 60k lines of code.

Of course, line counting is not a great measure of a lot, i'm just
pointing out it contains roughly half the code of the rest of our scalar
optimization infrastructure, and 30% of the total code of it overall. It's
twice as big as our entire vectorization infrastructure. It's big. It does
a lot.
I'd argue that even if most of these transforms belong there, there are
probably cleaner, more performance, more maintainable, and easier to reason
about ways at this point to produce the same result.

If there is a way to do what we do today which is more Pareto optimal, I
doubt you will hear any complaints. You will hear loud complains if we made
InstCombine as enjoyable to debug as SelectionDAGISel :slight_smile:

Sure.

Personally, i would split it into a real canonicalization face, and then a
combination/optimization phase (if you still need one)

The canonicalizer is a set of graph rewrite rules that apply in a fixed
order and are fairly easy to debug.
The combination/optimizer, ... haven't thought about.
For some significant parts of instcombine and some of the underlying
instsimplify transforms it uses, newgvn subsumed it.

(For example, all of it's thread/backtranslate over phi transformations are
subsumed, and we can do ones instcombine can't. This transform is also
probably exponential time worst case as implemented in instcombine, while
the newgvn impl is polynomial time. It's just not common to have long
chains that meet instcombine's condition)

However, in general, will you lose some optimization by splitting it? Over
the course of a single pass, vs what it does now, yes.

(which kind of proves the point. if it didn't do everything, you wouldn't
have this issue.
You should be able to make it significantly faster and run it more though :slight_smile:

Ping on this question. I'm offering to do some work on this, but I can't
guess what the important benchmarks and compile machines are for other
people, and clearly my experiment didn't match your reality. Test file(s)
to use as a gold standard and a hardware description for the compile
machine are a minimum.

FWIW, I sloppily timed the same test as shown above using 'opt' built
release today, and we're significantly faster than 2 months ago:

r298514 (no knownbits caching):

user 0m0.302s
user 0m0.300s
user 0m0.296s
user 0m0.299s
user 0m0.296s

r304034 (no knownbits caching):
user 0m0.266s
user 0m0.268s
user 0m0.268s
user 0m0.267s
user 0m0.265s

This is actually the same speed as I measured 2 months ago *with*
knownbits caching. I couldn't apply the patch from:
https://reviews.llvm.org/D31239

...cleanly, so I'm not sure if there would still be an improvement from
caching or not. I'd be very interested to see one of those perf vs. commit
charts to see if there are commits we can pinpoint for this improvement. :slight_smile:

Sorry, I realized this never got a reply.

If I were you, I'd start from
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.8495&rep=rep1&type=pdf

When I last looked at this area [for different purposes, many years
ago] I found GrGen nice to play with (to understand/visualize the
problem).
http://ai2-s2-pdfs.s3.amazonaws.com/a509/550d8427c210f6a5900c12733181e9a2e862.pdf

The algebraic approach to graph rewriting requires a very superficial
knowledge of categories, as the problem is stated in terms of graph
homomorphism, but I wouldn't worry too much , TBH, unless you want to
dig into the theory behind this.

Thanks,