InstCombine, graph rewriting, Equality saturation

Hello all,

I’ve seen some discussion that InstCombine is “too general” and that llvm should implement a proper graph rewrite mechanism:

Link to llvm-dev discussion about this: http://lists.llvm.org/pipermail/llvm-dev/2017-May/113219.html,
Link to review where this came up (and I first heard about it): https://reviews.llvm.org/D37195.

I wanted to understand what the current issues with InstCombine are, and if a graph rewriter prototype is something people are interested in. I find the idea appealing, from what little I know it, so I’d be interested in hacking something up.

What would such a framework look like? Is there past literature on the subject? From what I know, many functional compilers using combinator based compilation were graph rewriting. Is there other prior art?

Also, there is the idea of Equality Saturation (http://www.cs.cornell.edu/~ross/publications/eqsat/) that I found out about recently. Could this be used to augment the InstCombine infrastructure?

Thanks,
~Siddharth

Equality saturation is very cool work and Ross Tate mentioned to me that he'd be happy to chat with you about this.

But also, you might take a look at some existing projects in this general space that are already integrated with LLVM, such as Souper and Alive.

InstCombine is pretty cool but developing techniques to build these things automatically is awesome and on my wish list. There's potential for improvements in all of: speed, code quality, and correctness.

While we're on the subject: Are the canonicalization rules for LLVM IR documented anywhere?

Thanks,

John

https://github.com/google/souper
https://github.com/nunoplopes/alive

Thanks for the response.

Should I create a small prototype of equality saturation as an LLVM pass so that there can be some concrete discussion on this? I’d love pointers.

Thanks,
Siddharth

Should I create a small prototype of equality saturation as an LLVM pass so that there can be some concrete discussion on this? I'd love pointers.

I think that would be awesome.

John

Should I create a small prototype of equality saturation as an LLVM pass so that there can be some concrete discussion on this? I'd love pointers.

I think that would be awesome.

+1

The research implementation from Michael Stepp, Ross Tate, and friends is online: Peggy Optimization Framework

  -Hal

I’m not sure that I’d describe it as “too general”, but I suspect that it could be made much more efficient. Currently, the patterns are implemented in C++, which is often fairly verbose, and because many of the checks are implemented separately, we end up touching instructions many times as we attempt to match different patterns. Worse, for many of these instructions we end up checking properties, such as known bits, that are computed by further backward instructions visiting. It is also hard to automatically check the correctness of the transformations, and moreover, it is hard to prove that we’ll eventually reach a fixed point (we’ve certainly had problems in the past where some inputs would cause an endless loop between two different InstCombine patterns).

I’d love to see a solution where most of the transformations were specified in TableGen files and:

  1. We, as a result, had an easy way to convert these into a form where some SMT-solver-based checker could certify correctness.
  2. We, as a result, has an easy way to somehow check soundness, as a rewrite system, so we’d know it would reach a fixed point for all inputs.
  3. We, as a result, reduced the number of lines of code we need to maintain for these peephole optimizations.
  4. We, as a result, could generate efficient code for matching the patterns using some automaton technique for minimizing unnecessary visiting of instructions.

I’m not sure how closely it matches what we would need, but Stratego/XT comes to mind in terms of prior art (now here: ). There’s been a lot of work over the years on terminating graph rewriting systems in compilers. -Hal

I'd love to see a solution where most of the transformations were specified in TableGen files and:
  1. We, as a result, had an easy way to convert these into a form where some SMT-solver-based checker could certify correctness.
  2. We, as a result, has an easy way to somehow check soundness, as a rewrite system, so we'd know it would reach a fixed point for all inputs.
  3. We, as a result, reduced the number of lines of code we need to maintain for these peephole optimizations.
  4. We, as a result, could generate efficient code for matching the patterns using some automaton technique for minimizing unnecessary visiting of instructions.

YES to all of this (except TableGen :).

It should also be possible to find chains of transformations that fire in sequence and make them happen all at once, avoiding intermediate results and associated allocations/deallocations. Nuno has some ideas along these lines.

I'm not sure how closely it matches what we would need, but Stratego/XT comes to mind in terms of prior art (now here: The Spoofax Language Workbench — Spoofax 2.5.16 documentation). There's been a lot of work over the years on terminating graph rewriting systems in compilers.

I went down this rabbit hole a few years ago, there's indeed a lot of existing material.

John

FWIW, Before getting to “how to do it”, I think InstCombine needs an actual goal.
Right now it’s “do a bunch of instruction combination and canonicalization and random kinds of semi-local optimization with some weird analysis and other stuff in the middle.
Iterate this as hard as we can”
Nobody can really draw any real dividing line for what should be there or not except based on how easy or expensive it is to shove it in.
That’s a recipe for pass creep.

If instcombine becomes the thing that does everything, then everyone will want it to run every time they generate a bunch more code. We already see this now, where people want to add more optimizations to instcombine, and then more runs of those. Right now there are 4 or 5 runs in most pipelines (if i’m counting right :P).

If it becomes a more expensive, globally encompassing graph rewriting pass, it almost certainly needs to be able to be run less than 5 times per function. Whether by making other passes better at whatever use cases it drops, or deciding we get the same result anyway, or whatever.
If it becomes a cheap, local graph-rewriting pass, running it more is not so bad, but other passes may need to be better at the more global use cases that are dropped.

But i’d suggest that step 1 is actually define what we really are trying to get out of it, and what we want it to tackle, before we decide how it should tackle them.

I was thinking that the role of InstCombine would become clarified through the design of the table-based mechanism that Hal outlines. It needs to hit a sweet spot that is reasonably powerful (able to implement, say, 90% of today's InstCombine) and very efficient. At that point we have a pass that won't creep much, since it's clearly not general-purpose, and then there's a pile of residual C++ that isn't a lot of fun but at least it's a lot smaller than the 30,0000 lines of code we have now.

I'm not saying it's the answer, but Alive is an example of a restricted class of transformations that captures a lot of InstCombine-ish things, that has verification support, and that translates into relatively efficient C++.

John

Just thinking out loud…. I’m really not familiar with the vast majority of what instcombine does, so please excuse me if my naiveté is too obvious. I can’t help but notice all of the uses of ‘and’ in Daniel B’s description of what instcombine is right now:

FWIW, Before getting to "how to do it", I think InstCombine needs an actual goal.
Right now it's "do a bunch of instruction combination and canonicalization and random kinds of semi-local optimization with some weird analysis and other stuff in the middle.
Iterate this as hard as we can"
Nobody can really draw any real dividing line for what should be there or not except based on how easy or expensive it is to shove it in.
That's a recipe for pass creep.

This makes me wonder… is it sensible to be talking about splitting up instcombine into multiple separate passes? Would such a thing even be possible? For example, split by functionality into separate passes that each do one of:
* instruction combinations
* canonicalization
* semi-local optimizations
* etc…

Or something like that.

As separate passes, each would probably have a natural way to be implemented effectively and those implementations might vary.

-Daniel

Just thinking out loud…. I’m really not familiar with the vast majority of
what instcombine does, so please excuse me if my naiveté is too obvious. I
can’t help but notice all of the uses of ‘and’ in Daniel B’s description of
what instcombine is right now:

>
> FWIW, Before getting to "how to do it", I think InstCombine needs an
actual goal.
> Right now it's "do a bunch of instruction combination and
canonicalization and random kinds of semi-local optimization with some
weird analysis and other stuff in the middle.
> Iterate this as hard as we can"
> Nobody can really draw any real dividing line for what should be there
or not except based on how easy or expensive it is to shove it in.
> That's a recipe for pass creep.

This makes me wonder… is it sensible to be talking about splitting up
instcombine into multiple separate passes? Would such a thing even be
possible? For example, split by functionality into separate passes that
each do one of:
* instruction combinations
* canonicalization
* semi-local optimizations
* etc…

Or something like that.

As separate passes, each would probably have a natural way to be
implemented effectively and those implementations might vary.

One obstacle to that is that currently instcombine has an internal
fixed-point iteration that it does.

So when splitting it into separate passes we would need to either add
fixed-point iteration to the pass manager running the separate instcombine
passes (extending the pass management in this way is doable and has been
explored in the past, e.g. https://www.youtube.com/watch?v=c7iP43an5_Q ) or
demonstrate/measure that we don't regress without the fixed-point iteration
across separate instcombine passes.

-- Sean Silva

Or the analysis need to be thought about and extended to not require fix pointing to get the same end result out of the other end of llvm.

In situations like this, the real question is “does the same kind of optimized code come out the other end”, not “do we perfectly match every little optimization on a per-pass basis”.
Or at least, that’s the answer i got about GVN.
(because trying to replace a do-everything pass that fixpoints with split up passes rarely can match it perfectly Too many second order effects. It shouldn’t be your goal)

The problem is that all of these are really the same thing. Almost all canonicalizations are also simplifications, the common underlying factor being that they’re mostly-local transformations that likely enable other optimizations. I agree, but I’ll add that I view the fixed-point iteration here to be an asset. It increases the robustness and consistency of the compiler, and allows later passes to depend on the canonicalization () without worrying about phase-ordering effects (**). () In my experience, the largest problem we have in this regard is our lack of documentation on what our canonical form is. (**) To be clear, we still have phase-ordering effects within InstCombine. Using a better system for dealing with the rewriting rules, I hope, will help. Nevertheless, these effects are much rarer than if we ran InstCombine only as discrete passe. I’d like to see us do more fixed-point iteration in our optimization pipeline, and our past work has shown this would be practical (i.e., only a few iterations will be needed in practice), but even that won’t remove the advantages of using a fixed-point iteration within InstCombine. Regardless, I think that if we had a model for what InstCombine did (i.e., as a graph-rewriting system), then it would be clear what could be part of that system and what couldn’t. Otherwise, I think that the real challenge here is figuring out the underlying structural foundations to make the process efficient and sound. -Hal

Just thinking out loud…. I’m really not familiar with the vast majority
of what instcombine does, so please excuse me if my naiveté is too obvious.
I can’t help but notice all of the uses of ‘and’ in Daniel B’s description
of what instcombine is right now:

>
> FWIW, Before getting to "how to do it", I think InstCombine needs an
actual goal.
> Right now it's "do a bunch of instruction combination and
canonicalization and random kinds of semi-local optimization with some
weird analysis and other stuff in the middle.
> Iterate this as hard as we can"
> Nobody can really draw any real dividing line for what should be there
or not except based on how easy or expensive it is to shove it in.
> That's a recipe for pass creep.

This makes me wonder… is it sensible to be talking about splitting up
instcombine into multiple separate passes? Would such a thing even be
possible? For example, split by functionality into separate passes that
each do one of:
* instruction combinations
* canonicalization
* semi-local optimizations
* etc…

The problem is that all of these are really the same thing. Almost all
canonicalizations are also simplifications, the common underlying factor
being that they're mostly-local transformations that likely enable other
optimizations.

Or something like that.

As separate passes, each would probably have a natural way to be
implemented effectively and those implementations might vary.

One obstacle to that is that currently instcombine has an internal
fixed-point iteration that it does.

So when splitting it into separate passes we would need to either add
fixed-point iteration to the pass manager running the separate instcombine
passes (extending the pass management in this way is doable and has been
explored in the past, e.g. https://www.youtube.com/watch?v=c7iP43an5_Q )
or demonstrate/measure that we don't regress without the fixed-point
iteration across separate instcombine passes.

I agree, but I'll add that I view the fixed-point iteration here to be an
asset. It increases the robustness and consistency of the compiler, and
allows later passes to depend on the canonicalization (*) without worrying
about phase-ordering effects (**).

FWIW: I completely agree. At least as you are using it here, which is in
the sense of "system is built to apply a set of transformation. Fixed point
iteration ensures that all such transformations that can occur, do, even if
they are only exposed as a result of earlier transformations".

In that sense, a vast majority of our passes do or should be using fixed
point iteration.
The usual concern is cost, and it requires careful engineering (as you say
below) to ensure you are doing this in an effective manner. IE ad-hoc
solutions work but rarely scale, it requires a good understanding of theory
and engineering to get something that will scale really well and work well

Integrating multiple transformations/capabilities into a single pass is
much trickier than one would expect. You can easily screw stuff up by
doing things you think might only improve things.

A simple example from value numbering define void @hoge() {bb: br label %bb1bb1: - Pastebin.com

Note that if we ignore unreachable edges at first and just shove it in the
algorithm, we no longer get optimal answers in certain cases. Even though
we only changed exactly one initial state, and we're trying to make things
better!
This is fixable, but requires a lot of thought to see that this will happen
in the first case.

(the reason in this case is because the algorithm relies on the TOP state
to resolve cycles. There is no perfect comes-before iteration ordering in
this case. By changing the initial state that fed the cycle, without
resetting that cycle to TOP, it can no longer resolve the cycle to the
optimal answer. )

I'd like to see us do more fixed-point iteration in our optimization
pipeline, and our past work has shown this would be practical (i.e., only a
few iterations will be needed in practice), but even that won't remove the
advantages of using a fixed-point iteration within InstCombine.

Completely agree on this

Regardless, I think that if we had a model for what InstCombine did (i.e.,
as a graph-rewriting system), then it would be clear what could be part of
that system and what couldn't. Otherwise, I think that the real challenge
here is figuring out the underlying structural foundations to make the
process efficient and sound.

Also completely agree.

Just thinking out loud…. I’m really not familiar with the vast majority
of what instcombine does, so please excuse me if my naiveté is too obvious.
I can’t help but notice all of the uses of ‘and’ in Daniel B’s description
of what instcombine is right now:

>
> FWIW, Before getting to "how to do it", I think InstCombine needs an
actual goal.
> Right now it's "do a bunch of instruction combination and
canonicalization and random kinds of semi-local optimization with some
weird analysis and other stuff in the middle.
> Iterate this as hard as we can"
> Nobody can really draw any real dividing line for what should be there
or not except based on how easy or expensive it is to shove it in.
> That's a recipe for pass creep.

This makes me wonder… is it sensible to be talking about splitting up
instcombine into multiple separate passes? Would such a thing even be
possible? For example, split by functionality into separate passes that
each do one of:
* instruction combinations
* canonicalization
* semi-local optimizations
* etc…

The problem is that all of these are really the same thing. Almost all
canonicalizations are also simplifications, the common underlying factor
being that they're mostly-local transformations that likely enable other
optimizations.

Or something like that.

As separate passes, each would probably have a natural way to be
implemented effectively and those implementations might vary.

One obstacle to that is that currently instcombine has an internal
fixed-point iteration that it does.

So when splitting it into separate passes we would need to either add
fixed-point iteration to the pass manager running the separate instcombine
passes (extending the pass management in this way is doable and has been
explored in the past, e.g. https://www.youtube.com/watch?v=c7iP43an5_Q )
or demonstrate/measure that we don't regress without the fixed-point
iteration across separate instcombine passes.

I agree, but I'll add that I view the fixed-point iteration here to be an
asset. It increases the robustness and consistency of the compiler, and
allows later passes to depend on the canonicalization (*) without worrying
about phase-ordering effects (**).

(*) In my experience, the largest problem we have in this regard is our
lack of documentation on what our canonical form is.
(**) To be clear, we still have phase-ordering effects within InstCombine.
Using a better system for dealing with the rewriting rules, I hope, will
help. Nevertheless, these effects are much rarer than if we ran InstCombine
only as discrete passe.

I'd like to see us do more fixed-point iteration in our optimization
pipeline, and our past work has shown this would be practical (i.e., only a
few iterations will be needed in practice), but even that won't remove the
advantages of using a fixed-point iteration within InstCombine.

Regardless, I think that if we had a model for what InstCombine did (i.e.,
as a graph-rewriting system), then it would be clear what could be part of
that system and what couldn't. Otherwise, I think that the real challenge
here is figuring out the underlying structural foundations to make the
process efficient and sound.

As a data point for this, I did a quick walkthrough of the top-level of
instcombine, and one thing that stuck out to me is that it tries to sink
instructions in certain "easy" scenarios. That doesn't fit a graph
rewriting system.

As a side note, TryToSinkInstruction does a whole-BB scan to check safety
for sinking loads, which will go quadratic (in the number of loads) on an
input like:

void use(int);
void foo(int *p, int *q, bool b) {
int p0 = p[0];
int p1 = p[1];
int p2 = p[2];
int p3 = p[3];
if (b) {
use(p0);
use(p1);
use(p2);
use(p3);
}
}

(script attached)

The commit where it was introduced (rG39c98bb31cc4 /
rGf17a2fb8498f yes that's from 2004) indicated that it
actually caught a bunch of cases in SPEC, but it isn't clear that it's
really providing much value integrated into the fixed-point iteration
compared to an appropriate standalone global code motion pass (which would
not have the quadratic complexity and would be more general to boot).

-- Sean Silva

instcombine_quadratic_load_sinking.py (477 Bytes)

The sinking in InstCombine also has a bogus single use check on the front of it. Why does it matter how many uses there are as long as they are all in the same basic block?

It shouldn’t. At least at this point in time, we should be able to do this in a better way. The other problem with this transformation for loads, besides being inefficient and seemingly over specific, is that it is not information preserving (it sinks loads that we then can’t re-hoist). As a result, we should consider doing this later in the pipeline (or preserving the point-of-validity information some other way). -Hal

The sinking in instcombine is not always a win. We hit a case recently where instcombine made a hot loop in a benchmark significantly slower by moving an address-space cast instruction into a loop as part of its gep of addrspacecast to addrspacecast of gep transformation. On our architecture, this translates to an extra instruction in a hot loop for the address-space cast *and* means that we then can’t use the complex addressing modes for the loads and stores because we have an address space cast in between the GEP and the memory op. We end up increasing the number of instruction in the loop by around 20%, with a corresponding decrease in performance.

My somewhat long-winded point is that I’m not convinced that InstCombine is doing sufficient reasoning when it moves instructions between basic blocks to determine that the moves actually make sense. In this benchmark, this one misoptimisation offset all of the other gains from InstCombine and simply disabling it gave us better code.

I don’t know how many other examples of this there are, but it would probably be good to have an InstCombine that ran on single basic blocks (or sequences with trivial flow control) and did folding that was always a win and something separate that’s more path-aware (and probably doesn’t run at O1).

David

This sounds less like the explicit sinking code where InstCombine tries to analyzer users and just moves already existing instructions. That code shouldn’t be able to move into loops because it checks that the successor its moving into only has a single predecessor.

I think this case is a consequence of InstCombine creating new instructions in the basic block that the parent instruction of the transform lives in.

My somewhat long-winded point is instead that IC shouldn’t even bother trying to sink. I understand it may eventually expose other optimizations but it doesn’t seem to be worth the complexity added. As you say, the heuristic is pretty simple (fold instructions to blocks with a single predecessor, IIRC, but don’t quote me on that). There are cases where it regresses, and I’m not surprised as sinking is not done after a proper analysis (as, e.g. the one GVNSink or LICM are doing). You could probably catch some cases teaching IC about profile metadata when taking sink decisions, but that sounds like adding even more madness.