Saving Compile Time in InstCombine

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it’s scheduled 6 times in the current pass pipeline, and each time it’s invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it’s underused at the moment.

Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:

InstCombine_covreport.html (6.17 MB)

0001-InstCombine-Move-some-infrequent-patterns-under-if-E.patch (32.6 KB)

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it's scheduled 6 times in the current pass pipeline, and each time it's invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it's underused at the moment.

Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
<InstCombine_covreport.html>
(beware, the file is ~6MB).

Guided by this profile I moved some patterns under the "if (ExpensiveCombines)" check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).

It'd be nice to double-check that any runtime performance loss at -O2 is negligible. But this sounds like a great idea!

vedant

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it’s scheduled 6 times in the current pass pipeline, and each time it’s invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it’s underused at the moment.

Trying to find out, which patterns are important, and which are rare, I profiled clang using CTMark and got the following coverage report:
<InstCombine_covreport.html>
(beware, the file is ~6MB).

Guided by this profile I moved some patterns under the “if (ExpensiveCombines)” check, which expectedly happened to be neutral for runtime performance, but improved compile-time. The testing results are below (measured for Os).

It’d be nice to double-check that any runtime performance loss at -O2 is negligible. But this sounds like a great idea!

I forgot to mention that I ran SPEC2006/INT with “-Os” on ARM64 and didn’t see any changes in runtime performance. I can run O2 testing as well over the weekend.

Michael

Hi,

One of the most time-consuming passes in LLVM middle-end is InstCombine (see e.g. [1]). It is a very powerful pass capable of doing all the crazy stuff, and new patterns are being constantly introduced there. The problem is that we often use it just as a clean-up pass: it’s scheduled 6 times in the current pass pipeline, and each time it’s invoked it checks all known patterns. It sounds ok for O3, where we try to squeeze as much performance as possible, but it is too excessive for other opt-levels. InstCombine has an ExpensiveCombines parameter to address that - but I think it’s underused at the moment.

Yes, the “ExpensiveCombines” has been added recently (4.0? 3.9?) but I believe has always been intended to be extended the way you’re doing it. So I support this effort :slight_smile:

In general it is great that we investigate these things! We have been liberally adding pass invocations and patterns for years without checking the compiletime consequences.

However intuitively it feels wrong to disable some patterns completely (there will always be that one program that gets so much better when you have a certain pattern).

  • Do you have an idea what would happen if we only disable them in 5 of the 6 invocations?

  • Or alternatively what happens when we just not put as many InstCombine instances into the pass pipeline in -Os?

  • Matthias

+1 Also, did your profiling reveal why the other combines are expensive? Among other things, I’m curious if the expensive ones tend to spend a lot of time in ValueTracking (getting known bits and similar)? -Hal

Honestly, I’m not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine’s xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.

I’d be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn’t change but is not intelligently reused. I imagine that the performance win might be quite comparable. Such a patch would have the benefit of keeping the set of available transforms constant throughout the pipeline while bringing execution time down; I wouldn’t be at all surprised if caching the ValueTracking functions resulted in a bigger time savings.

I agree with this up to a point. If we have these kinds of canonicalization dependencies that depend on ValueTracking’s depth, this seems very fragile. Even if we introduce caching, thus making the depth often infinite, if it will still be finite in the face of updates then we still need to be careful (plus, if we’re worried about being able to understand the canonical form, then depending on known-bits analysis can make defining this form subtle). I’d started working on this a few months ago; I didn’t finish the patch (mostly because I discovered that there’s also a need to invalidate the cache whenever to perform a transformation that drops nsw/nuw flags and I’ve never got to that part), but I’m happy to provide my work-in-progress to anyone interested. cc’ing Davide who had also expressed an interest in this. -Hal

The goal is to improve compile-time, so I think we should measure what’s taking the most time and work from there. I’m not sure how the code coverage report was made, but that seems like it is based on execution frequency rather than time?

Here are some Instruments (macOS) screenshots of a time profile of a release build of opt running on a Haswell machine:
$ opt -O2 row_common.bc -S -o /dev/null

The input is derived from the Chrome source file attached to:
https://bugs.llvm.org//show_bug.cgi?id=32037

It’s an implicit assumption in this experiment that ‘opt’ time is important to overall compile-time, but as you can see in the bug report, that’s not really true for this case. Also, this profile may not be representative of other programs or important vs. other programs, but to summarize:

  1. 123 ms out of 326 ms (38%) of the -O2 opt time is going into InstCombine. So yes, InstCombine is a big chunk of opt time.

  2. Focusing on computeKnownBits() within the InstCombine subtree: > 48% of the time consumed by InstCombine is spent in ValueTracking.

We know that most InstCombines don’t use ValueTracking, so I think transforms that use computeKnownBits are the ones that should be grouped under “ExpensiveCombines”. Maybe there’s another category for “RareCombines”, or we split InstCombine into multiple passes to further reduce compile-time?

In any case, the profile lines up with David’s comment about simplifyDemanded / computeKnownBits and previous anecdotes in the “llvm is getting slower” threads. Simple matchers are relatively short strands of compare and branch spaghetti. It’s the few transforms that use ValueTracking that have a high relative cost. If we attack that, the profile says we could recover much more than 1% if -O2 middle-end time is significant for most compiles.

Honestly, I’m not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine’s xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

OK, but is it true of all the combines today?

InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.

If a canonicalization is too costly to achieve, maybe it is not a reasonable one?
It is also not clear to me that canonicalizations that are using complex analyses (ValueTracking / computeKnownBits) are really making it easy to "understand what is canonical” anyway. This is my impression in general as the scope of what is needed to achieve the transformation gets larger: the more context needed the less it looks like a “canonicalization” to me.
WDYT?

Agree, it would be nice if instcombine was more modular, such that we could choose whether or not we wanted a canonicalisation run or wanted optimizations. And of the optimisations, it would also be good to be able to select which categories of combines to run. Splitting out canonicalisation would also make it clearer to all what the standard forms are for certain constructs.

Amara

Thanks for the interest, everyone! The intention of this patch was exactly to ignite the discussion, I expected that it should be changed/reworked before going in (if ever).

The profile I used was not a time profile, it was a frequency-based report, as Sanjay noticed. The idea was to find patterns that are (almost) unused, and that was the patterns I later moved under ‘ExpensiveCombines’ - I admit it was kind of arbitrary on my side, I kept the ones which looked pretty cheap, and picked the others.

Let me try to keep only the ValueTracking related patterns and remeasure the numbers for now, and then we can decide if we want to do anything about the others. I think it still would be nice to avoid checking for them - maybe we can keep one ‘full’ InstCombine run for canonicalization, and in subsequent runs only check frequent patterns.

Thanks,
Michael

Honestly, I’m not a huge fan of this change as-is. The set of transforms that were added behind ExpensiveChecks seems awfully strange and many would not lead the reader to believe that they are expensive at all (the SimplifyDemandedInstructionBits and foldICmpUsingKnownBits calls being the obvious expensive routines).

The purpose of many of InstCombine’s xforms is to canonicalize the IR to make life easier for downstream passes and analyses.

As we get further along with compile-time improvements one question we need to ask ourselves more frequently is about the effectiveness of optimizations/passes. For example - in this case - how can we make an educated assessment that running the combiner N times is a good cost/benefit investment of compute resources? The questions below are meant to figure out what technologies/instrumentations/etc could help towards a more data-driven decision process when it comes to the effectiveness of optimizations. Instcombiner might just be an inspirational use case to see what is possible in that direction.

The combiner is invoked in full multiple times. But is it really necessary to run all of it for that purpose? After instcombine is run once is there a mapping from transformation → combines? I suspect most transformations could invoke a subset of combines to re-canonicalize. Or, if there was a (cheap) verifier for canonical IR, it could invoke a specific canonicalization routine. Instrumenting the instcombiner and checking which patterns actually kick in (for different invocations) might give insight into how the combiner could be structured and so that only a subset of pattern need to be checked.

InstCombine internally relies on knowing what transforms it may or may not perform. This is important: canonicalizations may duel endlessly if we get this wrong; the order of the combines is also important for exactly the same reason (SelectionDAG deals with this problem in a different way with its pattern complexity field).

Can you elaborate on this “duel endlessly” with specific examples? This is out of curiosity. There must be verifiers that check that this cannot happen. Or an implementation strategy that guarantees that. Global isel will run into the same/similar question when it gets far enough to replace SD.

Another concern with moving seemingly arbitrary combines under ExpensiveCombines is that it will make it that much harder to understand what is and is not canonical at a given point during the execution of the optimizer.

I’d be much more interested in a patch which caches the result of frequently called ValueTracking functionality like ComputeKnownBits, ComputeSignBit, etc. which often doesn’t change but is not intelligently reused. I imagine that the performance win might be quite comparable.

Can you back this up with measurements? Caching schemes are tricky. Is there a way to evaluate when the results of ComputeKnownBits etc is actually effective meaining the result is used and gives faster instructions? E.g. it might well be that only the first instance of inst_combine benefits from computing the bits.

As somebody who brought up this problem at least once in the mailing
lists, I'm in agreement with David Majnemer here.
I think we should consider a caching strategy before going this route.
FWIW, I'm not a big fan of `ExpensiveCombines` at all, I can see the
reason why it was introduced, but in my experience the "expensive"
bits of Instcombine comes from the implementation of bitwise domain,
i.e. known bits & friends, so at least evaluating caching is something
I would try earlier.

Something else that can be tried (even if it doesn't improve compile
time is still a nice cleanup) is that of moving combines not creating
new instructions from instcombine to instsimplify. Many passes use
instruction simplify so that might result in the amount of code that's
processed by instcombine being smaller and/or could result in improved
code quality. Just speculations, but a decent experiment if somebody
has time to take a look at.

So, just a thought:
“The purpose of many of InstCombine’s xforms is to canonicalize the IR to make life easier for downstream passes and analyses.”
That sounds sane.

So, are the expensive things canonicalization?
If that is the case, why are we doing such expensive canonicalization? That seems strange to me.

If they are not canonicalization, should they really not be separated out (into some pass that possibly shares infrastructure)?

No compiler is going to get everything anyway, and instcombine needs to decide what “good enough” really means.

I would rather see us understand what we want out of instcombine, precisely, before we try to decide how to make it faster at doing whatever that thing is :slight_smile:

–Dan

To (hopefully) make it easier to answer this question, I’ve posted my (work-in-progress) patch which adds a known-bits cache to InstCombine. I rebased it yesterday, so it should be fairly easy to apply: - Seeing what this does to the performance of the benchmarks mentioned in this thread (among others) would certainly be interesting. -Hal

To (hopefully) make it easier to answer this question, I’ve posted my (work-in-progress) patch which adds a known-bits cache to InstCombine.
I rebased it yesterday, so it should be fairly easy to apply: https://reviews.llvm.org/D31239 - Seeing what this does to the performance of the
benchmarks mentioned in this thread (among others) would certainly be interesting.

Thanks! I only have the one rough data point based on PR32037, but caching does good things for compile-time on that example.

Trunk r298514 compiled release on macOS running on Haswell 4GHz:
$ time ./opt -O2 row_common.bc -S -o /dev/null

user 0m0.302s
user 0m0.300s
user 0m0.296s
user 0m0.299s
user 0m0.296s

With your patch applied:

user 0m0.264s
user 0m0.269s
user 0m0.269s
user 0m0.268s
user 0m0.268s

So the time for all of -O2 has dropped to 89.6% of the baseline (improvement of 11.5%).
A profile shows time spent in InstCombine->computeKnownBits dropped from 58 ms to 15 ms (lines up with the overall time drop), so we’re about 4x faster in ValueTracking with the caching.

Yay :slight_smile: – Unfortunately, I won’t have time to work on this in the near future, but if someone would like to pick this up and fix the nsw/nuw invalidation issue (which is exposed in a few of the regression-tests which fail with the patch applied), that would be great. -Hal

In my testing results are not that impressive, but that’s because I’m now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements).

Personally I think caching results of KnownBits is a good idea, and should probably help O3 compile time (and obviously the cases from bug reports, like PR32037).

But I also think that the way we’re currently doing combining/canonicalization is unnecessary complicated. Do we really need to try canonicalizing all of these patterns? What happens if we don’t? Has anyone tried replace some of the InstCombine invocations with InstSimplify? Do we have a justification for the invocations we currently have? I realize that InstCombine doesn’t usually do any harm, if we don’t care about compile time, but that’s only the case for O3 (to some extent), not for other optimization levels. I think it’s equally important to 1) make our implementations faster, and 2) not perform unnecessary work when possible. Adding caching for known bits makes InstCombine faster, but we can get even more if we don’t invoke it when it’s not needed.

Of course, we don’t want to blindly disable the patterns that just happen to be not used in some (admittedly small) test runs that I made. But what I’d like to do is to make a deliberate decision on what’s critical and what’s optional here, what can be disabled to save some compile time. I’d be happy to carry out any kinds of experiments do we need to run to figure this out, and take part in implementing any missing parts or necessary refactoring. My current plan is to experiment with replacing some InstCombine invocations with InstSimplify and see what regresses after it, but if you know that’s already been done by someone, or you have better ideas - I’d be happy to hear them!

Thanks,
Michael

In my testing results are not that impressive, but that's because I'm now focusing on Os. For me even complete disabling of all KnownBits-related patterns in InstCombine places the results very close to the noise level. In my original patch I also had some extra patterns moved under ExpensiveCombines - and that seems to make a difference too (without this part, or without the KnownBits part I get results below 1%, which are not reported as regressions/improvements).

Have you profiled a single InstCombine run to see where we actually
spend our cycles (as Sanjay did for his reduced testcase)?

I realize that InstCombine doesn't usually do any harm, if we don't care about compile time, but that's only the case for O3 (to some extent), not for other optimization levels.

Independently from what's the optimization level, I think compile-time
is important. Note, for example, that we run a (kinda) similar
pipeline at O3 and LTO (full, that is), where the impact of compile
time is much more evident. Also, while people are not generally bitten
by O3 compilation time, you may end up with terrible performances for
large TUs (and I unfortunately learned this the hard way).