[RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

We would like to propose a new feature to disable optimizations on IR Functions that are considered “cold” by PGO profiles. The primary goal for this work is to improve code optimization speed (which also improves compilation and LTO speed) without making too much impact on target code performance.

The mechanism is pretty simple: In the second phase (i.e. optimization phase) of PGO, we would add optnone attributes on functions that are considered “cold”. That is, functions with low profiling counts. Similar approach can be applied on loops. The rationale behind this idea is pretty simple as well: If a given IR Function will not be frequently executed, we shouldn’t waste time optimizing it. Similar approaches can be found in modern JIT compilers for dynamic languages (e.g. Javascript and Python) that adopt a multi-tier compilation model: Only “hot” functions or execution traces will be brought to higher-tier compilers for aggressive optimizations.

In addition to de-optimizing on functions whose profiling counts are exactly zero (-fprofile-deopt-cold), we also provide a knob (-fprofile-deopt-cold-percent=<X percent>) to adjust the “cold threshold”. That is, after sorting profiling counts of all functions, this knob provides an option to de-optimize functions whose count values are sitting in the lower X percent.

We evaluated this feature on LLVM Test Suite (the Bitcode, SingleSource, and MultiSource sub-folders were selected). Both compilation speed and target program performance are measured by the number of instructions reported by Linux perf. The table below shows the percentage of compilation speed improvement and target performance overhead relative to the baseline that only uses (instrumentation-based) PGO.

Experiment Name Compile Speedup Target Overhead
DeOpt Cold Zero Count 5.13% 0.02%
DeOpt Cold 25% 8.06% 0.12%
DeOpt Cold 50% 13.32% 2.38%
DeOpt Cold 75% 17.53% 7.07%

(The “DeOpt Cold Zero Count” experiment will only disable optimizations on functions whose profiling counts are exactly zero. Rest of the experiments are disabling optimizations on functions whose profiling counts are in the lower X%.)

We also did evaluations on FullLTO, here are the numbers:

Experiment Name Link Time Speedup Target Overhead
DeOpt Cold Zero Count 10.87% 1.29%
DeOpt Cold 25% 18.76% 1.50%
DeOpt Cold 50% 30.16% 3.94%
DeOpt Cold 75% 38.71% 8.97%

(The link time presented here included the LTO and code generation time. We omitted the compile time numbers here since it’s not really interesting in LTO setup)

From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

We believed that the above numbers had justified this patch to be useful on improving build time with little overhead.

Here are the patches for review:

Credit: This project was originally started by Paul Robinson <paul.robinson@sony.com> and Edward Dawson <Edd.Dawson@sony.com> from Sony PlayStation compiler team. I picked it up when I was interning there this summer.

Thank you for your reading.
-Min

Hi Min (Paul, Edd),

This is great work! Small, clear patch, substantial impact, virtually no downsides.

Just looking at your test-suite numbers, not optimising functions “never used” during the profile run sounds like an obvious “default PGO behaviour” to me. The flag defining the percentage range is a good option for development builds.

I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues.

cheers,
–renato

Hello,

We use PGO to optimize clang itself. I can see if I have time to give this patch some testing. Anything special to look out for except compile benchmark and time to build clang, do you expect any changes in code size?

This sounds very interesting and the compile time gains in the conservative range (say under 25%) seem quite promising.

One concern that comes to mind is if it is possible for performance to degrade severely in the situation where a function has a hot call site (where it gets inlined) and some non-zero number of cold sites (where it does not get inlined). When we decorate the function with optnone, noinline it will presumably not be inlined into the hot call site any longer and will furthermore be unoptimized.
Have you considered such a case and if so, is it something that cannot happen (i.e. inlining has already happened, etc.) or something that we can mitigate in the future?

A more aesthetic comment I have is that personally, I would prefer a single option with a default percentage (say 0%) rather than having to specify two options.
Also, it might be useful to add an option to dump the names of functions that are decorated so the user can track an execution count of such functions when running their code. But of course, the debug messages may be adequate for this purpose.

Nemanja

A more aesthetic comment I have is that personally, I would prefer a single option with a default percentage (say 0%) rather than having to specify two options.

0% doesn’t mean “don’t do it”, just means “only do that to functions I didn’t see running at all”, which could be misrepresented in the profiling run.

If we agree this should be always enabled, then only one option is needed. Otherwise, we’d need negative percentages to mean “don’t do that” and that would be weird. :slight_smile:

Also, it might be useful to add an option to dump the names of functions that are decorated so the user can track an execution count of such functions when running their code. But of course, the debug messages may be adequate for this purpose.

Remark options should be enough for that.

–renato

Hi Renato,

From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.

Hi Min (Paul, Edd),

This is great work! Small, clear patch, substantial impact, virtually no downsides.

Thank you :slight_smile:

Just looking at your test-suite numbers, not optimising functions “never used” during the profile run sounds like an obvious “default PGO behaviour” to me. The flag defining the percentage range is a good option for development builds.

I imagine you guys have run this on internal programs and found beneficial, too, not just the LLVM test-suite (which is very small and non-representative). It would be nice if other groups that already use PGO could try that locally and spot any issues.

Good point! We are aware that LLVM Test Suite is too “SPEC-alike” and lean toward scientific computation rather than real-world use cases. So we actually did experiments on the V8 javascript engine, which is absolutely a huge code base and a good real-world example. And it showed a 10~13% speed improvement on optimization + codegen time with up to 4% of target performance overhead (Note that due to some hacky reasons, for many of the V8 source files, over 80% or even 95% of compilation time was spent on frontend, so measuring by total compilation time will be heavily skewed and unable to reflect the impact of this feature)

Best
-Min

Hi Tobias and Dominique,

I didn’t evaluate the impact on code size in the first place since it was not my primary goal. But thanks to the design of LLVM Test Suite benchmarking infrastructure, I can call out those numbers right away.

(Non-LTO)
Experiment Name Code Size Increase Percentage
DeOpt Cold Zero Count 5.2%
DeOpt Cold 25% 6.8%
DeOpt Cold 50% 7.0%
DeOpt Cold 75% 7.0%

(FullLTO)

Experiment Name Code Size Increase Percentage
DeOpt Cold Zero Count 4.8%
DeOpt Cold 25% 6.4%
DeOpt Cold 50% 6.2%
DeOpt Cold 75% 5.3%

For non-LTO its cap is around 7%. For FullLTO things got a little more interesting where code size actually decreased when we increased the cold threshold, but I’ll say it’s around 6%. To dive a little deeper, the majority of increased code size was (not-surprisingly) coming from the .text section. The PLT section contributed a little bit, and the rest of sections brealey changed.

Though the overhead on code size is higher than the target performance overhead, I think it’s still acceptable in normal cases. In addition, David mentioned in D87337 that LLVM has used similar techniques on code size (not sure what he was referencing, my guess will be something related to hot-cold code splitting). So I think the feature we’re proposing here can be a complement to that one.

Finally: Tobias, thanks for evaluating the impact on Clang, I’m really interested to see the result.

Best,
Min

IIUC, it’s just using optsize instead of optnone. The idea is that, if the code really doesn’t run often/at all, then the performance impact of reducing the size is negligible, but the size impact is considerable.

I’d wager that optsize could even be faster than optnone, as it would delete a lot of useless code… but not noticeable, as it wouldn’t run much.

This is an idea that we (Verona Language) are interested in, too.

Would it make sense to have a flag to select optnone or optsize? We would probably also do the tradeoff for a smaller binary.

A more aesthetic comment I have is that personally, I would prefer a single option with a default percentage (say 0%) rather than having to specify two options.

0% doesn’t mean “don’t do it”, just means “only do that to functions I didn’t see running at all”, which could be misrepresented in the profiling run.

If we agree this should be always enabled, then only one option is needed. Otherwise, we’d need negative percentages to mean “don’t do that” and that would be weird. :slight_smile:

I am not sure I follow. My suggestion was to have one option that would give you a default of 0% (i.e. only add the attribute on functions that were never called). So the semantics would be fairly straightforward:

  • Default (i.e. no -profile-deopt-cold): do nothing
  • Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero
  • Option with an arg (i.e. -profile-deopt-cold=): add attribute to functions that account for % of total execution counts
  • Default (i.e. no -profile-deopt-cold): do nothing

  • Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero

  • Option with an arg (i.e. -profile-deopt-cold=): add attribute to functions that account for % of total execution counts

I see. This looks confusing to me, but perhaps it’s just me.

Though, I’m not sure we can get this behaviour from the same flag, as you need to provide a default value if the flag isn’t passed (usually boolean or integer, not both).

It’s not just you. :slight_smile: Assuming “account for % of total execution counts” means “account for % or less of total execution counts,” then it seems like the proposed -profile-deopt-cold does the same thing as -profile-deopt-cold=0.

Also, for build-system-friendliness, IMHO every positive option should have a negative option — i.e., the default behavior should be regainable via an option such as -profile-no-deopt-cold. (Or -fno-profile-deopt-cold, if there was a missing f in all of the above.) That seems easier to do if the whole thing is controlled by just one option instead of two.

my $.02,
–Arthur

Hi All,

  • Default (i.e. no -profile-deopt-cold): do nothing

  • Option with no arg (i.e. -profile-deopt-cold): add attribute only to functions that have an execution count of zero

  • Option with an arg (i.e. -profile-deopt-cold=): add attribute to functions that account for % of total execution counts

I see. This looks confusing to me, but perhaps it’s just me.

It’s not just you. :slight_smile: Assuming “account for % of total execution counts” means “account for % or less of total execution counts,” then it seems like the proposed -profile-deopt-cold does the same thing as -profile-deopt-cold=0.

Also, for build-system-friendliness, IMHO every positive option should have a negative option — i.e., the default behavior should be regainable via an option such as -profile-no-deopt-cold. (Or -fno-profile-deopt-cold, if there was a missing f in all of the above.) That seems easier to do if the whole thing is controlled by just one option instead of two.

Actually there has always been a -fno-profile-deopt-cold driver flag in my second Phabricator review (D87338).
But to sum up, I think it’s a good idea to have only one driver flag, or even one LLVM CLI option.

FYI David is referring to PGSO (profile-guided size optimization) as it exists directly under that name, see: https://reviews.llvm.org/D67120. And yeah using PGSO is selecting optsize while this change is selecting optnone.

The 1.29% is pretty considerable on functions that should never be hit according to profile information. This can indicate that there might be something amiss with the profile quality and that certain hot functions are not getting caught. Alternatively, given the ~5% code size increase you mention in the other thread the cold code may not be being moved out to a cold page so i-cache pollution ends up being a factor. I think it would be worthwhile to dig deeper into why there’s any performance degradation on functions that should never be called.

Also if you’re curious on how to build clang itself with PGO the documentation is here: How To Build Clang and LLVM with Profile-Guided Optimizations — LLVM 16.0.0git documentation

I think calling PGSO size opt is probably a bit misleading though. It’s more of an adaptive opt strategy, and it can improve performance too due to better locality. We have something similar internally for selecting opt level based on profile hotness too under AutoFDO.

Perhaps similar implementations can all be unified under a profile guided “adaptive optimization” framework to avoid duplication:

  • A unified way of setting hot/cold cutoff percentile (e.g. through PSI that’s already used by all PGO/FDO).
  • A unified way of selecting opt strategy for cold functions: default, none, size, minsize.

Thanks,

Wenlei

1%+ overhead is indeed interesting. If you use lld as linker (together with new pass manager), you should be able to have a good profile guided function level layout so dead functions are moved out of the hot pages.

This may also be related to subtle pass ordering issue. Pre-inline counts may not be super accurate, but we can’t use post-inline counts either given CGSCC inline is half way through the opt pipeline. Looking at the patch, it seems the decision is made at PGO annotation time which is between pre-instrumentation inline and CGSCC inline.

I think calling PGSO size opt is probably a bit misleading though. It’s more of an adaptive opt strategy, and it can improve performance too due to better locality. We have something similar internally for selecting opt level based on profile hotness too under AutoFDO.

Perhaps similar implementations can all be unified under a profile guided “adaptive optimization” framework to avoid duplication:

  • A unified way of setting hot/cold cutoff percentile (e.g. through PSI that’s already used by all PGO/FDO).
  • A unified way of selecting opt strategy for cold functions: default, none, size, minsize.

Thanks,

Wenlei

From: llvm-dev <llvm-dev-bounces@lists.llvm.org> on behalf of Modi Mo via llvm-dev <llvm-dev@lists.llvm.org>
Reply-To: Modi Mo <modimo@fb.com>
Date: Wednesday, September 9, 2020 at 5:55 PM
To: Tobias Hieta <tobias@plexapp.com>, Renato Golin <rengolin@gmail.com>
Cc:ddevienne@gmail.com” <ddevienne@gmail.com>, llvm-dev <llvm-dev@lists.llvm.org>, “cfe-dev (cfe-dev@lists.llvm.org)” <cfe-dev@lists.llvm.org>
Subject: Re: [llvm-dev] [RFC] New Feature Proposal: De-Optimizing Cold Functions using PGO Info

FYI David is referring to PGSO (profile-guided size optimization) as it exists directly under that name, see: https://reviews.llvm.org/D67120. And yeah using PGSO is selecting optsize while this change is selecting optnone.

PGSO looks at the block-level profile, too.

Hi,

Thanks for all the feedback related to code size.

  1. Adding a new llvm::shouldOptimizeForSize framework that leverages BFI and PSI to provide block level and function level assessments on whether we should optimize for size.
  2. In Passes (mostly MachinePasses), they’ll change certain behaviors (e.g. whether adding pads or not) if llvm::shouldOptimizeForSize returns true OR there is an optsize or minsize attribute

I totally agree with Wenlei that (somewhere in the future) we should have a unified FDO framework for both code size and compilation time. And I think Renato and Tobias’s suggestions to do the same thing for size-oriented attributes (i.e. minsize and optsize) is the low-hanging fruit we can support in a short time.
Engineering-wised I’ll prefer to send out a separate review for the size-oriented attributes work, since minsize / optsize are kind of in conflict with optnone so I don’t think it’s a good idea to put them into one flag / feature set.

Best,
Min

I’m happy for this unification to happen at a later stage. Just not too long later.

I worry exposing the flags will get people to use it and then we’ll change it. The longer we leave it, the more people will be hit by the subtle change.

Worse still if we release with one behaviour now and then with a different behaviour in the next release.

Having conditional paths in build systems for different versions of the compiler isn’t fun.

–renato