We would like to propose a new feature to disable optimizations on IR Functions that are considered “cold” by PGO profiles. The primary goal for this work is to improve code optimization speed (which also improves compilation and LTO speed) without making too much impact on target code performance.
The mechanism is pretty simple: In the second phase (i.e. optimization phase) of PGO, we would add
In addition to de-optimizing on functions whose profiling counts are exactly zero (
-fprofile-deopt-cold), we also provide a knob (
-fprofile-deopt-cold-percent=<X percent>) to adjust the “cold threshold”. That is, after sorting profiling counts of all functions, this knob provides an option to de-optimize functions whose count values are sitting in the lower X percent.
We evaluated this feature on LLVM Test Suite (the Bitcode, SingleSource, and MultiSource sub-folders were selected). Both compilation speed and target program performance are measured by the number of instructions reported by Linux perf. The table below shows the percentage of compilation speed improvement and target performance overhead relative to the baseline that only uses (instrumentation-based) PGO.
Experiment Name Compile Speedup Target Overhead
DeOpt Cold Zero Count 5.13% 0.02%
DeOpt Cold 25% 8.06% 0.12%
DeOpt Cold 50% 13.32% 2.38%
DeOpt Cold 75% 17.53% 7.07%
(The “DeOpt Cold Zero Count” experiment will only disable optimizations on functions whose profiling counts are exactly zero. Rest of the experiments are disabling optimizations on functions whose profiling counts are in the lower X%.)
We also did evaluations on FullLTO, here are the numbers:
Experiment Name Link Time Speedup Target Overhead
DeOpt Cold Zero Count 10.87% 1.29%
DeOpt Cold 25% 18.76% 1.50%
DeOpt Cold 50% 30.16% 3.94%
DeOpt Cold 75% 38.71% 8.97%
(The link time presented here included the LTO and code generation time. We omitted the compile time numbers here since it’s not really interesting in LTO setup)
From the above experiments we observed that compilation / link time improvement scaled linearly with the percentage of cold functions we skipped. Even if we only skipped functions that never got executed (i.e. had counter values equal to zero, which is effectively “0%”), we already had 5~10% of “free ride” on compilation / linking speed improvement and barely had any target performance penalty.
We believed that the above numbers had justified this patch to be useful on improving build time with little overhead.
Here are the patches for review:
- Modifications on LLVM instrumentation-based PGO: https://reviews.llvm.org/D87337
- Modifications on Clang driver: https://reviews.llvm.org/D87338
Credit: This project was originally started by Paul Robinson <firstname.lastname@example.org> and Edward Dawson <Edd.Dawson@sony.com> from Sony PlayStation compiler team. I picked it up when I was interning there this summer.
Thank you for your reading.