GSoC: Profile-guided inlining and block positioning

1. Summary

I will implement two optimizations in the LLVM compiler, both based on runtime profile information: an inlining mechanism that takes the call frequency into consideration, and a better basic block placement algorithm.

2. The project

LLVM now includes an efficient framework [1] for instrumenting an application and collecting edge and path profile information. Profile-guided optimizations present an opportunity for improving the performance of the application in a significant way, but until now no real optimization was developed/adapted to take advantage of this profile information (there is aBasicBlockPlacement pass, but it can be considered naïve).

The first part of my project consists in implementing a function inlining pass that takes advantage of function call frequency, if available. Function inlining is one of the most important optimizations nowadays, reducing call overhead and enhancing a series of classic optimizations that are usually inhibited by calls. Although it is desirable to inline as many of the calls as possible, a limit must be imposed, otherwise application performance might suffer as a result of excessive code size growth and instruction cache misses [2]. In taking these decisions the current heuristic can be improved by considering the number of times the function to be inlined was actually called:

  • A function that was not called at all, or was called below a certain threshold should not be inlined, or at least be penalized.

  • A function that is called frequently should always be inlined, or at least get a bonus, so it has a higher chance of being inlined.

I will use the algorithm presented in [3] because, according to the benchmarks presented in the paper, the performance improvement can be up to 57% compared to inlining without any profile information, all this with a maximum of 10% code size growth. The second reason is that the algorithm takes into consideration the cost of performing inlining at a call site, functionality which is already provided by the InlineCost analysis.

The algorithm will be implemented as a new optimization pass, which will perform the following sequence of operations:

  1. Determine a maximum cost for the whole inlining process. This can be_S * 1.1_, where S is the estimated size of the application (this can be an estimate of the total number of instructions, the wayCodeMetrics::analyzeBasicBlock does it). Multiplying by 1.1 means that a code growth of maximum 10% is allowed, which has been shown to be more than enough. If optimization for size is enabled we could allow a code growth of only 1%, which still provides most of the benefits.

  2. Determine the call sites, and for each one a benefit/cost ratio, where benefit is the number of times the calee was actually called, and costis the estimated cost of the inlining, as computed by the InlineCost analysis. The call site/ratio pairs are then inserted into a max-heap, because they are needed by the algorithm in decreasing-ratio order, and also because the ratio of a call site may change due to an inlining decision, which can be implemented efficiently in a heap (decrease-key).

  3. Run the actual algorithm, which can be formulated as a knapsack-like problem: the call sites are chosen in a greedy fashion, based on thebenefit/cost ratio, until the maximum cost is achieved (or no more call sites are available). After the calee is inlined, all call sites that may be affected by the inlining decision must be updated.

The benefit/cost ratio formula most likely won’t be that simple, because it must also take into consideration attributes like alwaysinline,noinline, and optsize. Also, functions with an extremely high cost should not be inlined, even if they are called frequently. This will be fine-tuned in the later stages of the project.

The second optimization I will implement is a better basic block placement algorithm compared to the current one. The algorithm will be Algo2from [4], with better results than Algo1, the one implemented now in theBasicBlockPlacement pass, according to the paper. Basic block placement using profile information is very helpful especially for control-intensive functions, and functions where many blocks are executed infrequently or not at all (debugging/error handling code, for example). Reordering the basic blocks helps reducing branch penalties, which is very important in today’s heavily pipelined CPUs. It also helps reducing instruction cache misses, because the instruction stream is denser and the unlikely code is moved to a “cold” place of the function. The performance improvements are between 2-15% for most applications.

The basic idea of the algorithm is that it forms chains of basic blocks that should be placed as a group of straight-line code in a bottom-up fashion. After these chains are computed, the doubly-linked list of basic blocks is changed to reflect the order in the chains.

  1. Benefits for LLVM
  • The optimizations I propose provide tangible performance benefits, unlike other profile guided optimizations that are worth performing only for a small number of applications, or may degrade performance if they introduced code growth that can’t be accounted for with more optimization opportunities (superblocks formation, for example).

  • Taking advantage of the many opportunities opened by profile-guided optimizations.

  • Inspiring other developers to implement other profile-guided optimizations.

4. Success criteria# - Implement both optimizations described above.# - Pass test suite with both optimizations enabled.# - Performance with profile-guided inlining enabled should exceed the one obtained with the standard inlining pas.

  • Performance with the new block placement algorithm should exceed the one obtained with the current placement algorithm.