Instrumentation based Profile Guided Optimization (PGO) is a compiler technique that leverages important program runtime information, such as precise edge counts and frequent value information, to make frequently executed code run faster. It’s proven to be one of the most effective ways to improve program performance.
An important design point of PGO is to decide where to place the instrumentation. In current LLVM PGO, the instrumentation is done in the Clang Front-End (will be referred as FE based instrumentation). Doing this early in the front-end gives better coverage information as the there are more precise line number information. The resulted profile also has relatively high tolerance to compiler changes. All the compiler change after the instrumentation point will not lead to mismatched IR that invalidates the profile.
On the other hand, doing this too early gives the compiler fewer opportunities to optimize the code before instrumentation. It has significant impact on instrumentation runtime performance. In addition, it tends to produce larger binary and profile data file. Our internal C++ benchmarks shows that FE based instrumentation degrades the performance (compared to non-instrumented version) by 58.3%, and in some extreme cases, the application speed/throughput decreases by 95% (21x runtime slowdown).
Running the instrumented binary too slow is not desirable in PGO for the following reasons:
-
This slows down already lengthy build time. In the worst case, the instrumented binary is so slow that it fails to run a representative workload, because slow execution can leads to more time-outs in many server programs. Other typical issues include: text size too big, failure to link instrumented binaries, memory usage exceeding system limits.
-
Slow runtime affects the program behavior. Real applications sometimes monitor the program runtime and have different execution path when the program run too slow. This would defeat the underlying assumption of PGO and make it less effective.
This work proposes an option to turn on middle-end based instrumentation (new) that aims to speed up instrumentation runtime. The new instrumentation is referred to as ME based instrumentation in this document. Our experimental results show that ME instrumentation can speed up the instrumentation by 80% on average for typical C++ programs. Here are the two main design objectives:
- Co-existing with FE Instrumenter: We do not propose to replace the FE based instrumentation because FE based instrumentation has its advantages and applications. User can choose which phase to do instrumentation via command line options.
- PGO Runtime Support Sharing: The ME instrumenter will completely re-use the existing PGO’s runtime support.
- FE Based Instrumentation Runtime Overhead Analysis
Instrumented binaries are expected to run slower due to instrumentation code inserted. With FE based instrumentation, the overhead is especially high and runtime slowdown can be unacceptable in many cases. Further analysis shows that there are 3 important factors contributing to FE instrumentation slowdown :
- [Main] Redundant counter updates of inlined functions. C++ programs can introduce large abstraction penalties by using lots of small inline functions (assignment operators, getters, setters, ctors/dtors etc). Overhead of instrumenting those small functions can be very large, making training runs too slow and in some cases to usable;
- Non-optimal placement of the count updates;
- A third factor is related value profiling (to be turned on in the future). Small and hot callee functions taking function pointer (callbacks) can incur overhead due to indirect call target profiling.
1.1 Redundant Counter Update
If checking the assembly of the instrumented binary generated by current LLVM implementation, we can find many sequence of consecutive ‘incq’ instructions that updating difference counters in the same basic block. As an example that extracted from real binary:
…
incq 0xa91d80(%rip) # 14df4b8 <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
incq 0xa79011(%rip) # 14c6750 <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
incq 0xa79442(%rip) # 14c6b88 <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
incq 0x9c288b(%rip) # 140ffd8 <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
…
From profile use point of view, many of these counter update are redundant. Considering the following example:
void bar(){
sum++;
}
void foo() {
bar();
}
FE based instrumentation needs to insert counter update for the only BB of the bar().
bar: # @bar
BB#0: # %entry
incq .L__llvm_profile_counters_bar(%rip)
incl sum(%rip)
retq
It also need to insert the update the BB in function foo(). After inlining bar to foo(), the code is:
foo: # @foo
BB#0: # %entry
incq .L__llvm_profile_counters_foo(%rip)
incq .L__llvm_profile_counters_bar(%rip)
incl sum(%rip)
retq
If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is redundant – the counter won’t help downstream optimizations. On the other hand, if bar() is a large function and may not be suitable to be inlined for all callsites, this counter updated is necessary in order to produce more accurate profile data for the out-of-line instance of the callee.
If foo() is a hot function, the overhead of updating two counters can be significant. This is especially bad for C++ program where there are tons of small inline functions.
There is another missing opportunity in FE based instrumentation. The small functions’ control flow can usually be simplified when they are inlined into caller contexts. Once the control flow is simplified, many counter updates can therefore be eliminated. This is only possible for a middle end based late instrumenter. Defining a custom clean-up pass to remove redundant counter update is unrealistic and cannot be done in a sane way without destroying the profile integrity of neither the out-of-line nor inline instances of the callee.
A much simpler and cleaner solution is to do a pre-inline pass to inline all the trivial inlines before instrumentation. In addition to removing the unnecessary count updates for the inline instances, another advantage of pre-inline is to provide context sensitive profile for these small inlined functions. This context senstive profile can further improve the PGO based optimizations. Here is a contrived example:
void bar (int n) {
if (n&1)
do_sth1();
else
do_sth2();
}
void caller() {
int s = 1;
for (; s<100; s+=2)
bar(s);
for (s = 102; s< 200; s+=2)
bar(s);
}
The direction of the branch inside bar will be totally opposite between two different callsites in ‘caller’. Without pre-inlining, the branch probability will be 50-50 which will be useless for later optimizations. With pre-inlining, the profile will have the perfect branch count for each callsite. The positive performance impact of context sensitive profiling due to pre-inlining has been observed in real world large C++ programs. Supporting context sensitive profiling is another way to solve this, but it will introduce large additional runtime/memory overhead.
1.2 Non-optimal placement of count update
Another much smaller showdown factor is the placement of the counter updates. Current front-end based instrumentation applies the instrumentation to each front-end lexical construct. It also minimizes the number of static instrumentations. Note that it always instruments the entry count of the CFG. This may result in higher dynamic instruction counts. For example,
BB0
100
BB1
90 / \ 10
BB2 BB3
90 \ / 10
BB4
Like the the above example, FE based instrumentation will always insert count update in BB0. The dynamic instrumentation count will be either 110 (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic instrumentation count is 100.
Our experimental shows that the optimal placement based on edge hotness can improve instrumented code performance by about 10%. While it’s hard to find the optimal placement of count update, compiler heuristics can be used the get the better placement. These heuristics can be based on static profile prediction or user annotations (like __buildin_expect) to estimate the relative edge hotness and put instrumentations on the less hot edges. The initial late instrumentation has not fully implemented this placement strategy yet. With that implemented, we expect even better results than what is reported here. For real world programs, another major source of the slowdown is the data racing and false sharing of the counter update for highly threaded programs. Pre-inlining can alleviate this problem as the counters in the inline instances are not longer shared. But the complete solution to the data racing issue is orthogonal to the problem we try to solve here.
- High Level Design
We propose to perform a pre-profile inline pass before the PGO instrumentation pass. Since the instrumentation pass is after inine, it has to be done in the middle-end.
(1) The pre-inline pass
We will invoke a pre-inline pass before the instrumentation. When PGO is on, the inlining will be split into two passes:
- A pre-inline pass that is scheduled before the profile instrumentation/annotation
- A post-inline pass which is the regular inline pass after instrumentation/annotation
By design, all beneficial callsites without requiring profile data should be inlined in the pre-inline pass. It includes all callsites that will shrink code size after inlining. All the remaining callsites will be left to the regular inline pass when profile data is available.
After pre-inline, a CFG based profile instrumentation/annotation will be done. A minimum weight spanning tree (MST) in CFG is first computed, then only the edges not in the MST will be instrumented. The counter update instructions are placed in the basic blocks.
(2) Minimum Spanning Tree (MST) based instrumentation
A native way of instrumentation is to insert a count update for every edge in CFG which will result in too many redundant updates that makes the runtime very slow. Knuth [1] proposed a minimum spanning tree based method: given a CFG, first compute a spanning tree. All edges that not in the MST will be instrumented. In the profile use compilation, the counters are populated (from the leaf of the spanning tree) to all the edges. Knuth proved this method inserts the minimum number of instrumentations. MST based method only guarantees the number static instrumentation are minimized, not the dynamic instance of instrumentation. To reduce the number of dynamic instrumentation, edges of potentially high counts will be put into MST first so that they will have less chance to be instrumented.
- Experimental Results
3.1 Measurement of the efficiency of instrumentation
Other than the runtime of the instrumented binaries, a more direct measurement of the instrumentation overhead is the the sum of the raw profile count values. Note that regardless what kind of instrumentations are used, the raw profile count should be able to reconstruct all the edge count values for the whole program. All raw profile value are obtained via incrementing the counter variable value by one. The sum of the raw profile count value is roughly the dynamic instruction count of the instrumented code. The lower of the value, the more efficient of the instrumentation.
3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K eon
The performance speedup is computed by (FE_instrumentation_runtime / ME_instrumentation_runtime - 1)
We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from SPEC2000. For C programs, except for one outlier 456.hmmer, there are small ups and downs across different programs. Late instrumentation improves hmmer a lot, but it is probably due to unrelated loop optimizations (90% of runtime spent in one loop nest).
For C++ programs, the performance impact of late instrumentation is very large, which is as expected. The following table shows the result. For some C++ program, the time speedup is huge. For example, in 483.xalancbmk, late instrumentation speeds up performance by ~60%. Among all the SPEC C++ programs, only 444.namd is an outlier – it uses a lot of macros and is a very C like program.
Program Speedup
471.omnetpp 16.03%
473.astar 5.00%
483.xalancbmk 58.57%
444.namd -0.90%
447.dealII 60.47%
450.soplex 8.20%
453.povray 11.34%
252.eon 35.33%
PGOLateinstrumentationforLLVM.pdf (267 KB)