RFC: PGO Late instrumentation for LLVM

Instrumentation based Profile Guided Optimization (PGO) is a compiler technique that leverages important program runtime information, such as precise edge counts and frequent value information, to make frequently executed code run faster. It’s proven to be one of the most effective ways to improve program performance.

An important design point of PGO is to decide where to place the instrumentation. In current LLVM PGO, the instrumentation is done in the Clang Front-End (will be referred as FE based instrumentation). Doing this early in the front-end gives better coverage information as the there are more precise line number information. The resulted profile also has relatively high tolerance to compiler changes. All the compiler change after the instrumentation point will not lead to mismatched IR that invalidates the profile.

On the other hand, doing this too early gives the compiler fewer opportunities to optimize the code before instrumentation. It has significant impact on instrumentation runtime performance. In addition, it tends to produce larger binary and profile data file. Our internal C++ benchmarks shows that FE based instrumentation degrades the performance (compared to non-instrumented version) by 58.3%, and in some extreme cases, the application speed/throughput decreases by 95% (21x runtime slowdown).

Running the instrumented binary too slow is not desirable in PGO for the following reasons:

  • This slows down already lengthy build time. In the worst case, the instrumented binary is so slow that it fails to run a representative workload, because slow execution can leads to more time-outs in many server programs. Other typical issues include: text size too big, failure to link instrumented binaries, memory usage exceeding system limits.

  • Slow runtime affects the program behavior. Real applications sometimes monitor the program runtime and have different execution path when the program run too slow. This would defeat the underlying assumption of PGO and make it less effective.

This work proposes an option to turn on middle-end based instrumentation (new) that aims to speed up instrumentation runtime. The new instrumentation is referred to as ME based instrumentation in this document. Our experimental results show that ME instrumentation can speed up the instrumentation by 80% on average for typical C++ programs. Here are the two main design objectives:

  • Co-existing with FE Instrumenter: We do not propose to replace the FE based instrumentation because FE based instrumentation has its advantages and applications. User can choose which phase to do instrumentation via command line options.
  • PGO Runtime Support Sharing: The ME instrumenter will completely re-use the existing PGO’s runtime support.
  1. FE Based Instrumentation Runtime Overhead Analysis

Instrumented binaries are expected to run slower due to instrumentation code inserted. With FE based instrumentation, the overhead is especially high and runtime slowdown can be unacceptable in many cases. Further analysis shows that there are 3 important factors contributing to FE instrumentation slowdown :

  • [Main] Redundant counter updates of inlined functions. C++ programs can introduce large abstraction penalties by using lots of small inline functions (assignment operators, getters, setters, ctors/dtors etc). Overhead of instrumenting those small functions can be very large, making training runs too slow and in some cases to usable;
  • Non-optimal placement of the count updates;
  • A third factor is related value profiling (to be turned on in the future). Small and hot callee functions taking function pointer (callbacks) can incur overhead due to indirect call target profiling.

1.1 Redundant Counter Update

If checking the assembly of the instrumented binary generated by current LLVM implementation, we can find many sequence of consecutive ‘incq’ instructions that updating difference counters in the same basic block. As an example that extracted from real binary:

incq 0xa91d80(%rip) # 14df4b8 <__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
incq 0xa79011(%rip) # 14c6750 <__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
incq 0xa79442(%rip) # 14c6b88 <__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
incq 0x9c288b(%rip) # 140ffd8 <__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>

From profile use point of view, many of these counter update are redundant. Considering the following example:
void bar(){
sum++;
}
void foo() {
bar();
}

FE based instrumentation needs to insert counter update for the only BB of the bar().
bar: # @bar

BB#0: # %entry

incq .L__llvm_profile_counters_bar(%rip)
incl sum(%rip)
retq

It also need to insert the update the BB in function foo(). After inlining bar to foo(), the code is:
foo: # @foo

BB#0: # %entry

incq .L__llvm_profile_counters_foo(%rip)
incq .L__llvm_profile_counters_bar(%rip)
incl sum(%rip)
retq

If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is redundant – the counter won’t help downstream optimizations. On the other hand, if bar() is a large function and may not be suitable to be inlined for all callsites, this counter updated is necessary in order to produce more accurate profile data for the out-of-line instance of the callee.

If foo() is a hot function, the overhead of updating two counters can be significant. This is especially bad for C++ program where there are tons of small inline functions.

There is another missing opportunity in FE based instrumentation. The small functions’ control flow can usually be simplified when they are inlined into caller contexts. Once the control flow is simplified, many counter updates can therefore be eliminated. This is only possible for a middle end based late instrumenter. Defining a custom clean-up pass to remove redundant counter update is unrealistic and cannot be done in a sane way without destroying the profile integrity of neither the out-of-line nor inline instances of the callee.

A much simpler and cleaner solution is to do a pre-inline pass to inline all the trivial inlines before instrumentation. In addition to removing the unnecessary count updates for the inline instances, another advantage of pre-inline is to provide context sensitive profile for these small inlined functions. This context senstive profile can further improve the PGO based optimizations. Here is a contrived example:
void bar (int n) {
if (n&1)
do_sth1();
else
do_sth2();
}

void caller() {
int s = 1;
for (; s<100; s+=2)
bar(s);

for (s = 102; s< 200; s+=2)
bar(s);
}

The direction of the branch inside bar will be totally opposite between two different callsites in ‘caller’. Without pre-inlining, the branch probability will be 50-50 which will be useless for later optimizations. With pre-inlining, the profile will have the perfect branch count for each callsite. The positive performance impact of context sensitive profiling due to pre-inlining has been observed in real world large C++ programs. Supporting context sensitive profiling is another way to solve this, but it will introduce large additional runtime/memory overhead.

1.2 Non-optimal placement of count update

Another much smaller showdown factor is the placement of the counter updates. Current front-end based instrumentation applies the instrumentation to each front-end lexical construct. It also minimizes the number of static instrumentations. Note that it always instruments the entry count of the CFG. This may result in higher dynamic instruction counts. For example,
BB0

100
BB1
90 / \ 10
BB2 BB3
90 \ / 10
BB4
Like the the above example, FE based instrumentation will always insert count update in BB0. The dynamic instrumentation count will be either 110 (Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic instrumentation count is 100.

Our experimental shows that the optimal placement based on edge hotness can improve instrumented code performance by about 10%. While it’s hard to find the optimal placement of count update, compiler heuristics can be used the get the better placement. These heuristics can be based on static profile prediction or user annotations (like __buildin_expect) to estimate the relative edge hotness and put instrumentations on the less hot edges. The initial late instrumentation has not fully implemented this placement strategy yet. With that implemented, we expect even better results than what is reported here. For real world programs, another major source of the slowdown is the data racing and false sharing of the counter update for highly threaded programs. Pre-inlining can alleviate this problem as the counters in the inline instances are not longer shared. But the complete solution to the data racing issue is orthogonal to the problem we try to solve here.

  1. High Level Design

We propose to perform a pre-profile inline pass before the PGO instrumentation pass. Since the instrumentation pass is after inine, it has to be done in the middle-end.

(1) The pre-inline pass
We will invoke a pre-inline pass before the instrumentation. When PGO is on, the inlining will be split into two passes:

  • A pre-inline pass that is scheduled before the profile instrumentation/annotation
  • A post-inline pass which is the regular inline pass after instrumentation/annotation
    By design, all beneficial callsites without requiring profile data should be inlined in the pre-inline pass. It includes all callsites that will shrink code size after inlining. All the remaining callsites will be left to the regular inline pass when profile data is available.

After pre-inline, a CFG based profile instrumentation/annotation will be done. A minimum weight spanning tree (MST) in CFG is first computed, then only the edges not in the MST will be instrumented. The counter update instructions are placed in the basic blocks.

(2) Minimum Spanning Tree (MST) based instrumentation
A native way of instrumentation is to insert a count update for every edge in CFG which will result in too many redundant updates that makes the runtime very slow. Knuth [1] proposed a minimum spanning tree based method: given a CFG, first compute a spanning tree. All edges that not in the MST will be instrumented. In the profile use compilation, the counters are populated (from the leaf of the spanning tree) to all the edges. Knuth proved this method inserts the minimum number of instrumentations. MST based method only guarantees the number static instrumentation are minimized, not the dynamic instance of instrumentation. To reduce the number of dynamic instrumentation, edges of potentially high counts will be put into MST first so that they will have less chance to be instrumented.

  1. Experimental Results

3.1 Measurement of the efficiency of instrumentation

Other than the runtime of the instrumented binaries, a more direct measurement of the instrumentation overhead is the the sum of the raw profile count values. Note that regardless what kind of instrumentations are used, the raw profile count should be able to reconstruct all the edge count values for the whole program. All raw profile value are obtained via incrementing the counter variable value by one. The sum of the raw profile count value is roughly the dynamic instruction count of the instrumented code. The lower of the value, the more efficient of the instrumentation.

3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K eon
The performance speedup is computed by (FE_instrumentation_runtime / ME_instrumentation_runtime - 1)

We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from SPEC2000. For C programs, except for one outlier 456.hmmer, there are small ups and downs across different programs. Late instrumentation improves hmmer a lot, but it is probably due to unrelated loop optimizations (90% of runtime spent in one loop nest).

For C++ programs, the performance impact of late instrumentation is very large, which is as expected. The following table shows the result. For some C++ program, the time speedup is huge. For example, in 483.xalancbmk, late instrumentation speeds up performance by ~60%. Among all the SPEC C++ programs, only 444.namd is an outlier – it uses a lot of macros and is a very C like program.

Program Speedup
471.omnetpp 16.03%
473.astar 5.00%
483.xalancbmk 58.57%
444.namd -0.90%
447.dealII 60.47%
450.soplex 8.20%
453.povray 11.34%
252.eon 35.33%

PGOLateinstrumentationforLLVM.pdf (267 KB)

Rong Xu via llvm-dev <llvm-dev@lists.llvm.org> writes:

Instrumentation based Profile Guided Optimization (PGO) is a compiler
technique that leverages important program runtime information, such as
precise edge counts and frequent value information, to make frequently
executed code run faster. It's proven to be one of the most effective ways to
improve program performance.

An important design point of PGO is to decide where to place the
instrumentation. In current LLVM PGO, the instrumentation is done in the Clang
Front-End (will be referred as FE based instrumentation). Doing this early in
the front-end gives better coverage information as the there are more precise
line number information. The resulted profile also has relatively high
tolerance to compiler changes. All the compiler change after the
instrumentation point will not lead to mismatched IR that invalidates the
profile.

On the other hand, doing this too early gives the compiler fewer opportunities
to optimize the code before instrumentation. It has significant impact on
instrumentation runtime performance. In addition, it tends to produce larger
binary and profile data file. Our internal C++ benchmarks shows that FE
based instrumentation degrades the performance (compared to non-instrumented
version) by 58.3%, and in some extreme cases, the application speed/throughput
decreases by 95% (21x runtime slowdown).

While I agree that we can and should improve the overhead of profiling
instrumentation, I think that there's a lot of low hanging fruit in
improving the frontend instrumentation that makes some of the
comparisons below misleading.

The cost of maintaining two separate ways of profiling is fairly high,
so we really have to be sure that the gains are worth it if we're going
to go this route.

Running the instrumented binary too slow is not desirable in PGO for the
following reasons:
   * This slows down already lengthy build time. In the worst case, the
instrumented binary is so slow that it fails to run a representative workload,
because slow execution can leads to more time-outs in many server programs.
Other typical issues include: text size too big, failure to link instrumented
binaries, memory usage exceeding system limits.
   * Slow runtime affects the program behavior. Real applications sometimes
monitor the program runtime and have different execution path when the program
run too slow. This would defeat the underlying assumption of PGO and make it
less effective.

This work proposes an option to turn on middle-end based instrumentation (new)
that aims to speed up instrumentation runtime. The new instrumentation is
referred to as ME based instrumentation in this document. Our experimental
results show that ME instrumentation can speed up the instrumentation by 80%
on average for typical C++ programs. Here are the two main design objectives:
   * Co-existing with FE Instrumenter: We do not propose to replace the FE
based instrumentation because FE based instrumentation has its advantages and
applications. User can choose which phase to do instrumentation via command
line options.
   * PGO Runtime Support Sharing: The ME instrumenter will completely re-use
the existing PGO’s runtime support.

1. FE Based Instrumentation Runtime Overhead Analysis

Instrumented binaries are expected to run slower due to instrumentation code
inserted. With FE based instrumentation, the overhead is especially high and
runtime slowdown can be unacceptable in many cases. Further analysis shows
that there are 3 important factors contributing to FE instrumentation
slowdown :
   * [Main] Redundant counter updates of inlined functions. C++
     programs can introduce large abstraction penalties by using lots of
     small inline functions (assignment operators, getters, setters,
     ctors/dtors etc). Overhead of instrumenting those small functions
     can be very large, making training runs too slow and in some cases
     to usable;

As Sean pointed out, this problem might be solveable in other ways.
Beyond the naive approach of simply ignoring small functions with a
certain hotness that he suggested, the frontend instrumentation could be
easily tweaked to do better in these cases in a couple of ways.

1. It's pretty easy to recognize classes of function in the frontend. It
   would be trivial to optionally ignore getter-style functions or
   simple constructors, or whatever. This loses some information that's
   useful for the coverage use case, but not in a very harmful way.

2. These will generally be dominated by a counter update in the calling
   function. We could insert a pass at some point that recognizes redundant
   calls to instrprof.increment and combines them. This combine could
   even keep the fidelity of all of the counters by creating extra
   symbolic counters that map to multiple updates, if we're willing to
   pay the space cost.

   * Non-optimal placement of the count updates;

A lot of this has quite a bit of room for improvement in the frontend
instrumentation. More on this below.

   * A third factor is related value profiling (to be turned on in the
     future). Small and hot callee functions taking function pointer
     (callbacks) can incur overhead due to indirect call target
     profiling.

Can you elaborate on this? I can't see how a feature we don't have yet
can contribute to the slowdown, and I don't really understand how the
late instrumentation would help with whatever the hypothetical problem
is.

1.1 Redundant Counter Update

If checking the assembly of the instrumented binary generated by current LLVM
implementation, we can find many sequence of consecutive 'incq' instructions
that updating difference counters in the same basic block. As an example that
extracted from real binary:
  ...
incq 0xa91d80(%rip) # 14df4b8
<__llvm_profile_counters__ZN13LowLevelAlloc5ArenaC2Ev+0x1b8>
incq 0xa79011(%rip) # 14c6750
<__llvm_profile_counters__ZN10MallocHook13InvokeNewHookEPKvm>
incq 0xa79442(%rip) # 14c6b88
<__llvm_profile_counters__ZNK4base8internal8HookListIPFvPKvmEE5emptyEv>
incq 0x9c288b(%rip) # 140ffd8
<__llvm_profile_counters__ZN4base6subtle12Acquire_LoadEPVKl>
...

From profile use point of view, many of these counter update are redundant.
Considering the following example:
void bar(){
sum++;
}
void foo() {
bar();
}

FE based instrumentation needs to insert counter update for the only BB of the
bar().
bar: # @bar
# BB#0: # %entry
       incq .L__llvm_profile_counters_bar(%rip)
       incl sum(%rip)
       retq

It also need to insert the update the BB in function foo(). After inlining
bar to foo(), the code is:
foo: # @foo
# BB#0: # %entry
       incq .L__llvm_profile_counters_foo(%rip)
       incq .L__llvm_profile_counters_bar(%rip)
       incl sum(%rip)
       retq

If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
redundant -- the counter won't help downstream optimizations. On the other
hand, if bar() is a large function and may not be suitable to be inlined for
all callsites, this counter updated is necessary in order to produce more
accurate profile data for the out-of-line instance of the callee.

If foo() is a hot function, the overhead of updating two counters can be
significant. This is especially bad for C++ program where there are tons of
small inline functions.

On the other hand, unless foo is bar's only caller, you're losing
information by dropping one of these counters. How do you determine when
this transformation is safe?

There is another missing opportunity in FE based instrumentation. The small
functions’ control flow can usually be simplified when they are inlined into
caller contexts. Once the control flow is simplified, many counter updates can
therefore be eliminated. This is only possible for a middle end based late
instrumenter. Defining a custom clean-up pass to remove redundant counter
update is unrealistic and cannot be done in a sane way without destroying the
profile integrity of neither the out-of-line nor inline instances of the
callee.

Wait, but doing it this way destroys the profile integrity of one of
those anyway - it isn't updating all of the counters. How is doing this
in a separate pass any different?

A much simpler and cleaner solution is to do a pre-inline pass to inline all
the trivial inlines before instrumentation. In addition to removing the
unnecessary count updates for the inline instances, another advantage of
pre-inline is to provide context sensitive profile for these small inlined
functions. This context senstive profile can further improve the PGO based
optimizations. Here is a contrived example:
void bar (int n) {
  if (n&1)
    do_sth1();
  else
    do_sth2();
}

void caller() {
  int s = 1;
  for (; s<100; s+=2)
    bar(s);

  for (s = 102; s< 200; s+=2)
    bar(s);
}

The direction of the branch inside bar will be totally opposite between two
different callsites in ‘caller’. Without pre-inlining, the branch probability
will be 50-50 which will be useless for later optimizations. With
pre-inlining, the profile will have the perfect branch count for each
callsite. The positive performance impact of context sensitive profiling due
to pre-inlining has been observed in real world large C++ programs. Supporting
context sensitive profiling is another way to solve this, but it will
introduce large additional runtime/memory overhead.

So this a pretty interesting transformation. Do you have any data on
exactly how large the overhead of doing this with a context sensitive
approach would be, or how much extra complexity would entail?

1.2 Non-optimal placement of count update

Another much smaller showdown factor is the placement of the counter updates.
Current front-end based instrumentation applies the instrumentation to each
front-end lexical construct. It also minimizes the number of static
instrumentations. Note that it always instruments the entry count of the CFG.
This may result in higher dynamic instruction counts. For example,
     BB0
     > 100
    BB1
90 / \ 10
  BB2 BB3
90 \ / 10
    BB4
Like the the above example, FE based instrumentation will always insert count
update in BB0. The dynamic instrumentation count will be either 110
(Instrument bb0->bb1 and bb1->bb2) or 190 (bb0->bb1 and bb1->bb3). A better
instrumentation is to instrument (bb1->bb2 and bb1->bb3) where the dynamic
instrumentation count is 100.

Our experimental shows that the optimal placement based on edge hotness can
improve instrumented code performance by about 10%. While it’s hard to find
the optimal placement of count update, compiler heuristics can be used the
get the better placement. These heuristics can be based on static profile
prediction or user annotations (like __buildin_expect) to estimate the
relative edge hotness and put instrumentations on the less hot edges. The
initial late instrumentation has not fully implemented this placement strategy
yet. With that implemented, we expect even better results than what is
reported here. For real world programs, another major source of the slowdown
is the data racing and false sharing of the counter update for highly threaded
programs. Pre-inlining can alleviate this problem as the counters in the
inline instances are not longer shared. But the complete solution to the data
racing issue is orthogonal to the problem we try to solve here.

I suspect most of these heuristics can be applied to the frontend
approach today. User annotations are visible to the frontend by
definition, so we can pretty easily modify our counter placement based
on those. Using previous profiles introduces a need to record somewhere
why we made an "abnormal" decision, but that problem exists with the
late instrumentation as well.

2. High Level Design

We propose to perform a pre-profile inline pass before the PGO instrumentation
pass. Since the instrumentation pass is after inine, it has to be done in the
middle-end.

(1) The pre-inline pass
We will invoke a pre-inline pass before the instrumentation. When PGO is on,
the inlining will be split into two passes:
   * A pre-inline pass that is scheduled before the profile instrumentation/
annotation
   * A post-inline pass which is the regular inline pass after instrumentation
/annotation
By design, all beneficial callsites without requiring profile data should be
inlined in the pre-inline pass. It includes all callsites that will shrink
code size after inlining. All the remaining callsites will be left to the
regular inline pass when profile data is available.

After pre-inline, a CFG based profile instrumentation/annotation will be done.
A minimum weight spanning tree (MST) in CFG is first computed, then only the
edges not in the MST will be instrumented. The counter update instructions are
placed in the basic blocks.

(2) Minimum Spanning Tree (MST) based instrumentation
A native way of instrumentation is to insert a count update for every edge in
CFG which will result in too many redundant updates that makes the runtime
very slow. Knuth [1] proposed a minimum spanning tree based method: given a
CFG, first compute a spanning tree. All edges that not in the MST will be
instrumented. In the profile use compilation, the counters are populated (from
the leaf of the spanning tree) to all the edges. Knuth proved this method
inserts the minimum number of instrumentations. MST based method only
guarantees the number static instrumentation are minimized, not the dynamic
instance of instrumentation. To reduce the number of dynamic instrumentation,
edges of potentially high counts will be put into MST first so that they will
have less chance to be instrumented.

3. Experimental Results

3.1 Measurement of the efficiency of instrumentation
Other than the runtime of the instrumented binaries, a more direct measurement
of the instrumentation overhead is the the sum of the raw profile count
values. Note that regardless what kind of instrumentations are used, the raw
profile count should be able to reconstruct all the edge count values for the
whole program. All raw profile value are obtained via incrementing the counter
variable value by one. The sum of the raw profile count value is roughly the
dynamic instruction count of the instrumented code. The lower of the value,
the more efficient of the instrumentation.

3.2 LLVM instrumentations runtime for SPEC2006 C/C++ programs and SPEC2K eon
The performance speedup is computed by (FE_instrumentation_runtime /
ME_instrumentation_runtime - 1)

We run the experiments on all C/C++ programs in SPEC2006 and 252.eon from
SPEC2000. For C programs, except for one outlier 456.hmmer, there are small
ups and downs across different programs. Late instrumentation improves hmmer a
lot, but it is probably due to unrelated loop optimizations (90% of runtime
spent in one loop nest).

For C++ programs, the performance impact of late instrumentation is very
large, which is as expected. The following table shows the result. For some
C++ program, the time speedup is huge. For example, in 483.xalancbmk, late
instrumentation speeds up performance by ~60%. Among all the SPEC C++
programs, only 444.namd is an outlier -- it uses a lot of macros and is a very
C like program.

Program Speedup
471.omnetpp 16.03%
473.astar 5.00%
483.xalancbmk 58.57%
444.namd -0.90%
447.dealII 60.47%
450.soplex 8.20%
453.povray 11.34%
252.eon 35.33%
-------------------------
Geomean 21.01%

3.3 Statistics of LLVM profiles for SPEC2006 C/C++ programs
We also collect some statistic of the profiles generated by FE based
instrumentation and late instrumentation, namely, the following information:
   1. the number of functions that being instrumented,
   2. the result profile file size,
   3. the sum of raw count values that was mentioned earlier -- we used it to
measure the efficiency of the instrumentation.
Next table shows the ratios of the each metrics by late instrumentation for
the C++ programs, with FE based instrumentation as the base : column (1) shows
the ratios of instrumented functions; column (2) shows the ratios of the
profile file size; column (3) shows the ratios of the sum of raw count values.

                (1) (2) (3)
471.omnetpp 85.36% 110.26% 46.52%
473.astar 64.86% 72.72% 63.13%
483.xalancbmk 51.83% 56.11% 35.77%
444.namd 75.36% 82.82% 85.77%
447.dealII 43.42% 46.46% 26.75%
450.soplex 71.80% 87.54% 51.19%
453.povray 78.68% 83.57% 64.37%
252.eon 72.06% 91.22% 30.02%
----------------------------------------
Geomean 66.50% 76.36% 47.01%

For FE based instrumentation, profile count variables generated for the dead
functions will not be removed (like __llvm_prf_names, __llvm_prf_data, and
__llvm_prf_cnts) from the data/text segment, nor in the result profile. There
is a recent patch that removes these unused data for COMDAT functions, but
that patch won’t touch regular functions. This is the main reason for the
larger number of instrumented functions and larger profile file size for the
FE based instrumentation. The reduction of the sum of raw count values is
mainly due to the elimination of redundant profile updates enabled by the
pre-inlining.

For C programs, we observe similar improvement in the profile size (geomean
ratio of 73.75%) and smaller improvement in the number of instrumented
functions (geomean ratio of 87.49%) and the sum of raw count values (geomean
of 82.76%).

3.4 LLVM instrumentations runtime performance for Google internal C/C++
benchmarks

We also use Google internal benchmarks (mostly typical C++ applications) to
measure the relative performance b/w FE based instrumentation and late
instrumentation. The following table shows the speedup of late
instrumentation vs FE based instrumentation. Note that C++benchmark01 is a
very large multi-threaded C++ program. Late instrumentation sees 4x speedup.
Larger than 3x speedups are also seen in many other programs.

C++_bencharmk01 416.98%
C++_bencharmk02 6.29%
C++_bencharmk03 22.39%
C++_bencharmk04 28.05%
C++_bencharmk05 2.00%
C++_bencharmk06 675.89%
C++_bencharmk07 359.19%
C++_bencharmk08 395.03%
C_bencharmk09 15.11%
C_bencharmk10 5.47%
C++_bencharmk11 5.73%
C++_bencharmk12 2.31%
C++_bencharmk13 87.73%
C++_bencharmk14 7.22%
C_bencharmk15 -0.51%
C++_bencharmk16 59.15%
C++_bencharmk17 358.82%
C++_bencharmk18 861.36%
C++_bencharmk19 29.62%
C++_bencharmk20 11.82%
C_bencharmk21 0.53%
C++_bencharmk22 43.10%
---------------------------
Geomean 83.03%

3.5 Statistics of LLVM profiles for Google internal benchmarks

The following shows the profile statics using Google internal benchmarks.
                         (1) (2) (3)
C++_bencharmk01 36.84% 40.29% 2.32%
C++_bencharmk02 39.20% 40.49% 42.39%
C++_bencharmk03 39.37% 40.65% 23.24%
C++_bencharmk04 39.13% 40.68% 17.70%
C++_bencharmk05 36.58% 38.27% 51.08%
C++_bencharmk06 29.50% 27.87% 2.87%
C++_bencharmk07 29.50% 27.87% 1.73%
C++_bencharmk08 29.50% 27.87% 4.17%
C_bencharmk09 53.95% 68.00% 11.18%
C_bencharmk10 53.95% 68.00% 31.74%
C++_bencharmk11 36.40% 37.07% 46.12%
C++_bencharmk12 38.44% 41.90% 73.59%
C++_bencharmk13 39.28% 42.72% 29.56%
C++_bencharmk14 38.59% 42.20% 13.42%
C_bencharmk15 57.45% 48.50% 66.99%
C++_bencharmk16 36.86% 42.18% 16.53%
C++_bencharmk17 37.82% 39.77% 13.68%
C++_bencharmk18 37.82% 39.77% 7.96%
C++_bencharmk19 37.52% 40.46% 1.85%
C++_bencharmk20 32.37% 30.44% 19.69%
C_bencharmk21 37.63% 40.42% 88.81%
C++_bencharmk22 36.28% 36.92% 21.62%
--------------------------------------------------
Geomean 38.22% 39.96% 15.58%

4. Implementation Details:

We need to add new option(s) for the alternative PGO instrumentation pass in
the middle end. It can in one of the following forms:

   (1) Complete new options that are on par with current PGO options:
-fprofile-late-instr-generate[=<profile_file>]? for PGO Instrumentation, and
-fprofile-late-instr-use[=<profile_file>]? for PGO USE.
   (2) Or, late instrumentation can be turned on with an additional option
-fprofile-instr-late with current PGO options. I. e. -fprofile-instr-late
-fprofile-instr-generate[=<profile_file>]? for PGO instrumentation, and
-fprofile-instr-late -fprofile-instr-use[=<profile_file>]? for PGO use.
   (3) Alternatively to (2), only keep -fprofile-instr-late option in PGO
instrumentation. Adding a magic tag in profile so that FE based profile and
late instrumented profile can be automatically detected by profile loader In
PGO use compilation. This requires a slight profile format change.

In our prototype implementation, two new passes are added in the beginning of
PassManagerBuilder::populateModulePassManager(), namely PreProfileInlinerPass
and PGOLateInstrumentationPass.

4.1 Pre-inline pass:

It is controlled by back-end option "-preinline" and "-disable-preinline". If
the user specifies any llvm option of "-fprofile-late-instr-{generate|use},
option "-mllvm -preinline" will be automatically inserted in the driver.. To
disable the pre-inliner when late instrumentation is enabled, use option
"-mllvm -disable-preinline".

For now, only minimum tuning is done for the pre-inliner, which simply adjusts
the inline threshold: If -Oz is specified, the threshold is set to 25.
Otherwise, it is 75.

The following clean up passes are added to PassManager, right after the
PreProfileInline pass:
  createEarlyCSEPass()
  createJumpThreadingPass()
  createCorrelatedValuePropagationPass()
  createCFGSimplificationPass()
  createInstructionCombiningPass()
  createGVNPass(DisableGVNLoadPRE)
  createPeepholePASS()
Some of them might not be necessary.

4.2 Late Instrumentation Pass:
The late instrumentation is right after the pre-inline pass and it's cleanup
passes. It is controlled by opt option "-pgo-late-instr-gen" and
"-pgo-late-instr-use". For "-pgo-late-instr-use" option, the driver will
provide the profile name.
For "-pgo-late-instr-gen", a pass that calls createInstrProfilingPass() is
also added to PassManager to lower the instrumentation intrinsics

PGOLateInstrumeatnion is a module pass that applies the instrumentation to
each function by class PGOLateInstrumentationFunc. For each function, perform
the following steps:
   1. First collect all the CFG edges. Assign an estimated weight to each
edge. Critical edges and back-edges are assigned to high value of weights. One
fake node and a few fake edges (from the fake node to the entry node, and from
all the exit nodes to the fake node) are also added to the worklist.
   2. Construct the MST. The edges with the higher weight will be put to MST
first, unless it forms a cycle.
   3. Traverse the CFG and compute the CFG hash using CRC32 of the index of
each BB.
The above three steps are the same for profile-generate and profile-use
compilation.

In the next step, for profile-generation compilation, all the edges that not
in the MST are instrumented. If this is a critical edge, split the edge first.
The actual instrumentation is to generate Intrinsic::instrprof_increment() in
the instrumented BB. This intrinsic will be lowed by pass
createInstrProfilingPass().
                                                
In the next step, for profile-generation compilation, all the edges that not
in the MST are instrumented. If this is a critical edge, split the edge first.
The actual instrumentation is to generate Intrinsic::instrprof_increment() in
the instrumented BB. This intrinsic will be lowed by pass
createInstrProfilingPass().

For -fprofile-use compilation, first read in the counters and the CFG hash
from the profile file. If the CFG hash matches, populate the counters to all
the edges in reverse topological order of the MST. Once having all the edge
counts, set the branch weights metadata for the IR having multiple branches.
Also apply the cold/hot function attributes based on function level counts.

4.3 Profile Format:

The late instrumentation profile is mostly the same as the one from front-end
instrument-ion. The difference is
   * Function checksums are different.
   * Function entry counts are no longer available.
For llvm-profdata utility, options -lateinstr needs to be used to
differentiate FE based and late instrumentation profiles, unless a magic tag
is added to the profile.

If we do this, we'll probably want to add something to the file to
differentiate the profiles, otherwise the error messages will be pretty
bad.

One aspect of this that I have not seen discussed is that middle-end
instrumentation enables PGO optimizations to front-ends other than Clang.

While I agree that FE instrumentation could be improved, it still requires
every FE to implement essentially the same common functionality. Having
PGO instrumentation generated in the middle-end, allows us every FE to
automatically take advantage of PGO.

Additionally, some of the overhead imposed by FE instrumentation is not
really all that easy to get rid of. You end up duplicating functionality
that is more naturally implemented in the middle end.

I see the two approaches as supplementary, rather than complementary. One
does not negate the other. Some of the optimizations we'd do in the FE,
may hurt coverage. Instead, by instrumenting in the middle end, you can
focus exclusively on performance (coverage be damned).

Diego.

Rong Xu via llvm-dev <llvm-dev@lists.llvm.org> writes:

Instrumentation based Profile Guided Optimization (PGO) is a compiler
technique that leverages important program runtime information, such as
precise edge counts and frequent value information, to make frequently
executed code run faster. It's proven to be one of the most effective ways to
improve program performance.

An important design point of PGO is to decide where to place the
instrumentation. In current LLVM PGO, the instrumentation is done in the Clang
Front-End (will be referred as FE based instrumentation). Doing this early in
the front-end gives better coverage information as the there are more precise
line number information. The resulted profile also has relatively high
tolerance to compiler changes. All the compiler change after the
instrumentation point will not lead to mismatched IR that invalidates the
profile.

On the other hand, doing this too early gives the compiler fewer opportunities
to optimize the code before instrumentation. It has significant impact on
instrumentation runtime performance. In addition, it tends to produce larger
binary and profile data file. Our internal C++ benchmarks shows that FE
based instrumentation degrades the performance (compared to non-instrumented
version) by 58.3%, and in some extreme cases, the application speed/throughput
decreases by 95% (21x runtime slowdown).

While I agree that we can and should improve the overhead of profiling
instrumentation, I think that there's a lot of low hanging fruit in
improving the frontend instrumentation that makes some of the
comparisons below misleading.

The cost of maintaining two separate ways of profiling is fairly high,

I think the initial bring up cost may be high, but the long term
maintenance cost may not -- I expect it to be fairly stable.

As Sean pointed out, this problem might be solveable in other ways.
Beyond the naive approach of simply ignoring small functions with a
certain hotness that he suggested, the frontend instrumentation could be
easily tweaked to do better in these cases in a couple of ways.

1. It's pretty easy to recognize classes of function in the frontend. It
   would be trivial to optionally ignore getter-style functions or
   simple constructors, or whatever. This loses some information that's
   useful for the coverage use case, but not in a very harmful way.

There are many limitations though:

1) functions that can be trivially cleaned up (CFG simplified) after inlining
2) Iwrapper functions which calls another function when exception
handling is on

2. These will generally be dominated by a counter update in the calling
   function. We could insert a pass at some point that recognizes redundant
   calls to instrprof.increment and combines them. This combine could
   even keep the fidelity of all of the counters by creating extra
   symbolic counters that map to multiple updates, if we're willing to
   pay the space cost.

This sounds complicated. How can the pass determine which updates are
really redundant? It does not know which are for true inline
instances and which are for potential out-of-line instances.

   * Non-optimal placement of the count updates;

A lot of this has quite a bit of room for improvement in the frontend
instrumentation. More on this below.

   * A third factor is related value profiling (to be turned on in the
     future). Small and hot callee functions taking function pointer
     (callbacks) can incur overhead due to indirect call target
     profiling.

Can you elaborate on this? I can't see how a feature we don't have yet
can contribute to the slowdown, and I don't really understand how the
late instrumentation would help with whatever the hypothetical problem
is.

The following code is not uncommon:

void callback(void);

template <typename CallBack_> void runner (CallBack_ cbk) {

   cbk(...);
}

void test()
{
     runner(callback);
}

If runner is instantiated with 'callback', there will be an trivially
eliminatable indirect calls in it -- this is similar to other CFG
cleanup that can be done with pre-cleanup passes.

Value Profiling is not in LLVM yet, but we know the work is pending.

If bar() should be always inlined, .L__llvm_profile_counters_bar(%rip) is
redundant -- the counter won't help downstream optimizations. On the other
hand, if bar() is a large function and may not be suitable to be inlined for
all callsites, this counter updated is necessary in order to produce more
accurate profile data for the out-of-line instance of the callee.

If foo() is a hot function, the overhead of updating two counters can be
significant. This is especially bad for C++ program where there are tons of
small inline functions.

On the other hand, unless foo is bar's only caller, you're losing
information by dropping one of these counters. How do you determine when
this transformation is safe?

That is the point of splitting inlining into 1) pre-profiling inlining
and 2) regular post profile inlining. Pre-inlining does not require
profile data and is safe to do. profile-use and profile-instr will
share the same pipeline, so what profile-use sees after pre-inlining
is a complete
picture of the out of inline instance's profile. Note that regular
inlining can also inline functions after instrumentation is done, but
those inline copies will still update the original counters just like
the FE based instrumentation.

Note that the above serves the optimizer perfectly, but not necessary
satisfies coverage test's requirement which is FE instrumentation's
strength.

There is another missing opportunity in FE based instrumentation. The small
functions’ control flow can usually be simplified when they are inlined into
caller contexts. Once the control flow is simplified, many counter updates can
therefore be eliminated. This is only possible for a middle end based late
instrumenter. Defining a custom clean-up pass to remove redundant counter
update is unrealistic and cannot be done in a sane way without destroying the
profile integrity of neither the out-of-line nor inline instances of the
callee.

Wait, but doing it this way destroys the profile integrity of one of
those anyway - it isn't updating all of the counters. How is doing this
in a separate pass any different?

See above, it is does not destroy profile integrity of the
out-of-inline instances. For inline instances, since their control
flows are now integrated with the caller's CFG, it allows a more
globally optimal edge selection and profile update placement.

The direction of the branch inside bar will be totally opposite between two
different callsites in ‘caller’. Without pre-inlining, the branch probability
will be 50-50 which will be useless for later optimizations. With
pre-inlining, the profile will have the perfect branch count for each
callsite. The positive performance impact of context sensitive profiling due
to pre-inlining has been observed in real world large C++ programs. Supporting
context sensitive profiling is another way to solve this, but it will
introduce large additional runtime/memory overhead.

So this a pretty interesting transformation. Do you have any data on
exactly how large the overhead of doing this with a context sensitive
approach would be, or how much extra complexity would entail?

I would expect the cost of doing so is quite high -- it is not
uncommon to have some callees to have hundreds of static callsites, so
even one level context sensitivity can incur high cost. The caller
code also needs to setup the contexts in order for the callee to
update the right counter set. Profile-use also becomes more
complicated.

e data

racing issue is orthogonal to the problem we try to solve here.

I suspect most of these heuristics can be applied to the frontend
approach today. User annotations are visible to the frontend by
definition, so we can pretty easily modify our counter placement based
on those.

The static prediction logic needs to be duplicated in Clang or any
other FEs in order to support this.

thanks,

David

One aspect of this that I have not seen discussed is that middle-end
instrumentation enables PGO optimizations to front-ends other than Clang.

While I agree that FE instrumentation could be improved, it still requires
every FE to implement essentially the same common functionality. Having
PGO instrumentation generated in the middle-end, allows us every FE to
automatically take advantage of PGO.

This is a really good point, and I agree with it. We may have gotten off on
the wrong foot since Rong's email focused so heavily on comparing with the
frontend instrumentation. As far as I see it, Rong's proposal has a couple
different parts:

1. Infrastructure for IR-level instrumentation-based PGO
2. Changes to the pass pipeline so that a hypothetical IR-level
instrumentation-based PGO is more effective
3. MST algorithm with profile feedback for optimal placement of counter
updates.

I think 1. is a no-brainer, if only so that all LLVM clients can benefit
from PGO, and also (as you pointed out below) so that it can have an
exclusive focus on performance. If it is sufficiently flexible, it may even
make sense to restrict clang's frontend instrumentation-based profiling to
non-performance stuff, and have clang directly interoperate with the
IR-level PGO for performance-related PGO use cases, just like any other
frontend would.

Philip and Sanjoy, out of curiosity do you guys use your own
instrumentation placement for PGO? Is an IR-level PGO infrastructure
upstream something you guys would be interested in?

I think that 2. is something that once we have 1. we will be able to
evaluate better, but for now my opinion is that we should be able to make
good progress without digging into that.

I think that 3. is a no-brainer if it provides a really significant win,
but without 1. we can't really measure its effect in isolation. It also has
a usability problem since it requires feeding in an existing profile for
the *instrumented* build, but if the benefit is very significant this may
be worth it for some users. We will probably be able to easily refactor 1.
as needed into an MST approach that degrades gracefully to using static
heuristics in the absence of real profile information, so is not a
maintenance burden (maybe even helps by providing a good framework in which
to develop effective static heuristics).

For the time being, I think we can avoid discussion of 2. and 3. until we
have more of 1. working. So I think it would be most productive if we focus
this discussion on 1.

Additionally, some of the overhead imposed by FE instrumentation is not
really all that easy to get rid of. You end up duplicating functionality
that is more naturally implemented in the middle end.

Yeah, I was looking into a couple of other simple approaches and quickly
found out that I was basically replicating much of the sort of logic that
the inliner already has.

-- Sean Silva

We have entirely separate infrastructure for this. An IR-level PGO instrumentation infrastructure would be of no immediate benefit to me. Even in the long term, we'd have some restrictions on the profile collection points which are easy to reason about in the frontend, but would be hard in the middle end.

Philip

We collected more data to address some of the questions from the reviewers. Note this time we use clang itself as the benchmark. We choose clang because we think it’s a typical C++ program and the reviewers here have good knowledge of the code base.

What we measure is running time for clang to compile a large preprocessed source file (4.98M lines of .ii file), using different compilation modes. All the numbers reported here are the average running time of 5 runs in seconds.

(1) Performance b/w late instrumentation v.s. not instrumenting single BB functions

We first compare various instrumentation performance.

Thank you for sharing the data. I haven’t been following the discussion, but this data made for very interesting reading on it’s own.

Philip

Justin, Sean and other people interested in this proposal,

I’m wondering if you have chances to read the new experiment results in my last email sent 2 weeks ago. Can you share you thoughts, or you have other tests that you want to to run?

I’m in the final stage of preparing the patch. If you are OK, I can sent out the patch soon.

Thanks,

-Rong

Justin, Sean and other people interested in this proposal,

I'm wondering if you have chances to read the new experiment results in my
last email sent 2 weeks ago. Can you share you thoughts, or you have other
tests that you want to to run?

See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation
pass makes sense (you didn't need to provide any performance data to
support this; there are plenty of good reasons), but there are a couple
independent parts. Have you been able to work on splitting out any of them?

I'm in the final stage of preparing the patch. If you are OK, I can sent
out the patch soon.

I'm not sure what you mean by "the" patch. It seems pretty clear that there
are multiple sub-parts to this. Could you send an RFC for part 1 that I
described? We especially need to discuss the interface for frontends e.g.
clang command line interface, when a user passes a profile file how do we
thread that information back to the middle-end, details for the runtime
interoperation (things like function hash will have different meaning
between IR-level and Clang instrumentation), etc.

-- Sean Silva

Justin, Sean and other people interested in this proposal,

I'm wondering if you have chances to read the new experiment results in
my last email sent 2 weeks ago. Can you share you thoughts, or you have
other tests that you want to to run?

See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation
pass makes sense (you didn't need to provide any performance data to
support this; there are plenty of good reasons), but there are a couple
independent parts. Have you been able to work on splitting out any of them?

I'm in the final stage of preparing the patch. If you are OK, I can sent
out the patch soon.

I'm not sure what you mean by "the" patch. It seems pretty clear that
there are multiple sub-parts to this. Could you send an RFC for part 1 that
I described? We especially need to discuss the interface for frontends e.g.
clang command line interface, when a user passes a profile file how do we
thread that information back to the middle-end, details for the runtime
interoperation (things like function hash will have different meaning
between IR-level and Clang instrumentation), etc.

Those are good suggestions!

thanks,

David

This is a late reply -- the email somehow skipped my inbox.

Philip and Sanjoy, out of curiosity do you guys use your own instrumentation
placement for PGO? Is an IR-level PGO infrastructure upstream something you
guys would be interested in?

I think that 2. is something that once we have 1. we will be able to
evaluate better, but for now my opinion is that we should be able to make
good progress without digging into that.

I think that 3. is a no-brainer if it provides a really significant win, but
without 1. we can't really measure its effect in isolation. It also has a
usability problem since it requires feeding in an existing profile for the
*instrumented* build, but if the benefit is very significant this may be
worth it for some users. We will probably be able to easily refactor 1. as
needed into an MST approach that degrades gracefully to using static
heuristics in the absence of real profile information, so is not a
maintenance burden (maybe even helps by providing a good framework in which
to develop effective static heuristics).

Regarding 3, I am not sure what usability issue are you referring to.
Can you elaborate?

thanks,

David

Justin, Sean and other people interested in this proposal,

I'm wondering if you have chances to read the new experiment results in
my last email sent 2 weeks ago. Can you share you thoughts, or you have
other tests that you want to to run?

See my email from Aug 11 (3 weeks ago). Adding an IR-level instrumentation
pass makes sense (you didn't need to provide any performance data to
support this; there are plenty of good reasons), but there are a couple
independent parts. Have you been able to work on splitting out any of them?

I re-read your comments from Aug 11.

As far as I see it, Rong's proposal has a couple different parts:

1. Infrastructure for IR-level instrumentation-based PGO
2. Changes to the pass pipeline so that a hypothetical IR-level

instrumentation-based PGO is more effective

3. MST algorithm with profile feedback for optimal placement of counter

updates.

In my implementation, MST algorithm is the main component of 1. The only
IR change is to insert instrprof_increment intrinsic calls (which will be
lower in createInstrProfilingPass).
I'm not quite sure about 3. Do you mean MST algorithm, or using one profile
to guide the MST algorithm to get the optimal placement? I do have the code
for both. But the latter one was just for experimental purpose. It gonna be
hard to use in the real applications (for example, the profile-use would
also need the bootstrap profile to read the real profile).

I'm in the final stage of preparing the patch. If you are OK, I can sent
out the patch soon.

I'm not sure what you mean by "the" patch. It seems pretty clear that
there are multiple sub-parts to this. Could you send an RFC for part 1 that
I described? We especially need to discuss the interface for frontends e.g.
clang command line interface, when a user passes a profile file how do we
thread that information back to the middle-end, details for the runtime
interoperation (things like function hash will have different meaning
between IR-level and Clang instrumentation), etc.

I agree with your approach. When I said "the patch", I really meant 'a
series of patches'.

Thanks for the suggestion.

-Rong

This is a late reply -- the email somehow skipped my inbox.

> Philip and Sanjoy, out of curiosity do you guys use your own
instrumentation
> placement for PGO? Is an IR-level PGO infrastructure upstream something
you
> guys would be interested in?
>
> I think that 2. is something that once we have 1. we will be able to
> evaluate better, but for now my opinion is that we should be able to make
> good progress without digging into that.
>
> I think that 3. is a no-brainer if it provides a really significant win,
but
> without 1. we can't really measure its effect in isolation. It also has a
> usability problem since it requires feeding in an existing profile for
the
> *instrumented* build, but if the benefit is very significant this may be
> worth it for some users. We will probably be able to easily refactor 1.
as
> needed into an MST approach that degrades gracefully to using static
> heuristics in the absence of real profile information, so is not a
> maintenance burden (maybe even helps by providing a good framework in
which
> to develop effective static heuristics).

Regarding 3, I am not sure what usability issue are you referring to.
Can you elaborate?

A user will need to provide a profile do the "instr-generate" phase and as
Rong points out the "instr-use" phase will need to also be supplied that
same profile in order to interpret the counters. This adds significant
complexity to any PGO workflow and is not essential (it may be desirable at
some point, but it can be added to 1. as an incremental step at any time).

-- Sean Silva

This is a late reply -- the email somehow skipped my inbox.

> Philip and Sanjoy, out of curiosity do you guys use your own
> instrumentation
> placement for PGO? Is an IR-level PGO infrastructure upstream something
> you
> guys would be interested in?
>
> I think that 2. is something that once we have 1. we will be able to
> evaluate better, but for now my opinion is that we should be able to
> make
> good progress without digging into that.
>
> I think that 3. is a no-brainer if it provides a really significant win,
> but
> without 1. we can't really measure its effect in isolation. It also has
> a
> usability problem since it requires feeding in an existing profile for
> the
> *instrumented* build, but if the benefit is very significant this may be
> worth it for some users. We will probably be able to easily refactor 1.
> as
> needed into an MST approach that degrades gracefully to using static
> heuristics in the absence of real profile information, so is not a
> maintenance burden (maybe even helps by providing a good framework in
> which
> to develop effective static heuristics).

Regarding 3, I am not sure what usability issue are you referring to.
Can you elaborate?

A user will need to provide a profile do the "instr-generate" phase

Late Instr-generate does not need a profile to work. It works the same
way (in terms of usability) as FE based instrumentation. You can also
use opt tool to do instrumentation.

and as
Rong points out the "instr-use" phase will need to also be supplied that
same profile in order to interpret the counters. This adds significant
complexity to any PGO workflow and is not essential (it may be desirable at
some point, but it can be added to 1. as an incremental step at any time).

The PGO work flow does not change at all. The only requirement is that
profile annotation pass in profile-use needs to be at the same point
as where the instrumentation is inserted -- this is same as FE based
method too, but more flexible.

David

Justin, Sean and other people interested in this proposal,

I'm wondering if you have chances to read the new experiment results in
my last email sent 2 weeks ago. Can you share you thoughts, or you have
other tests that you want to to run?

See my email from Aug 11 (3 weeks ago). Adding an IR-level
instrumentation pass makes sense (you didn't need to provide any
performance data to support this; there are plenty of good reasons), but
there are a couple independent parts. Have you been able to work on
splitting out any of them?

I re-read your comments from Aug 11.
>As far as I see it, Rong's proposal has a couple different parts:
>
>1. Infrastructure for IR-level instrumentation-based PGO
>2. Changes to the pass pipeline so that a hypothetical IR-level
instrumentation-based PGO is more effective
>3. MST algorithm with profile feedback for optimal placement of counter
updates.

In my implementation, MST algorithm is the main component of 1.

I guess for 3. I mean "profile feedback for optimal placement of counter
updates" since that has significant usability implications (needing to pass
extra files). I have nothing against the MST algorithm per se.

-- Sean Silva

>
>
>>
>> This is a late reply -- the email somehow skipped my inbox.
>>
>> > Philip and Sanjoy, out of curiosity do you guys use your own
>> > instrumentation
>> > placement for PGO? Is an IR-level PGO infrastructure upstream
something
>> > you
>> > guys would be interested in?
>> >
>> > I think that 2. is something that once we have 1. we will be able to
>> > evaluate better, but for now my opinion is that we should be able to
>> > make
>> > good progress without digging into that.
>> >
>> > I think that 3. is a no-brainer if it provides a really significant
win,
>> > but
>> > without 1. we can't really measure its effect in isolation. It also
has
>> > a
>> > usability problem since it requires feeding in an existing profile for
>> > the
>> > *instrumented* build, but if the benefit is very significant this may
be
>> > worth it for some users. We will probably be able to easily refactor
1.
>> > as
>> > needed into an MST approach that degrades gracefully to using static
>> > heuristics in the absence of real profile information, so is not a
>> > maintenance burden (maybe even helps by providing a good framework in
>> > which
>> > to develop effective static heuristics).
>>
>> Regarding 3, I am not sure what usability issue are you referring to.
>> Can you elaborate?
>
>
> A user will need to provide a profile do the "instr-generate" phase

Late Instr-generate does not need a profile to work.

Exactly. Therefore, 3. is a separate step (in particular the complexity of
needing to pass extra files around).

-- Sean Silva

Passing extra file to guide MST is orthogonal to the proposal -- I
don't expect it to be included in the base implementation. So yes,
that step can be explored later once the base infrastructure is ready.

thanks,

David

We collected more data to address some of the questions from the reviewers. Note this time we use clang itself as the benchmark. We choose clang because we think it’s a typical C++ program and the reviewers here have good knowledge of the code base.

What we measure is running time for clang to compile a large preprocessed source file (4.98M lines of .ii file), using different compilation modes. All the numbers reported here are the average running time of 5 runs in seconds.

(1) Performance b/w late instrumentation v.s. not instrumenting single BB functions

We first compare various instrumentation performance.

Config wall_time_for_instr ratio_vs_base profile_size
(1) base O2 80.386 100.0% –
(2) FE-based Instr 201.658 250.8% 65238880
(3) late Instr 103.662 129.0% 14860144
(4) (3) + w/o pre-inline 199.924 248.7% 70762720
(5) (4) + Silva 119.904 149.2% 24499528

Config(5) used the simple heuristic that Sean Silva proposed: not instrumenting single BB functions that contain less than 10 instructions (excluding debug and phi stmts).

We can see:

  1. Simple heuristic of not instrumenting small single BB functions improves instrumentation performance as expected.
  2. Using simple heuristic is still slower than late instrumentation with pre-inlining: the later is 15% faster.
  3. Late instrumentation produces the smallest profile size: it’s 39% smaller than using the simple heuristic.

The result is expected as pre-inlining can handle more cases than the simple heuristic. There is significant performance gap between the simple heuristic (5) and late instrumentation (2).

We also used a few larger internal benchmarks to further validate the above result. The following table shows the slowdown compared to the base O2. The labels (2) to (5) refer to the same config as in the previous table.

Program (2) (3) (4) (5)
C++benchmark16 -45.24% -12.93% -43.84% -24.74%
C++benchmark17 -90.86% -58.19% -87.77% -80.62%
C++benchmark18 -95.32% -54.75% -91.21% -82.56%

We can see the same trend as the clang benchmark: the simple heuristic (5) recovers a lot of performance loss compared with FE base instrumentation, but is still significantly worse than late instrumentation (3).

(2) Performance impact of context sensitivity

LLVM does not use the profile information fully in the back-end optimizations, for instance, inlining does not fully use the profile counts – it only marks hot/cold function attribute based on function entry counts. To evaluate the impact of profile context sensitivity, GCC is used in the experiment. Note that GCC PGO improves clang performance a lot more than clang PGO.

First we summarize the methodology used in the experiment:
0) build clang with GCC O2 without early inlining and measure clang’s performance. GCC early inlining (einline) is similar to pre-inline used by late instrumentation.

  1. build clang with GCC O2 with early inlining and measure performance.

The performance difference of 1) and 0) is denoted as E which measures the contribution of early inlining.

  1. build clang with GCC O2 + PGO without early inlining.
  2. build clang with GCC O2 + PGO with early inlining.

The performance difference of 3) and 2) is denoted as EC. It constitutes roughly two parts a) early inlining contribution b) context sensitive profiling enabled with early inlining.

The contribution of context sensitive profiling can be estimated by EC - E above.

Config wall_time_for_use speedup_vs_(0) speedup_vs_(1)
(0) base w/o einline 84.946 1.000 0.934
(1) base O2 79.310 1.071 1.000
(2) profile-arcs w/o einline 63.518 1.337 1.249
(3) profile-arcs 48.364 1.756 1.640

Sorry for digging up an old thread.

The data looks interesting and promising. How did you measure clang’s performance? On which benchmarks?
Do you have similar performance data on Clang PGO + late instrumentation that you can share?

Thanks,
Manman