GSoC Proposal : Path Profiling Support

This proposal adds support for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime.
Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles. Superblocks constructed from path profiles offer more opportunities for speculative code motion thereby improving program performance.

About Me :
I am a PhD candidate at Simon Fraser University, Canada. My research primarily deals with computer architecture and workload analysis. Over the last year, I have built a toolchain which uses path profiling as its basis. With support from GSoC, I would like to integrate our path profiling pass into trunk. Our passes are currently built against LLVM 3.6.2. I am willing to maintain said passes as they form the core of our research infrastructure.

Prior Work :
Path profiling was implemented by Adam Preuss for LLVM [Preuss10]. However, this was before the unified profiling infrastructure was in place. It seems that the code was removed from trunk post LLVM 3.3.

Current Implementation :

  • HashTable based
  • Static overflow checks (64 bit counters) at instrumentation time with fallback to APInt counters
  • Implements efficient event counting [Ball94] to reduce instrumentation overheads
  • Context sensitive counters if no 64-bitoverflow (i.e. supports path profiling across recursion)
  • Optimizations for use cases with very large number of executed paths
  • Used to analyse C/C++ benchmarks from SPECCPU2006

Proposed Integration :

  • Support library code added to compiler-rt
  • llvm/Analysis gets PathEncoding, PathDecoding and PathProfileInfo module passes
  • llvm/Transforms/Instrumentation gets PathProfileInstrumentation module pass
  • PathProfileInfo offers const only public methods to query info

Open Issues :

  • Update PathProfileInfo on CFG transformations ?
  • Verify with PGOEdge info ?
  • Handle setjmp, longjmp, early program termination, noreturn calls

Please let me know your comments on this proposal as a GSoC project. Any comments on how to refine this proposal are welcome. I can also elaborate on specific details of our implementation if there is interest on the mailing list.

Thanks,
Snehasish Kumar
School of Computing Science
Simon Fraser University

References :
[Ball94] Ball, Thomas. “Efficiently counting program events with support for on-line queries.” ACM Transactions on Programming Languages and Systems (TOPLAS) 16.5 (1994): 1399-1410.
[Ball96] Ball, Thomas, and James R. Larus. “Efficient path profiling.” Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society, 1996.
[Young98] Young, Cliff, and Michael D. Smith. “Better global scheduling using path profiles.” Proceedings of the 31st annual ACM/IEEE international symposium on Microarchitecture. IEEE Computer Society Press, 1998.
[Preuss10] http://lists.llvm.org/pipermail/llvm-dev/2010-December/036790.html

Hi,

This sounds like a very interesting proposal!

This proposal adds support for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime.

In [Ball96], the authors mention that the worst case number of acyclic
paths in a program is exponential in its size. How do you manage this? Is
the amount of instrumentation necessary to handle this scenario also
exponential in the size of the program?

Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles.

I haven't had time to digest [Young98]. Could you explain in some more
detail which optimizations stand to benefit from path profile information?

Over the last year, I have built a toolchain which uses path profiling as its basis.

Awesome!

Do you have any numbers about the size of path profiles, compared to the
size of raw instrprof-based profiles?

In general I'm curious about how we can support users of instrprof-style
profiles with path profiles. Do you have plans to support code coverage
using path profiles? Can path profiles from dylibs be concatenated together
to yield valid profiles?

+ Used to analyse C/C++ benchmarks from SPECCPU2006

I'd be interested to see any numbers you have about this. In particular,
what's the compile-time overhead of path profiling instrumentation? What
are the sizes of instrumented binaries, and what's the slowdown factor?

Open Issues :
+ Update PathProfileInfo on CFG transformations ?

Could you clarify what this means?

+ Verify with PGOEdge info ?

Ditto.

+ Handle setjmp, longjmp, early program termination, noreturn calls

How do you handle indirect calls?

best,
vedant

Hi Vedant,

I would like to clarify that the proposal does *intra-procedural* path profiling as described in [Ball96].

> This proposal adds support for path profiling [Ball96] to LLVM. Path profiling compactly represents acyclic paths in a directed acyclic graph representation of the control flow graph of a routine. Instrumentation can be added to uniquely identify paths executed at runtime.

In [Ball96], the authors mention that the worst case number of acyclic
paths in a program is exponential in its size. How do you manage this? Is
the amount of instrumentation necessary to handle this scenario also
exponential in the size of the program?

Paths are numbered from 0 to NumPaths-1. This can be exponential in size. In practice we find that they fit in 64-bit wide counters are enough for numbering paths within a routine. The amount of instrumentation required is not exponential. For example, every if-condition in a routine will increase the number of paths by X2, but the instrumentation points required to disambiguate between each side of the branch is only 1. Furthermore, we reduce the amount of instrumentation using [Ball94] which is proven to be optimal.

> Path profiles enable precise enumeration of the sequence of basic blocks executed in order for a particular path. Using path profiles has been demonstrated by [Young98] to perform better global scheduling than edge profiles.

I haven't had time to digest [Young98]. Could you explain in some more
detail which optimizations stand to benefit from path profile information?

[Young98] target superblock enlargement optimizations. Having precise information about the paths enables additional optimizations using value numbering and subsequent dead code elimination. In our experience, isolating paths with reduced control flow allows aggressive dead-store-elimination, dead-code-elimination and constant propagation.

> Over the last year, I have built a toolchain which uses path profiling as its basis.

Do you have any numbers about the size of path profiles, compared to the
size of raw instrprof-based profiles?

No. We use a plain text representation internally which can be further optimized.

In general I'm curious about how we can support users of instrprof-style
profiles with path profiles. Do you have plans to support code coverage
using path profiles?

Perfect instrprof data can be derived from path profiles. Code coverage can be supported using path profiles though we do not do so as part of our implementation.

Can path profiles from dylibs be concatenated together
to yield valid profiles?

Yes. As each routine is independently instrumented, the profiles can be concatenated. However, it is difficult to reason about which paths in the caller invoke which paths in the callee. Melinski and Reps describe how to do so in Interprocedural Path Profiling (CC'99). This is beyond the scope of this proposal.

> + Used to analyse C/C++ benchmarks from SPECCPU2006

I'd be interested to see any numbers you have about this. In particular,
what's the compile-time overhead of path profiling instrumentation? What
are the sizes of instrumented binaries, and what's the slowdown factor?

You can find similar numbers reported by [Ball96] in Table 1 of the paper. Numbers from our selected workloads are shown below.
These numbers use 256 bit APInt counters (primary cause for slowdown) along with a DenseMap<APInt, APInt> for Id => Counter mapping.
The runtime is implemented as a shared library. The benchmarks are selected from SPEC2006 and Parsec 3.0. Each was
compiled to a single bitcode module and a particular function was picked for instrumentation. The function was
determined to be (one of) the primary work functions. The bitcodes were compiled with -O2 optimization post
instrumentation.

Note that the frequency of paths executed is not shown in this table and is a factor for slowdown.
Each path is executed multiple times and each path has varying number of instrumentation points.
This data shows the *worst possible case* where 256-bit APInt counters are used to calculate path ids as well
as maintain runtime frequency counts in the shared library. Note that all benchmarks are well within the
64-bit integer range for path-ids.

Thus the data in the [Ball96] paper shows the best case overheads, while the data I have included herein
shows the worst case overheads for instrumentation+runtime.

+ Key

numpaths = Number of possible paths
epp+compile = Time taken to compute encoding, insert instrumentation and compile to executable
compile = Time taken to compile to executable
execpaths = Number of paths dynamically executed
epp-exec-time = Execution time with instrumentation
exec-time = Normal execution time
epp-bin-size = Size of instrumented binary in bytes
bin-size = Size of binary
** size of shared library in bytes = 598042

Hi Kumar,

Are the data below all collected when only one function is picked for
instrumentation? Do you have data when such manual selection is not done?

thanks,

David

Hi David,

Are the data below all collected when only one function is picked for
instrumentation?

Yes, here is a list of the benchmarks and selected functions.

Hi

I am pinging to find out if there is any interest to mentor this
proposal for GSoC this year? I've submitted a draft via the GSoC
website.

David, Vedant it would be great if I could get some advice on refining
the goals and particulars of the implementation.
The version we use internally is not performance oriented and will
require refactoring.
Here is a link to the draft document [1].

Thanks,
Snehasish

[1] https://docs.google.com/document/d/18i9FvD7FSqX6tNEXb83gzc0EC_STeS3bWOVf167sFWk/edit?usp=sharing

I've added some inline comments to your proposal doc.

It would be nice to get feedback from potential users of path profile
information, like people who work on GVN.

Some general notes:

- It would be nice to measure the performance variation between programs
  optimized with edge-based profiles vs. with path profiles.

- I'd be curious to see comparisons with the instrprof framework (compile time,
  exec time, profile size, binary size). This would establish a concrete target
  to beat.

David, Vedant it would be great if I could get some advice on refining
the goals and particulars of the implementation.
The version we use internally is not performance oriented and will
require refactoring.
Here is a link to the draft document [1].

[1] https://docs.google.com/document/d/18i9FvD7FSqX6tNEXb83gzc0EC_STeS3bWOVf167sFWk/edit?usp=sharing

Open Issues :
+ Update PathProfileInfo on CFG transformations ?

Could you clarify what this means?

Changing the control flow graph of a routine may invalidate collected path
profiles. For example, splitting a block with an unconditional branch does
not change the profile, but introducing a conditional branch invalidates the
profile. The issue I would like to address is which transformations should
we allow as safe transformations and how should we update the internal path
profile data structures if we allow this at all.

It seems that existing optimization passes can make changes that could invalidate
a path profile. Is it expensive to detect these changes and invalidate the affected
portions of the profile?

vedant

Hi Vedant,

Thanks for the feedback. I've updated the draft.

It would be nice to get feedback from potential users of path profile

information, like people who work on GVN.

- It would be nice to measure the performance variation between programs
  optimized with edge-based profiles vs. with path profiles.

To my knowledge currently there are few consumers of edge profiles.
Replacing them with path profiles will not yield any different results
as it is used now.
Implementing deeper optimizations correlated to the paths executed will show
better results than edge profiles (eg. Superblock formation).

- I'd be curious to see comparisons with the instrprof framework (compile time,
  exec time, profile size, binary size). This would establish a concrete target
  to beat.

This is definitely something we can analyse further though in my
opinion it may not
be a fair target to "beat" as path profiles subsume edge profiles.

David, Vedant it would be great if I could get some advice on refining
the goals and particulars of the implementation.
The version we use internally is not performance oriented and will
require refactoring.
Here is a link to the draft document [1].

[1] https://docs.google.com/document/d/18i9FvD7FSqX6tNEXb83gzc0EC_STeS3bWOVf167sFWk/edit?usp=sharing

Open Issues :
+ Update PathProfileInfo on CFG transformations ?

Could you clarify what this means?

Changing the control flow graph of a routine may invalidate collected path
profiles. For example, splitting a block with an unconditional branch does
not change the profile, but introducing a conditional branch invalidates the
profile. The issue I would like to address is which transformations should
we allow as safe transformations and how should we update the internal path
profile data structures if we allow this at all.

It seems that existing optimization passes can make changes that could invalidate
a path profile. Is it expensive to detect these changes and invalidate the affected
portions of the profile?

*Speculation* Detecting portions to invalidate should be easy but depends on the
nature of the change.

Hi Snehasish, thanks for writing up the proposal.

As it stands today, path profiling still has serious scalability issue that prevents it from being usable by any optimization passes that may benefit from it. On the other hand, sampling based approach can still be promising. For instance, LBR can potentially together with static CFG constructed from the binary can be used to form path(let) samples, which is the area our intern will explore this summer. It will be interesting to see how the sampling based approach matches up instrumentation based method in detecting hot paths.

Independent of the method used in generating path profile data, your proposed work on the path profile info representation and query APIs can be shared.

thanks,

David

Hi David,

Hi Snehasish, thanks for writing up the proposal.

As it stands today, path profiling still has serious scalability issue that
prevents it from being usable by any optimization passes that may benefit
from it.

I agree; it would be an interesting to see how we can reduce the overheads
to bring it within acceptable limits.

It will be interesting to see how the
sampling based approach matches up instrumentation based method in
detecting hot paths.

I actually performed some experiments regarding this very issue. I
created an edge
profile from the collected path profile and constructed superblocks
for innermost loops
in 28 applications. I found

a) in 6 applications superblocks (i.e hot paths constructed from edge
profiles) are
*not present* as executed paths in the collected profile. For example,
in 401.bzip2 70 of 141 inner loop hot paths derived
from edge profiles do not corresponding to actual executed paths.

b) in 6 applications (overlap of 3 with the previous 6) some hot paths
constructed
from edge profiles are not the highest ranked path. For example, this occurs in
4 out of 32 innermost loops in 458.sjeng.

The high level reason this happens is due to overlapped paths. I can
go into more details if
there is interest.

Hi David,

> Hi Snehasish, thanks for writing up the proposal.
>
> As it stands today, path profiling still has serious scalability issue
that
> prevents it from being usable by any optimization passes that may benefit
> from it.

I agree; it would be an interesting to see how we can reduce the overheads
to bring it within acceptable limits.

One idea to reduce the overhead of instrumentation is to use edge profiling
to guide path profile instrumentation. With edge profiling, you can prune
perhaps >90% of the paths and focus one profiling potential hot paths. The
downside of this approach it requires additional steps to enable path
profiling -- but at least it can be put into a state that is usable.

> It will be interesting to see how the
> sampling based approach matches up instrumentation based method in
> detecting hot paths.

I actually performed some experiments regarding this very issue. I
created an edge
profile from the collected path profile and constructed superblocks
for innermost loops
in 28 applications. I found

a) in 6 applications superblocks (i.e hot paths constructed from edge
profiles) are
*not present* as executed paths in the collected profile. For example,
in 401.bzip2 70 of 141 inner loop hot paths derived
from edge profiles do not corresponding to actual executed paths.

b) in 6 applications (overlap of 3 with the previous 6) some hot paths
constructed
from edge profiles are not the highest ranked path. For example, this
occurs in
4 out of 32 innermost loops in 458.sjeng.

The high level reason this happens is due to overlapped paths. I can
go into more details if
there is interest.

For edge profiles, this is expected as not all paths are realizable but
your data is still interesting. For instance, the bzip2 data shows great
potential of using path profile information. The sample based method I
mentioned is not about edge profiling/sampling though, but doing sampling
of execution paths (with the length of the paths is limited by capacity of
the branch record buffer).

David