Proposal: add instrumentation for PGO and code coverage

I've been investigating better support for both profile-guided optimization (PGO) and code coverage in Clang/LLVM. I've spent some time experimenting with several options and have arrived at this proposal. I'd like to get feedback on the overall approach before I start submitting patches. There is a proof-of-concept patch attached here so you can see the details, but it needs more testing and cleanup before it is ready to commit.

Background:

PGO: LLVM already has support for branch weight metadata and can use that to compute block frequencies and generate better code. Those branch weights can come from __builtin_expect or from real profile data. We currently have some support for IR-level instrumentation to insert CFG edge counts and then read those counts back into the IR as branch weights. That support is not very well integrated into clang and does not work well if the IR changes at all after the profile data is collected. Diego Novillo has recently expressed interested in a sampling-based approach to collecting profile data for PGO. I believe that would be complementary to this proposal to use instrumentation. There are pros and cons of either approach, so it would be great to offer a choice between them. Both approaches should share the same metadata representation.

Coverage: We currently support code coverage testing with gcov. We also have a very basic llvm-cov tool that is intended to work much like gcov, and it currently works from the same gcov data files. At least with the version of gcov that I have access to, it does not provide a great experience, especially with regard to showing accurate source ranges. For expressions spread across multiple source lines, typically only one of those lines is reported as having any execution counts at all. The same problem exists within one source line that has multiple basic blocks -- gcov can show that there are multiple blocks but it doesn't show the source range for each block.

Proposal:

I would like to add instrumentation code in clang to collect data for both PGO and coverage. Since the same data can be used for both, it makes sense to combine them, especially if you can then use a coverage tool to examine your profile data that is being used for PGO. The same code that inserts the instrumentation will be used (with a different build setting) to add the branch weight metadata to the IR. Here are my main design goals:

* PGO should degrade gracefully when the source changes: I started out trying to get around this and just require the profile data to be regenerated for every source change, but that really didn't work out. For a variety of reasons, it is important to avoid that. I'm not expecting much disagreement so I won't go into details. The proposal here is to match the profile data to the current code on a function-by-function basis. If the profile data is stale for a particular function, it will be ignored. More sophisticated solutions could be considered in the future, but as far as I know, that is still an open research problem.

* Profile data must not be invalidated by compiler changes: No one should have to regenerate their profile data just because they installed a new compiler. This is especially important to Apple, but I expect it applies to most people to some extent. Just as an example, while working on this I stumbled across some work-in-progress to support switch case-ranges in LLVM IR. Clang's current code-gen for case-ranges expands them out to either a series of separate cases or conditional branches in the default block, depending on the size of the range. I want profile data collected with the current code-gen to remain valid if we ever get native case-range support, even though the IR-level CFG might look significantly different. The proposed solution here is to associate the instrumentation counters with clang AST nodes, which are much less likely to change in a way that would invalidate the profile data.

* Provide accurate coverage data: As described above, I think it will be quite useful to use the same instrumentation for both PGO and coverage, and the coverage data we collect should be as accurate as possible. Clang's IR-gen does a few optimizations, e.g., to remove dead code, that require some care to make sure the coverage data is correct.

* Minimize runtime overhead for instrumentation: If the instrumented code is very slow or requires far more memory than normal, that could be a barrier to using it. Since the proposal here is to insert the instrumentation in the front-end, we can take advantage of the structure of the ASTs to reduce the number of counters. (The effect should be similar to what you get by using a spanning tree to place counters in an arbitrary CFG.) We can also compile the instrumented code with full optimization.

Summary of the Instrumentation Phase:

In the attached patch, I added a "-finstrument-regions" option for this. Other name suggestions are welcome. With that option, clang's IR-gen for a function first walks through the AST in source order, assigning sequential counter indices for the nodes involved in conditional control flow. While generating the IR, for each AST node that has associated counters, it looks up the counter indices and emits code to increment those counters at the appropriate places. I have tried to minimize the number of counters. For an if-statement, I only put a counter on the then-block. If you know the count going into the if-statement, you can infer the count for the else-block and the fall-through block. For a while loop, I use 3 counters: one for the condition expression, one for the loop body, and one for the exit block. Gotos are handled by putting counters on the labels. Switches are more interesting: to get the right data for PGO I had to put the counters on the CFG edges from the switch to each case, i.e., not counting the fall through. For code coverage, and also for keeping track of the current count during PGO compilation, the fall through counts need to be added to the edge counts. I've prototyped all of that and it seems to work reasonably well. I haven't added instrumentation for all the control-flow constructs yet, since I was just trying to convince myself that this approach works, but the rest should be similar.

Summary of PGO Compilation:

I added a "-fpgo" option for this. Again, name suggestions are welcome. The current prototype patch just uses a fixed filename for the profile data, but that should be changed to be an argument to the -fpgo option. The profile data is read in when the CodeGenModule is initialized. The data file is currently plain text but should be changed to something more efficient. The data for each function is identified by the mangled function name. For static functions, we'll need to add a file name to distinguish them uniquely. There is a single data file for the entire executable, so the compilation of a single source file may only use a fraction of the entries. For each function during IR-gen, the AST is traversed in the same way as during the instrumentation phase. If the number of counters does not match what was in the profile data, then the source must have changed and the profile data is ignored. We probably need to make that check more specific, e.g., by computing some sort of hash of the AST contents (in a way that doesn't depend on internal details of the AST representation). If the profile data doesn't match, we can add a warning diagnostic, and if that is too noisy in practice, we can investigate things like warning only when a certain proportion of the functions have stale data, or warning only for hot functions, etc. Assuming the profile data matches, during IR-gen the proposed approach is to keep track of the current execution count, updating it from the counter values when visiting code that was instrumented. Those counts are used to add the branch weight metadata to the generated IR. There are a few cases where the ASTs don't distinguish all the edges, e.g., a case-range that gets expanded to a bunch of separate cases. This is not a problem for code coverage, because we can only report the counts for things that exist in the source code. For PGO, these situations are not common, and it should be good enough to just divide the counts evenly across the possibilities, e.g., for the case-range all the case values are going to jump to the same code anyway, so there's no real advantage is distinguishing the branch weight for each value in the range.

Summary of Code Coverage:

I had implemented this with an earlier approach but do not currently have anything working. The ASTs have very accurate source locations, so it is no problem to accurately report the execution counts. One option here would be to have a code coverage system actually run the compiler again to correlate the profile data with the ASTs in the same way that it does when compiling with -fpgo. That would have the advantage of letting you see coverage counts from a profile data file after changing the source slightly. Actually given the desire to use a coverage tool to view the data for PGO, I'm strongly inclined to go that way. Alternatively, the compiler during the instrumentation phase could write out a description of how to calculate the counts for each source range. That would take some extra work and the results would be tied to the source locations at the time the instrumentation was done, but it would avoid the need to re-run the compiler to view the coverage.

Here are the proof-of-concept patches for clang and compiler-rt. I intend to clean them up and add more comments, but hopefully they are useful for anyone who wants a more detailed understanding of the proposal.

pgo-clang.patch (61.9 KB)

pgo-compiler-rt.patch (2.76 KB)

Hi Bob,

Two general comments (detailed inline with your proposal):
- pgo should be able to deal with several profile data.
- we should do something about counters set to zero.

I've been investigating better support for both profile-guided optimization (PGO) and code coverage in Clang/LLVM. I've spent some time experimenting with several options and have arrived at this proposal. I'd like to get feedback on the overall approach before I start submitting patches. There is a proof-of-concept patch attached here so you can see the details, but it needs more testing and cleanup before it is ready to commit.

Background:

PGO: LLVM already has support for branch weight metadata and can use that to compute block frequencies and generate better code. Those branch weights can come from __builtin_expect or from real profile data. We currently have some support for IR-level instrumentation to insert CFG edge counts and then read those counts back into the IR as branch weights. That support is not very well integrated into clang and does not work well if the IR changes at all after the profile data is collected. Diego Novillo has recently expressed interested in a sampling-based approach to collecting profile data for PGO. I believe that would be complementary to this proposal to use instrumentation. There are pros and cons of either approach, so it would be great to offer a choice between them. Both approaches should share the same metadata representation.

Coverage: We currently support code coverage testing with gcov. We also have a very basic llvm-cov tool that is intended to work much like gcov, and it currently works from the same gcov data files. At least with the version of gcov that I have access to, it does not provide a great experience, especially with regard to showing accurate source ranges. For expressions spread across multiple source lines, typically only one of those lines is reported as having any execution counts at all. The same problem exists within one source line that has multiple basic blocks -- gcov can show that there are multiple blocks but it doesn't show the source range for each block.

Proposal:

I would like to add instrumentation code in clang to collect data for both PGO and coverage. Since the same data can be used for both, it makes sense to combine them, especially if you can then use a coverage tool to examine your profile data that is being used for PGO. The same code that inserts the instrumentation will be used (with a different build setting) to add the branch weight metadata to the IR. Here are my main design goals:

* PGO should degrade gracefully when the source changes: I started out trying to get around this and just require the profile data to be regenerated for every source change, but that really didn't work out. For a variety of reasons, it is important to avoid that. I'm not expecting much disagreement so I won't go into details. The proposal here is to match the profile data to the current code on a function-by-function basis. If the profile data is stale for a particular function, it will be ignored. More sophisticated solutions could be considered in the future, but as far as I know, that is still an open research problem.

* Profile data must not be invalidated by compiler changes: No one should have to regenerate their profile data just because they installed a new compiler. This is especially important to Apple, but I expect it applies to most people to some extent. Just as an example, while working on this I stumbled across some work-in-progress to support switch case-ranges in LLVM IR. Clang's current code-gen for case-ranges expands them out to either a series of separate cases or conditional branches in the default block, depending on the size of the range. I want profile data collected with the current code-gen to remain valid if we ever get native case-range support, even though the IR-level CFG might look significantly different. The proposed solution here is to associate the instrumentation counters with clang AST nodes, which are much less likely to change in a way that would invalidate the profile data.

* Provide accurate coverage data: As described above, I think it will be quite useful to use the same instrumentation for both PGO and coverage, and the coverage data we collect should be as accurate as possible. Clang's IR-gen does a few optimizations, e.g., to remove dead code, that require some care to make sure the coverage data is correct.

* Minimize runtime overhead for instrumentation: If the instrumented code is very slow or requires far more memory than normal, that could be a barrier to using it. Since the proposal here is to insert the instrumentation in the front-end, we can take advantage of the structure of the ASTs to reduce the number of counters. (The effect should be similar to what you get by using a spanning tree to place counters in an arbitrary CFG.) We can also compile the instrumented code with full optimization.

Summary of the Instrumentation Phase:

In the attached patch, I added a "-finstrument-regions" option for this. Other name suggestions are welcome. With that option, clang's IR-gen for a function first walks through the AST in source order, assigning sequential counter indices for the nodes involved in conditional control flow. While generating the IR, for each AST node that has associated counters, it looks up the counter indices and emits code to increment those counters at the appropriate places. I have tried to minimize the number of counters. For an if-statement, I only put a counter on the then-block. If you know the count going into the if-statement, you can infer the count for the else-block and the fall-through block. For a while loop, I use 3 counters: one for the condition expression, one for the loop body, and one for the exit block. Gotos are handled by putting counters on the labels. Switches are more interesting: to get the right data for PGO I had to put the counters on the CFG edges from the switch to each case, i.e., not counting the fall through. For code coverage, and also for keeping track of the current count during PGO compilation, the fall through counts need to be added to the edge counts. I've prototyped all of that and it seems to work reasonably well. I haven't added instrumentation for all the control-flow constructs yet, since I was just trying to convince myself that this approach works, but the rest should be similar.

Summary of PGO Compilation:

I added a "-fpgo" option for this. Again, name suggestions are welcome. The current prototype patch just uses a fixed filename for the profile data, but that should be changed to be an argument to the -fpgo option.

Sorry if I missed it but I think it is important that we could pass several profile data to the pgo option (or have a tool that aggregate several files into one).
The use case is:
- Gather profiling data for an application with different inputs, each generate a profile.
- Recompile this application using one aggregated profile.

For optimizations purposes, another consideration is we may want to bias a bit the numbers we collect to avoid weird heuristic behavior.
The idea is to adjust the numbers for counters that are zeros.
Let me illustrate with an example.
Consider the following code:

if (something that should -almost- never occur) {
  do specific job
} else {
  do regular job
}

It is likely that the then block will have its counter set to zero.
Therefore, a frequency based heuristic may take weird decision in that block because it may think it is free to put code here.

Generally speaking, this may be okay w.r.t. performance since this path is -almost- never supposed to be taken, but it may increase the code size and may have other side effect on the cache for instance. We may also do not want to degrade the performance in the almost never taken path.

My point is, we should not end up with basic block with 0 frequency or branch with 0 probability, otherwise we expose ourselves to weird behavior.
We could argue that each optimization has to deal with zero frequency but I guess it will complicate the overall code base of LLVM.

Thanks,

-Quentin

Hi Bob,

Two general comments (detailed inline with your proposal):
- pgo should be able to deal with several profile data.
- we should do something about counters set to zero.

Sorry if I missed it but I think it is important that we could pass several profile data to the pgo option (or have a tool that aggregate several files into one).
The use case is:
- Gather profiling data for an application with different inputs, each generate a profile.
- Recompile this application using one aggregated profile.

Yes, definitely. I am planning to have a separate tool to aggregate profile data files, but if that doesn't work out, we would want the compiler to do the aggregation. One way or the other, we should definitely allow multiple profiles to be combined.

For optimizations purposes, another consideration is we may want to bias a bit the numbers we collect to avoid weird heuristic behavior.
The idea is to adjust the numbers for counters that are zeros.
Let me illustrate with an example.
Consider the following code:

if (something that should -almost- never occur) {
do specific job
} else {
do regular job
}

It is likely that the then block will have its counter set to zero.
Therefore, a frequency based heuristic may take weird decision in that block because it may think it is free to put code here.

Generally speaking, this may be okay w.r.t. performance since this path is -almost- never supposed to be taken, but it may increase the code size and may have other side effect on the cache for instance. We may also do not want to degrade the performance in the almost never taken path.

My point is, we should not end up with basic block with 0 frequency or branch with 0 probability, otherwise we expose ourselves to weird behavior.
We could argue that each optimization has to deal with zero frequency but I guess it will complicate the overall code base of LLVM.

Yes. Jakob already pointed me at LaPlace's Rule of Succession, which basically says the same thing: Rule of succession - Wikipedia

The patch already has code in it to use Count + 1 for the branch weights in the metadata.

I wonder if there’s a way to take advantage of Clang/LLVM’s library-based design to achieve this functionality in a less invasive way than “directly code in all the necessary logic into various relevant parts of clang’s IRGen”.

Off the top of my head, how far could you get with an LLVM pass that consumes debug info? That would keep the logic nice and separated.

Failing a solution with existing means, maybe it’s time for clang to grow some “pass”-like functionality, so that every new instrumentation that needs AST information can coexist in peace without incrementally adding complexity to IRGen. I think a lot of ground could be covered by a two-part scheme where a “clang pass” can look at the AST and tell IRGen to attach annotations to the code generated from particular AST nodes, coupled with a corresponding IR pass that interprets those annotations and transforms the IR.

Random thing I saw when looking at the patches:

+typedef unsigned int uint32_t;
+typedef unsigned int uint64_t;

This seems super sketchy.

– Sean Silva

I wonder if there’s a way to take advantage of Clang/LLVM’s library-based design to achieve this functionality in a less invasive way than “directly code in all the necessary logic into various relevant parts of clang’s IRGen”.

Off the top of my head, how far could you get with an LLVM pass that consumes debug info? That would keep the logic nice and separated.

It keeps the logic nicely separated but I don’t see any way to achieve my design goals with that approach.

Failing a solution with existing means, maybe it’s time for clang to grow some “pass”-like functionality, so that every new instrumentation that needs AST information can coexist in peace without incrementally adding complexity to IRGen. I think a lot of ground could be covered by a two-part scheme where a “clang pass” can look at the AST and tell IRGen to attach annotations to the code generated from particular AST nodes, coupled with a corresponding IR pass that interprets those annotations and transforms the IR.

I will think about it some more, but I don’t think that works well. Believe me, I’ve tried really hard to make this less invasive, but most of the things I have tried have not worked out (and I’ve already adopted the things that did work out). Fundamentally, we need code-gen to insert the instrumentation and also to add the branch weight metadata, so much of the code needs to be in code-gen regardless.

I’m essentially already doing a separate “pass” over the AST to assign the counter indices. Eventually I expect to have separate code outside code-gen to calculate code coverage statistics. I could possibly reuse that as a separate “pass” before code-gen, but that wouldn’t actually do much to simplify the changes to code-gen.

Random thing I saw when looking at the patches:

+typedef unsigned int uint32_t;
+typedef unsigned int uint64_t;

This seems super sketchy.

That is exactly the sort of thing that prompted me to put a disclaimer on this that it is a proof-of-concept patch and not something that I intend to submit in its current state.

Background:

PGO: LLVM already has support for branch weight metadata and can use that to compute block frequencies and generate better code. Those branch weights can come from __builtin_expect or from real profile data. We currently have some support for IR-level instrumentation to insert CFG edge counts and then read those counts back into the IR as branch weights. That support is not very well integrated into clang and does not work well if the IR changes at all after the profile data is collected. Diego Novillo has recently expressed interested in a sampling-based approach to collecting profile data for PGO. I believe that would be complementary to this proposal to use instrumentation. There are pros and cons of either approach, so it would be great to offer a choice between them. Both approaches should share the same metadata representation.

I agree, the auto profiler can and should co-exist with the instrumentation profiler. In a sense, the auto profiler is just a lossy version of the instrumentation profiler.

One thing I'm not so sure about your approach is the decision to do PGO instrumentation at the source level. I would suspect that this may produce more instrumentation than needed (which is then hard to remove in the optimizers) and it also prevents other LLVM front ends from being able to use PGO.

Wouldn't it make it easier if the PGO instrumentation was generated by LLVM itself?

For code coverage, I think being closer to the source helps. For PGO, I'm not so sure. But I also don't know for certain that my suspicion of excessive instrumentation is correct. The only thing I'm sure of, is that other front ends will not be able to take advantage of it.

In terms of the metadata representation, what are your thoughts on the on-disk format to use? Since we want to support multiple external profile sources, I would like to define a canonical on-disk representation that every profiling source should convert to.

This way, the profile loader in the compiler needs to handle that single file format.

* PGO should degrade gracefully when the source changes: I started out trying to get around this and just require the profile data to be regenerated for every source change, but that really didn't work out. For a variety of reasons, it is important to avoid that. I'm not expecting much disagreement so I won't go into details. The proposal here is to match the profile data to the current code on a function-by-function basis. If the profile data is stale for a particular function, it will be ignored. More sophisticated solutions could be considered in the future, but as far as I know, that is still an open research problem.

Absolutely. My current plan wrt auto profiling is to drop on the floor any samples for which I can make no sense of. Perhaps offering a diagnostic option, but certainly giving up and falling back on static heuristics.

* Profile data must not be invalidated by compiler changes: No one should have to regenerate their profile data just because they installed a new compiler. This is especially important to Apple, but I expect it applies to most people to some extent. Just as an example, while working on this I stumbled across some work-in-progress to support switch case-ranges in LLVM IR. Clang's current code-gen for case-ranges expands them out to either a series of separate cases or conditional branches in the default block, depending on the size of the range. I want profile data collected with the current code-gen to remain valid if we ever get native case-range support, even though the IR-level CFG might look significantly different. The proposed solution here is to associate the instrumentation counters with clang AST nodes, which are much less likely to change in a way that would invalidate the profile data.

But ultimately, this is a special case of code drifting, is it not? Or maybe I just completely missed your point here.

* Provide accurate coverage data: As described above, I think it will be quite useful to use the same instrumentation for both PGO and coverage, and the coverage data we collect should be as accurate as possible. Clang's IR-gen does a few optimizations, e.g., to remove dead code, that require some care to make sure the coverage data is correct.

And this is the case I think why coverage instrumentation at the FE level makes a lot of sense.

* Minimize runtime overhead for instrumentation: If the instrumented code is very slow or requires far more memory than normal, that could be a barrier to using it. Since the proposal here is to insert the instrumentation in the front-end, we can take advantage of the structure of the ASTs to reduce the number of counters. (The effect should be similar to what you get by using a spanning tree to place counters in an arbitrary CFG.) We can also compile the instrumented code with full optimization.

This is the part that I have doubts about. Perhaps I'm too used to generating instrumentation from the compiler's CFG.

Summary of the Instrumentation Phase:

In the attached patch, I added a "-finstrument-regions" option for this. Other name suggestions are

-fprofile-generate and -fprofile-use? I'm drawing from my GCC background here. Sorry :wink:

If the profile data doesn't match, we can add a warning diagnostic, and if that is too noisy in practice, we can investigate things like warning only when a certain proportion of the functions have

Right. Here I was initially thinking a diagnostic when the compiler drops more than N% of the records from the instrumentation file.

stale data, or warning only for hot functions, etc. Assuming the profile data matches, during IR-gen the proposed approach is to keep track of the current execution count, updating it from the counter values when visiting code that was instrumented. Those counts are used to add the branch weight metadata to the generated IR. There are a few cases where the ASTs don't distinguish all the edges, e.g., a case-range that gets expanded to a bunch of separate cases. This is not a problem for code coverage, because we can only report the counts for things that exist in the source code. For PGO, these situations are not common, and it should be good enough to just divide the counts evenly across t
he possibilities, e.g., for the case-range all the case values are going to jump to the same code anyway, so there's no real advantage is distinguishing the branch weight for each value in the range.

For execution-count profiles, this should be a good start. Other types of profiling, like data-based ones, may need other attributes (perhaps in types, or variables). I'm currently not sure how to handle those.

Diego.

[Eric] Yes, I am thinking in terms of embedded targets where instrumentation cannot work. I would like to be able to take trace data generated by the program running on target and then pull out the relevant profile data on branch counts to feed to PGO. How do I create the profile format such that the branch counts taken from the code addresses of the executable match up to the branches being optimized by the compiler?

Also, to the point of only having one data file per executable, it seems like it would be better to associate a data file per source file. That way I can more easily store my profile data along with my source file and it will get used individually when that source file gets pulled into different builds.

Thanks!

Also, to the point of only having one data file per executable, it seems like it would be better to associate a data file per source file. That way I can more easily store my profile data along with my source file and it will get used individually when that source file gets pulled into different builds.

This is probably not workable. Think about the LTO case when you want
to merge compile units. What profile data goes with the particular
file will depend on the executable that's created. Also the workload
profile for a given translation unit could vary quite significantly
depending on workflow.

-eric

From: cfe-dev-bounces@cs.uiuc.edu [mailto:cfe-dev-bounces@cs.uiuc.edu]
On Behalf Of Diego Novillo
Sent: Saturday, September 07, 2013 6:55 AM
To: Bob Wilson
Cc: clang-dev Developers
Subject: Re: [cfe-dev] Proposal: add instrumentation for PGO and code
coverage

In terms of the metadata representation, what are your thoughts on the on-
disk format to use? Since we want to support multiple external profile
sources, I would like to define a canonical on-disk representation that every
profiling source should convert to.

This way, the profile loader in the compiler needs to handle that single file
format.

[Eric] Yes, I am thinking in terms of embedded targets where instrumentation cannot work. I would like to be able to
take trace data generated by the program running on target and then pull out the relevant profile data on branch counts
to feed to PGO. How do I create the profile format such that the branch counts taken from the code addresses of the
executable match up to the branches being optimized by the compiler?

The idea used by AutoFDO (AutoFDO - GCC Wiki) is to take
the sampled data, plus the debug info from the binary and map it to
source+line numbers. This gives you an approximate map of block
frequencies which you then use to estimate branch probabilities. This
is what I am currently implementing.

The process is, by nature, lossy. But at Google we have found that it
provides a good balance between ease-of-use and optimization
opportunities. We avoid the intrusive instrumentation and training
steps of traditional PGO, while keeping most of its advantages.

Also, to the point of only having one data file per executable, it seems like it would be better to associate a data file
per source file. That way I can more easily store my profile data along with my source file and it will get used
individually when that source file gets pulled into different builds.

Well, the data file will naturally map back to source files via its
debug info. Is that what you are looking for?

Diego.

From: Diego Novillo [mailto:dnovillo@google.com]
Sent: Tuesday, September 10, 2013 6:11 AM
To: Katzfey, Eric
Cc: Bob Wilson; clang-dev Developers
Subject: Re: [cfe-dev] Proposal: add instrumentation for PGO and code
coverage

>
>
>> From: cfe-dev-bounces@cs.uiuc.edu
>> [mailto:cfe-dev-bounces@cs.uiuc.edu]
>> On Behalf Of Diego Novillo
>> Sent: Saturday, September 07, 2013 6:55 AM
>> To: Bob Wilson
>> Cc: clang-dev Developers
>> Subject: Re: [cfe-dev] Proposal: add instrumentation for PGO and code
>> coverage
>>
>>
>> In terms of the metadata representation, what are your thoughts on
>> the on- disk format to use? Since we want to support multiple
>> external profile sources, I would like to define a canonical on-disk
>> representation that every profiling source should convert to.
>>
>> This way, the profile loader in the compiler needs to handle that
>> single file format.
>>
> [Eric] Yes, I am thinking in terms of embedded targets where
> instrumentation cannot work. I would like to be able to take trace
> data generated by the program running on target and then pull out the
> relevant profile data on branch counts to feed to PGO. How do I create the
profile format such that the branch counts taken from the code addresses of
the executable match up to the branches being optimized by the compiler?

The idea used by AutoFDO (AutoFDO - GCC Wiki) is to take the
sampled data, plus the debug info from the binary and map it to
source+line numbers. This gives you an approximate map of block
frequencies which you then use to estimate branch probabilities. This is
what I am currently implementing.

The process is, by nature, lossy. But at Google we have found that it
provides a good balance between ease-of-use and optimization
opportunities. We avoid the intrusive instrumentation and training steps of
traditional PGO, while keeping most of its advantages.

[Eric] Okay, that is interesting. I have a tool that analyzes, in real time, trace data streaming from an embedded target processor. After the test run I get a count of how many times each function has been executed. I also get, for each conditional branch in the object code, a count of how many times it branched and how many times it didn't branch. So the trick is to map this into something that the optimizer can use.

>
> Also, to the point of only having one data file per executable, it
> seems like it would be better to associate a data file per source
> file. That way I can more easily store my profile data along with my source
file and it will get used individually when that source file gets pulled into
different builds.

Well, the data file will naturally map back to source files via its debug info. Is
that what you are looking for?

[Eric] It works, I'm just thinking more of logistics. For example, if I have a module with some small number of files, and this goes into a build with a huge number of files, then the data file I include with my files would be quite large and only have a small amount of usable information in it. Granted, I could use that one data file for all of the files in my module. I guess it could also be an advantage if I get my profiling info from a small module test application.

Background:

PGO: LLVM already has support for branch weight metadata and can use that to compute block frequencies and generate better code. Those branch weights can come from __builtin_expect or from real profile data. We currently have some support for IR-level instrumentation to insert CFG edge counts and then read those counts back into the IR as branch weights. That support is not very well integrated into clang and does not work well if the IR changes at all after the profile data is collected. Diego Novillo has recently expressed interested in a sampling-based approach to collecting profile data for PGO. I believe that would be complementary to this proposal to use instrumentation. There are pros and cons of either approach, so it would be great to offer a choice between them. Both approaches should share the same metadata representation.

I agree, the auto profiler can and should co-exist with the instrumentation profiler. In a sense, the auto profiler is just a lossy version of the instrumentation profiler.

One thing I'm not so sure about your approach is the decision to do PGO instrumentation at the source level. I would suspect that this may produce more instrumentation than needed (which is then hard to remove in the optimizers) and it also prevents other LLVM front ends from being able to use PGO.

The scheme I'm proposing here doesn't help other front-ends to provide PGO but it doesn't prevent them, either. IMO, the important point to standardize across various profiling schemes is the metadata format.

Wouldn't it make it easier if the PGO instrumentation was generated by LLVM itself?

I suppose that depends on how you define "easier". I don't think it is possible to meet my design goals with that approach.

For code coverage, I think being closer to the source helps. For PGO, I'm not so sure. But I also don't know for certain that my suspicion of excessive instrumentation is correct. The only thing I'm sure of, is that other front ends will not be able to take advantage of it.

Yes, that's a tradeoff. All other things being equal I would have liked to do something that applies to all front ends, but not at the expense of providing a great experience with clang.

I'm intentionally pairing coverage and PGO here. We don't want to require rebuilding with different instrumentation for each of those. We also want to make it easy for developers to examine their PGO profile data using code coverage tools.

In terms of the metadata representation, what are your thoughts on the on-disk format to use? Since we want to support multiple external profile sources, I would like to define a canonical on-disk representation that every profiling source should convert to.

This way, the profile loader in the compiler needs to handle that single file format.

This scheme would not work with a profile loader like that. The profile data in this proposal is tied to clang's ASTs. It must be loaded directly by clang and then translated to metadata in the IR.

* PGO should degrade gracefully when the source changes: I started out trying to get around this and just require the profile data to be regenerated for every source change, but that really didn't work out. For a variety of reasons, it is important to avoid that. I'm not expecting much disagreement so I won't go into details. The proposal here is to match the profile data to the current code on a function-by-function basis. If the profile data is stale for a particular function, it will be ignored. More sophisticated solutions could be considered in the future, but as far as I know, that is still an open research problem.

Absolutely. My current plan wrt auto profiling is to drop on the floor any samples for which I can make no sense of. Perhaps offering a diagnostic option, but certainly giving up and falling back on static heuristics.

* Profile data must not be invalidated by compiler changes: No one should have to regenerate their profile data just because they installed a new compiler. This is especially important to Apple, but I expect it applies to most people to some extent. Just as an example, while working on this I stumbled across some work-in-progress to support switch case-ranges in LLVM IR. Clang's current code-gen for case-ranges expands them out to either a series of separate cases or conditional branches in the default block, depending on the size of the range. I want profile data collected with the current code-gen to remain valid if we ever get native case-range support, even though the IR-level CFG might look significantly different. The proposed solution here is to associate the instrumentation counters with clang AST nodes, which are much less likely to change in a way that would invalidate the profile data.

But ultimately, this is a special case of code drifting, is it not? Or maybe I just completely missed your point here.

I think you missed my point. There are 2 kinds of drift:

1) The source code changes. This is inevitable but manageable because you can tell the developer to regenerate their profile data whenever they change the code. Or not. They can choose.

2) The compiler changes. See, for example, the case-range issue that I mentioned above, where the compiler may suddenly start producing IR with a drastically different CFG even when the source code did not change at all. This is not manageable, at least for us. If the new compiler determines that the old profile data doesn't apply, it should warn about that, which could break the build if the project uses -Werror, and even if it doesn't, there may still be a performance regression. We need to regularly release new compilers internally and we have no control over other people's profile data files. We need a solution that avoids the possibility of either broken builds or performance regressions when we release a new compiler.

* Provide accurate coverage data: As described above, I think it will be quite useful to use the same instrumentation for both PGO and coverage, and the coverage data we collect should be as accurate as possible. Clang's IR-gen does a few optimizations, e.g., to remove dead code, that require some care to make sure the coverage data is correct.

And this is the case I think why coverage instrumentation at the FE level makes a lot of sense.

I'm intentionally pairing coverage and PGO here. We don't want to require rebuilding with different instrumentation for each of those. We also want to make it easy for developers to examine their PGO profile data using code coverage tools.

But, there's no reason you would need to pair those if it didn't suit your needs. You could use this proposed feature for code coverage and still use the sampling-based profile data for PGO.

* Minimize runtime overhead for instrumentation: If the instrumented code is very slow or requires far more memory than normal, that could be a barrier to using it. Since the proposal here is to insert the instrumentation in the front-end, we can take advantage of the structure of the ASTs to reduce the number of counters. (The effect should be similar to what you get by using a spanning tree to place counters in an arbitrary CFG.) We can also compile the instrumented code with full optimization.

This is the part that I have doubts about. Perhaps I'm too used to generating instrumentation from the compiler's CFG.

I don't see why front-end instrumentation would have more overhead than instrumentation added at the IR level. Most the usual optimizations that you would do to minimize the counters just fall out naturally by following the structured control flow of the ASTs to determine where to place counters.

I'll work on getting some numbers soon.

Summary of the Instrumentation Phase:

In the attached patch, I added a "-finstrument-regions" option for this. Other name suggestions are

-fprofile-generate and -fprofile-use? I'm drawing from my GCC background here. Sorry :wink:

Yeah, I like those names, too. I was trying to avoid confusion with GCC's implementation. This proposal will not be compatible with what GCC does, so I'm not sure that using the same names would be a good idea.

From: cfe-dev-bounces@cs.uiuc.edu [mailto:cfe-dev-bounces@cs.uiuc.edu]
On Behalf Of Diego Novillo
Sent: Saturday, September 07, 2013 6:55 AM
To: Bob Wilson
Cc: clang-dev Developers
Subject: Re: [cfe-dev] Proposal: add instrumentation for PGO and code
coverage

In terms of the metadata representation, what are your thoughts on the on-
disk format to use? Since we want to support multiple external profile
sources, I would like to define a canonical on-disk representation that every
profiling source should convert to.

This way, the profile loader in the compiler needs to handle that single file
format.

[Eric] Yes, I am thinking in terms of embedded targets where instrumentation cannot work. I would like to be able to take trace data generated by the program running on target and then pull out the relevant profile data on branch counts to feed to PGO. How do I create the profile format such that the branch counts taken from the code addresses of the executable match up to the branches being optimized by the compiler?

I'm assuming that we will provide some API that you can use to replace the standard output routines to write the profile data. For an embedded target, you could just dump the data into a memory buffer or perhaps send it over a wire back to a host machine.

Also, to the point of only having one data file per executable, it seems like it would be better to associate a data file per source file. That way I can more easily store my profile data along with my source file and it will get used individually when that source file gets pulled into different builds.

As Eric already commented, that doesn't work very well. Besides what he wrote, it's really important to allow collecting separate profiles from different machines, different inputs, etc. If you have 50 different scenarios that you want to include in your profiling, you really don't want 50 separate data files for every source file, especially if they'll stored alongside the source in the same directories.