RFC: Binary format for instrumentation based profiling data

Instrumentation-based profiling data is generated by instrumented
binaries through library functions in compiler-rt, and read by the clang
frontend to feed PGO.

The current data format is a naive textual format.

The files consist of a number of counters for each function in a
program.

Use cases

Functions are represented by strings, determined by the part of the
frontend that both generates and uses this data. In our case, these are
generally whatever clang thinks of as the function's name, with minor
details added to disambiguate names that aren't necessarily unique.

Why not just use mangled names here? No need to add minor details, nor
ad-hoc disambiguation. If you're already using mangled names, then I'm
not sure why we need disambiguating details.

The counter data is simply an array of unsigned 64 bit values. Given an
offset found in the index, a sequence follows:

   <function hash> <number of counters> <counters...>

This is all of the data needed for a given function.

How are counters represented? Are these line numbers together with the
counter? Basic blocks? Edges?

I wonder if it would make sense to use the existing gcov format for
this. OTOH, we could provide a converter in the Profile library.

Thanks. Diego.

Functions are represented by strings, determined by the part of the
frontend that both generates and uses this data. In our case, these are
generally whatever clang thinks of as the function's name, with minor
details added to disambiguate names that aren't necessarily unique.

Why not just use mangled names here? No need to add minor details, nor
ad-hoc disambiguation. If you're already using mangled names, then I'm
not sure why we need disambiguating details.

We need to prepend the file name to distinguish functions with local linkage. Also, Objective-C methods do not get mangled, so we don’t want to say that we just use the mangled names.

The counter data is simply an array of unsigned 64 bit values. Given an
offset found in the index, a sequence follows:

  <function hash> <number of counters> <counters...>

This is all of the data needed for a given function.

How are counters represented? Are these line numbers together with the
counter? Basic blocks? Edges?

There are no line numbers, basic blocks, or edges. It is just a sequence of counters that the front-end knows how to map to the code (the same as with our current textual file format).

I wonder if it would make sense to use the existing gcov format for
this. OTOH, we could provide a converter in the Profile library.

This is pretty clearly a different format from gcov, and I don’t see how we could convert between them. But, I agree that it would be nice to collect the code for handling different kinds of profile formats in one library, even if those file formats are not interchangeable.

Bob Wilson <bob.wilson@apple.com> writes:

Functions are represented by strings, determined by the part of the
frontend that both generates and uses this data. In our case, these are
generally whatever clang thinks of as the function's name, with minor
details added to disambiguate names that aren't necessarily unique.

Why not just use mangled names here? No need to add minor details, nor
ad-hoc disambiguation. If you're already using mangled names, then I'm
not sure why we need disambiguating details.

We need to prepend the file name to distinguish functions with local
linkage. Also, Objective-C methods do not get mangled, so we don’t
want to say that we just use the mangled names.

Notably, we do use mangled names where we can, and we only add anything
in cases where it isn't enough.

Just wanted to point out that Clang already has an on-disk hash table
implementation in:

include/clang/Basic/OnDiskHashTable.h

Dmitri

Sorry, you lost me. How exactly does the FE map them to the code? In
the sample profiler, each instrumented line consists of a line offset,
a discriminator (to distinguish distinct control flow paths on the
same line) and the counter. We match them by computing the absolute
line number from the offset and assign the counter to the
corresponding basic block.

I think we should be able to use the same pass in
lib/Transforms/Scalar/SampleProfile.cpp to read profiles generated
from instrumentation. The information is basically the same, so a bit
of generalization of that code should be all we need to pass those
counters down into the analysis module.

Diego.

>
>>
>> How are counters represented? Are these line numbers together with the
>> counter? Basic blocks? Edges?
>
> There are no line numbers, basic blocks, or edges. It is just a sequence
of counters that the front-end knows how to map to the code (the same as
with our current textual file format).

Sorry, you lost me. How exactly does the FE map them to the code? In
the sample profiler, each instrumented line consists of a line offset,
a discriminator (to distinguish distinct control flow paths on the
same line) and the counter. We match them by computing the absolute
line number from the offset and assign the counter to the
corresponding basic block.

For GCC, it is CFG based matching -- it requires exact match of sources
between instrumentation and annotation (the counters are laid out in some
CFG order).

David

How are counters represented? Are these line numbers together with the
counter? Basic blocks? Edges?

There are no line numbers, basic blocks, or edges. It is just a sequence of counters that the front-end knows how to map to the code (the same as with our current textual file format).

Sorry, you lost me. How exactly does the FE map them to the code? In
the sample profiler, each instrumented line consists of a line offset,
a discriminator (to distinguish distinct control flow paths on the
same line) and the counter. We match them by computing the absolute
line number from the offset and assign the counter to the
corresponding basic block.

This is a proposal for the instrumentation-based approach that I talked about at the dev meeting. I don’t see how it can share the a file format with the sample profiler, since the content is fundamentally different.

I think we should be able to use the same pass in
lib/Transforms/Scalar/SampleProfile.cpp to read profiles generated
from instrumentation. The information is basically the same, so a bit
of generalization of that code should be all we need to pass those
counters down into the analysis module.

?? The information is completely different.

It is? But we’re fundamentally emitting similar information, right?

Sample counts and instrumentation counts will be different values, sure. But they convey the same meaning. Higher values mean higher frequency of execution.

I’m not saying that the file format must be the same. I’m saying that we should be able to feed block frequency information and branch probability information in the same way from both instrumentation and sampling data.

So, in the backend pass that reads profile data we should be able to process both sample data or instrumentation data. The reader layer just needs to be smart enough to know what it’s reading. But it ultimately feeds the same information in the form of branch weights.

Diego.

The instrumentation based profiling is done in the front end, and the counters are related to parts do the AST. Their isn’t enough information for a backend pass to map them to anything. We feed block frequency and branch probability through metadata, much like the sampling based approach, but it happens during irgen in the front end, rather than in a backend pass.

While it might be possible to translate the instrumentation format to the profile format, it wouldn’t be possible to go the other way, and the translation would be imperfect.

The instrumentation based profiling is done in the front end, and the
counters are related to parts do the AST. Their isn't enough information
for a backend pass to map them to *anything*. We feed block frequency and
branch probability through metadata, much like the sampling based approach,
but it happens during irgen in the front end, rather than in a backend pass.

Ah, OK. At least we get the annotations through to the backend. So,
instrumentation gathers both block and edge weights? Or do you derive edge
weights from blocks?

While it might be possible to translate the instrumentation format to the
profile format, it wouldn't be possible to go the other way, and the
translation would be imperfect.

I'm only interested in passing the instrumentation down to the back end.
Are you folks thinking of doing other kinds of instrumentation? (value
profiling, for example).

Thanks. Diego.

The IR only has branch weight metadata, and we add that directly to the branch instructions. In most cases, we can just use the block counts to calculate the branch weights, but in a few cases, we insert counters on the edges when that is necessary to get the correct weights.

We’ve discussed that possibility but not in any detail.

Instrumentation-based profiling data is generated by instrumented
binaries through library functions in compiler-rt, and read by the clang
frontend to feed PGO.

Note that compiler-rt *cannot* link in LLVM libraries for many reasons:

1) The compiler-rt must be built for the target, the LLVM libraries are
built for the host.
2) The compiler-rt source is under a different license
3) We typically want to avoid the large dependencies of LLVM from runtime
libraries that need to be extremely light weight such as the profile
runtime.

So I think you will at least need to have the header file with basic
structs, and endian-aware writing code duplicated. =/ This may explain some
of the trouble you had with build systems at least.

Use cases

=========

There are a few use cases that drive our requirements:

A. Looking up the counters for a particular function needs to be
  efficient, so as to not slow down compilation too much when using the
  data for PGO. This is also presumably how this data would be
  accessed if using it for coverage.

B. The file should be relatively quick to generate, and shouldn't
  require a large amount of memory overhead to write out. Instrumented
  programs will write this file out on termination or on demand.

C. Iterating over all of the data must be reasonably simple, to support
  tools that summarize or merge data files.

D. The format should be relatively compact. Programs with a large
  number of functions may be instrumented, and we don't want to create
  unnecessarily large files because of it.

E. The format should be portable between systems. Ie, if Alice
  generates profile data for some program P using Machine A, Bob should
  be able to use that same profile data to instrument a copy of that
  same program on machine B.

Partially updating the file is not a goal. Instrumented programs write
out a profile for a single run, and multiple runs can be merged by a
separate tool.

The other assumption here is that you want the same file format written by
instrumentation and read back by the compiler. While I think that is an
unsurprising goal, I think it creates quite a few limitations that I'd like
to point out. I think it would be worthwhile to consider the alternative of
having the profile library write out data files in a format which is
essentially "always" transformed by a post-processing tool before being
used during compilation.

Limitations of using the same format in both places:
- High burden on writing the file constrains the format (must be fast, must
not use libraries, etc...)
- Have to write and index even though the writer doesn't really need it.
- Have to have the function name passed through the instrumentation,
potentially duplicating it with debug info.
- Can't use an extensible file format (like bitcode) to insulate readers of
profile data from format changes.

I'm imagining it might be nicer to have something along the lines of the
following counter proposal. Define two formats: the format written by
instrumentation, and the format read by the compiler. Split the use cases
up. Specialize the formats based on the use cases. It does require the user
to post-process the results, but it isn't clear that this is really a
burden. Historically it has been needed to merge gcov profiles from
different TUs, and it is still required to merge them from multiple runs.

I think the results could be superior for both the writer and reader:

Instrumentation written format:
- No index, just header and counters
- (optional) Omit function names, and use PC at a known point of the
function, and rely on debug info to map back to function names.
- Use a structure which can be mmap-ed directly by the instrumentation code
(at least on LE systems) so that "writing the file on close" is just
flushing the memory region to disk
- Explicitly version format, and provide no stability going forward

Profile reading format:
- Use a bitcoded format much like Clang's ASTs do (or some other tagged
format which allows extensions)
- Leverage the existing partial reading which has been heavily optimized
for modules, LLVM IR, etc.
- Use implicit-zero semantics for missing counters within a function where
we have *some* instrumentation results, and remove all zero counters
- Maybe other compression techniques

Thoughts? Specific reasons to avoid this? I'm very much interested in
minimizing the space and runtime overhead of instrumentation, as well as
getting more advanced features in the format read by Clang itself.

Chandler Carruth <chandlerc@google.com> writes:

The other assumption here is that you want the same file format written by
instrumentation and read back by the compiler. While I think that is an
unsurprising goal, I think it creates quite a few limitations that I'd like to
point out. I think it would be worthwhile to consider the alternative of
having the profile library write out data files in a format which is
essentially "always" transformed by a post-processing tool before being used
during compilation.

Limitations of using the same format in both places:
- High burden on writing the file constrains the format (must be fast, must
  not use libraries, etc...)
- Have to write and index even though the writer doesn't really need it.
- Have to have the function name passed through the instrumentation,
  potentially duplicating it with debug info.
- Can't use an extensible file format (like bitcode) to insulate readers of
  profile data from format changes.

I'm imagining it might be nicer to have something along the lines of the
following counter proposal. Define two formats: the format written by
instrumentation, and the format read by the compiler. Split the use cases up.
Specialize the formats based on the use cases. It does require the user to
post-process the results, but it isn't clear that this is really a burden.
Historically it has been needed to merge gcov profiles from different TUs, and
it is still required to merge them from multiple runs.

This is an interesting idea. The counter data itself without index is
dead simple, so this approach for the instrumentation written format
would certainly be nice for compiler-rt, at the small cost of needing
two readers. We'd also need two writers, but that appears inevitable
since one needs to live in compiler-rt.

I think the results could be superior for both the writer and reader:

Instrumentation written format:
- No index, just header and counters
- (optional) Omit function names, and use PC at a known point of the function,
  and rely on debug info to map back to function names.

This depends a bit on whether or not the conversion tool should depend
on the debug info being available. We'd need to weigh the usability cost
against the size benefit.

- Use a structure which can be mmap-ed directly by the instrumentation code
  (at least on LE systems) so that "writing the file on close" is just
  flushing the memory region to disk

If this is feasible, we could also make the format is host endian and
force the post-processing to byteswap as it reads. This avoids online
work in favour of offline.

- Explicitly version format, and provide no stability going forward

Profile reading format:
- Use a bitcoded format much like Clang's ASTs do (or some other tagged format
  which allows extensions)

I'm not entirely convinced a bitcoded format is going to gain us much
over a simpler on disk hash table. The variable bit rate integers might
be worthwhile, but will it be efficient to look up the counters for a
particular function name?

That said, the ASTs also make use of the on disk hash that Dmitri
mentioned for various indexes, which is definitely worth looking at.

Chandler Carruth <chandlerc@google.com> writes:

The other assumption here is that you want the same file format written by
instrumentation and read back by the compiler. While I think that is an
unsurprising goal, I think it creates quite a few limitations that I'd like to
point out. I think it would be worthwhile to consider the alternative of
having the profile library write out data files in a format which is
essentially "always" transformed by a post-processing tool before being used
during compilation.

Limitations of using the same format in both places:
- High burden on writing the file constrains the format (must be fast, must
not use libraries, etc...)
- Have to write and index even though the writer doesn't really need it.
- Have to have the function name passed through the instrumentation,
potentially duplicating it with debug info.
- Can't use an extensible file format (like bitcode) to insulate readers of
profile data from format changes.

I'm imagining it might be nicer to have something along the lines of the
following counter proposal. Define two formats: the format written by
instrumentation, and the format read by the compiler. Split the use cases up.
Specialize the formats based on the use cases. It does require the user to
post-process the results, but it isn't clear that this is really a burden.
Historically it has been needed to merge gcov profiles from different TUs, and
it is still required to merge them from multiple runs.

This is an interesting idea. The counter data itself without index is
dead simple, so this approach for the instrumentation written format
would certainly be nice for compiler-rt, at the small cost of needing
two readers. We'd also need two writers, but that appears inevitable
since one needs to live in compiler-rt.

I'm in favour of two formats. Simplifying compiler-rt is a worthwhile goal.

Nevertheless, the current proposal with a naive index is straightforward to produce, especially after the changes I committed today. I think moving to that is a good incremental change.

Moving forward we can split the format in two and evolve them independently. In particular, compiler-rt's write could be coded as a few memcpy calls plus a header, if there's some freedom around the format.

Chandler Carruth <chandlerc@google.com> writes:

I think it would be worthwhile to consider the alternative of having
the profile library write out data files in a format which is
essentially "always" transformed by a post-processing tool before
being used during compilation.

We seem to have some agreement that two formats for instrumentation
based profiling is worthwhile. These are that emitted by compiler-rt in
the instrumented program at runtime (format 1), and that which is
consumed by clang when compiling the program with PGO (format 2).

Format 1

We seem to have some agreement that two formats for instrumentation
based profiling is worthwhile. These are that emitted by compiler-rt in
the instrumented program at runtime (format 1), and that which is
consumed by clang when compiling the program with PGO (format 2).

Format 1
--------

This format should be efficient to write, since the instrumented program
should run with as little overhead as possible. This also doesn't need
to be stable, and we can assume the same version of LLVM that was used
to instrument the program will read the counter data. As such, the file
format is versioned (so we can easily reject versions we don't
understand) and consists basically of a memory dump of the relevant
profiling counters.

The "same version" assertion isn't completely true, at a previous job
we had clients who preferred not to regenerate profile data unless they
actually had to (because it was a big pain and took a long time). But
as long as the versioning is based on actual format changes, not just
repurposing the current LLVM version number (making the previous data
unusable for no technical reason), that's okay.

As long as I'm bothering to say something, is there some way that the
tools will figure out that you're trying to apply old data to new files
that have changed in ways that make the old data inapplicable? Sorry
if this has been brought up elsewhere and I just missed it.
--paulr

Chandler Carruth <chandlerc@google.com> writes:
> I think it would be worthwhile to consider the alternative of having
> the profile library write out data files in a format which is
> essentially "always" transformed by a post-processing tool before
> being used during compilation.

We seem to have some agreement that two formats for instrumentation
based profiling is worthwhile. These are that emitted by compiler-rt in
the instrumented program at runtime (format 1), and that which is
consumed by clang when compiling the program with PGO (format 2).

Format 1
--------

This format should be efficient to write, since the instrumented program
should run with as little overhead as possible. This also doesn't need
to be stable, and we can assume the same version of LLVM that was used
to instrument the program will read the counter data. As such, the file
format is versioned (so we can easily reject versions we don't
understand) and consists basically of a memory dump of the relevant
profiling counters.

This makes perfect sense to me.

Format 2
--------

This format should be efficient to read and preferably reasonably
compact. We'll convert from format 1 to format 2 using llvm-profdata,
and clang will use format 2 for PGO.

Since the only particularly important operation in this use case is fast
lookup, I propose using the on disk hash table that's currently used in
clang for AST serialization/PTH/etc with a small amount of metadata in a
header.

The hash table implementation currently lives in include/clang/Basic and
consists of a single header. Moving it to llvm and updating the clients
in clang should be easy. I'll send a brief RFC separately to see if
anyone's opposed to moving it.

I can mention this and we can discuss this on the other thread if you would
rather, but I'm not a huge fan of this code. My vague memory was that this
was a quick hack by Doug that he never really expected to live long-term.

I have a general preference for from-disk lookups to use tries (for
strings, prefix tries) or other fast, sorted lookup structures. They have
the nice property of being inherently stable and unambiguous, and not
baking any hashing algorithm into it.

I've not thought enough about how to make a general purpose one of these to
have a stronger opinion though; perhaps I should do so...

We seem to have some agreement that two formats for instrumentation
based profiling is worthwhile. These are that emitted by compiler-rt in
the instrumented program at runtime (format 1), and that which is
consumed by clang when compiling the program with PGO (format 2).

Format 1
--------

This format should be efficient to write, since the instrumented program
should run with as little overhead as possible. This also doesn't need
to be stable, and we can assume the same version of LLVM that was used
to instrument the program will read the counter data. As such, the file
format is versioned (so we can easily reject versions we don't
understand) and consists basically of a memory dump of the relevant
profiling counters.

The "same version" assertion isn't completely true, at a previous job
we had clients who preferred not to regenerate profile data unless they
actually had to (because it was a big pain and took a long time). But
as long as the versioning is based on actual format changes, not just
repurposing the current LLVM version number (making the previous data
unusable for no technical reason), that's okay.

Format 1 (extension .profraw since r204676) should be run immediately through llvm-profdata to generate format 2 (extension .profdata). The only profiles that should be kept around are format 2.

As long as I'm bothering to say something, is there some way that the
tools will figure out that you're trying to apply old data to new files
that have changed in ways that make the old data inapplicable? Sorry
if this has been brought up elsewhere and I just missed it.
—paulr

There’s a hash for each function based on the layout of the counters assigned to it. If the hash from the data doesn’t match the current frontend, the data is ignored. Currently, the hash is extremely naive: the number of counters.

Format 2
--------

This format should be efficient to read and preferably reasonably
compact. We'll convert from format 1 to format 2 using llvm-profdata,
and clang will use format 2 for PGO.

Since the only particularly important operation in this use case is fast
lookup, I propose using the on disk hash table that's currently used in
clang for AST serialization/PTH/etc with a small amount of metadata in a
header.

The hash table implementation currently lives in include/clang/Basic and
consists of a single header. Moving it to llvm and updating the clients
in clang should be easy. I'll send a brief RFC separately to see if
anyone's opposed to moving it.

I can mention this and we can discuss this on the other thread if you would rather, but I'm not a huge fan of this code. My vague memory was that this was a quick hack by Doug that he never really expected to live long-term.

I have a general preference for from-disk lookups to use tries (for strings, prefix tries) or other fast, sorted lookup structures.

These profiles will contain every function in a program. Relatively few of these will be needed per translation unit (per invocation of clang). I suspect that an on disk hash will perform better than a trie for this use case, since it requires fewer loads from disk.

But the main benefit of the clang on-disk hash is that it’s in use and it already works. Unless tries are significantly better, I prefer cleaning up the (working) hash table implementation to implementing (and debugging) something new.

They have the nice property of being inherently stable and unambiguous, and not baking any hashing algorithm into it.

It *is* harder to keep the hash table stable. I think it’s worth the cost here.