RFC: Binary format for instrumentation based profiling data

>> We seem to have some agreement that two formats for instrumentation
>> based profiling is worthwhile. These are that emitted by compiler-rt
in
>> the instrumented program at runtime (format 1), and that which is
>> consumed by clang when compiling the program with PGO (format 2).
>>
>> Format 1
>> --------
>>
>> This format should be efficient to write, since the instrumented
program
>> should run with as little overhead as possible. This also doesn't
need
>> to be stable, and we can assume the same version of LLVM that was
used
>> to instrument the program will read the counter data. As such, the
file
>> format is versioned (so we can easily reject versions we don't
>> understand) and consists basically of a memory dump of the relevant
>> profiling counters.
>
> The "same version" assertion isn't completely true, at a previous job
> we had clients who preferred not to regenerate profile data unless
they
> actually had to (because it was a big pain and took a long time). But
> as long as the versioning is based on actual format changes, not just
> repurposing the current LLVM version number (making the previous data
> unusable for no technical reason), that's okay.

Format 1 (extension .profraw since r204676) should be run immediately
through llvm-profdata to generate format 2 (extension .profdata). The
only profiles that should be kept around are format 2.

Okay, but the version comment still applies to format 2 then.

> As long as I'm bothering to say something, is there some way that the
> tools will figure out that you're trying to apply old data to new
files
> that have changed in ways that make the old data inapplicable? Sorry
> if this has been brought up elsewhere and I just missed it.
> -paulr

There's a hash for each function based on the layout of the counters
assigned to it. If the hash from the data doesn't match the current
frontend, the data is ignored. Currently, the hash is extremely naive:
the number of counters.

Eww. Should be CFG-based. I think merge-similar-functions has a way
to compute this?
--paulr

We seem to have some agreement that two formats for instrumentation
based profiling is worthwhile. These are that emitted by compiler-rt

in

the instrumented program at runtime (format 1), and that which is
consumed by clang when compiling the program with PGO (format 2).

Format 1
--------

This format should be efficient to write, since the instrumented

program

should run with as little overhead as possible. This also doesn't

need

to be stable, and we can assume the same version of LLVM that was

used

to instrument the program will read the counter data. As such, the

file

format is versioned (so we can easily reject versions we don't
understand) and consists basically of a memory dump of the relevant
profiling counters.

The "same version" assertion isn't completely true, at a previous job
we had clients who preferred not to regenerate profile data unless

they

actually had to (because it was a big pain and took a long time). But
as long as the versioning is based on actual format changes, not just
repurposing the current LLVM version number (making the previous data
unusable for no technical reason), that's okay.

Format 1 (extension .profraw since r204676) should be run immediately
through llvm-profdata to generate format 2 (extension .profdata). The
only profiles that should be kept around are format 2.

Okay, but the version comment still applies to format 2 then.

Right. Format 2 needs to be auto-upgraded if we have version changes.

As long as I'm bothering to say something, is there some way that the
tools will figure out that you're trying to apply old data to new

files

that have changed in ways that make the old data inapplicable? Sorry
if this has been brought up elsewhere and I just missed it.
-paulr

There's a hash for each function based on the layout of the counters
assigned to it. If the hash from the data doesn't match the current
frontend, the data is ignored. Currently, the hash is extremely naive:
the number of counters.

Eww. Should be CFG-based. I think merge-similar-functions has a way
to compute this?

Working on it in "[PATCH] InstrProf: Calculate a better function hash" :).

Chandler Carruth <chandlerc@google.com> writes:

    Format 2
    --------
   
    This format should be efficient to read and preferably reasonably
    compact. We'll convert from format 1 to format 2 using llvm-profdata,
    and clang will use format 2 for PGO.
   
    Since the only particularly important operation in this use case is fast
    lookup, I propose using the on disk hash table that's currently used in
    clang for AST serialization/PTH/etc with a small amount of metadata in a
    header.
   
    The hash table implementation currently lives in include/clang/Basic and
    consists of a single header. Moving it to llvm and updating the clients
    in clang should be easy. I'll send a brief RFC separately to see if
    anyone's opposed to moving it.

I can mention this and we can discuss this on the other thread if you would
rather, but I'm not a huge fan of this code. My vague memory was that this was
a quick hack by Doug that he never really expected to live long-term.

It may not be the prettiest piece of code, but given that it's used in
several places in clang and hasn't needed any significant changes since
2010, I'd say it's fairly solid. It also has the very obvious advantage
of already existing, which makes it a pretty good candidate for a
version 1 format, IMHO.

I have a general preference for from-disk lookups to use tries (for strings,
prefix tries) or other fast, sorted lookup structures. They have the nice
property of being inherently stable and unambiguous, and not baking any
hashing algorithm into it.

I would like to experiment with a few trie-based approaches for this as
we try to optimize the PGO process further (both for space and for
lookup time). Even so, it's not a sure thing that this will work better,
and I don't think it's worth delaying getting something that people can
use out the door.

If you're opposed to moving the existing OnDiskHashTable into Support,
perhaps because you don't think it should proliferate to other uses,
the obvious alternative is to include a private copy of a stripped down
version of it for the profile reader and writer to use themselves. I'm
not sure if this is worth the copy pasted code, but it is an
option. What do you think?

Chandler, are you okay with this way forward?

Justin Bogner <mail@justinbogner.com> writes:

Ping.

Justin Bogner <mail@justinbogner.com> writes:

Sorry, just back from holiday (your email managed to catch the start of 2 weeks, sorry about that) and still getting back through the email and review backlog.

Before I answer definitively, I want to catch up on all of the changes that have happened since this very high level discussion. I don’t know what the current state of the on disk hash table is, or what the implications of this being a long-term file format are. I’ll go investigate that, since it seems all the code is committed at this point anyways…

Chandler Carruth <chandlerc@google.com> writes:
> Format 2
> --------
>
> This format should be efficient to read and preferably reasonably
> compact. We'll convert from format 1 to format 2 using llvm-profdata,
> and clang will use format 2 for PGO.
>
> Since the only particularly important operation in this use case is
fast
> lookup, I propose using the on disk hash table that's currently used
in
> clang for AST serialization/PTH/etc with a small amount of metadata
in a
> header.
>
> The hash table implementation currently lives in include/clang/Basic
and
> consists of a single header. Moving it to llvm and updating the
clients
> in clang should be easy. I'll send a brief RFC separately to see if
> anyone's opposed to moving it.
>
> I can mention this and we can discuss this on the other thread if you
would
> rather, but I'm not a huge fan of this code. My vague memory was that
this was
> a quick hack by Doug that he never really expected to live long-term.

It may not be the prettiest piece of code, but given that it's used in
several places in clang and hasn't needed any significant changes since
2010, I'd say it's fairly solid. It also has the very obvious advantage
of already existing, which makes it a pretty good candidate for a
version 1 format, IMHO.

So, I've gone and read all of it to try and get a good handle on the
current state rather than dredging up memories from so many years ago. =] I
can speak more confidently now.

The OnDiskHashTable stuff seems perfectly fine for emitting just that - an
on-disk-hash-table. However, it was never designed to support long-lived
file formats in that form. The use in serialized ASTs is specifically not
supporting a long-term file format. The biggest issue there is endianness,
and I see you've already very nicely added good support for that. The only
remaining concern I might have are the 32-bit offset limitations. While

2gb of counter data may seem unlikely, I don't think it is inconceivable,

and even >4gb of counter data might happen. Using 64-bit offsets seems an
easy fix though. Essentially, I think this part of the code could quickly
and easily be made viable for this purpose, although it would require a bit
more cleanup and documenting the intended stability.

Anyways, the part I was truly concerned about is actually nicely factored
out -- the hashing bit. The AST's hashing is *completely* unsuitable for a
long-term file format, but my assumption is that you'd just use the
existing stable PGO hashing stuff for this table? If so, it should work
fine. If you want to hash other things (function names?), I would just urge
using something like their MD5 or some other fixed, and uncontroversial
algorithm so we don't end up wondering how a bug snuck in there N years
later. So this seems workable too.

Essentially, to answer a later question:

If you're opposed to moving the existing OnDiskHashTable into Support,

perhaps because you don't think it should proliferate to other uses,
the obvious alternative is to include a private copy of a stripped down
version of it for the profile reader and writer to use themselves. I'm
not sure if this is worth the copy pasted code, but it is an
option. What do you think?

I think with the cleanups you've started plus a bit more documentation,
this could be a fine candidate for a generic on-disk (or, raw memory
buffer) hash table writer and reader.

OK, on to the general questions rather than ones concerning specific code...

> I have a general preference for from-disk lookups to use tries (for
strings,
> prefix tries) or other fast, sorted lookup structures. They have the nice
> property of being inherently stable and unambiguous, and not baking any
> hashing algorithm into it.

I would like to experiment with a few trie-based approaches for this as
we try to optimize the PGO process further (both for space and for
lookup time). Even so, it's not a sure thing that this will work better,

So the first question is whether it is really worth looking into other
solutions. I have a suspicion that there are better formats for this
because of one key idea: while the important operation is lookup, I don't
think it is truly *random* lookup. In fact, I suspect it will be extremely
structured lookup, with a few hot clusters of data due to similar function
"names" (where the names of things like file-local static functions get
file name prefixes and such to disambiguate them). So I think that there is
a real locality win possible in this space. Maybe not all the time, and
most of the time it may be below the noise floor, but I think it will still
be there.

and I don't think it's worth delaying getting something that people can
use out the door.

If the file format is wide open to change over the coming months, then I'm
totally down with this. Makes perfect sense. However, I get the feeling
that it isn't wide open to change, and we're going to quickly end up locked
into a particular format here to support the backwards compatibility
concerns. Personally, I'm happy to change the format *very* fluidly, at
least until there is an LLVM release, and potentially even after that, but
it would be good to hear from others that want to consume this data.

It wouldn’t surprise me at all if we eventually need larger than 32-bit offsets, but I don’t think that’s a priority right now. We have almost no knowledge of the typical size of these PGO profiles, and no one is even using this new feature yet. More importantly, it would be good to share the basic OnDiskHash table implementation for clang’s serialized ASTs and modules and I’m concerned that it might be disruptive to change that right now. We’re almost certainly going to be changing the file format over time and we can coordinate a change to 64-bit offsets later if that turns out to be important.

The PGO hashing is completely unrelated. The hash here is just of the function name (or more specifically, the name that we used in the PGO data, e.g., possibly mangled). We’re currently using Bernstein’s hash function, which is so simple that I’m not worried about bugs, but it isn’t a great hash function. MD5 would make sense to me.

We need to settle of a file format ASAP for our internal work, but from the perspective of the LLVM community, it makes sense to me that this should remain wide open to change, at least until it goes out in an open-source LLVM release.

I’d like to proceed in the usual incremental fashion. This change is essentially just moving the OnDiskHashTable code from clang to llvm, and given that we seem to have agreement on the general direction, I see no reason why that can’t be committed now, along with the llvm-profdata changes to use it. It should be followed up with a change to address Chandler’s concern about the hash function.

Chandler, do you think we need to investigate the 64-bit offset issue now or can that wait until we have a better idea whether it matters?

Bob Wilson <bob.wilson@apple.com> writes:

    The OnDiskHashTable stuff seems perfectly fine for emitting just that - an
    on-disk-hash-table. However, it was never designed to support long-lived
    file formats in that form. The use in serialized ASTs is specifically not
    supporting a long-term file format. The biggest issue there is endianness,
    and I see you've already very nicely added good support for that. The only
    remaining concern I might have are the 32-bit offset limitations. While >
    2gb of counter data may seem unlikely, I don't think it is inconceivable,
    and even >4gb of counter data might happen. Using 64-bit offsets seems an
    easy fix though. Essentially, I think this part of the code could quickly
    and easily be made viable for this purpose, although it would require a
    bit more cleanup and documenting the intended stability.

It wouldn’t surprise me at all if we eventually need larger than 32-bit
offsets, but I don’t think that’s a priority right now. We have almost no
knowledge of the typical size of these PGO profiles, and no one is even using
this new feature yet. More importantly, it would be good to share the basic
OnDiskHash table implementation for clang’s serialized ASTs and modules and
I’m concerned that it might be disruptive to change that right now. We’re
almost certainly going to be changing the file format over time and we can
coordinate a change to 64-bit offsets later if that turns out to be important.

Adding an extra template parameter (offset_type?) to the OnDiskHashTable
classes would be pretty trivial, we can even default it to uint32_t for
the pre-existing clang usages to minimize churn. Given how easy this is
to do, it's probably worth doing now.

    Anyways, the part I was truly concerned about is actually nicely factored
    out -- the hashing bit. The AST's hashing is *completely* unsuitable for a
    long-term file format, but my assumption is that you'd just use the
    existing stable PGO hashing stuff for this table? If so, it should work
    fine. If you want to hash other things (function names?), I would just
    urge using something like their MD5 or some other fixed, and
    uncontroversial algorithm so we don't end up wondering how a bug snuck in
    there N years later. So this seems workable too.

The PGO hashing is completely unrelated. The hash here is just of the function
name (or more specifically, the name that we used in the PGO data, e.g.,
possibly mangled). We’re currently using Bernstein’s hash function, which is
so simple that I’m not worried about bugs, but it isn’t a great hash function.
MD5 would make sense to me.

Yes, these are hashes of the function names. The OnDiskHashTable
currently expects a 32 bit hash, so I think we should add a
hash_value_type template parameter as well, then use 64 bits of an MD5
here. Agreed?

Chandler Carruth <chandlerc@google.com> writes:

Essentially, to answer a later question:

    If you're opposed to moving the existing OnDiskHashTable into Support,
    perhaps because you don't think it should proliferate to other uses,
    the obvious alternative is to include a private copy of a stripped down
    version of it for the profile reader and writer to use themselves. I'm
    not sure if this is worth the copy pasted code, but it is an
    option. What do you think?

I think with the cleanups you've started plus a bit more documentation, this
could be a fine candidate for a generic on-disk (or, raw memory buffer) hash
table writer and reader.

Great. I'll go ahead and move the hash table now, and I'll add some
documentation while I'm at it. I'm going to put it in llvm/Support/,
and keep the OnDiskHashTable.h name.

Then I'll go ahead and write up the patches for choosing the offset and
hash value types, and then move on to using this in InstrProf and
updating the clang clients.

OK, on to the general questions rather than ones concerning specific code...

    > I have a general preference for from-disk lookups to use tries (for
    strings,
    > prefix tries) or other fast, sorted lookup structures. They have the
    nice
    > property of being inherently stable and unambiguous, and not baking any
    > hashing algorithm into it.
   
    I would like to experiment with a few trie-based approaches for this as
    we try to optimize the PGO process further (both for space and for
    lookup time). Even so, it's not a sure thing that this will work better,

So the first question is whether it is really worth looking into other
solutions. I have a suspicion that there are better formats for this because
of one key idea: while the important operation is lookup, I don't think it is
truly *random* lookup. In fact, I suspect it will be extremely structured
lookup, with a few hot clusters of data due to similar function "names" (where
the names of things like file-local static functions get file name prefixes
and such to disambiguate them). So I think that there is a real locality win
possible in this space. Maybe not all the time, and most of the time it may be
below the noise floor, but I think it will still be there.

Right, data locality and an abundance of shared prefixes make this data
set kind of interesting. I agree it's worth experimenting, but I do
think we need this in as a baseline first.

    and I don't think it's worth delaying getting something that people can
    use out the door.

If the file format is wide open to change over the coming months, then I'm
totally down with this. Makes perfect sense. However, I get the feeling that
it isn't wide open to change, and we're going to quickly end up locked into a
particular format here to support the backwards compatibility concerns.
Personally, I'm happy to change the format *very* fluidly, at least until
there is an LLVM release, and potentially even after that, but it would be
good to hear from others that want to consume this data.

I'm happy with making changes here too. Once we have an LLVM release
with this in it, then changes will need to come with a story on how
users upgrade their old data to the changed format, but we shouldn't let
that stop us from improving things.

Ok, I don't really know how to reconcile this.

Is it fine to make breaking, backwards incompatible changes to the file
format in the coming months or not?

I would really like it if we can avoid making those kinds of changes, but I don’t know how to justify that from an open-source LLVM project perspective.

I can’t expect you to care about Apple’s release schedules but neither can I change them. It would be nice if the open-source version of LLVM can work with the PGO profile data used by Apple (especially because it seems that you guys are taking a totally different approach to PGO which won’t even use these same files). If we can get a “good enough” solution into trunk now, and if we can continue to provide support for that format, then we can make that work. If you really think it’s important to continue iterating on the file format, without adding support for reading the earlier format, then we’ll lose that. It would be unfortunate for the community IMO, but you may not agree.

I think maintaining support for a legacy format that has never even been part of an open source LLVM release is a really lame burden to place on the community. It’ll add complexity to Clang and the profile data tools. But there are many things that make pragmatic sense even though they are lame.

However, I think even worrying about this is getting in the way of developing an actual good file format. I think that the design discussion, iteration, and evaluation should proceed regardless of what you happen to do at Apple. Later on, when if you end up needing to add reading support for some fixed “external” format we don’t control, propose that as a separate patch that clearly factors all of the logic away into detection and reading of an alternative file format. That way its clear which is the supported and recommended format and design for the open source tools, and which is just a compatibility layer for some existing files and tools that can’t be changed.

For example, this is how I plan to suggest teaching the llvm-profdata tool to read and merge profile data produced by profiling tools outside of LLVM like the Linux perf events, etc. It’s going to come after we essentially have the pipeline well designed and worked out, as a clearly separated layer that just adds compatibility with specific fixed file formats we can’t easily control.

I think maintaining support for a legacy format that has never even been part of an open source LLVM release is a really lame burden to place on the community. It’ll add complexity to Clang and the profile data tools. But there are many things that make pragmatic sense even though they are lame.

However, I think even worrying about this is getting in the way of developing an actual good file format. I think that the design discussion, iteration, and evaluation should proceed regardless of what you happen to do at Apple.

I agree the top priority should always be "getting it right”. But I can’t agree with this thinking completely. This has to be balanced with pragmatism. If we completely disregard the practical concerns of commercial use, it makes LLVM hostile towards an important group of users.

Evan

Clearly, I can't argue with that. =] We benefit from it as well. And I
don't think I'm arguing against a pragmatic approach in this case, although
I'm sorry if it comes off that way.

Just so we're on the same page, the only real question in my mind is: can
we make breaking changes as we iterate on the design.

What I would like to do is figure out the right design first,
incrementally, trying one format, and seeing how it does, but potentially
with backwards incompatible changes. Once we have the experience and know
what the right format is, *then* we should consider pragmatic concerns such
as how to structure support for reading legacy formats, or formats outside
of the control of the project. I think if we start off with a format that
we can't change because of external reasons (whatever they may be), it will
be much harder to incrementally get to the right design. Does that seem
reasonable (and hopefully help explain the concrete suggestion I'm making)?

Why can't we have our cake and eat it too?

Off the cuff and flame away

    I agree the top priority should always be "getting it right”. But
    I can’t agree with this thinking completely. This has to be
    balanced with pragmatism. If we completely disregard the practical
    concerns of commercial use, it makes LLVM hostile towards an
    important group of users.

Clearly, I can't argue with that. =] We benefit from it as well. And I
don't think I'm arguing against a pragmatic approach in this case, although
I'm sorry if it comes off that way.

Just so we're on the same page, the only real question in my mind is: can
we make breaking changes as we iterate on the design.

What I would like to do is figure out the right design first,
incrementally, trying one format, and seeing how it does, but potentially
with backwards incompatible changes. Once we have the experience and know
what the right format is, *then* we should consider pragmatic concerns such
as how to structure support for reading legacy formats, or formats outside
of the control of the project. I think if we start off with a format that
we can't change because of external reasons (whatever they may be), it will
be much harder to incrementally get to the right design. Does that seem
reasonable (and hopefully help explain the concrete suggestion I'm making)?

Why can't we have our cake and eat it too?

Off the cuff and flame away
------------
What's wrong with a soft policy like - Feel free to do an incremental
improvement which breaks a change and then if someone else wants to
contribute a patch that allows the old format/behavior - that's up to them
(patches welcome). Both changes would be accepted and allows the parallel
paths - (putting the burden of maintenance for backward compatibility on
those stakeholders). Wouldn't this solve the problem Right Now(tm)?

I think its important that there remain incentives for the contributors to
address design feedback that comes up in review essentially. Making it
"patches welcome" right away, removes that incentive IMO, and would be a
shame. Sometimes, its unavoidable, but I think the project is better when
we hold that line.

As an example, when we wanted to contribute the AST matchers to Clang, is
was really useful that it was entirely on us to address all of the
(excellent) design review from Doug. It may not have mattered for our use
cases, but it mattered that the not-quite-right solution didn't just sit
rotting in mainline waiting for someone to take up the challenge of making
it the right solution.

The pragmatic tradeoff here is to do it incrementally, so you don't have to
leap to the right solution before the first patch lands. And I don't think
there has been any pushback against that in this thread.

I don't mean this as an attack
It's very depressing to see passionate Apple vs Google people not be able
to come to middle ground very quickly

=] I think we're actually pretty close, and the discussion continues to be
polite, and non-circular, so I'm not depressed yet.

/* last non-technical reply */

I empathize with one side in this case, but outside this thread and generally - Holding the line is great, but sometimes gotta break a few eggs to make an omelet.. (Or hold the line, but don't step on it) Especially if all the commit noise happens between releases and the "patches"/code churn pragmatically gets squashed for any end users.

/* Probably not the case this time - I'd hope passionate developers don't block ${serious_contributor} to the point of hard deadline decisions. It would be a larger foul for the community of users to miss out on a compatible solution */

I didn’t think that’s what you were suggesting. But I wanted to make sure we are on the same page. Sometimes I felt like the arguments in these threads are starting to take on a “religious” tone. I really don’t want to see that happen.

Yes and no. I agree that we should try to get the design right. We should not be afraid to make changes to drop technical debts even if it’s inconvenient. In this specific discussion, I obviously agree we should come up with a method to support non-mainline formats. However, I don’t think it’s always possible to come up with the a design that’s 100% perfect. And the only way we can get real world feedback is to get it out there so people can use it. I fear we are straying a little too far from iterative approach that LLVM has always encouraged.

Evan

I’m not depressed either. The community is growing and I expect and respect the difference of opinions. As far as I can tell, we’re all focused on making LLVM better. We’ll sort this out.

Evan