RFC: Reducing Instr PGO size overhead

LLVM Profile instrumentation incurs very large size (memory, storage)
overhead. For instance, the following is the binary size comparison of
the Clang binaries built (-O2 -DNDEBUG) with different
configurations:

1) 60.9M (built with Clang itself)
2) 280.4M (same as 1, but added -fprofile-instr-generate)
3) 54.9M (built with GCC 4.8)
4) 156.5M (same as 3, but added -fprofile-generate)

In other words, Clang's instrumentation increases binary size by 4.6X,
while GCC's instrumentation overhead is only 2.8X.

The profile data size from Clang instrumented binary is also much
larger. The following is a comparison:

1) 114.5M (raw profile data emitted by Clang instrumented by Clang)
2) 63.7M (indexed profile data emitted by Clang instrumented by Clang)
3) 33.1M (total size of GCDA files emitted by Clang instrumented by GCC).

Large size of overhead can limit the usability of PGO greatly:
a) devices with small system partition does not have much room left --
greatly bloated image due to instrumentation can prevent it from being
installed
b) Linker can be highly stressed when linking very large C++
applications -- large size increase due to instrumentation can prevent
those apps from being successfully linked
c) Large raw profile dumps may also cause problems, e.g. running out
of space. It can also profile bootstrap build really slow.

Looking at various sources of size increase caused by instrumentation,
it is clear that the biggest contributor is the __llvm_prf_names
section. The current PGO implementation uses function assembly names
as the lookup keys in the indexed format so it needs to be emitted in
both rodata of the binary and in the raw/indexed profiles.

On the other hand, LLVM's indexed format also supports 'key collision'
-- it allows functions with the same key share the same entry in the
hash table. Function's structural hash values will be used as a
secondary key to find a real match.

The above feature allows us to use a different key other than function
assembly names and this can reduce binary/profile data size
significantly. Function name's MD5 hash is a good candidate, and I
have a patch (3 parts: runtime, reader/writer, clang) that does just
that. The results are very promising. With this change, the clang
instrumented binary size is now 216M (down from 280M); the raw profile
size is only 40.1M (a 2.85X size reduction); and the indexed profile
size is only 29.5M (a 2.2X size reduction).

With the change, the indexed format size is smaller than GCC's (but
the latter has value profile data). The binary size is still much
larger than GCC's, but with the late instrumentation, we will have
more size reduction.

A couple of more details of the patch:

1) When -fcoverage-mapping is specified, the llvm_prf_names will be
emitted to the binary, but they won't be dumped into the profile data.
In other words, with -fcoverage-mapping, only profile data will be
shrinked. The reason is that llvm-cov tool reads function names from
the section (referenced from covmap) to support name based filtering
(including regular expression) when dumping line coverage report
2) The change is backward compatible such that old version of both raw
and index formats can still be read by the new profile reader (and
therefore clients such as clang, llvm-profdata, llvm-cov tools)

I plan to submit the patch for review after some cleanups.

Thoughts, concerns?

thanks,

David

LLVM Profile instrumentation incurs very large size (memory, storage)
overhead. For instance, the following is the binary size comparison of
the Clang binaries built (-O2 -DNDEBUG) with different
configurations:

1) 60.9M (built with Clang itself)
2) 280.4M (same as 1, but added -fprofile-instr-generate)
3) 54.9M (built with GCC 4.8)
4) 156.5M (same as 3, but added -fprofile-generate)

In other words, Clang's instrumentation increases binary size by 4.6X,
while GCC's instrumentation overhead is only 2.8X.

The profile data size from Clang instrumented binary is also much
larger. The following is a comparison:

1) 114.5M (raw profile data emitted by Clang instrumented by Clang)
2) 63.7M (indexed profile data emitted by Clang instrumented by Clang)
3) 33.1M (total size of GCDA files emitted by Clang instrumented by GCC).

Large size of overhead can limit the usability of PGO greatly:
a) devices with small system partition does not have much room left --
greatly bloated image due to instrumentation can prevent it from being
installed
b) Linker can be highly stressed when linking very large C++
applications -- large size increase due to instrumentation can prevent
those apps from being successfully linked
c) Large raw profile dumps may also cause problems, e.g. running out
of space. It can also profile bootstrap build really slow.

Looking at various sources of size increase caused by instrumentation,
it is clear that the biggest contributor is the __llvm_prf_names
section. The current PGO implementation uses function assembly names
as the lookup keys in the indexed format so it needs to be emitted in
both rodata of the binary and in the raw/indexed profiles.

On the other hand, LLVM's indexed format also supports 'key collision'
-- it allows functions with the same key share the same entry in the
hash table. Function's structural hash values will be used as a
secondary key to find a real match.

The above feature allows us to use a different key other than function
assembly names and this can reduce binary/profile data size
significantly. Function name's MD5 hash is a good candidate, and I
have a patch (3 parts: runtime, reader/writer, clang) that does just
that. The results are very promising. With this change, the clang
instrumented binary size is now 216M (down from 280M); the raw profile
size is only 40.1M (a 2.85X size reduction); and the indexed profile
size is only 29.5M (a 2.2X size reduction).

With the change, the indexed format size is smaller than GCC's (but
the latter has value profile data). The binary size is still much
larger than GCC's, but with the late instrumentation, we will have
more size reduction.

A couple of more details of the patch:

1) When -fcoverage-mapping is specified, the llvm_prf_names will be
emitted to the binary, but they won't be dumped into the profile data.
In other words, with -fcoverage-mapping, only profile data will be
shrinked. The reason is that llvm-cov tool reads function names from
the section (referenced from covmap) to support name based filtering
(including regular expression) when dumping line coverage report
2) The change is backward compatible such that old version of both raw
and index formats can still be read by the new profile reader (and
therefore clients such as clang, llvm-profdata, llvm-cov tools)

I plan to submit the patch for review after some cleanups.

Thoughts, concerns?

I think it is reasonable to simply replace the key we currently use with
MD5(key) for getting a size reduction. In practice for my use cases, I
have not observed any of the issues you mentioned under "Large size of
overhead can limit the usability of PGO greatly", but I can understand that
some of these issues could become problems in Google's use case. I would
personally prefer to keep the existing behavior as the default (see below),
and have MD5(key) as an option.

My primary concern is that if the function name are not kept at all stages,
then it becomes difficult to analyze the profile data in a standalone way.
Many times, I have used `llvm-profdata show -all-functions foo.profdata` on
the resulting profile data and then imported that data into Mathematica for
analysis. My understanding of your proposal is that `llvm-profdata show
-all-functions foo.profdata` will not show the actual function names but
instead MD5 hashes, which will make it more difficult for me to do this
kind of analysis (would require using nm on the original binary, hashing
everything, etc.).

btw, feel free to attach the patch even if it in a rough state. It can
still help to clarify the proposal and be a good talking point.
Fine-grained patch review for caring about the rough parts will happen on
llvm-commits; the rough parts will not distract the discussion here on
llvm-dev.

-- Sean Silva

I think it is reasonable to simply replace the key we currently use with
MD5(key) for getting a size reduction. In practice for my use cases, I have
not observed any of the issues you mentioned under "Large size of overhead
can limit the usability of PGO greatly", but I can understand that some of
these issues could become problems in Google's use case. I would personally
prefer to keep the existing behavior as the default (see below), and have
MD5(key) as an option.

The problem is that this requires profile format changes. It will be
very messy to support multiple formats in instr-codegen and
instr-runtime. For compatibility concerns, the reader is taught to
support previous format, but the changes there are isolated (also
expected to be removed in the future).

My primary concern is that if the function name are not kept at all stages,
then it becomes difficult to analyze the profile data in a standalone way.
Many times, I have used `llvm-profdata show -all-functions foo.profdata` on
the resulting profile data and then imported that data into Mathematica for
analysis.

This is certainly a very valid use case.

My understanding of your proposal is that `llvm-profdata show
-all-functions foo.profdata` will not show the actual function names but
instead MD5 hashes,

Yes.

To support your use case, there are two solutions:

1) user can add -fcoverage-mapping option in the build
2) introduce a new option -fprofile-instr-names that force the
emission of the name sections in the .o file. This is similar to 1),
but no covmap section is needed.

llvm-profdata tool will be taught to read the name section and attach
function names to the profile records.

Note that with 1) or 2), the user can still benefit from the reduced
profile size.

thanks,

David

>
> I think it is reasonable to simply replace the key we currently use with
> MD5(key) for getting a size reduction. In practice for my use cases, I
have
> not observed any of the issues you mentioned under "Large size of
overhead
> can limit the usability of PGO greatly", but I can understand that some
of
> these issues could become problems in Google's use case. I would
personally
> prefer to keep the existing behavior as the default (see below), and have
> MD5(key) as an option.

The problem is that this requires profile format changes.

Why? AFAIK the "function name" is just an arbitrary string. Using s or
MD5(s) shouldn't matter. Of course, the user will need to pass consistent
flags to clang.

It will be
very messy to support multiple formats in instr-codegen and
instr-runtime. For compatibility concerns, the reader is taught to
support previous format, but the changes there are isolated (also
expected to be removed in the future).

>
> My primary concern is that if the function name are not kept at all
stages,
> then it becomes difficult to analyze the profile data in a standalone
way.
> Many times, I have used `llvm-profdata show -all-functions foo.profdata`
on
> the resulting profile data and then imported that data into Mathematica
for
> analysis.

This is certainly a very valid use case.

>My understanding of your proposal is that `llvm-profdata show
> -all-functions foo.profdata` will not show the actual function names but
> instead MD5 hashes,

Yes.

To support your use case, there are two solutions:

1) user can add -fcoverage-mapping option in the build
2) introduce a new option -fprofile-instr-names that force the
emission of the name sections in the .o file. This is similar to 1),
but no covmap section is needed.

llvm-profdata tool will be taught to read the name section and attach
function names to the profile records.

Needing to pass the executable to llvm-profdata would cause deployment
issues for my customers in practice.

Note that with 1) or 2), the user can still benefit from the reduced
profile size.

Let me reiterate that the size of the profile is not a problem I have
observed in practice (nor have I heard of this being a problem in practice
until this thread). Therefore I'm skeptical of any changes to our default
behavior or any new requirements that are not opt-in.

-- Sean Silva

>
> I think it is reasonable to simply replace the key we currently use with
> MD5(key) for getting a size reduction. In practice for my use cases, I
> have
> not observed any of the issues you mentioned under "Large size of
> overhead
> can limit the usability of PGO greatly", but I can understand that some
> of
> these issues could become problems in Google's use case. I would
> personally
> prefer to keep the existing behavior as the default (see below), and
> have
> MD5(key) as an option.

The problem is that this requires profile format changes.

Why? AFAIK the "function name" is just an arbitrary string. Using s or
MD5(s) shouldn't matter. Of course, the user will need to pass consistent
flags to clang.

The raw format for 64bit target can be made 'unchanged', but not for
the 32bit raw format -- the nameptr field is only 32bit.

The indexed format can not be made the same -- The ondisk profile
record layout totally changes. The key field changes from a blob of
chars into an 64bit integer.

It will be
very messy to support multiple formats in instr-codegen and
instr-runtime. For compatibility concerns, the reader is taught to
support previous format, but the changes there are isolated (also
expected to be removed in the future).

>
> My primary concern is that if the function name are not kept at all
> stages,
> then it becomes difficult to analyze the profile data in a standalone
> way.
> Many times, I have used `llvm-profdata show -all-functions foo.profdata`
> on
> the resulting profile data and then imported that data into Mathematica
> for
> analysis.

This is certainly a very valid use case.

>My understanding of your proposal is that `llvm-profdata show
> -all-functions foo.profdata` will not show the actual function names but
> instead MD5 hashes,

Yes.

To support your use case, there are two solutions:

1) user can add -fcoverage-mapping option in the build
2) introduce a new option -fprofile-instr-names that force the
emission of the name sections in the .o file. This is similar to 1),
but no covmap section is needed.

llvm-profdata tool will be taught to read the name section and attach
function names to the profile records.

Needing to pass the executable to llvm-profdata would cause deployment
issues for my customers in practice.

Why? The deployment needs to pass the profile data anyway right? This
is no different from llvm-cov usage model.

David

I think I can make the format changes in such a way that your use case
is not affected. What is needed is an optional name section in both
raw and indexed profile data. The default behavior can also be made
unchanged.

thanks,

David

>
>
>>
>> >
>> > I think it is reasonable to simply replace the key we currently use
with
>> > MD5(key) for getting a size reduction. In practice for my use cases,
I
>> > have
>> > not observed any of the issues you mentioned under "Large size of
>> > overhead
>> > can limit the usability of PGO greatly", but I can understand that
some
>> > of
>> > these issues could become problems in Google's use case. I would
>> > personally
>> > prefer to keep the existing behavior as the default (see below), and
>> > have
>> > MD5(key) as an option.
>>
>> The problem is that this requires profile format changes.
>
>
> Why? AFAIK the "function name" is just an arbitrary string. Using s or
> MD5(s) shouldn't matter. Of course, the user will need to pass consistent
> flags to clang.

The raw format for 64bit target can be made 'unchanged', but not for
the 32bit raw format -- the nameptr field is only 32bit.

The indexed format can not be made the same -- The ondisk profile
record layout totally changes. The key field changes from a blob of
chars into an 64bit integer.

An MD5 sum cannot be represented as a blob of chars?

Or to say it another way, suppose that Itanium mangling required as a final
step to replace the string with its md5 sum in hex. Therefore all symbol
names are "small". My understanding is that this is effectively all your
patch is doing.

If this is not the case, you should show your current patch so that we can
discuss things concretely.

>
>
>>
>> It will be
>> very messy to support multiple formats in instr-codegen and
>> instr-runtime. For compatibility concerns, the reader is taught to
>> support previous format, but the changes there are isolated (also
>> expected to be removed in the future).
>>
>> >
>> > My primary concern is that if the function name are not kept at all
>> > stages,
>> > then it becomes difficult to analyze the profile data in a standalone
>> > way.
>> > Many times, I have used `llvm-profdata show -all-functions
foo.profdata`
>> > on
>> > the resulting profile data and then imported that data into
Mathematica
>> > for
>> > analysis.
>>
>> This is certainly a very valid use case.
>>
>> >My understanding of your proposal is that `llvm-profdata show
>> > -all-functions foo.profdata` will not show the actual function names
but
>> > instead MD5 hashes,
>>
>> Yes.
>>
>> To support your use case, there are two solutions:
>>
>> 1) user can add -fcoverage-mapping option in the build
>> 2) introduce a new option -fprofile-instr-names that force the
>> emission of the name sections in the .o file. This is similar to 1),
>> but no covmap section is needed.
>>
>> llvm-profdata tool will be taught to read the name section and attach
>> function names to the profile records.
>
>
> Needing to pass the executable to llvm-profdata would cause deployment
> issues for my customers in practice.

Why? The deployment needs to pass the profile data anyway right?

Yes, but not the executable.

The PGO training run is likely being run by a gameplay tester
(non-programmer). In general the binary will not be lying around as a loose
file anywhere, it will be part of a full package of the binary+assets
(think like what will end up on a bluray disc). A game's binary *completely
useless* without the assets, so except locally on a programmer's machine
while they iterate/debug, there is no reason for a binary to ever exist as
a standalone file.

I'm not saying that needing the binary is insurmountable in any particular
scenario. Just that it will cause a strict increase in the number of issues
to deploying PGO.

These are much bigger "compatibility concerns" for me than for newer
toolchains to accept the old format. For a change in format I can easily
tell my users to replace an exe with a newer one and that is all they need
to do and it takes 10 seconds, guaranteed. A workflow change is potentially
a massive disruption and guaranteed to take more than 10 seconds to fix
(perhaps hours or days).

  This
is no different from llvm-cov usage model.

In practice, getting the performance of PGO is a higher priority for my
users, so we should not assume that llvm-cov is being used.

-- Sean Silva

>
>
>>
>> >
>> > I think it is reasonable to simply replace the key we currently use
with
>> > MD5(key) for getting a size reduction. In practice for my use
cases, I
>> > have
>> > not observed any of the issues you mentioned under "Large size of
>> > overhead
>> > can limit the usability of PGO greatly", but I can understand that
some
>> > of
>> > these issues could become problems in Google's use case. I would
>> > personally
>> > prefer to keep the existing behavior as the default (see below), and
>> > have
>> > MD5(key) as an option.
>>
>> The problem is that this requires profile format changes.
>
>
> Why? AFAIK the "function name" is just an arbitrary string. Using s or
> MD5(s) shouldn't matter. Of course, the user will need to pass
consistent
> flags to clang.

The raw format for 64bit target can be made 'unchanged', but not for
the 32bit raw format -- the nameptr field is only 32bit.

The indexed format can not be made the same -- The ondisk profile
record layout totally changes. The key field changes from a blob of
chars into an 64bit integer.

An MD5 sum cannot be represented as a blob of chars?

Or to say it another way, suppose that Itanium mangling required as a
final step to replace the string with its md5 sum in hex. Therefore all
symbol names are "small". My understanding is that this is effectively all
your patch is doing.

(only when dealing with profile-related things; the actual symbol names in
the binary are kept the same as always; i.e. all PGO-related codepaths will
either 1) use the assembly name directly or 2) compute and MD5 sum of the
assembly name before using it; which of these is chosen depends on a
user-specified flag)

-- Sean Silva

>
>
>>
>> >
>> > I think it is reasonable to simply replace the key we currently use
>> > with
>> > MD5(key) for getting a size reduction. In practice for my use cases,
>> > I
>> > have
>> > not observed any of the issues you mentioned under "Large size of
>> > overhead
>> > can limit the usability of PGO greatly", but I can understand that
>> > some
>> > of
>> > these issues could become problems in Google's use case. I would
>> > personally
>> > prefer to keep the existing behavior as the default (see below), and
>> > have
>> > MD5(key) as an option.
>>
>> The problem is that this requires profile format changes.
>
>
> Why? AFAIK the "function name" is just an arbitrary string. Using s or
> MD5(s) shouldn't matter. Of course, the user will need to pass
> consistent
> flags to clang.

The raw format for 64bit target can be made 'unchanged', but not for
the 32bit raw format -- the nameptr field is only 32bit.

The indexed format can not be made the same -- The ondisk profile
record layout totally changes. The key field changes from a blob of
chars into an 64bit integer.

An MD5 sum cannot be represented as a blob of chars?

yes -- it is fixed length (8byte) blob which may include null byte in
the middle.

Or to say it another way, suppose that Itanium mangling required as a final
step to replace the string with its md5 sum in hex. Therefore all symbol
names are "small". My understanding is that this is effectively all your
patch is doing.

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

In the raw format, md5 sum key can be an embedded field in the
prf_data variable instead of as different var referenced by prf_data.

If this is not the case, you should show your current patch so that we can
discuss things concretely.

It is not. See above about the difference.

>
>
>>
>> It will be
>> very messy to support multiple formats in instr-codegen and
>> instr-runtime. For compatibility concerns, the reader is taught to
>> support previous format, but the changes there are isolated (also
>> expected to be removed in the future).
>>
>> >
>> > My primary concern is that if the function name are not kept at all
>> > stages,
>> > then it becomes difficult to analyze the profile data in a standalone
>> > way.
>> > Many times, I have used `llvm-profdata show -all-functions
>> > foo.profdata`
>> > on
>> > the resulting profile data and then imported that data into
>> > Mathematica
>> > for
>> > analysis.
>>
>> This is certainly a very valid use case.
>>
>> >My understanding of your proposal is that `llvm-profdata show
>> > -all-functions foo.profdata` will not show the actual function names
>> > but
>> > instead MD5 hashes,
>>
>> Yes.
>>
>> To support your use case, there are two solutions:
>>
>> 1) user can add -fcoverage-mapping option in the build
>> 2) introduce a new option -fprofile-instr-names that force the
>> emission of the name sections in the .o file. This is similar to 1),
>> but no covmap section is needed.
>>
>> llvm-profdata tool will be taught to read the name section and attach
>> function names to the profile records.
>
>
> Needing to pass the executable to llvm-profdata would cause deployment
> issues for my customers in practice.

Why? The deployment needs to pass the profile data anyway right?

Yes, but not the executable.

The PGO training run is likely being run by a gameplay tester
(non-programmer). In general the binary will not be lying around as a loose
file anywhere, it will be part of a full package of the binary+assets (think
like what will end up on a bluray disc). A game's binary *completely
useless* without the assets, so except locally on a programmer's machine
while they iterate/debug, there is no reason for a binary to ever exist as a
standalone file.

I'm not saying that needing the binary is insurmountable in any particular
scenario. Just that it will cause a strict increase in the number of issues
to deploying PGO.

Your concern is acknowledged.

These are much bigger "compatibility concerns" for me than for newer
toolchains to accept the old format. For a change in format I can easily
tell my users to replace an exe with a newer one and that is all they need
to do and it takes 10 seconds, guaranteed. A workflow change is potentially
a massive disruption and guaranteed to take more than 10 seconds to fix
(perhaps hours or days).

ok.

  This
is no different from llvm-cov usage model.

In practice, getting the performance of PGO is a higher priority for my
users, so we should not assume that llvm-cov is being used.

Glad to hear that :slight_smile:

thanks,

David

>
>
>>
>> >
>> > I think it is reasonable to simply replace the key we currently use
>> > with
>> > MD5(key) for getting a size reduction. In practice for my use
>> > cases, I
>> > have
>> > not observed any of the issues you mentioned under "Large size of
>> > overhead
>> > can limit the usability of PGO greatly", but I can understand that
>> > some
>> > of
>> > these issues could become problems in Google's use case. I would
>> > personally
>> > prefer to keep the existing behavior as the default (see below), and
>> > have
>> > MD5(key) as an option.
>>
>> The problem is that this requires profile format changes.
>
>
> Why? AFAIK the "function name" is just an arbitrary string. Using s or
> MD5(s) shouldn't matter. Of course, the user will need to pass
> consistent
> flags to clang.

The raw format for 64bit target can be made 'unchanged', but not for
the 32bit raw format -- the nameptr field is only 32bit.

The indexed format can not be made the same -- The ondisk profile
record layout totally changes. The key field changes from a blob of
chars into an 64bit integer.

An MD5 sum cannot be represented as a blob of chars?

Or to say it another way, suppose that Itanium mangling required as a
final step to replace the string with its md5 sum in hex. Therefore all
symbol names are "small". My understanding is that this is effectively all
your patch is doing.

(only when dealing with profile-related things; the actual symbol names in
the binary are kept the same as always; i.e. all PGO-related codepaths will
either 1) use the assembly name directly or 2) compute and MD5 sum of the
assembly name before using it; which of these is chosen depends on a
user-specified flag)

2) is how it is implemented. The user specified flag does not need to
pick between 1) and 2) -- the new format will use 2) consistently
regardless of the flag. The user specified flag can simply inform the
compiler and profile runtime whether to dump or omit function names in
the profile.

David

>
>
>>
>> >
>> >
>> >>
>> >> >
>> >> > I think it is reasonable to simply replace the key we currently use
>> >> > with
>> >> > MD5(key) for getting a size reduction. In practice for my use
cases,
>> >> > I
>> >> > have
>> >> > not observed any of the issues you mentioned under "Large size of
>> >> > overhead
>> >> > can limit the usability of PGO greatly", but I can understand that
>> >> > some
>> >> > of
>> >> > these issues could become problems in Google's use case. I would
>> >> > personally
>> >> > prefer to keep the existing behavior as the default (see below),
and
>> >> > have
>> >> > MD5(key) as an option.
>> >>
>> >> The problem is that this requires profile format changes.
>> >
>> >
>> > Why? AFAIK the "function name" is just an arbitrary string. Using s or
>> > MD5(s) shouldn't matter. Of course, the user will need to pass
>> > consistent
>> > flags to clang.
>>
>> The raw format for 64bit target can be made 'unchanged', but not for
>> the 32bit raw format -- the nameptr field is only 32bit.
>>
>> The indexed format can not be made the same -- The ondisk profile
>> record layout totally changes. The key field changes from a blob of
>> chars into an 64bit integer.
>
>
> An MD5 sum cannot be represented as a blob of chars?

yes -- it is fixed length (8byte) blob which may include null byte in
the middle.

For reference, MD5 sum is 16 bytes (128-bit):

>
> Or to say it another way, suppose that Itanium mangling required as a
final
> step to replace the string with its md5 sum in hex. Therefore all symbol
> names are "small". My understanding is that this is effectively all your
> patch is doing.

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

How much does this change the benefit? If most of the benefit is avoiding
extraordinarily long mangled names then it may be sufficient.

With IR-level instrumentation like Rong is pursuing the size may be reduced
sufficiently that we do not need the optimization proposed in this thread.
For example, Rong found >2x size reduction on Google's C++ benchmarks,
which I assume are representative of the extremely large Google binaries
that are causing the problems addressed by your proposal in this thread.
The measurements you mention for Clang in this thread provide similar size
reductions, so Rong's approach may be sufficient (especially because
functions with extremely large mangled names tend to be small inline
functions in header-only template libraries).

Of the points you mention in "Large size of overhead can limit the
usability of PGO greatly", many of the issues are hard limits that prevent
the use of PGO. Do you have a lower bound on how much the size of the PGO
data must be reduced in order to overcome the hard limits?

Obviously LLVM must be able to support the extremely large binaries in your
configuration (otherwise what use is LLVM as a compiler :wink: My questions are
primarily aimed at establishing which tradeoffs are acceptable for
supporting this (both for LLVM and for you guys).

Btw, for us, the issue of PGO data size is not completely immaterial but is
very different from your use case. For us, the primary issue is the
additional memory use at run time, since PS4 games usually use "all"
available memory. We had a problem with UBSan where the large amount of
memory required for storing the UBSan diagnostic data at runtime required
the game programmers to manually change their memory map to make room.
+Filipe, do you remember how much memory UBSan was using that caused a
problem?

-- Sean Silva

>
>
>>
>> >
>> >
>> >>
>> >> >
>> >> > I think it is reasonable to simply replace the key we currently
use
>> >> > with
>> >> > MD5(key) for getting a size reduction. In practice for my use
cases,
>> >> > I
>> >> > have
>> >> > not observed any of the issues you mentioned under "Large size of
>> >> > overhead
>> >> > can limit the usability of PGO greatly", but I can understand that
>> >> > some
>> >> > of
>> >> > these issues could become problems in Google's use case. I would
>> >> > personally
>> >> > prefer to keep the existing behavior as the default (see below),
and
>> >> > have
>> >> > MD5(key) as an option.
>> >>
>> >> The problem is that this requires profile format changes.
>> >
>> >
>> > Why? AFAIK the "function name" is just an arbitrary string. Using s
or
>> > MD5(s) shouldn't matter. Of course, the user will need to pass
>> > consistent
>> > flags to clang.
>>
>> The raw format for 64bit target can be made 'unchanged', but not for
>> the 32bit raw format -- the nameptr field is only 32bit.
>>
>> The indexed format can not be made the same -- The ondisk profile
>> record layout totally changes. The key field changes from a blob of
>> chars into an 64bit integer.
>
>
> An MD5 sum cannot be represented as a blob of chars?

yes -- it is fixed length (8byte) blob which may include null byte in
the middle.

For reference, MD5 sum is 16 bytes (128-bit):
MD5 - Wikipedia

>
> Or to say it another way, suppose that Itanium mangling required as a
final
> step to replace the string with its md5 sum in hex. Therefore all symbol
> names are "small". My understanding is that this is effectively all your
> patch is doing.

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

How much does this change the benefit? If most of the benefit is avoiding
extraordinarily long mangled names then it may be sufficient.

With IR-level instrumentation like Rong is pursuing the size may be
reduced sufficiently that we do not need the optimization proposed in this
thread. For example, Rong found >2x size reduction

To clarify, I mean the size of the PGO data, not the binary size.
Column (2) in section 3.5 of Rong's RFC to be precise.

-- Sean Silva

>
>
>>
>> >
>> >
>> >>
>> >> >
>> >> > I think it is reasonable to simply replace the key we currently
>> >> > use
>> >> > with
>> >> > MD5(key) for getting a size reduction. In practice for my use
>> >> > cases,
>> >> > I
>> >> > have
>> >> > not observed any of the issues you mentioned under "Large size of
>> >> > overhead
>> >> > can limit the usability of PGO greatly", but I can understand that
>> >> > some
>> >> > of
>> >> > these issues could become problems in Google's use case. I would
>> >> > personally
>> >> > prefer to keep the existing behavior as the default (see below),
>> >> > and
>> >> > have
>> >> > MD5(key) as an option.
>> >>
>> >> The problem is that this requires profile format changes.
>> >
>> >
>> > Why? AFAIK the "function name" is just an arbitrary string. Using s
>> > or
>> > MD5(s) shouldn't matter. Of course, the user will need to pass
>> > consistent
>> > flags to clang.
>>
>> The raw format for 64bit target can be made 'unchanged', but not for
>> the 32bit raw format -- the nameptr field is only 32bit.
>>
>> The indexed format can not be made the same -- The ondisk profile
>> record layout totally changes. The key field changes from a blob of
>> chars into an 64bit integer.
>
>
> An MD5 sum cannot be represented as a blob of chars?

yes -- it is fixed length (8byte) blob which may include null byte in
the middle.

For reference, MD5 sum is 16 bytes (128-bit):
MD5 - Wikipedia

yes, LLVM's MD5 hash only takes the lower 64bit.

>
> Or to say it another way, suppose that Itanium mangling required as a
> final
> step to replace the string with its md5 sum in hex. Therefore all symbol
> names are "small". My understanding is that this is effectively all your
> patch is doing.

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

How much does this change the benefit? If most of the benefit is avoiding
extraordinarily long mangled names then it may be sufficient.

With IR-level instrumentation like Rong is pursuing the size may be reduced
sufficiently that we do not need the optimization proposed in this thread.
For example, Rong found >2x size reduction on Google's C++ benchmarks, which
I assume are representative of the extremely large Google binaries that are
causing the problems addressed by your proposal in this thread. The
measurements you mention for Clang in this thread provide similar size
reductions, so Rong's approach may be sufficient (especially because
functions with extremely large mangled names tend to be small inline
functions in header-only template libraries).

Late instrumentation helps many cases. In some cases (as shown in
SPEC), the reduction in size is not as large. Reducing PGO overhead
will lower the bar for its adoption.

Of the points you mention in "Large size of overhead can limit the usability
of PGO greatly", many of the issues are hard limits that prevent the use of
PGO. Do you have a lower bound on how much the size of the PGO data must be
reduced in order to overcome the hard limits?

This is a static view: Think about the situation where application
size is ever increasing; also think about situation where we want to
collect more types of profile data. Think about situation where user
want to run pgo binaries on small devices with tiny memory/storage ..

Obviously LLVM must be able to support the extremely large binaries in your
configuration (otherwise what use is LLVM as a compiler :wink: My questions are
primarily aimed at establishing which tradeoffs are acceptable for
supporting this (both for LLVM and for you guys).

As I said, with the modified proposal (after getting your feedback),
no PGO users in LLVM land is going to lose anything/functionality. The
end result will be net win for general users of LLVM (even though your
customers don't care about it), not just 'us' as you have mentioned
many times.

Btw, for us, the issue of PGO data size is not completely immaterial but is
very different from your use case. For us, the primary issue is the
additional memory use at run time, since PS4 games usually use "all"
available memory. We had a problem with UBSan where the large amount of
memory required for storing the UBSan diagnostic data at runtime required
the game programmers to manually change their memory map to make room.
+Filipe, do you remember how much memory UBSan was using that caused a
problem?

My proposal does help reducing rodata size significantly.

David

Hi Sean, Xinliang,

I don’t have the numbers for the UBSan problem we had (the binary was much bigger than clang/opt and made from many more files). But here’s how it changes in clang/opt (OS X):

-rwxr-xr-x 1 49M Aug 10 23:15 build-asserts/bin/clang-3.8
-rwxr-xr-x 1 164M Aug 10 21:15 build-debug/bin/clang-3.8
-rwxr-xr-x 1 370M Sep 5 10:24 build-ubsan/bin/clang-3.8
-rwxr-xr-x 1 44M Sep 4 20:43 build/bin/clang-3.8

-rwxr-xr-x 1 25M Aug 10 21:03 build-asserts/bin/opt

-rwxr-xr-x 1 67M Aug 10 21:16 build-debug/bin/opt
-rwxr-xr-x 1 156M Sep 5 10:24 build-ubsan/bin/opt
-rwxr-xr-x 1 22M Sep 4 20:45 build/bin/opt

build and build-ubsan are the same, except for -DLLVM_USE_SANITIZER=Undefined
build and build-asserts are the same, except for -DLLVM_ENABLE_ASSERTIONS=ON
build and build-debug are the same except for -DCMAKE_BUILD_TYPE=Debug (IIRC)

As you can see, a release build grows to more than 7.5x just by enabling UBSan! This is more than double the debug build!
Each check will add a few bytes (depending on the check, I think the median is more than 32B (just recollection, didn’t go and double-check))
For every file which has checks, the full file path is in the binary as a string. (I want to, at least, make this (configurably) smaller, but will check to make sure there’s savings to be had, first)

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

How much does this change the benefit? If most of the benefit is avoiding
extraordinarily long mangled names then it may be sufficient.

I implemented the same functionality using the approach suggested by you.

1) There is no format change required for raw/indexed profile data
2) when a user option is specified, md5hash string is used as the
function key instead of of the raw function name

In addition to that, the option specified is also passed to the
runtime and emitted into profile data. This allows different variants
of the profile data encoded in the same format to be automatically
recognized by the client (clang -fprofile-use, llvm-profdata etc)
without the need to specify the option again.

There are some size increase, but increase seems to be in the range
that is acceptable. Using clang as the test case. The indexed profile
size is now 33.6 M (compared with 29.5M which is the optimal case).
The raw profile size is 54.1M (compared with 40.1M from previous
patch).

David

Xinliang David Li <davidxl@google.com> writes:

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

How much does this change the benefit? If most of the benefit is avoiding
extraordinarily long mangled names then it may be sufficient.

I implemented the same functionality using the approach suggested by you.

1) There is no format change required for raw/indexed profile data
2) when a user option is specified, md5hash string is used as the
function key instead of of the raw function name

In addition to that, the option specified is also passed to the
runtime and emitted into profile data. This allows different variants
of the profile data encoded in the same format to be automatically
recognized by the client (clang -fprofile-use, llvm-profdata etc)
without the need to specify the option again.

There are some size increase, but increase seems to be in the range
that is acceptable. Using clang as the test case. The indexed profile
size is now 33.6 M (compared with 29.5M which is the optimal case).
The raw profile size is 54.1M (compared with 40.1M from previous
patch).

Have any of these patches you're referencing been posted at all? I came
to the thread late and it's a bit hard to follow which is which.

In any case, the idea sounds fine to me. Names in the profile can be
optional assuming we can safely differentiate the functions without them
(aside: on-by-default seems more user friendly than off-by-default, but
I don't have a strong opinion. Sean?).

I have a couple of high level thoughts on the approach:

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:

Looking at various sources of size increase caused by instrumentation,
it is clear that the biggest contributor is the __llvm_prf_names
section. The current PGO implementation uses function assembly names
as the lookup keys in the indexed format so it needs to be emitted in
both rodata of the binary and in the raw/indexed profiles.

There's also the matter that we mangle the filename onto the symbol name
for non-ODR function names, which makes the names even larger than they
need to be. I imagine this isn't the biggest contributor (C++ mangled
names tend to be longer than filenames), but it is relatively easy to
fix independently of this change, by incorporating the filename into the
function's structural hash instead.

On the other hand, LLVM's indexed format also supports 'key collision'
-- it allows functions with the same key share the same entry in the
hash table. Function's structural hash values will be used as a
secondary key to find a real match.

The above feature allows us to use a different key other than function
assembly names and this can reduce binary/profile data size
significantly. Function name's MD5 hash is a good candidate, and I
have a patch (3 parts: runtime, reader/writer, clang) that does just
that. The results are very promising. With this change, the clang
instrumented binary size is now 216M (down from 280M); the raw profile
size is only 40.1M (a 2.85X size reduction); and the indexed profile
size is only 29.5M (a 2.2X size reduction).

It should be fairly unlikely, but we should probably consider hash
collisions briefly. If we hit a collision in function names that also
collides in the structural hash, there is no way to recover in this
scheme. So:

1. The effect is that we get bogus data in each of the affected
   functions (the sum of the counters in each set, essentially).

2. To hit this, we need two functions with basically the same control
   flow, that have names that collide. Functions having the same control
   flow is obviously valid and likely, so really we only care about MD5
   collisions on function names. Are C++ mangled names long enough that
   collisions are likely?

As long as we're comfortable that (1) isn't so bad and (2) is
sufficiently rare this should be fine. Otherwise, we should consider a
better hash - though that would probably mean changing from 64 bits to
128 and reducing the size less.

Regardless, we can probably take this to -commits and start discussing
the patches themselves. I don't think anyone's opposed to the general
idea, as long as you don't break the use case that Sean brought up.

Xinliang David Li <davidxl@google.com> writes:

The key type before the change is StringRef, while the after the key
type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
a string of 8 bytes or storing md5 has in text form which will double
the size?

How much does this change the benefit? If most of the benefit is avoiding
extraordinarily long mangled names then it may be sufficient.

I implemented the same functionality using the approach suggested by you.

1) There is no format change required for raw/indexed profile data
2) when a user option is specified, md5hash string is used as the
function key instead of of the raw function name

In addition to that, the option specified is also passed to the
runtime and emitted into profile data. This allows different variants
of the profile data encoded in the same format to be automatically
recognized by the client (clang -fprofile-use, llvm-profdata etc)
without the need to specify the option again.

There are some size increase, but increase seems to be in the range
that is acceptable. Using clang as the test case. The indexed profile
size is now 33.6 M (compared with 29.5M which is the optimal case).
The raw profile size is 54.1M (compared with 40.1M from previous
patch).

Have any of these patches you're referencing been posted at all? I came
to the thread late and it's a bit hard to follow which is which.

Not yet. I wanted to get the high level discussions done first. I will
post the patch soon.

In any case, the idea sounds fine to me. Names in the profile can be
optional assuming we can safely differentiate the functions without them
(aside: on-by-default seems more user friendly than off-by-default, but
I don't have a strong opinion. Sean?).

I have a couple of high level thoughts on the approach:

Xinliang David Li via llvm-dev <llvm-dev@lists.llvm.org> writes:

Looking at various sources of size increase caused by instrumentation,
it is clear that the biggest contributor is the __llvm_prf_names
section. The current PGO implementation uses function assembly names
as the lookup keys in the indexed format so it needs to be emitted in
both rodata of the binary and in the raw/indexed profiles.

There's also the matter that we mangle the filename onto the symbol name
for non-ODR function names, which makes the names even larger than they
need to be. I imagine this isn't the biggest contributor (C++ mangled
names tend to be longer than filenames), but it is relatively easy to
fix independently of this change, by incorporating the filename into the
function's structural hash instead.

The main file name is only added for functions with local linkage -- I
expect the size impact is small.

On the other hand, LLVM's indexed format also supports 'key collision'
-- it allows functions with the same key share the same entry in the
hash table. Function's structural hash values will be used as a
secondary key to find a real match.

The above feature allows us to use a different key other than function
assembly names and this can reduce binary/profile data size
significantly. Function name's MD5 hash is a good candidate, and I
have a patch (3 parts: runtime, reader/writer, clang) that does just
that. The results are very promising. With this change, the clang
instrumented binary size is now 216M (down from 280M); the raw profile
size is only 40.1M (a 2.85X size reduction); and the indexed profile
size is only 29.5M (a 2.2X size reduction).

It should be fairly unlikely, but we should probably consider hash
collisions briefly. If we hit a collision in function names that also
collides in the structural hash, there is no way to recover in this
scheme. So:

In an experiment done for value profiler, the number of collisions
with md5 sum (lower 64bit) is zero with a set of 3 million symbol
names. Having two functions collide in both the name hash and
structural hash will indeed be too low to be worried about.

1. The effect is that we get bogus data in each of the affected
   functions (the sum of the counters in each set, essentially).

yes -- I would classify this is a case only existing in theory though.

2. To hit this, we need two functions with basically the same control
   flow, that have names that collide. Functions having the same control
   flow is obviously valid and likely, so really we only care about MD5
   collisions on function names. Are C++ mangled names long enough that
   collisions are likely?

Our experiment shows that name collision is unlikely. Structure
hash collision rate depends on the size of the function body (where
late instrumentation may help).

As long as we're comfortable that (1) isn't so bad and (2) is
sufficiently rare this should be fine. Otherwise, we should consider a
better hash - though that would probably mean changing from 64 bits to
128 and reducing the size less.

Actually - current implementation is not collision free either .. I
think if it becomes an issue (which I doubt) in the future, we can
even introduce an option to control the length of the md5 key size
(the benefit is that user can further shrink it they desire to do so).

Regardless, we can probably take this to -commits and start discussing
the patches themselves. I don't think anyone's opposed to the general
idea, as long as you don't break the use case that Sean brought up.

Yes.

thanks,

David

>> The key type before the change is StringRef, while the after the key
>> type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
>> a string of 8 bytes or storing md5 has in text form which will double
>> the size?
>
>
> How much does this change the benefit? If most of the benefit is avoiding
> extraordinarily long mangled names then it may be sufficient.

I implemented the same functionality using the approach suggested by you.

1) There is no format change required for raw/indexed profile data
2) when a user option is specified, md5hash string is used as the
function key instead of of the raw function name

In addition to that, the option specified is also passed to the
runtime and emitted into profile data. This allows different variants
of the profile data encoded in the same format to be automatically
recognized by the client (clang -fprofile-use, llvm-profdata etc)
without the need to specify the option again.

Okay, this seems reasonable if it does not add a huge amount of complexity.

There are some size increase, but increase seems to be in the range
that is acceptable. Using clang as the test case. The indexed profile
size is now 33.6 M (compared with 29.5M which is the optimal case).
The raw profile size is 54.1M (compared with 40.1M from previous
patch).

Thanks for looking into this. If this is acceptable for you, then I think
this approach is simpler.

-- Sean Silva

>
>
>>
>> >
>> >
>> >>
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > I think it is reasonable to simply replace the key we currently
>> >> >> > use
>> >> >> > with
>> >> >> > MD5(key) for getting a size reduction. In practice for my use
>> >> >> > cases,
>> >> >> > I
>> >> >> > have
>> >> >> > not observed any of the issues you mentioned under "Large size
of
>> >> >> > overhead
>> >> >> > can limit the usability of PGO greatly", but I can understand
that
>> >> >> > some
>> >> >> > of
>> >> >> > these issues could become problems in Google's use case. I would
>> >> >> > personally
>> >> >> > prefer to keep the existing behavior as the default (see below),
>> >> >> > and
>> >> >> > have
>> >> >> > MD5(key) as an option.
>> >> >>
>> >> >> The problem is that this requires profile format changes.
>> >> >
>> >> >
>> >> > Why? AFAIK the "function name" is just an arbitrary string. Using s
>> >> > or
>> >> > MD5(s) shouldn't matter. Of course, the user will need to pass
>> >> > consistent
>> >> > flags to clang.
>> >>
>> >> The raw format for 64bit target can be made 'unchanged', but not for
>> >> the 32bit raw format -- the nameptr field is only 32bit.
>> >>
>> >> The indexed format can not be made the same -- The ondisk profile
>> >> record layout totally changes. The key field changes from a blob of
>> >> chars into an 64bit integer.
>> >
>> >
>> > An MD5 sum cannot be represented as a blob of chars?
>>
>> yes -- it is fixed length (8byte) blob which may include null byte in
>> the middle.
>
>
> For reference, MD5 sum is 16 bytes (128-bit):
> MD5 - Wikipedia

yes, LLVM's MD5 hash only takes the lower 64bit.

>
>>
>>
>> >
>> > Or to say it another way, suppose that Itanium mangling required as a
>> > final
>> > step to replace the string with its md5 sum in hex. Therefore all
symbol
>> > names are "small". My understanding is that this is effectively all
your
>> > patch is doing.
>>
>> The key type before the change is StringRef, while the after the key
>> type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
>> a string of 8 bytes or storing md5 has in text form which will double
>> the size?
>
>
> How much does this change the benefit? If most of the benefit is avoiding
> extraordinarily long mangled names then it may be sufficient.
>
> With IR-level instrumentation like Rong is pursuing the size may be
reduced
> sufficiently that we do not need the optimization proposed in this
thread.
> For example, Rong found >2x size reduction on Google's C++ benchmarks,
which
> I assume are representative of the extremely large Google binaries that
are
> causing the problems addressed by your proposal in this thread. The
> measurements you mention for Clang in this thread provide similar size
> reductions, so Rong's approach may be sufficient (especially because
> functions with extremely large mangled names tend to be small inline
> functions in header-only template libraries).

Late instrumentation helps many cases. In some cases (as shown in
SPEC), the reduction in size is not as large. Reducing PGO overhead
will lower the bar for its adoption.

>
> Of the points you mention in "Large size of overhead can limit the
usability
> of PGO greatly", many of the issues are hard limits that prevent the use
of
> PGO. Do you have a lower bound on how much the size of the PGO data must
be
> reduced in order to overcome the hard limits?

This is a static view: Think about the situation where application
size is ever increasing; also think about situation where we want to

collect more types of profile data. Think about situation where user

want to run pgo binaries on small devices with tiny memory/storage ..

If we want to reduce memory overhead at runtime and reduce the size of the
raw profile data extracted from the target, there are clear solutions.
Consider that debug info does not need to be loaded into the memory image
of the target; why should information identifying each counter need to be?
A file containing raw profile counters is a subset of a core dump; in most
environments, a core dump does not need to have debug info or symbol names
in it, but can be still be read in full detail in conjunction with the
original binary.

Thus, as we require that the binary be passed to llvm-profdata, there is no
fundamental reason that the memory image of the program, or the raw data
extracted from the program, must have any size overhead besides the raw
values of the counters themselves and any text size increase for
incrementing them. If we are willing to impose this requirement on users,
then as far as reducing memory overhead at runtime and reducing the size of
the raw profile data extracted from the target, using hashed function names
is clearly the wrong direction.

*Without* imposing the requirement of passing the binary to llvm-profdata,
I do like the ability to use hashed function names like you are proposing.
It is a simple solution for reducing size overhead of function name strings
with little complexity, as it is just swapping one string for another.

>
> Obviously LLVM must be able to support the extremely large binaries in
your
> configuration (otherwise what use is LLVM as a compiler :wink: My questions
are
> primarily aimed at establishing which tradeoffs are acceptable for
> supporting this (both for LLVM and for you guys).

As I said, with the modified proposal (after getting your feedback),
no PGO users in LLVM land is going to lose anything/functionality. The
end result will be net win for general users of LLVM (even though your
customers don't care about it), not just 'us' as you have mentioned
many times.

>
> Btw, for us, the issue of PGO data size is not completely immaterial but
is
> very different from your use case. For us, the primary issue is the
> additional memory use at run time, since PS4 games usually use "all"
> available memory. We had a problem with UBSan where the large amount of
> memory required for storing the UBSan diagnostic data at runtime required
> the game programmers to manually change their memory map to make room.
> +Filipe, do you remember how much memory UBSan was using that caused a
> problem?
>

My proposal does help reducing rodata size significantly.

Yes, that is why I think that this is a useful thing to do. I just want to
be careful about existing use cases and the relevant workflow issues.

-- Sean Silva

>
>
>>
>> >
>> >
>> >>
>> >> >
>> >> >
>> >> >>
>> >> >> >
>> >> >> > I think it is reasonable to simply replace the key we currently
>> >> >> > use
>> >> >> > with
>> >> >> > MD5(key) for getting a size reduction. In practice for my use
>> >> >> > cases,
>> >> >> > I
>> >> >> > have
>> >> >> > not observed any of the issues you mentioned under "Large size
of
>> >> >> > overhead
>> >> >> > can limit the usability of PGO greatly", but I can understand
that
>> >> >> > some
>> >> >> > of
>> >> >> > these issues could become problems in Google's use case. I
would
>> >> >> > personally
>> >> >> > prefer to keep the existing behavior as the default (see
below),
>> >> >> > and
>> >> >> > have
>> >> >> > MD5(key) as an option.
>> >> >>
>> >> >> The problem is that this requires profile format changes.
>> >> >
>> >> >
>> >> > Why? AFAIK the "function name" is just an arbitrary string. Using
s
>> >> > or
>> >> > MD5(s) shouldn't matter. Of course, the user will need to pass
>> >> > consistent
>> >> > flags to clang.
>> >>
>> >> The raw format for 64bit target can be made 'unchanged', but not for
>> >> the 32bit raw format -- the nameptr field is only 32bit.
>> >>
>> >> The indexed format can not be made the same -- The ondisk profile
>> >> record layout totally changes. The key field changes from a blob of
>> >> chars into an 64bit integer.
>> >
>> >
>> > An MD5 sum cannot be represented as a blob of chars?
>>
>> yes -- it is fixed length (8byte) blob which may include null byte in
>> the middle.
>
>
> For reference, MD5 sum is 16 bytes (128-bit):
> MD5 - Wikipedia

yes, LLVM's MD5 hash only takes the lower 64bit.

>
>>
>>
>> >
>> > Or to say it another way, suppose that Itanium mangling required as a
>> > final
>> > step to replace the string with its md5 sum in hex. Therefore all
symbol
>> > names are "small". My understanding is that this is effectively all
your
>> > patch is doing.
>>
>> The key type before the change is StringRef, while the after the key
>> type is uint64_t. Are you suggesting treating uint64_t md5 sum key as
>> a string of 8 bytes or storing md5 has in text form which will double
>> the size?
>
>
> How much does this change the benefit? If most of the benefit is
avoiding
> extraordinarily long mangled names then it may be sufficient.
>
> With IR-level instrumentation like Rong is pursuing the size may be
reduced
> sufficiently that we do not need the optimization proposed in this
thread.
> For example, Rong found >2x size reduction on Google's C++ benchmarks,
which
> I assume are representative of the extremely large Google binaries that
are
> causing the problems addressed by your proposal in this thread. The
> measurements you mention for Clang in this thread provide similar size
> reductions, so Rong's approach may be sufficient (especially because
> functions with extremely large mangled names tend to be small inline
> functions in header-only template libraries).

Late instrumentation helps many cases. In some cases (as shown in
SPEC), the reduction in size is not as large. Reducing PGO overhead
will lower the bar for its adoption.

>
> Of the points you mention in "Large size of overhead can limit the
usability
> of PGO greatly", many of the issues are hard limits that prevent the
use of
> PGO. Do you have a lower bound on how much the size of the PGO data
must be
> reduced in order to overcome the hard limits?

This is a static view: Think about the situation where application
size is ever increasing; also think about situation where we want to

collect more types of profile data. Think about situation where user

want to run pgo binaries on small devices with tiny memory/storage ..

If we want to reduce memory overhead at runtime and reduce the size of the
raw profile data extracted from the target, there are clear solutions.
Consider that debug info does not need to be loaded into the memory image
of the target; why should information identifying each counter need to be?
A file containing raw profile counters is a subset of a core dump; in most
environments, a core dump does not need to have debug info or symbol names
in it, but can be still be read in full detail in conjunction with the
original binary.

Thus, as we require

this should read "if we ant to require".

-- Sean Silva