IC profiling infrastructure

From: <betulb@codeaurora.org>
Date: Tue, Apr 7, 2015 at 12:44 PM
Subject: [LLVMdev] IC profiling infrastructure
To: llvmdev@cs.uiuc.edu

Hi All,

We had sent out an RFC in October on indirect call target profiling. The
proposal was about profiling target addresses seen at indirect call sites.
Using the profile data we're seeing up to %8 performance improvements on
individual spec benchmarks where indirect call sites are present. We've
already started uploading our patches to the phabricator. I'm looking
forward to your reviews and comments on the code and ready to respond to
your design related queries.

There were few questions posted on the RFC that were not responded. Here
are the much delayed comments.

Hi Betul, thank you for your patience. I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here: https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub

1) Added dependencies: Our implementation adds dependency on calloc/free
as we’re generating/maintaining a linked list at run time.

If it becomes a problem for some, there is a way to handle that -- but
at a cost of more memory required (to be conservative). One of the
good feature of using dynamic memory is that it allows counter array
allocation on the fly which eliminates the need to allocate memory for
lots of cold/unexecuted functions.

We also added
dependency on the usage of mutexes to prevent memory leaks in the case
multiple threads trying to insert a new target address for the same IC
site into the linked list. To least impact the performance we only added
mutexes around the pointer assignment and kept any dynamic memory
allocation/free operations outside of the mutexed code.

This (using mutexes) should be and can be avoided -- see the above report.

2) Indirect call data being present in sampling profile output: This is
unfortunately not helping in our case due to perf depending on lbr
support. To our knowledge lbr support is not present on ARM platforms.

yes.

3) Losing profiling support on targets not supporting malloc/mutexes: The
added dependency on calloc/free/mutexes may perhaps be eliminated
(although our current solution does not handle this) through having a
separate run time library for value profiling purposes. Instrumentation
can link in two run time libraries when value profiling (an instance of it
being indirect call target profiling) is enabled on the command line.

See above.

4) Performance of the instrumented code: Instrumentation with IC profiling
patches resulted in 7% degradation across spec benchmarks at -O2. For the
benchmarks that did not have any IC sites, no performance degradation was
observed. This data is gathered using the ref data set for spec.

I'd like to make the runtime part of the change to be shared and used
as a general purpose value profiler (not just indirect call
promotion), but this can be done as a follow up.

I will start with some reviews. Hopefully others will help with reviews too.

thanks,

David

Xinliang David Li <davidxl@google.com> writes:

From: <betulb@codeaurora.org>
Date: Tue, Apr 7, 2015 at 12:44 PM
Subject: [LLVMdev] IC profiling infrastructure
To: llvmdev@cs.uiuc.edu

Hi All,

We had sent out an RFC in October on indirect call target profiling. The
proposal was about profiling target addresses seen at indirect call sites.
Using the profile data we're seeing up to %8 performance improvements on
individual spec benchmarks where indirect call sites are present. We've
already started uploading our patches to the phabricator. I'm looking
forward to your reviews and comments on the code and ready to respond to
your design related queries.

There were few questions posted on the RFC that were not responded. Here
are the much delayed comments.

Hi Betul, thank you for your patience. I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here:
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub

Thanks for looking at this David.

Betul: I also have some thoughts on the approach and implementation of
this, but haven't had a chance to go over it in detail. I hope to have
some feedback for you on all of this sometime next week, and I'll start
reviewing the individual patches after that.

From: <betulb@codeaurora.org>
Date: Tue, Apr 7, 2015 at 12:44 PM
Subject: [LLVMdev] IC profiling infrastructure
To: llvmdev@cs.uiuc.edu

Hi All,

We had sent out an RFC in October on indirect call target profiling. The
proposal was about profiling target addresses seen at indirect call
sites.
Using the profile data we're seeing up to %8 performance improvements on
individual spec benchmarks where indirect call sites are present. We've
already started uploading our patches to the phabricator. I'm looking
forward to your reviews and comments on the code and ready to respond to
your design related queries.

There were few questions posted on the RFC that were not responded. Here
are the much delayed comments.

Hi Betul, thank you for your patience. I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here:
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub

Hi David,

Thanks for the detailed report and working on this. We really appreciate
the feedback. We're looking forward to the comments and up streaming the
changes.

1) Added dependencies: Our implementation adds dependency on calloc/free
as we’re generating/maintaining a linked list at run time.

If it becomes a problem for some, there is a way to handle that -- but
at a cost of more memory required (to be conservative). One of the
good feature of using dynamic memory is that it allows counter array
allocation on the fly which eliminates the need to allocate memory for
lots of cold/unexecuted functions.

We also added
dependency on the usage of mutexes to prevent memory leaks in the case
multiple threads trying to insert a new target address for the same IC
site into the linked list. To least impact the performance we only added
mutexes around the pointer assignment and kept any dynamic memory
allocation/free operations outside of the mutexed code.

This (using mutexes) should be and can be avoided -- see the above report.

I did read your report carefully. You suggested use of atomic linked list
link update to avoid mutexes. We have a runtime written in C. So I was not
sure if introducing C++11 features like std::atomic was OK or not. Also
some operations can be performed atomically on x86 platforms (based on
data being aligned at various bit length/cache line boundaries) but arm or
other platforms would not support that.

2) Indirect call data being present in sampling profile output: This is
unfortunately not helping in our case due to perf depending on lbr
support. To our knowledge lbr support is not present on ARM platforms.

yes.

3) Losing profiling support on targets not supporting malloc/mutexes:
The
added dependency on calloc/free/mutexes may perhaps be eliminated
(although our current solution does not handle this) through having a
separate run time library for value profiling purposes. Instrumentation
can link in two run time libraries when value profiling (an instance of
it
being indirect call target profiling) is enabled on the command line.

See above.

4) Performance of the instrumented code: Instrumentation with IC
profiling
patches resulted in 7% degradation across spec benchmarks at -O2. For
the
benchmarks that did not have any IC sites, no performance degradation
was
observed. This data is gathered using the ref data set for spec.

I'd like to make the runtime part of the change to be shared and used
as a general purpose value profiler (not just indirect call
promotion), but this can be done as a follow up.

My understanding of your analysis was that it only covered the run-time
library performance and not really looked into if instrumentation is
really enabled at the right sites.

I will start with some reviews. Hopefully others will help with reviews
too.

Thanks very much. We'll be responding to the reviews diligently.

From: <betulb@codeaurora.org>
Date: Tue, Apr 7, 2015 at 12:44 PM
Subject: [LLVMdev] IC profiling infrastructure
To: llvmdev@cs.uiuc.edu

Hi All,

We had sent out an RFC in October on indirect call target profiling. The
proposal was about profiling target addresses seen at indirect call
sites.
Using the profile data we're seeing up to %8 performance improvements on
individual spec benchmarks where indirect call sites are present. We've
already started uploading our patches to the phabricator. I'm looking
forward to your reviews and comments on the code and ready to respond to
your design related queries.

There were few questions posted on the RFC that were not responded. Here
are the much delayed comments.

Hi Betul, thank you for your patience. I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here:
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub

Hi David,

Thanks for the detailed report and working on this. We really appreciate
the feedback. We're looking forward to the comments and up streaming the
changes.

1) Added dependencies: Our implementation adds dependency on calloc/free
as we’re generating/maintaining a linked list at run time.

If it becomes a problem for some, there is a way to handle that -- but
at a cost of more memory required (to be conservative). One of the
good feature of using dynamic memory is that it allows counter array
allocation on the fly which eliminates the need to allocate memory for
lots of cold/unexecuted functions.

We also added
dependency on the usage of mutexes to prevent memory leaks in the case
multiple threads trying to insert a new target address for the same IC
site into the linked list. To least impact the performance we only added
mutexes around the pointer assignment and kept any dynamic memory
allocation/free operations outside of the mutexed code.

This (using mutexes) should be and can be avoided -- see the above report.

I did read your report carefully. You suggested use of atomic linked list
link update to avoid mutexes. We have a runtime written in C. So I was not
sure if introducing C++11 features like std::atomic was OK or not. Also
some operations can be performed atomically on x86 platforms (based on
data being aligned at various bit length/cache line boundaries) but arm or
other platforms would not support that.

The suggestion is to use the atomic builtins -- see the review comments.

2) Indirect call data being present in sampling profile output: This is
unfortunately not helping in our case due to perf depending on lbr
support. To our knowledge lbr support is not present on ARM platforms.

yes.

3) Losing profiling support on targets not supporting malloc/mutexes:
The
added dependency on calloc/free/mutexes may perhaps be eliminated
(although our current solution does not handle this) through having a
separate run time library for value profiling purposes. Instrumentation
can link in two run time libraries when value profiling (an instance of
it
being indirect call target profiling) is enabled on the command line.

See above.

4) Performance of the instrumented code: Instrumentation with IC
profiling
patches resulted in 7% degradation across spec benchmarks at -O2. For
the
benchmarks that did not have any IC sites, no performance degradation
was
observed. This data is gathered using the ref data set for spec.

I'd like to make the runtime part of the change to be shared and used
as a general purpose value profiler (not just indirect call
promotion), but this can be done as a follow up.

My understanding of your analysis was that it only covered the run-time
library performance and not really looked into if instrumentation is
really enabled at the right sites.

It was mainly focusing on the runtime library performance.

I will start with some reviews. Hopefully others will help with reviews
too.

I looked through one patch and sent the comments.

David

Xinliang David Li <davidxl@google.com> writes:

From: <betulb@codeaurora.org>
Date: Tue, Apr 7, 2015 at 12:44 PM
Subject: [LLVMdev] IC profiling infrastructure
To: llvmdev@cs.uiuc.edu

Hi All,

We had sent out an RFC in October on indirect call target profiling.
The
proposal was about profiling target addresses seen at indirect call
sites.
Using the profile data we're seeing up to %8 performance improvements
on
individual spec benchmarks where indirect call sites are present. We've
already started uploading our patches to the phabricator. I'm looking
forward to your reviews and comments on the code and ready to respond
to
your design related queries.

There were few questions posted on the RFC that were not responded.
Here
are the much delayed comments.

Hi Betul, thank you for your patience. I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here:
https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub

Thanks for looking at this David.

Betul: I also have some thoughts on the approach and implementation of
this, but haven't had a chance to go over it in detail. I hope to have
some feedback for you on all of this sometime next week, and I'll start
reviewing the individual patches after that.

Hi All,

I've posted three more patches yesterday. They might be missing some
cosmetic fixes, but the support for profiling multiple value kinds have
been added to the readers, writers and runtime. I'd appreciate your
comments on the CL's.

Thanks,
-Betul

I have sent my review comments. I think most of my high level concerns
have been addressed (except for last few minor fix ups).

Justin, do you have a chance to take a look?

thanks,

David

Hi Betul,

I've finally gotten around to going over this in detail - sorry for the
delay, and thanks for working on this.

I think that the general approach is a good one, and that this will end
up working well. I have a couple of points on a high level, and I'll
also send some review for the patches you've sent out.

- The raw profile format does not need to be backwards compatible,
  that's only important for the indexed one. Instead of adding
  RawInstrValueProfReader, just change RawInstrProfReader and reject the
  data if the version is 1. Similarly, don't change the raw profile
  magic - just bump the version to 2.

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

- Given that, the instrprof libraries needn't know about kinds at all,
  that can be dealt with locally in clang's CodeGenPGO, where we need to
  know what to do with the data. I'd just drop the kind completely for
  now. It'll be easy to add later without affecting much.

- We should be able to statically allocate the first entry for each
  callsite for __llvm_profile_value_data. It will always be there, and
  for call sites that always call the same target it's nice to avoid the
  allocation.

- The changes in the llvm and compiler-rt repos need their own tests.

- Please use clang-format on your changes. Several of the patches have
  unusual or non-llvm-style whitespace/wrapping/etc.

Finally, the LLVM patch is quite large. We can probably make it a bit
easier to stage and review by splitting out a couple of things and doing
them first:

1. Changing the __llvm_profile_data variables so they're no longer
   const.

2. Adding the InstrProfRecord type and updating the unittests. Also,
   this type should just go in the InstrProf.h header, no need for its
   own.

3. Arguably, we could update the raw format and implement the
   RawInstrProfReader changes (but ignoring the new data) first. If you
   think this is worthwhile go ahead, but if it'll lead to too much
   extra busywork it probably isn't worth it.

I'm sending some review on the patches themselves as well, but I expect
those to mostly be on the mechanical aspects since the high level points
are already written down here.

Xinliang David Li <davidxl@google.com> writes:

Hi Betul,

I've finally gotten around to going over this in detail - sorry for the
delay, and thanks for working on this.

I think that the general approach is a good one, and that this will end
up working well. I have a couple of points on a high level, and I'll
also send some review for the patches you've sent out.

- The raw profile format does not need to be backwards compatible,
  that's only important for the indexed one. Instead of adding
  RawInstrValueProfReader, just change RawInstrProfReader and reject the
  data if the version is 1. Similarly, don't change the raw profile
  magic - just bump the version to 2.

I second this.

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I was going to suggest this and there is a better way:

Note that:

const IntPtrT Values[instr_value_prof_kind::last]; // NULL, used at runtime

This field can actually be used to point to the start of the value
entries for each kind.

On the other hand, storing value kind does not waste any space -- as
it is a 32bit integer (paired with counterIdx, another 32 bit
integer).

- Given that, the instrprof libraries needn't know about kinds at all,
  that can be dealt with locally in clang's CodeGenPGO, where we need to
  know what to do with the data. I'd just drop the kind completely for
  now. It'll be easy to add later without affecting much.

This is doable, but at the cost of some flexibility in the future. For
instance for different value_profile_kind, the in-disk format may be
different -- e.g, for some cases we only care about storing 'average
value' per-site etc to reduce storage size.

- We should be able to statically allocate the first entry for each
  callsite for __llvm_profile_value_data. It will always be there, and
  for call sites that always call the same target it's nice to avoid the
  allocation.

Since we need dynamic allocation any way, I don't see much benefit of
doing static allocation here. In my experience of large apps, a big
percentage (say >50%) of functions are really cold in the training run
which will never be exercised/called. Dynamic allocation has the
advantage of minimizing memory consumption.

thanks,

David

Xinliang David Li <davidxl@google.com> writes:

Hi Betul,

I've finally gotten around to going over this in detail - sorry for the
delay, and thanks for working on this.

I think that the general approach is a good one, and that this will end
up working well. I have a couple of points on a high level, and I'll
also send some review for the patches you've sent out.

- The raw profile format does not need to be backwards compatible,
  that's only important for the indexed one. Instead of adding
  RawInstrValueProfReader, just change RawInstrProfReader and reject the
  data if the version is 1. Similarly, don't change the raw profile
  magic - just bump the version to 2.

I second this.

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I was going to suggest this and there is a better way:

Note that:

const IntPtrT Values[instr_value_prof_kind::last]; // NULL, used at runtime

This field can actually be used to point to the start of the value
entries for each kind.

This array of pointers adds complexity all over each of the patches in
review. If it isn't necessary I'd much rather avoid it.

If we do keep the kind, I'll have further comments on how to reduce that
complexity (other than in the raw format), but I really don't think
having the kind here is worthwhile in its current form.

On the other hand, storing value kind does not waste any space -- as
it is a 32bit integer (paired with counterIdx, another 32 bit
integer).

- Given that, the instrprof libraries needn't know about kinds at all,
  that can be dealt with locally in clang's CodeGenPGO, where we need to
  know what to do with the data. I'd just drop the kind completely for
  now. It'll be easy to add later without affecting much.

This is doable, but at the cost of some flexibility in the future. For
instance for different value_profile_kind, the in-disk format may be
different -- e.g, for some cases we only care about storing 'average
value' per-site etc to reduce storage size.

As implemented, all of the profile kinds need to be identical - to
support kinds that have a different format we need to change the formats
anyway. Adding the complexity now when we'll have to change it later
*anyway* isn't worth it.

- We should be able to statically allocate the first entry for each
  callsite for __llvm_profile_value_data. It will always be there, and
  for call sites that always call the same target it's nice to avoid the
  allocation.

Since we need dynamic allocation any way, I don't see much benefit of
doing static allocation here. In my experience of large apps, a big
percentage (say >50%) of functions are really cold in the training run
which will never be exercised/called. Dynamic allocation has the
advantage of minimizing memory consumption.

If >50% of the "functions that contain indirect calls" are never called,
then I agree that the dynamic allocation is probably a better trade
off. Note that this is functions that contain these calls, not "call
sites", given the way we're dynamically allocating here.

I don't have any data, but that seems superficially unlikely to me - I
would expect training runs to have somewhat reasonable code coverage to
be useful.

OTOH, statically allocating one node assumes that indirect call sites
that resolve to exactly one destination are reasonably common. Since
allocating several times isn't much more expensive than allocating once,
there isn't much benefit in avoiding the first allocation if the
"exactly one destination" case is unlikely.

The trade off isn't exactly clear cut, but elsewhere in the raw profile
format we've gone the "low execution time overhead" direction, which I
think is the right choice unless the memory cost is particularly bad.

Xinliang David Li <davidxl@google.com> writes:

Hi Betul,

I've finally gotten around to going over this in detail - sorry for the
delay, and thanks for working on this.

I think that the general approach is a good one, and that this will end
up working well. I have a couple of points on a high level, and I'll
also send some review for the patches you've sent out.

- The raw profile format does not need to be backwards compatible,
  that's only important for the indexed one. Instead of adding
  RawInstrValueProfReader, just change RawInstrProfReader and reject the
  data if the version is 1. Similarly, don't change the raw profile
  magic - just bump the version to 2.

I second this.

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I was going to suggest this and there is a better way:

Note that:

const IntPtrT Values[instr_value_prof_kind::last]; // NULL, used at runtime

This field can actually be used to point to the start of the value
entries for each kind.

This array of pointers adds complexity all over each of the patches in
review. If it isn't necessary I'd much rather avoid it.

If we do keep the kind, I'll have further comments on how to reduce that
complexity (other than in the raw format), but I really don't think
having the kind here is worthwhile in its current form.

On the other hand, storing value kind does not waste any space -- as
it is a 32bit integer (paired with counterIdx, another 32 bit
integer).

- Given that, the instrprof libraries needn't know about kinds at all,
  that can be dealt with locally in clang's CodeGenPGO, where we need to
  know what to do with the data. I'd just drop the kind completely for
  now. It'll be easy to add later without affecting much.

This is doable, but at the cost of some flexibility in the future. For
instance for different value_profile_kind, the in-disk format may be
different -- e.g, for some cases we only care about storing 'average
value' per-site etc to reduce storage size.

As implemented, all of the profile kinds need to be identical - to
support kinds that have a different format we need to change the formats
anyway. Adding the complexity now when we'll have to change it later
*anyway* isn't worth it.

Keeping the number value sites per value profile kind can be useful
sanity check, but having the simplicity may be more attractive -- so I
am fine with dropping it for now.

- We should be able to statically allocate the first entry for each
  callsite for __llvm_profile_value_data. It will always be there, and
  for call sites that always call the same target it's nice to avoid the
  allocation.

Since we need dynamic allocation any way, I don't see much benefit of
doing static allocation here. In my experience of large apps, a big
percentage (say >50%) of functions are really cold in the training run
which will never be exercised/called. Dynamic allocation has the
advantage of minimizing memory consumption.

If >50% of the "functions that contain indirect calls" are never called,
then I agree that the dynamic allocation is probably a better trade
off. Note that this is functions that contain these calls, not "call
sites", given the way we're dynamically allocating here.

I don't have any data, but that seems superficially unlikely to me - I
would expect training runs to have somewhat reasonable code coverage to
be useful.

Actually IIRC it is actually worse than that for very large apps.
Those apps linked in lots of modules due to complicated dependencies.

Another reason is that we actually don't really care much about the
cold functions at all for optimization purpose (they should be
optimized for size of course). One of the techniques to reduce
instrumentation overhead is to do sampling. With sampling turned on
(we don't have it now, but we may in the future), a lot of cold
functions will be 'pruned' and won't have any profile data produced.

OTOH, statically allocating one node assumes that indirect call sites
that resolve to exactly one destination are reasonably common. Since
allocating several times isn't much more expensive than allocating once,
there isn't much benefit in avoiding the first allocation if the
"exactly one destination" case is unlikely.

Very often, there are actually very few very hot indirect callsites.
For such a callsites, if the number of targets is small, the
allocation cost per target will be very small. If there are lots of
targets, saving allocation for one target does not really help either.

The trade off isn't exactly clear cut, but elsewhere in the raw profile
format we've gone the "low execution time overhead" direction, which I
think is the right choice unless the memory cost is particularly bad.

I think reducing execution overhead is the right goal to shoot but
reducing memory overhead does not necessarily mean it will increase
runtime. It can help the other way e.g, via reduced page faults of
DTLB misses. On the other hand, reducing memory footprint can be
critical in many cases -- e.g, small memory devices not configured
with swap.

LLVM's PGO instrumentation already has gigantic memory usage -- but
that is another thread for discussion.

thanks,

David

Hi Betul,

I've finally gotten around to going over this in detail - sorry for the
delay, and thanks for working on this.

I think that the general approach is a good one, and that this will end
up working well. I have a couple of points on a high level, and I'll
also send some review for the patches you've sent out.

- The raw profile format does not need to be backwards compatible,
  that's only important for the indexed one. Instead of adding
  RawInstrValueProfReader, just change RawInstrProfReader and reject the
  data if the version is 1. Similarly, don't change the raw profile
  magic - just bump the version to 2.

Thanks. Not remaining backwards compatible in the RawInstrProfReader
removed much of the added complexity from the implementation. I'll upload
my changes soon to the phabricator later today.

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I think so too. However, value_kind should stay around until the kinds of
expressions whose values are computed are included in the hash
computation. Otherwise, if a value-profiled expression is to be removed
from source, then all the rest of the values would get attached to the MD
fields of wrong types. Currently hash checking verifies if the counter
assigned regions in the source has been changed across profile consumption
runs.

- Given that, the instrprof libraries needn't know about kinds at all,
  that can be dealt with locally in clang's CodeGenPGO, where we need to
  know what to do with the data. I'd just drop the kind completely for
  now. It'll be easy to add later without affecting much.

I've not reflected this request in my latest changes that I'll upload
later today. If we can come up with a mechanism to make sure the source
changes on values profiled can be detected at function level, then I may
bring in this change in my CLs.

- We should be able to statically allocate the first entry for each
  callsite for __llvm_profile_value_data. It will always be there, and
  for call sites that always call the same target it's nice to avoid the
  allocation.

I'm not in favor of this as dynamic allocation seem to be doing fine.

- The changes in the llvm and compiler-rt repos need their own tests.

I'll add them in.

- Please use clang-format on your changes. Several of the patches have
  unusual or non-llvm-style whitespace/wrapping/etc.

I'll do this as well.

Finally, the LLVM patch is quite large. We can probably make it a bit
easier to stage and review by splitting out a couple of things and doing
them first:

1. Changing the __llvm_profile_data variables so they're no longer
   const.

2. Adding the InstrProfRecord type and updating the unittests. Also,
   this type should just go in the InstrProf.h header, no need for its
   own.

3. Arguably, we could update the raw format and implement the
   RawInstrProfReader changes (but ignoring the new data) first. If you
   think this is worthwhile go ahead, but if it'll lead to too much
   extra busywork it probably isn't worth it.

I'm sending some review on the patches themselves as well, but I expect
those to mostly be on the mechanical aspects since the high level points
are already written down here.

I'll first update current CL's responding to the review comments and then
start uploading the CL's in the above recommended order.

-Betul

> Hi Betul,
>
> I've finally gotten around to going over this in detail - sorry for the
> delay, and thanks for working on this.
>
> I think that the general approach is a good one, and that this will end
> up working well. I have a couple of points on a high level, and I'll
> also send some review for the patches you've sent out.
>
> - The raw profile format does not need to be backwards compatible,
> that's only important for the indexed one. Instead of adding
> RawInstrValueProfReader, just change RawInstrProfReader and reject the
> data if the version is 1. Similarly, don't change the raw profile
> magic - just bump the version to 2.

Thanks. Not remaining backwards compatible in the RawInstrProfReader
removed much of the added complexity from the implementation. I'll upload
my changes soon to the phabricator later today.

> - We don't need to store the value profiling kind in the data at all.
> The frontend knows which invocations of the intrinsic are for each kind
> implicitly, much like it knows the difference between a counter for an
> "if" and a "for" apart implicitly. Just store one set of profiling data
> and let the frontend sort it out.

I think so too. However, value_kind should stay around until the kinds of
expressions whose values are computed are included in the hash
computation. Otherwise, if a value-profiled expression is to be removed
from source, then all the rest of the values would get attached to the MD
fields of wrong types. Currently hash checking verifies if the counter
assigned regions in the source has been changed across profile consumption
runs.

yes -- separating counter arrays from different value kind is still very
useful, for instance profile-generate and profile-use compilations can turn
on different set of value profile options and can still work.

However, by organizing the in-memory and persistent profile data structure
properly, we don't need to record the value kind, nor counter index
information in the value profile entries. See my most recent review
comments.

David

"Betul Buyukkurt" <betulb@codeaurora.org> writes:

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I think so too. However, value_kind should stay around until the kinds of
expressions whose values are computed are included in the hash
computation. Otherwise, if a value-profiled expression is to be removed
from source, then all the rest of the values would get attached to the MD
fields of wrong types. Currently hash checking verifies if the counter
assigned regions in the source has been changed across profile consumption
runs.

This isn't a good reason to keep value_kind - the function hash not
reflecting these kinds of changes is a problem regardless of whether or
not we have multiple types. We should just add hash computations for
these indirect call counters. As long as we only add this when indirect
profiling is turned on it won't invalidate old hashes so it's trivial to
stay backwards compatible.

The other thing we should do is store which profiling options are
enabled, in both formats. When we specify profile-instr-use it'll be
less error prone if we can detect whether or not this includes indirect
call profiling without checking other options, and it's probably a good
idea for "llvm-profdata merge" to disallow merging profiles that are
gathering different sets of data. A bitfield seems suitable for this.

"Betul Buyukkurt" <betulb@codeaurora.org> writes:

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I think so too. However, value_kind should stay around until the kinds of
expressions whose values are computed are included in the hash
computation. Otherwise, if a value-profiled expression is to be removed
from source, then all the rest of the values would get attached to the MD
fields of wrong types. Currently hash checking verifies if the counter
assigned regions in the source has been changed across profile consumption
runs.

This isn't a good reason to keep value_kind - the function hash not
reflecting these kinds of changes is a problem regardless of whether or
not we have multiple types. We should just add hash computations for
these indirect call counters. As long as we only add this when indirect
profiling is turned on it won't invalidate old hashes so it's trivial to
stay backwards compatible.

I think there are other better reasons why keeping value kind is
desirable. Different kinds of value profile data needs different
'preprocessing' during reading. For instance, indirect call target
needs to translate target address into function name. For length data,
we may want to compute average length; for data address, we may want
to compute the min-alignment for each site. It will be hard to do so
with out value kind info.

The other thing we should do is store which profiling options are
enabled, in both formats. When we specify profile-instr-use it'll be
less error prone if we can detect whether or not this includes indirect
call profiling without checking other options, and it's probably a good
idea for "llvm-profdata merge" to disallow merging profiles that are
gathering different sets of data. A bitfield seems suitable for this.

For value profiling, allowing profile-gen and profile-use passes using
different options is a useful feature. Consider the following
scenarios:
1) collect the same profile data once with all kinds of value profile
data collected. The exact same profile data can be used in performance
experiments with different kinds of value profiling enabled/disabled
2) work around profile-use/value transformation bug selectively for
some file without the need to change anything in instrumentation pass

Besides, with the latest patch, the value_kind is not recorded in each
profile value thus the overhead is minimized.

thanks,

David

Xinliang David Li <davidxl@google.com> writes:

"Betul Buyukkurt" <betulb@codeaurora.org> writes:

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each kind
  implicitly, much like it knows the difference between a counter for an
  "if" and a "for" apart implicitly. Just store one set of profiling data
  and let the frontend sort it out.

I think so too. However, value_kind should stay around until the kinds of
expressions whose values are computed are included in the hash
computation. Otherwise, if a value-profiled expression is to be removed
from source, then all the rest of the values would get attached to the MD
fields of wrong types. Currently hash checking verifies if the counter
assigned regions in the source has been changed across profile consumption
runs.

This isn't a good reason to keep value_kind - the function hash not
reflecting these kinds of changes is a problem regardless of whether or
not we have multiple types. We should just add hash computations for
these indirect call counters. As long as we only add this when indirect
profiling is turned on it won't invalidate old hashes so it's trivial to
stay backwards compatible.

I think there are other better reasons why keeping value kind is
desirable. Different kinds of value profile data needs different
'preprocessing' during reading. For instance, indirect call target
needs to translate target address into function name. For length data,
we may want to compute average length; for data address, we may want
to compute the min-alignment for each site. It will be hard to do so
with out value kind info.

But since the consumer is the frontend, and it knows which counts are
which, it knows the kind, no? I don't understand how storing the kind
info helps or hurts - it's just redundant.

The other thing we should do is store which profiling options are
enabled, in both formats. When we specify profile-instr-use it'll be
less error prone if we can detect whether or not this includes indirect
call profiling without checking other options, and it's probably a good
idea for "llvm-profdata merge" to disallow merging profiles that are
gathering different sets of data. A bitfield seems suitable for this.

For value profiling, allowing profile-gen and profile-use passes using
different options is a useful feature. Consider the following
scenarios:
1) collect the same profile data once with all kinds of value profile
   data collected. The exact same profile data can be used in performance
   experiments with different kinds of value profiling enabled/disabled

This use case is pretty easy to accomodate for with the model where the
profile stores its type. We could autodetect the type to decide how to
interpret the profile, and if more specific profile flags are set simply
ignore some of the data. The thing that the file format storing the type
gives us is that it's harder to misinterpret the data by supplying the
mismatching flags between reading and writing.

2) work around profile-use/value transformation bug selectively for
   some file without the need to change anything in instrumentation pass

I'm not sure I understand what you mean here.

Besides, with the latest patch, the value_kind is not recorded in each
profile value thus the overhead is minimized.

The overhead is low, sure, but the code complexity of dealing with the
multiple kinds in this way is quite real. Since I'm not convinced we're
getting real benefit from storing the kind, I don't believe this
trade-off is worth it.

Xinliang David Li <davidxl@google.com> writes:

"Betul Buyukkurt" <betulb@codeaurora.org> writes:

- We don't need to store the value profiling kind in the data at all.
  The frontend knows which invocations of the intrinsic are for each
kind
  implicitly, much like it knows the difference between a counter for
an
  "if" and a "for" apart implicitly. Just store one set of profiling
data
  and let the frontend sort it out.

I think so too. However, value_kind should stay around until the kinds
of
expressions whose values are computed are included in the hash
computation. Otherwise, if a value-profiled expression is to be
removed
from source, then all the rest of the values would get attached to the
MD
fields of wrong types. Currently hash checking verifies if the counter
assigned regions in the source has been changed across profile
consumption
runs.

This isn't a good reason to keep value_kind - the function hash not
reflecting these kinds of changes is a problem regardless of whether or
not we have multiple types. We should just add hash computations for
these indirect call counters. As long as we only add this when indirect
profiling is turned on it won't invalidate old hashes so it's trivial
to
stay backwards compatible.

I think there are other better reasons why keeping value kind is
desirable. Different kinds of value profile data needs different
'preprocessing' during reading. For instance, indirect call target
needs to translate target address into function name. For length data,
we may want to compute average length; for data address, we may want
to compute the min-alignment for each site. It will be hard to do so
with out value kind info.

But since the consumer is the frontend, and it knows which counts are
which, it knows the kind, no? I don't understand how storing the kind
info helps or hurts - it's just redundant.

The other thing we should do is store which profiling options are
enabled, in both formats. When we specify profile-instr-use it'll be
less error prone if we can detect whether or not this includes indirect
call profiling without checking other options, and it's probably a good
idea for "llvm-profdata merge" to disallow merging profiles that are
gathering different sets of data. A bitfield seems suitable for this.

For value profiling, allowing profile-gen and profile-use passes using
different options is a useful feature. Consider the following
scenarios:
1) collect the same profile data once with all kinds of value profile
   data collected. The exact same profile data can be used in
performance
   experiments with different kinds of value profiling enabled/disabled

This use case is pretty easy to accomodate for with the model where the
profile stores its type. We could autodetect the type to decide how to
interpret the profile, and if more specific profile flags are set simply
ignore some of the data. The thing that the file format storing the type
gives us is that it's harder to misinterpret the data by supplying the
mismatching flags between reading and writing.

How would you plan to merge profile files where one profile is collected
w/ one value_kind turned on with another file where another value_kind is
turned on. This might be a valid use case. The value_kinds at this time
are not stored per raw value in the raw profile files. The newer is design
is extremely compact. Having the value_kind is extremely useful when doing
the merges and when processing (i.e. derivation of function names from
target addresses) data differently per value_kind. You may read the design
discussions in the past threads.

But since the consumer is the frontend, and it knows which counts are
which, it knows the kind, no? I don't understand how storing the kind
info helps or hurts - it's just redundant.

Yes, the frontend consumer knows, but the raw reader does not. It is
natural to do those processing when processing the raw profile data
(for instance when function pointer to name mapping still exists).

The other thing we should do is store which profiling options are
enabled, in both formats. When we specify profile-instr-use it'll be
less error prone if we can detect whether or not this includes indirect
call profiling without checking other options, and it's probably a good
idea for "llvm-profdata merge" to disallow merging profiles that are
gathering different sets of data. A bitfield seems suitable for this.

For value profiling, allowing profile-gen and profile-use passes using
different options is a useful feature. Consider the following
scenarios:
1) collect the same profile data once with all kinds of value profile
   data collected. The exact same profile data can be used in performance
   experiments with different kinds of value profiling enabled/disabled

This use case is pretty easy to accomodate for with the model where the
profile stores its type. We could autodetect the type to decide how to
interpret the profile, and if more specific profile flags are set simply
ignore some of the data. The thing that the file format storing the type
gives us is that it's harder to misinterpret the data by supplying the
mismatching flags between reading and writing.

Yes, this is achievable with some extra effort -- for instance by
collecting the profile sites of all kinds regardless of the flags.
After profiling matching, simply drop selected sites according to the
flags.

This adds complexity of the code. With value kinds encoded in profile
data, the handling is much simpler -- each value profiler only needs
to deal with its own kind.

2) work around profile-use/value transformation bug selectively for
   some file without the need to change anything in instrumentation pass

I'm not sure I understand what you mean here.

Similar to the above, but for correctness.

Besides, with the latest patch, the value_kind is not recorded in each
profile value thus the overhead is minimized.

The overhead is low, sure, but the code complexity of dealing with the
multiple kinds in this way is quite real.

I actually believe having value-kind can help reduce code complexity
instead of the other way around. What is the extra complexity?

thanks,

David

Justin, do you have more concerns on keeping value_kind?

If there is still disagreement, can we agree to move on with it for
now ? After the initial version of the patches checked in, we can do
more serious testings with large apps and revisit this if there are
problems discovered.

thanks,

David

Justin, do you have more concerns on keeping value_kind?

Hi Justin,

The next CL's according to the merge plan all depend on the value_kind.
Therefore both http://reviews.llvm.org/D10471 and
http://reviews.llvm.org/D10674 introduce it. Could you please review these
CL's?

As David stated, we need to bring the value_kind discussion to
finalization w/ a decision. Since the original IC profiling patches to
today it's been over two months and we're still not arrived at a
conclusion.

-Betul

Ping?