From: <betulb@codeaurora.org>
Date: Tue, Apr 7, 2015 at 12:44 PM
Subject: [LLVMdev] IC profiling infrastructure
To: llvmdev@cs.uiuc.eduHi All,
We had sent out an RFC in October on indirect call target profiling. The
proposal was about profiling target addresses seen at indirect call sites.
Using the profile data we're seeing up to %8 performance improvements on
individual spec benchmarks where indirect call sites are present. We've
already started uploading our patches to the phabricator. I'm looking
forward to your reviews and comments on the code and ready to respond to
your design related queries.There were few questions posted on the RFC that were not responded. Here
are the much delayed comments.
Hi Betul, thank you for your patience. I have completed initial
comparison with a few alternative value profile designs. My conclusion
is that your proposed approach should well in practice. The study can
be found here: https://docs.google.com/document/u/1/d/1k-_k_DLFBh8h3XMnPAi6za-XpmjOIPHX_x6UB6PULfw/pub
1) Added dependencies: Our implementation adds dependency on calloc/free
as we’re generating/maintaining a linked list at run time.
If it becomes a problem for some, there is a way to handle that -- but
at a cost of more memory required (to be conservative). One of the
good feature of using dynamic memory is that it allows counter array
allocation on the fly which eliminates the need to allocate memory for
lots of cold/unexecuted functions.
We also added
dependency on the usage of mutexes to prevent memory leaks in the case
multiple threads trying to insert a new target address for the same IC
site into the linked list. To least impact the performance we only added
mutexes around the pointer assignment and kept any dynamic memory
allocation/free operations outside of the mutexed code.
This (using mutexes) should be and can be avoided -- see the above report.
2) Indirect call data being present in sampling profile output: This is
unfortunately not helping in our case due to perf depending on lbr
support. To our knowledge lbr support is not present on ARM platforms.
yes.
3) Losing profiling support on targets not supporting malloc/mutexes: The
added dependency on calloc/free/mutexes may perhaps be eliminated
(although our current solution does not handle this) through having a
separate run time library for value profiling purposes. Instrumentation
can link in two run time libraries when value profiling (an instance of it
being indirect call target profiling) is enabled on the command line.
See above.
4) Performance of the instrumented code: Instrumentation with IC profiling
patches resulted in 7% degradation across spec benchmarks at -O2. For the
benchmarks that did not have any IC sites, no performance degradation was
observed. This data is gathered using the ref data set for spec.
I'd like to make the runtime part of the change to be shared and used
as a general purpose value profiler (not just indirect call
promotion), but this can be done as a follow up.
I will start with some reviews. Hopefully others will help with reviews too.
thanks,
David