RFC: Sanitizer-based Heap Profiler

Hi Teresa,

A year has passed since you posted this RFC; could you, please, give a
quick update on the current state of heap profiler development?

(Sorry if you already did so; I looked through llvm-dev mailing list
to no avail -- but perhaps I missed something?)

We (Huawei) are very interested in data cache optimizations; we are
discussing our plans with Maxim and others on the BOLT project github
(https://github.com/facebookincubator/BOLT/issues/178); I would be
really interested to hear your perspective / plans -- either on BOLT
project discussion or here.

One area of particular interest are specific data cache optimizations
you plan (or not?) to implement either in compiler / binary optimizer
/ runtime optimizer based on heap profiler data.

Thank you!

Yours,
Andrey

Hi Andrey,

The compiler side support for the necessary instrumentation went in a while back (D85948), as did the compiler-rt support (D87120 is the main one, but there were a number of preparatory and follow-on patches). Currently it dumps to a text file. We’ve been working on designs for the binary profile format, IR handling for the profile data, and context disambiguation support. We plan to send RFCs for some of this early this quarter.

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We’ll send more details on that when it is fleshed out too.

Teresa

Hi Teresa,

Thank you for the quick reply! I'm really happy to see the project is
moving forward!

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.

Just to double check: do you plan to open-source this runtime? --
perhaps as a part of LLVM?

Yours,
Andrey

Hi Teresa,

Thank you for the quick reply! I’m really happy to see the project is
moving forward!

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We’ll send more details on that when it is fleshed out too.

Just to double check: do you plan to open-source this runtime? –

It will be in tcmalloc initially.

perhaps as a part of LLVM?

A wrapper runtime layer in LLVM is possible, but not initially.

David

Hi David,

Hi Teresa,

One more thing, if you don’t mind.

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We’ll send more details on that when it is fleshed out too.

I played with the current implementation, and became a bit concerned if the current data profile is sufficient for an efficient data allocation optimization.

First, there is no information on temporal locality – only total_lifetime of an allocation block is recorded, not start / end times – let alone timestamps of actual memory accesses. I wonder what criteria would be used by data profile-based allocation runtime to allocate two blocks from the same memory chunk?

Second, according to the data from [Savage’20], memory accesses affinity (= space distance between temporarily close memory accesses from two different allocated blocks) is crucial: figure #12 demonstrates that this is vital for omnetpp benchmark from SPEC CPU 2017.

Said this, my concerns are based essentially on a single paper that employs specific algorithms to guide memory allocation and measures their impact on a specific set of benchmarks. I wonder if you have preliminary data that validates sufficiency of the implemented data profile for efficient optimization of heap memory allocations?

References:

[Savage’20] Savage, J., & Jones, T. M. (2020). HALO: Post-Link Heap-Layout Optimisation. CGO 2020: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, https://doi.org/10.1145/3368826.3377914

Yours,

Andrey

Hi David,

> We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We'll send more details on that when it is fleshed out too.

Just to double check: do you plan to open-source this runtime? --

It will be in tcmalloc initially.

perhaps as a part of LLVM?

A wrapper runtime layer in LLVM is possible, but not initially.

I wonder how you plan to deliver guidance on what allocations should
be made from the same memory chunk to tcmalloc -- unless you plan to
read data profile directly from the runtime (I doubt so...) this
should be done via some kind of instrumentation done by a compiler /
binary optimizer (a la BOLT) -- right?

Yours,
Andrey

Hi David,

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We’ll send more details on that when it is fleshed out too.

Just to double check: do you plan to open-source this runtime? –

It will be in tcmalloc initially.

perhaps as a part of LLVM?

A wrapper runtime layer in LLVM is possible, but not initially.

I wonder how you plan to deliver guidance on what allocations should
be made from the same memory chunk to tcmalloc – unless you plan to
read data profile directly from the runtime (I doubt so…) this
should be done via some kind of instrumentation done by a compiler /
binary optimizer (a la BOLT) – right?

Hi Andrey,

Responding to this one first, I’ll respond to your other email shortly. Initially we plan to provide hints to tcmalloc via new APIs to help it make allocation decisions (things like hotness and lifetime). The compiler will be responsible for adding these hints, using some method to disambiguate the calling context (e.g. via function cloning, new parameter, etc).

Teresa

Sounds good – thanks!

(though this would increase code size and thus, instruction cache pressure – tsk, tsk… :-))

Hi Andrey,

I was actually just typing up a reply welcoming contributions and to suggest you give the existing profile support a try - I realized I need to add documentation for the usage to llvm/clang’s docs which I will do soon but it sounds like you figured it out ok.

Some answers below.

Hi Teresa,

One more thing, if you don’t mind.

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We’ll send more details on that when it is fleshed out too.

I played with the current implementation, and became a bit concerned if the current data profile is sufficient for an efficient data allocation optimization.

First, there is no information on temporal locality – only total_lifetime of an allocation block is recorded, not start / end times – let alone timestamps of actual memory accesses. I wonder what criteria would be used by data profile-based allocation runtime to allocate two blocks from the same memory chunk?

It would be difficult to add all of this information for every allocation and particularly every access without being prohibitively expensive. Right now we have the ave/min/max lifetime, and just a single boolean per context indicating whether there was a lifetime overlap with the prior allocation for that context. We can probably expand this a bit to have somewhat richer aggregate information, but like I said, recording and emitting all start/end times and timestamps will be an overwhelming amount of information. As I mentioned in my other response, initially the goal is to provide hints about hotness and lifetime length (short vs long) to the memory allocator so that it can make smarter decisions about how and where to allocate data.

Second, according to the data from [Savage’20], memory accesses affinity (= space distance between temporarily close memory accesses from two different allocated blocks) is crucial: figure #12 demonstrates that this is vital for omnetpp benchmark from SPEC CPU 2017.

Right now we don’t track this information. Part of the issue is that memory accesses themselves don’t interact with the profile runtime library, but rather the code is instrumented to update shadow counters inline - this keeps the overhead reasonable. My understanding from reading the HALO paper and asking the authors at CGO is that the overheads are currently quite large (both the PIN-based runtime, and also the offline grouping algorithm), and it didn’t support multithreaded applications yet.

Definitely interested in contributions or ideas on how we could collect richer information with the approach we’re taking (allocations tracked by the runtime per context and fast shadow memory based updates for accesses).

Said this, my concerns are based essentially on a single paper that employs specific algorithms to guide memory allocation and measures their impact on a specific set of benchmarks. I wonder if you have preliminary data that validates sufficiency of the implemented data profile for efficient optimization of heap memory allocations?

I don’t have anything I can share yet but we will do so in the future. For an idea of how lifetime based allocation would work, here’s a related paper which used ML to identify context-sensitive lifetimes and used the info in a custom allocator:

https://research.google/pubs/pub49008/

Maas, Martin & Andersen, David & Isard, Michael & Javanmard, Mohammad Mahdi & McKinley, Kathryn & Raffel, Colin. (2020). Learning-based Memory Allocation for C++ Server Workloads. Proceedings of the 25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 541-556. 10.1145/3373376.3378525.

Teresa

There are various methods for disambiguating the contexts (e.g. HALO inserts some instructions at a minimal number of callsites to do this), and I suspect a hybrid method will be best in practice. E.g. cloning for hot allocation contexts or callsites, and instructions or parameters or something else on colder allocation contexts and callsites, to balance the code size and dynamic instruction overheads.

Sorry if I sounded too demanding: the easiest thing in the World to do
is to sit on a sofa and throw one demand after another... :slight_smile:

I understand that you have a big field full of question marks in front
of you and an immensely challenging task. My "tsk, tsk" message was
said with a tongue in cheek.

I was actually just typing up a reply welcoming contributions and to suggest you give the existing profile support a try - I realized I need to add documentation for the usage to llvm/clang’s docs which I will do soon but it sounds like you figured it out ok.

Yeah, I can still read code – believe it or not! :slight_smile:

It would be difficult to add all of this information for every allocation and particularly every access without being prohibitively expensive. Right now we have the ave/min/max lifetime, and just a single boolean per context indicating whether there was a lifetime overlap with the prior allocation for that context. We can probably expand this a bit to have somewhat richer aggregate information, but like I said, recording and emitting all start/end times and timestamps will be an overwhelming amount of information. As I mentioned in my other response, initially the goal is to provide hints about hotness and lifetime length (short vs long) to the memory allocator so that it can make smarter decisions about how and where to allocate data.

Yes, I understand that the task is very challenging and there are a lot of limitations / compromises to make all around.

Definitely interested in contributions or ideas on how we could collect richer information with the approach we’re taking (allocations tracked by the runtime per context and fast shadow memory based updates for accesses).

I guess the best way to approach this is by trying real workloads and investigating what works and what doesn’t. Would be happy to contribute to this – please keep me in the loop and drop a line when you’ll have something ready to play with.

Our team is busy at the moment with BOLT improvements (ARM64, shared libs, Golang support, etc); after that, hopefully we’ll be able to join data profile development efforts.

Yours,
Andrey

Sorry if I sounded too demanding: the easiest thing in the World to do
is to sit on a sofa and throw one demand after another… :slight_smile:

I understand that you have a big field full of question marks in front
of you and an immensely challenging task. My “tsk, tsk” message was
said with a tongue in cheek.

Oh no worries, I knew it was said in jest. =). But wanted to note the concern, which is totally valid, and that we’ve thought about how to balance various overheads.
Teresa

I was actually just typing up a reply welcoming contributions and to suggest you give the existing profile support a try - I realized I need to add documentation for the usage to llvm/clang’s docs which I will do soon but it sounds like you figured it out ok.

Yeah, I can still read code – believe it or not! :slight_smile:

It would be difficult to add all of this information for every allocation and particularly every access without being prohibitively expensive. Right now we have the ave/min/max lifetime, and just a single boolean per context indicating whether there was a lifetime overlap with the prior allocation for that context. We can probably expand this a bit to have somewhat richer aggregate information, but like I said, recording and emitting all start/end times and timestamps will be an overwhelming amount of information. As I mentioned in my other response, initially the goal is to provide hints about hotness and lifetime length (short vs long) to the memory allocator so that it can make smarter decisions about how and where to allocate data.

Yes, I understand that the task is very challenging and there are a lot of limitations / compromises to make all around.

Definitely interested in contributions or ideas on how we could collect richer information with the approach we’re taking (allocations tracked by the runtime per context and fast shadow memory based updates for accesses).

I guess the best way to approach this is by trying real workloads and investigating what works and what doesn’t. Would be happy to contribute to this – please keep me in the loop and drop a line when you’ll have something ready to play with.

Yep, absolutely! Definitely welcome the ideas, feedback, and any contributions you are able to make.

Our team is busy at the moment with BOLT improvements (ARM64, shared libs, Golang support, etc); after that, hopefully we’ll be able to join data profile development efforts.

Sounds good. Thanks,
Teresa

Hi Teresa,

One more thing, if you don’t mind.

We initially plan to use the profile information to provide guidance to the dynamic allocation runtime on data allocation and placement. We’ll send more details on that when it is fleshed out too.

I played with the current implementation, and became a bit concerned if the current data profile is sufficient for an efficient data allocation optimization.

First, there is no information on temporal locality – only total_lifetime of an allocation block is recorded, not start / end times – let alone timestamps of actual memory accesses. I wonder what criteria would be used by data profile-based allocation runtime to allocate two blocks from the same memory chunk?

First, I think per-allocation start-end time should be added to approximate temporal locality.

Detailed temporal locality information is not tracked is by design for a various of reasons:

  1. This can be done with static analysis. The idea is for the compiler to instrument a potentially hot access region and profile the start and end address of the accessed memory regions. This information can be combined with the regular heap profile data. In profile-use phase, the compiler can perform access pattern analysis and produce affinity graph

  2. We try to make use of existing allocator runtime (tcmalloc) for locality optimization. The runtime has been tuned for years to have the most efficient code for fast-path allocation. For hot allocation sites, adding too much overhead (e.g. via wrapper etc) can lead to overhead that totally eat up the gains from the locality optimization;

  3. tcmalloc currently uses size class based partitioning, which makes co-allocation of small objects of different size classes impossible. Even for objects with the same type/size, due to the use of free lists, there is no guarantee that consecutively allocated objects are placed together.

  4. a bump-pointer allocator has its own sets of problems – when not used carefully, it can lead to huge memory waste due to fragmentation. In reality it only helps grouping for initial set of allocations when pointer bumps continuously – during stable state, the allocations will also be all over the place and no contiguity can be guaranteed.

This is why initially we focus more coarse grain locality optimization – 1) co-placement to improve DTLB performance and 2) improving dcache utilization using only lifetime and hotness information.

Longer term, we need to beef up compiler based analysis – objects with the exact life times can be safely co-allocated via compiler based transformation. Also objects with similar lifetimes can be co-allocated without introducing too much fragmentation.

Thanks,

David

There are lots of ways to control the overhead : 1) context merging if their properties are similar; 2) context merging if memory in each context is short lived; 3) context trimming etc.

Responding to this one first, I’ll respond to your other email shortly. Initially we plan to provide hints to tcmalloc via new APIs to help it make allocation decisions (things like hotness and lifetime). The compiler will be responsible for adding these hints, using some method to disambiguate the calling context (e.g. via function cloning, new parameter, etc).

Sounds good – thanks!

(though this would increase code size and thus, instruction cache pressure – tsk, tsk… :-))

There are various methods for disambiguating the contexts (e.g. HALO inserts some instructions at a minimal number of callsites to do this), and I suspect a hybrid method will be best in practice. E.g. cloning for hot allocation contexts or callsites, and instructions or parameters or something else on colder allocation contexts and callsites, to balance the code size and dynamic instruction overheads.

Right – static disambiguation is suitable for hot allocation sites which are sensitive to any overhead added.

David

This is a big undertaking with good potential but also some uncertainty on how effective such optimizations are for larger workloads, so really appreciate the pioneering effort in LLVM.

We (facebook) are very interested in this too. I’ve reached out to David and Teresa a while ago about this, and was going to wait for the RFC before having more detailed discussions. But now that we’re discussing it, here’s my two cents about the responsibility division between compiler and allocator, and the API.

I think it’d be beneficial if we let compiler do more heavy lighting instead of relying heavily on allocator. If we rely on less magic inside an allocator, we will likely benefit more users who may use different allocators. Otherwise there’s a risk that the compiler part may be too coupled with a specific allocator, which limits the overall effectiveness of PGHO outside of that allocator.

This also affects what we want to expose in the new API for hinting the allocator (e.g. provide grouping or arena-like hint computed by compiler vs. passing a number of factors through the API which would help compute that inside allocator). With a general, stable API, hope we won’t need to change API when we want to take more runtime info (temporal etc., even just for experiments) into account, or when we improve and leverage more from compiler analysis (I agree that in the long run we should improve compiler analysis).

I’ve talked with jemalloc folks on our side, and we’re flexible to API changes. In this case, it makes sense to avoid abstraction overhead from wrappers.

Looking forward to the RFC and more discussions on this.

Thanks,

Wenlei

This is a big undertaking with good potential but also some uncertainty on how effective such optimizations are for larger workloads, so really appreciate the pioneering effort in LLVM.

We (facebook) are very interested in this too. I’ve reached out to David and Teresa a while ago about this, and was going to wait for the RFC before having more detailed discussions. But now that we’re discussing it, here’s my two cents about the responsibility division between compiler and allocator, and the API.

I think it’d be beneficial if we let compiler do more heavy lighting instead of relying heavily on allocator. If we rely on less magic inside an allocator, we will likely benefit more users who may use different allocators. Otherwise there’s a risk that the compiler part may be too coupled with a specific allocator, which limits the overall effectiveness of PGHO outside of that allocator.

I agree with this in general. At a high level, there are multiple different types of locality optimizations each requiring different treatment for the peak performance:

  1. Locality policies fully enforced by the compiler. Examples include co-allocation of different types of small sized objects with the same lifetime (which is statically determinable).
  2. Locality policies determined by the compiler but provided as hints to the allocator(s). Examples include coarse grain locality optimizations as well as co-allocations for objects whose lifetime can not be proved to be identical statically. The allocator provides APIs to pass the hints, but can choose to ignore them. Due to the freelists and fragmentation, there is no guarantee two small objects in the same locality group (compiler hint) will be allocated in the same cacheline.
  3. A hybrid of those – there are some compiler transformations needed but also relying on collaboration from the allocator. An example is struct splitting. Blindly splitting the struct without hint to the runtime won’t prevent the cold parts from polluting the cache.

This also affects what we want to expose in the new API for hinting the allocator (e.g. provide grouping or arena-like hint computed by compiler vs. passing a number of factors through the API which would help compute that inside allocator). With a general, stable API, hope we won’t need to change API when we want to take more runtime info (temporal etc., even just for experiments) into account, or when we improve and leverage more from compiler analysis (I agree that in the long run we should improve compiler analysis).

Agree.

I’ve talked with jemalloc folks on our side, and we’re flexible to API changes. In this case, it makes sense to avoid abstraction overhead from wrappers.

yes. A mature allocator is tuned for years to reduce the allocation overhead as well as minimizing the memory fragmentation. Duplicating those in the wrappers can be a huge task – except for special cases.

thanks,

David